Leanpub: Publish Early, Publish Often

Selecting data

In this section, we will learn how to get data out of RethinkDB. Most of the time, we will choose a db to work with, and chain into command table.

Select the whole table

Let’s find all foods. This is same as SELECT * FROM foods in SQL.

 1 r.db('foodb').table('foods')
 2 //=>
 3 
 4 [{
 5     "created_at": Wed Feb 09 2011 00: 37: 17 GMT - 08: 00,
 6     "creator_id": null,
 7     "description": null,
 8     "food_group": "Herbs and Spices",
 9     "food_subgroup": "Spices",
10     "food_type": "Type 1",
11     "id": 43,
12     "itis_id": "29610",
13     "legacy_id": 46,
14     "name": "Caraway",
15     "name_scientific": "Carum carvi",
16     "picture_content_type": "image/jpeg",
17     "picture_file_name": "43.jpg",
18     "picture_file_size": 59897,
19     "picture_updated_at": Fri Apr 20 2012 09: 38: 36 GMT - 07: 00,
20     "updated_at": Fri Apr 20 2012 16: 38: 37 GMT - 07: 00,
21     "updater_id": null,
22     "wikipedia_id": null
23 }, {
24     "created_at": Wed Feb 09 2011 00: 37: 18 GMT - 08: 00,
25     "creator_id": null,
26     "description": null,
27     "food_group": "Herbs and Spices",
28     "food_subgroup": "Spices",
29     "food_type": "Type 1",
30     "id": 67,
31     "itis_id": "501839",
32     "legacy_id": 73,
33     "name": "Cumin",
34     "name_scientific": "Cuminum cyminum",
35     "picture_content_type": "image/jpeg",
36     "picture_file_name": "67.jpg",
37     "picture_file_size": 73485,
38     "picture_updated_at": Fri Apr 20 2012 09: 32: 32 GMT - 07: 00,
39     "updated_at": Fri Apr 20 2012 16: 32: 33 GMT - 07: 00,
40     "updater_id": null,
41     "wikipedia_id": null
42 },
43 ...
44 ]

You should get back an array of JSON object. By default, the data explorer will automatically paginate it and display a part of data.

Typing r.db(db_name) all the time is insane. We can drop it to use r.table() without calling r.db() if the table is in current selected database. Without any indication, the default database is test. On Data Exploer, without a r.db command, RethinkDB will use test as default database. Unfortunately we cannot set a default database with data exploer¹

Counting

We can also count the table or any sequence by calling count command.

1 r.db('foodb').table('foods').count()
2 //=>
3 863

Select a single document by its primary key

To select a single element, we call get on a table, and passing its primary key value.

 1 r.db('foodb').table('foods')
 2   .get(108)
 3 //=>
 4 {
 5     "created_at": Wed Feb 09 2011 00: 37: 20 GMT - 08: 00,
 6     "creator_id": null,
 7     "description": null,
 8     "food_group": "Herbs and Spices",
 9     "food_subgroup": "Herbs",
10     "food_type": "Type 1",
11     "id": 108,
12     "itis_id": "32565",
13     "legacy_id": 115,
14     "name": "Lemon balm",
15     "name_scientific": "Melissa officinalis",
16     "picture_content_type": "image/jpeg",
17     "picture_file_name": "108.jpg",
18     "picture_file_size": 30057,
19     "picture_updated_at": Fri Apr 20 2012 09: 33: 54 GMT - 07: 00,
20     "updated_at": Fri Apr 20 2012 16: 33: 54 GMT - 07: 00,
21     "updater_id": null,
22     "wikipedia_id": null
23 }

Every document in RethinkDB includes a primary key field, its value is unique across cluster and is used to identify the document. The name of primary field is id by default. However, when you create a table, you have an option to change name of primary field. We will learn more about it later. Just keep a note here.

In RethinkDB, using of incremental primary key isn’t recommended because that’s hard in a cluster environment. To make sure the uniqueness of the new value, We have to check in every clusters somehow. RethinkDB team decides² to use an universal unique³ id instead of an incremental value.

get command returns the whole document. What if we get a single field? Such as we only care about name? RethinkDB has a command call bracket for that purpose. In Ruby it’s [], and in JavaScript it’s ().

We can do this in JavaScript:

1 r.db('foodb').table('foods')
2   .get(108)("name")
3 //=>
4 "Lemon balm"

Or in Ruby

1 r.connect.repl
2 r.db('foodb').table('foods').get(108)[:name].run

What special about bracket is that it return a single value of the field. The type of value is same type of value, not a subset of document. We can verify that with typeOf command:

1 r.db('foodb').table('foods')
2   .get(108)
3   ("name")
4   .typeOf()
5 //=>
6 "STRING"

You can even get nested field with bracket:

1 r.db('foodb').table('test')
2   .get(108)("address")("country")

with assumption that the document has address field is an object contains a field name country.

If you don’t like the using of bracket, you can use getField(JavaScript) or get_field(Ruby) which have same effect:

1 r.db('foodb').table('foods')
2   .get(108)
3   .getField('name')
4 //=>
5 "Lemon balm"

How about getting a sub set of document, we can use pluck like this:

1 r.db('foodb').table('foods')
2   .get(108)
3   .pluck(get"name", "id")
4 //=>
5 {
6   "id": 108 ,
7   "name":  "Lemon balm"
8 }

pluck probably existed in many standard library of your favourite language. This example shows you how friendly ReQL is.

Select many documents by value of fields

To select many document based on value of field, We used filter method, and passing an object with expected value.

Let’s find all food that were inserted into database on 2011, the year I come to the US.

 1 r.db('foodb').table('foods')
 2   .filter(r.row('created_at').year().eq(2011))
 3 //=>Executed in 59ms. 40 rows returned, 40 displayed, more available
 4 [{
 5     "created_at": Wed Feb 09 2011 00: 37: 17 GMT - 08: 00,
 6     "creator_id": null,
 7     "description": null,
 8     "food_group": "Herbs and Spices",
 9     "food_subgroup": "Spices",
10     "food_type": "Type 1",
11     "id": 43,
12     "itis_id": "29610",
13     "legacy_id": 46,
14     "name": "Caraway",
15     "name_scientific": "Carum carvi",
16     "picture_content_type": "image/jpeg",
17     "picture_file_name": "43.jpg",
18     "picture_file_size": 59897,
19     "picture_updated_at": Fri Apr 20 2012 09: 38: 36 GMT - 07: 00,
20     "updated_at": Fri Apr 20 2012 16: 38: 37 GMT - 07: 00,
21     "updater_id": null,
22     "wikipedia_id": null
23 }
24 ...
25 ]

r.row is new to you, but no worry, it just means current document. We used r.row('created_at') to get value of created_at field, similar with how we use bracket on get command to get a single value. Because created_at is a datetime value, I get its year with, well, year command, then using eq to do an equal compare with 2011. Sound a lot, but above query is really simple and exlain itself. Sometimes I feel redundant to explain query but I have to write this book anyway.

We can also pass an filter object to do matching filter:

 1 r.db('foodb').table('foods')
 2   .filter({
 3     food_type: 'Type 1',
 4     food_group: 'Fruits'
 5   })
 6 //=>
 7 [
 8 {
 9   "created_at": Wed Feb 09 2011 00:37:15 GMT-08:00 ,
10   "creator_id": null ,
11   "description": null ,
12   "food_group":  "Fruits" ,
13   "food_subgroup":  "Tropical fruits" ,
14   "food_type":  "Type 1" ,
15   "id": 14 ,
16   "itis_id":  "18099" ,
17   "legacy_id": 14 ,
18   "name":  "Custard apple" ,
19   "name_scientific":  "Annona reticulata" ,
20   "picture_content_type":  "image/jpeg" ,
21   "picture_file_name":  "14.jpg" ,
22   "picture_file_size": 29242 ,
23   "picture_updated_at": Fri Apr 20 2012 09:30:49 GMT-07:00 ,
24   "updated_at": Fri Apr 20 2012 16:30:49 GMT-07:00 ,
25   "updater_id": null ,
26   "wikipedia_id": null
27 },...
28 ]

Passing an object will match exactly document with those field and value. In other words, passing an object is equal to passing multiple eq command and and command. Above query can re-write using expression:

1 r.db('foodb').table('foods')
2   .filter(
3     r.and(
4       r.row('food_type').eq('Type 1'),
5       r.row('food_group').eq('Fruits')
6     )
7   )

The object notation is much cleaner in this case.

From a selection of document, We can use pluck to get a subset of documents’s field instead of returning the whole document. Similarly to how we use bracket to get a particular field

 1 r.db('foodb').table('foods')
 2   .filter({
 3     food_type: 'Type 1',
 4     food_group: 'Fruits'
 5   })
 6   .pluck('id', 'name', 'food_subgroup')
 7 //=>Executed in 70ms. 40 rows returned, 40 displayed, more available
 8 [
 9   {
10   "food_subgroup":  "Berries" ,
11   "id": 75 ,
12   "name":  "Black crowberry"
13   }, {
14   "food_subgroup":  "Tropical fruits" ,
15   "id": 150 ,
16   "name":  "Guava"
17   }, {
18   "food_subgroup":  "Tropical fruits" ,
19   "id": 151 ,
20   "name":  "Pomegranate"
21   }, ...
22 ]

By passing a list of field to pluck, we can get only those field.

Opposite of pluck is without. We passed a list of fields, and it removes those fiels from document.

 1 r.db('foodb').table('foods')
 2   .filter({
 3     food_type: 'Type 1',
 4     food_group: 'Fruits'
 5   })
 6   .without("created_at", "picture_content_type", 'picture_file_name', 'picture_f\
 7 ile_size', 'picture_updated_at')
 8 //=> Executed in 52ms. 40 rows returned, 40 displayed, more available
 9 [
10 {
11   "creator_id": null ,
12   "description": null ,
13   "food_group":  "Fruits" ,
14   "food_subgroup":  "Berries" ,
15   "food_type":  "Type 1" ,
16   "id": 75 ,
17   "itis_id":  "23743" ,
18   "legacy_id": 81 ,
19   "name":  "Black crowberry" ,
20   "name_scientific":  "Empetrum nigrum" ,
21   "updated_at": Fri Apr 20 2012 16:29:43 GMT-07:00 ,
22   "updater_id": null ,
23   "wikipedia_id": null
24 },...
25 ]

With simple filterting, we can easily pass an filter object as above. But what up with complex searching? Such as finding all foods whose name starts with character N. As you see at the beginning, we used r.row command to do a bit complex query.

1 r.db('foodb').table('foods')
2   .filter(r.row('created_at').year().eq(2011))

Let’s dive more into it.

r.row

r.row is our swiss army knife. It refers to current visited document. Literally, it’s the document at which RethinkDB is accessing. You can think of it like this in a JavScript callback/iterator. Or think of it like current element in an iterator loop. It’s very handy because we can call other ReQL command on it to achieve our filtering.

It somehow feel like jQuery filtering command. For an instance, we write this in JavaScript to filter all DOM element whose data-type value is anime.

1 $('.db').find('.table').filter(function() {
2   return $(this).data('type')=='anime'
3 })

In ReQL, using filter with filter object:

1 r.db('foodb').table('foods').filter({food_group: 'Fruits'})

We can re-write it with r.row

1 r.db('foodb').table('foods').filter(r.row('food_group').eq('Fruits'))

Breaking it down we have:

r.row ⇒ current document
(‘type’) ⇒ get value of field type
.eq(‘anime’) ⇒ return true if the value is equal to the argument, anime in this case

r.row a RethinkDB object, which we can continue call many method to filter or manipulation it. The expression that we pass into filter is a normal ReQL expression but evaluate to a boolean result. RethinkDB runs it and if the returned value is true, the document is included into result set. Ideally, any function that returns boolean result can used with filter. Note that the evaluation of filter expression run on RethinkDB server, therefore they has to be a valid ReQL expression, they cannot be any arbitrary language expression. You cannot write:

1 r.db('db').table('table').filter(r.row('type') == 'anime')

In manner of filter action, we usually execute comparison or some condition to be matched, RethinkDB gives us some kind of those method. You should refer to its API for extensive command. Usually, we can use r.row in combine with pluck or without or bracket command to narrow down data before comparing. Below are some function for that purpose:

eq(value) check equal to value. similar to ==.
ne(value) check not equal to value. similar to !=.
ge(value) check greater than or equal value. similar to >=.
gt(value) check greater than value. similar to >.
le(value) check less than or equal value. similar to <=.
lt(value) check less than value. similar to <.
add(value) Sum two numbers, concatenate two strings, or concatenate 2 arrays.
sub() Subtract two numbers.

Each of above command can be call on different data type. Eg, when you call add on an array, it will append the element to array. when you call on a string, it concat parameter to the original string. Or calling on a number and they just do arithmetic operation.

Run those command in data explorer:

 1 r.expr(["foo", "bar"]).add(['forbar'])
 2 //=>
 3 [
 4   "foo" ,
 5   "bar" ,
 6   "forbar"
 7 ]
 8 
 9 r.expr(2).add(10)
10 //=>
11 12
12 
13 r.expr('foo').add("bar")
14 //=>
15 "foobar"

You can find more about those document in RethinkDB doc, in group Math and logic⁴.

Let’s apply what we learn, by finding al food where its name starts with character R and is a tropical fruits.

 1 r.db("foodb").table("foods")
 2   .filter(
 3       r.row("name").match("^R")
 4       .and(
 5         r.row("food_subgroup").eq('Tropical fruits')
 6       )
 7   )
 8 //=>
 9 {
10   "created_at": Wed Feb 09 2011 00:37:27 GMT-08:00 ,
11   "creator_id": null ,
12   "description": null ,
13   "food_group":  "Fruits" ,
14   "food_subgroup":  "Tropical fruits" ,
15   "food_type":  "Type 1" ,
16   "id": 234 ,
17   "itis_id":  "506073" ,
18   "legacy_id": 249 ,
19   "name":  "Rambutan" ,
20   "name_scientific":  "Nephelium lappaceum" ,
21   "picture_content_type":  "image/jpeg" ,
22   "picture_file_name":  "234.jpg" ,
23   "picture_file_size": 71055 ,
24   "picture_updated_at": Fri Apr 20 2012 09:43:04 GMT-07:00 ,
25   "updated_at": Fri Apr 20 2012 16:43:05 GMT-07:00 ,
26   "updater_id": null ,
27   "wikipedia_id": null
28 }

Here we are usinbg match with an regular expression ^R means any name starts with R, and using and to do an and operator with other boolean. Other boolean is result of getting field food_subgroup and compare with tropical fruits.

filter seems handy but it’s actually limited. filter didn’t leverage index. It scan and hold all data in memory. Of course, this isn’t scale infinite. Only 100,000 records can be filter. For anything large than that, we have to use getAll or between which we will learn in chapter 5.

Now, let’s try to find all foods which has more than 10 foods document in its group. We probably think of a simple solution like this: for each of document, we get its food_group and count how many items has that same food group, if the result is greater than 10, we return true, so that it will be included in filter result. We may have duplicate result but let’s try this naieve soltuion:

1 r.db('foodb').table('foods')
2   .filter(
3     r.db('foodb').table('foods')
4       .filter(
5         {food_group: r.row("food_group")}
6       )
7       .count()
8       .gt(10)
9   )

Query looks good but when we run, we get this:

1 RqlCompileError: Cannot use r.row in nested queries.  Use functions instead in:
2 r.db("foodb").table("foods").filter(r.db("foodb").table("foods").filter({food_gr\
3 oup:
4 r.row("food_group")}).count().gt(10))

Basically, we have nested query here, and RethinkDB doesn’t know which query r.row should belong to, is it parent query, or the sub query? In those case, we have to use filter with function. Let’s move to next chapter.

Filter with function

Beside passing an ReQL expression, we can also use a function which return true or false to filter.

Let’s try previous example.

1 r.db('foodb').table('foods')
2   .filter(function (food) {
3    return r.db('foodb').table('foods').filter({food_group: food("food_group")}).\
4 count().gt(10)
5   })

Now, we no longer using r.row, we pass an anonymous function with a single parameter(which we can name whatever), when itereating over the table, RethinkDB call this function, and passing current document as its first argument. By using function, we can still access current document, without using r.row, and clearly bind current document to a variable, so that we can access its value and avoid conflicting. Here, we name our argument food, instead of writing:

1 filter({food_group: r.row("food_group")})

We will write:

1 filter({filter_group: food("food_group")})

And we using boolean value, count().gt(10) here, as result of function. Filter with function helps us write query with complex logic.

Pagination data

We rarely want a whole sequence of document, usually we care about a subset of data such as pagination data. In this section, we go over commands: order, limit and skip.

Order data

So far, we only select data and accept default ordering. Let’s control how they appear:

1 r.db('foodb').table('foods')
2   .filter(function (food) {
3    return r.db('foodb').table('foods').filter({food_group:

food(“food_group”)}).count().gt(10) }) .orderBy(“name”) //⇒Executed in 5.69s. 821 rows returned [ { “created_at”: { “$reql_type$”: “TIME”, “epoch_time”: 1297240650, “timezone”: “-08:00” }, “creator_id”: null, “description”: null, “food_group”: “Aquatic foods”, “food_subgroup”: “Mollusks”, “food_type”: “Type 1”, “id”: 280, “itis_id”: “69493”, “legacy_id”: 307, “name”: “Abalone”, “name_scientific”: “Haliotis”, “picture_content_type”: “image/jpeg”, “picture_file_name”: “280.jpg”, “picture_file_size”: 99231, “picture_updated_at”: { “$reql_type$”: “TIME”, “epoch_time”: 1334940073, “timezone”: “-07:00” }, “updated_at”: { “$reql_type$”: “TIME”, “epoch_time”: 1334965273, “timezone”: “-07:00” }, “updater_id”: null, “wikipedia_id”: null }, … ]

We re-used above filter query, but append orderBy("name"). If you notice, the above command run quite long Executed in 5.56s. 821 rows returned and all rows are returned instead of a streams as usual. When we are calling orderBy without specifing an index, it load all data into memory to sort, which is both of slow and in-efficient. We will learn more about ordering with index in chapter 5. For now, let’s continue with this method because, well, they are easy to use, at first :D

We can reverse order by applying r.desc command:

1 r.db('foodb').table('foods')
2   .filter(function (food) {
3     return r.db('foodb').table('foods').filter({food_group: food("food_group")})\
4 .count().gt(10)
5   })
6   .orderBy(r.desc("name"))

We can order on table too, not just filter:

1 r.db('foodb').table('foods')
2   .orderBy(r.desc("name"))

We can order by multiple field, at a time

1 r.db('foodb').table('foods')
2   .orderBy(r.desc("name"), r.asc("created_at"))

We order by descending order on field name and ascending on field created_at.

We can combine some document commands with orderBy too. Such as pluck only an useful set of fields:

 1 r.db('foodb').table('foods')
 2   .pluck("id", "name", "food_group")
 3   .orderBy(r.desc("name"), r.asc("created_at"))
 4 //=>Executed in 122ms. 863 rows returned
 5 [
 6   {
 7       "food_group": "Milk and milk products",
 8       "id": 634,
 9       "name": "Yogurt"
10   },
11   {
12       "food_group": "Milk and milk products",
13       "id": 656,
14       "name": "Ymer"
15   },
16   {
17       "food_group": "Aquatic foods",
18       "id": 523,
19       "name": "Yellowtail amberjack"
20   },
21   ...
22 ]

Limiting data

Once we have an ordering sequence, we usually want to select a limit number of document instead of the whole sequence. We use command limit(n) for this purpose. It get n elements from the sequence or array.

 1 r.db('foodb').table('foods')
 2   .pluck("id", "name", "food_group")
 3   .orderBy(r.desc("name"), r.asc("created_at"))
 4   .limit(4)
 5 //=>Executed in 107ms. 2 rows returned
 6 [{
 7     "food_group": "Milk and milk products",
 8     "id": 634,
 9     "name": "Yogurt"
10 }, {
11     "food_group": "Milk and milk products",
12     "id": 656,
13     "name": "Ymer"
14 }, {
15     "food_group": "Aquatic foods",
16     "id": 523,
17     "name": "Yellowtail amberjack"
18 }, {
19     "food_group": "Aquatic foods",
20     "id": 522,
21     "name": "Yellowfin tuna"
22 }]

limit get us a number of document that we want, but it always start from the beginning of sequence. To start selecting data starts from a position, we used skip.

Skip

As its name, skip(n) ignore a number of element from the head of sequence.

 1 r.db('foodb').table('foods')
 2   .pluck("id", "name", "food_group")
 3   .orderBy(r.desc("name"), r.asc("created_at"))
 4   .skip(2)
 5   .limit(2)
 6 //=> Executed in 97ms. 2 rows returned
 7 [{
 8     "food_group": "Aquatic foods",
 9     "id": 523,
10     "name": "Yellowtail amberjack"
11 }, {
12     "food_group": "Aquatic foods",
13     "id": 522,
14     "name": "Yellowfin tuna"
15 }]

Access Nested field

As you know, RethinkDB document is a JSON object. Very likely we have two or more level of data structure. So how we can access those nested field, or to drill down the fields.

Let’s beging this chapter by creating some sample data. Just copy and paste, ignore the syntax for now because we save them for chapter 4.

First, create table on test db.

1 r.tableCreate("books")

Then, insert sample data:

 1 r.table("books")
 2   .insert([
 3         {
 4             id: 1,
 5             name: "Simply RethinkDB",
 6             address: {
 7                 country: {
 8                     code: "USA",
 9                     name: "The United State of America"
10                 }
11             },
12             contact: {
13                 phone: {
14                     work: "408-555-1212",
15                     home: "408-555-1213",
16                     cell: "408-555-1214"
17                 },
18                 email: {
19                     work: "bob@smith.com",
20                     home: "bobsmith@gmail.com",
21                     other: "bobbys@moosecall.net"
22                 },
23                 im: {
24                     skype: "Bob Smith",
25                     aim: "bobmoose",
26                     icq: "nobodyremembersicqnumbers"
27                 }
28             }
29         },
30         {
31             id: 2,
32             name: "TKKG",
33             address: {
34                 country: {
35                     code: "GER",
36                     name: "Germany"
37                 }
38             },
39             contact: {
40                 phone: {
41                     work: "408-111-1212",
42                     home: "408-111-1213",
43                     cell: "408-111-1214"
44                 },
45                 email: {
46                     work: "bob@gmail.com",
47                     home: "bobsmith@axcoto.com",
48                     other: "bobbys@axcoto.com"
49                 },
50                 im: {
51                     skype: "Jon",
52                     aim: "Jon",
53                     icq: "nooneremembersicqnumbers"
54                 }
55             }
56         }
57   ])

Depend on your language, you will usually have some way to access nested field, by following the nested path. In above example, let’s say we want to access *skype im, the path is:

contact -> im -> skype

Using JavaScript driver, we will use bracket to access field and sub field.

1 r.table('books').get(1)('contact')('im')('skype')
2 //=>
3 "Bob Smith"

While as, in Ruby driver, bracket notation is [field]

1 r.table('books').get(1)['contact']['im']['skype']

We can keep calling bracket to get the final nested field follow the path. Not just a single document, we can use bracket on table level too:

 1 r.table('books')('address')('country')
 2 [
 3 {
 4   "code":  "GER" ,
 5   "name":  "Germany"
 6 }, {
 7   "code":  "USA" ,
 8   "name":  "The United State of America"
 9 }
10 ]

Or using in combination with filter, on selection:

1 r.table('books')
2   .filter({id: 1})('address')('country')('name')
3 //=>
4 "The United State of America"

Beside using bracket command, we can also using getField if that feel more nature:

 1 r.table('books')
 2   .getField('contact')('email')
 3 //=>
 4 [
 5 {
 6   "home": bobsmith@axcoto.com, »
 7   "other": bobbys@axcoto.com, »
 8   "work": bob@gmail.com, »
 9 }, {
10   "home": bobsmith@gmail.com, »
11   "other": bobbys@moosecall.net, »
12   "work": bob@smith.com, »
13 }]

At the end of the day all you have to remember is to drill down the path with a chain of bracket command.

https://github.com/rethinkdb/rethinkdb/issues/829↩
http://stackoverflow.com/questions/21020823/unique-integer-counter-in-rethinkdb ↩
http://en.wikipedia.org/wiki/Universally_unique_identifier ↩
http://rethinkdb.com/api/javascript/#mod ↩

Up next

Wrap Up