Leanpub: Publish Early, Publish Often

Data Type

Why do we have to discuss about data type? We use dynamic language and we almost dicuss about Ruby and JavaScript most of time. But understanding data type allow us to read API document better. It helps us to understand why we can can r.table.insert but we cannot call like r.table.filter().insert. Aren’t we still selecting data from table, so we should be able to insert data to it?

Data type helps us know that we can only call some method on some of data type

Each ReQL method can be call on one or many above data types. Take update command, when you browser the API document, you see

1 table.update(json | expr[, {durability: "hard", returnVals: false, nonAtomic: fa\
2 lse}]) → object
3 
4 selection.update(json | expr[, {durability: "hard", returnVals: false, nonAtomic\
5 : false}]) → object
6 
7 singleSelection.update(json | expr[, {durability: "hard", returnVals: false, non\
8 Atomic: false}]) → object

It means the command can be invoked on a table, or selection (eg: first 30 element of tables), or a single selection - a document is an example of single selection. The behaviour maybe different based on data type, even the command is same.

In RethinkDB, we have several data types. We will focus into those 2 kind for now:

Basic data type
Composite data type

Basic data type

These are usually the native data type in your language too:

 1 * Number: any real numbers. RethinkDB uses double precision (64-bit) floating po\
 2 int numbers internally
 3 * String
 4 * Time: This is native RethinkDB date time type. However, they will be converted\
 5  automatically to your native data type in your language by the driver.
 6 * Boolean: True/False 
 7 * Null: Depend on your language, it can be nil, null,..    
 8 * Object: any valid JSON object. In JavaScript, it will be a normal object. In R\
 9 uby, it can be a hash.
10 * Array: any valid JSON array.

The data type of a field or column can be change. If you assign a number to a field, you can still assign an value with different data type to that same field. So we don’t have a static schema for tables.

We have a very useful command to get the type of any vaue. It’s typeOf. Example:

 1 r.db('foodb').table('foods')
 2   .typeOf()
 3 //=>
 4 "TABLE"
 5 
 6 r.db('foodb').table('foods')
 7   .filter(r.row("name").match('A^'))
 8   .typeOf()
 9 //=>
10 "SELECTION<STREAM>"

It’s seems not very important to understand about data type at first. I really hope you should invest some time to use it frequently to understand the data type of a value.

To give a story. In MariaDB10.0/MySQL5.6, when data type doesn’t match, an index may not be used. Let’s say you have a field name with type VARCHAR(255) when you define it, then you create an index on that column. Query on that column with exact data type will make index kicked in. Let’s come back MySQL a bit.

First I insert below records.

1 INSERT INTO foods(name) VALUES("100");
2 Query OK, 1 row affected, 1 warning (0.00 sec)

Below query will use index:

 1 MariaDB [food]> EXPLAIN SELECT * FROM foods WHERE name="100";
 2 +------+-------------+-------+-------+---------------------+--------------------\
 3 -+---------+-------+------+-------+
 4 | id   | select_type | table | type  | possible_keys       | key                \
 5  | key_len | ref   | rows | Extra |
 6 +------+-------------+-------+-------+---------------------+--------------------\
 7 -+---------+-------+------+-------+
 8 |    1 | SIMPLE      | foods | const | index_foods_on_name | index_foods_on_name\
 9  | 257     | const |    1 |       |
10 +------+-------------+-------+-------+---------------------+--------------------\
11 -+---------+-------+------+-------+

But this query won’t:

 1 EXPLAIN select * from foods where name = 9;
 2 MariaDB [food]> EXPLAIN SELECT * FROM foods WHERE name=100;
 3 +------+-------------+-------+------+---------------------+------+---------+----\
 4 --+------+-------------+
 5 | id   | select_type | table | type | possible_keys       | key  | key_len | ref\
 6  | rows | Extra       |
 7 +------+-------------+-------+------+---------------------+------+---------+----\
 8 --+------+-------------+
 9 |    1 | SIMPLE      | foods | ALL  | index_foods_on_name | NULL | NULL    | NUL\
10 L |  890 | Using where |
11 +------+-------------+-------+------+---------------------+------+---------+----\
12 --+------+-------------+
13 1 row in set (0.00 sec)

When we pass string ‘9’, the index is used. When we pass number 9, the index isn’t used.

Of if you have a date time column and you passing time as string, the index won’t kicked in either.

The lesson here is we aboslutely should understand about data type.

Composite data type

We have 3 composite data types.

Streams

:are list or array, but they’re loaded in a lazy fashion. Instead of returning a whole array at once, meaning all data are read into memory, a cursor is return. A cursor is a pointer into the result set. We can loop over cursor to read data when we need. Imagine instead of an array, and loop over it, you know iterate over the cursor to get next value. It allows you to iterator over a data set without building an entire array in memory. It’s equivalent to PHP iterator, or Ruby iterator, or JavaScript iterator. Stream allows us access current element and keep track of current position so that we can, ideally call next() on a cursor to move to next element, until we reach to the end of array, it returns nil and iterator can stop. Because of that, we can work with large data set because RethinkDB doesn’t need to load all of data and return to client. The nature of stream make it read-only, you cannot change the data while iterating over it.

Selections

:represent subsets of tables, for example, the return values of filter or get. There are two kinds of selections, Selection<Object> and Selection<Stream>, which behave like objects or streams respectively. The difference between selections and objects/streams are that selections are writable –their return values can be passed as inputs to ReQL commands that modify the database. For instance, the get command will return a Selection<Object> that could then be passed to an update or delete command. We can think of selection like an array where each element keeps an reference back to real document so that we can modify them.

Tables

:are RethinkDB database tables. They behave like selections. However, they’re writable, as you can insert and delete documents in them. ReQL methods that use an index, like getAll, are only available on tables. Because index are created on table level.

In short, you cannot modify streams, you can update or change value of selection but you cannot remove existed document, or insert new one. Tables allows you insert new document or remove existed one.

Remember data types seems not much important but you should understand them well because it helps us understand the efficient of a query. If a query returns an array, it consumes lot of memory to hold the array.

Sorting data

When talking about data type, let think of how we sort them. It really doesn’t matter in the order, what is important is the definition of sorting data.

Understanding sorting is important in RethinkDB because of its schemaless. The primary key may not be a numeric field, it can be a string. Moreover than that, a field can have whatever data type, how are we going to compare an object to a string when sorting.

Here is sorting order:

Arrays (and strings) sort lexicographically. Objects are coerced to arrays before sorting. Strings are sorted by UTF-8 codepoint and do not support Unicode collations.

Mixed sequences of data sort in the following order:

arrays
booleans
null
numbers
objects
binary objects
geometry objects
times
strings

That mean array < booleans < null < numbers < objects < binary objects < geometry objects < times < strings.

Up next

Selecting data