Leanpub: Publish Early, Publish Often

7. Organizing your data

The number of fields you wish to store in your database would be a larger value than the five column table we saw earlier chapters. Also, some assumptions were made intrinsically on the kind of data we will store in the table. But this is not always the case in real life. In reality the data we encounter will be complex, even redundant. This is where the study of data modelling techniques and database design come in. While it is advised that the reader refer to a more comprehensive treatise on this subject, nonetheless we will try to study some good relational database design principles since the study would come in handy while learning SQL statements for multiple tables.

7.1 Normalization

Let us suppose we have a database of employees in a fictional institution as given below. If the database structure has not been modelled but has been extracted from a raw collection of information available, redundancy is expected.

Figure: the fictional firm’s database
employee_id	name	skill	manager_id	location
1	Socrates	Philosophy	(null)	Greece
2	Plato	Writing	1	Greece
3	Aristotle	Science	2	Greece
4	Descartes	Philosophy	(null)	France
4	Descartes	Philosophy	(null)	Netherlands

We can see that Descartes has two rows because he spent his life in both France and Netherlands. At a later point we decide that we wish to classify him with a different skill, we would have to update both rows since they should contain an identical (primary) skill. It would be easier to have a separate table for skills and and somehow allow the records which share the same skill to refer to this table. This way if we wish to reflect that both Socrates and Descartes were thinkers in Western Philosophy renaming the skill record in the second table would do the trick.

This process of breaking down a raw database into logical tables and removing redundancies is called Normalization. There are even levels of normalization called normal forms which dictate on how to acheive the desired design.

7.2 Atomicity

In the programming language examples we’ve seen, our assumption has always been that a language has a single author. But there are countless languages where multiple people contributed to the core design and should rightfully be acknowledged in our table. How would we go about making such a record? Let us take the case of BASIC which was designed by John Kemeny and Thomas Kurtz. The easiest option to add this new record into the table is to fit both author’s in the author field.

Figure: a record with a non-atomic field value
id	language	author	year	standard
1	Prolog	Colmerauer	1972	ISO
2	Perl	Wall	1987	(null)
3	APL	Iverson	1964	ANSI
4	Tcl	Ousterhout	1988	(null)
5	BASIC	Kemeny, Kurtz	1964	ANSI

You can immediately see that it would be difficult to write a query to retrieve this record based on the author field. If the data written as “Kemeny, Kurtz” or “Kurtz, Kemeny” or even “Kemeny & Kurtz”, it would be extremely difficult to put the right string in the WHERE conditional clause of the query. This is often the case with multiple values, and the solution is to redesign the table structure to make all field value atomic.

7.3 Repeating Groups

Another simple (but ultimately wrong) approach that comes to mind is to split the author field into two parts - author1 and author2. If a language has only one author, the author2 field would contain a null value. Can you spot the problem that will arise from this design decision?

Figure: a table with a repeating group
id	language	author1	author2	year	standard
1	Prolog	Colmerauer	(null)	1972	ISO
2	Perl	Wall	(null)	1987	(null)
3	APL	Iverson	(null)	1964	ANSI
4	Tcl	Ousterhout	(null)	1988	(null)
5	BASIC	Kemeny	Kurtz	1964	ANSI

This imposes an artificial constraint on how many authors a language can have. It seems to work fine for a couple of them, but what if a programming language was designed by a committee of a dozen or more people? At the database design time, how do we fix the number of authors we wish to support? This kind of design is referred to as a repeating group and must be actively avoided.

7.4 Splitting the table

The correct design to remove the problems listed above is to split the table into two - one holding the author details and one detailing the language.

Figure: a table holding author details
author_id	author	language_id
1	Colmerauer	1
2	Wall	2
3	Ousterhout	4
4	Iverson	3
5	Kemeny	5
6	Kurtz	5

Figure: a table holding programming language details
id	language	year	standard
1	Prolog	1972	ISO
2	Perl	1987	(null)
3	APL	1964	ANSI
4	Tcl	1988	(null)
5	BASIC	1964	ANSI

Once you have removed the non-atomicity of fields and repeating groups alongwith assigning unique id’s to your tables, your table structure is now in the first normal form. The author table’s language_id field which refers to the id field of the language table is called a foreign key constraint.

Listing: creating the new programming languages table

1 CREATE TABLE newlang_tbl (id       INTEGER     NOT NULL PRIMARY KEY,
2                           language VARCHAR(20) NOT NULL,
3                           year     INTEGER     NOT NULL,
4                           standard VARCHAR(10) NULL);

Listing: creating the authors table

1 CREATE TABLE authors_tbl (author_id   INTEGER     NOT NULL,
2                           author      VARCHAR(25) NOT NULL,
3                           language_id INTEGER REFERENCES newlang_tbl(id));

Notice that in the author’s table we’ve made a foreign key constraint by making the language_id field reference the id field of the new programming languages table using the keyword REFERENCES. You can only create a foreign key reference a primary or unique key, otherwise during the constraint creation time we would recieve an error similar to the following.

1 E_PS0490 CREATE/ALTER TABLE: The referenced columns in table 'newlang_tbl'
2     do not form a unique constraint; a foreign key may only reference
3     columns in a unique or primary key constraint.
4     (Thu May 17 15:28:45 2012)

Since we have created a reference to the language_id, inserting a row in the author’s table which does not yet have a language entry would also result in an error, called a Referential Integrity violation.

1 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (5, 'Kemeny', 5)
2 
3 E_US1906 Cannot INSERT into table '"authors_tbl"' because the values do
4     not match those in table '"newlang_tbl"' (violation of REFERENTIAL
5     constraint '"$autho_r0000010c00000000"').

However when done sequentially, i.e. the language first and then its corresponding entry in the author table, everything works out.

Listing: making entries for BASIC in both the tables

1 INSERT INTO newlang_tbl (id, language, year, standard) VALUES (5, 'BASIC', 1964,\
2  'ANSI');
3 
4 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (5, 'Kemeny', 5);

The other statements to get fully populated tables are given below.

 1 INSERT INTO newlang_tbl  (id, language, year,  standard) VALUES (1, 'Prolog', 19\
 2 72, 'ISO');
 3 INSERT INTO newlang_tbl  (id, language, year) VALUES (2, 'Perl', 1987);
 4 INSERT INTO newlang_tbl  (id, language, year,  standard) VALUES (3, 'APL', 1964,\
 5  'ANSI');
 6 INSERT INTO newlang_tbl  (id, language, year) VALUES (4, 'Tcl', 1988);
 7 
 8 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (6, 'Kurtz', 5);
 9 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (1, 'Colmerauer'\
10 , 1);
11 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (2, 'Wall', 2);
12 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (3, 'Ousterhout'\
13 , 4);
14 INSERT INTO authors_tbl (author_id, author, language_id) VALUES (4, 'Iverson', 3\
15 );

Up next

8. Doing more with queries