Database:Exercise - Combining rows of data from related tables - SQL JOIN

From Juneday education
Jump to: navigation, search

Work in progress

Remove this section when the page is production-ready.

Exercise for Combining rows of data from related tables - SQL JOIN

Notes on typography

Commands are written in a fixed width font and commands issued in a terminal in your computer shell are prefixed with a $.

Commands issued inside the SQLite3 interactive shell are prefixed with sqlite>

Example: List the files in the current directory using the command ls:

$ ls

Issue the .tables command inside the SQLite3 interactive shell:

sqlite> .tables

Purpose of this exercise

We are going to convert the books table from the lecture into two tables, one for publishers and one from books. We will have the books table reference the publishers table via a publisher_id rather than having the publisher name in the books table.

Further, we will practice making SELECT statements which reference both the books table and the publishers table, joining the two tables together using the fact that a book’s publisher_id corresponds to a publisher_id in the publishers table.

The exercise is built in progression, so you cannot skip any task. We start by creating a situation with a table for books which contains too much information, and move on to create two tables which are connected instead. This will show you a simple and basic way to move data out from one table into a second table, modify the first table and get a situation where you can get all the data from the old "know-it-all" table, by querying the new tables and joining them together instead.

Task 1 - Set up the books table

To keep things organised, create a directory for this exercise session:

$ mkdir Ex07
$ cd Ex07

Create a textfile called books.sql with the following contents:

PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE IF NOT EXISTS books
   (author TEXT, title TEXT, isbn TEXT PRIMARY KEY, publisher TEXT);
INSERT INTO "books"
    VALUES('John Smith','Life','0-0-0-0-0-1','Bonnier');
INSERT INTO "books"
    VALUES('James Woody','Love','0-0-0-0-0-2','Bonnier');
INSERT INTO "books"
    VALUES('Joan Carmen','Guns','0-0-0-0-0-3','Bonnier');
INSERT INTO "books"
    VALUES('Johnanna Boyd','Code','0-0-0-0-0-4','Bonnier');
INSERT INTO "books"
    VALUES('Eva Peron','Cars','0-0-0-0-0-5','Books R us');
COMMIT;

As usual, you create a database and the table by running sqlite3 with the name of the database as the only argument and redirect input from the file.

Expand using link to the right to see the syntax.

$ sqlite3 my_books < books.sql

Next, enter the interactive shell giving my_books as the only argument.

Expand using link to the right to see the syntax.

$ sqlite3 my_books

You can also download the file from Github.

Task 2 - Create the publishers table

In the interactive shell with the my_books database selected as the main database, issue an SQL statement to create the publishers table with the folowing columns and modifiers:

  • publisherid INTEGER PRIMARY KEY
  • name TEXT

Expand using link to the right to see the syntax.

sqlite> CREATE TABLE publishers(publisherid INTEGER PRIMARY KEY, name TEXT);

Now populate the newly created publishers table with two publishers with the following data:

  • Name: 'Bonnier'
  • Name: 'Books R us'

Expand using link to the right to see the syntax.

sqlite> INSERT INTO publishers(name) VALUES('Bonnier');
sqlite> INSERT INTO publishers(name) VALUES('Books R us');

Inspect the contents of the publishers table. Verify that the two publishers were inserted correctly.

Expand using link to the right to see the syntax.

sqlite> SELECT * FROM publishers;
1|Bonnier
2|Books R us

Where did the IDs for Bonnier and Books R us come from? We only inserted the names of the publishers!?

Expand using link to the right to see the answer if you don't know it.

The reason for the IDs to appear magically, is the fact that the publisherid column is a primary key of type INTEGER. In SQLite3, a primary key of INTEGER type has an "auto increment"-like property. Whenever a new row is inserted, the value for the column is automatically incremented with one. That is quite practical, to let the database figure out the next available ID for a growing table.

Read this page of the SQLite documentation to learn about primary keys with type INTEGER. The important part for you to know, is that you didn't have to calculate the IDs for the rows of new publishers, since SQLite calculated them for you, and that this was because the publisherid column was both a primary key and of type INTEGER.

Task 3 - Create a new books table with a PublisherID instead of the publisher name

Let’s create the new books table. We don’t want to store the publisher name in it any more, now that we have a dedicated table to keep track of the publishers for us. We’ll call the new table books2 for now.

Create the books2 table with the following columns:

  • author TEXT
  • title TEXT
  • isbn TEXT PRIMARY KEY
  • publisherid INTEGER

Expand using link to the right to see the syntax.

sqlite> CREATE TABLE books2(author TEXT, title TEXT, isbn TEXT PRIMARY KEY, publisherid INTEGER);

Task 4 - Populate the books2 table

We want the new table books2 to hold basically the same data as the old books table. The only thing that has changed is that we now will store the publisher’s id rather than the publisher name. Anyway, we may start with populating the books2 table with author, title and isbn. We will use a nested SELECT statement, in order to get the data for our INSERT statement. We don't expect you to know this syntax yet, so expand the hint below to see the syntax. If you want to try yourself without looking, here's a hint: SQLite INSERT manual. Look at the diagram, to see that you can use SELECT instead of the VALUES clause. Don't feel bad if you don't understand this, simply look at the hint below instead!

Expand using link to the right to see the syntax.

sqlite> INSERT INTO books2(author, title, isbn) SELECT author, title, isbn FROM books;

Note that instead of the VALUES clause, we have a SELECT statetment! This is a handy trick that is good to know, but perhaps an advanced bonus knowledge, since we haven’t talked so much about it in the lectures or in this book. We thought it might be good for you to have seen the syntax and tried it out, though. If you look at the syntax diagram here you can see that instead of a VALUES clause, there may be a SELECT statement in its place. The diagram works like this: You can follow the arrows as if they were a race track where you can choose different routes. As you see, you can "turn right" before the VALUES "road" and instead take the SELECT road. Of course, there is a similar diagram for how the SELECT works. You'll have to find it yourself if you want to look at that too.

But we are not done yet. We need to populate the table with the correct publisherids as well.

How can we do this? If we look at the old books table, we see that all books have "Bonnier" for publisher except the last book with isbn "0-0-0-0-0-5", which has "Books R us" for publisher. So we could actually set all books that don’t have that isbn, to publisherid 1 (Bonnier’s ID) and then set the book with ISBN 0-0-0-0-0-5 to publisherid 2 (Books R us’s ID). That’s one way of doing it. It is of course not the only way to get the correct IDs for the new books2 table, but it seems quite simple, works for this small example and will save us some typing compared to inserting the "correct" publisherid manually. So, use UPDATE to fix the IDs as described above.

Expand using link to the right to see the syntax.

sqlite> UPDATE books2 SET publisherid=1 WHERE isbn != '0-0-0-0-0-5'; -- "!=" is the "not equals" operator
sqlite> UPDATE books2 SET publisherid=2 WHERE isbn  = '0-0-0-0-0-5';

As you might remember from the lecture, we accomplished the same thing with another strategy. Instead of using inequality and equality, we used the following where clause for publisherid = 1:

 WHERE isbn < '0-0-0-0-0-5' -- "<" is the "less than" operator, and it works for strings too

That works equally well, because it is true that we should set publisherid to 1 for all rows of books whose isbn is less than (lexicographically) '0-0-0-0-0-5'. (Lexicographically means in the order of characters, kind of like alphabetically, but includes characters that are not letters.)

Before we continue, let’s inspect the books2 table and all its rows. Make a SELECT to retrieve all rows and all columns from books2. As a bonus, make SQLite3 include the column names using the .headers on directive.

Expand using link to the right to see the syntax.

sqlite> .headers on
sqlite> SELECT * FROM books2;
author|title|isbn|publisherid
John Smith|Life|0-0-0-0-0-1|1
James Woody|Love|0-0-0-0-0-2|1
Joan Carmen|Guns|0-0-0-0-0-3|1
Johnanna Boyd|Code|0-0-0-0-0-4|1
Eva Peron|Cars|0-0-0-0-0-5|2

The .headers on directive makes the interactive shell print the column names at the top. (SQLite-specific)

Task 5 - Get rid of the old books table

Now that we have the new books2 table and have populated it with the correct values, we can rid ourselves of the old books table using the DROP TABLE statement. It’s used like this:

sqlite> DROP TABLE books;

We may verify what tables remains in the database using the .tables directive. Do it!

Expand using link to the right to see the syntax.

sqlite> .tables
books2      publishers

If you can't see the old books table, hat’s fine. No more books table! It was old and contained too much information anyway!

Task 6 - Change the name of the books2 table to books

Syntax difference ahead! Click on the image to see MariaDB version.

We kind of liked the books table name, so let’s rename our books2 table to books and confirm that it worked:

sqlite> ALTER TABLE books2 RENAME TO books;
sqlite> .tables
books       publishers
SQLite ALTER TABLE syntax diagram

Now things looks even better! You can read up on the ALTER TABLE syntax here but it wouldn't be on the exam without the syntax diagram. You may need to use it in an assignment, though, so it’s good to have a reference. And it's a very useful command, so you really should read about it and test it out on some test database you can create on your own.

Task 7 - List all books and their publisher names

When we want to look up a book, or many books, we don’t want to see an ID in place of the publisher. We probably want to see the publisher name in full. So we need a way to write a SELECT statement which fetches the names from the publishers table and connect that table to our books table. In other words, finally we have a setup which allows us to learn the basics of joining to tables on a common column.

In order to do this, we use the fact that the publisherid for a book has the same value as the publisherid in the publishers table. If we have publisherid 1 in the books table, it means exactly the publisher with publisherid 1, also in the publishers table, i.e. "Bonnier".

We can connect the tables in the WHERE clause of our SELECT using this fact. But when we reference columns in different tables, we need to provide the table names. The publisherid in the books table will be called books.publisherid and the publisherid in the publishers table will be called publishers.publisherid. We also need to specify both tables in the FROM clause of our SELECT statement.

The SELECT statement for listing books.title, books.author, publisher’s name should now be possible for you to figure out. If not, see the hint by expanding the suggested solution below. You need to practice this, in order to pass this part of an exam.

Expand using link to the right to see the syntax.

sqlite> SELECT books.title, books.author, publishers.name
          FROM books, publishers
         WHERE books.publisherid = publishers.publisherid;

When you have peeked or figured the correct syntax out, note the WHERE clause. We say that the connection between the two tables books, publishers is the publisherid. In other words, we ask for all combinations of the two tables but we only want to keep those combinations where the publisherid in both tables are the same.

Try running the same SELECT but without the WHERE clause.

Expand using link to the right to see the syntax.

sqlite> SELECT books.title, books.author, publishers.name
          FROM books, publishers;

As you see, without the WHERE clause to filter out only the combinations where the publisherid matches, we get all possible combinations of the two tables. Some teachers and books will tell you that this is called a "Cartesian" product, but who cares what it's called? It makes it really clear, though, that the WHERE clause in a SELECT combining two tables is rather important. It is not so common that we'd want all combinations of the rows from both tables, which makes you wonder why they think it is important to give it a name when teaching SQL. However, regardless of what the combination is called, we hope that you understand now how the SELECT worked, and what the consequence of forgetting the WHERE would be.

Task 8 - List books and publishers using JOIN

Another way of connecting the two tables on their common column value (publisherid) is to use the SQL JOIN syntax. We would perhaps not test this syntax in an exam for an introductory book like this (at least not without providing the syntax diagram), but we think it is good if you have seen it and tried it out. You should, of course, not restrict your studies to what might turn up on an exam, however.

The idea behind the JOIN syntax, is to avoid to express that we want to create all combinations of the two tables and then filter the result via the WHERE clause. Instead, we explicitly state how the two tables are JOINed together like this:

sqlite> SELECT title, author, name FROM books
          JOIN publishers
            ON books.publisherid = publishers.publisherid;

In SQLite, this form of JOIN is equivalent to the SELECT with WHERE syntax. The difference lies in how we express our query.

This way, we can name the columns unqualified (without their table prefix). Instead we fix the connection with the JOIN clause where we explicitly say that the books table is joined together with the publishers table ON the fact that in both tables publisherid means the same thing.

A slightly shorter form of the JOIN above, is to use a USING clause instead of the ON clause, if the columns are named the same:

sqlite> SELECT a.name AS "Author", nationality, title, isbn, p.name as "Publisher"
          FROM books b
  NATURAL JOIN publishers p
          JOIN authors a USING(AuthorID);

Actually, there is still another way to take advantage of the fact that the column publisherid not only means the same thing in both tables, the column shares the exact name between the tables. It is called exactly publisherid in both tables. Naturally, this is something that SQLite can use. Here, we'll show you a shorter JOIN syntax which works when the column we are JOINing on, has the exact same name in the tables we are JOINing:

sqlite> SELECT Title, Author, Name
          FROM books
  NATURAL JOIN publishers;
Life|John Smith|Bonnier
Love|James Woody|Bonnier
Guns|Joan Carmen|Bonnier
Code|Johnanna Boyd|Bonnier
Cars|Eva Peron|Books R us

The trick in the NATURAL JOIN, is that it means that SQLite should JOIN the tables on the column which has the same name in both tables. It is only the publisherid column which occurs with exactly the same name in both tables, therefore the NATURAL JOIN works without the ON clause.

Note that, if there would have been more columns with the same names in both tables, the NATURAL JOIN wouldn't work. For instance if the name column of the publishers table happened to have the same name as a name column in the books table (but meaning something completely different - book name, and, publisher name respectively), this would fail miserably. Please try it out! It's always good to have many tools in your toolbox!

On a side-note, these kinds of JOIN statements are also called INNER JOINs.

Task 9 - What about the headers (column names)?

If you tell the interactive shell that you’d like to see the column names (the headers) in the result, you use the .headers on directive. Do it! (Perhaps you already did)

Expand using link to the right to see the syntax.

sqlite> .headers on

Now, run the SELECT statement again and carefully inspect the headers.

Expand using link to the right to see the syntax.

sqlite> SELECT title, author, name
          FROM books
          JOIN publishers
            ON books.publisherid = publishers.publisherid;
title|author|name
Life|John Smith|Bonnier
Love|James Woody|Bonnier
Guns|Joan Carmen|Bonnier
Code|Johnanna Boyd|Bonnier
Cars|Eva Peron|Books R us

Now, the publisher’s name is simply called Name in the publisher table. Therefore it is listed in the headers as name. Quite logically ;-) But what if name is not the header we’d like? We might prefer something more telling, like "Publisher". In fact, that is probably what we want.

For this, we can use a construct called AS. We can, in fact, rename the column header to what we want. First try this:

sqlite> SELECT title AS "The book name", author AS "Writer's name" FROM books;

Look at the headers (the first line of the result). We can use this to fix the header listed as Name in the larger SELECT statement above. Try to fix it yourself so that instead of Name, the publisher’s name will be listed as "Publisher". Hint: Use the AS construct again!

Expand using link to the right to see the syntax.

Using JOIN and ON:

sqlite> .headers on
sqlite> SELECT title, author, name AS "Publisher"
          FROM books
          JOIN publishers
            ON books.publisherid = publishers.publisherid;
title|author|Publisher
Life|John Smith|Bonnier
Love|James Woody|Bonnier
Guns|Joan Carmen|Bonnier
Code|Johnanna Boyd|Bonnier
Cars|Eva Peron|Books R us

Using qualified names and WHERE:

sqlite> .headers on
sqlite> SELECT books.title, books.author, publishers.name AS "Publisher"
          FROM books, publishers
         WHERE books.publisherid = publishers.publisherid;
title|author|Publisher
Life|John Smith|Bonnier
Love|James Woody|Bonnier
Guns|Joan Carmen|Bonnier
Code|Johnanna Boyd|Bonnier
Cars|Eva Peron|Books R us

Using NATURAL JOIN:

sqlite> .headers on
sqlite> SELECT title, author, name AS "Publisher"
          FROM books
  NATURAL JOIN publishers;
title|author|Publisher
Life|John Smith|Bonnier
Love|James Woody|Bonnier
Guns|Joan Carmen|Bonnier
Code|Johnanna Boyd|Bonnier
Cars|Eva Peron|Books R us

Task 10 - Using aliases for tables

Going back to the syntax where we used the WHERE clause to connect the tables, filtering on rows with the same value for publisherid, you remember this?

sqlite> SELECT books.title, books.author, publishers.name
          FROM books, publishers
         WHERE books.publisherid = publishers.publisherid;

That’s quite tiresome to write, since we are repeating the table names over and over. We don’t like repetition, so let’s shorten that statement using something called aliases. The idea is to create abbreviations for the table names so that for instance books.title becomes b.title:

sqlite> SELECT b.title, b.author, p.name
          FROM books b, publishers p
         WHERE b.publisherid = p.publisherid;

Try it out. It is very common, so again, it is good to have seen it and tried it out. So try it! No, really, try it!

Task 11 - What have we learned and for what?

Your task in this exercise is to reflect upon what you have done and learned. Here are some thoughts for you.

Imagine a database of 100 000 books. If we stored them in one flat (fat?) table, as the table we started with in this exercise session, and there were only four publishers equally distributed, then we would have 25 000 rows repeating the publisher name for each publisher.

What we have done today, is to move the publishers out of the books table, into a table of their own. This way, we will not repeat the publisher name more than once. This is particularly good if we had to change the name of a publisher. We do prefer to change the name in one place, compared to, say in 25 000 places. Think about what would happen if you had misspelled "Bonnier" in some rows? And think about what would happen if "Bonnier" changed its name to "Bonnier books and magazines". We'll we don't know what would happen, but compare the situation to what we accomplished by moving the Publisher name out to a publishers table where "Bonnier" occurred only once. A change of name of the publisher would mean changing one row instead of changing every row with a book by that publisher.

Task 12 - Next challenge

The next logical step for fixing our database is to "normalize" also the Authors. ("Normalize" is fancy-talk for moving data out to a new table as we did with the Publisher column of the first version of the books table.) We leave this as a voluntary exercise for those who want to try it on their own. A possible scenario would be to create an authors table. The authors table could hold AuthorID, FirstName, LastName, BornYear, Nationality etc. The books table could be changed to instead of Author, hold the AuthorID of the corresponding row in the authors table.

Hint: This exercise would make a good preparation for Assignment 2! So we recommend that you give it a try.

Expand using link to the right to see one proposed solution - this is an open question, so your solution may vary.

First, create and populate the authors table:

sqlite> CREATE TABLE "authors"(AuthorID INTEGER primary key, Name text, Nationality TEXT);
sqlite> INSERT INTO authors(Name, Nationality) VALUES('John Smith', 'English');
sqlite> INSERT INTO authors(Name, Nationality) VALUES('James Woody', 'Scottish');
sqlite> INSERT INTO authors(Name, Nationality) VALUES('Joan Carmen', 'Welsh');
sqlite> INSERT INTO authors(Name, Nationality) VALUES('Johnanna Boyd', 'American');
sqlite> INSERT INTO authors(Name, Nationality) VALUES('Eva Peron', 'Argentinian');

Next, create and populate the new books2 table with AuthorID instead of Author:

sqlite> create table books2(AuthorID INTEGER, Title text, ISBN text primary key, PublisherID INTEGER);
sqlite> insert into books2(ISBN,Title) SELECT ISBN,Title FROM books;
sqlite> update books2 set PublisherID=2 where ISBN='0-0-0-0-0-5';
sqlite> update books2 set PublisherID=1 where ISBN<'0-0-0-0-0-5';
sqlite> update books2 set AuthorID=1 WHERE ISBN LIKE "%1";
sqlite> update books2 set AuthorID=2 WHERE ISBN LIKE "%2";
sqlite> update books2 set AuthorID=3 WHERE ISBN LIKE "%3";
sqlite> update books2 set AuthorID=4 WHERE ISBN LIKE "%4";
sqlite> update books2 set AuthorID=5 WHERE ISBN LIKE "%5";

Next, drop the old books table and rename books2 to books

sqlite> drop table books;
sqlite> alter table books2 rename to books;

Next, try out a complicated JOIN:

sqlite> SELECT a.Name AS "Author", Nationality, Title, ISBN, p.Name as "Publisher"
>       FROM books b NATURAL JOIN publishers p
>       JOIN authors a on a.AuthorID=b.AuthorID;
Author|Nationality|Title|ISBN|Publisher
John Smith|English|Life|0-0-0-0-0-1|Bonnier
James Woody|Scottish|Love|0-0-0-0-0-2|Bonnier
Joan Carmen|Welsh|Guns|0-0-0-0-0-3|Bonnier
Johnanna Boyd|American|Code|0-0-0-0-0-4|Bonnier
Eva Peron|Argentinian|Cars|0-0-0-0-0-5|Books R us

The JOIN became a little complicated because some columns shared the same names. First, we have the PublisherID which is not a problem and actually allows us to use a NATURAL JOIN between books and publishers. But then we have Name, which in the publishers table represents a publisher name, and in authors represents an author name. So for the Name columns, we need to use some qualification in order to help SQLite understand which Name column we are talking about. So the SELECT starts by qualifying the author name: a.Name AS "Author". The columns Nationality, Title, ISBN do not need qualification, since they are unique column names. But the publisher name needs qualification too: p.Name AS "Publisher".

Next, we can NATURAL JOIN books with publishers, but here we give the shorter aliases for the respective columns: FROM books b NATURAL JOIN publishers p. Finally, we JOIN the result of the first NATURAL JOIN with authors ON the AuthorIDs of books and authors: JOIN authors a on a.AuthorID=b.AuthorID.

As you see, this could become a little complicated when we JOIN many tables. But the rules for qualifying column names are simple - if they need to be specified because they occur in more than one table, use qualification (as with Name in the example above). If they are unique you do not need to qualify them. If they occur in more than one table but means the same thing, it doesn't need to be qualified and can even be used in a NATURAL JOIN.

Using the SELECT with WHERE clause syntax, the same result can be achieved by this query:

sqlite> SELECT a.Name AS "Author", Nationality, Title, ISBN, p.Name AS "Publisher"
...> FROM books b, authors a, publishers p
...> WHERE a.AuthorID=b.AuthorID AND p.PublisherID=b.PublisherID;
Author|Nationality|Title|ISBN|Publisher
John Smith|English|Life|0-0-0-0-0-1|Bonnier
James Woody|Scottish|Love|0-0-0-0-0-2|Bonnier
Joan Carmen|Welsh|Guns|0-0-0-0-0-3|Bonnier
Johnanna Boyd|American|Code|0-0-0-0-0-4|Bonnier
Eva Peron|Argentinian|Cars|0-0-0-0-0-5|Books R us

You need to practice using JOIN and the syntax using filtering with WHERE.

Links

Further reading

Files

  • The first books.sql file from task 1 Github

Where to go next

« PreviousBook TOCNext »