Database:Combining rows of data from related tables - SQL JOIN

From Juneday education
Jump to: navigation, search

Meta

Expand using link to the right to see the meta section with instructor's notes.

Description

Often, we have a database consisting of a lot more than one single table. The tables in the database represent (or at least store data about) typically one thing or concept that our system needs information about. It is not uncommon that two or more tables are connected in some sense, by giving parts of the data for one larger concept. In this way, we could talk about connected tables.

We could for instance imagine a table about cars. This table could contain some data about cars in rows. Each row could contain data about Make, Color and License Number. But the complete data about the color and make need not be stored inside the Car table. Instead, we could store the License Number in the Car table and refer to rows of more information about the color of a car to a Color table, and information about the Make could be referred to a row in a Make table.

In this way, information about one color only needs to be stored once, in one row of the Color table, and many other tables with objects which have a color could refer to this one row. The same goes for the Make table.

When retrieving data from more than one table which are connected via references in this way, the SQL JOIN construct is used. This chapter/topic introduces both how tables can be decomposed into several tables and how to later JOIN them together in a SELECT statement with a JOIN clause. The chapter does not go into details about normalization and the theory around that. It is more of a hands-on topic with examples and exercises.

Prepare

Expect the main lecture to take around 25 - 30 minutes. The optional lectures are about decomposing a “self contained big table” into smaller tables (around 20 - 25 minutes) and some more advanced examples of JOIN statements (15 - 20 minutes). We recommend that you see all the videos before deciding whether you will give all of them or settle for the main lecture. Also, as usual, review the exercises and questions online. We also recommend that you read the theoretical introduction online on the course material website.

Topics addressed or used in the online exercises

  • Re-visiting creating a directory for the exercise, setting up a database, creating a table for the exercise and inserting some test data etc
  • The exercise goes from a “self-contained” (not normalized) table with redundant data, to a design with at least two tables, where some data is moved from the big table into a new table
  • How to change an existing table in SQLite3 (alter table syntax is rather limited in SQLite3, so this has to be done in steps)
  • How to move data from one table to another
  • Using IDs as foreign keys (surrogate keys)
  • Using WHERE on surrogate keys (IDs) as a join operation
  • Using a WHERE with a JOIN clause
  • Natural joins
  • Using the AS operator to create alias for columns and tables

Heads-up

Think about:

  • Splitting up a table into more tables isn’t always intuitive to the students (or the authors!) - consider using the optional lecture on the topic and refer to literature and online resources
  • Students may ask about the difference between using WHERE to create a join and using JOIN.
    • Our answer is that JOIN makes for a more clear SELECT statement, in particular if you want to filter on some condition other than the joining columns. If you use WHERE to join two tables together on e.g. “product_id” and also want to filter on e.g. price < 100.0 then the where clause has two different purposes - first the joining of the tables, then also the filtering on price.
    • You might have a different answer. Think about this and prepare for an answer.

Full frontal - Code examples up-front!

SELECT product.name, price, alcohol, volume, productgroup.name AS "Product group"
  FROM product JOIN productgroup
    ON product.productgroupid = productgroup.id
 WHERE product.name LIKE '%loch lomond%' AND price > 500.0;
name                             price    alcohol  volume   Product group
-------------------------------  -------  -------  -------  -------------
Loch Lomond Inchmurrin 12 Years  559.0    46.0     700      Whisky       
Loch Lomond Càrn Mòr 6 Years     628.0    46.0     700      Whisky

Introduction

Often, we have a database consisting of a lot more than one single table. The tables in the database represent (or at least store data about) typically one thing or concept that our system needs information about. It is not uncommon that two or more tables are connected in some sense, by giving parts of the data for one larger concept. In this way, we could talk about connected tables.

An example might perhaps clear up this situation. We might have a system which needs data about cars. Cars come in different make, color and model etc. Furthermore, a car which exists in the real world, must be possible to uniquely identify in our database. For this we typically use a primary key (a column with data which uniquely identifies (singles out) one row in the car table). For primary key, we choose something which is unique to each car (each row of information describing one car in the real world). A license plate number springs to mind, since we know that all cars have a unique license plate number.

But we might not want to duplicate all data about cars for such things which is shared by many real cars, e.g. color, make, model etc.

In fact, we might have use for, for instance, the colors for other objects in our database. If we have a table for cars for our system and also a table for garages in the same database, we might realize that both cars and garages have a color in the real world. At the same time, we might realize that the color "Green" doesn't belong to any or some specific cars or garages, it exists on its own merits apart from the objects having the color "Green".

Another thing worth thinking about, is that not only can we share colors as attributes for different types of objects, we might not want to duplicate information about the color name etc in hundreds of rows of data describing cars and garages etc.

A solution to this situation, is to make color its own table, and having both car and garage reference a suitable row in the color table. If we have a color table, we could decide that a color has a name (like "Green" for instance) and a color_id, where the id uniquely identifies a color.

Now, we can have both a "car" and a "garage" referencing the same color (using the color_id for "Green" for instance). This is done by adding a column to the table referencing another table, called a "foreign key".

You can think about a foreign key as a reference meaning "described further elsewhere".

Let's look at a minimal and simplified example of this relationship between a car table and a color table:

--Definition of the color and car tables:

CREATE TABLE color(color_id INTEGER PRIMARY KEY, color TEXT UNIQUE NOT
NULL);
CREATE TABLE car(id INTEGER PRIMARY KEY, license TEXT UNIQUE NOT NULL,
 color_id INTEGER,
 FOREIGN KEY(color_id) REFERENCES color(color_id));

--Contents of the car table:

sqlite> SELECT * FROM car;
id license color_id
---------- ---------- ----------
1 AAA 001 1
2 AAA 002 2
3 AAA 003 3

--Contents of the color table:

sqlite> SELECT * FROM color;
color_id color
---------- ----------
1 Red
2 Blue
3 White
4 Green
5 Silver
6 Gold
7 Pink
8 Grey
9 Black
10 Orange
11 Purple

As you see, the car table has a column called color_id, which is a foreign key that references some row in the color table, using the color_id integer value to do the actual connection between the tables.

You don't have to focus on the so called constraint of the car table reading foreign key(color_id) references color(color_id) just yet, it is there only to provide some integrity and safeguards, so that we don't add a row with a car that has a color_id which doesn't exist in the color table.

The concept of "foreign keys" has not so much to do with SQL syntax, but is rather a human construct where we mean to say that a car with a certain color_id has the same color as the color in the color table that has the same color_id.

This lecture and exercises will show you some basics for how to write SELECT statements which retrieves data from more than one single table, using some foreign keys as the means to join the two tables. A JOIN is thus a way in SQL to describe which columns are connected in two tables.

To get the cars including their color name from the tables above, we could for instance write the following SELECT statement:

sqlite> .header on
sqlite> .mode column
sqlite> SELECT license, color FROM car JOIN color ON car.color_id = color.color_id;
license     color     
----------  ----------
AAA 001     Red       
AAA 002     Blue      
AAA 003     White

JOIN using WHERE syntax

Let's say we have the following four tables in a database:

CREATE TABLE artist(artist_name TEXT, artist_id INTEGER PRIMARY KEY);
CREATE TABLE genre(genre_name TEXT, genre_id INTEGER PRIMARY KEY);
CREATE TABLE album(album_name TEXT, album_id INTEGER PRIMARY KEY);
CREATE TABLE track(title TEXT, artist_id INTEGER, genre_id INTEGER, album_id INTEGER, track_id INTEGER PRIMARY KEY);

A track has a title, an artist (an id which references the artist table), a genre (an id which references the genre table) and an album (an id which references the album table). By combining the four tables we can get all information about a track.

Now, if we would just do this:

SELECT title, artist_name, genre_name, album_name
  FROM track, artist, genre, album;

then we'd get all possible combinations of all rows in all tables. That's not what we want. We want a specific combination of rows from the tables. The combination we want is:

  • The title combined with the artist_name row where the artist_id is the same as in the row where we found the title combined with
  • the genre_name row where the genre_id is the same as in the row where we found the title combined with
  • the album_name where the album_id is the same as in the row where we found the title

for each row of tracks.

We can achieve this by actually first selecting all possible combination of rows and then add a criteria using WHERE for the various ids to be the same:

SELECT title, artist_name, genre_name, album_name
  FROM track,artist,genre,album
 WHERE track.artist_id = artist.artist_id
   AND track.genre_id = genre.genre_id
   AND track.album_id = album.album_id;

title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

Note that the track_id isn't used in the query. It is a so called surrogate key which allows us to have a primary key in the track table (something which is uniq).

If it seems a little lengthy to write the WHERE clause like the above, then you can use aliases for the table names:

SELECT title, artist_name, genre_name, album_name
  FROM track t, artist a, genre g, album am
 WHERE t.artist_id = a.artist_id
   AND t.genre_id = g.genre_id
   AND t.album_id = am.album_id;
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

The draw-back of using WHERE to specify how tables are joined, is that we sometimes want to use WHERE to make a selection based on a column value which has nothing to do with joining the tables. That makes the WHERE clause harder to read and understand, because now we are using the WHERE for two different purposes. Let's say we want to join the tables like we did above, but only list tracks whose genre_name is Pop. This is what that would look like:

SELECT title, artist_name, genre_name, album_name
  FROM track t, artist a, genre g, album am
 WHERE t.artist_id = a.artist_id
   AND t.genre_id = g.genre_id
   AND t.album_id = am.album_id
   AND genre_name = 'Pop';
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!

Next we'll look at a cleaner way to join tables, using JOIN so that we can let the WHERE be used for the selection criteria alone.

Using JOIN

The JOIN syntax uses the keyword ON to specify how rows from two JOINed tables should be matched. We'll jump right into it:

SELECT title, artist_name, genre_name, album_name
  FROM track
  JOIN artist ON track.artist_id = artist.artist_id
  JOIN genre  ON track.genre_id  = genre.genre_id
  JOIN album  ON track.album_id  = album.album_id;
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

Now, using WHERE to select only Pop tracks will look like this:

SELECT title, artist_name, genre_name, album_name
  FROM track
  JOIN artist ON track.artist_id = artist.artist_id
  JOIN genre  ON track.genre_id  = genre.genre_id
  JOIN album  ON track.album_id  = album.album_id
 WHERE genre_name = 'Pop';
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!

We can use table aliases also with JOIN:

SELECT title, artist_name, genre_name, album_name
  FROM track t
  JOIN artist a ON t.artist_id = a.artist_id
  JOIN genre g  ON t.genre_id  = g.genre_id
  JOIN album am ON t.album_id  = am.album_id;
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

In the example at hand, you might have noticed that the columns used for matching two rows from different tables have the same names in both tables for all joins, e.g. track has an album_id column and so does album, and track has a genre_id column and so does genre etc. Next, we'll look at a way to exploit this fact to get a shorter syntax.

Using JOIN and USING

Since the columns from the track table we are using to match rows in the joined tables have the same name in the joined tables, we can use a shorter syntax:

SELECT title, artist_name, genre_name, album_name
  FROM track
  JOIN artist USING(artist_id)
  JOIN genre USING(genre_id)
  JOIN album USING(album_id);
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

But wait, there's an even shorter syntax for joining tables on columns with the same names in both tables! We'll look at that next.

Using NATURAL JOIN

Here's the syntax for NATURAL JOIN, which is a kind of JOIN which exists for situation where you know that the columns used for joining two tables have the same name:

SELECT title, artist_name, genre_name, album_name
  FROM track NATURAL JOIN artist NATURAL JOIN genre NATURAL JOIN album;
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

LEFT OUTER JOIN

What if the genre table has some genres that no track has? There might be the genres "Black Metal" and "Death Metal" in the genre table, even if no track has that genre, right?

How would we find those genres? Using a normal JOIN wouldn't work, since no rows of the track table would have the corresponding genre_id values. That's the whole point. How can we then find the "extra" or orphan genres?

To understand how, we need to think about how the normal JOIN works. It creates all combinations of the rows in two tables and then filters away (discards) the rows whose join column values don't match. What we want to do to solve the problem of finding the orphan genres, is to actually keep all rows of one of the tables (genre) regardless of whether we can match the genre_id value with the track table.

This is achieved with LEFT OUTER JOIN which is designed to keep all rows of the table to the LEFT of the expression. We'll put genre to the left, of course, since we are interested in genres.

Let's start with a query for all genre_name values in the genre table and LEFT OUTER JOIN with the track table. We use .nullvalue NULL to make it clearer what happens. We'll select title and genre_name:

SELECT genre_name, title
  FROM genre LEFT OUTER JOIN track
 USING(genre_id);
genre_name              title       
----------------------  ------------
Pop                     Sleeping in 
Pop                     The Look    
Exotica                 Crawfish    
Rock                    Mystery Trai
Black Metal             NULL        
Death Metal             NULL

Now, you see that the left table, genre, gets to keep all its rows even when the right table, track, doesn't have a matching genre_id! The title for such rows becomes NULL (because there are no rows with the two orphaned genres Black Metal or Death Metal, so there can't be a title value!).

Now, to get only the names of the orphan genres, we can add a WHERE clause specifying that title must be null:

SELECT genre_name, title
  FROM genre LEFT OUTER JOIN track USING(genre_id)
 WHERE title IS NULL;
genre_name              title       
----------------------  ------------
Black Metal             NULL        
Death Metal             NULL

We don't even need to list title in this case, since we are only interested in the genre_name of the orphan genres:

SELECT genre_name
  FROM genre LEFT OUTER JOIN track USING(genre_id)
 WHERE title IS NULL;
genre_name            
----------------------
Black Metal           
Death Metal

You can use any column from the track table in the WHERE clause. They will all become NULL, since there is no corresponding row when joining on a column whose value will be missing in the right-hand table.

As a bonus, we can show you how to answer the question of "what genres are not represented in the track table" without a JOIN. You can use a nested SELECT instead:

SELECT genre_name FROM genre WHERE genre_id NOT IN(SELECT DISTINCT genre_id FROM track);
genre_name            
----------------------
Black Metal           
Death Metal

The IN operator takes a column as its left-hand operand and a list as its right-hand operand. It evaluates to TRUE if the left-hand operand is present in the list. We negate the Boolean result by using NOT IN.

Using VIEWS to simplify queries

One useful technique when dealing with complicated JOINs, is to create a view from the JOIN statement. This will create a kind of table representing the result of performing the SELECT with the JOIN.

Here's how you could create a view from the example used above:

CREATE VIEW track_info AS
             SELECT title, artist_name, genre_name, album_name
               FROM track
       NATURAL JOIN artist
       NATURAL JOIN genre
       NATURAL JOIN album;

To use the view is simple. Just pretend it's a table and use a normal SELECT from it:

SELECT * FROM track_info;
title                   artist_name   genre_name    album_name            
----------------------  ------------  ------------  ----------------------
The Look                Roxette       Pop           Look Sharp!           
Sleeping in my car      Roxette       Pop           Crash! Boom! Bang!    
Crawfish                Elvis         Exotica       King Creole           
Mystery Train           Elvis         Rock          Greatest Hits

In SQLite3, views are read-only. So you can only use SELECT on them.

Here's the syntax page about creating views on SQLite.org.

Links

Videos and lecture presentations

Note that some of the videos are (so far) available in Swedish only. However, the slides are in English. We plan to add English videos in the future.

SQL JOIN - Swedish videos

Decomposing a table - English videos

This extra lecture video was created due to students' having trouble with decomposing tables on the 2018 exam. So watch it.

Extra lecture on JOINs - English videos

The extra lecture above is probably not part of the exam of basic/introductory courses using this book, accept as "VG" (harder) questions. But it might be good to have seen it anyway. So see it.

Further reading

Where to go next

« PreviousBook TOCNext »