Database:Combining rows of data from related tables - SQL JOIN

From Juneday education
Jump to: navigation, search

Meta

Expand using link to the right to see the meta section with instructor's notes.

Description

Often, we have a database consisting of a lot more than one single table. The tables in the database represent (or at least store data about) typically one thing or concept that our system needs information about. It is not uncommon that two or more tables are connected in some sense, by giving parts of the data for one larger concept. In this way, we could talk about connected tables.

We could for instance imagine a table about cars. This table could contain some data about cars in rows. Each row could contain data about Make, Color and License Number. But the complete data about the color and make need not be stored inside the Car table. Instead, we could store the License Number in the Car table and refer to rows of more information about the color of a car to a Color table, and information about the Make could be referred to a row in a Make table.

In this way, information about one color only needs to be stored once, in one row of the Color table, and many other tables with objects which have a color could refer to this one row. The same goes for the Make table.

When retrieving data from more than one table which are connected via references in this way, the SQL JOIN construct is used. This chapter/topic introduces both how tables can be decomposed into several tables and how to later JOIN them together in a SELECT statement with a JOIN clause. The chapter does not go into details about normalization and the theory around that. It is more of a hands-on topic with examples and exercises.

Prepare

Expect the main lecture to take around 25 - 30 minutes. The optional lectures are about decomposing a “self contained big table” into smaller tables (around 20 - 25 minutes) and some more advanced examples of JOIN statements (15 - 20 minutes). We recommend that you see all the videos before deciding whether you will give all of them or settle for the main lecture. Also, as usual, review the exercises and questions online. We also recommend that you read the theoretical introduction online on the course material website.

Topics addressed or used in the online exercises

  • Re-visiting creating a directory for the exercise, setting up a database, creating a table for the exercise and inserting some test data etc
  • The exercise goes from a “self-contained” (not normalized) table with redundant data, to a design with at least two tables, where some data is moved from the big table into a new table
  • How to change an existing table in SQLite3 (alter table syntax is rather limited in SQLite3, so this has to be done in steps)
  • How to move data from one table to another
  • Using IDs as foreign keys (surrogate keys)
  • Using WHERE on surrogate keys (IDs) as a join operation
  • Using a WHERE with a JOIN clause
  • Natural joins
  • Using the AS operator to create alias for columns and tables

Heads-up

Think about:

  • Splitting up a table into more tables isn’t always intuitive to the students (or the authors!) - consider using the optional lecture on the topic and refer to literature and online resources
  • Students may ask about the difference between using WHERE to create a join and using JOIN.
    • Our answer is that JOIN makes for a more clear SELECT statement, in particular if you want to filter on some condition other than the joining columns. If you use WHERE to join two tables together on e.g. “product_id” and also want to filter on e.g. price < 100.0 then the where clause has two different purposes - first the joining of the tables, then also the filtering on price.
    • You might have a different answer. Think about this and prepare for an answer.

Full frontal - Code examples up-front!

SELECT product.name, price, alcohol, volume, productgroup.name AS "Product group"
FROM product JOIN productgroup
 ON product.productgroupid = productgroup.id
WHERE product.name LIKE '%loch lomond%' AND price > 500.0;
name                             price    alcohol  volume   Product group
-------------------------------  -------  -------  -------  -------------
Loch Lomond Inchmurrin 12 Years  559.0    46.0     700      Whisky       
Loch Lomond Càrn Mòr 6 Years     628.0    46.0     700      Whisky

Introduction

Often, we have a database consisting of a lot more than one single table. The tables in the database represent (or at least store data about) typically one thing or concept that our system needs information about. It is not uncommon that two or more tables are connected in some sense, by giving parts of the data for one larger concept. In this way, we could talk about connected tables.

An example might perhaps clear up this situation. We might have a system which needs data about cars. Cars come in different make, color and model etc. Furthermore, a car which exists in the real world, must be able to uniquely identify in our database. For this we typically use a primary key (a column with data which uniquely identify one row in the car table). For primary key, we choose something which is unique to each car (each row of information describing one car in the real world). A license plate number springs to mind, since we know that all cars have a unique license plate number.

But we might not want to duplicate all data about cars for such things which is shared by many real cars, e.g. color, make, model etc.

In fact, we might have use for, for instance, the colors for other objects in our database. If we have a table for cars for our system and also a table for garages in the same database, we might realize that both cars and garages have a color in the real world. At the same time, we might realize that the color "Green" doesn't belong to any or some specific cars or garages, it exists on its own merits apart from the objects having the color "Green".

Another thing worth thinking about, is that not only can we share colors as attributes for different types of objects, we might not want to duplicate information about the color name etc in hundreds of rows of data describing cars and garages etc.

A solution to this situation, is to make color its own table, and having both car and garage reference a suitable row in the color table. If we have a color table, we could decide that a color has a name (like "Green" for instance) and a color_id, where the id uniquely identifies a color.

Now, we can have both a "car" and a "garage" referencing the same color (using the color_id for "Green" for instance). This is done by adding a column to the table referencing another table, called a "foreign key".

You can think about a foreign key as a reference meaning "described further elsewhere".

Let's look at a minimal and simplified example of this relationship between a car table and a color table:

--Definition of the color and car tables:

CREATE TABLE color(color_id integer primary key, color text unique not
null);
CREATE TABLE car(id integer primary key, license text unique not null,
 color_id integer,
 foreign key(color_id) references color(color_id));

--Contents of the car table:

sqlite> select * from car;
id license color_id
---------- ---------- ----------
1 AAA 001 1
2 AAA 002 2
3 AAA 003 3

--Contents of the color table:

sqlite> select * from color;
color_id color
---------- ----------
1 Red
2 Blue
3 White
4 Green
5 Silver
6 Gold
7 Pink
8 Grey
9 Black
10 Orange
11 Purple

As you see, the car table has a columns called color_id, which is a foreign key which references some row in the color table, using the color_id integer value to do the actual connection between the tables.

You don't have to focus on the so called constraint of the car table reading foreign key(color_id) references color(color_id) just yet, it is there only to provide some integrity and safeguards, so that we don't add a row with a car that has a color_id which doesn't exist in the color table.

The concept of "foreign keys" has not so much to do with SQL syntax, but is rather a human construct where we mean to say that a car with a certain color_id has the same color as the color in the color table that has the same color_id.

This lecture and exercises will show you some basics for how to write SELECT statements which retrieves data from more than one single table, using some foreign keys as the means to join the two tables. A JOIN is thus a way in SQL to describe which columns are connected in two tables.

To get the cars including their color name from the tables above, we could for instance write the following SELECT statement:

sqlite> .header on
sqlite> .mode column
sqlite> SELECT license, color FROM car JOIN color ON car.color_id = color.color_id;
license     color     
----------  ----------
AAA 001     Red       
AAA 002     Blue      
AAA 003     White

Next, see the lecture video(s) and do the exercises for this chapter.

Lecture presentations and Videos

Note that some of the videos are (so far) available in Swedish only. However, the slides are in English. We plan to add English videos in the future.

The extra lecture above is not part of the exam of basic courses using this book. But it might be good to have seen it anyway.

Links

Further reading

Where to go next

« PreviousBook TOCNext »