2. Purpose
One of the most common interview questions for IT jobs is “can you explain the
difference between an inner join and an outer join?” This question is common for a very
simple reason: understanding the difference between an inner join and an outer join and
which one to use for a particular purpose is the key to writing complex SQL queries
necessary to capture and analyze the complex data used in most large applications and
many small applications, so this knowledge very clearly separates the employees who need
no supervision from the employees who will be constantly asking for assistance.
The purpose of this document is not just to give you a cheat sheet so you can get
through an interview sounding like you actually know SQL, but rather to give you an
understanding of joins that will allow you to excel in your current position even if you are
not selected for a new one.
That said, this is not an SQL training course. This guide is intended to fill a gap in
popular SQL learning materials and courses. Nearly all instructional materials and
instructors will at least attempt to explain the difference between an inner and an outer join,
but few do so in a way that students can understand if they do not already know the
difference. My intention is to explain the difference in a way that the rest of us can
understand. To this end, I will avoid “big words” and discussion of the underlying
mathematical theory where possible.
Assumptions
Throughout this document we will assume the following:
• You already understand SQL well enough to write single-table queries.
• You are at least vaguely familiar with the purpose of the WHERE clause.
• You have access to a relational database that supports SQL queries.
• There is already data in your database.
• You already have access to a query execution tool (even if the tool is just the
command line interface to the database management system) and you know
how to execute queries with it.
If any of the assumptions above are incorrect, the information below may not be of
much value to you. Most SQL manuals are clear enough to give the average person the
required knowledge, and most of the manuals also include either a database with data in it
or instructions to create one. Please consult your manual and/or your database
administrator for further information.
3. Sample Tables
The tables in the diagram below show a common database structure. Many of the
fields that would normally be in these tables are not in them because they are not needed to
explain the concepts. Below the diagram is a listing of the data in each of the tables.
In the structure below, a customer can have many orders. An order can have many
products, and a product can be on many orders.
There is one customer in the database who does not have any orders. There are also
several orders that do not have products. For the purpose of this exercise, I turned off
referential integrity checks, which allowed me to create orders that did not have customers.
It might seem like this should never happen in a production system, but quite often an
initial data load will be executed without referential integrity checks because the load can
run much faster. We would all hope that all required data would be included, but
sometimes decisions are made to load only part of the legacy data, and even when all data
is expected to be loaded, someone has to write the SQL to verify that all data was correctly
loaded, so I will consider this a valid example even though I had to disable referential
integrity to create the example.
4.
5. Inner Join
An inner join will output only the results from each table where the join conditions
are met. It will not output any rows from either table where the conditions are not met. In
the examples below, the join condition is CUST.CUST_ID = ORD.CUST_ID. Only rows
where the CUST_ID field matches in both tables will be displayed in the output.
There are two basic ways to implement an inner join. Most of you are probably
already familiar with one of them.
1) You can put your join conditions in the WHERE clause:
SELECT
*
FROM
CUST
, ORD
WHERE
CUST.CUST_ID = ORD.CUST_ID
;
2) You can use INNER JOIN in the FROM clause (Note: in most relational database
management systems, JOIN is equivalent to INNER JOIN):
SELECT
*
FROM
CUST
INNER JOIN ORD
ON CUST.CUST_ID = ORD.CUST_ID
;
6. Left Outer Join
Outer joins are used to display the results from one table where there are no
corresponding results in another table (where the join conditions are not met). If you want
to know which of your customers have not ordered anything, you would need an outer join
because if a customer has not ordered anything, there will be no entry in the ORD table to
match the customer, and an inner join would not output that customer at all.
Much of the confusion around outer joins is caused by the use of LEFT and
RIGHT. Look at the query below:
SELECT * FROM CUST LEFT OUTER JOIN ORD ON CUST.CUST_ID = ORD.CUST_ID;
With the query written on a single line, the first table (CUST) is to the left of the
second table (ORD). A left outer join will output all of the rows in the first (or left) table
and only the rows from the second (or right) table that match the join conditions. This
means that where the join conditions are not met, the output for the second (or right) table
will be filled with nulls (some query tools will actually display <NULL>, and others will
simply display a blank field).
In the result set below, you can see that Nick has not ordered anything. If you go
back and look at the result sets for the inner join examples, you will see that Nick was not
displayed at all because there was no matching entry with his ID in the ORD table.
7. Right Outer Join
Right outer joins are used for the same purpose that left outer joins are used for, but
they are much less common. The reason they are much less common is that in most cases
where one might want to use a right outer join, he or she can simply swap the order of the
tables in the query and use a left outer join instead.
If we use the same query we used for the left outer join example above and change
it to a right outer join without changing the table order, we will see all of the orders that
have customers and all of the orders that do not have customers, where the same query with
a left outer join showed us all of the customers that have orders and all of the customers
that do not have orders.
SELECT * FROM CUST RIGHT OUTER JOIN ORD ON CUST.CUST_ID = ORD.CUST_ID;
Once again, the non-matching results will be displayed as null. Notice in the output
below that the null values are on the left where they were previously on the right when we
used a left outer join. Since we did not specify the order to output the fields (we used
SELECT *), the fields from the left table are displayed on the left and the fields from the
right table are displayed on the right.
8. Full Outer Join
Since I do not currently have access to a DB2 database that I can create the sample
tables in and MySQL does not support full outer joins, I will describe the concept and
provide simulated output below.
Full outer joins are used where we need all rows from both left and right tables but
some rows in each table do not have corresponding entries in the other table. In our
example, we will be looking at all customers and all orders. Where a customer matches an
order, we want to display the results on one line. Where a customer does not have any
orders, we want to display the customer and some null fields where the order should be.
Where an order does not have a customer assigned to it, we want to display the order and
some null fields where the customer should be.
SELECT * FROM CUST FULL OUTER JOIN ORD ON CUST.CUST_ID = ORD.CUST_ID;
Full outer joins are rare, but in any situation where it is needed, the full outer join is
much less complex than the alternatives.
9. Using Multiple Joins
Looking through the dataset we started with, you might have noticed that not only
do we have customers without orders and orders without customers, but we also have
orders with no products. The customers without orders might be explained by a legacy data
conversion that did not include outdated orders or by an initial contact with a customer who
has not yet decided if he or she wants to order anything at all. The orders without
customers and the orders with no products on them, however, probably indicate that we
have some problems with the software that created the database entries (whether that
software is the user interface or a data conversion utility, the problem remains).
Since solving problems is what we do best in IT (at least if we want to retain our
jobs after someone asks us what we do all day), we might want to document the data that is
missing so the developer can look at the source code and correct it and so we can
retroactively correct the data if possible.
The query below will display all of the orders with their associated customers or a
null value if there is no customer and the associated products that are on the order or a null
value if there are no products.
SELECT
CUST.CUST_NM
, ORD.ORD_TS
, ORD.SHIP_TS
, ORD.ORD_ID
, PRD.PRD_DESC
FROM
ORD
LEFT OUTER JOIN CUST
ON CUST.CUST_ID = ORD.CUST_ID
LEFT OUTER JOIN ORD_PRD
ON ORD.ORD_ID = ORD_PRD.ORD_ID
LEFT OUTER JOIN PRD
ON ORD_PRD.PRD_ID = PRD.PRD_ID
;
10. The same query, modified to only display entries where customers or products are null is
actually far more useful because you do not have to sort through all of the valid data to
locate the invalid data:
SELECT
CUST.CUST_NM
, ORD.ORD_TS
, ORD.SHIP_TS
, ORD.ORD_ID
, PRD.PRD_DESC
FROM
ORD
LEFT OUTER JOIN CUST
ON CUST.CUST_ID = ORD.CUST_ID
LEFT OUTER JOIN ORD_PRD
ON ORD.ORD_ID = ORD_PRD.ORD_ID
LEFT OUTER JOIN PRD
ON ORD_PRD.PRD_ID = PRD.PRD_ID
WHERE
PRD.PRD_ID IS NULL
OR CUST.CUST_ID IS NULL
;
11. Exercises
1. Modify the query with multiple joins to also include customers that do not have
orders.
2. Write a query to return all of the products that have never been ordered.
3. Write a query to return only orders that have neither customers nor products
associated with them.
4. Start applying this knowledge at work.