- Categories:
Working with Joins¶
A join combines rows from two tables to create a new combined row that can be used in the query.
Introduction¶
Joins are useful when the data in the tables is related. For example, one table might hold information about projects, and one table might hold information about employees working on those projects.
SELECT * FROM projects ORDER BY project_ID; +------------+------------------+ | PROJECT_ID | PROJECT_NAME | |------------+------------------| | 1000 | COVID-19 Vaccine | | 1001 | Malaria Vaccine | | 1002 | NewProject | +------------+------------------+SELECT * FROM employees ORDER BY employee_ID; +-------------+-----------------+------------+ | EMPLOYEE_ID | EMPLOYEE_NAME | PROJECT_ID | |-------------+-----------------+------------| | 10000001 | Terry Smith | 1000 | | 10000002 | Maria Inverness | 1000 | | 10000003 | Pat Wang | 1001 | | 10000004 | NewEmployee | NULL | +-------------+-----------------+------------+
The two joined tables usually contain one or more columns in common so that the rows in one table can be associated with the corresponding rows in the other table. For example, each row in the projects table might have a unique project ID number, and each row in the employees table might include the ID number of the project that the employee is currently assigned to.
The join operation specifies (explicitly or implicitly) how to relate rows in one table to the corresponding rows in the other table, typically by referencing the common column(s), such as project ID. For example, the following joins the project and employee tables shown above:
SELECT p.project_ID, project_name, employee_ID, employee_name, e.project_ID FROM projects AS p JOIN employees AS e ON e.project_ID = p.project_ID ORDER BY p.project_ID, e.employee_ID; +------------+------------------+-------------+-----------------+------------+ | PROJECT_ID | PROJECT_NAME | EMPLOYEE_ID | EMPLOYEE_NAME | PROJECT_ID | |------------+------------------+-------------+-----------------+------------| | 1000 | COVID-19 Vaccine | 10000001 | Terry Smith | 1000 | | 1000 | COVID-19 Vaccine | 10000002 | Maria Inverness | 1000 | | 1001 | Malaria Vaccine | 10000003 | Pat Wang | 1001 | +------------+------------------+-------------+-----------------+------------+
Although a single join operation can join only two tables, joins can be chained together. The result of a join is a table-like object, and that table-like object can then be joined to another table-like object. Conceptually, the idea is similar to the following (this is not the actual syntax):
table1 join (table2 join table 3)
In this pseudo-code, table2 and table3 are joined first. The table that results from that join is then joined with table1.
Joins can be applied not only to tables, but also to other table-like objects. You can join:
A table.
A view (materialized or non-materialized).
An expression that evaluates to the equivalent of a table (containing one or more columns and zero or more rows). For example:
The result set returned by a table function.
The result set returned by a subquery that returns a table.
When this topic refers to joining a table, it generally means joining any table-like object.
Note
Snowflake can improve performance by eliminating unnecessary joins. For details, see Understanding How Snowflake Can Eliminate Redundant Joins.
Types of Joins¶
Snowflake supports the following types of joins:
Inner join.
Outer join.
Cross join.
Natural join.
Inner Join¶
An inner join pairs each row in one table with the matching row(s) in the other table.
The example below uses an inner join:
SELECT p.project_ID, project_name, employee_ID, employee_name, e.project_ID FROM projects AS p INNER JOIN employees AS e ON e.project_id = p.project_id ORDER BY p.project_ID, e.employee_ID; +------------+------------------+-------------+-----------------+------------+ | PROJECT_ID | PROJECT_NAME | EMPLOYEE_ID | EMPLOYEE_NAME | PROJECT_ID | |------------+------------------+-------------+-----------------+------------| | 1000 | COVID-19 Vaccine | 10000001 | Terry Smith | 1000 | | 1000 | COVID-19 Vaccine | 10000002 | Maria Inverness | 1000 | | 1001 | Malaria Vaccine | 10000003 | Pat Wang | 1001 | +------------+------------------+-------------+-----------------+------------+
In this example, the output table contains two columns named “Project_ID”. One Project_ID column is from the projects
table, and one is from the employees table. For each row in the output table, the values in the two Project_ID
columns match because the query specified e.project_id = p.project_id
.
The output includes only valid pairs (i.e. rows that match the join condition). In this example there is no row for the project named “NewProject” (which has no employees assigned yet) or the employee named “NewEmployee” (who hasn’t been assigned to any projects yet).
Outer Join¶
An outer join lists all rows in the specified table, even if those rows have no match in the other table. For example, a left outer join between projects and employees lists all projects, including projects that do not yet have any employee assigned.
SELECT p.project_name, e.employee_name
FROM projects AS p LEFT OUTER JOIN employees AS e
ON e.project_ID = p.project_ID
ORDER BY p.project_name, e.employee_name;
+------------------+-----------------+
| PROJECT_NAME | EMPLOYEE_NAME |
|------------------+-----------------|
| COVID-19 Vaccine | Maria Inverness |
| COVID-19 Vaccine | Terry Smith |
| Malaria Vaccine | Pat Wang |
| NewProject | NULL |
+------------------+-----------------+
The project named “NewProject” is included in this output even though there is no matching row in the employees table. Because there are no matching employee names for the project named “NewProject”, the employee name is set to NULL.
A right outer join lists all employees (regardless of project).
SELECT p.project_name, e.employee_name FROM projects AS p RIGHT OUTER JOIN employees AS e ON e.project_ID = p.project_ID ORDER BY p.project_name, e.employee_name; +------------------+-----------------+ | PROJECT_NAME | EMPLOYEE_NAME | |------------------+-----------------| | COVID-19 Vaccine | Maria Inverness | | COVID-19 Vaccine | Terry Smith | | Malaria Vaccine | Pat Wang | | NULL | NewEmployee | +------------------+-----------------+
A full outer join lists all projects and all employees.
SELECT p.project_name, e.employee_name FROM projects AS p FULL OUTER JOIN employees AS e ON e.project_ID = p.project_ID ORDER BY p.project_name, e.employee_name; +------------------+-----------------+ | PROJECT_NAME | EMPLOYEE_NAME | |------------------+-----------------| | COVID-19 Vaccine | Maria Inverness | | COVID-19 Vaccine | Terry Smith | | Malaria Vaccine | Pat Wang | | NewProject | NULL | | NULL | NewEmployee | +------------------+-----------------+
Cross Join¶
A cross join combines each row in the first table with each row in the second table, creating every possible combination of rows (called a “Cartesian product”). Because most of the result rows contain parts of rows that are not actually related, a cross join is rarely useful by itself. In fact, cross joins are usually the result of accidentally omitting the join condition.
The result of a cross join can be very large (and expensive). If the first table has N rows and the second table has M rows, then the result is N x M rows. For example, if the first table has 100 rows and the second table has 1000 rows, then the result set contains 100,000 rows.
The following query shows a cross join:
SELECT p.project_name, e.employee_name FROM projects AS p CROSS JOIN employees AS e ORDER BY p.project_ID, e.employee_ID; +------------------+-----------------+ | PROJECT_NAME | EMPLOYEE_NAME | |------------------+-----------------| | COVID-19 Vaccine | Terry Smith | | COVID-19 Vaccine | Maria Inverness | | COVID-19 Vaccine | Pat Wang | | COVID-19 Vaccine | NewEmployee | | Malaria Vaccine | Terry Smith | | Malaria Vaccine | Maria Inverness | | Malaria Vaccine | Pat Wang | | Malaria Vaccine | NewEmployee | | NewProject | Terry Smith | | NewProject | Maria Inverness | | NewProject | Pat Wang | | NewProject | NewEmployee | +------------------+-----------------+
Note that this query contains no ON
clause and no filter.
The output of a cross join can be made more useful by applying a filter in the WHERE
clause:
SELECT p.project_name, e.employee_name FROM projects AS p CROSS JOIN employees AS e WHERE e.project_ID = p.project_ID ORDER BY p.project_ID, e.employee_ID; +------------------+-----------------+ | PROJECT_NAME | EMPLOYEE_NAME | |------------------+-----------------| | COVID-19 Vaccine | Terry Smith | | COVID-19 Vaccine | Maria Inverness | | Malaria Vaccine | Pat Wang | +------------------+-----------------+
The result of this cross join and filter is the same as the result of the following inner join:
SELECT p.project_name, e.employee_name FROM projects AS p INNER JOIN employees AS e ON e.project_ID = p.project_ID ORDER BY p.project_ID, e.employee_ID; +------------------+-----------------+ | PROJECT_NAME | EMPLOYEE_NAME | |------------------+-----------------| | COVID-19 Vaccine | Terry Smith | | COVID-19 Vaccine | Maria Inverness | | Malaria Vaccine | Pat Wang | +------------------+-----------------+
Important
Although the two queries in this example produce the same output when they use the same condition
(e.project_id = p.project_id
) in different clauses (WHERE
vs. FROM ... ON ...
), it is possible to
construct pairs of queries that use the same condition but that do not produce the same output.
The most common examples involve outer joins. If you execute table1 LEFT OUTER JOIN table2
, then for rows in
table1 that have no match, the columns that would have come from table2 contain NULL. A filter
like WHERE table2.ID = table1.ID
filters out rows in which either table2.id or table1.id contains a
NULL, while an explicit outer join in the FROM ... ON ...
clause does not filter out rows with NULL values.
In other words, an outer join with a filter might not actually act like an outer join.
Natural Join¶
A natural join is used when two tables contain columns that have the same name and in which the data in those
columns corresponds. In the employees and projects tables shown above, both tables have columns named “project_ID”.
A natural join implicitly constructs the ON
clause: ON projects.project_ID = employees.project_ID
.
If two tables have multiple columns in common, then all the common columns are used in the ON
clause. For example, if you had
two tables that each had columns named “city” and “province”, then a natural join would construct the following ON clause:
ON table2.city = table1.city AND table2.province = table1.province
.
The output of a natural join includes only one copy of each of the shared columns. For example, the following query produces a natural join containing all columns in the two tables, except that it omits all but one copy of the redundant project_ID column:
SELECT * FROM projects NATURAL JOIN employees ORDER BY employee_ID; +------------+------------------+-------------+-----------------+ | PROJECT_ID | PROJECT_NAME | EMPLOYEE_ID | EMPLOYEE_NAME | |------------+------------------+-------------+-----------------| | 1000 | COVID-19 Vaccine | 10000001 | Terry Smith | | 1000 | COVID-19 Vaccine | 10000002 | Maria Inverness | | 1001 | Malaria Vaccine | 10000003 | Pat Wang | +------------+------------------+-------------+-----------------+
A natural join can be combined with an outer join.
A natural join cannot be combined with an ON
clause because the join condition is already implied. However, you
can use a WHERE
clause to filter the results of a natural join.
Implementing Joins¶
Syntactically, there are two ways to join tables:
Snowflake recommends using the ON
sub-clause in the FROM
clause. The syntax is more flexible. And specifying the predicate
in the ON
clause avoids the problem of accidentally filtering rows with NULLs when using a WHERE
clause to
specify the join condition for an outer join.