Let's try to understand joins by trying to verbalize what 'question' each join is trying to answer.
We will try to frame the explanation in terms of a real world example. We will build a Jobs DB
with two tables - job and candidate. The job table will have a city
field representing where the job location is at. Similarly candidate
table also has a city
field representing the city which the candidate prefers. Our goal is to match candidates to jobs based on the city. We will try to answer different questions with different joins and subqueries.
This seemed like a better example than other examples based on primary keys because this particular example does the matching on a column which can allow duplicate values. So we can understand the general case of how joins work. The case where the join is on a primary key column or a col with unique constraint - will just be a special case of this general case.
CREATE TABLE candidate (id serial PRIMARY KEY, email VARCHAR (255) UNIQUE NOT NULL, city VARCHAR (255) );
CREATE TABLE job (id serial PRIMARY KEY, designation VARCHAR (255), company VARCHAR (255), city VARCHAR (255) );
We will keep Chennai and Bengaluru as common cities between the two tables and then add some extra entries to both tables with some other cities.
INSERT INTO candidate (email, city) VALUES
('[email protected]', 'Chennai'),
('[email protected]', 'Bengaluru'),
('[email protected]', 'Chennai'),
('[email protected]', 'Pune'),
('[email protected]', 'Hyderabad'),
('[email protected]', 'Bengaluru'),
('[email protected]', 'Pune'),
('[email protected]', 'Trivandrum'),
('[email protected]', 'Trivandrum'),
('[email protected]', 'Chennai'),
('[email protected]', 'Lucknow'),
('[email protected]', 'Indore');
INSERT INTO job (designation, company, city) VALUES
('SDE 1', 'Amazon', 'Chennai'),
('SDE 2', 'Amazon', 'Bengaluru'),
('SDE3', 'Amazon', 'Seattle'),
('SDE', 'Microsoft', 'Bengaluru'),
('Engineering Manager', 'FunnelIQ', 'Chennai'),
('SDE', 'Grab', 'Singapore'),
('Product Manager', 'Uber', 'Bengaluru'),
('Engineering Lead', 'Adobe', 'Gurgaon'),
('Principal Engineer', 'Grab', 'Bengaluru'),
('TPM', 'Amazon', 'Chennai'),
('SDE2', 'Grab', 'Singapore'),
('TPM', 'Amazon', 'Seattle'),
('SDE1', 'Adobe', 'Gurgaon');
Let's first review the data we are going to play around with.
SELECT * FROM candidate;
id | city | |
---|---|---|
1 | [email protected] | Chennai |
2 | [email protected] | Bengaluru |
3 | [email protected] | Chennai |
4 | [email protected] | Pune |
5 | [email protected] | Hyderabad |
6 | [email protected] | Bengaluru |
7 | [email protected] | Pune |
8 | [email protected] | Trivandrum |
9 | [email protected] | Trivandrum |
10 | [email protected] | Chennai |
11 | [email protected] | Lucknow |
12 | [email protected] | Indore |
(12 rows)
SELECT * FROM job;
id | designation | company | city |
---|---|---|---|
1 | SDE 1 | Amazon | Chennai |
2 | SDE 2 | Amazon | Bengaluru |
3 | SDE3 | Amazon | Seattle |
4 | SDE | Microsoft | Bengaluru |
5 | Engineering Manager | FunnelIQ | Chennai |
6 | SDE | Grab | Singapore |
7 | Product Manager | Uber | Bengaluru |
8 | Engineering Lead | Adobe | Gurgaon |
9 | Principal Engineer | Grab | Bengaluru |
10 | TPM | Amazon | Chennai |
11 | SDE2 | Grab | Singapore |
12 | TPM | Amazon | Seattle |
13 | SDE1 | Adobe | Gurgaon |
(13 rows)
We will first try to answer a few simple queries
This can be done with a subquery itself as we require fields from the candidate table only, with only the constraint being specified on the job table. So a subquery on the WHERE clause is enough.
SELECT candidate.id, candidate.email, candidate.city AS candidate_city
FROM candidate WHERE candidate.city IN (SELECT city FROM job);
id | candidate_city | |
---|---|---|
1 | [email protected] | Chennai |
2 | [email protected] | Bengaluru |
3 | [email protected] | Chennai |
6 | [email protected] | Bengaluru |
10 | [email protected] | Chennai |
(5 rows)
SELECT candidate.id, candidate.email, candidate.city AS candidate_city
FROM candidate WHERE candidate.city NOT IN (SELECT city FROM job);
id | candidate_city | |
---|---|---|
4 | [email protected] | Pune |
5 | [email protected] | Hyderabad |
7 | [email protected] | Pune |
8 | [email protected] | Trivandrum |
9 | [email protected] | Trivandrum |
11 | [email protected] | Lucknow |
12 | [email protected] | Indore |
(7 rows)
SELECT job.id, job.designation, job.company, job.city AS job_city
FROM job WHERE job.city IN (SELECT city FROM candidate);
id | designation | company | job_city |
---|---|---|---|
1 | SDE 1 | Amazon | Chennai |
2 | SDE 2 | Amazon | Bengaluru |
4 | SDE | Microsoft | Bengaluru |
5 | Engineering Manager | FunnelIQ | Chennai |
7 | Product Manager | Uber | Bengaluru |
9 | Principal Engineer | Grab | Bengaluru |
10 | TPM | Amazon | Chennai |
(7 rows)
SELECT job.id, job.designation, job.company, job.city AS job_city
FROM job WHERE job.city NOT IN (SELECT city FROM candidate);
id | designation | company | job_city |
---|---|---|---|
3 | SDE3 | Amazon | Seattle |
6 | SDE | Grab | Singapore |
8 | Engineering Lead | Adobe | Gurgaon |
11 | SDE2 | Grab | Singapore |
12 | TPM | Amazon | Seattle |
13 | SDE1 | Adobe | Gurgaon |
(6 rows)
Subqueries are enough if the fields we need are in one table only. But if we want to combine the fields from multiple tables, joins become necessary.
Let's say we are building an interface where we want to show all possible matching pairs of candidates and jobs. In this case the query to answer is "List all pairs of matching jobs and candidates which have the same city"
Inner Join is meant to answer this. The data has only Chennai and Bengaluru as the common cities. With 3 candidates in Chennai matched to 3 jobs in Chennai and 2 candidates in Bengaluru matched to 4 jobs there. So the total number of matching pairs would be 3 * 3 + 4 * 2 = 17
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city, candidate.id AS candidate_id,
candidate.email AS candidate_email, candidate.city AS candidate_city
FROM job INNER JOIN candidate ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
(17 rows)
Inner Join is symmetric. So the order doesn't matter. We can do INNER JOIN from candidate to job and see that it returns the same result set.
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city,
candidate.id AS candidate_id, candidate.email AS candidate_email, candidate.city AS candidate_city
FROM candidate INNER JOIN job ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
(17 rows)
After showing the pairs of jobs and matching candidates, lets say we also want to show the jobs which don't have any matching candidate. This is answered by Left Outer Join.
The query statement is - "Pair jobs with matching candidates and list all pairs. If there is no matching candidate for a job, then pair it with an empty row on the candidate side and list that also"
This can also be phrased as - "Pairs of matching jobs and candidates which have the same city + Jobs in cities without any available candidates"
Now in the above expression, the first part is the same as the INNER JOIN and the second part is the same as the "NOT IN" subquery we wrote for "Jobs in cities without any available candidates"
So the total rows would be the sum of the rows for both those queries = 17 + 6 = 23
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city,
candidate.id AS candidate_id, candidate.email AS candidate_email,
candidate.city AS candidate_city FROM job
LEFT OUTER JOIN candidate ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
11 | SDE2 | Grab | Singapore | |||
6 | SDE | Grab | Singapore | |||
13 | SDE1 | Adobe | Gurgaon | |||
8 | Engineering Lead | Adobe | Gurgaon | |||
12 | TPM | Amazon | Seattle | |||
3 | SDE3 | Amazon | Seattle |
(23 rows)
LEFT OUTER JOIN is not symmetric. If we change the order of the join, the meaning of the query changes.
It becomes, "Pair candidates with matching jobs and list all pairs. If there is no matching job for a candidate, then pair it with an empty row on the job side and list that also"
Or in other words - "Pairs of matching jobs and candidates which have the same city + Candidates in cities without any available jobs"
Again we can see that this is equivalent to adding the result of INNER JOIN with the result of the "NOT IN" subquery where we found the "Candidates in cities without any available jobs". So the total rows in the resultset would be 17 + 7 = 24
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city,
candidate.id AS candidate_id, candidate.email AS candidate_email,
candidate.city AS candidate_city FROM candidate LEFT OUTER JOIN job ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
4 | [email protected] | Pune | ||||
5 | [email protected] | Hyderabad | ||||
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
7 | [email protected] | Pune | ||||
8 | [email protected] | Trivandrum | ||||
9 | [email protected] | Trivandrum | ||||
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
11 | [email protected] | Lucknow | ||||
12 | [email protected] | Indore |
(24 rows)
Right Outer Join is just the mirror image of LEFT OUTER JOIN. The statement "FROM job RIGHT OUTER JOIN candidate" is exactly equivalent to "FROM candidate LEFT OUTER JOIN job", which is basically the query we discussed just before. So whenever we need to visualize RIGHT OUTER JOIN, we can just instead visualize it as LEFT OUTER JOIN with the order of the join reversed.
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city, candidate.id AS candidate_id,
candidate.email AS candidate_email, candidate.city AS candidate_city
FROM job RIGHT OUTER JOIN candidate ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
4 | [email protected] | Pune | ||||
5 | [email protected] | Hyderabad | ||||
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
7 | [email protected] | Pune | ||||
8 | [email protected] | Trivandrum | ||||
9 | [email protected] | Trivandrum | ||||
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
11 | [email protected] | Lucknow | ||||
12 | [email protected] | Indore |
(24 rows)
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city,
candidate.id AS candidate_id, candidate.email AS candidate_email, candidate.city AS candidate_city
FROM candidate RIGHT OUTER JOIN job ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
11 | SDE2 | Grab | Singapore | |||
6 | SDE | Grab | Singapore | |||
13 | SDE1 | Adobe | Gurgaon | |||
8 | Engineering Lead | Adobe | Gurgaon | |||
12 | TPM | Amazon | Seattle | |||
3 | SDE3 | Amazon | Seattle |
(23 rows)
Finally we come to FULL JOIN. We can use this when the requirement combines both LEFT and RIGHT outer joins. That is we want a query that will "Pair candidates with matching jobs and list all pairs. If there is no matching job for a candidate, then pair it with an empty row on the job side and list that also. Similarly, ff there is no matching candidate for a job, then pair it with an empty row on the candidate side and list that also"
In other words - "Candidate and job matching pairs, matched on city + Jobs in cities without any available candidates + Candidates in cities without any available jobs"
We can see that this is basically the same as adding the INNER JOIN result to the results from the two "NOT IN" subqueries we wrote. So the final count would be 17 + 7 + 6 = 30
SELECT job.id AS job_id, job.designation, job.company, job.city AS job_city, candidate.id AS candidate_id,
candidate.email AS candidate_email, candidate.city AS candidate_city
FROM candidate FULL JOIN job ON job.city = candidate.city;
job_id | designation | company | job_city | candidate_id | candidate_email | candidate_city |
---|---|---|---|---|---|---|
10 | TPM | Amazon | Chennai | 1 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 1 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 1 | [email protected] | Chennai |
9 | Principal Engineer | Grab | Bengaluru | 2 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 2 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 2 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 2 | [email protected] | Bengaluru |
10 | TPM | Amazon | Chennai | 3 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 3 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 3 | [email protected] | Chennai |
4 | [email protected] | Pune | ||||
5 | [email protected] | Hyderabad | ||||
9 | Principal Engineer | Grab | Bengaluru | 6 | [email protected] | Bengaluru |
7 | Product Manager | Uber | Bengaluru | 6 | [email protected] | Bengaluru |
4 | SDE | Microsoft | Bengaluru | 6 | [email protected] | Bengaluru |
2 | SDE 2 | Amazon | Bengaluru | 6 | [email protected] | Bengaluru |
7 | [email protected] | Pune | ||||
8 | [email protected] | Trivandrum | ||||
9 | [email protected] | Trivandrum | ||||
10 | TPM | Amazon | Chennai | 10 | [email protected] | Chennai |
5 | Engineering Manager | FunnelIQ | Chennai | 10 | [email protected] | Chennai |
1 | SDE 1 | Amazon | Chennai | 10 | [email protected] | Chennai |
11 | [email protected] | Lucknow | ||||
12 | [email protected] | Indore | ||||
11 | SDE2 | Grab | Singapore | |||
6 | SDE | Grab | Singapore | |||
13 | SDE1 | Adobe | Gurgaon | |||
8 | Engineering Lead | Adobe | Gurgaon | |||
12 | TPM | Amazon | Seattle | |||
3 | SDE3 | Amazon | Seattle |
(30 rows)