Correlated subqueries 🌐
Success
Correlated subqueries are a great way to "run a subquery" for each row in a table.
This can be slow if used inappropriately, but there are some great places to use this!
Warning
Correlated subqueries are advanced, and there aren't Excel equivalents for them.
Make sure that you are comfortable with the main concepts before diving into these advanced concepts.
Correlated subqueries "run a subquery" for each row in a table
We saw subqueries in the subqueries section, so what makes one "correlated"?
A subquery becomes a "correlated subquery" if it references a column from a table in the outer query.
The subquery is then "run" for each row in the referenced column.
This can take a bit of getting used to, but it's a powerful tool when used appropriately.
Converting joins into a correlated subquery
Warning
These are contrived examples to demonstrate the concept. You should stick to using JOIN for this kind of operation! 😝
Employee LEFT JOIN Department
To ease into the concept, we'll see how we could convert a JOIN into a correlated subquery.
We saw the following query in the Employee LEFT JOIN Department part of the joins section:
SELECT
Employee.EmployeeID,
Employee.EmployeeName,
Employee.DepartmentID,
Department.DepartmentName
FROM Employee
LEFT JOIN Department
ON Employee.DepartmentID = Department.DepartmentID
;
| EmployeeID | EmployeeName | DepartmentID | DepartmentName |
|---|---|---|---|
| 1 | Alice | 1 | Sales |
| 2 | Bob | 1 | Sales |
| 3 | Charlie | 2 | Marketing |
| 4 | Dave | 2 | Marketing |
| 5 | Eve | 3 | null |
In this example, we're finding the DepartmentName in the Department table that matches the DepartmentID for each employee in the Employee table.
We could convert this into a correlated subquery like so:
SELECT
EmployeeID,
EmployeeName,
DepartmentID,
(
SELECT DepartmentName
FROM Department
WHERE Department.DepartmentID = Employee.DepartmentID
) AS DepartmentName
FROM Employee
;
This isn't just a normal subquery: the WHERE clause references the Employee table which is outside the subquery!
Let's break down what's happening with each row in this table.
Employee 1
The first row is employee 1:
| EmployeeID | EmployeeName | DepartmentID |
|---|---|---|
| 1 | Alice | 1 |
Their DepartmentID is 1, so SQL will run the following subquery to find the DepartmentName:
SELECT DepartmentName
FROM Department
WHERE Department.DepartmentID = 1
;
| DepartmentName |
|---|
| Sales |
Therefore, the value Sales will be used for the DepartmentName for Alice.
Employee 2
The second row is employee 2:
| EmployeeID | EmployeeName | DepartmentID |
|---|---|---|
| 2 | Bob | 1 |
Their DepartmentID is also 1, so SQL will run the exact same subquery to find the DepartmentName; this means that Bob will also have Sales as their DepartmentName.
Employee 3
The third row is employee 3:
| EmployeeID | EmployeeName | DepartmentID |
|---|---|---|
| 3 | Charlie | 2 |
Their DepartmentID is 2, so SQL will run the following subquery to find the DepartmentName:
SELECT DepartmentName
FROM Department
WHERE Department.DepartmentID = 2
;
| DepartmentName |
|---|
| Marketing |
Therefore, the value Marketing will be used for the DepartmentName for Charlie.
Employee 4
The fourth row is employee 4:
| EmployeeID | EmployeeName | DepartmentID |
|---|---|---|
| 4 | Dave | 2 |
Their DepartmentID is also 2, so SQL will run the exact same subquery to find the DepartmentName; this means that Dave will also have Marketing as their DepartmentName.
Employee 5
The fifth row is employee 5:
| EmployeeID | EmployeeName | DepartmentID |
|---|---|---|
| 5 | Eve | 3 |
Their DepartmentID is 3, so SQL will run the following subquery to find the DepartmentName:
SELECT DepartmentName
FROM Department
WHERE Department.DepartmentID = 3
;
This subquery will return no rows, so the NULL value will be used for the DepartmentName for Eve.
Putting it all together
When we put all of these rows together, we get the following result:
| EmployeeID | EmployeeName | DepartmentID | DepartmentName |
|---|---|---|---|
| 1 | Alice | 1 | Sales |
| 2 | Bob | 1 | Sales |
| 3 | Charlie | 2 | Marketing |
| 4 | Dave | 2 | Marketing |
| 5 | Eve | 3 | null |
This is indeed identical to the result we got from the LEFT JOIN query!
What about the other join examples?
We saw two other join examples in the joins section:
- An inner join: Employee INNER JOIN Department
- A left join with multiple matches: Employee LEFT JOIN Address
Do you think we could convert these into correlated subqueries, too?
Employee INNER JOIN Department
If we try to convert the inner join into a correlated subquery, we get exactly the same result as the LEFT JOIN example unless we explicitly filter out the NULL values after the subquery.
This means that correlated subqueries are more like LEFT JOIN than INNER JOIN in this context.
Employee LEFT JOIN Address
If we try to convert the left join with multiple matches into a correlated subquery, we'd end up with an error.
This is because the subquery would return two rows for Eve, and subqueries used to define columns are only allowed to return a single value.
This is a good example of why correlated subqueries are less appropriate for joins!
Using correlated subqueries for filtering
The examples above were contrived to demonstrate the concept, but they're not the best use cases for correlated subqueries.
The most common place that you'll see correlated subqueries being used is in the WHERE clause; these usually come in two forms:
- Relativity tests
- Existence tests
Using correlated subqueries for relativity tests
It's typical to see correlated subqueries to be used to get the latest position of a record in a table that holds historical data.
For example, the HumanResources.EmployeeDepartmentHistory table holds the history of an employee's department. Let's take a look at the rows corresponding to employees 4, 16, 224, 234, and 250:
SELECT
BusinessEntityID,
DepartmentID,
StartDate,
EndDate
FROM HumanResources.EmployeeDepartmentHistory
WHERE BusinessEntityID IN (4, 16, 224, 234, 250)
ORDER BY BusinessEntityID, StartDate
;
| BusinessEntityID | DepartmentID | StartDate | EndDate |
|---|---|---|---|
| 4 | 1 | 2007-12-05 | 2010-05-30 |
| 4 | 2 | 2010-05-31 | null |
| 16 | 5 | 2007-12-20 | 2009-07-14 |
| 16 | 4 | 2009-07-15 | null |
| 224 | 7 | 2009-01-07 | 2011-08-31 |
| 224 | 8 | 2011-09-01 | null |
| 234 | 10 | 2009-01-31 | 2013-11-13 |
| 234 | 16 | 2013-11-14 | null |
| 250 | 4 | 2011-02-25 | 2011-07-30 |
| 250 | 13 | 2011-07-31 | 2012-07-14 |
| 250 | 5 | 2012-07-15 | null |
This table is nice for finding the latest department for each employee because we can just filter the rows where EndDate is NULL, but what if we didn't have the EndDate column and had to rely on just the StartDate?
We could use a correlated subquery to find the latest StartDate for each employee, and then use that to filter the rows:
SELECT
BusinessEntityID,
DepartmentID,
StartDate
FROM HumanResources.EmployeeDepartmentHistory AS History
WHERE BusinessEntityID IN (4, 16, 224, 234, 250)
AND StartDate = (
SELECT MAX(StartDate)
FROM HumanResources.EmployeeDepartmentHistory AS InnerHistory
WHERE InnerHistory.BusinessEntityID = History.BusinessEntityID
)
;
This is a bit more complex than the previous examples!
Not only is the subquery correlated, but it's also using an aggregate function (MAX) to find the latest StartDate for each employee.
Let's break down what's happening with each row in this table.
Employee 4
The first employee is 4, and they have two rows:
| BusinessEntityID | DepartmentID | StartDate |
|---|---|---|
| 4 | 1 | 2007-12-05 |
| 4 | 2 | 2010-05-31 |
We know that we want to keep the row with the latest StartDate.
A correlated subquery is run for each row in the EmployeeDepartmentHistory table, so the subquery is run twice for employee 4.
The subquery that is run for this employee (for both their rows) is:
SELECT MAX(StartDate)
FROM HumanResources.EmployeeDepartmentHistory AS InnerHistory
WHERE InnerHistory.BusinessEntityID = 4
;
| MAX(StartDate) |
|---|
| 2010-05-31 |
When the subquery is run for the first row, the StartDate is 2007-12-05. This is not the result of the subquery, so the row is discarded.
Conversely, when the subquery is run for the second row, the StartDate is 2010-05-31 so this second row is kept.
Employee 16
The second employee is 16:
| BusinessEntityID | DepartmentID | StartDate |
|---|---|---|
| 16 | 5 | 2007-12-20 |
| 16 | 4 | 2009-07-15 |
Similar to employee 4, a correlated subquery is run for each row in the EmployeeDepartmentHistory table, so the subquery is also run twice for employee 16.
The subquery that is run for this employee (for both their rows) is:
SELECT MAX(StartDate)
FROM HumanResources.EmployeeDepartmentHistory AS InnerHistory
WHERE InnerHistory.BusinessEntityID = 16
;
| MAX(StartDate) |
|---|
| 2009-07-15 |
When the subquery is run for the first row, the StartDate is 2007-12-20. This is not the result of the subquery, so the row is discarded.
Conversely, when the subquery is run for the second row, the StartDate is 2010-05-31 so this second row is kept.
Employees 224 and 234
Like the previous employees, employees 224 and 234 each have two rows, so the subquery is run twice for each of them too.
In both cases, the second row per employee is kept because it has the latest StartDate.
Employee 250
The fifth employee is 250:
| BusinessEntityID | DepartmentID | StartDate |
|---|---|---|
| 250 | 4 | 2011-02-25 |
| 250 | 13 | 2011-07-31 |
| 250 | 5 | 2012-07-15 |
This employee has three rows, so the subquery is run three times for them.
The subquery that is run for this employee (for all three of their rows) is:
SELECT MAX(StartDate)
FROM HumanResources.EmployeeDepartmentHistory AS InnerHistory
WHERE InnerHistory.BusinessEntityID = 250
;
| MAX(StartDate) |
|---|
| 2012-07-15 |
When the subquery is run for the first row, the StartDate is 2011-02-25. This is not the result of the subquery, so the row is discarded.
When the subquery is run for the second row, the StartDate is 2011-07-31. Again, this is not the result of the subquery, so the row is discarded.
Finally, when the subquery is run for the third row, the StartDate is 2012-07-15 so this third row is kept.
Putting it all together
When we put all of these rows together, we get the following result:
| BusinessEntityID | DepartmentID | StartDate |
|---|---|---|
| 4 | 2 | 2010-05-31 |
| 16 | 4 | 2009-07-15 |
| 224 | 8 | 2011-09-01 |
| 234 | 16 | 2013-11-14 |
| 250 | 5 | 2012-07-15 |
Tip
There are other ways (not using correlated subqueries) to achieve this result, too. Can you think of any?
Using correlated subqueries for existence tests
Another common place that you'll see correlated subqueries being used is to check whether a record exists.
In this case, the EXISTS keyword is used with the subquery to tell SQL to do the existence test.
For example, we could check which people in the Person.Person table have a record in the HumanResources.JobCandidate table:
SELECT
BusinessEntityID,
FirstName,
LastName
FROM Person.Person
WHERE EXISTS (
SELECT *
FROM HumanResources.JobCandidate
WHERE JobCandidate.BusinessEntityID = Person.BusinessEntityID
)
;
| BusinessEntityID | FirstName | LastName |
|---|---|---|
| 212 | Peng | Wu |
| 274 | Stephen | Jiang |
As per the Microsoft documentation:
Notice that subqueries that are introduced with
EXISTSare a bit different from other subqueries in the following ways:
- The keyword
EXISTSisn't preceded by a column name, constant, or other expression.- The select list of a subquery introduced by
EXISTSalmost always consists of an asterisk (*). There's no reason to list column names because you're just testing whether rows that meet the conditions specified in the subquery exist.The
EXISTSkeyword is important because frequently there's no alternative formulation without subqueries. Although some queries that are created withEXISTScan't be expressed any other way, many queries can useINor a comparison operator modified byANYorALLto achieve similar results.
Our example is, in fact, an example of a query that can be expressed using IN instead of EXISTS:
SELECT
BusinessEntityID,
FirstName,
LastName
FROM Person.Person
WHERE BusinessEntityID IN (
SELECT BusinessEntityID
FROM HumanResources.JobCandidate
)
;
So, what's an example of a query that can't be expressed using IN instead of EXISTS?
It's rare, but you'd need EXISTS over IN to evaluate more complex conditions that can't be boiled down to an equality test.
For example, the following query can't be expressed using IN:
SELECT
BusinessEntityID,
FirstName,
LastName,
ModifiedDate
FROM Person.Person
WHERE EXISTS (
SELECT *
FROM HumanResources.JobCandidate
WHERE JobCandidate.BusinessEntityID = Person.BusinessEntityID
AND JobCandidate.ModifiedDate > Person.ModifiedDate
)
;
This query checks for people who have a record in the HumanResources.JobCandidate table that was modified more recently than their record in the Person.Person table.
This can't be expressed using IN because of the JobCandidate.ModifiedDate > Person.ModifiedDate condition.
This could, however, be expressed using a LEFT JOIN and a WHERE clause:
SELECT
Person.BusinessEntityID,
Person.FirstName,
Person.LastName,
Person.ModifiedDate
FROM Person.Person
LEFT JOIN HumanResources.JobCandidate
ON JobCandidate.BusinessEntityID = Person.BusinessEntityID
WHERE JobCandidate.BusinessEntityID IS NOT NULL
AND JobCandidate.ModifiedDate > Person.ModifiedDate
;
Like with many SQL problems, the approach that you take will depend on the context, the performance of the approaches, and the readability of the code.
In this particular case, the LEFT JOIN is more performant than the EXISTS approach (on my personal computer, averaging 100,000 runs):
- The
LEFT JOINapproach takes approximately 0.88 milliseconds to run (44%) - The
EXISTSapproach takes approximately 1.11 milliseconds to run (56%)
However, the EXISTS approach is slightly more readable (in my opinion), so both approached have pros and cons.
Tip
You'll encounter several ways to achieve the same output in SQL. There's rarely a "right" way to do something, so to figure out which approach you take, you should consider several things like:
- The performance of the code
- The readability of the code
- The extensibility of the code
- The preferences of your team/company
- The tools that you have available to you
This is not a complete list and there are many other things to consider, but it's a good starting point.
A quick note for the programmers
Warning
If you've never done any programming before, skip this bit!
If you've used other programming languages, you might be thinking that correlated subqueries are similar to loops. You'd be right!
Correlated subqueries can be thought of as for-loops that run for each row in a table.
This is why correlated subqueries can be slow if used inappropriately: they can run a lot of times!
An extra example to get the brain juices flowing
This example is taken from the following article:
Suppose we have the following tables:
aa
| a |
|---|
| 1 |
| 2 |
| 3 |
xx
| x |
|---|
| 10 |
| 20 |
| 30 |
What do you think the results of the following queries are?
SELECT (SELECT TOP 1 SUM(1) FROM xx) FROM aa;
SELECT (SELECT TOP 1 SUM(a) FROM xx) FROM aa;
SELECT (SELECT TOP 1 SUM(x) FROM xx) FROM aa;
SELECT (SELECT TOP 1 SUM(a + x) FROM xx) FROM aa;
Here are the results:
Expand to show the results
SELECT (SELECT TOP 1 SUM(1) FROM xx) FROM aa |
|---|
| 3 |
| 3 |
| 3 |
SELECT (SELECT TOP 1 SUM(a) FROM xx) FROM aa |
|---|
| 6 |
SELECT (SELECT TOP 1 SUM(x) FROM xx) FROM aa |
|---|
| 60 |
| 60 |
| 60 |
SELECT (SELECT TOP 1 SUM(a + x) FROM xx) FROM aa |
|---|
| 63 |
| 66 |
| 69 |
Weird, right? 🤯
See if you can wrap your head around what's going on here, and then check out the article linked above for a great explanation of what's happening.
The SQL for running this example
This data is not in the AdventureWorks database, so you can run this example using the SQL below (uncomment the line you want to run):
WITH
aa AS (
SELECT *
FROM (VALUES (1), (2), (3)) AS t(a)
),
xx AS (
SELECT *
FROM (VALUES (10), (20), (30)) AS t(x)
)
SELECT (SELECT TOP 1 SUM(1) FROM xx) FROM aa
-- SELECT (SELECT TOP 1 SUM(a) FROM xx) FROM aa
-- SELECT (SELECT TOP 1 SUM(x) FROM xx) FROM aa
-- SELECT (SELECT TOP 1 SUM(a + x) FROM xx) FROM aa
;
Further reading
Check out the official Microsoft documentation for more information on correlated subqueries at:
Similarity to CROSS APPLY
There is a similarity between correlated subqueries and CROSS APPLY. This is outside the scope of this course, but you can read more about it at: