First let's prepare tables and data:
CREATE TABLE T1 (A UNIQUEIDENTIFIER); CREATE TABLE T2 (A UNIQUEIDENTIFIER); ;WITH CTE AS ( SELECT A FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8)) X(A) ) INSERT INTO T1 SELECT NEWID() FROM CTE AS CTE1 CROSS JOIN CTE AS CTE2 CROSS JOIN CTE AS CTE3 CROSS JOIN CTE AS CTE4 CROSS JOIN CTE AS CTE5 CROSS JOIN CTE AS CTE6 CROSS JOIN CTE AS CTE7 CROSS JOIN CTE AS CTE8; INSERT INTO T2 SELECT TOP 5 PERCENT A FROM T1 UNION ALL SELECT TOP 95 PERCENT NULL FROM T1;
We have now two tables with similar number of rows. The difference is that the second one has mostly NULL values. Now let's try joining those two tables together:
SELECT COUNT(1) FROM T1 INNER JOIN T2 ON T1.A = T2.A;
It takes 6922 ms on my machine. If you look at the execution plan, you will notice that SQL Server reads the whole T2 table, does aggregate on it and then performs a hash join.
The aggregate is useful, because it removes all the duplicate NULL values. But we're doing a inner join. Which means we cannot join on NULL value. We could simply filter these out. Let's try this:
SELECT COUNT(1) FROM T1 INNER JOIN T2 ON T1.A = T2.A WHERE T2.A IS NOT NULL;
This takes 4625 ms and gives the same result. In the execution plan you can observe that the aggregate is gone and we have residual predicate on T2, which removes all the unnecessary NULL values.
In conclusion we see that SQL Server could not filter out NULL values early unless explicitly told to do so. I think the reason for this is that adding such filter is rarely beneficial. So no one coded this optimization into the query optimizer. I personally encounter queries like the one above very rarely. Like once a year maybe. You can even see that the set up for this to surface is very specific. For this reason it makes no sense to go and add to every query with inner join a NOT NULL filter. It's just something to keep in mind.
No comments:
Post a Comment