Sunday, April 12, 2020

Join on nullable column

There is a interesting behavior I noticed in SQL Server. Let me show it to you.

First let's prepare tables and data:

CREATE TABLE T1 (A UNIQUEIDENTIFIER);
CREATE TABLE T2 (A UNIQUEIDENTIFIER);

;WITH CTE AS ( SELECT A FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8)) X(A) )
INSERT INTO T1
SELECT NEWID() FROM CTE AS CTE1
CROSS JOIN CTE AS CTE2
CROSS JOIN CTE AS CTE3
CROSS JOIN CTE AS CTE4
CROSS JOIN CTE AS CTE5
CROSS JOIN CTE AS CTE6
CROSS JOIN CTE AS CTE7
CROSS JOIN CTE AS CTE8;

INSERT INTO T2
SELECT TOP 5 PERCENT A FROM T1
UNION ALL
SELECT TOP 95 PERCENT NULL FROM T1;

We have now two tables with similar number of rows. The difference is that the second one has mostly NULL values. Now let's try joining those two tables together:

SELECT COUNT(1)
FROM T1
INNER JOIN T2 ON T1.A = T2.A;

It takes 6922 ms on my machine. If you look at the execution plan, you will notice that SQL Server reads the whole T2 table, does aggregate on it and then performs a hash join.


The aggregate is useful, because it removes all the duplicate NULL values. But we're doing a inner join. Which means we cannot join on NULL value. We could simply filter these out. Let's try this:

SELECT COUNT(1)
FROM T1
INNER JOIN T2 ON T1.A = T2.A
WHERE T2.A IS NOT NULL;

This takes 4625 ms and gives the same result. In the execution plan you can observe that the aggregate is gone and we have residual predicate on T2, which removes all the unnecessary NULL values.

In conclusion we see that SQL Server could not filter out NULL values early unless explicitly told to do so. I think the reason for this is that adding such filter is rarely beneficial. So no one coded this optimization into the query optimizer. I personally encounter queries like the one above very rarely. Like once a year maybe. You can even see that the set up for this to surface is very specific. For this reason it makes no sense to go and add to every query with inner join a NOT NULL filter. It's just something to keep in mind.

Sunday, April 5, 2020

Filtering on one column and joining on another - estimate problem

Let me show you a case where SQL Server has problems with cardinality estimation and how to fix it.

First let us prepare tables and data:

CREATE TABLE T1 ( A INT, B INT)
CREATE TABLE T2 ( A INT, B INT)
GO

INSERT INTO T1 VALUES (1, 1)
GO 1000

UPDATE TOP (100) T1 SET A = 2

INSERT INTO T2 VALUES (1, 1), (2, 2)

UPDATE STATISTICS T1
UPDATE STATISTICS T2


Now try the following query and examine the execution plan:

SELECT COUNT (1) FROM T1
INNER JOIN T2 ON T1.A = T2.A
WHERE T2.B = 1

The initial execution plan

As we can see the estimates on the join are wrong. Reason for this is that we filter on column B in table T2, but use column A to join with the other table. In that case SQL Server has no idea what value will column A be. So it guesses an average. In this case the guess is far off, since the data in column A in table T1 is skewed.

How to fix this? As we can see the problem is with wrong estimates. We need to help SQL Server here and create filtered statistics:

CREATE STATISTICS S1 ON T2(A) WHERE B = 1

This statistics is on the join column and the filter is exactly the same. The execution plan estimates look much better now:
The better execution plan

In this simplistic example it doesn't matter much. Yet I encountered cases where this caused hash spills. In addition, if it's executed often, you can get significant performance problems.

Now this is in no way universal fix. It will work only if filter on B doesn't change. If we change the where clause to B = 2, the statistics will be wrong again. We would need to create new statistics for the new where clause. This is not scalable. It may be doable for 5 values, but not for 10 000. Yet this trick may come in helpful from time to time.