Sunday, April 5, 2020

Filtering on one column and joining on another - estimate problem

Let me show you a case where SQL Server has problems with cardinality estimation and how to fix it.

First let us prepare tables and data:

CREATE TABLE T1 ( A INT, B INT)
CREATE TABLE T2 ( A INT, B INT)
GO

INSERT INTO T1 VALUES (1, 1)
GO 1000

UPDATE TOP (100) T1 SET A = 2

INSERT INTO T2 VALUES (1, 1), (2, 2)

UPDATE STATISTICS T1
UPDATE STATISTICS T2


Now try the following query and examine the execution plan:

SELECT COUNT (1) FROM T1
INNER JOIN T2 ON T1.A = T2.A
WHERE T2.B = 1

The initial execution plan

As we can see the estimates on the join are wrong. Reason for this is that we filter on column B in table T2, but use column A to join with the other table. In that case SQL Server has no idea what value will column A be. So it guesses an average. In this case the guess is far off, since the data in column A in table T1 is skewed.

How to fix this? As we can see the problem is with wrong estimates. We need to help SQL Server here and create filtered statistics:

CREATE STATISTICS S1 ON T2(A) WHERE B = 1

This statistics is on the join column and the filter is exactly the same. The execution plan estimates look much better now:
The better execution plan

In this simplistic example it doesn't matter much. Yet I encountered cases where this caused hash spills. In addition, if it's executed often, you can get significant performance problems.

Now this is in no way universal fix. It will work only if filter on B doesn't change. If we change the where clause to B = 2, the statistics will be wrong again. We would need to create new statistics for the new where clause. This is not scalable. It may be doable for 5 values, but not for 10 000. Yet this trick may come in helpful from time to time.

No comments:

Post a Comment