Monday, November 4, 2019

Tracking query timeouts with extended events

Timeouts are kind of a blind spot for performance tuner. They don't appear in any DMVs like other queries. You get to know about them either when someone points them out for you or from application logs, if you have such logging in place. For this reason I wanted independent way of tracking timeouts. After quite exhaustive research I figured out how to do this with Extended Events. Turns out that when application cancels request, it's result is 2. Theoretically you could use attention event, but I found it reported timeouts on statements, which didn't from application point of view. Here's the event definition:

CREATE EVENT SESSION [QueryTimeouts] ON SERVER 
ADD EVENT sqlserver.rpc_completed(SET collect_statement=(1)
    ACTION(sqlserver.database_name,sqlserver.username)
    WHERE ([result]=(2))),
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.database_name,sqlserver.username)
    WHERE ([result]=(2)))
ADD TARGET package0.asynchronous_file_target
    (SET filename = 'G:\QueryTimeouts.xel', max_file_size = 5, max_rollover_files = 2)
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS)

The data can be extracted with following query:

DECLARE @filename NVARCHAR(max)
SELECT @filename = REPLACE(CAST(sesf.value AS NVARCHAR(MAX)), '.xel', '') 
FROM sys.server_event_session_fields sesf
JOIN sys.server_event_sessions ses ON ses.event_session_id = sesf.event_session_id
WHERE sesf.name = 'filename'
AND ses.name = 'QueryTimeouts'

;WITH cte
AS (SELECT CONVERT(XML, event_data) ed
    FROM sys.fn_xe_file_target_read_file(@filename + '*.xel', NULL, NULL, NULL) )
SELECT
	DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), ed.value('(event/@timestamp)[1]', 'datetime2(3)')) AS [timestamp],
	ed.value('(event/action[@name="database_name"]/value)[1]', 'nvarchar(128)') AS [database_name],
	ed.value('(event/action[@name="username"]/value)[1]', 'nvarchar(128)') AS username,
	ed.value('(event/data[@name="duration"]/value)[1]', 'bigint')/1000 AS duration,
	ed.value('(event/data[@name="cpu_time"]/value)[1]', 'bigint')/1000 AS cpu_time,
	ed.value('(event/data[@name="logical_reads"]/value)[1]', 'bigint') AS logical_reads,
	ISNULL(ed.value('(event/data[@name="spills"]/value)[1]', 'bigint'),0) AS spills,
	ISNULL(ed.value('(event/data[@name="batch_text"]/value)[1]', 'nvarchar(max)'),ed.value('(event/data[@name="statement"]/value)[1]', 'nvarchar(max)')) AS text
FROM cte
ORDER BY timestamp;

There are few gotchas when using this event. First you want to filter out non application users like yourself. You'll be surprised how often you cancel your queries. Secondly, some applications are not written very well. They do select without TOP XX clause but cancel the query after receiving some number of rows. Better filter those out. Either in the event itself or when extracting data.

When analyzing the results I compare duration and CPU time. If they are similar then the query is probably too slow. If the times are like 30 seconds duration but 1 for CPU, then it was probably blocked by other query. You need to use other methods to identify the offending query. I also check spills. They can be slow, so fixing them may speed up the query. Note that spills are only reported for batches for some reason.

Thursday, October 10, 2019

Search plan cache for index

Quite often I want to see where specific index is used. I started by using script from Jonathan Kehayias. Sadly his script took around 10 minutes on plan cache of 15 GB size. Also it produced many duplicate entries. So I tweaked his script to my needs. My version groups results by plan hash. This helps, if you have unparametrized queries. I also switched to exist() when searching xml. Finally I use both dm_exec_query_plan and dm_exec_text_query_plan. For some reason, during my testing, it turned out quicker than with only one of them. dm_exec_text_query_plan lets you limit execution plan only to specific query using index of your interest, in case a batch or stored procedure has more than one statement. That way you don't need to analyze the whole thing to find the relevant part.

Here's the code:

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
DECLARE @IndexName AS NVARCHAR(128) = N'YourIndexName';

--Handle the case where the left or right was quoted manually but not the opposite side 
IF LEFT(@IndexName, 1) <> '[' SET @IndexName = '[' + @IndexName;
IF RIGHT(@IndexName, 1) <> ']' SET @IndexName = @IndexName + ']';

;WITH XMLNAMESPACES (DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
, cte AS
	(SELECT plan_handle, query_plan_hash, statement_start_offset, statement_end_offset,
		MIN(creation_time) OVER(PARTITION BY qs.query_plan_hash ORDER BY qs.query_plan_hash) creation_time,
		MAX(last_execution_time) OVER(PARTITION BY qs.query_plan_hash ORDER BY qs.query_plan_hash) last_execution_time,
        SUM(total_worker_time) OVER(PARTITION BY qs.query_plan_hash ORDER BY qs.query_plan_hash) total_worker_time,
		SUM(total_logical_reads) OVER(PARTITION BY qs.query_plan_hash ORDER BY qs.query_plan_hash) total_logical_reads,
		SUM(execution_count) OVER(PARTITION BY qs.query_plan_hash ORDER BY qs.query_plan_hash) execution_count,
		ROW_NUMBER() OVER(PARTITION BY qs.query_plan_hash ORDER BY qs.query_plan_hash) row_num
    FROM sys.dm_exec_query_stats AS qs
   )
, cte2 AS
	(SELECT * FROM cte
	OUTER APPLY sys.dm_exec_query_plan(cte.plan_handle) AS qp
	WHERE (query_plan.exist('//IndexScan[not(@Lookup)]/Object[@Index=sql:variable("@IndexName")]') = 1 OR query_plan IS NULL) AND  cte.row_num = 1
	)
, cte3 AS
	(SELECT ISNULL(
			SUBSTRING(qt.text, (cte2.statement_start_offset / 2) + 1, ((CASE cte2.statement_end_offset WHEN -1 THEN DATALENGTH(qt.text) ELSE cte2.statement_end_offset
						END - cte2.statement_start_offset) / 2) + 1), qt.text) AS TSQL,
		TRY_CAST(qp.query_plan AS XML) query_plan, cte2.creation_time, cte2.last_execution_time,
		cte2.total_worker_time, cte2.total_logical_reads, cte2.execution_count
    FROM cte2
    CROSS APPLY sys.dm_exec_text_query_plan(cte2.plan_handle, cte2.statement_start_offset, cte2.statement_end_offset) AS qp
    CROSS APPLY sys.dm_exec_sql_text(cte2.plan_handle) AS qt)
SELECT *
FROM cte3
WHERE query_plan.exist('//IndexScan[not(@Lookup)]/Object[@Index=sql:variable("@IndexName")]') = 1
ORDER BY cte3.execution_count DESC
OPTION (RECOMPILE);

I admit it's quite messy, but gets me relatively clean data and runs in 2 minutes. Quite nice improvement over original.

Before using this script three warnings:
  1. Remember that plan cache doesn't necessary hold all execution plans, especially after instance restart
  2. This script won't catch index usage, if plan is not parse-able as XML.
  3. Considering the above be careful when performing index maintenance based on script results.

Sunday, September 22, 2019

Update on a view with NULL column

Let's set up a experiment:

CREATE TABLE T 
(
	c1 INT,
	c2 DATETIME2
);

GO

INSERT INTO T
VALUES (1, GETUTCDATE());

GO

CREATE VIEW V AS
SELECT c1, NULL as c2 FROM T;

GO

CREATE TRIGGER VT on V 
INSTEAD OF UPDATE AS

	UPDATE V SET c1 = 1
	FROM V JOIN inserted AS i ON V.c1 = i.c1

GO

Now assume you want to run an update statement through the view and for some reason you want to update in your statement column c2 (Yes I agree this isn't very sensible thing you might want to do).

UPDATE V SET c2 = GETUTCDATE()
WHERE c1 = 1

This fails with error:
Implicit conversion from data type datetime to int is not allowed. Use the CONVERT function to run this query.

How to workaround this? Well the solution is suggested by error message. It seems NULL is interpreted as integer by default. You need to cast this NULL in view to proper data type:

ALTER VIEW V AS
SELECT c1, CAST(NULL AS DATETIME2) AS c2
FROM T

Now the above update will be successful.

Normally you shouldn't have constant columns in a view, which your code tries to update or insert into. Yet I had such situation and I am sure this can happen to others. That's why I wanted to document this peculiarity.

Sunday, August 25, 2019

Who is doing all those batch requests anyway?

Let's say you have a instance which has normally 1000 batch requests/s. Then suddenly it's 10 times more batch requests. It doesn't even need to affect performance in visible way. Yet it's is a sign that something is going wrong somewhere. In case you have many applications, each connecting through different user, you can try to pinpoint which one is it. To do this you can set up such extended event:

CREATE EVENT SESSION NumberOfStatementsByUser ON SERVER 
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.username))
, ADD EVENT sqlserver.rpc_completed(
    ACTION(sqlserver.username))
ADD TARGET package0.histogram(SET source=N'sqlserver.username')
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, 
        MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,
        MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF, STARTUP_STATE=OFF)

After starting it, you can check which user starts most requests with following query:

;WITH
	sql_batch_completed_by_database AS ( 
        SELECT CAST(xet.target_data AS xml) AS event_data
        FROM sys.dm_xe_session_targets AS xet  
        JOIN sys.dm_xe_sessions AS xe  
            ON (xe.address = xet.event_session_address)  
        WHERE xe.name = 'NumberOfStatementsByUser'
    )
    , requests_by_user AS (
        SELECT
              slot.value('(./value)[1]', 'sysname') AS Username
            , slot.value('@count', 'bigint') AS RequestCount
        FROM (SELECT event_data FROM sql_batch_completed_by_database) AS xe(event_data)
        CROSS APPLY event_data.nodes('/HistogramTarget/Slot') AS histogram_target(slot)
    )
SELECT Username, SUM(RequestCount) AS RequestCount
FROM requests_by_user
GROUP BY Username;

Personally, I start this extended event only when I see a problem and only for a minute or so. This is enough to identify the troublesome user. After I identify the user, I usually set up tracing only for him to see what queries he runs. I stop the trace stop as soon as possible. The profiler is quirky and usually runs for few seconds before it can be stopped. This is enough to capture thousands of event. In my experience such things happen when the code being executed goes into infinite loop or user is missing permissions (yes this happened). Having the queries which are being executed should be sufficient information to give to people responsible for the application, so they can start troubleshooting their code.

Tuesday, July 9, 2019

Finding failing queries in SQL Server

Sometimes you know there is a query which fails with error but you don't know how does it look like. For example the application doesn't log it or the query is dynamic one. To track these you can use such Extended Event:

CREATE EVENT SESSION FailingQueries
ON SERVER
    ADD EVENT sqlserver.error_reported
    (ACTION (sqlserver.sql_text,
             sqlserver.database_id,
             sqlserver.username)
     WHERE (   [severity] > 10
               AND error_number <> 942 /*Database '%.*ls' cannot be opened because it is offline.*/
               AND error_number <> 916 /*The server principal "%.*ls" is not able to access the database "%.*ls" under the current security context.*/
               AND error_number <> 18456 /*Login failed for user '%.*ls'.%.*ls%.*ls*/
               AND error_number <> 17830 /*Network error code 0x2746 occurred while establishing a connection; the connection has been closed.*/
               AND error_number <> 4060 /*Cannot open database  requested by the login. The login failed.*/
     ))
    ADD TARGET package0.asynchronous_file_target
    (SET filename = 'G:\FailingQueries.xel', max_file_size = 5, max_rollover_files = 2)
WITH (MAX_DISPATCH_LATENCY = 5 SECONDS);
GO

(Don't forget to set your own path)

This event checks for error severity higher than 10 (anything below is informational) and also filters out these I found useless and quite chatty in practice. For example, if some developer connects with SSMS to instance on which he doesn't have access to all databases, said SSMS will try to open these databases anyway.

You can check data captured with following SQL:

DECLARE @filename NVARCHAR(max)
SELECT @filename = REPLACE(CAST(sesf.value AS NVARCHAR(MAX)), '.xel', '') 
FROM sys.server_event_session_fields sesf
JOIN sys.server_event_sessions ses ON ses.event_session_id = sesf.event_session_id
WHERE sesf.name = 'filename'
AND ses.name = 'FailingQueries'

;WITH cte
AS (SELECT CONVERT(XML, event_data) ed
    FROM sys.fn_xe_file_target_read_file(@filename + '*.xel', NULL, NULL, NULL) ),
shredded_data AS (
SELECT DATEADD(
           mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), ed.value('(event/@timestamp)[1]', 'datetime2(3)')) AS [timestamp],
       DB_NAME(ed.value('(event/action[@name="database_id"]/value)[1]', 'nvarchar(128)')) AS database_name,
       ed.value('(event/data[@name="severity"]/value)[1]', 'tinyint') AS [err_severity],
       ed.value('(event/data[@name="error_number"]/value)[1]', 'int') AS [err_number],
       ed.value('(event/data[@name="message"]/value)[1]', 'nvarchar(512)') AS [err_message],
       ed.value('(event/action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS [sql_text],
       ed.value('(event/action[@name="username"]/value)[1]', 'nvarchar(128)') AS username
FROM cte)
SELECT * FROM shredded_data sd
--WHERE sd.database_name = 'YourDatabase'
ORDER BY timestamp;

I like to have this Extended Event enabled whole time on development servers. That way if developers come to me saying they have some errors but don't know what SQL is executed, I can just check the event data. I've also used similar one in production.