Monday, November 4, 2019

Tracking query timeouts with extended events

Timeouts are kind of a blind spot for performance tuner. They don't appear in any DMVs like other queries. You get to know about them either when someone points them out for you or from application logs, if you have such logging in place. For this reason I wanted independent way of tracking timeouts. After quite exhaustive research I figured out how to do this with Extended Events. Turns out that when application cancels request, it's result is 2. Theoretically you could use attention event, but I found it reported timeouts on statements, which didn't from application point of view. Here's the event definition:

CREATE EVENT SESSION [QueryTimeouts] ON SERVER 
ADD EVENT sqlserver.rpc_completed(SET collect_statement=(1)
    ACTION(sqlserver.database_name,sqlserver.username)
    WHERE ([result]=(2))),
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.database_name,sqlserver.username)
    WHERE ([result]=(2)))
ADD TARGET package0.asynchronous_file_target
    (SET filename = 'G:\QueryTimeouts.xel', max_file_size = 5, max_rollover_files = 2)
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS)

The data can be extracted with following query:

DECLARE @filename NVARCHAR(max)
SELECT @filename = REPLACE(CAST(sesf.value AS NVARCHAR(MAX)), '.xel', '') 
FROM sys.server_event_session_fields sesf
JOIN sys.server_event_sessions ses ON ses.event_session_id = sesf.event_session_id
WHERE sesf.name = 'filename'
AND ses.name = 'QueryTimeouts'

;WITH cte
AS (SELECT CONVERT(XML, event_data) ed
    FROM sys.fn_xe_file_target_read_file(@filename + '*.xel', NULL, NULL, NULL) )
SELECT
	DATEADD(mi, DATEDIFF(mi, GETUTCDATE(), GETDATE()), ed.value('(event/@timestamp)[1]', 'datetime2(3)')) AS [timestamp],
	ed.value('(event/action[@name="database_name"]/value)[1]', 'nvarchar(128)') AS [database_name],
	ed.value('(event/action[@name="username"]/value)[1]', 'nvarchar(128)') AS username,
	ed.value('(event/data[@name="duration"]/value)[1]', 'bigint')/1000 AS duration,
	ed.value('(event/data[@name="cpu_time"]/value)[1]', 'bigint')/1000 AS cpu_time,
	ed.value('(event/data[@name="logical_reads"]/value)[1]', 'bigint') AS logical_reads,
	ISNULL(ed.value('(event/data[@name="spills"]/value)[1]', 'bigint'),0) AS spills,
	ISNULL(ed.value('(event/data[@name="batch_text"]/value)[1]', 'nvarchar(max)'),ed.value('(event/data[@name="statement"]/value)[1]', 'nvarchar(max)')) AS text
FROM cte
ORDER BY timestamp;

There are few gotchas when using this event. First you want to filter out non application users like yourself. You'll be surprised how often you cancel your queries. Secondly, some applications are not written very well. They do select without TOP XX clause but cancel the query after receiving some number of rows. Better filter those out. Either in the event itself or when extracting data.

When analyzing the results I compare duration and CPU time. If they are similar then the query is probably too slow. If the times are like 30 seconds duration but 1 for CPU, then it was probably blocked by other query. You need to use other methods to identify the offending query. I also check spills. They can be slow, so fixing them may speed up the query. Note that spills are only reported for batches for some reason.