20 January 2022

Intermittent job failures with 20,000 SSRS Subscriptions

Intermittent job failures with 20,000 SSRS Subscriptions

We often work with customers that push the boundaries of SQL Server and the problems that occur are always interesting to investigate and write up. Recently, one of those customers reported that many SQL Server Reporting Services (SSRS) subscriptions that were scheduled to run on a Monday morning, hadn’t run at all. What is unique about the environment is that there are 20,000 SSRS subscriptions.

What started off as a routine investigation ended up becoming an interesting journey that took us into the guts of SQL Server Agent and SSRS.

As with most things related to SSRS, the first port of call was to review the SSRS logs to see if anything was recorded in there. The SSRS logs can be found in C:\Program Files\Microsoft SQL Server Reporting Services\SSRS\LogFiles.

Scrolling through the log file to around 08:30:00, which is when the subscriptions were scheduled to run, immediately offered a clue as to the problem. For a 26 second period, between 08:30:29 and 08:30:55, there were a slew of the following errors:

schedule!WindowsService_51!4e8!08/23/2021-08:30:55:: i INFO: Handling event 
with data: TimeEntered: 08/23/2021 08:30:55, Type: Event, EventType: TimedSubscription, 
SubscriptionID: ab35126f-a605-4069-973b-160a59bf36c6.library!WindowsService_51!30d0!

08/23/2021-08:30:55:: e ERROR: Throwing 
Microsoft.ReportingServices.Diagnostics.Utilities.SchedulerNotRespondingException: , 
The SQL Agent service is not running. This operation requires the SQL Agent service.

The solution seems obvious, make sure that the SQL Agent service is running, but it was running the whole time so where do we go from here?

How does SSRS use SQL Server Agent?

SSRS doesn’t have its own built-in scheduling engine and instead uses SQL Agent. Whenever a subscription is created, a corresponding SQL Agent job is created that is scheduled to run according to the scheduling parameters for that subscription.

The job itself does very little, it simply executes a procedure in the ReportServer database, named AddEvent, which in turn adds a single row into the Event table that looks like this:

exec [ReportServer].dbo.AddEvent @EventType='TimedSubscription', 

The @EventData parameter contains the SubscriptionId of the Subscription.

The SSRS service polls the Event table every 10 seconds (by default), looking for new rows in the table. If it finds a new row in the table, it treats this as a subscription that it needs to process, which it does, and then deletes the row from the Event table.

For a more detailed look at the SSRS scheduling architecture, there’s a good description of it here: Chapter 3: Reporting Services Architecture (codemag.com)

The important thing to note here, is that that there is a SQL Agent job that corresponds to the above subscription that we identified in the SSRS error log. So now we can check whether that SQL Agent job ran at 08:30. You can read our post on how to do that here: How do you find an SSRS subscription’s SQL Agent job?

Sure enough, the job had run successfully:

  1. At 08:30:51 we saw the job ran successfully, which means it inserted a row into the EventTable.
  2. At 08:30:55 the SSRS service polled for and identified the subscription that it needed to process.
  3. SSRS then reports that the SQL Agent service isn’t running, and the processing of that subscription fails to run.

Before continuing with the investigation, it was important to realise that the customer had several business-critical subscriptions that hadn’t reached their intended targets. So, as much as it was becoming an interesting problem to troubleshoot, our priority at this point was to identity and re-run all the subscriptions that had failed to run.  You can read our post on how to do that here: How do you identify and re-run failed SSRS subscriptions from SQL Server?  

Why do SSRS jobs fail when there are thousands of subscriptions?

With so many SQL Agent jobs running at the same time, our first thought was that SQL Server worker threads might be being exhausted. With 800 SQL Agent jobs scheduled to run at 08:30:00 on a Monday morning and max worker threads sitting at 576 on this server it was a reasonable line of enquiry. However, our monitoring never picked up the thread count rising above 150.

As it turns out, with SQL Agent being a separate service it manages its worker threads independently of SQL Server. In fact, each subsystem has a max worker threads setting which you can see by querying msdb.dbo.syssubsystems.

Looking in the SQL Agent error log we observed various messages of the following two types:

[251] Step 1 of job A9FA6DC2-C91B-4BB8-B774-24EED200D9ED is being queued for the TSQL subsystem

[398] The user job (D2E7DC77-EAA5-49EB-B1EE-E821BBC999CD) has been queued because the maximum number of user working threads (800) are already running.  This user job will be executed as soon as one of the working threads finishes execution.
Why do SSRS jobs fail when there are thousands of subscriptions

This shows a couple of interesting things. Firstly, the maximum number of T-SQL Agent Job steps that can concurrently run is 160. This corresponds to the first error (251) observed in the SQL Agent Error Log. Secondly, the maximum number of jobs that can concurrently run is 800, which corresponds to the second message (398).

Given that all the Agent Jobs have a single T-SQL step and there were never more than 160 T-SQL steps running concurrently, this explains why the thread count we were monitoring against the SQL Server didn’t rise by more than 150 or so and ruled out thread exhaustion as possible culprit.

Tracing the SSRS Service

We decided to trace events against the msdb database, originating from the SSRS application. Specifically, we were looking for a query or queries that executed immediately prior to the error message about SQLAgent being offline. The hope here was that we might identify a query that would give clues as to why SSRS was reporting that the SQL Agent was offline.

Whilst tracing, immediately prior to the error message, we identified an execution of the stored procedure Get_sqlagent_job_status, which lives in the ReportServer database. This stored procedure makes various calls to the msdb database, including this call:

EXECUTE @retval = master.dbo.xp_sqlagent_enum_jobs @is_sysadmin, @owner_login_name, @job_id 
IF (@retval <> 0)
    RETURN(1) -- Failure

This extended procedure returns various bits of information about a job, including when it last ran and when it’s next scheduled to run.

Interestingly, in the RSReportServer.config file there is a setting called MaxScheduleWait, which is described as follows:

Specifies the number of seconds the Report Server Windows service waits for a schedule to be updated by SQL Server Agent service when Next Run Time is requested. Valid values range from 1 to 60. In the default configuration file, MaxScheduleWait is set to 5.

So, SSRS determines when a Subscription is next scheduled to run by requesting this information from the corresponding SQL Agent job and it appears that it does this through the stored procedure Get_sqlagent_job_status.

We decided to increase the MaxScheduleWait to 60. The thinking behind this was that it was possible that, with 160 steps running concurrently and up to 800 jobs needing to update their last run time and next run time, it might take SQL Agent a while to update its Next Run Time.

Six weeks after having changed MaxScheduleWait to 60, we haven’t seen a single incident of SSRS jobs failing to execute with the error that the SQL Agent is not running.

Why does changing MaxScheduleWait to 60 help with a large number of subscriptions?

The flow chart below simplifies some of the steps that occur from the point at which a SQL Agent job associated with a subscription fires. For simplicity, the details associated with the task “Process Event” have been left out but you if you’re interested in what’s going on under the SSRS hood, this Microsoft Tech Community post by Tim Plesuk (microsoft.com) does a really good job of highlighting the flow of data in more detail.

Why does changing MaxScheduleWait to 60 help with a large number of subscriptions

  1. Updating a job’s status happens independently after a row is inserted into the Event table. More importantly, the time at which a job’s status is updated is indeterminate and could happen after the status is called for through get_sql_agent_job_status. Whilst this ordinarily happens within milliseconds of the last step completing, contention caused by hundreds of jobs trying to update the statuses can result in this time delay being over 5 seconds.

As you can see from the above flow chart, there are events that occur on the SQL Server (in green) and events that occur in the SSRS service (blue). More importantly, the dotted section, highlights events that occur on the SQL Server but in a separate timeframe to the main timeframe.

With hundreds of jobs running simultaneously, contention occurs in the msdb database as hundreds of jobs try to write to the sysjobactivity table concurrently. Consequently, the step “Execute procedure get_sql_agent_job_status” can potentially run against a sysjobactivity table that hasn’t been updated yet so the next run time for a job appears to be in the past, not the future as expected.

We tested this theory by running 1,000 jobs concurrently every 5 minutes, each one inserting a tiny row into a test table, and around 10% of them had a 2 to 3 second delay in reporting completion time. Once or twice every 2 to 3 hours there would be a handful of jobs that took 5 to 9 seconds to update the completion time. This shows that, indeed, with hundreds of jobs trying to update their statuses and next run times, it can take (infrequently) more than 5 seconds for the job to update its completion time.

Given what we know about the nature of the SSRS service and its reliance on the SQL Agent service – both of which are loosely coupled, it’s hard to think past this being the problem. In fact, the SSRS development team seemingly developed for this loose coupling by introducing the MaxWaitSchedule configuration option. It seems that they anticipated the potential for the time it takes to accurately get the next run time was potentially subject to delays greater than 5 seconds.

In summary

    • Increasing MaxScheduleWait in the RSReportServer.config file to 60 removed all occurrences of intermittent failed SSRS subscription jobs in our environment with 20,000 subscriptions, 800 of which were scheduled to run at the same time.
    • After 25 years working with SQL Server, it’s positive and enriching experience to still come across interesting and unique problems that we have never come across before.
    • All our Managed DBA Service customers now benefit from this new insight as we’ve updated our monitoring platform to alert if an SSRS subscription hasn’t run when it was expected to.
    • Technically, there is no limit to the number of SQL Agent jobs and subscriptions that could be set up although there are practical limitations.
    • Finally, don’t open Job Activity Monitor in SSMS when you have 20,000 SQL Agent jobs!


Let’s Talk

Have a SQL challenge? We can help.

"*" indicates required fields

This field is for validation purposes and should be left unchanged.