Managing Hung Jobs in SQL Server Agent Job and Oracle Job Scheduler

By:   |   Updated: 2022-09-22   |   Comments   |   Related: More > Other Database Platforms


Problem

My company has multiple scheduled recurring activities in both SQL Server and Oracle and sometimes the jobs hang. How can I handle this so processing can continue with minimal interruptions?

Solution

Job hangs can happen for numerous reasons:

  • A deadlock occurs and doesn't get cleared automatically
  • The job finished but is reported as still running; maybe the agent or the server rebooted
  • A package is corrupted, causing it to fail intermittently
  • A local resource is temporarily crowded
  • An executable failed with a transient error
  • Network communication got interrupted
  • Remote resources are temporarily unavailable
  • DNS intermittent malfunction, maybe it was updated recently, or one replica is failing
  • Waiting for user input before closing
  • No free space left for writing the log
  • A command needs to finish graciously before starting the next command
  • Antivirus blocking the job at that moment
  • A remote database session got killed and was unable to be recovered/terminated locally
  • A remote database query failed due to any of the reasons above

When any of these occur, sometimes you can't handle it in the code (retry/rollback). If the job can be restarted, you can automate this task to continue processing information with minimal disruption. However, be careful to exclude long-running jobs (suspected corrupt data, waiting on resource locked by another session, big logfile, etc.) or when the failure must be fixed first (network configuration changes, corrupt database files, disk full, etc.), as that will adversely affect the performance of those jobs and will only increase the time they take to complete.

How do you automate this task in SQL Server 2019 and Oracle Windows 19c, and what are the differences?

Oracle

In Oracle Windows 19c, there is a standard way to create this setup:

Often, it's still not clear or straightforward even after reading the following documentation:

So, I'm going to show a more straightforward approach that is easier to understand and can be extended and adapted to your environment. This will be simpler than creating the job, a scheduler program (with its metadata argument EVENT_MESSAGE being thrown), and a stored procedure (which receives the SYS.SCHEDULER$_EVENT_INFO). And once you understand how it works, it will be much easier to create the scheduler program and stored procedure to improve performance as recommended by Oracle experts.

First, you need to subscribe to event-based notifications sent from the scheduler event queue. To do this, run the command below, logged in as a DBA user, taking note of the name assigned:

 BEGIN
 DBMS_SCHEDULER.ADD_EVENT_QUEUE_SUBSCRIBER('DATA_EVENT');
 END;
 /
 

Note: Use the command below to remove the subscription:

 BEGIN
 DBMS_SCHEDULER.REMOVE_EVENT_QUEUE_SUBSCRIBER('DATA_EVENT');
 END;
 /
 

To check which users are subscribed to event-based notifications from the scheduler event queue, run the following commands. The ones added are the last row in the first output and the last two rows in the second output.

 SET LINESIZE 300
 COL OWNER FOR A5
 COL QUEUE_TABLE FOR A21
 COL CONSUMER_NAME FOR A25
 COL RULE FOR A35
 SELECT OWNER, QUEUE_TABLE, CONSUMER_NAME, RULE
 FROM DBA_QUEUE_SUBSCRIBERS
 WHERE QUEUE_NAME ='SCHEDULER$_EVENT_QUEUE';
 
 OWNER QUEUE_TABLE CONSUMER_NAME RULE
 ----- --------------------- ------------------------- -----------------------------------
 SYS SCHEDULER$_EVENT_QTAB ILM_AGENT
 SYS SCHEDULER$_EVENT_QTAB SCHEDULER$_EVENT_AGENT
 SYS SCHEDULER$_EVENT_QTAB DATA_EVENT tab.user_data.object_owner ='SYS'
 
 
 COL OBJECT_NAME FOR A30
 SELECT OWNER, OBJECT_NAME, OBJECT_TYPE, STATUS
 FROM DBA_OBJECTS
 WHERE OBJECT_NAME LIKE '%SCHEDULER$_EVENT_QUEUE%';
 
 OWNER OBJECT_NAME OBJECT_TYPE STATUS
 ----- ------------------------------ ----------------------- -------
 SYS SCHEDULER$_EVENT_QUEUE_R RULE SET VALID
 SYS SCHEDULER$_EVENT_QUEUE_N RULE SET VALID
 SYS SCHEDULER$_EVENT_QUEUE QUEUE VALID
 SYS SCHEDULER$_EVENT_QUEUE UNDEFINED VALID
 SYS SCHEDULER$_EVENT_QUEUE$21 RULE SET VALID
 SYS SCHEDULER$_EVENT_QUEUE$21 RULE VALID
 

Now let's show the query to check which jobs have exceeded their max duration. It will only return records after you modify the existing jobs (to be explained later). The table DBA_SCHEDULER_JOBS has information about the existing jobs. We're interested in the last time the job ran and if it's running. The table SYS.AQ$SCHEDULER$_EVENT_QTAB means AQ for Advanced Queuing and QTAB for Queue Table. Note: SE.USER_DATA is an object, so it must be cast appropriately to get its nested columns EVENT_TIMESTAMP and EVENT_TYPE. We're interested at what time the job exceeded its max duration:

 COL OWNER FOR A10
 COL JOB_NAME FOR A10
 COL EVENT_TIMESTAMP FOR A15
 COL EVENT_TYPE FOR A16
 COL LAST_START_DATE FOR A10
 COL STATE FOR A10
 SELECT J.OWNER, J.JOB_NAME,
 TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP EVENT_TIMESTAMP,
 TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TYPE EVENT_TYPE,
 J.LAST_START_DATE, J.STATE
 FROM DBA_SCHEDULER_JOBS J
 INNER JOIN SYS.AQ$SCHEDULER$_EVENT_QTAB SE ON
 TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).OBJECT_OWNER = J.OWNER AND
 TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).OBJECT_NAME = J.JOB_NAME
 WHERE TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP >= J.LAST_START_DATE
 AND TREAT(SE.USER_DATA AS SYS.SCHEDULER$_EVENT_INFO).EVENT_TYPE ='JOB_OVER_MAX_DUR'
 AND J.STATE ='RUNNING'
 ORDER BY J.LAST_START_DATE;
 

With this query, we can create an event-based job that will stop other jobs that have exceeded their max duration. You can filter them as needed by modifying the query. The second step is to create the job with JOB_ACTION containing the previous query escaping single quotes and stopping the job with DBMS_SCHEDULER.STOP_JOB. Specifying this job will launch when the event type is JOB_OVER_MAX_DUR, specifying the queue type SYS.SCHEDULER$_EVENT_QUEUE and specifying the subscriber name created initially. Note: TAB.USER_DATA is a special collection and must be used exactly as that.

 BEGIN
 DBMS_SCHEDULER.CREATE_JOB(
 job_name => 'STOP_HUNG_JOBS',
 job_type => 'PLSQL_BLOCK',
 job_action => '
 BEGIN
 FOR i IN (
 SELECT DISTINCT OWNER||''.''||JOB_NAME JOB FROM (
 SELECT J.OWNER, J.JOB_NAME,
 TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP EVENT_TIMESTAMP,
 TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TYPE EVENT_TYPE,
 J.LAST_START_DATE, J.STATE
 FROM DBA_SCHEDULER_JOBS J
 INNER JOIN SYS.AQ$SCHEDULER$_EVENT_QTAB SE ON
 TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).OBJECT_OWNER = J.OWNER AND
 TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).OBJECT_NAME = J.JOB_NAME
 WHERE TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TIMESTAMP >= J.LAST_START_DATE
 AND TREAT(SE.USER_DATA AS SCHEDULER$_EVENT_INFO).EVENT_TYPE =''JOB_OVER_MAX_DUR''
 AND J.STATE =''RUNNING''
 ORDER BY J.LAST_START_DATE)) LOOP
 DBMS_SCHEDULER.STOP_JOB(i.JOB, TRUE);
 END LOOP;
 END;
 ',
 event_condition => 'TAB.USER_DATA.EVENT_TYPE =''JOB_OVER_MAX_DUR''',
 queue_spec => 'SYS.SCHEDULER$_EVENT_QUEUE, DATA_EVENT',
 enabled => TRUE);
 END;
 /
 

Logged in with the user MYDB, you can create a job with the query below to test it. It is set to wait for 120 seconds, to launch every minute, and to raise the event JOB_OVER_MAX_DUR if it has been running for more than 60 seconds, which is the smallest possible value.

 BEGIN
 DBMS_SCHEDULER.CREATE_JOBS(
 JOB_DEFINITION_ARRAY(
 JOB_DEFINITION(
 job_name => '"MYDB"."TEST"',
 job_type => 'PLSQL_BLOCK',
 job_action => 'DBMS_LOCK.SLEEP(120);',
 repeat_interval => 'FREQUENCY=MINUTELY;INTERVAL=1;',
 max_run_duration => INTERVAL '60' SECOND,
 enabled => TRUE,
 number_of_arguments => 0)));
 END;
 /
 

To specify the MAX_RUN_DURATION for an existing job is as follows:

 BEGIN
 DBMS_SCHEDULER.SET_ATTRIBUTE(
 name => '"MYDB"."TEST"',
 attribute => 'MAX_RUN_DURATION',
 value => NUMTODSINTERVAL(60, 'SECOND'));
 END;
 /
 

You can check the current active jobs and their configuration with the queries below:

 SET LINESIZE 300
 SET PAGESIZE 500
 COL OWNER FOR A5
 COL JOB_NAME FOR A22
 COL STATE FOR A9
 COL REPEAT_INTERVAL FOR A30
 COL LAST_START_DATE FOR A16
 COL LAST_RUN_DURATION FOR A16
 COL NEXT_RUN_DATE FOR A16
 COL START_DATE FOR A16
 COL MAX_RUN_DURATION FOR A15
 SELECT OWNER, JOB_NAME, STATE, LAST_START_DATE, LAST_RUN_DURATION, NEXT_RUN_DATE
 FROM DBA_SCHEDULER_JOBS
 WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS')
 AND ENABLED='TRUE';
 
 OWNER JOB_NAME STATE LAST_START_DATE LAST_RUN_DURATIO NEXT_RUN_DATE
 ----- ---------------------- --------- ---------------- ---------------- ----------------
 SYS STOP_HUNG_JOBS SCHEDULED 30-AUG-22 09.07. +000000000 00:00
 26.017000 PM AME :00.297000
 RICA/MEXICO_CITY
 
 MYDB TEST RUNNING 30-AUG-22 09.07. 30-AUG-22 09.07.
 26.314000 PM AME 04.205000 PM AME
 RICA/MEXICO_CITY RICA/MEXICO_CITY
 
 
 
 SELECT OWNER, JOB_NAME, START_DATE, REPEAT_INTERVAL, ENABLED, MAX_RUN_DURATION
 FROM DBA_SCHEDULER_JOBS
 WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS')
 AND ENABLED='TRUE';
 
 OWNER JOB_NAME START_DATE REPEAT_INTERVAL ENABL MAX_RUN_DURATIO
 ----- ---------------------- ---------------- ------------------------------ ----- ---------------
 SYS STOP_HUNG_JOBS TRUE
 MYDB TEST 30-AUG-22 09.02. FREQUENCY=MINUTELY;INTERVAL=1; TRUE +000 00:01:00
 04.283000 PM AME
 RICA/MEXICO_CITY
 

You can see what happens in each schedule with the query and output below:

 COL LOG_DATE FOR A10
 COL OWNER FOR A5
 COL JOB_NAME FOR A22
 COL STATUS FOR A10
 COL REQ_START_DATE FOR A10
 COL ACTUAL_START_DATE FOR A10
 COL RUN_DURATION FOR A13
 COL ADDITIONAL_INFO FOR A10
 SET LINESIZE 300
 SET PAGESIZE 500
 SELECT LOG_DATE, OWNER, JOB_NAME, STATUS, ACTUAL_START_DATE, RUN_DURATION, ADDITIONAL_INFO
 FROM DBA_SCHEDULER_JOB_RUN_DETAILS
 WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS')
 ORDER BY ACTUAL_START_DATE;
 
 LOG_DATE OWNER JOB_NAME STATUS ACTUAL_STA RUN_DURATION ADDITIONAL
 ---------- ----- ---------------------- ---------- ---------- ------------ ----------
 30-AUG-22 MYDB TEST STOPPED 30-AUG-22 +000 00:01:18 REASON="St
 08.33.37.0 09.32.18.3 op job wit
 17000 PM - 61000 PM A h force ca
 06:00 MERICA/MEX lled by us
 ICO_CITY er: 'SYS'"
 
 30-AUG-22 SYS STOP_HUNG_JOBS SUCCEEDED 30-AUG-22 +000 00:00:01
 08.33.37.0 09.33.35.8
 17000 PM - 77000 PM A
 06:00 MERICA/MEX
 ICO_CITY
 
 30-AUG-22 MYDB TEST STOPPED 30-AUG-22 +000 00:01:18 REASON="St
 08.34.55.6 09.33.37.0 op job wit
 89000 PM - 17000 PM A h force ca
 06:00 MERICA/MEX lled by us
 ICO_CITY er: 'SYS'"
 
 30-AUG-22 SYS STOP_HUNG_JOBS SUCCEEDED 30-AUG-22 +000 00:00:01
 08.34.55.6 09.34.54.8
 89000 PM - 92000 PM A
 06:00 MERICA/MEX
 ICO_CITY
 

By creating this setup, you have automated the hung job stop functionality, and in the next job schedule, it can continue processing data without noticing and requiring manual intervention.

Additionally, to clear the scheduler job run details for any reason, you can do it with the following:

 DELETE FROM DBA_SCHEDULER_JOB_RUN_DETAILS WHERE JOB_NAME IN ('TEST','STOP_HUNG_JOBS');
 

To clear the queue events, use the following:

 DECLARE
 PO DBMS_AQADM.AQ$_PURGE_OPTIONS_T;
 BEGIN
 PO.BLOCK := FALSE;
 DBMS_AQADM.PURGE_QUEUE_TABLE(
 QUEUE_TABLE => 'SCHEDULER$_EVENT_QTAB',
 PURGE_CONDITION => NULL,
 PURGE_OPTIONS => PO);
 END;
 /
 

Once you've confirmed the stop job works successfully, you can drop the test job:

 BEGIN
 DBMS_SCHEDULER.STOP_JOB('MYDB.TEST');
 DBMS_SCHEDULER.DISABLE('MYDB.TEST');
 DBMS_SCHEDULER.DROP_JOB('MYDB.TEST', TRUE);
 END;
 /
 

The first command stops the job if it's running, the second disables the job from running in the next schedule, and the third deletes the job. Now you need to set the MAX_RUN_DURATION for the existing jobs you want to be stopped, and you're done.

SQL Server

No job property indicates the max time it should run, nor a way to raise an event when it has exceeded a certain amount of time. However, in the schedules, you can specify an active end time, and one job can be associated with multiple schedules, so this can be used to create a job that stops hung jobs.

Let's show the query to check which jobs have exceeded the schedule active end time, but it will only return records after you specify this property in a schedule:

 SELECT js.schedule_id, j.name job_name, msdb.dbo.agent_datetime(CONVERT(VARCHAR, ja.start_execution_date, 112), s.active_end_time) MustFinishBy
 FROM msdb.dbo.sysjobs j
 INNER JOIN msdb.dbo.sysjobschedules js ON js.job_id = j.job_id
 INNER JOIN msdb.dbo.sysschedules s ON s.schedule_id = js.schedule_id
 INNER JOIN msdb.dbo.sysjobactivity ja ON ja.job_id = j.job_id
 WHERE ja.start_execution_date IS NOT NULL
 AND s.active_end_time IS NOT NULL
 

With this query, we can create a job that will stop other jobs that have exceeded the scheduled active end time; you can filter them as needed by modifying the query. The steps below create the job, associate it with the local server, add a step to the job, add a schedule to run it daily (3rd parameter) every (5th parameter) minute (4th parameter), and attach the schedule to the job:

 EXEC msdb.dbo.sp_add_job 'STOP_HUNG_JOBS', 1
 EXEC msdb.dbo.sp_add_jobserver @job_name='STOP_HUNG_JOBS', @server_name = @@SERVERNAME
 EXEC msdb.dbo.sp_add_jobstep @job_name='STOP_HUNG_JOBS', @step_id=1, @step_name='STOP_HUNG_JOBS', @command='
 DECLARE @cmd VARCHAR(MAX);
 SET @cmd ='''';
 SELECT @[email protected]+''EXEC msdb.dbo.sp_stop_job ''''''+job_name+'''''';'' FROM (
 SELECT DISTINCT job_name FROM (
 SELECT js.schedule_id, j.name job_name, msdb.dbo.agent_datetime(CONVERT(VARCHAR, ja.start_execution_date, 112), s.active_end_time) MustFinishBy
 FROM msdb.dbo.sysjobs j
 INNER JOIN msdb.dbo.sysjobschedules js ON js.job_id = j.job_id
 INNER JOIN msdb.dbo.sysschedules s ON s.schedule_id = js.schedule_id
 INNER JOIN msdb.dbo.sysjobactivity ja ON ja.job_id = j.job_id
 WHERE ja.start_execution_date IS NOT NULL
 AND s.active_end_time IS NOT NULL) A
 WHERE GETDATE() > A.MustFinishBy) B;
 EXEC (@cmd);
 '
 EXEC msdb.dbo.sp_add_schedule 'MinutelyEvery1Minute', 1, 4, 1, 0x4, 1
 EXEC msdb.dbo.sp_attach_schedule @job_name='STOP_HUNG_JOBS', @schedule_name='MinutelyEvery1Minute'
 

To test this job, you can create schedules per minute programmatically with the query below. They're created to run every minute and set to end the minute after it starts:

 DECLARE
 @cnt INT = 0,
 @min VARCHAR(2),
 @nextmin VARCHAR(2),
 @hr VARCHAR(2),
 @nexthr VARCHAR(2);
 WHILE @cnt < 1440
 BEGIN
 SET @min = RIGHT('0'+CAST(@cnt%60 AS VARCHAR(2)), 2)
 SET @nextmin = RIGHT('0'+CAST((@cnt+1)%60 AS VARCHAR(2)), 2)
 SET @hr = RIGHT('0'+CAST(@cnt/60 AS VARCHAR(2)), 2)
 SET @nexthr = CASE WHEN @nextmin='00' THEN RIGHT('0'+CAST((@cnt/60)+1 AS VARCHAR(2)), 2) ELSE @hr END
 SET @nexthr = CASE WHEN @nexthr='24' THEN '00' ELSE @nexthr END
 EXEC ('EXEC msdb.dbo.sp_add_schedule '''[email protected]+':'[email protected]+''', 1, 4, 1, @active_start_time='''[email protected][email protected]+'00'', @active_end_time='''[email protected][email protected]+'00''')
 SET @cnt = @cnt + 1;
 END;
 

If you already have a schedule, you can update it with the query below, 000100 means the first minute after midnight:

 EXEC msdb.dbo.sp_update_schedule @name='00:00', @active_end_time = 000100;
 

You can check the existing schedules and their active end time with the query below:

 SELECT name, active_start_date, active_start_time, active_end_date, active_end_time
 FROM msdb.dbo.sysschedules
 

Note: The results look strange:

But, they display correctly when you view the schedules:

Also, note that the last schedule, even when we set it to 00:00:00, changed to 11:59:59 PM.

Now you can create the test job as follows:

 EXEC msdb.dbo.sp_add_job 'test', 1
 EXEC msdb.dbo.sp_add_jobserver @job_name='test', @server_name = @@SERVERNAME
 EXEC msdb.dbo.sp_add_jobstep @job_name='test', @step_id=1, @step_name='step1', @command='WAITFOR DELAY ''00:02:00'';'
 

And you can associate the job with the created schedules programmatically:

 DECLARE
 @cnt INT = 0,
 @min VARCHAR(2),
 @hr VARCHAR(2);
 WHILE @cnt < 1440
 BEGIN
 SET @min = RIGHT('0'+CAST(@cnt%60 AS VARCHAR(2)), 2)
 SET @hr = RIGHT('0'+CAST(@cnt/60 AS VARCHAR(2)), 2)
 EXEC ('EXEC msdb.dbo.sp_attach_schedule @job_name=''test'', @schedule_name='''[email protected]+':'[email protected]+'''')
 SET @cnt = @cnt + 1;
 END;
 

You can see what happens in each schedule as described in this tip: Script for SQL Server Agent Job Issues Across All Instances.

In my case, the output of all jobs history is below:

By creating this setup, you have automated the hung job stop functionality. In the next job schedule, it can continue processing data without noticing or requiring manual intervention.

Once you've confirmed the stop job works successfully, you can drop the test job, which also deletes the per minute schedules created earlier and its job history:

 EXEC msdb.dbo.sp_delete_job @job_name='test'
 

Now you just need to set the active_end_time for the existing schedules you want to be stopped, and you're done.

Next Steps

You can learn more about SQL Server scheduled jobs in the links below:




Get Started Now - Click here to get your free 14 day trial of SolarWinds Database Insights








About the author
Pablo Echeverria worked for more than 10 years as a software programmer and analyst. In 2016, I switched jobs to a DBA position, where I have implemented new processes, created better monitoring tools and grown my data scientist skills.

View all my tips


Article Last Updated: 2022-09-22

Comments For This Article