|
|
Technical Paper:
An Intelligent Scheduler/Controller to Manage Complex SASProduction
Overview
The Problem
The Model
Testing the Intelligent Scheduler/Controller
*Run 1*
*Run 2*
Read Your SAS License
Appendix: Program Downloads
Overview
Many industries have a substantial investment in SAS code to run complex production processes. These processes are made up of many individual SAS jobs and other jobs such as relational database and CRM updates. These jobs have complex preconditions and interactions. Some jobs cannot begin until external feed files or products from other parts of the production process become available. Some jobs may have a scheduled start time, or preconditions based on physical resources such as available CPU and disk space. Some jobs may not run properly - they may have various non-fatal errors, or fail completely, or run longer than expected. All the jobs must be monitored and their status made available to various company departments, depending on the exact nature of the problems encountered.
Although there are some scheduling products on the market, most SAS production is still done manually. Technical staff comes in at night and over the weekends and physically manages production by submitting individual jobs and monitoring their progress. A return-on-investment (ROI) from automating the production process can come from:
-
Discover and document the company's production processes.
-
Insure that every production run is done the same way.
-
Insure that every production run is done according to best practices.
-
Insure that there is a record made of every production run.
-
Reduce the risk of having the production run fail to complete properly.
-
Increase IT staff productivity by freeing them from having to be on site to run the production process. The system will check the job preconditions and run the daily production reports. IT staff will only have to review and confirm the report recommendations or intervene if the process does not succeed. Overtime expenses will be reduced. IT staff stress will also be reduced.
Company production processes are often largely undocumented. Individuals may understand parts of it well, but their knowledge is in their heads or in a dog-eared production manual full of sticky notes kept at their desk. Overall, it is difficult to determine what the production process is, if it is followed with any consistency, if it reflects best practices, or if every run is properly documented. The conventional approach of documenting the processes and producing "shelf-ware" suffers from a continuing struggle to keep documents and production synchronized. Production processes are generally dynamic and documentation must try to keep up. Regulatory intervention such as Sarbanes-Oxley only intensifies the issues.
This technical paper demonstrates an external scheduler/controller to start and monitor SAS jobs through a SAS socket server. The small production process is demonstrated here, but it is easy to model a large number of SAS jobs with complex preconditions and interactions. The model is easily understood since it duplicates the behaviors of experts running each of the individual SAS jobs, and the behavior of an expert who knows how to orchestrate the individual jobs to produce the overall production process. The socket server approach allows this system to require only a Base SAS installation, and allows SAS jobs to be started and monitored at any location. This in turn allows remote SAS installations with special resources to be shared by the process.
Here is a simple but representative production problem. Our company has a daily production run consisting of three SAS jobs, SAS_A, SAS_B, and SAS_C. If SAS_A does not have enough disk space it will run for a while and then fail. Also, SAS_A often has divide-by-zero errors that must be detected. SAS_B shares a computer and cannot be allowed to start while other large jobs are running. SAS_C has dependencies on disk space, CPU, and a file (File_B.txt) produced by SAS_B.
All the jobs have preconditions. SAS_A and SAS_B are asynchronous and either can start before the other. SAS_B and SAS_C are synchronous and SAS_C can only run after SAS_B has produced File_B.txt. Also, SAS_A must be closely watched for errors.

|
Overview of the Intelligent Scheduler/Controller |
The three SAS jobs are unchanged. Instead of being managed manually they are now accessed through a SAS Socket Server. This server will allow Java programs to start the SAS jobs and to retrieve files such as the SAS log file.
Above the server are four Java jobs arranged in a tree. The four Java jobs encapsulate the knowledge of four technical experts. Three of these, Job_A, Job_B, and Job_C, contain the knowledge to run the individual SAS jobs. The fourth Java program , DailyReports, contains the knowledge needed to orchestrate the individual jobs and complete the production run.
The three individual Java jobs are similar and only one needs to be described.

|
Activity Diagram of an Individual Java Job Controller |
The first time it runs, the individual Java job controller may need to do some initialization. It then looks to see if the SAS job it controls has been launched. If the job needs to be launched, the preconditions are checked and, if they are met, a launch is attempted by getting a socket connection, issuing launch commands, and listening to the socket server return message.
If the SAS job has been previously started, the Java controller looks to see if the SAS job's log file has been returned and scanned for serious errors. Using a socket server means that the SAS and java jobs may not be on the same computers. The SAS log files must be retrieved from the remote computer and stored on the local computer so they can examined in detail if needed. After they are stored locally, the SAS log files are scanned by a separate local Java program that looks for serious errors and writes them to a java log file that records an overview of the complete production run.
Understanding the Java program DailyReports requires understanding the tree and its built-in search loop. Additional Java software behind the scenes lets the four Java jobs appear as nodes in a tree. A loop can be started in the tree that will visit the nodes in the following order:
DailyReports -> Job_A -> DailyReports -> Job_B -> DailyReports -> Job_C -> DailyReports
This makes sense when we realize that this is the behavior of an expert orchestrating the three individual Java jobs. The expert will start at DailyReports and first look at what needs to be done with Job_A. He will then return to DailyReports and ask if all the jobs are completed or if the overall production run has timed out. If production has not completed and not timed out, Job_B is examined to see if it can be made to complete. The expert returns to DailyReports to determine if production is now completed or timed out. If production has not completed and not timed out, Job_C is examined, followed by a return to DailyReports to complete a loop.
A few loops around the tree will usually match all the preconditions and synchronizations and take the production run to completion. If the production run should time out, the detailed record in the Java log file can be used to locate the problem. Any SAS jobs that completed will also have a copy of their log file saved to local disk.
An activity diagram of the four Java jobs and the tree loop process is shown below. A loop counter is used to produce a time-out condition.

|
The Four Java Jobs Are Run from the Tree Loop |
Internally, the DailyReports program is fairly simple and its activity diagram is shown below.

|
Activity Diagram of the DailyReports Program |
The software that runs in the background supporting the tree and loop is a set of java programs. The loop reset block in the diagram above is simply making a conventional Java call to tell this software to repeat the loop around the tree. This software has many other features, including a memory resident database that is used to store data that can be read by all the Java programs. This meets the need for the Java side of the server socket to "remember" the system state and make this information available to each java program when it becomes active during the loop through the tree.
This example has three controller nodes (Job_A, Job_B, and Job_C) and a root node DailyReports that monitors the overall production run. This small demonstration problem never-the-less illustrates a general solution to the problems of all complex production systems regardless of size - jobs have preconditions and synchronizations, and they can fail in part or in whole. The three SAS jobs (SAS_A, SAS_B, and SAS_C) are intelligently launched and monitored and provide sufficient information to document the production run, identify problems, and take corrective actions or notifications as needed.
It is easy to expand the number of controlled jobs by adding more nodes horizontally. The individual Java controllers have been simple enough in this example to implement with a single Java program. When requirements become significantly more complex, the single Java job can expand into a set of Java jobs arranged in their own subtree. For example, the node Job_A can become the root of a subtree composed of Job_A_1, Job_A_2, ..., Job_A_n, and some of these nodes can be the roots of their own subtrees. The tree looping mechanism (more properly termed a "tree search algorithm") will automatically expand to accommodate trees of any size or depth, and consequently production processes of large size and complexity.
This example does not demonstrate using Java threading to run multiple Java controllers simultaneously, or using multiple SAS socket servers to support parallel SAS execution. Both Java threading and multiple SAS servers are easily implemented.
Several examples from the log produced by the four Java programs during the production run are described below. Notes in the log are discussed at the bottom of each run.
*** Run 1 ***
Automated Consultant(R) version 2.1.1
Copyright (C) 1998-2003 Dymond and Associates, LLC
All rights reserved.
This is an alpha version of the software.
November 12, 2003 11:21:03 AM PST
Loading KB from file: C:\datasets\com\dymondassoc\xomj\examples\dailyreports\dailyreports.xos
KB version: 2.1.1 BBDB rows: 0 Tree nodes: 4
Beginning "Depth First" search algorithm. [note 1]
Start Executioner: class dailyreports.DailyReports.production [note 2]
Start Executioner: class dailyreports.Job_A.process [note 3]
Start sending launch msg to server.
SAS job appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production [note 4]
Start Executioner: class dailyreports.Job_B.process [note 5]
Job_B timed out waiting for a socket.
LAST EXCEPTION: java.net.ConnectException: Connection refused: connect
Job_B could not launch because it could not connect to the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_C.process [note 6]
Job_C does not have file File_B.txt needed to launch.
Start Executioner: class dailyreports.DailyReports.production [note 7]
Start Executioner: class dailyreports.Job_A.process [note 8]
Job_A timed out waiting for a socket.
LAST EXCEPTION: java.net.ConnectException: Connection refused: connect
Could not scan the log because it could not connect to the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_B.process [note 9]
Start sending launch msg to server.
SAS job appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_C.process [note 10]
Start sending launch msg to server.
SAS job appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_A.process [note 11]
Start sending query msg to server.
QUERY appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_B.process
Start sending query msg to server.
QUERY appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_C.process
Start sending query msg to server.
QUERY appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production [note 12]
FINAL REPORT
All nodes appear to have completed their tasks.
Node Name: Job_A
Start Time: November 12, 2003 11:21:04 AM PST
Job Completed: true
Possible Errors: 0
Copy of SAS log (if available): copy_sas_alog.txt
Node Name: Job_B
Start Time: November 12, 2003 11:21:42 AM PST
Job Completed: true
Possible Errors: 0
Copy of SAS log (if available): copy_sas_blog.txt
Node Name: Job_C
Start Time: November 12, 2003 11:21:42 AM PST
Job Completed: true
Possible Errors: 0
Copy of SAS log (if available): copy_sas_clog.txt
SearchHistory: [note 12]
DailyReports
Job_A
DailyReports
Job_B
DailyReports
Job_C
DailyReports
Job_A
DailyReports
Job_B
DailyReports
Job_C
DailyReports
Job_A
DailyReports
Job_B
DailyReports
Job_C
DailyReports
|
[note 1]
Automated Consultant is the software that sits behind the scenes and supports the tree with Java programs in the nodes and the tree loop. The log stats with messages from this program. A file named dailyreports.xos is loaded that contains the pre-built tree. The loop algorithm to be used is a "Depth First" search.
[note 2]
"Start Executioner:" indicates that the Java code in a node is beginning to execute. The loop is beginning here at the root node Daily reports. (The full name "dailyreports.DailyReports.production" means the production method in the DailyReports class in the dailyreports package has been called.) This code will conclude that the production run has not completed and not timed out, so it will exit and return control to the tree loop that will move next to the Job_A node.
[note 3]
Job_A begins to run. It has found no problems with the preconditions, so it attempts to connect to the socket server and send launch instructions for SAS_A. Based on the message returned from the server, Job_A concludes the launch worked. Job_A is done for now and returns control to the tree loop.
[note 4]
DailyReports concludes that the production run has not completed and not timed out.
[note 5]
Java Job_B begins to run. It finds no problems with its preconditions and attempts to get a socket connection to launch SAS_B. But the socket connection timed out and failed because the socket server is still busy running SAS_A. There is only one socket server and it can only handle one job at a time. SAS_A will have to finish before the server will allow a new socket connection.
[note 6]
After visiting DailyReports, the tree loop activates Java Job_C. The preconditions are not meet because SAS_B has not yet run to produce File_B.txt. Java_C exists without attempting to contact the socket server.
[note 7]
Returning to DailyReports here marks the first complete loop around the tree.
[note 8]
The loop begins again with a second visit to Job_A. Job_A is not complete yet. It has launched SAS_A but it has not recovered SAS_A's log file and checked it for errors. Job_A tries to contact the socket server to request the log file but the connection times out and fails. This occurs because SAS_A is still running and the server is still not available for a new connection.
[note 9]
When Job_B meets its preconditions and tries to launch SAS_B. The server has now become available and accepts this launch request.
[note 10]
Job_C meets its preconditions which include File_B.txt just produced by SAS_B. The server is available and accepts this launch request.
[note 11]
All the SAS jobs have been launched but none of the SAS logs have been recovered and checked for errors. The following set of node executions ask the socket server to return these log files which are then saved to local disk and error checked.
[note 12]
DailyReports determines the production run has completed and initiates a final report. For each Java job, the report describes the start time, completion status, possible errors in the underlying SAS job, and the name of the local copy of the SAS log file.
[note 13]
The search history lists the nodes that were executed during this production run.
Automated Consultant(R) version 2.1.1
Copyright (C) 1998-2003 Dymond and Associates, LLC
All rights reserved.
This is an alpha version of the software.
November 12, 2003 2:58:22 PM PST
Loading KB from file: C:\datasets\com\dymondassoc\xomj\examples\dailyreports\dailyreports.xos
KB version: 2.1.1 BBDB rows: 0 Tree nodes: 4
Beginning "Depth First" search algorithm.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_A.process
Start sending launch msg to server.
SAS job appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_B.process
Job_B timed out waiting for a socket.
LAST EXCEPTION: java.net.ConnectException: Connection refused: connect
Job_B could not launch because it could not connect to the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_C.process
Job_C does not have file File_B.txt needed to launch.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_A.process
Job_A timed out waiting for a socket.
LAST EXCEPTION: java.net.ConnectException: Connection refused: connect
Could not scan the log because it could not connect to the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_B.process
Start sending launch msg to server.
SAS job appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_C.process [note 1]
Job_C does not have adequate disk space to launch.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_A.process [note 2]
Start sending query msg to server.
QUERY appears to have started normally on the server.
The following possible errors have been found in the log file:
from sas: i=100000001 seed=53940 random=0.8810162027 numerator=1 denominator=0 error=. _ERROR_=1 _N_=1
from sas: NOTE: Mathematical operations could not be performed at the following places. The results of
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_B.process
Start sending query msg to server.
QUERY appears to have started normally on the server.
Start Executioner: class dailyreports.DailyReports.production
Start Executioner: class dailyreports.Job_C.process [note 3]
Job_C does not have adequate CPU to launch.
Start Executioner: class dailyreports.DailyReports.production
FINAL REPORT [note 4]
WARNING! Some nodes did not complete their tasks.
Node Name: Job_A
Start Time: November 12, 2003 2:58:22 PM PST
Job Completed: true
Possible Errors: 2
Copy of SAS log (if available): copy_sas_alog.txt
Node Name: Job_B
Start Time: November 12, 2003 2:59:01 PM PST
Job Completed: true
Possible Errors: 0
Copy of SAS log (if available): copy_sas_blog.txt
Node Name: Job_C
Start Time: Never started
Job Completed: false
Possible Errors: 0
Copy of SAS log (if available): copy_sas_clog.txt
SearchHistory:
DailyReports
Job_A
DailyReports
Job_B
DailyReports
Job_C
DailyReports
Job_A
DailyReports
Job_B
DailyReports
Job_C
DailyReports
Job_A
DailyReports
Job_B
DailyReports
Job_C
DailyReports
|
[note 1]
Up to this point, Run 2 is proceeding the same as Run 1. However, in this case, Job_C is unable to meet one of the preconditions (disk space) and does not attempt to launch SAS_C.
[note 2]
Job_A connects to the socket server a requests a copy of the SAS_A log file. This is returned, stored on local disk, and scanned for errors. Divide-by-zero errors are detected and noted in the Java log file.
[note 3]
Job_C fails to start again because of another of its preconditions (available CPU).
[note 4]
DailyReports has timed out (its loop counter is equal or greater than 10) and a warning report is written. SAS_A completed but with errors. The local copy of the SAS log file is available to more fully explore the errors. SAS_B has completed successfully without errors. SAS_C never met its preconditions and never started.
Read your SAS license and be sure that it allows you to do whatever you are planning!
The Business Process Management (BPM) Package and Metadata Package software that provide the tree, tree search, and many other features can be downloaded from this location. Installation instructions are also at this location. Other examples and complete documentation are included with the download.
Source code for the Java and SAS programs used in this example can be downloaded in this zip file. The files should be unzipped and placed in a new examples directory at:
...com\dymondassoc\xomj\examples\dailyreports
Note that basic Java programming skills are required to understand and use these downloads. The socket server and the individual node controllers (Job_A, Job_B, and Job_C) are based on the SAS server and Java client program described in the technical paper SAS Socket Server.
|
|