Whenever I mention a job log, I always hear a lot of groans from people. They have met the beast. The number of job logs the system creates is huge, and no one wants to read them. But the job log contains some good information that can help you. Let me describe what a job log is, how it works, and what to look for.
To determine the source of a problem on a system, experienced programmers usually require a job log. Often, I am unable to figure out some problem a user has told me about unless I see the job log. There are just too many things that can be misinterpreted or left unsaid when dealing over the phone. Having the job log in front of you can solve a lot of problems.
A job log is nothing more than a listing of the messages that have been issued during the execution of a given job. Every time a program is called, the system creates a unique program message queue. It is perfectly normal for system commands or user applications to send messages to these program message queues.
While the system keeps track of the messages in different program message queues within a job, the job log is written in strict chronological sequence. You can remove some of these messages (see later discussion), or you can just let them accumulate. At the end of your job, the job attributes and the way in which the job ended (normal or abnormal) determine whether the job log is produced as a spooled file or thrown away.
To tell the system to log, use the job description's LOG parameter on the Submit Job (SBMJOB) command, or use the Change Job (CHGJOB) command. The default is LOG(4 0 *NOLIST), which means that you get the maximum logging, but you don't get a job log if the job completes normally. If the job completes abnormally, the system writes a spooled file, which is the job log. The definition of a "normal completion" is also controlled by parameters.
In general, I don't want a job log if the job completes normally, and I want all the help I can get if the job ends abnormally. The default of LOG(4 0 *NOLIST) gives me this.
The default can be changed. The first parameter controls the amount of logging, and the second parameter controls the severity level of the message that will be logged (0 = log everything). The third parameter is the only one I ever change. If I want to force a job log in a batch job, I use LOG(40*SECLVL).
It should be noted that the system sends messages regardless of what the logging level is. So you can't really improve runtime performance by specifying different LOG values. The only control you have from a performance viewpoint is whether a job log is written and how much information goes into the job log. From a performance viewpoint, you are controlling only what happens at the end of the job, not what occurs during the execution of the job. Most users just take the defaults or use LOG(4 0 *SECLVL). Unless you are very familiar with the other options, you can make a mess of the situation and lose the job log information when you really need it.
Another form of logging occurs with the LOGCLPGM parameter. This causes more logging to occur and does impact the performance of your job. I use this function only to assist in debugging as opposed to production work.
If you take the defaults, a few things normally cause a job log to be produced:
o A batch job has an unmonitored escape message. There are ways to avoid this, but usually you want the job log to help find the problem.
o Someone ends a job external to the job (or ends a subsystem). While there is an option on ENDJOB to delete the job's spooled files (DLTSPLF parameter), this option does not control the production of a job log.
o Someone signs off with LOG(*LIST). This forces the job log for an interactive job. Note that an interactive job that has an unmonitored escape message will not generate a job log, even though you may want one. System-provided displays and some user menus may monitor for every condition and prevent the job from terminating abnormally. The SIGNOFF command LIST parameter is what controls job log creation. This is a confusing point, so let me reiterate: Using LOG(4 0 *SECLVL) on an interactive job has no impact on whether you get the job log. It's the SIGNOFF command's LOG parameter that determines if a job log is produced. For batch jobs, specifying LOG(4 0 *SECLVL) does force a job log.
You can clean out the messages that go into the job log by one of several techniques:
Removing a single message. To remove a single message, use the following command:
RCVMSG MSGTYPE(*LAST)
You can use the Receive Message (RCVMSG) command right after the command that produced the message you want to remove from the job log. RCVMSG defaults to RMV(*YES), which removes the message. The MSGTYPE(*LAST) option will work for any type of message, including escape messages.
Removing all messages in a program message queue. If your program works successfully and you don't want the job log to include any of the messages sent to your program message queue, use this command:
RMVMSG CLEAR(*ALL)
A high-level language (HLL) program such as RPG may also receive messages sent by the system, such as "Duplicate key" or "Buffer too large." You can remove these messages with an API (QMHRCVPM or QMHRMVPM), or you can use the HLRMVMSG tool, which is in both QUSRTOOL and the TAA Productivity Tools. When your program is ending, just do a Call:
CALL 'HLRMVMSG'
This program performs the same function as RMVMSG CLEAR(*ALL). Note that the system creates a unique program message queue each time a program is called. If you have an RPG program that is ending with LR off, you can clear the messages associated with the current Call onlybyusingHLRMVMSG.
Removing all inactive messages. Once a program ends, there is no system function to get rid of the messages for a single program message queue. These messages become low-level messages, and you can do nothing with them. The system supports PGMQ(*ALLINACT) on RMVMSG to allow you to clear all messages from all inactive programs. This is a drastic measure, and I would use it only in a batch program in which I have complete control.
Job logs always follow the same format. One is shown in Figure 1 for you to refer to. The heading of a job log is made up of three lines that give the important overview information, such as system name, what release you are on, date and time, job name, and the job description the job started with.
The main body of the job log is made up of messages. Each message follows the same format:
o Message ID. Most of the messages sent by the system have a message ID. When you enter a command, it becomes a special message type termed "request." Request messages do not have a message ID. Messages sent from applications may or may not have a message ID. If a message ID exists, it is a key piece of information, as it can really help identify exactly what happened.
o Message Type. Every message has a type. The information type is just what it says. Depending on the message, it can be very important or not meaningful at all. The request type is used for commands. The request message is the command that was executed either interactively or in batch. A completion type is usually sent when a function completes normally, but not all commands send a completion message. A diagnostic type is usually either a warning type or a specific error condition that precedes an escape type message. A warning type of message indicates that the system has made an assumption for you.
An escape message indicates that some command failed, and this is usually what you want to look for. What confuses people is that an escape message will exist in the job log even if a program has monitored for the message. So not every escape message indicates a problem. To avoid confusion, I like to remove this kind of message.
A sender copy is usually associated with the CPA0701 inquiry message, which tells you that an unmonitored exception occurred. The good thing about this message is that it usually shows which high-level statement number your program blew up on.
If the statement number is *N (null), the program is not observable. You need to have the program observable to allow the system to determine the statement number.
You should see a reply type of message following CPA0701, which is the response to the
message. The reply may have come from an interactive user in the same job, from the operator if it is a batch job, or as a default response. For example, if the job is set to INQMSGRPY(*DFT), the message default will be used if an inquiry message occurs. If the job is set to INQMSGRPY(*SYS-RPYL), you will get the response as defined in the System Reply List for the message.
o Severity. This describes the severity of the message, which is determined by the SEV parameter on the Send Program Message (SNDPGMMSG) command. I rarely look at this information.
o Date and Time. Since the job log occurs in strict time sequence, you can use this information for some performance feedback. I rarely look at the date and time information.
o From and To Program Information. You get to see the program that sent the message and the program that received it. Note that the library where the program exists is shown. This is important because you can get confused by the library list and access the wrong version of the program.
The instruction information refers to a unique number assigned to each instruction in your program. Unfortunately, this is an internal number and isn't very meaningful to user applications. It's helpful to IBM but not the rest of us.
The following message IDs will appear in most job logs:
o CPF1124. This is the job start message. It repeats some of the information from the job log heading. It also describes the subsystem the job ran in, which can be important.
o CPF1125. This is for batch jobs invoked by the Submit Job (SBMJOB) command. It provides information about the parameters that were used on the SBMJOB command. You see the attributes the job started with, so it is a combination of the job description and the parameters that were specified on SBMJOB. This is the place to look if you are having trouble with job attributes such as the library list or the default output queue. Of course, you can change this information with commands like Change Job (CHGJOB) and Change Library List (CHGLIBL).
o CPC2402. If a batch job ends abnormally, you see CPC2402, which tells you that the job ended because it exceeded the severity. There isn't any specific information on this message, so I don't find it very helpful. It does tell you to correct the errors.
o Deletions. If you create objects into QTEMP, you will see the cleanup completion messages at the end of the job log where the system deletes the objects in QTEMP. These are rarely meaningful. The messages started to appear a few releases ago when the system made a change to avoid QTEMP objects from having to be cleaned up by RCLSTG if an abnormal system termination occurred.
o CPF1164. This should be the last message in your job log. It describes how the job completed. The same message is also sent to QHST and includes some simple job accounting information, such as how many seconds of CPU time the job used. You can analyze the information by using the QHST files or with a tool like PRTQHSTANL in QUSRTOOL or the TAA Productivity Tools.
When a job terminates abnormally, I display the job log and immediately get to the last page by entering "b" in the Control field (rather than rolling forward from page 1). Then, I start rolling backward past the following messages:
o CPF1164 job completion message
o Anything deleted from QTEMP messages
o The CPC2402 job ended message
I scroll through the messages until I run into something of interest, like a CPA0701 sender copy message, the last escape message, or a diagnostic message. This is the stuff that bad programs are made of.
Then comes the art form of debugging, which is to figure out what to do about the problem. Often, the problem is clear once you see the error. Sometimes, it is not. Sometimes, the real problem is further back in the job log, where some weird thing occurred. Sometimes, you can't figure it out by looking at the job log.
On complex batch jobs, I may send myself little audit trail messages to say where I am in the program. I usually start these with some special characters so they are easy to scan for or spot in the job log. For example, my CL code might say:
SNDPGMMSG MSG('%%% Beginning the update program')
If my program has aborted because of some bad value, I may want to send the job log a message to describe the value, just in case it blows up again. There are all kinds of creative things you can do to make the job log more meaningful when you need it.
Sometimes, the job log contains an escape message that indicates an obvious problem. However, if the job continued past this point, some program probably monitored for the error condition but did nothing about it. Since you can usually determine the program name the message was sent to, you can fix the program to either abort on the condition or provide for recovery.
Most of the message text includes enough information to determine which command failed and which program it was sent from. Sometimes, the message gets bubbled up from one program to another, and the "To program" you see in the job log is not the one that initially experienced the problem.
Some of these can be tough to track down. Sometimes, you can guess at the command and minimize your search. Sometimes, you will be dealing with a message like CPF9801, which can be sent from a variety of commands. If you are dealing with a message ID and you are not sure of what command could send the message, use the Display Message Use (DSPMSGUSE) command in the TAA Productivity Tools to see the possible commands.
Some of the ugliest job logs to review are those caused by Save (SAV) or Restore (RST) commands. These commands primarily do what they can and then report (by messages in the job log) what worked and what did not work. Consequently, a lot of messages can be produced, and it is easy to miss a problem.
A good solution is to net out the job log when you deal with SAV and RST commands. You can
use the outfile option on these commands (not necessarily easy to work with) and look for certain conditions, or you can use the CHKSAVRST tool in QUSRTOOL and the TAA Productivity Tools. This tool allows you to net the job log so that you see only important SAV/RST messages.
You can convert any job log to a program-described file by using the Copy Spooled File (CPYSPLF) command. Then, you can do normal database functions, such as saving the job log. You can reprint the job log by using Copy File (CPYF) with TOFILE(*PRINT).
I packaged these together for you with the Copy From Output Queue (CPYFRMOUTQ) tool in QUSRTOOL and the TAA Productivity Tools. The CPYFRMOUTQ command converts all the spooled files in an output queue to a database file with a separate member per spooled file. The companion command Copy To Output Queue (CPYTOOUTQ) converts one or all members of the database file back to spooled files.
Converting to an externally described file will make it simpler for you to process the data that is in a spooled file. The Display Job Log (DSPJOBLOG) command allows you to make a database file of the job log. Unfortunately, you can do this only to the current job. To get this to happen when you really need it, you probably need some standard exception-handling routine that gets invoked when there is a problem and converts the job log.
An alternative solution is the Copy Job Log (CPYJOBLOG) tool in QUSRTOOL or the TAA Productivity Tools. It will work on any job log, whether it is in your own job or not.
The system keeps track of every job with a unique control block. The job control block remains in existence as long as some spooled file created by the job still remains unprinted, held, or saved on the system. So, anytime you have a job log, the job control block is still hanging around. You need to delete job logs regularly with a function like the OA (Operational Assistant) cleanup routines or the Delete Old Spooled Files (DLTOLDSPLF) tool (in QUSRTOOL or the TAA Productivity Tools). When the last spooled file for a job is deleted, the system automatically deletes the job control block.
Some users need or want to keep job logs for all jobs for some period of time. This not only takes up a lot of spool space, but also can use a lot of system space, because the job control blocks are still active.
If you have this requirement, take a look at the Job Log Control (JLGCTL) tool in the TAA Productivity Tools. It will convert job logs into database file members and allow you to manage the whole process of converting, finding what you want, reprinting, and cleanup in a more orderly manner. An option on the Convert Job Log Control (CVTJLGCTL) command allows you to delete a job log after you have converted it.
One of the potential problems with interactive jobs is that you may not get a job log when you need one. The job may be monitoring for any error (CPF0000) and preventing the job from abnormally terminating. The user may see a message and just ignore it. Nothing forces the job log unless the user says SIGNOFF LOG(*LIST).
A solution for this problem is the Set Job Log (SETJOBLOG) tool in QUSRTOOL or the TAA Productivity Tools. The tool allows you to issue the SETJOBLOG command whenever your code senses a problem in which you want to ensure that a job log will exist. SETJOBLOG
creates a data area in QTEMP with a special value. When the user is ready to sign off, you use the Force Job Log (FRCJOBLOG) command, which checks the data area in QTEMP. If the data area is set, FRCJOBLOG ends with SIGNOFF LOG(*LIST). If it is not set, FRCJOBLOG ends with SIGNOFF LOG(*NOLIST). The tool also works in batch, but, instead of using SIGNOFF, the FRCJOBLOG command just sets the LOG parameter to cause a job log.
Job logs are not a good choice if you need a permanent audit trail of events. You cannot guarantee a job log will occur. Or someone may delete the job log before you get to it. If you want a permanent audit trail, a journal is a much safer function to use. You can write entries to a journal with the Send Journal Entry (SNDJRNE) command.
After working with job logs for a while, I think they are a good system solution. Job logs usually have what you need to figure out what went wrong. While the net impact is one of overkill, I'm not sure what to suggest to provide a better solution. Therefore, my conclusion is to learn how to live with the beast.
Jim Sloan, president of Jim Sloan, Inc., is a software vendor and consultant. A retired IBMer, Sloan was a software planner on the S/38 when it began as a piece of paper. He also worked on the planning and early releases of the AS/400. In addition, Sloan wrote the TAA tools that exist in QUSRTOOL and is now the owner of the TAA Productivity Tools product. He has been a speaker at COMMON for many years.
Every copy of OS/400 comes with the QUSRTOOL library, which contains the source code for many useful programmer/operator tools. The QUSRTOOL library must be unpackaged, and each tool within the library must be created. For information about how to unpackage the library and create the tools, see source member TTTINFO in source file QATTINFO in library QUSRTOOL.
Many of the tools in the QUSRTOOL library have been removed in OS/400 releases beyond V3R1 (e.g., V3R2 and V3R6), including a large group of tools known as the TAA Tools. For a complete list of the tools that have been removed, see source member AAAAREADME in source file QATTINFO in library QUSRTOOL.
If you are at a release beyond V3R1, you may not be able to find the tools mentioned in this article in the QUSRTOOL library. The tools, however, can be obtained from TAA Productivity Tools, a licensed product from Jim Sloan, Inc.
For more information about TAA Productivity Tools, contact:
Jim Sloan, Inc.
c/o Barsa Consulting Group, Inc.
Tel: 914-251-9494; Fax: 914-253-9413
Figure 1: Sample job log
5769SS1 V4R1M0 970829 Job Log TAASYS5 10/23/97
14:57:00 Page 1
Jobname..........: QDFTJOBD User ......: SLOAN Number.....
......: 029351
Jobdescription ......: QDFTJOBD Library.....: QGPL
MSGID
TYPE SEV DATE TIME FROM PGM LIBRARY INST TO
PGM LIBRARY INST
CPF1124 Information 00 10/23/97 14:56:58 QWTPIIPP QSYS 0539
*EXT *N
Message....: Job029351/SLOAN/QDFTJOBD started on 10/23/97
at 14:56:58
in subsystem QBASE in QSYS. Job entered system on 10/23/97 at
14:56:58.
CPI1125 Information 00 10/23/97 14:56:58 QWTPIIPP QSYS 0255
*EXT *N
Message....: Job029351/SLOAN/QDFTJOBDsubmitted.
Cause.....: Job029351/SLOAN/QDFTJOBDsubmittedtojob
queue QBATCH in
QGPL from job 029317/SLOAN/DSP01. Job 029351/SLOAN/QDFTJOBD was
started
using the Submit Job (SBMJOB) command with the following job
attributes:
JOBPTY(5) OUTPTY(5) PRTTXT() RTGDTA(QCMDB) SYSLIBL(QSYS
QSYS2
QHLPSYS QUSRSYS QRPG) CURLIB(SLOANT) INLLIBL(QGPL
QTEMP
TAABLDCDE TAATSTCDE TAATOOL) LOG(4 00 *NOLIST) LOGCLPGM(*NO)
INQMSGRPY(*DFT) OUTQ(SLOANT/SLOANT) PRTDEV(PRT01) HOLD(*NO)
DATE(*SYSVAL)
SWS(00000000) MSGQ(QUSRSYS/SLOAN) CCSID(65535) SRTSEQ(*N/*HEX)
LANGID(ENU)
CNTRYID(US).
*NONE Request 10/23/97 14:56:58 QWTSCSBJ *N
QCMD QSYS 0142
Message....: -CALLPGM(TSTCLP)
CPF1615 Information 00 10/23/97 14:56:58 QWDGMSG QSYS 00A8
TSTCLP SLOANT 000D
Message....: JobqueueQBATCHfoundinlibraryQGPL.
CPC1611 Completion 00 10/23/97 14:56:58 QWDGMSG QSYS 00A8
TSTCLP SLOANT 000D
Message....: JobdescriptionXXXcreatedinlibraryQTEMP.
CPD0170 Diagnostic 30 10/23/97 14:56:58 QCLCLCPR QSYS 0396
TSTCLP SLOANT 0011
Message....: ProgramTYSYSYSYinlibrary*LIBLnotfound.
Cause.....: TheprogramspecifiedontheCALLcommand can
not be
found.Recovery ...: Verifythespellingoftheprogram
and library
names. Correct the names and try the CALL command again.
CPF0001 Escape 30 10/23/97 14:56:58 QCLCLCPR QSYS 043D
TSTCLP SLOANT 0011
Message....: ErrorfoundonCALLcommand.
Cause.....: Thesystemdetectederrorsinthecommand.
Recovery . . .
: See the previously listed messages in the job log. Correct
the errors
and then try the command again. *N instead of a command name
means that the
name had not been determined before the error was found.
CPA0701 Sender copy 99 10/23/97 14:56:59 QCLXERR QSYS 00D8
QCLXERR QSYS 00D8
Message....: CPF0001receivedbyTSTCLPat500.(CDIR)
Cause.....: Controllanguage(CL)programTSTCLPin
library SLOANT
detected an error at statement number 500. Message text for
CPF0001 is:
ErrorfoundonCALLcommand.Recovery ...: Thisinquiry
message can be
avoided by changing the program. Monitor for the error (MONMSG
command) and
perform error recovery within the program. To continue, choose
a reply
value.Possiblechoicesforreplyingtomessage........
......
. : C - Cancel the CL program. D - Dump the CL program
variables and
cancel the CL program. I - Ignore the failing command. R - Try
the failing
command again.
*NONE Reply 10/23/97 14:56:59 QMHSNINQ QSYS 0A0E
QCLXERR QSYS 00D8
Message....:
C
CPF9999 Escape 40 10/23/97 14:56:58 QMHUNMSG *N
QCMD QSYS 016E
Message....: Functioncheck.CPF0001unmonitoredbyTSTCLP
at statement
500, instruction X'0011'.
Cause.....: Anescapeexceptionmessagewassenttoa
program which
did not monitor for that message. The full name of the program
to which the
unmonitored message was sent is TSTCLP . At the time the
message was sent
the program was stopped at higher level language statement
number(s) 500. If
more than one statement number is shown, the program was a
bound program.
5769SS1 V4R1M0 970829 Job Log TAASYS5 10/23/97
14:57:00 Page 2
Jobname..........: QDFTJOBD User ......: SLOAN Number.....
......: 029351
Jobdescription ......: QDFTJOBD Library.....: QGPL
MSGID TYPE SEV DATE TIME FROM PGM LIBRARY INST TO
PGM LIBRARY INST
Optimization does not allow a single statement number to be
determined. If
*N is shown as a value, it means the actual value was not
available.
Recovery ...: Seethelowlevelmessagespreviously
listed to locate
the cause of the function check. Correct any errors, and then
try the
request again.
CPC2402 Completion 50 10/23/97 14:56:59 QCMD QSYS 050C
*EXT *N
Message....: Jobended.Cancel message received at command
processor.
Cause.....: Amessagewithaseverityequaltoor
exceeding the end
severity was received at the command processor. Recovery . . .
:Seethe
messages previously listed to determine the message that caused
the job to
be ended. Correct the errors, and then try the request again.
CPC2191 Completion 00 10/23/97 14:56:59 QLIDLOBJ QSYS 0326
QLICLLIB QSYS 0216
Message....: ObjectXXXinQTEMPtype*JOBDdeleted.
CPF1164 Completion 00 10/23/97 14:56:59 QWTMCEOJ QSYS 0212
*EXT *N
Message....: Job029351/SLOAN/QDFTJOBDendedon10/23/97
at 14:56:59; 1
seconds used; end code 20 .
Cause.....: Job029351/SLOAN/QDFTJOBDcompletedon
10/23/97 at
14:56:59 after it used 1 seconds processing unit time. The job
had ending
code 20. The job ended after 1 routing steps with a secondary
ending code of
0. The job ending codes and their meanings are as follows: 0
-Thejob
completed normally. 10 - The job completed normally during
controlled ending
or controlled subsystem ending. 20 - The job exceeded end
severity (ENDSEV
job attribute). 30 - The job ended abnormally. 40 - The job
ended before
becoming active. 50 - The job ended while the job was active.
60-The
subsystem ended abnormally while the job was active. 70 - The
system ended
abnormally while the job was active. 80 - The job ended
(ENDJOBABN command).
90 - The job was forced to end after the time limit ended
(ENDJOBABN
command).Recovery ...: Formoreinformation,seethe
Work Management
book, SC41-5306.
LATEST COMMENTS
MC Press Online