Some organizations utilize numerous error logging and error notification techniques on the AIX RS/6000 platform. Described here is a technique which will separate error logging from error notification, and will consolidate all error logging into a single system. The separation of error logging from error notification will simplify the task of reporting errors from the operating system, applications, programs, scripts, databases, etc. The separation of error logging will result in a standardized method of reporting errors from all sources. The system or application administrator only needs to know how to report an error and does not need to consider how the responsible parties will be notified of an error.
The error notification system may still contain several mechanisms by which notification is performed, but will be fed by the single consolidated error logging system. The error notification system will utilize a variety of mechanisms including e-mail, CONTROL-M, SuperVision, BMC Patrol, etc.
The consolidation of the error notification systems into the AIX error logging will require the use of an error message template with predefined data records. These records may then be parsed by a subsequent shell script or program. The following records were identified which every error message should contain. This means that when a script encounters an error, each record will be a part of the error message sent to the "errlogger" program: The fields within each record are separated by the colon (:) character, the first field is the record identifier, the second field is the data. Sample data is shown in the example below:
Error Message Type: 0-TotalOutage|1-Critical|2-Urgent|3-Warning|4-Info Error Notification Contact: Opensystems on-call|SAP on-call|... Error Notification Time: Normal office hours|Immediate|Day Time Hours Error Component Class: Hardware|Software Error Component Name: AIX|OMS|EXE|Manugistics|Mercator|MQSeries|... Error Return Code: 0-255 Error Label:
Error Description: Error Email Address: Supervision Group Name: opensys|sapsys|...
Hardware
" and
"Software
". We may add other classes as needed such as
"Firmware
".
1
".
If other Supervision groups are required, they will need to be preconfigured into the system. Dave Webster is the person responsible for configuring these groups.
The following 3 records are added to the error message during the notification phase. These do NOT need to be part of the error logging phase:
Machine Class: RS/6000 Machine Type: $( lsattr -El sys0 -a modelname | awk '{ print $2}' ) Operating System: AIX $( oslevel )
The following is an example snippit of code to log an error from a shell script. This example would represent logging a "file system full" error from one of the OMS machines:
...blah... ...blah... ...blah... errlogger " Error Message Type: 1-Critical Error Notification Contact: Opensystems on-call Error Notification Time: Normal office hours Error Component Class: Software Error Component Name: AIX Error Return Code: 1 Error Label: File system Full Error Description: The file system /home is more than 90% full. Please remove unneeded files or increase the size of the file system to correct this problem. Error Email Address: dfrench@mtxia.com Supervision Group Name: opensys " ...blah... ...blah... ...blah...
Notice that the system name and the date/time were NOT included in the error message. This is because the system name and date/time are automatically inserted with the message when it is added to the error logging system.
To configure the AIX Error Logging system, perform the following steps.
/tmp/operator.add
" containing
the following Error Notification object:
errnotify: en_label = OPMSG en_type = TEMP en_name = OPERATOR en_class = "O" en_method = "/home/bin/errnotify.ksh $1 $2 $3 $4 $5 $6 $7 $8 $9"
To add the object to the Error Notification object class, enter:
odmadd /tmp/operator.add
The odmadd command adds the Error Notification object contained in
"/tmp/operator.add
" to the errnotify file.
odmget -q"en_name='OPERATOR'" errnotify
The odmget command locates the Error Notification object within the errnotify file that has an en_name value of "OPERATOR" and displays the object. The following output is returned:
errnotify: en_pid = 0 en_name = "OPERATOR" en_persistenceflg = 0 en_label = "OPMSG" en_crcid = 0 en_class = "O" en_type = "TEMP" en_alertflg = "" en_resource = "" en_rtype = "" en_rclass = "" en_symptom = "" en_method = "/home/bin/errnotify.ksh $1 $2 $3 $4 $5 $6 $7 $8 $9"
odmdelete -q"en_name='OPERATOR'" -o errnotify
The odmdelete command locates the Error Notification object within the errnotify file that has an en_name value of "OPERATOR" and removes it from the Error Notification object class.
The error logging program is called "/usr/bin/errlogger
"
and will exist on every system. However, this program is normally
configured so that only the "root" user can execute it. The permissions
on this program must be changed to allow any user to execute it. Login
as "root" and change the permissions as follows:
chmod 555 /usr/bin/errlogger
The following script is the error notification script. This script performs a number of different notification methods such as CONTROL-M, Supervision, E-mail, etc.
#!/bin/ksh ################################################################ # # Program: errnotify.ksh # # Description: Accepts incoming error messages from the AIX # Standard Error log and uses the Acme standard # notification mechanisms to notify groups or # individuals. # # Author: Dana French # # Date: 01/28/2002 # ################################################################ typeset -L10 JOBNAME typeset -L50 DESCRIPTION TMPSCRIPT="/tmp/ctmscript${$}.tmp" TMPOUT="/tmp/tmp${$}.out" ################################################################ # Extract the full context of the error message from the # AIX standard error log. if errpt -a -l ${1} | sed -e "s/'/\\'/g" > ${TMPOUT} then ################################################################ # Append the following information to the extracted error message. print "Machine Class: RS/6000 Machine Type: $( lsattr -El sys0 -a modelname | awk '{ print $2}' ) Operating System: AIX $( oslevel )" >> ${TMPOUT} chmod 666 ${TMPOUT} else print "ERROR: Unable to extract error message from AIX error log" | tee /dev/console 1>&2 exit 1 fi ################################################################ # Parse the records defined in the error message and extract # various bits of information. This information is used to # describe the person or persons who should be contacted and when # they should be contacted. It also provides a description of the # error generated. grep -i 'Node ID:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- NODEID grep -i 'Error Notification Contact:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- DESCPART1 grep -i 'Error Notification Time:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- DESCPART2 grep -i 'Error Label:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- DESCPART3 grep -i 'Error Email Address:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- ERREMAIL grep -i 'Supervision Group Name:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- SVGROUP grep -i 'Error Component Name:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- COMPNAME grep -i 'Error Return Code:' ${TMPOUT} | awk -F: '{ print $2 }' | read -r -- ERRORCODE SHORTDESC="${NODEID}:ERR${$}:${DESCPART1}:${DESCPART2}:${DESCPART3}" ################################################################ # Create a shell script to print the content of the error # message, remove itself, then exit with the error code # from the program which generated the error. if print "#!/bin/ksh print -u2 -r -- '$( cat ${TMPOUT} )' rm -f ${TMPSCRIPT} exit ${ERRORCODE}" > "${TMPSCRIPT}" then chmod 777 "${TMPSCRIPT}" else print "ERROR: unable to create ${TMPSCRIPT}" | tee /dev/console 1>&2 exit 2 fi ################################################################ # The ctmcreate parameters are defined here export CONTROLM="/$( uname -n )/bmc/ctmagent/ctm" CTMCREATE="${CONTROLM}/exe_AIX/ctmcreate" MEMNAME=$( uname -n )_ERR${$} GROUP="OPERATOR" APPLICATION="ERRORS" DATACENTER="FTW" OWNER="root" JOBNAME=JOB${$} NODEGRP="$( uname -n )" DESCRIPTION="${SHORTDESC}" CMDLINE="/home/bin/ecsrun ${TMPSCRIPT}" SHOUT="${SHORTDESC}" SHOUT="TESTING ignore this message" EXITCODE="0" ################################################################ # The ctmcreate command is executed here ${CTMCREATE} \ -tasktype COMMAND \ -group "${GROUP}" \ -application "${APPLICATION}" \ -nodegrp "${NODEGRP}" \ -memname ${MEMNAME} \ -jobname ${JOBNAME} \ -owner "${OWNER}" \ -description "${DESCRIPTION}" \ -shout NOTOK ECS R "${SHOUT}" \ -cmdline "${CMDLINE}" STATUS="${?}" ################################################################ # If the Error Email Address record is defined, send the # full content of the error message to the recipient defined. if [[ "_${ERREMAIL}" != "_" ]] then mail -s "${SHORTDESC}" ${ERREMAIL} < ${TMPOUT} fi rm -f ${TMPOUT} exit 0