Some organizations utilize numerous error logging and error notification techniques on the AIX RS/6000 platform. Described here is a technique which will separate error logging from error notification, and will consolidate all error logging into a single system. The separation of error logging from error notification will simplify the task of reporting errors from the operating system, applications, programs, scripts, databases, etc. The separation of error logging will result in a standardized method of reporting errors from all sources. The system or application administrator only needs to know how to report an error and does not need to consider how the responsible parties will be notified of an error.

The error notification system may still contain several mechanisms by which notification is performed, but will be fed by the single consolidated error logging system. The error notification system will utilize a variety of mechanisms including e-mail, CONTROL-M, SuperVision, BMC Patrol, etc.

The consolidation of the error notification systems into the AIX error logging will require the use of an error message template with predefined data records. These records may then be parsed by a subsequent shell script or program. The following records were identified which every error message should contain. This means that when a script encounters an error, each record will be a part of the error message sent to the "errlogger" program: The fields within each record are separated by the colon (:) character, the first field is the record identifier, the second field is the data. Sample data is shown in the example below:


Error Message Type: 0-TotalOutage|1-Critical|2-Urgent|3-Warning|4-Info
Error Notification Contact: Opensystems on-call|SAP on-call|...
Error Notification Time: Normal office hours|Immediate|Day Time Hours
Error Component Class: Hardware|Software
Error Component Name: AIX|OMS|EXE|Manugistics|Mercator|MQSeries|...
Error Return Code: 0-255
Error Label: 
Error Description: 
Error Email Address: 
Supervision Group Name: opensys|sapsys|...

Error Message Type:

This record provides information regarding the criticallity of the error. There are currently 5 levels defined and should be identified as follows:

Error Notification Contact:

The person, persons or group to contact regarding the error should be identified using this record type. Examples follow:

Error Notification Time:

When to contact the person, persons, or group who are responsible for resolving this error, should be defined using this record:

Error Component Class:

Currently, only two classes are defined: "Hardware" and "Software". We may add other classes as needed such as "Firmware".

Error Component Name:

This record identifies the failing component by its common name. This name should be short and provide immediate recognition of the failed component such as:

Error Return Code:

This is an arbitrary non-zero positive number between 1-255. This is provided by the script writer or the exit code from a compiled program. If this code is not provided, it will be defaulted to "1".

Error Label:

A short description of the error. This description should be 64 characters or less and provide recogizable information regarding the error. It should not be a cryptic code of numbers and letters.

Error Description:

This should be a detailed description of the error and provide specific information to the support personnel. The support personnel should be able to use the information provided in this description to help debug and diagnos the problem.

Error Email Address:

This record is optional and if defined, will cause the error notification system to send an email message containing the full text of the error message, to all defined receipients. The data portion of this record should contain one or more valid email addresses.

Supervision Group Name:

This record is optional and if defined, will cause the error notification system to attempt to send a shout message to the Supervision console in the Operations Center. The data portion of this record should the name of a predefined Supervision group. Currently, the only valid groups are:

If other Supervision groups are required, they will need to be preconfigured into the system. Dave Webster is the person responsible for configuring these groups.

The following 3 records are added to the error message during the notification phase. These do NOT need to be part of the error logging phase:


Machine Class: RS/6000
Machine Type: $( lsattr -El sys0 -a modelname | awk '{ print $2}' )
Operating System: AIX $( oslevel )

The following is an example snippit of code to log an error from a shell script. This example would represent logging a "file system full" error from one of the OMS machines:


...blah...
...blah...
...blah...

errlogger "
Error Message Type: 1-Critical
Error Notification Contact: Opensystems on-call
Error Notification Time: Normal office hours
Error Component Class: Software
Error Component Name: AIX
Error Return Code: 1
Error Label: File system Full
Error Description: The file system /home is more than 90% full.  Please remove unneeded files or increase the size of the file system to correct this problem.
Error Email Address: dfrench@mtxia.com
Supervision Group Name: opensys
"

...blah...
...blah...
...blah...

Notice that the system name and the date/time were NOT included in the error message. This is because the system name and date/time are automatically inserted with the message when it is added to the error logging system.


To configure the AIX Error Logging system, perform the following steps.

  1. Create a file called "/tmp/operator.add" containing the following Error Notification object:

    
    
    errnotify:
         en_label = OPMSG
         en_type = TEMP
         en_name = OPERATOR
         en_class = "O"
         en_method = "/home/bin/errnotify.ksh $1 $2 $3 $4 $5 $6 $7 $8 $9"
    
    
    

    To add the object to the Error Notification object class, enter:

    
    odmadd /tmp/operator.add
    
    

    The odmadd command adds the Error Notification object contained in "/tmp/operator.add" to the errnotify file.

  2. To verify that the Error Notification object was added to the object class, enter:

    
    odmget -q"en_name='OPERATOR'" errnotify
    
    

    The odmget command locates the Error Notification object within the errnotify file that has an en_name value of "OPERATOR" and displays the object. The following output is returned:

    
    
    errnotify:
         en_pid = 0
         en_name = "OPERATOR"
         en_persistenceflg = 0
         en_label = "OPMSG"
         en_crcid = 0
         en_class = "O"
         en_type = "TEMP"
         en_alertflg = ""
         en_resource = ""
         en_rtype = ""
         en_rclass = ""
         en_symptom = ""
         en_method = "/home/bin/errnotify.ksh $1 $2 $3 $4 $5 $6 $7 $8 $9"
    
    

  3. To delete the OPERATOR Error Notification object from the Error Notification object class, enter:

    
    odmdelete -q"en_name='OPERATOR'" -o errnotify
    
    

    The odmdelete command locates the Error Notification object within the errnotify file that has an en_name value of "OPERATOR" and removes it from the Error Notification object class.

The error logging program is called "/usr/bin/errlogger" and will exist on every system. However, this program is normally configured so that only the "root" user can execute it. The permissions on this program must be changed to allow any user to execute it. Login as "root" and change the permissions as follows:


chmod 555 /usr/bin/errlogger


The following script is the error notification script. This script performs a number of different notification methods such as CONTROL-M, Supervision, E-mail, etc.


#!/bin/ksh
################################################################
#
# Program:      errnotify.ksh
#
# Description:  Accepts incoming error messages from the AIX
# 		Standard Error log and uses the Acme standard
# 		notification mechanisms to notify groups or 
# 		individuals.
#
# Author:       Dana French
#
# Date:         01/28/2002
#
################################################################
typeset -L10 JOBNAME
typeset -L50 DESCRIPTION

TMPSCRIPT="/tmp/ctmscript${$}.tmp"
TMPOUT="/tmp/tmp${$}.out"

################################################################
# Extract the full context of the error message from the
# AIX standard error log.

if errpt -a -l ${1} | sed -e "s/'/\\'/g" > ${TMPOUT}
then
    ################################################################
    # Append the following information to the extracted error message.

    print "Machine Class: RS/6000
Machine Type: $( lsattr -El sys0 -a modelname | awk '{ print $2}' )
Operating System: AIX $( oslevel )" >> ${TMPOUT}

    chmod 666 ${TMPOUT}
else
    print "ERROR: Unable to extract error message from AIX error log" | tee /dev/console 1>&2
    exit 1
fi

################################################################
# Parse the records defined in the error message and extract
# various bits of information.  This information is used to
# describe the person or persons who should be contacted and when 
# they should be contacted.  It also provides a description of the 
# error generated.

grep -i 'Node ID:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- NODEID

grep -i 'Error Notification Contact:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- DESCPART1

grep -i 'Error Notification Time:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- DESCPART2

grep -i 'Error Label:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- DESCPART3

grep -i 'Error Email Address:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- ERREMAIL

grep -i 'Supervision Group Name:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- SVGROUP

grep -i 'Error Component Name:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- COMPNAME

grep -i 'Error Return Code:' ${TMPOUT} |
 awk -F: '{ print $2 }' |
 read -r -- ERRORCODE

SHORTDESC="${NODEID}:ERR${$}:${DESCPART1}:${DESCPART2}:${DESCPART3}"

################################################################
# Create a shell script to print the content of the error 
# message, remove itself, then exit with the error code
# from the program which generated the error.

if print "#!/bin/ksh
print -u2 -r -- '$( cat ${TMPOUT} )'
rm -f ${TMPSCRIPT}
exit ${ERRORCODE}" > "${TMPSCRIPT}"
then
    chmod 777 "${TMPSCRIPT}"
else
    print "ERROR: unable to create ${TMPSCRIPT}" | tee /dev/console 1>&2
    exit 2
fi

################################################################
#  The ctmcreate parameters are defined here

export CONTROLM="/$( uname -n )/bmc/ctmagent/ctm"
CTMCREATE="${CONTROLM}/exe_AIX/ctmcreate"
MEMNAME=$( uname -n )_ERR${$}
GROUP="OPERATOR"
APPLICATION="ERRORS"
DATACENTER="FTW"
OWNER="root"
JOBNAME=JOB${$}
NODEGRP="$( uname -n )"
DESCRIPTION="${SHORTDESC}"
CMDLINE="/home/bin/ecsrun ${TMPSCRIPT}"
SHOUT="${SHORTDESC}"
SHOUT="TESTING ignore this message"
EXITCODE="0"

################################################################
# The ctmcreate command is executed here

${CTMCREATE} \
-tasktype COMMAND \
-group "${GROUP}" \
-application "${APPLICATION}" \
-nodegrp "${NODEGRP}" \
-memname ${MEMNAME} \
-jobname ${JOBNAME} \
-owner   "${OWNER}" \
-description "${DESCRIPTION}" \
-shout NOTOK ECS R "${SHOUT}" \
-cmdline "${CMDLINE}"

STATUS="${?}"

################################################################
# If the Error Email Address record is defined, send the
# full content of the error message to the recipient defined.

if [[ "_${ERREMAIL}" != "_" ]]
then
    mail -s "${SHORTDESC}" ${ERREMAIL} < ${TMPOUT}
fi

rm -f ${TMPOUT}

exit 0