Some organizations utilize numerous error logging and error
notification techniques on the AIX RS/6000 platform. Described here is
a technique which will separate error logging from error notification,
and will consolidate all error logging into a single system. The
separation of error logging from error notification will simplify the
task of reporting errors from the operating system, applications,
programs, scripts, databases, etc. The separation of error logging will
result in a standardized method of reporting errors from all sources.
The system or application administrator only needs to know how to report
an error and does not need to consider how the responsible parties will
be notified of an error.
The error notification system may still contain several mechanisms by
which notification is performed, but will be fed by the single
consolidated error logging system. The error notification system will
utilize a variety of mechanisms including e-mail, CONTROL-M,
SuperVision, BMC Patrol, etc.
The consolidation of the error notification systems into the AIX
error logging will require the use of an error message template with
predefined data records. These records may then be parsed by a
subsequent shell script or program. The following records were
identified which every error message should contain. This means that
when a script encounters an error, each record will be a part of the
error message sent to the "errlogger" program: The fields within each
record are separated by the colon (:) character, the first field is the
record identifier, the second field is the data. Sample data is shown
in the example below:
Error Message Type: 0-TotalOutage|1-Critical|2-Urgent|3-Warning|4-Info
Error Notification Contact: Opensystems on-call|SAP on-call|...
Error Notification Time: Normal office hours|Immediate|Day Time Hours
Error Component Class: Hardware|Software
Error Component Name: AIX|OMS|EXE|Manugistics|Mercator|MQSeries|...
Error Return Code: 0-255
Error Label:
Error Description:
Error Email Address:
Supervision Group Name: opensys|sapsys|...
Error Message Type:
This record provides information regarding the criticallity of the
error. There are currently 5 levels defined and should be identified as
follows:
- 0-TotalOutage
- 1-Critical
- 2-Urgent
- 3-Warning
- 4-Info
Error Notification Contact:
The person, persons or group to contact regarding the error should be
identified using this record type. Examples follow:
- Opensystems on-call
- SAP on-call
- Joe Schmoe
- John Doe, Jane Doe
Error Notification Time:
When to contact the person, persons, or group who are responsible for
resolving this error, should be defined using this record:
- Immediate
- Normal office hours
- Day time hours only
- Between 8:00am and 5:00pm weekdays
- Anytime
Error Component Class:
Currently, only two classes are defined: "Hardware " and
"Software ". We may add other classes as needed such as
"Firmware ".
Error Component Name:
This record identifies the failing component by its common name. This
name should be short and provide immediate recognition of the failed
component such as:
- AIX
- OMS
- EXE
- Manugistics
- Mercator
- MQSeries
- Enterprise/CS
- CONTROL-M/Server
- CONTROL-M/Agent
Error Return Code:
This is an arbitrary non-zero positive number between 1-255. This is
provided by the script writer or the exit code from a compiled program.
If this code is not provided, it will be defaulted to
"1 ".
Error Label:
A short description of the error. This description should be 64
characters or less and provide recogizable information regarding the
error. It should not be a cryptic code of numbers and letters.
Error Description:
This should be a detailed description of the error and provide
specific information to the support personnel. The support personnel
should be able to use the information provided in this description to
help debug and diagnos the problem.
Error Email Address:
This record is optional and if defined, will cause the error
notification system to send an email message containing the full text of
the error message, to all defined receipients. The data portion of this
record should contain one or more valid email addresses.
Supervision Group Name:
This record is optional and if defined, will cause the error
notification system to attempt to send a shout message to the
Supervision console in the Operations Center. The data portion of this
record should the name of a predefined Supervision group. Currently,
the only valid groups are:
- opensys: Opensystems Group
- sapsys: SAP Basis Group
If other Supervision groups are required, they will need to be
preconfigured into the system. Dave Webster is the person responsible
for configuring these groups.
The following 3 records are added to the error
message during the notification phase. These do NOT
need to be part of the error logging phase:
Machine Class: RS/6000
Machine Type: $( lsattr -El sys0 -a modelname | awk '{ print $2}' )
Operating System: AIX $( oslevel )
The following is an example snippit of code to log an error from a
shell script. This example would represent logging a "file system full"
error from one of the OMS machines:
...blah...
...blah...
...blah...
errlogger "
Error Message Type: 1-Critical
Error Notification Contact: Opensystems on-call
Error Notification Time: Normal office hours
Error Component Class: Software
Error Component Name: AIX
Error Return Code: 1
Error Label: File system Full
Error Description: The file system /home is more than 90% full. Please remove unneeded files or increase the size of the file system to correct this problem.
Error Email Address: dfrench@mtxia.com
Supervision Group Name: opensys
"
...blah...
...blah...
...blah...
Notice that the system name and the date/time were
NOT included in the error message. This is because the
system name and date/time are automatically inserted with the
message when it is added to the error logging system.
To configure the AIX Error Logging system, perform the following
steps.
- Create a file called "
/tmp/operator.add " containing
the following Error Notification object:
errnotify:
en_label = OPMSG
en_type = TEMP
en_name = OPERATOR
en_class = "O"
en_method = "/home/bin/errnotify.ksh $1 $2 $3 $4 $5 $6 $7 $8 $9"
To add the object to the Error Notification object class, enter:
odmadd /tmp/operator.add
The odmadd command adds the Error Notification object contained in
"/tmp/operator.add " to the errnotify file.
- To verify that the Error Notification object was added to the object
class, enter:
odmget -q"en_name='OPERATOR'" errnotify
The odmget command locates the Error Notification object within the
errnotify file that has an en_name value of "OPERATOR"
and displays the object. The following output is returned:
errnotify:
en_pid = 0
en_name = "OPERATOR"
en_persistenceflg = 0
en_label = "OPMSG"
en_crcid = 0
en_class = "O"
en_type = "TEMP"
en_alertflg = ""
en_resource = ""
en_rtype = ""
en_rclass = ""
en_symptom = ""
en_method = "/home/bin/errnotify.ksh $1 $2 $3 $4 $5 $6 $7 $8 $9"
- To delete the OPERATOR Error Notification object from the Error
Notification object class, enter:
odmdelete -q"en_name='OPERATOR'" -o errnotify
The odmdelete command locates the Error Notification object within
the errnotify file that has an en_name value of "OPERATOR" and removes
it from the Error Notification object class.
The error logging program is called "/usr/bin/errlogger "
and will exist on every system. However, this program is normally
configured so that only the "root" user can execute it. The permissions
on this program must be changed to allow any user to execute it. Login
as "root" and change the permissions as follows:
chmod 555 /usr/bin/errlogger
The following script is the error notification script. This script
performs a number of different notification methods such as CONTROL-M,
Supervision, E-mail, etc.
#!/bin/ksh
################################################################
#
# Program: errnotify.ksh
#
# Description: Accepts incoming error messages from the AIX
# Standard Error log and uses the Acme standard
# notification mechanisms to notify groups or
# individuals.
#
# Author: Dana French
#
# Date: 01/28/2002
#
################################################################
typeset -L10 JOBNAME
typeset -L50 DESCRIPTION
TMPSCRIPT="/tmp/ctmscript${$}.tmp"
TMPOUT="/tmp/tmp${$}.out"
################################################################
# Extract the full context of the error message from the
# AIX standard error log.
if errpt -a -l ${1} | sed -e "s/'/\\'/g" > ${TMPOUT}
then
################################################################
# Append the following information to the extracted error message.
print "Machine Class: RS/6000
Machine Type: $( lsattr -El sys0 -a modelname | awk '{ print $2}' )
Operating System: AIX $( oslevel )" >> ${TMPOUT}
chmod 666 ${TMPOUT}
else
print "ERROR: Unable to extract error message from AIX error log" | tee /dev/console 1>&2
exit 1
fi
################################################################
# Parse the records defined in the error message and extract
# various bits of information. This information is used to
# describe the person or persons who should be contacted and when
# they should be contacted. It also provides a description of the
# error generated.
grep -i 'Node ID:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- NODEID
grep -i 'Error Notification Contact:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- DESCPART1
grep -i 'Error Notification Time:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- DESCPART2
grep -i 'Error Label:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- DESCPART3
grep -i 'Error Email Address:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- ERREMAIL
grep -i 'Supervision Group Name:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- SVGROUP
grep -i 'Error Component Name:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- COMPNAME
grep -i 'Error Return Code:' ${TMPOUT} |
awk -F: '{ print $2 }' |
read -r -- ERRORCODE
SHORTDESC="${NODEID}:ERR${$}:${DESCPART1}:${DESCPART2}:${DESCPART3}"
################################################################
# Create a shell script to print the content of the error
# message, remove itself, then exit with the error code
# from the program which generated the error.
if print "#!/bin/ksh
print -u2 -r -- '$( cat ${TMPOUT} )'
rm -f ${TMPSCRIPT}
exit ${ERRORCODE}" > "${TMPSCRIPT}"
then
chmod 777 "${TMPSCRIPT}"
else
print "ERROR: unable to create ${TMPSCRIPT}" | tee /dev/console 1>&2
exit 2
fi
################################################################
# The ctmcreate parameters are defined here
export CONTROLM="/$( uname -n )/bmc/ctmagent/ctm"
CTMCREATE="${CONTROLM}/exe_AIX/ctmcreate"
MEMNAME=$( uname -n )_ERR${$}
GROUP="OPERATOR"
APPLICATION="ERRORS"
DATACENTER="FTW"
OWNER="root"
JOBNAME=JOB${$}
NODEGRP="$( uname -n )"
DESCRIPTION="${SHORTDESC}"
CMDLINE="/home/bin/ecsrun ${TMPSCRIPT}"
SHOUT="${SHORTDESC}"
SHOUT="TESTING ignore this message"
EXITCODE="0"
################################################################
# The ctmcreate command is executed here
${CTMCREATE} \
-tasktype COMMAND \
-group "${GROUP}" \
-application "${APPLICATION}" \
-nodegrp "${NODEGRP}" \
-memname ${MEMNAME} \
-jobname ${JOBNAME} \
-owner "${OWNER}" \
-description "${DESCRIPTION}" \
-shout NOTOK ECS R "${SHOUT}" \
-cmdline "${CMDLINE}"
STATUS="${?}"
################################################################
# If the Error Email Address record is defined, send the
# full content of the error message to the recipient defined.
if [[ "_${ERREMAIL}" != "_" ]]
then
mail -s "${SHORTDESC}" ${ERREMAIL} < ${TMPOUT}
fi
rm -f ${TMPOUT}
exit 0
|