psmonitor_k93 - Automated Process Monitoring Korn Shell Script
The purpose of this function is to monitor running processes on an
AIX system and to log error messages or send error notification if those
processes are not running. It is expected that this function will be
scheduled to run from "cron", executed once every minute of every day.
It is configured to only send one error message per day for each process
it monitors, so the monitoring system or system administrator is not
flooded with error messages.
Numerous command line options provide the system administrator with
the flexibility to customize this function to integrate with any error
logging/notification system already in place, or to provide that
functionality itself.
Command line options allow the administrator to customize the error
messages, insert messages to the standard AIX error log, send email
notification, execute an external script, or to embed their own Korn
shell code into this function to perform error logging and/or
notification.
Zero or more of these error logging and notification methods may be
performed by this function in the event it detects a process not
running.
The processes being monitored by this function are specified via
regular expressions read from a process list file. This file is
customized and maintained by the system administrator and is empty or
non-existant by default. Each line of this file is assumed to contain a
regular expression representing one or more processes that MUST always
be running on the AIX system. This function compares each regular
expression from the process list file against the output from the "ps
-ef" command to determine if any processes matching the regular
expression are running. If not, it attempts to perform the error
logging and notifications specified by the command line options.
The content of the error messages can be customized by the
administrator via several methods. These methods include the use of a
configuration file, environment variables, or by directly editing this
function. This function contains the following preset variables and
values, which are assembled to create an error message:
ErrMsgType="1-Critical" # Error Message Type
ErrNtfyCon="Unix on-call" # Error Notification Contact
ErrNtfyTim="Normal office hours" # Error Notification Time
ErrCompCls="Software" # Error Component Class
ErrCompNam="AIX" # Error Component Name
ErrRetCode="1" # Error Return Code
ErrLabel="Process not running" # Error Label
ErrDescrip="Process is not running for frame associated with" # Error Desription
ErrEmail="dfrench@mtxia.com" # Error Email Address
These values may be changed by presetting shell or environment
variables before running this function, or by defining a configuration
file containing the shell variable values.
The configuration file must be a valid executable shell script
containing shell variable definitions. The variable values may be
change to any value desired. Suggested values follow:
- ERROR MESSAGE TYPE
- Variable Name:
ErrmsgType
- This record provides information regarding the criticallity of the
error.
- 0-TotalOutage
- 1-Critical
- 2-Urgent
- 3-Warning
- 4-Info
- ERROR NOTIFICATION CONTACT
- Variable Name:
ErrNtfyCon
- The person, persons or group to contact regarding the error should
be identified using this record type. Examples follow:
- Unix Systems on-call
- SAP on-call
- Joe Schmoe
- John Doe, Jane Doe
- ERROR NOTIFICATION TIME
- Variable Name:
ErrNtfyTim
- When to contact the person, persons, or group who are responsible
for resolving this error, should be defined using this record:
- Immediate
- Normal office hours
- Day time hours only
- Between 8:00am and 5:00pm weekdays
- Anytime
- ERROR COMPONENT CLASS
- Variable Name:
ErrCompCls
- Currently, only two classes are defined: "Hardware" and
"Software". Other classes may be added as needed, such as "Firmware".
- ERROR COMPONENT NAME
- Variable Name:
ErrCompNam
- This record identifies the failing component by its common name.
This name should be short and provide immediate recognition of the
failed component such as:
- AIX
- OMS
- EXE
- Manugistics
- Mercator
- MQSeries
- Enterprise/CS
- CONTROL-M/Server
- CONTROL-M/Agent
- ERROR RETURN CODE
- Variable Name:
ErrRetCode
- This is an arbitrary non-zero positive number between 1-255. This
is provided by the script writer or the exit code from a compiled
program. If this code is not provided, it will be defaulted to "1".
- ERROR LABEL
- Variable Name:
ErrLabel
- A short description of the error. This description should be 64
characters or less and provide recogizable information regarding the
error. It should not be a cryptic code of numbers and letters.
- ERROR DESCRIPTION
- Variable Name:
ErrDescrip
- This should be a detailed description of the error and provide
specific information to the support personnel. The support personnel
should be able to use the information provided in this description to
help debug and diagnos the problem.
- ERROR EMAIL ADDRESS
- Variable Name:
ErrEmail
- This record is optional and if defined, will cause the error
notification system to send an email message containing the full text of
the error message, to all defined receipients. The data portion of this
record should contain one or more valid email addresses.
The contents of a configuration file to define the error
message environment variables would appear as follows:
ErrMsgType="1-Critical"
ErrNtfyCon="Unix on-call"
ErrNtfyTim="Normal office hours"
ErrCompCls="Software"
ErrCompNam="AIX"
ErrRetCode="1"
ErrLabel="Process not running"
ErrDescrip="Process is not running for frame associated with"
ErrEmail="dfrench@mtxia.com"
If defining the error message values via enviroment
variables, before calling the psmonitor_k93 script as a
function from another script, that may appear as
follows:
...
...
...
export ErrMsgType="1-Critical"
export ErrNtfyCon="Unix on-call"
export ErrNtfyTim="Normal office hours"
export ErrCompCls="Software"
export ErrCompNam="AIX"
export ErrRetCode="1"
export ErrLabel="Process not running"
export ErrDescrip="Process is not running for frame associated with"
export ErrEmail="dfrench@mtxia.com"
psmonitor_k93
...
...
...
Any external script may be specified on the command line of the
psmonitor_k93 function and will be executed, assuming the external
script will process the error message assembled by this function.
The administrator may add their own code to this function to process
error messages, the location is identified inside the function in a
comment statement that reads: "CHANGE THE BODY OF THE FOLLOWING 'if'
STATEMENT TO SUIT YOUR INDIVIDUAL NEEDS AND REQUIREMENTS FOR LOGGING
ERROR MESSAGES".
The process list file should contain regular expressions
representing lines of output from the "ps -ef" command. The default
process list file is located at "/etc/psmonitor.list". Example contents
of this file might appear as:
root.*sendmail
oracle DBNAME
mercatord
Each line of the process list file would be read and compared
against the list of all system processes. If a match is not found, an
error message is generated and logged via zero or more of the available
methods.
The psmonitor_k93 function can also be used to restart failed
processes by building an external shell script for this purpose. Then
run this function and specify the external shell script as a command
line option. When a process is detected as "not running", the external
shell script will be executed to restart the failed process.
#!/usr/bin/ksh93
################################################################
function usagemsg_psmonitor_k93 {
print "
Program: psmonitor_k93
This function reads a list of process arguments from a process list
file and checks the system process list to see if that process
is running, if not it logs an error in the AIX error log. This
function is intended to be run from cron once every minute of
every day.
Usage: ${1##*/} [-?] [-vVLEu] [-p processListFile] [-c configFile]
[-l loggerScript] [-r resetDays] [-e emailAddress]
Where:
-p processListFile = Use the specified process list file.
Default: /etc/psmonitor.list
-c configFile = Use the specified configuration file to
define error message variables.
Default: none
-l loggerScript = Execute the external error logging script
specified by the file name "loggerScript".
Default: none
-r resetDays = Number of days between configuration resets
Default: 1 (NOT IMPLEMENTED AT THIS TIME)
-e emailAddr = Email address(s) to send error notification.
-u = Execute local user customized code section of this function.
-L = Do NOT log messages to AIX Errorlog.
-E = Do NOT send email messages for error notification.
-v = Verbose mode
-V = Very Verbose Mode
Example: psmonitor_k93
Author: Dana French (dfrench@mtxia.com)
Copyright 2006 by Dana French
\"AutoContent\" enabled
"
}
################################################################
function psmonitor_k93 {
typeset TRUE="1"
typeset FALSE="0"
typeset RETCODE="0"
typeset VERBOSE="${FALSE}"
typeset VERYVERB="${FALSE}"
typeset LOGGER=""
typeset PROCLISTFILE="/etc/psmonitor.list"
typeset CONFIGFILE=""
typeset RESETDAYS="1"
typeset TMPFILE="/tmp/psmonitor_k93.${$}.tmp"
typeset AIXERRLOG="${TRUE}"
typeset SENDEMAIL="${TRUE}"
typeset CUSTOMCODE="${FALSE}"
typeset VERSION="1.0"
typeset ErrMsgType="${ErrMsgType:-1-Critical}"
typeset ErrNtfyCon="${ErrNtfyCon:-Unix on-call}"
typeset ErrNtfyTim="${ErrNtfyTim:-Normal office hours}"
typeset ErrCompCls="${ErrCompCls:-Software}"
typeset ErrCompNam="${ErrCompNam:-AIX}"
typeset ErrRetCode="${ErrRetCode:-1}"
typeset ErrLabel="${ErrLabel:-Process not running}"
typeset ErrDescrip="${ErrDescrip:-Process is not running for frame associated with}"
typeset ErrEmail="${ErrEmail:-dfrench1@capgeminienergy.com Unix_Team@txu.com}"
typeset MESSAGE='Error Message Type: ${ErrMsgType}
Error Notification Contact: ${ErrNtfyCon}
Error Notification Time: ${ErrNtfyTim}
Error Component Class: ${ErrCompCls}
Error Component Name: ${ErrCompNam}
Error Return Code: ${ErrRetCode}
Error Label: ${ErrLabel} \"${PROCESSCMD}\"
Error Desription: ${ErrDescrip} \"${PROCESSCMD}\"
Error Email Address: ${ErrEmail}'
while getopts ":vVELup:c:r#e:" OPTION
do
case "${OPTION}" in
'v') VERBOSE="${TRUE}";;
'V') VERYVERB="${TRUE}";;
'p') PROCLISTFILE="${OPTARG}";;
'c') CONFIGFILE="${OPTARG}";;
'l') LOGGER="${OPTARG}";;
'e') ErrEmail="${OPTARG}";;
'u') CUSTOMCODE="${TRUE}";;
'L') AIXERRLOG="${FALSE}";;
'E') SENDEMAIL="${FALSE}";;
'?') usagemsg_psmonitor_k93 "${0}" && return 1 ;;
':') usagemsg_psmonitor_k93 "${0}" && return 1 ;;
'#') usagemsg_psmonitor_k93 "${0}" && return 1 ;;
esac
done
shift $(( ${OPTIND} - 1 ))
(( VERBOSE == TRUE )) && print -u 2 -- "# Version: ${VERSION}"
(( VERBOSE == TRUE )) && print -u 2 -- "# Process List File: ${PROCLISTFILE}"
################################################################
trap "usagemsg_psmonitor_k93 ${0}" EXIT
####
#### Check to see if the specified process list file exists
#### and contains data. If not, display an error message and
#### return from the function with a non-zero return code.
####
RETCODE="1"
if ! [[ -s "${PROCLISTFILE}" ]]
then
print -u 2 -- "# ERROR: Process List file \"${PROCLISTFILE}\" does not exist"
print -u 2 -- "# or contains no data."
return ${RETCODE}
fi
####
#### Build a full path file name for the working copy of the
#### process list file, replacing the slashes with bang
#### symbols. This is so that if this function is executed
#### from multiple users, they will not likely overwrite each
#### others working process list file.
####
typeset PROCLISTWORK="${PROCLISTFILE}.work"
if [[ "_${PROCLISTFILE}" != _/* ]]
then
typeset PROCLISTWORK="${PWD}/${PROCLISTFILE}.work"
fi
PROCLISTWORK="/tmp/${PROCLISTWORK//\//!}"
PROCLISTWORK="${PROCLISTWORK//!.!/!}"
(( VERBOSE == TRUE )) && print -u 2 -- "# Working process list File: ${PROCLISTWORK}"
####
#### Check to see if the working process list file exists,
#### if not create it from the user specified or default
#### process list file using sorted and unique record lines.
####
if ! [[ -f "${PROCLISTWORK}" ]]
then
(( VERBOSE == TRUE )) && print -u 2 -- "# Working process list file \"${PROCLISTWORK}\" does not exist"
(( VERBOSE == TRUE )) && print -u 2 -- "# Creating \"${PROCLISTWORK}\""
sort "${PROCLISTFILE}" | uniq > "${PROCLISTWORK}"
fi
####
#### Check to see if the user specified or default
#### process list file has a later time stamp than the
#### working process list file. If so, rebuild the working
#### config file using sorted and unique record lines.
####
if [[ "${PROCLISTFILE}" -nt "${PROCLISTWORK}" ]]
then
(( VERBOSE == TRUE )) && print -u 2 -- "# Process list file \"${PROCLISTFILE}\" is newer than working copy."
(( VERBOSE == TRUE )) && print -u 2 -- "# Resetting working copy to resemble newer process list file."
sort "${PROCLISTFILE}" | uniq > "${PROCLISTWORK}"
fi
####
#### Check to see if the number of days between working file
#### resets is less than 1, if so display an error message
#### and return from the function with a non-zero return
#### code.
####
RETCODE="2"
if (( RESETDAYS <= 0 ))
then
print -u 2 -- "# ERROR: Number of days between working file resets is less than 1, MIN=1"
return ${RETCODE}
fi
####
#### If a configuation file is specified on the command line,
#### check to see that it exists, has a non-zero file length,
#### and is executable. If it passes these tests, execute it
#### to define the error message variables and values.
####
RETCODE="3"
if [[ "_${CONFIGFILE}" != "_" ]] && [[ -s "${CONFIGFILE}" ]]
then
(( VERBOSE == TRUE )) && print -u 2 -- "# Configuration File: ${CONFIGFILE}"
if [[ -x "${CONFIGFILE}" ]]
then
. "${CONFIGFILE}"
else
print -u 2 -- "# ERROR: Configuration file \"${CONFIGFILE}\" is not executable."
return ${RETCODE}
fi
fi
RETCODE="0"
trap "-" EXIT
(( VERYVERB == TRUE )) && set -x
####
#### Reset the working psmonitor.list file once a day at midnight
####
TOD=$( date +"%H%M" )
if [[ "_${TOD}" = _0000 ]]
then
rm -f -- "${PROCLISTWORK}"
sort -- "${PROCLISTFILE}" | uniq > "${PROCLISTWORK}"
fi
################################################################
#### Generate a list of all processes on the system and store
#### the list in an array, one process record line per array
#### element.
IFS=$'\n'
PLIST=( $( ps -ef | grep -v grep ) )
IFS=$' \t\n'
####
#### Loop through the record lines in the working
#### process list file one line at a time. Each line is
#### assumed to contain a regular expression representing a
#### process that appears in a system's "ps -ef" output.
####
rm -f -- "${TMPFILE}"
while read -r -- PROCESSCMD
do
(( VERBOSE == TRUE )) && print -u 2 -r -- "# Process args regex: \"${PROCESSCMD}\""
#### Test the contents of the process list array to determine
#### if it contains the process identifier read from the
#### working process list file. If it does not, then the
#### process is not running, so log an error message.
IFS=$'\n'
if ! print -- "${PLIST[*]}" | grep -- "${PROCESSCMD}" > /dev/null 2>&1
then
IFS=$' \t\n'
print -u 2 -r -- "# ERROR: Process matching \"${PROCESSCMD}\" does not exist"
#### Evaluate the error message text to cause the dynamically
#### assigned values to be substituted into the message.
eval MSG="\"${MESSAGE}\""
# (( VERBOSE == TRUE )) && print -- "${MSG}"
#### Insert the error message into the standard AIX error log
#### using the "errlogger" utility.
(( AIXERRLOG == TRUE )) && errlogger "${MSG}"
#### Email the error message to the person(s) or groups
#### identified as the recipient of these error messages.
#### This email address may be specified on the command line,
#### configuration file, or as an environment variable.
(( SENDEMAIL == TRUE )) && print -r -- "${MSG}" |
mail -s "$( hostname ) psmonitor_k93" "${ErrEmail}"
#### If an error logging script was specified on the command
#### line, execute it. Assume the script utilizes the
#### appropriate error message variables.
[[ "_${LOGGER}" != "_" ]] && [[ -x "${LOGGER}" ]] && . "${LOGGER}"
#### If the command line option to execute local user
#### customized code was selected on the command line,
#### execute this section of code. CHANGE THE BODY OF THE
#### FOLLOWING "if" STATEMENT TO SUIT YOUR INDIVIDUAL NEEDS
#### AND REQUIREMENTS FOR LOGGING ERROR MESSAGES.
if (( CUSTOMCODE == TRUE ))
then
(( VERBOSE == TRUE )) && print -u 2 "# Begin local user custom code section."
print "# "
print "# If you had inserted your customized code for error"
print "# logging and/or notification, this function would be"
print "# running it now..."
print "# "
(( VERBOSE == TRUE )) && print -u 2 "# End local user custom code section."
fi
else
#### If the process list array contains the process
#### identifier read from the working process list file, then
#### insert the process identifer into a temporary storage
#### file. This file will be used during the next invocation
#### of this function as the list of valid process identifers
#### to test against.
IFS=$' \t\n'
print -r -- "${PROCESSCMD}" >> "${TMPFILE}"
fi
done < "${PROCLISTWORK}"
#### Sort the list of valid process identifiers and extract
#### only the unique values. Store these values in the
#### working process list file.
sort -- "${TMPFILE}" | uniq > "${PROCLISTWORK}"
################################################################
(( VERBOSE == TRUE )) && print -u 2 -r -- "# Begin checking for restarted processes."
####
#### Now loop through the record lines of the configuration
#### file, that do not appear in the working configuration
#### file, and determine if any running processes match. If
#### so, add it back to the working configuration file.
####
rm -f -- "${TMPFILE}"
cp -f -- "${PROCLISTWORK}" "${TMPFILE}"
sort -- "${PROCLISTFILE}" | uniq | comm -23 - "${PROCLISTWORK}" |
while read -r -- PROCESSCMD
do
(( VERBOSE == TRUE )) && print -u 2 -r -- "# Check for restarted process: \"${PROCESSCMD}\""
#### Test the contents of the process list array to determine
#### if it contains the process identifier read from the
#### process list file. If it does, then the process has
#### been restarted, so add it back into the working process
#### list file.
IFS=$'\n'
if print -- "${PLIST[*]}" | grep -- "${PROCESSCMD}" > /dev/null 2>&1
then
IFS=$' \t\n'
#### If the process list array contains the process
#### identifier read from the working process list file, then
#### insert the process identifer into a temporary storage
#### file. This file will be used during the next invocation
#### of this function as the list of valid process identifers
#### to test against.
(( VERBOSE == TRUE )) && print -u 2 -r -- "# Re-adding \"${PROCESSCMD}\" to the working process list."
print -r -- "${PROCESSCMD}" >> "${TMPFILE}"
fi
IFS=$' \t\n'
done
#### Sort the list of valid process identifiers and extract
#### only the unique values. Store these values in the
#### working process list file.
sort -- "${TMPFILE}" | uniq > "${PROCLISTWORK}"
rm -f -- "${TMPFILE}"
(( VERBOSE == TRUE )) && print -u 2 -r -- "# End checking for restarted processes."
return ${RETCODE}
}
################################################################
psmonitor_k93 "${@}"
|