Mt Xia: Technical Consulting Group

Business Continuity / Disaster Recovery / High Availability
Data Center Automation / Audit Response / Audit Compliance

-
Current Location
-

css
  Downloads
    Scripts
      Korn
        Functions

-

digg Digg this page
del.icio.us Post to del.icio.us
Slashdot Slashdot it!


Business Web Site Hosting
$3.99 / month includes Tools,
Shopping Cart, Site Builder

www.siteox.com

FREE Domain Registration
included with Web Site Hosting
Tools, Social Networking, Blog

www.siteox.com

psmonitor_k93


psmonitor_k93 - Automated Process Monitoring Korn Shell Script

The purpose of this function is to monitor running processes on an AIX system and to log error messages or send error notification if those processes are not running. It is expected that this function will be scheduled to run from "cron", executed once every minute of every day. It is configured to only send one error message per day for each process it monitors, so the monitoring system or system administrator is not flooded with error messages.

Numerous command line options provide the system administrator with the flexibility to customize this function to integrate with any error logging/notification system already in place, or to provide that functionality itself.

Command line options allow the administrator to customize the error messages, insert messages to the standard AIX error log, send email notification, execute an external script, or to embed their own Korn shell code into this function to perform error logging and/or notification.

Zero or more of these error logging and notification methods may be performed by this function in the event it detects a process not running.

The processes being monitored by this function are specified via regular expressions read from a process list file. This file is customized and maintained by the system administrator and is empty or non-existant by default. Each line of this file is assumed to contain a regular expression representing one or more processes that MUST always be running on the AIX system. This function compares each regular expression from the process list file against the output from the "ps -ef" command to determine if any processes matching the regular expression are running. If not, it attempts to perform the error logging and notifications specified by the command line options.

The content of the error messages can be customized by the administrator via several methods. These methods include the use of a configuration file, environment variables, or by directly editing this function. This function contains the following preset variables and values, which are assembled to create an error message:

ErrMsgType="1-Critical"			# Error Message Type
ErrNtfyCon="Unix on-call"		# Error Notification Contact
ErrNtfyTim="Normal office hours"	# Error Notification Time
ErrCompCls="Software"			# Error Component Class
ErrCompNam="AIX"			# Error Component Name
ErrRetCode="1"				# Error Return Code
ErrLabel="Process not running"		# Error Label
ErrDescrip="Process is not running for frame associated with"	# Error Desription
ErrEmail="dfrench@mtxia.com"		# Error Email Address

These values may be changed by presetting shell or environment variables before running this function, or by defining a configuration file containing the shell variable values.

The configuration file must be a valid executable shell script containing shell variable definitions. The variable values may be change to any value desired. Suggested values follow:

ERROR MESSAGE TYPE
Variable Name: ErrmsgType
This record provides information regarding the criticallity of the error.

  • 0-TotalOutage
  • 1-Critical
  • 2-Urgent
  • 3-Warning
  • 4-Info

ERROR NOTIFICATION CONTACT
Variable Name: ErrNtfyCon
The person, persons or group to contact regarding the error should be identified using this record type. Examples follow:

  • Unix Systems on-call
  • SAP on-call
  • Joe Schmoe
  • John Doe, Jane Doe

ERROR NOTIFICATION TIME
Variable Name: ErrNtfyTim
When to contact the person, persons, or group who are responsible for resolving this error, should be defined using this record:

  • Immediate
  • Normal office hours
  • Day time hours only
  • Between 8:00am and 5:00pm weekdays
  • Anytime

ERROR COMPONENT CLASS
Variable Name: ErrCompCls
Currently, only two classes are defined: "Hardware" and "Software". Other classes may be added as needed, such as "Firmware".

ERROR COMPONENT NAME
Variable Name: ErrCompNam
This record identifies the failing component by its common name. This name should be short and provide immediate recognition of the failed component such as:

  • AIX
  • OMS
  • EXE
  • Manugistics
  • Mercator
  • MQSeries
  • Enterprise/CS
  • CONTROL-M/Server
  • CONTROL-M/Agent

ERROR RETURN CODE
Variable Name: ErrRetCode
This is an arbitrary non-zero positive number between 1-255. This is provided by the script writer or the exit code from a compiled program. If this code is not provided, it will be defaulted to "1".

ERROR LABEL
Variable Name: ErrLabel
A short description of the error. This description should be 64 characters or less and provide recogizable information regarding the error. It should not be a cryptic code of numbers and letters.

ERROR DESCRIPTION
Variable Name: ErrDescrip
This should be a detailed description of the error and provide specific information to the support personnel. The support personnel should be able to use the information provided in this description to help debug and diagnos the problem.

ERROR EMAIL ADDRESS
Variable Name: ErrEmail
This record is optional and if defined, will cause the error notification system to send an email message containing the full text of the error message, to all defined receipients. The data portion of this record should contain one or more valid email addresses.

The contents of a configuration file to define the error message environment variables would appear as follows:

ErrMsgType="1-Critical"
ErrNtfyCon="Unix on-call"
ErrNtfyTim="Normal office hours"
ErrCompCls="Software"
ErrCompNam="AIX"
ErrRetCode="1"
ErrLabel="Process not running"
ErrDescrip="Process is not running for frame associated with"
ErrEmail="dfrench@mtxia.com"

If defining the error message values via enviroment variables, before calling the psmonitor_k93 script as a function from another script, that may appear as follows:

...
...
...
export ErrMsgType="1-Critical"
export ErrNtfyCon="Unix on-call"
export ErrNtfyTim="Normal office hours"
export ErrCompCls="Software"
export ErrCompNam="AIX"
export ErrRetCode="1"
export ErrLabel="Process not running"
export ErrDescrip="Process is not running for frame associated with"
export ErrEmail="dfrench@mtxia.com"
psmonitor_k93
...
...
...

Any external script may be specified on the command line of the psmonitor_k93 function and will be executed, assuming the external script will process the error message assembled by this function.

The administrator may add their own code to this function to process error messages, the location is identified inside the function in a comment statement that reads: "CHANGE THE BODY OF THE FOLLOWING 'if' STATEMENT TO SUIT YOUR INDIVIDUAL NEEDS AND REQUIREMENTS FOR LOGGING ERROR MESSAGES".

The process list file should contain regular expressions representing lines of output from the "ps -ef" command. The default process list file is located at "/etc/psmonitor.list". Example contents of this file might appear as:

root.*sendmail
oracle DBNAME
mercatord

Each line of the process list file would be read and compared against the list of all system processes. If a match is not found, an error message is generated and logged via zero or more of the available methods.

The psmonitor_k93 function can also be used to restart failed processes by building an external shell script for this purpose. Then run this function and specify the external shell script as a command line option. When a process is detected as "not running", the external shell script will be executed to restart the failed process.



#!/usr/bin/ksh93
################################################################
function usagemsg_psmonitor_k93 {
  print "
Program: psmonitor_k93

This function reads a list of process arguments from a process list
file and checks the system process list to see if that process
is running, if not it logs an error in the AIX error log.  This 
function is intended to be run from cron once every minute of 
every day.

Usage: ${1##*/} [-?] [-vVLEu] [-p processListFile] [-c configFile]
                [-l loggerScript] [-r resetDays] [-e emailAddress]
    Where:
      -p processListFile = Use the specified process list file.
                           Default: /etc/psmonitor.list
      -c configFile      = Use the specified configuration file to
                           define error message variables.
                           Default: none
      -l loggerScript    = Execute the external error logging script 
                           specified by the file name "loggerScript".
                           Default: none
      -r resetDays       = Number of days between configuration resets
                           Default: 1  (NOT IMPLEMENTED AT THIS TIME)
      -e emailAddr       = Email address(s) to send error notification.

      -u = Execute local user customized code section of this function.
      -L = Do NOT log messages to AIX Errorlog.
      -E = Do NOT send email messages for error notification.
      -v = Verbose mode
      -V = Very Verbose Mode

Example: psmonitor_k93

Author: Dana French (dfrench@mtxia.com)
        Copyright 2006 by Dana French

\"AutoContent\" enabled
"
}
################################################################
function psmonitor_k93 {
  typeset TRUE="1"
  typeset FALSE="0"
  typeset RETCODE="0"
  typeset VERBOSE="${FALSE}"
  typeset VERYVERB="${FALSE}"
  typeset LOGGER=""
  typeset PROCLISTFILE="/etc/psmonitor.list"
  typeset CONFIGFILE=""
  typeset RESETDAYS="1"
  typeset TMPFILE="/tmp/psmonitor_k93.${$}.tmp"
  typeset AIXERRLOG="${TRUE}"
  typeset SENDEMAIL="${TRUE}"
  typeset CUSTOMCODE="${FALSE}"
  typeset VERSION="1.0"
  typeset ErrMsgType="${ErrMsgType:-1-Critical}"
  typeset ErrNtfyCon="${ErrNtfyCon:-Unix on-call}"
  typeset ErrNtfyTim="${ErrNtfyTim:-Normal office hours}"
  typeset ErrCompCls="${ErrCompCls:-Software}"
  typeset ErrCompNam="${ErrCompNam:-AIX}"
  typeset ErrRetCode="${ErrRetCode:-1}"
  typeset ErrLabel="${ErrLabel:-Process not running}"
  typeset ErrDescrip="${ErrDescrip:-Process is not running for frame associated with}"
  typeset ErrEmail="${ErrEmail:-dfrench1@capgeminienergy.com Unix_Team@txu.com}"

  typeset MESSAGE='Error Message Type: ${ErrMsgType}
Error Notification Contact: ${ErrNtfyCon}
Error Notification Time: ${ErrNtfyTim}
Error Component Class: ${ErrCompCls}
Error Component Name: ${ErrCompNam}
Error Return Code: ${ErrRetCode}
Error Label: ${ErrLabel} \"${PROCESSCMD}\"
Error Desription: ${ErrDescrip} \"${PROCESSCMD}\"
Error Email Address: ${ErrEmail}'

  while getopts ":vVELup:c:r#e:" OPTION
  do
    case "${OPTION}" in
      'v') VERBOSE="${TRUE}";;
      'V') VERYVERB="${TRUE}";;
      'p') PROCLISTFILE="${OPTARG}";;
      'c') CONFIGFILE="${OPTARG}";;
      'l') LOGGER="${OPTARG}";;
      'e') ErrEmail="${OPTARG}";;
      'u') CUSTOMCODE="${TRUE}";;
      'L') AIXERRLOG="${FALSE}";;
      'E') SENDEMAIL="${FALSE}";;
      '?') usagemsg_psmonitor_k93 "${0}" && return 1 ;;
      ':') usagemsg_psmonitor_k93 "${0}" && return 1 ;;
      '#') usagemsg_psmonitor_k93 "${0}" && return 1 ;;
    esac
  done
   
  shift $(( ${OPTIND} - 1 ))
  
  (( VERBOSE == TRUE )) && print -u 2 -- "# Version: ${VERSION}"
  (( VERBOSE == TRUE )) && print -u 2 -- "# Process List File: ${PROCLISTFILE}"

################################################################

  trap "usagemsg_psmonitor_k93 ${0}" EXIT
  
#### 
#### Check to see if the specified process list file  exists
#### and contains data.  If not, display an error message and
#### return from the function with a non-zero return code.
#### 
  
  RETCODE="1"
  if ! [[ -s "${PROCLISTFILE}" ]]
  then
    print -u 2 -- "# ERROR: Process List file \"${PROCLISTFILE}\" does not exist"
    print -u 2 -- "#        or contains no data."
    return ${RETCODE}
  fi

#### 
#### Build a full path file name for the working copy of the
#### process list file, replacing the slashes with bang
#### symbols.  This is so that if this function is executed
#### from multiple users, they will not likely overwrite each
#### others working process list file.
#### 

  typeset PROCLISTWORK="${PROCLISTFILE}.work"
  if [[ "_${PROCLISTFILE}" != _/* ]]
  then
      typeset PROCLISTWORK="${PWD}/${PROCLISTFILE}.work"
  fi
  PROCLISTWORK="/tmp/${PROCLISTWORK//\//!}"
  PROCLISTWORK="${PROCLISTWORK//!.!/!}"

  (( VERBOSE == TRUE )) && print -u 2 -- "# Working process list File: ${PROCLISTWORK}"
  
#### 
#### Check to see if the working process list file exists,
#### if not create it from the user specified or default
#### process list file using sorted and unique record lines. 
#### 

  if ! [[ -f "${PROCLISTWORK}" ]]
  then
    (( VERBOSE == TRUE )) && print -u 2 -- "# Working process list file \"${PROCLISTWORK}\" does not exist"
    (( VERBOSE == TRUE )) && print -u 2 -- "#     Creating \"${PROCLISTWORK}\""
    sort "${PROCLISTFILE}" | uniq > "${PROCLISTWORK}"
  fi
  
#### 
#### Check to see if the user specified or default
#### process list file has a later time stamp than the
#### working process list file. If so, rebuild the working
#### config file using sorted and unique record lines.
#### 

  if [[ "${PROCLISTFILE}" -nt "${PROCLISTWORK}" ]]
  then
    (( VERBOSE == TRUE )) && print -u 2 -- "# Process list file \"${PROCLISTFILE}\" is newer than working copy."
    (( VERBOSE == TRUE )) && print -u 2 -- "# Resetting working copy to resemble newer process list file."
    sort "${PROCLISTFILE}" | uniq > "${PROCLISTWORK}"
  fi
  
#### 
#### Check to see if the number of days between working file
#### resets is less than 1, if so display an error message
#### and return from the function with a non-zero return
#### code. 
#### 

  RETCODE="2"
  if (( RESETDAYS <= 0 ))
  then
    print -u 2 -- "# ERROR: Number of days between working file resets is less than 1, MIN=1"
    return ${RETCODE}
  fi

#### 
#### If a configuation file is specified on the command line,
#### check to see that it exists, has a non-zero file length,
#### and is executable.  If it passes these tests, execute it
#### to define the error message variables and values.
#### 

  RETCODE="3"
  if [[ "_${CONFIGFILE}" != "_" ]] && [[ -s "${CONFIGFILE}" ]]
  then
    (( VERBOSE == TRUE )) && print -u 2 -- "# Configuration File: ${CONFIGFILE}"
    if [[ -x "${CONFIGFILE}" ]]
    then
      . "${CONFIGFILE}"
    else
      print -u 2 -- "# ERROR: Configuration file \"${CONFIGFILE}\" is not executable."
      return ${RETCODE}
    fi
  fi

  RETCODE="0"
  
  trap "-" EXIT
  
  (( VERYVERB == TRUE )) && set -x

#### 
#### Reset the working psmonitor.list file once a day at midnight
#### 

  TOD=$( date +"%H%M" )
  if [[ "_${TOD}" = _0000 ]]
  then
    rm -f -- "${PROCLISTWORK}"
    sort -- "${PROCLISTFILE}" | uniq > "${PROCLISTWORK}"
  fi

################################################################

#### Generate a list of all processes on the system and store
#### the list in an array, one process record line per array
#### element. 

  IFS=$'\n'
  PLIST=( $( ps -ef | grep -v grep ) )
  IFS=$' \t\n'

#### 
#### Loop through the record lines in the working
#### process list file one line at a time.  Each line is
#### assumed to contain a regular expression representing a
#### process that appears in a system's "ps -ef" output.
#### 

  rm -f -- "${TMPFILE}"
  while read -r -- PROCESSCMD
  do

    (( VERBOSE == TRUE )) && print -u 2 -r -- "# Process args regex: \"${PROCESSCMD}\""

#### Test the contents of the process list array to determine
#### if it contains the process identifier read from the
#### working process list file.  If it does not, then the
#### process is not running, so log an error message.

    IFS=$'\n'
    if ! print -- "${PLIST[*]}" | grep -- "${PROCESSCMD}" > /dev/null 2>&1
    then
      IFS=$' \t\n'
      print -u 2 -r -- "# ERROR: Process matching \"${PROCESSCMD}\" does not exist"

#### Evaluate the error message text to cause the dynamically
#### assigned values to be substituted into the message. 

      eval MSG="\"${MESSAGE}\""

#       (( VERBOSE == TRUE )) && print -- "${MSG}"

#### Insert the error message into the standard AIX error log
#### using the "errlogger" utility.

      (( AIXERRLOG == TRUE )) && errlogger "${MSG}"

#### Email the error message to the person(s) or groups
#### identified as the recipient of these error messages.
#### This email address may be specified on the command line,
#### configuration file, or as an environment variable.

      (( SENDEMAIL == TRUE )) && print -r -- "${MSG}" |
              mail -s "$( hostname ) psmonitor_k93" "${ErrEmail}"

#### If an error logging script was specified on the command
#### line, execute it.  Assume the script utilizes the
#### appropriate error message variables.

      [[ "_${LOGGER}" != "_" ]] && [[ -x "${LOGGER}" ]] && . "${LOGGER}"

#### If the command line option to execute local user
#### customized code was selected on the command line,
#### execute this section of code.  CHANGE THE BODY OF THE
#### FOLLOWING "if" STATEMENT TO SUIT YOUR INDIVIDUAL NEEDS
#### AND REQUIREMENTS FOR LOGGING ERROR MESSAGES. 

      if (( CUSTOMCODE == TRUE ))
      then
        (( VERBOSE == TRUE )) && print -u 2 "# Begin local user custom code section."

        print "# "
        print "# If you had inserted your customized code for error"
        print "# logging and/or notification, this function would be"
        print "# running it now..."
        print "# "

        (( VERBOSE == TRUE )) && print -u 2 "# End local user custom code section."
      fi

    else

#### If the process list array contains the process
#### identifier read from the working process list file, then
#### insert the process identifer into a temporary storage
#### file.  This file will be used during the next invocation
#### of this function as the list of valid process identifers
#### to test against.

      IFS=$' \t\n'
      print -r -- "${PROCESSCMD}" >> "${TMPFILE}"

    fi

  done < "${PROCLISTWORK}"

#### Sort the list of valid process identifiers and extract
#### only the unique values.  Store these values in the
#### working process list file.

  sort -- "${TMPFILE}" | uniq > "${PROCLISTWORK}"

################################################################

  (( VERBOSE == TRUE )) && print -u 2 -r -- "# Begin checking for restarted processes."

#### 
#### Now loop through the record lines of the configuration
#### file, that do not appear in the working configuration
#### file, and determine if any running processes match.  If
#### so, add it back to the working configuration file.
#### 

  rm -f -- "${TMPFILE}"
  cp -f -- "${PROCLISTWORK}" "${TMPFILE}"
  sort -- "${PROCLISTFILE}" | uniq | comm -23 - "${PROCLISTWORK}" |
  while read -r -- PROCESSCMD
  do

    (( VERBOSE == TRUE )) && print -u 2 -r -- "# Check for restarted process: \"${PROCESSCMD}\""

#### Test the contents of the process list array to determine
#### if it contains the process identifier read from the
#### process list file.  If it does, then the process has
#### been restarted, so add it back into the working process
#### list file.

    IFS=$'\n'
    if print -- "${PLIST[*]}" | grep -- "${PROCESSCMD}" > /dev/null 2>&1
    then
      IFS=$' \t\n'

#### If the process list array contains the process
#### identifier read from the working process list file, then
#### insert the process identifer into a temporary storage
#### file.  This file will be used during the next invocation
#### of this function as the list of valid process identifers
#### to test against.

      (( VERBOSE == TRUE )) && print -u 2 -r -- "#     Re-adding \"${PROCESSCMD}\" to the working process list."
      print -r -- "${PROCESSCMD}" >> "${TMPFILE}"

    fi
    IFS=$' \t\n'

  done

#### Sort the list of valid process identifiers and extract
#### only the unique values.  Store these values in the
#### working process list file.

  sort -- "${TMPFILE}" | uniq > "${PROCLISTWORK}"
  rm -f -- "${TMPFILE}"

  (( VERBOSE == TRUE )) && print -u 2 -r -- "# End checking for restarted processes."

  return ${RETCODE}
}
################################################################

psmonitor_k93 "${@}"

-
Process Monitor
-
 


FREE Domain Registration
included with Web Site Hosting
Tools, Social Networking, Blog

www.siteox.com

Business Web Site Hosting
$3.99 / month includes Tools,
Shopping Cart, Site Builder

www.siteox.com