Process-Management Monitor

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (70.39 KB, 9 trang )

201
■ ■ ■
CHAPTER 31
Process-Management Monitor
S
ystem process monitors can be a vital tool in determining the health of a running
machine. Ensuring that the required processes are running and that the total number
of each type of running process is appropriate is a good way to maintain system stability.
The downside of these types of monitors is that they let you know only which processes
are running and how many there are. They don’t give you an indication of the health of
each individual process.
This script dives a little deeper into the condition of processes. By using the ps com-
mand with a customized format, we’ll be able to monitor the age, proportion of CPU
usage, virtual-memory consumption, and amount of CPU time consumed by a particular
process. If you are monitoring multiple instances of any given process, each instance will
be held up to the standard being monitored.
One other feature of this process monitor is that it can be configured not only to warn
you of impending peril from processes whose operational values are out of bounds, but
also to take action in the form of killing the aberrant process when necessary. The monitor
could be modified easily to perform other actions besides killing a process.
Using historical data, you can sometimes predict when a specific application will start
to consume too many resources. It was one such application I was working with that
prompted me to write this monitor. The monitor helped in characterizing exactly when
the application ran out of control and in finding the cause of the behavior. Both were very
helpful in fixing the problem.
The syntax for monitor configuration is fairly straightforward, with five colon-
separated fields as shown in the following example. The fields are as follows: the process
command, the indicator to track, a lower threshold, an upper threshold, and the kill
option. You can configure multiple processes by including several records in the config-
uration string.
kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"

The first field is the process command itself. This will be slightly different, and hope-
fully simpler, than the traditional ps -ef output. The ps -ef default output (-e for all
processes, -f for formatted output) includes the commands that are running, as well as
any arguments they were passed. The ps -eo comm output is formatted to include only
the commands that are running on a system without any path or argument information.
202
CHAPTER 31
■
PROCESS-MANAGEMENT MONITOR
With this switch combination (-eo) you can also format your output in many ways to
show many other options, such as memory size, process age, process CPU time, and
so on. (On some UNIX systems, you may need to define the UNIX95 variable within the
script for the ps -eo command to function properly. The UNIX95 variable can be set to
anything you’d like; it just needs to not be undefined.) When specifying the process for
our script to monitor, you’ll want to use only the command name, as this is what the
script will be looking for.
The second field contains the indicator you want to track. The options are cputime,
which measures the number of minutes the cpu has allocated to the process; etime, which
is the elapsed time in minutes since the process began running; pcpu which represents
the current percentage of the CPU capacity the process is consuming; and vsize, which
shows the virtual-memory size in kilobytes for the process.
The third and fourth fields contain the desired lower and upper thresholds for the indi-
cator you’re tracking.
The fifth and final field is the kill option. It is a value from 0 to 3:
0: Send a notification when either the low warning or high error threshold have been
crossed, but don’t kill the process.
1: Send a warning notification when the low threshold has been crossed or an error
notification when the high threshold has been crossed, and kill the process.
2: Send only a low-level warning notification when either the low or high threshold has
been crossed, and kill the process.

3: Kill the process without any notification at all.
Note that for safety, if the kill option is not set or is set to anything but one of the values
outlined here, processes will not be killed. Notice that there are two levels of notification.
I have used alphanumeric paging for the high level (error status) and e-mail for the low
level (warning status). You may want to implement the notification method as appropri-
ate for your needs.
The first section of the script sets up a few configuration variables, which alternatively
could be stored in a separate configuration file and sourced each time the script runs
through the loop. This would allow for live configuration changes to the script. The debug
value is for testing and the sleeptime value represents the amount of time to delay
between each run. The kill_plist variable is the main configuration value that lets the
script know what processes and values it should be watching.
#!/bin/sh
debug=1
sleeptime=3
kill_plist="dhcpd:pcpu:15:30:1 sshd:pcpu:15:30:1"
CHAPTER 31
■
PROCESS-MANAGEMENT MONITOR
203
The following function performs all notifications and process terminations in the
script. It is called with seven sequentially numbered parameters. The positional variables
are somewhat difficult to understand and their values could have been assigned to more
meaningfully named variables before they were used, for ease of debugging later. To
streamline the script a little, I didn’t do this.
notify ()
{
case $2 in
0)
# Warn/error level and don't kill..

echo "$1: $3 process id $4 found with $5 $7. Should be less than $6."
;;
1)
# Warn/error level and kill..
echo "$1: $3 process id $4 found with $5 $7. Should be less than $6."
test $debug -eq 0 && kill $4
;;
2)
# Warning level only...
echo "Warning: $3 process id $4 found with $5 $7. Should be less than $6."
test $debug -eq 0 && kill $4
;;
3)
# Just kill, don't warn at all..
test $debug -eq 0 && kill $4
;;
*)
echo "Warning: killoption not set correctly, please validate configuration."
;;
esac
}
Here, for ease of reference, I define all of the command-line arguments passed to this
function:
$1: Text passed used for building the notification string; used for the difference
between warning and error
$2: The kill option, which has a possible value of 0-3
$3: The process name that is being monitored
$4: The process ID of the process being monitored
$5: The current value of the indicator you are tracking
204

CHAPTER 31
■
PROCESS-MANAGEMENT MONITOR
$6: The monitor’s lower threshold
$7: The text equivalent of the indicator you are tracking
This is also a good example of how a function can reduce the length and complexity
of a script. The body of this function is code that would have to be repeated eight times
throughout the script if it were not placed in a function. An older version of this script was
written this way. Putting the code into a function reduced the script’s length by roughly
40 percent.
The following code is the beginning of the main loop. The script is intended to be run
at system startup; it will then be run continuously through an infinite loop. After each iter-
ation completes, the script will sleep for a predetermined time before the next iteration.
The first part here is a nested loop that progresses through each record in the configura-
tion string to parse its fields and set up the monitor.
while :
do
for pline in $kill_plist
do
process=`echo $pline | cut -d: -f1`
process="`echo $process | sed -e \"s/%20/ /g\"`"
type=`echo $pline | cut -d: -f2`
value=`echo $pline | awk -F: '{print $3}'`
errval=`echo $pline | awk -F: '{print $4}'`
killoption=`echo $pline | awk -F: '{print $5}'`
The process variable is assigned the first field in the configuration record (pline). It is
possible that the process command name you’re monitoring will consist of more than one
word, separated by spaces. Such spaces are replaced (here using the sed command) with
%20, which is a commonly used substitute for the space character, as in URL encoding, for
example.

The type variable is the second field in the configuration record. As mentioned, it spec-
ifies the performance indicator to watch: cputime (amount of CPU time consumed), etime
(elapsed time or age of process), pcpu (current percentage of the CPU consumed), or vsize
(virtual-memory size).
The value variable holds the lower warning threshold for the monitored value, taken
from the third field.
The errval variable is assigned the value of the upper error threshold for the monitored
value, taken from the fourth field.
The killoption variable is assigned the final field of the configuration record and spec-
ifies an action to perform when the process deviates from the normal range.
If the kill option was not specified initially, we set it to be the default kill option. This
makes sure no processes are killed unless one of the options for doing so is explicitly used.
CHAPTER 31
■
PROCESS-MANAGEMENT MONITOR
205
if [ "$killoption" = "" ]
then
killoption=0
fi
test $debug -gt 0 && echo "Kill $process processes if $type is greater than
$errval"
Next we pare down the full list of processes running on the system to the ones running
the command being monitored. Then we start a loop that iterates through the remaining
processes.
for pid in `ps -eo pid,comm | egrep "${process}$|${process}:$" | grep -v grep |
awk '{print $1}'`
do
For each process ID, the script has to gather the pertinent information. The embedded
ps command gathers only the specific information we want.

test $debug -gt 0 && echo "$process pid $pid"
pid_string=`ps -eo pid,cputime,etime,pcpu,vsize,comm | \
grep $pid | egrep "${process}$|${process}:$" | grep -v grep`
The following case statement is the heart of the monitor. The script tests for the monitor
type (cputime, etime, pcpu, or vsize); the cputime is the first monitor type listed. The code for
each type is slightly different, but all are very similar. Here we obtain the process time from
the ps output, as well as the number of fields that the proc_time variable contains.
case $type in
"cputime")
proc_time=`echo $pid_string | awk '{print $2}'`
fields=`echo $proc_time | awk -F: '{print NF}'`
proc_time_min=`echo $proc_time | awk -F: '{print $(NF-1)}'`
Both of these are needed because the format of the time value varies depending on the
amount of time it represents. The cputime and etime variables have values of the form
days-hours:minutes:seconds or hours:minute:seconds. A low value might look something
like 00:28 for 28 seconds. A high value could be 1-18:32:29 for 1 day, 18 hours, 32 minutes,
and 29 seconds. Both of these types have to be processed and converted to minutes.
(Seconds are dropped for simplicity.)
Of the four performance indicators, the logic for handling the cputime and etime values
is the most complex because the format used to report them changes depending on the
amount of time these values represent.
if [ $fields -lt 3 ]
then
proc_time_hr=0
proc_time_day=0

Process-Management Monitor

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về