Throttling Parallel Processes

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (75.46 KB, 8 trang )

91
■ ■ ■
CHAPTER 14
Throttling Parallel Processes
I
have often needed to perform a task across multiple remote systems. A common exam-
ple is the installation of a software package on each machine in a large environment. With
relatively small environments, you could simply write a script that loops through a list of
systems and performs the desired task serially on each machine. Another method would
be to loop through the list of machines and submit your job to the background so the tasks
are performed in parallel. Neither of these methods scales well when you run a large envi-
ronment, however. Processing the list sequentially is not an efficient use of resources and
can take a long time to complete.
With too many background parallel processes, the initiating machine will run out of
network sockets and the loop that starts all the background tasks will stop functioning.
Additionally, even if you were permitted an unlimited number of socket connections, the
installation package may be quite large and you might end up saturating your network.
You might also have to deal with so many machines that the installations will take an
extremely long time to complete because of network contention. In all of these cases you
need to control the number of concurrent sessions you have running at any given time.
The scripts presented in this chapter demonstrate a way of controlling the number of
parallel background processes. You can then tune your script based on your particular
hardware and bandwidth by timing sample runs, and you can play with the number of
parallel processes to control the time it takes to run the background jobs. The general idea
of this algorithm is that a background job is spawned whose only task is to feed a large list
of items back to the parent process at a rate that is controlled by the parent process.
Since not a lot of us have to manage remote jobs on hundreds to tens of thousands of
machines, this chapter uses an example that has broader applicability: a script that vali-
dates web-page links. The script takes the URL of a web site as input. It gathers the URLs
found on the input page, and then gets all the URLs from each of those pages, up to a spec-
ified level of depth. It is usually sufficient to carry the process to two levels to gather from

several hundred to a few thousand unique URLs.
Once the script has finished gathering the URLs, it validates each link and writes the
validation results to a log file. The script starts URL validation in groups of parallel pro-
cesses for which the size is based on the number specified when the script was called.
Once a group starts, the code waits for all the background tasks to complete before it starts
92
CHAPTER 14
■
THROTTLING PARALLEL PROCESSES
the next group. The script repeats the URL validation process until it has checked all web
pages passed to it.
You could easily modify the script to manage any parallel task. If you want to focus on
URL validation, you could limit the list of URLs to be validated to those residing within
your own domain; you would thereby create a miniature web crawler that validates URLs
on your own site.
Parallel Processing with ksh
One feature available within ksh is called a co-process. This is a process that is run and sent
to the background with syntax that allows the background child process to run asynchro-
nously from the parent that called it. Both processes are able to communicate with each
other. The following version of the web crawler uses the co-process method.
You start by defining the log file for the script to use, and then you have to determine
whether it already exists. If there is a previous version, you need to remove it.
#!/bin/ksh
LOGFILE=/tmp/valid_url.log
if [ -f $LOGFILE ]
then
rm $LOGFILE
fi
The main loop calls the url_feeder function as the background co-process task. The
function starts an infinite loop that waits for the message “GO” to be received. Once the

function receives the message, it breaks out of the loop and continues executing func-
tion code.
function url_feeder {
while read
do
[[ $REPLY = "GO" ]] && break
done
The script passes this function a variable containing the list of unique URLs that have
been collected. The script collects all links based on the starting web-page URL and the
link depth it is permitted to search. This loop iterates through each of the pages and prints
the links, although not to a terminal. Later I will discuss in greater detail how this function
is called.
for url in $*
do
print $url
done
}
CHAPTER 14
■
THROTTLING PARALLEL PROCESSES
93
The find_urls function finds the list of web pages and validates the URLs.
function find_urls {
url=$1
urls=`lynx -dump $url | sed -n '/^References/,$p' | \
egrep -v "ftp://|javascript:|mailto:|news:|https://" | \
tail -n +3 | awk '{print $2}' | cut -d\? -f1 | sort -u`
It takes a single web-site URL (such as www.google.com) as a parameter. This is the
function that is called as a background task from the script’s main code, and it can be
performed in parallel with many other instances of itself.

The urls variable contains the list of links found by the lynx command on the page
defined by the url variable. This lynx command lists all URLs found on a given site, in
output that is easy to obtain and manipulate in text form. To remove links that do not rep-
resent web pages, I piped the output of lynx to egrep and ordered and formatted the links
with tail, awk, cut, and sort.
Now you need to determine the number of URLs found on the page that was passed to
the function. If no URLs were found, then the script checks whether the second positional
parameter $2 was passed. If it was, then the function is acting in URL-validation mode and
it should log a message stating the page was not found. If $2 was not passed, then the func-
tion is acting in URL-gathering mode and it should echo nothing, meaning it didn’t find
any links to add to the URL list.
urlcount=`echo $urls | wc -w`
if [ "$urls" = "" ]
then
if [ "$2" != "" ]
then
echo $url Link Not Found on Server or Forbidden >> $LOGFILE
else
echo ""
fi
If a single URL was found and it matches then we log
that the web page has not been found. This is a special-case page that lynx will report; you
can ignore it.
elif [ $urlcount -eq 1 -a "$urls" = " ]
then
if [ "$2" != "" ]
then
echo $url Site Not Found >> $LOGFILE
else
echo ""

fi
As in the previous section of code, if $2 is not passed, the function is acting in URL-
gathering mode.
94
CHAPTER 14
■
THROTTLING PARALLEL PROCESSES
The following code applies when the URL was found to be valid:
else
if [ "$2" != "" ]
then
echo "$url is valid" >> $LOGFILE
else
echo " $urls"
fi
fi
}
If this is the case and $2 was passed to the function, you would log that the web page is valid.
If $2 was not passed, the unchanged list of URLs would be passed back to the main loop.
The following is the beginning of the code where the script processes the switches
passed by the user. The three possible switches define the levels of depth that the script
will check, the URL of the beginning site, and the maximum number of processes permit-
ted to run at the same time.
OPTIND=1
while getopts l:u:p: ARGS
do
case $ARGS in
l) levels=$OPTARG
;;
u) totalurls=$OPTARG

;;
p) max_parallel=$OPTARG
;;
*) echo "Usage: $0 -l [levels to look] -u [url] -p [parallel checks]"
;;
esac
done
If the user passes any other parameters, the script prints a usage statement explaining the
acceptable script parameters. You can find more detail on the processing of switches in
Chapter 5.
The following code shows a nested loop that gathers a complete URL list, starting with
the opening page and progressing through the number of levels to be checked by the
script. The outer loop iterates through the levels. The inner loop steps through all previ-
ously found URLs to gather the links from each page. All URLs found by the inner loop are
appended to the totalurls variable. Each pass through this inner loop generates a line of
output noting the number of sites found.
while [ $levels -ne 0 ]
do
(( levels -= 1 ))
for url in $totalurls
CHAPTER 14
■
THROTTLING PARALLEL PROCESSES
95
do
totalurls="${totalurls}`find_urls $url`"
url_count=`echo $totalurls | wc -w`
echo Current total number of urls is: $url_count
done
done

Now that the whole list has been gathered, we sort it with the -u option to reduce the
list to unique values. In this way we avoid redundant checks. Then we determine and out-
put the final number of sites the script found.
totalurls=`for url in $totalurls
do
echo $url
done | sort -u`
url_count=`echo $totalurls | wc -w`
echo Final unique total number of urls is: $url_count
This is where the script becomes interesting. You now call the url_feeder function as a
co-process by using the |& syntax; then you pass the total list of URLs to process.
url_feeder $totalurls |&
coprocess_pid=$!
As pointed out before, this is a capability unique to ksh. A co-process is somewhat
like a background task, but a pipe acting as a channel of communication is also opened
between it and the parent process. This allows two-way communication, which is some-
times referred to as IPC, or interprocess communication.
The url_feeder function prints out a list of all URLs it receives, but instead of printing
them to standard output, the function prints them to the pipe established between the
co-process and the parent process. One characteristic of printing to this newly estab-
lished pipe is that the print being performed by the child co-process won’t complete
until the value is read from the initiating parent process at the other end of the pipe. In
this case, the value is read from the main loop. This allows us to control the rate at which
we read new URLs to be processed, because the co-process can output URLs only as fast
as the parent process can read them.
Next we initialize a few variables that are used to keep track of the current number of
parallel jobs and processed URLs by setting them to zero and then sending the GO message
to the co-process. This tells the url_feeder function that it can start sending URLs to be
read by the parent process. The print -p syntax is needed because that is how the parent
process communicates to the previously spawned co-process. The -p switch specifies

printing to an established pipe.
processed_urls=0
parallel_jobs=0
print -p "GO"
while [ $processed_urls -lt $url_count ]

Throttling Parallel Processes

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về