Chapter 6: Example LWP Programs-P1
This chapter presents LWP programs that are more robust and feature-rich
than the examples shown in previous chapters. While Chapter 5, The LWP
Library, focused on teaching LWP and explained how LWP objects fit
together, this chapter shows you some sample LWP programs with more
user-friendly options and features.
We present three broad categories of web client programs:
Simple clients--programs that perform actions for users in real time,
usually with a finite list of URLs to act upon. In this section, we
present LWP versions of the hcat and hgrepurl programs that were
presented in Chapter 4, The Socket Library.
Periodic clients--robots that perform a request repeatedly, with some
delay between each request. Periodic clients typically request the
same resource over and over, or a different resource in a predictable
manner. For example, a client may request 0100.gif at 1 a.m., 0200.gif
at 2 a.m, etc. A periodic client might check some data and perform
action when a condition is met. In this section, we present a program
that periodically checks the status of a Federal Express document.
Recursive clients--robots that follow hyperlinks or other references on
an HTML page. In this section, we present a program that looks for
bad links in a web site.
The boundaries between these categories are not set in stone. It is possible to
write a periodic client that also happens to be a recursive client. Or a simple
client might become periodic if the document indicates that the page should
be refreshed every 15 minutes. We're not trying to classify all programs into
one category or another; these categories are given as a way to identify
distinct behaviors that a client may exhibit.
The examples in this chapter all use a simple command-line interface. In
Chapter 7, Graphical Examples with Perl/Tk, we have some additional
examples with a graphical interface using the Tk extension to Perl.
Simple Clients
Simple clients are programs that perform actions for users in real time,
usually with a finite list of URLs to act upon. In this section, we'll show
LWP versions of the socket-based hcat and hgrepurl programs that were
presented in Chapter 4.
Hypertext UNIX cat Revisited
As you might recall, the sockets version of hcat used the open_TCP( )
function to establish a connection to a web server, and then issued an HTTP
request, like "GET / HTTP/1.0". In LWP, many of the details are hidden
from the programmer. Instead of this:
open_TCP(F, $the_url[1], $the_url[2])
print F "GET $the_url[3] HTTP/1.0\n";
print F "Accept: */*\n";
print F "User-Agent: hcat/1.0\n\n";
in LWP, it can be written like this:
my $ua = new LWP::UserAgent;
$ua->agent("hcat/1.0");
my $request = new HTTP::Request("GET", $path);
my $response = $ua->request($request);
They both do the same thing; they request a document from a user-specified
web server and identify themselves in the User-Agent header. But one looks
a lot cleaner than the other. Instead of using the nitty-gritty socket code that
talks directly to the web server, you merely describe to LWP what the action
should be. LWP handles it for you. Many things, like handling URL
redirection or handling HTTP version differences, will be handled
automatically by LWP.
Also, the following lines in the sockets version of hcat can be replaced:
# print out server's response.
# get the HTTP response line
$the_response=<F>;
print $the_response if ($all || $response);
# get the header data
while(<F>=~ m/^(\S+):\s+(.+)/) {
print "$1: $2\n" if ($all || $header);
}
# get the entity body
if ($all || $data) {
print while (<F>);
}
In LWP, these lines can be written as:
my $code=$response->code;
my $desc = HTTP::Status::status_message($code);
my $headers=$response->headers_as_string;
my $body = $response->content;
if ($opt_r || $all) { print "HTTP/1.0 $code
$desc\n"; }
if ($opt_H || $all) { print "$headers\n";
}
if ($opt_d || $all) { print $body;
}
In addition, we've added proxy support, since it's trivial in LWP:
my $ua = new LWP::UserAgent;
$ua->agent("hcat/1.0");
# If proxy server specified, define it in the User
Agent object
if (defined $proxy) {
my $url = new URI::URL $path;
my $scheme = $url->scheme;
$ua->proxy($scheme, $proxy);
}
The source in its entirety looks like this:
#!/usr/local/bin/perl -w
use strict;
use HTTP::Status;
use HTTP::Response;
use LWP::UserAgent;
use URI::URL;
use vars qw($opt_h $opt_r $opt_H $opt_d $opt_p);
use Getopt::Std;
my $url;
my $goterr;
After calling all the necessary Perl modules and declaring variables, we
process command-line arguments:
getopts('hrHdp:');
my $all = !($opt_r || $opt_H || $opt_d); # all=1
when -r -H -d not set
if ($opt_h || $#ARGV==-1) { # print help text when
-h or no args
print_help( );
exit(0);
}
Then, for any string that remains as a command-like parameter, we treat it as
a URL, process it, and print out the result:
my $goterr = 0; # make sure we clear the error
flag
while ($url = shift @ARGV) {
my ($code, $desc, $headers,
$body)=simple_get('GET', $url, $opt_p);
if ($opt_r || $all) { print "HTTP/1.0 $code
$desc\n"; }
if ($opt_H || $all) { print "$headers\n";
}
if ($opt_d || $all) { print $body;
}
$goterr |= HTTP::Status::is_error($code);
}
exit($goterr);
The print-help( ) routine just prints out a range line and a list of command-
line options:
sub print_help {
print <<"HELP";
usage: $0 [-hrmbp] [proxy URL] URLs
-h help
-r response line only
-H HTTP header data only
-d data from entity body
-p use this proxy server
Example: $0 -p http://proxy:8080/
HELP
}
The actual processing is done in a separate function, called simple_get( ):
sub simple_get( ) {
my ($method, $path, $proxy) = @_;
# Create a User Agent object
my $ua = new LWP::UserAgent;
$ua->agent("hcat/1.0");
# If proxy server specified, define it in the User
Agent object
if (defined $proxy) {
my $url = new URI::URL $path;
my $scheme = $url->scheme;
$ua->proxy($scheme, $proxy);
}
# Ask the User Agent object to request a URL.
# Results go into the response object
(HTTP::Reponse).
my $request = new HTTP::Request($method, $path);
my $response = $ua->request($request);
# Parse/convert the response object for "easier
reading"
my $code=$response->code;
my $desc = HTTP::Status::status_message($code);
my $headers=$response->headers_as_string;
my $body = $response->content;
$body = $response->error_as_HTML if ($response-
>is_error);
return ($code, $desc, $headers, $body);
}
Within simple_get( ), an LWP::UserAgent object is created, and a proxy
server is defined for the object if one was specified to simple_get( ). A new
HTTP::Request object is created with the HTTP method and path that are
passed to simple_get( ). The request is given to UserAgent's request( )
method, and an HTTP::Response object is returned. From there,
HTTP::Response::code( ), HTTP::Response::headers_as_string( ), and
HTTP::Response::content( ) are used to extract the response information
from the HTTP::Response object.
Hypertext Grep URLs Revisited
The code that does the HTTP request of hgrepurl looks very much like hcat
's. Instead of repeating that information, let's center on another chunk of
code that changed from the sockets version of hgrepurl.
In Chapter 4, the raw sockets version checked the response code and then
skipped over the HTTP headers:
# if not an "OK" response of 200, skip it
if ($the_response !~ m@^HTTP/\d+\.\d+\s+200\s@)
{return;}
# get the header data
while(<F>=~ m/^(\S+):\s+(.+)/) {
# skip over the headers
}
In LWP, this can more easily be said with something like this:
if ($response->code!= RC_OK) { return; }
if ($response->content_type !~ m@text/html@) {
return; }
In the process of finding URLs without the help of LWP, one would have to
do something like this:
$data =~ s/<([^>]*)>//;
$in_brackets=$1;
$key='a';
$tag='href';
if ($in_brackets =~ /^\s*$key\s+/i) { # if tag
matches, try parms
if ($in_brackets =~ /\s+$tag\s*=\s*"([^"]*)"/i) {
$link=$1;
$link =~ s/[\n\r]//g; # kill newlines,returns
anywhere in url
# process the URL here
}
}
But in LWP, this simplifies to something like this:
my $parsed_html=HTML::Parse::parse_html($data);
for (@{ $parsed_html->extract_links(qw (body img))
}) {
my ($link) = @$_;
# process the URL here
}
As you can see, LWP simplified a lot of the code. Let's go over hgrepurl in a
little more detail:
#!/usr/local/bin/perl -w
use strict;
use HTTP::Status;
use HTTP::Response;
use LWP::UserAgent;
use URI::URL;
use HTML::Parse;
use vars qw($opt_h $opt_i $opt_l $opt_p);
use Getopt::Std;
my $url;
After calling all the necessary modules and declaring variables, there's the
usual command-line processing with getopts( ):
getopts('hilp:');
my $all = !($opt_i || $opt_l); # $all=1 when
-i -l not set
if ($opt_h || $#ARGV==-1) { # print help text when
-h or no args
print_help( );
exit(0);
}
Any remaining command-line arguments are treated as URLs and passed to
get_html( ):
while ($url = shift @ARGV) {
my ($code, $type, $data) = get_html($url,
$opt_p, $opt_i, $opt_l);
if (not_good($code, $type)) { next; }
if ($opt_i || $all) { print_images($data, $url);
}
if ($opt_l || $all) { print_hyperlinks($data,
$url); }
} # while there are URLs on the command line
As in hcat, print_help( ) displays a help message:
sub print_help {
print << "HELP";
usage: $0 [-hilp] [proxy URL] URLs
-h help
-i grep out images references only
-l grep out hyperlink references only
-p use this proxy server
Example: $0 -p http://proxy:8080/
HELP
}
The get_html( ) routine is defined next. The response of get_html( ) is the
response code, content type, and entity-body of the reply.
sub get_html( ) {
my($url, $proxy, $want_image, $want_link) = @_;
# Create a User Agent object