Chapter 6: Example LWP Programs-P2
Then the scan( ) method does all the real work. The scan( ) method accepts a
URL as a parameter. In a nutshell, here's what happens:
The scan( ) method pushes the first URL into a queue. For any URL pulled
from the queue, any links on that page are extracted from that page and
pushed on the queue. To keep track of which URLs have already been
visited (and not to push them back onto the queue), we use an associative
array called %touched and associate any URL that has been visited with a
value of 1. There are other useful variables that are also used, to track which
document points to what, the content-type of the document, which links are
bad, which links are local, which links are remote, etc.
For a more detailed look at how this works, let's step through it.
First, the initial URL is pushed onto a queue:
push (@urls , $root_url);
The URL is then checked with a HEAD method. If we can determine that
the URL is not an HTML document, we can skip it. Otherwise, we follow
that with a GET method to get the HTML:
my $request = new HTTP::Request('HEAD', $url);
my $response = $self->{'ua'}->request($request);
# if not HTML, don't bother to search it for URLs
next if ($response->header('Content-Type') !~
m@text/html@ );
# it is text/html, get the entity-body this time
$request->method('GET');
$response = $self->{'ua'}->request($request);
Then we extract the links from the HTML page. Here, we use our own
function to extract the links. There is a similar function in the LWP library
that extracts links, but we opted not to use it, since it is less prone to find
links in slightly malformed HTML:
my @rel_urls = grab_urls($data);
foreach $verbose_link (@rel_urls) {
...
}
With each iteration of the foreach loop, we process one link. If we haven't
seen it before, we add it to the queue:
foreach $verbose_link (@rel_urls) {
if (! defined $self->{'touched'}{$full_child}) {
push (@urls, $full_child);
}
# remember which url we just pushed, to avoid
repushing
$self->{'touched'}{$full_child} = 1;
}
While all of this is going on, we keep track of which documents don't exist,
what their content types are, which ones are local to the web server, which
are not local, and which are not HTTP-based. After scan( ) finishes, all of
the information is available from CheckSite's public interface. The bad( )
method returns an associative array of any URLs that encountered errors.
Within the associative array, one uses the URL as a key, and the key value is
a \n delimited error message. For the not_web( ), local( ), and remote( )
methods, a similar associative array is returned, where the URL is a key in
the array and denotes that the URL is not HTTP-based, is local to the web
server, or is not local to the web server, in that order. The type( ) method
returns an associate array of URLs, where the value of each URL hash
contains the content-type for the URL. And finally, the ref( ) method is an
associative array of URLs with values of referring URLs, delimited by \n.
So if the URL hash of "www.ora.com" has a value of "a.ora.com" and
"b.ora.com", that means "a.ora.com" and "b.ora.com" both point to
"www.ora.com".
Here's the complete source of the CheckSite package, with some sample
code around it to read in command-line arguments and print out the results:
#!/usr/local/bin/perl -w
use strict;
use vars qw($opt_a $opt_v $opt_l $opt_r $opt_R
$opt_n $opt_b
$opt_h $opt_m $opt_p $opt_e $opt_d);
use Getopt::Std;
# Important variables
#----------------------------
# @lookat queue of URLs to look at
# %local $local{$URL}=1 (local URLs in
associative array)
# %remote $remote{$URL}=1 (remote URLs in
associative array)
# %ref $ref{$URL}="URL\nURL\n" (list of URLs
separated by \n)
# %touched $touched{$URL}=1 (URLs that have been
visited)
# %notweb $notweb{$URL}=1 if URL is non-HTTP
# %badlist $badlist{$URL}="reason" (URLs that
failed. Separated with \n)
getopts('avlrRnbhm:p:e:d:');
# Display help upon -h, no args, or no e-mail
address
if ($opt_h || $#ARGV == -1 || (! $opt_e) ) {
print_help( );
exit(-1);
}
# set maximum number of URLs to visit to be
unlimited
my ($print_local, $print_remote, $print_ref,
$print_not_web,
$print_bad, $verbose, $max,
$proxy,
$email, $delay, $url);
$max=0;
if ($opt_l) {$print_local=1;}
if ($opt_r) {$print_remote=1;}
if ($opt_R) {$print_ref=1;}
if ($opt_n) {$print_not_web=1;}
if ($opt_b) {$print_bad=1;}
if ($opt_v) {$verbose=1;}
if (defined $opt_m) {$max=$opt_m;}
if ($opt_ p) {$proxy=$opt_p;}
if ($opt_e) {$email=$opt_e;}
if (defined $opt_d) {$delay=$opt_d;}
if ($opt_a) {
$print_local=$print_remote=$print_ref=$print_not_we
b=$print_bad = 1;
}
my $root_url=shift @ARGV;
# if there's no URL to start with, tell the user
unless ($root_url) {
print "Error: need URL to start with\n";
exit(-1);
}
# if no "output" options are selected, make
"print_bad" the default
if (!($print_local || $print_remote || $print_ref
||
$print_not_web || $print_bad)) {
$print_bad=1;
}
# create CheckSite object and tell it to scan the
site
my $site = new CheckSite($email, $delay, $max,
$verbose, $proxy);
$site->scan($root_url);
# done with checking URLs. Report results
# print out references to local machine
if ($print_local) {
my %local = $site->local;
print "\nList of referenced local URLs:\n";
foreach $url (keys %local) {
print "local: $url\n";
}
}
# print out references to remote machines
if ($print_remote) {
my %remote = $site->remote;
print "\nList of referenced remote URLs:\n";
foreach $url (keys %remote) {
print "remote: $url\n";
}
}
# print non-HTTP references
if ($print_not_web) {
my %notweb = $site->not_web;
print "\nReferenced non-HTTP links:\n";
foreach $url (keys %notweb) {
print "notweb: $url\n";
}
}
# print reference list (what URL points to what)
if ($print_ref) {
my $refer_by;
my %ref = $site->ref;
print "\nReference information:\n";
while (($url,$refer_by) = each %ref) {
print "\nref: $url is referenced by:\n";
$refer_by =~ s/\n/\n /g; # insert two spaces
after each \n
print " $refer_by";
}
}
# print out bad URLs, the server response line, and
the Referer
if ($print_bad) {
my $reason;
my $refer_by;
my %bad = $site->bad;
my %ref = $site->ref;
print "\nThe following links are bad:\n";
while (($url,$reason) = each %bad) {
print "\nbad: $url Reason: $reason";
print "Referenced by:\n";
$refer_by = $ref{$url};
$refer_by =~ s/\n/\n /g; # insert two spaces
after each \n
print " $refer_by";
} # while there's a bad link
} # if bad links are to be reported
sub print_help( ) {
print <<"USAGETEXT";
Usage: $0 URL\n
Options:
-l Display local URLs
-r Display remote URLs
-R Display which HTML pages refers to
what
-n Display non-HTML links
-b Display bad URLs (default)
-a Display all of the above
-v Print out URLs when they are examined
-e email Mandatory: Specify email address to
include
in HTTP request.
-m # Examine at most # URLs\n
-p url Use this proxy server
-d # Delay # minutes between requests.
(default=1)
Warning: setting # to 0 is not very
nice.