Tải bản đầy đủ (.pdf) (27 trang)

Web Client Programming with Perl-Chapter 5: The LWP Library- P1

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (81.88 KB, 27 trang )

Chapter 5: The LWP Library- P1
As we showed in Chapter 1, the Web works over TCP/IP, in which the client
and server establish a connection and then exchange necessary information
over that connection. Chapters See Demystifying the Browser and See
Learning HTTP concentrated on HTTP, the protocol spoken between web
clients and servers. Now we'll fill in the rest of the puzzle: how your
program establishes and manages the connection required for speaking
HTTP.
In writing web clients and servers in Perl, there are two approaches. You can
establish a connection manually using sockets, and then use raw HTTP; or
you can use the library modules for WWW access in Perl, otherwise known
as LWP. LWP is a set of modules for Perl 5 that encapsulate common
functions for a web client or server. Since LWP is much faster and cleaner
than using sockets, this book uses it for all the examples in Chapters See
Example LWP Programs and . If LWP is not available on your platform, see
Chapter 4, which gives more detailed descriptions of the socket calls and
examples of simple web programs using sockets.
The LWP library is available at all CPAN archives. CPAN is a collection of
Perl libraries and utilities, freely available to all. There are many CPAN
mirror sites; you should use the one closest to you, or just go to
to have one chosen for you at random. LWP
was developed by a cast of thousands (well, maybe a dozen), but its primary
driving force is Gisle Aas. It is based on the libwww library developed for
Perl 4 by Roy Fielding.
Detailed discussion of each of the routines within LWP is beyond the scope
of this book. However, we'll show you how LWP can be used, and give you
a taste of it to get you started. This chapter is divided into three sections:
 First, we'll show you some very simple LWP examples, to give you an
idea of what it makes possible.
 Next, we'll list most of the useful routines within the LWP library.
 At the end of the chapter, we'll present some examples that glue


together the different components of LWP.
Some Simple Examples
LWP is distributed with a very helpful--but very short--"cookbook" tutorial,
designed to get you started. This section serves much the same function: to
show you some simpler applications using LWP.
Retrieving a File
In Chapter 4, we showed how a web client can be written by manually
opening a socket to the server and using I/O routines to send a request and
intercept the result. With LWP, however, you can bypass much of the dirty
work. To give you an idea of how simple LWP can make things, here's a
program that retrieves the URL in the command line and prints it to standard
output:
#!/bin/perl
use LWP::Simple;

print (get $ARGV[0]);
The first line, starting with #!, is the standard line that calls the Perl
interpreter. If you want to try this example on your own system, it's likely
you'll have to change this line to match the location of the Perl 5 interpreter
on your system.
The second line, starting with use, declares that the program will use the
LWP::Simple class. This class of routines defines the most basic HTTP
commands, such as get.
The third line uses the get( ) routine from LWP::Simple on the first
argument from the command line, and applies the result to the print( )
routine.
Can it get much easier than this? Actually, yes. There's also a getprint( )
routine in LWP::Simple for getting and printing a document in one fell
swoop. The third line of the program could also read:
getprint($ARGV[0]);

That's it. Obviously there's some error checking that you could do, but if you
just want to get your feet wet with a simple web client, this example will do.
You can call the program geturl and make it executable; for example, on
UNIX:
% chmod +x geturl
Windows NT users can use the pl2bat program, included with the Perl
distribution, to make the geturl.pl executable from the command line:
C:\your\path\here> pl2bat geturl
You can then call the program to retrieve any URL from the Web:
% geturl
<HTML>
<HEAD>
<LINK REV=MADE HREF="mailto:">
<TITLE>O'Reilly &amp; Associates</TITLE>
</HEAD>
<BODY bgcolor=#ffffff>
...
Parsing HTML
Since HTML is hard to read in text format, instead of printing the raw
HTML, you could strip it of HTML codes for easier reading. You could try
to do it manually:
#!/bin/perl

use LWP::Simple;

foreach (get $ARGV[0]) {
s/<[^>]*>//g;
print;
}
But this only does a little bit of the job. Why reinvent the wheel? There's

something in the LWP library that does this for you. To parse the HTML,
you can use the HTML module:
#!/bin/perl

use LWP::Simple;
use HTML::Parse;

print parse_html(get ($ARGV[0]))->format;
In addition to LWP::Simple, we include the HTML::Parse class. We call the
parse_html( ) routine on the result of the get( ), and then format it for
printing.
You can save this version of the program under the name showurl, make it
executable, and see what happens:
% showurl
O'Reilly & Associates

About O'Reilly -- Feedback -- Writing for
O'Reilly

What's New -- Here's a sampling of our most
recent postings...

* This Week in Web Review: Tracking Ads
Are you running your Web site like a
business? These tools can help.

* Traveling with your dog? Enter the latest
Travelers' Tales
writing contest and send us a tale.



New and Upcoming Releases
...
Extracting Links
To find out which hyperlinks are referenced inside an HTML page, you
could go to the trouble of writing a program to search for text within angle
brackets (<...>), parse the enclosed text for the <A> or <IMG> tag, and
extract the hyperlink that appears after the HREF or SRC parameter. LWP
simplifies this process down to two function calls. Let's take the geturl
program from before and modify it:
#!/usr/local/bin/perl
use LWP::Simple;
use HTML::Parse;
use HTML::Element;

$html = get $ARGV[0];
$parsed_html = HTML::Parse::parse_html($html);

for (@{ $parsed_html->extract_links( ) }) {
$link = $_->[0];
print "$link\n";
}
The first change to notice is that in addition to LWP::Simple and
HTML::Parse, we added the HTML::Element class.
Then we get the document and pass it to HTML::Parse::parse_html( ). Given
HTML data, the parse_html( ) function parses the document into an internal
representation used by LWP.
$parsed_html = HTML::Parse::parse_html($html);
Here, the parse_html( ) function returns an instance of the
HTML::TreeBuilder class that contains the parsed HTML data. Since the

HTML::TreeBuilder class inherits the HTML::Element class, we make use
of HTML::Element::extract_links( ) to find all the hyperlinks mentioned in
the HTML data:
for (@{ $parsed_html->extract_links( ) }) {
extract_links( ) returns a list of array references, where each array in the list
contains a hyperlink mentioned in the HTML. Before we can access the
hyperlink returned by extract_links( ), we dereference the list in the for loop:
for (@{ $parsed_html->extract_links( ) }) {
and dereference the array within the list with:
$link = $_->[0];
After the deferencing, we have direct access to the hyperlink's location, and
we print it out:
print "$link\n";
Save this program into a file called showlink and run it:
% showlink
You'll see something like this:
graphics/texture.black.gif
/maps/homepage.map
/graphics/headers/homepage-anim.gif

/ads/international/satan.gif

...
Expanding Relative URLs
From the previous example, the links from showlink printed out the
hyperlinks exactly as they appear within the HTML. But in some cases, you
want to see the link as an absolute URL, with the full glory of a URL's
scheme, hostname, and path. Let's modify showlink to print out absolute
URLs all the time:
#!/usr/local/bin/perl

use LWP::Simple;
use HTML::Parse;
use HTML::Element;
use URI::URL;

$html = get $ARGV[0];
$parsed_html = HTML::Parse::parse_html($html);

for (@{ $parsed_html->extract_links( ) }) {
$link=$_->[0];
$url = new URI::URL $link;
$full_url = $url->abs($ARGV[0]);
print "$full_url\n";
}

×