Tải bản đầy đủ (.ppt) (41 trang)

Data mining the web using perl

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.01 MB, 41 trang )

Data-Mining the Web
Data-Mining the Web
Using Perl
Using Perl
Burt L. Monroe
Burt L. Monroe
Director, Quantitative Social Science Initiative
Director, Quantitative Social Science Initiative
Department of Political Science
Department of Political Science
The Pennsylvania State University
The Pennsylvania State University
Data-Mining the Web
Data-Mining the Web

Examples
Examples

Election Returns in Luxembourg
Election Returns in Luxembourg

Luxembourg Official Election Results, 2004
Luxembourg Official Election Results, 2004

/> />•
Parliamentary Speech
Parliamentary Speech

The Congressional Record
The Congressional Record
How’d You Do That?


How’d You Do That?

There are several programming languages
There are several programming languages
with “straightforward” facilities for doing
with “straightforward” facilities for doing
this. Most notably,
this. Most notably,

Perl
Perl

Python
Python

Java
Java

I’m going to talk about Perl, because
I’m going to talk about Perl, because

it’s the most established
it’s the most established

it’s the one I know
it’s the one I know

It appears that Python may be preferable,
It appears that Python may be preferable,
but that’s for someone else to say.

but that’s for someone else to say.
What’s Perl?
What’s Perl?

Open source (free / flexible / extensible / a little
Open source (free / flexible / extensible / a little
wild and woolly – like Linux, R) programming
wild and woolly – like Linux, R) programming
language.
language.

It is very very good at processing text.
It is very very good at processing text.

note, webpages are just texts.
note, webpages are just texts.

note, datasets (like a flat spreadsheet or Stata file) are
note, datasets (like a flat spreadsheet or Stata file) are
just texts.
just texts.

Social scientists might have some use for turning one
Social scientists might have some use for turning one
into the other, no?
into the other, no?

It has very useful facilities for building
It has very useful facilities for building


Spiders
Spiders

Scrapers
Scrapers

(and “agents”, “robots”, “crawlers”, etc.)
(and “agents”, “robots”, “crawlers”, etc.)
What’s a Spider?
What’s a Spider?

A spider is a program designed to automatically
A spider is a program designed to automatically
gather webpages.
gather webpages.

If, for example, you want to automatically
If, for example, you want to automatically
download all of the speeches delivered in
download all of the speeches delivered in
Congress today – without manually clicking on
Congress today – without manually clicking on
every one, cutting and pasting, etc. – you might
every one, cutting and pasting, etc. – you might
want to build a spider.
want to build a spider.
What’s a scraper?
What’s a scraper?

A scraper (or “screen-scraper”) extracts the

A scraper (or “screen-scraper”) extracts the
information you want – whatever you consider to
information you want – whatever you consider to
be data – from a given webpage.
be data – from a given webpage.

If you want to know who said “health” and how
If you want to know who said “health” and how
many times, you might want to build a scraper.
many times, you might want to build a scraper.
BEWARE!
BEWARE!

Spiders (and other similar types of programs –
Spiders (and other similar types of programs –
“robots”, “crawlers”) can be put to nefarious use:
“robots”, “crawlers”) can be put to nefarious use:

appropriating copyrighted materials
appropriating copyrighted materials

extracting email addresses for spammers
extracting email addresses for spammers

overwhelming servers to create “denial of service”
overwhelming servers to create “denial of service”

generally violating a site’s “terms of service” or
generally violating a site’s “terms of service” or
“acceptable use policy”

“acceptable use policy”

If you are not careful to use legal and ethical
If you are not careful to use legal and ethical
good practices, you can
good practices, you can

be denied access to a website altogether
be denied access to a website altogether

get yourself or the university sued or even subjected to
get yourself or the university sued or even subjected to
criminal penalties
criminal penalties
Perl
Perl

Open-source
Open-source

Cross-platform
Cross-platform

(Windows – I recommend “ActivePerl” from
(Windows – I recommend “ActivePerl” from


)
)


There are many websites with resources:
There are many websites with resources:



(Comprehensive Perl
(Comprehensive Perl
Archive Network)
Archive Network)



(PerlMonks)
(PerlMonks)






(O’Reilly Publishing)
(O’Reilly Publishing)

Lots of mailing lists, etc.
Lots of mailing lists, etc.
Books
Books

Basics of Perl
Basics of Perl


The best books are put out by O’Reilly Publishing and
The best books are put out by O’Reilly Publishing and
are generally known by the animal on the cover.
are generally known by the animal on the cover.

Learning Perl
Learning Perl
(the Llama)
(the Llama)

or, Learning Perl on Win32 Systems
or, Learning Perl on Win32 Systems
(the Gecko)
(the Gecko)

Programming Perl
Programming Perl
(the Camel)
(the Camel)

Web-mining
Web-mining

Perl & LWP
Perl & LWP
(the Blesbok, apparently)
(the Blesbok, apparently)

Spidering Hacks

Spidering Hacks

These books, and some others, are or will be
These books, and some others, are or will be
available in the “QuaSSI Library” (in Pond 216).
available in the “QuaSSI Library” (in Pond 216).
Running Perl
Running Perl

For machines with approved ActivePerl
For machines with approved ActivePerl
installations in Pond
installations in Pond

Perl is located in c:/Perl/
Perl is located in c:/Perl/

For today,
For today,

we will operate entirely in the directory c:/Perl/eg/
we will operate entirely in the directory c:/Perl/eg/

To get there,
To get there,

open Programs -> Accessories -> Command Prompt
open Programs -> Accessories -> Command Prompt

At the prompt, type

At the prompt, type
c:
c:

Type
Type
cd Perl/eg
cd Perl/eg

(In your particular installation, or in a Mac, or
(In your particular installation, or in a Mac, or
something like Unix on high performance
something like Unix on high performance
computing, these details will be different.)
computing, these details will be different.)
The First Perl Program
The First Perl Program

Go to the QuaSSI Website for the example
Go to the QuaSSI Website for the example
scripts for todays workshop:
scripts for todays workshop:

/> />
Right-click on the first script, “howdy.pl”,
Right-click on the first script, “howdy.pl”,
and save it to c:\Perl\eg\
and save it to c:\Perl\eg\

Open up the text-editor WinEdt (you could

Open up the text-editor WinEdt (you could
use almost anything) and then open
use almost anything) and then open
howdy.pl
howdy.pl

That’s a complete Perl program.
That’s a complete Perl program.

Note: that’s all a program is – a text file.
Note: that’s all a program is – a text file.
Running a Perl Program
Running a Perl Program

Go back to your command prompt.
Go back to your command prompt.

Type
Type
perl howdy.pl –w
perl howdy.pl –w

(The
(The
–w
–w
tells perl to give you
tells perl to give you
w
w

arnings about
arnings about
what might be wrong if the program is broken.)
what might be wrong if the program is broken.)
Modifying a program
Modifying a program

Go back to WinEdt
Go back to WinEdt

Edit the text between the quotation marks to say
Edit the text between the quotation marks to say
something new
something new

Click File -> Save
Click File -> Save

Go back to the command prompt
Go back to the command prompt

Hit the up arrow (to get the last command,
Hit the up arrow (to get the last command,
perl
perl
howdy.pl –w
howdy.pl –w

Look at that – you’re a programmer!
Look at that – you’re a programmer!

Break the program
Break the program

Go back to WinEdt
Go back to WinEdt

Delete the semicolon at the end of the line
Delete the semicolon at the end of the line

Save the file
Save the file

Go back to the command prompt and run the
Go back to the command prompt and run the
program, with
program, with
–w
–w
, again
, again

What happened?
What happened?
Perl at 30,000 feet
Perl at 30,000 feet

Much of the next set of slides is stolen
Much of the next set of slides is stolen
shamelessly from Andy Tester’s “Perl at 10,000
shamelessly from Andy Tester’s “Perl at 10,000

Feet” at www.petdance.com
Feet” at www.petdance.com

(I’m skipping even more than he did.)
(I’m skipping even more than he did.)
Some generalities about Perl
Some generalities about Perl

Statements in Perl are, or usually can be,
Statements in Perl are, or usually can be,
constructed in a fairly natural English-like
constructed in a fairly natural English-like
way.
way.

There are many ways to do any one thing.
There are many ways to do any one thing.

The syntax can be offputting and hard to
The syntax can be offputting and hard to
read, especially at first. It is easy to
read, especially at first. It is easy to
“obfuscate” Perl code and this is
“obfuscate” Perl code and this is
sometimes done intentionally.
sometimes done intentionally.

Main syntax rule: end all lines with
Main syntax rule: end all lines with
;

;
Data Types
Data Types

Scalars
Scalars

Arrays and Lists
Arrays and Lists

Hashes
Hashes

References
References

Filehandles
Filehandles

Objects
Objects
Scalars
Scalars

Numbers
Numbers

Generally decimal floating point
Generally decimal floating point


(Can be made integer, octal,
(Can be made integer, octal,
hexadecimal)
hexadecimal)

Strings
Strings

Can contain any character
Can contain any character

Can be null:
Can be null:
“”
“”

Can be arbitrarily large
Can be arbitrarily large
Strings
Strings

Single-quoted
Single-quoted

characters are as shown with only two exceptions.
characters are as shown with only two exceptions.

single-quote
single-quote
in

in
a single-quoted string requires
a single-quoted string requires
\’
\’

backslash in a single-quoted string requires
backslash in a single-quoted string requires
\\
\\

Double-quoted
Double-quoted

it will
it will
interpolate
interpolate
– calculate variables or control sequences.
– calculate variables or control sequences.

For example
For example

$foo = “myfile”;
$foo = “myfile”;

$datafile = “$foo.txt”;
$datafile = “$foo.txt”;


will result in the variable $datafile holding the string “myfile.txt”
will result in the variable $datafile holding the string “myfile.txt”

Another example
Another example

print ‘Howdy\n’;
print ‘Howdy\n’;
will print:
will print:

Howdy\n
Howdy\n

print “Howdy\n”;
print “Howdy\n”;
will print
will print

Howdy
Howdy

(
(
\n
\n
is a control sequence, standing for “new line”).
is a control sequence, standing for “new line”).
Scalar operators
Scalar operators


Math
Math

*, /, % (for modulo), ** (for exponentiation),
*, /, % (for modulo), ** (for exponentiation),
etc.
etc.

Strings
Strings

x to repeat the thing on the left
x to repeat the thing on the left



b” x 10
b” x 10
gives “bbbbbbbbbb”
gives “bbbbbbbbbb”

. concatenates strings
. concatenates strings

(“na” x 16).“ Batman!”
(“na” x 16).“ Batman!”
gives
gives


Perl knows to convert when mixing these
Perl knows to convert when mixing these
two types:
two types:



3”*4
3”*4
gives 12
gives 12



3”.4
3”.4
gives “34”
gives “34”
Comparing Scalars
Comparing Scalars
Comparison
Comparison
Numeric
Numeric
String
String

Equal
Equal
==

==
eq
eq

Not equal
Not equal
!=
!=
ne
ne

Less than
Less than
<
<
lt
lt

Greater than
Greater than
>
>
gt
gt

Less / equal
Less / equal
<=
<=
le

le

Greater / equal
Greater / equal
>=
>=
ge
ge
8 < 25
8 < 25
TRUE!
TRUE!


8” lt “25”
8” lt “25”
FALSE!
FALSE!
Variables
Variables

A sign, followed by a letter, followed by pretty much
A sign, followed by a letter, followed by pretty much
whatever.
whatever.

Sign determines the type:
Sign determines the type:

$foo

$foo
is a scalar
is a scalar

@foo
@foo
is a list
is a list

%foo
%foo
is a hash
is a hash

Variables default to global (they apply in all parts of your
Variables default to global (they apply in all parts of your
program). This can be problematic.
program). This can be problematic.

local $var
local $var
will make the variable active only for the current
will make the variable active only for the current
“block” of code.
“block” of code.

my $var
my $var
does the same, and is the more usual construction.
does the same, and is the more usual construction.


the very common
the very common
use strict
use strict
; at the beginning of code forces
; at the beginning of code forces
good practice in the use of local variables (creates more
good practice in the use of local variables (creates more
syntax errors, but prevents more whoppers that could blow
syntax errors, but prevents more whoppers that could blow
everything up.)
everything up.)
Lists and Arrays
Lists and Arrays

A list is an ordered set of (usually) scalars.
A list is an ordered set of (usually) scalars.

An array is a variable holding a list.
An array is a variable holding a list.

my @foo = (1,2,3)
my @foo = (1,2,3)

my @bar = (“elephant”, 3.14)
my @bar = (“elephant”, 3.14)

Can be constructed as lists of scalar variables:
Can be constructed as lists of scalar variables:


my @data = ($name, $address, $SSN)
my @data = ($name, $address, $SSN)
Using Arrays
Using Arrays

Elements are indexed, from 0.
Elements are indexed, from 0.

my @animals = (“frog”, “bear”, “elephant”);
my @animals = (“frog”, “bear”, “elephant”);

print $animals[2];
print $animals[2];
# prints elephant
# prints elephant

Note: element is a scalar, so $ rather than @
Note: element is a scalar, so $ rather than @

Subsections are “slices”.
Subsections are “slices”.

my @mammals = @animals[1,2];
my @mammals = @animals[1,2];



Lots of functions for
Lots of functions for


using as a stack (moving things on and off the right or left
using as a stack (moving things on and off the right or left
side of the array).
side of the array).

sorting
sorting

joining two arrays
joining two arrays

splitting a scalar string into an array
splitting a scalar string into an array

my $sentence = “This is my sentence.”;
my $sentence = “This is my sentence.”;

my @words = split(“ “, $sentence);
my @words = split(“ “, $sentence);

# now @words contains (“This”, “is”, “my”, “sentence”);
# now @words contains (“This”, “is”, “my”, “sentence”);
Programming Controls
Programming Controls

Control structures
Control structures

if / then / elsif / else

if / then / elsif / else

while
while

do {} while
do {} while

do {} until
do {} until

for ()
for ()

foreach() # loops over a list
foreach() # loops over a list

Errors / warnings
Errors / warnings

die “message” kills program and prints
die “message” kills program and prints
“message”.
“message”.

warn “message” prints message and keeps
warn “message” prints message and keeps
going.
going.

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×