Chapter 19
Web Crawler
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-2
Chapter Objectives
•
Provide a case study example from
problem statement through
implementation
•
Demonstrate how hash tables and
graphs can be used to solve a problem
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-3
Web Crawler
•
A web crawler is a system that
searches the web, beginning with a
user-designated we page, looking for a
designated target string
•
A web crawler follows all of the links on
each page that it encounter until there
are no more pages or until it reaches a
designated limit
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-4
Web Crawler
•
For this case study, we will create a
graphical web crawler with the
following requirements
–
Enter a designated starting web page
–
Enter a target string for which to search
–
Limit the search to 50 pages
–
Display the results when done
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-5
Web Crawler - Design
•
Our web crawler system consists of
three high-level components:
–
The driver
–
The graphical user interface
–
The web crawler implementation
•
Makes use of graphs and hashtables
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-6
Web Crawler - Design
•
The algorithm for the web crawler is as
follows
–
Add the starting page to a HashSet of pages to
be searched and to our graph
–
Remove a page from the set of pages to be
searched
–
Search the page for the target string
•
If string exists, add page to list of results
–
Search the page for links
•
If links have not already been searched, add them to
set of pages to be searched and to our graph
–
Repeat the three previous steps until our limit is
reached or the set is empty
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-7
FIGURE 19.1 User interface design
Copyright © 2005 Pearson
Addison-Wesley. All rights
reserved. 19-8
FIGURE 19.2
UML description