Tải bản đầy đủ (.pptx) (18 trang)

Cách cài đặt Apache nutch trên Window 10

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (589.3 KB, 18 trang )

Apache Nutch
Professor
Kim Kyoung-Yun 

Team 4
Nguyen Thi Phuong Hang
Park Minsoo
Ali Usman


Content
I. What is Apache Nutch?
II. How to install Apache Nutch on Window 10?
III. How to crawl web?


I. What is Apache nutch?
• Apache Nutch is a highly extensible and scalable open-source web
crawler software project
• Runs on top of Hadoop
• Mostly used to feed a search index, also data mining
• Customizable/ extensible plugin architecture


II. How to install Apache Nutch on
Window 10
1. Requirements:
+ Windows-Cygwin environment
+ Java Runtime/Development Environment (JDK 1.11 / Java
11)
+ (Source build only) Apache Ant:  />



Installing Cygwin
• Download the Cygwin installer and run setup.exe:
/>

Installing Java Runtime/Development
Environment
• Download Java SE Development Kit 11  for Window and run exe
file: />• Setup environment variables: JAVA_HOME and PATH

• Check installed Java 


Installing Apache Ant
• Download and install Apache ant (1.10.12)
/>• Set variables
ANT_HOME and PATH
• Check ant –version

Successfully installation


2. Installing Nutch
• Download a binary package (apache-nutch-1.X-bin.zip) (1.19)
/>• Unzip Nutch package. There should be a folder apache-nutch-1.X
• Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home
• Verify Nutch installation: 
+ Open cygwin64 terminal
+ Run bin/nutch: @{nutch_home} $bin/nutch



III. How to crawl a web
1. Crawler Workflow
• initialize crawldb, inject seed URLs
• generate fetch list: select URLs from crawldb for fetching
• fetch URLs from fetch list
• parse documents: extra content, metadata and links
• update crawldb status, score and signature, add new URLs inlines or at the end of one
crawler run
• invert links: map anchor texts to documents the links point to
• calculate link rank and web graph and update Crawldb
• deduplicate document by signature
• index document content, meta data, and anchor texts


1 Crawler Workflow


III. How to crawl a web?
2. Installing Apache Solr (8.11.2)
• Download and unzip Apache Solr
/>

2. Installing Apache Solr
• Check solr installation
+ Start: bin\solr.cmd start + Go to this: http://localhost:8983/solr/#/

+ Status: bin\solr.cmd status
+ Stop: bin\solr.cmd stop



3. Crawl a site
Customize your crawl properties in conf of
{nutch_home} and Configure Regular Expression
Filters
conf/nutch-site.xml

 conf/regex-urlfilter.txt


3. Crawl a site
Create a URL seed list
+ Create a urls folder
+ Create a file seed.txt under urls folder and add a site
which will crawl
For example:  />

3. Crawl a site
+ Open Cygwin terminal 
Seeding the crawldb with a list of URLs
+ bin/nutch inject crawl/crawldb urls
Fetching
+ bin/nutch generate craw l/craw ldb craw l/segm ents
+ s1= `ls -d craw l/segm ents/2* | tail-1`
 echo $s1
+  bin/nutch fetch $s1
+  bin/nutch parse $s1
+  bin/nutch updatedb craw l/craw ldb $s1
+  bin/nutch generate craw l/craw ldb craw l/segm ents -topN 1000
s2= `ls -d craw l/segm ents/2* | tail-1`

echo $s2
  bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb craw l/craw ldb $s2


3. Crawl a site
Invertlinks
bin/nutch invertlinks craw l/linkdb
-dir craw l/segm ents
D um p to fi
le and take a look:
bin/nutch readlinkdb
craw l/linkdb -dum p out2


3. Crawl a site – Some errors and solution

Solution:
+ Download and extract Hadoop package : />+ Set the environment variable %HADOOP_HOME% and PATH
+ Download the winutils.exe binary from a Hadoop redistribution and extract to folder. 
/>+ Replace the bin folder from Hadoop folder by bin in Hadoop redistribution which has winutils.exe
+ Copy hadoop.dll from bin of Hadoop redistribution into C:/Window/System32


3. Crawl a site

Use apache-nutch-1.18 will have the above error
Solution:
+ Upgrade the version of Apache Nutch to 1.19




×