Apache Nutch
Professor
Kim Kyoung-Yun
Content
I. What is Apache Nutch?
II. How to install Apache Nutch on Window 10?
III. How to crawl web?
I. What is Apache nutch?
• Apache Nutch is a highly extensible and scalable open-source web
crawler software project
• Runs on top of Hadoop
• Mostly used to feed a search index, also data mining
• Customizable/ extensible plugin architecture
II. How to install Apache Nutch on
Window 10
1. Requirements:
+ Windows-Cygwin environment
+ Java Runtime/Development Environment (JDK 1.11 / Java
11)
+ (Source build only) Apache Ant: />
Installing Cygwin
• Download the Cygwin installer and run setup.exe:
/>
Installing Java Runtime/Development
Environment
• Download Java SE Development Kit 11 for Window and run exe
file: />• Setup environment variables: JAVA_HOME and PATH
• Check installed Java
Installing Apache Ant
• Download and install Apache ant (1.10.12)
/>• Set variables
ANT_HOME and PATH
• Check ant –version
Successfully installation
2. Installing Nutch
• Download a binary package (apache-nutch-1.X-bin.zip) (1.19)
/>• Unzip Nutch package. There should be a folder apache-nutch-1.X
• Move folder apache-nutch-1.X (nutch_home).X into cygwin64/home
• Verify Nutch installation:
+ Open cygwin64 terminal
+ Run bin/nutch: @{nutch_home} $bin/nutch
III. How to crawl a web
1. Crawler Workflow
• initialize crawldb, inject seed URLs
• generate fetch list: select URLs from crawldb for fetching
• fetch URLs from fetch list
• parse documents: extra content, metadata and links
• update crawldb status, score and signature, add new URLs inlines or at the end of one
crawler run
• invert links: map anchor texts to documents the links point to
• calculate link rank and web graph and update Crawldb
• deduplicate document by signature
• index document content, meta data, and anchor texts
1 Crawler Workflow
III. How to crawl a web?
2. Installing Apache Solr (8.11.2)
• Download and unzip Apache Solr
/>
2. Installing Apache Solr
• Check solr installation: {APACHE_SOLR_HOME}
+ Start: bin\solr.cmd start + Go to this: http://localhost:8983/solr/#/
+ Status: bin\solr.cmd status
+ Stop: bin\solr.cmd stop
3. Crawl a site
Customize your crawl properties in conf of
{nutch_home} and Configure Regular Expression
Filters
conf/nutch-site.xml
conf/regex-urlfilter.txt
3. Crawl a site
Create a URL seed list
+ Create a urls folder
+ Create a file seed.txt under urls folder and add a site
which will crawl
For example: />
bin/crawl -i -s urls crawl 2
3. Crawl a site
+ Open Cygwin terminal
Crawl: bin/crawl –i –s urls crawl 2
Seeding the crawldb with a list of URLs
+ bin/nutch inject crawl/crawldb urls
Fetching
+ bin/nutch generate crawl/crawldb crawl/segments
+ s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
+ bin/nutch fetch $s1
+ bin/nutch parse $s1
+ bin/nutch updatedb crawl/crawldb $s1
+ bin/nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
bin/nutch fetch $s2
bin/nutch parse $s2
bin/nutch updatedb crawl/crawldb $s2
3. Crawl a site
Invertlinks
bin/nutch invertlinks
crawl/linkdb -dir
crawl/segments
Dump to file and take a
look:
bin/nutch readlinkdb crawl/
linkdb -dump out2
3. Crawl a site
Read db
• 1- bin/nutch readdb crawl/crawldb –stats >stats.txt
• 2- bin/nutch readdb crawl/crawldb/ -dump db
• 3- bin/nutch readlinkdb crawl/linkdb/ -dump link
• 4- bin/nutch readseg -dump crawl/segments/20131216194731 crawl/
segments/201312161 9473 _dump-nocontent-nofetch-noparsenoparsedata-noparsetext
3. Crawl a site – Some errors and solution
Solution:
+ Download and extract Hadoop package : />+ Set the environment variable %HADOOP_HOME% and PATH
+ Download the winutils.exe binary from a Hadoop redistribution and extract to folder.
/>+ Replace the bin folder from Hadoop folder by bin in Hadoop redistribution which has winutils.exe
+ Copy hadoop.dll from bin of Hadoop redistribution into C:/Window/System32
3. Crawl a site
Use apache-nutch-1.18 will have the above error
Solution:
+ Upgrade the version of Apache Nutch to 1.19
4. Indexing in Solr
- Integrate Nutch with Solr
+ Go to solr folder (solr-8.11.2)