GATE
Annie Lib
Lucene
Course Project
Presentator:
Bui Dac Thinh
For: IR Students TH2010
1
Objective
Information Retrieval:
Search Engine
on crawler text datasets
and open-source system
2
“What is the largest city of VietNam?”
“……….”
HIT/ NOT HIT
question
Answer document set
Objective
3
4
Search Engine Architecture
User Interface
Caching
Indexing and Ranking
Index Builder
Web Page Parser
Crawler
Web Graph
Builder
Link
Analysis
Inverted
index
Cached
Pages
Page & Site
Statistics
Page
Ranks
Web
Graph
Pages
Links
Anchors
Link Map
Online Part
Offline Part
7/2/14
What to do
5
Crawler
crawler4j JAVA
Sphider PHP
Scrapy PYTHON
HTMLAgilityPack .NET
GeckoFx .NET
Link set
Textual data
Preprocessing
GATE JAVA
UIMA JAVA
Data
ANNIE
OPENNLP
NLP
Survey of Tools & Resources
General frameworks
UIMA
GATE
NLP components, pipelines, and tools
Stanford Named Entity Recognizer (NER)
Stanford CoreNLP (CoreNLP)
NegEx (NegEx)
ENJU (ENJU)
OpenNLP
6
Java framework
Apache OpenNLP
OpenNLP tools
Sentence detector Pos-tagger
Tokenizer Shallow and full syntactic parser
Named-entity detector
Emdros
Text database engine for analyzed and annotated text
Mallet
Machine learning for language toolkit in Java
NLTK
Weka
Wordnet::Similarity
Measures of semantic relatedness using WordNet
7
Annie
a Nearly - New Information Extraction System
8
Annie
a Nearly - New Information Extraction System
Document Reset
Tokeniser
Gazetter
Sentence Splitter
RegEx Sentence Splitter
POS Tagger
Semantic Tagger
9
GATE
Open source software
Community of Text engineering
Defined and repeatable process
The Eclipse of NLP
The Lucene of Infromation Extraction
10
11
GATE
Annotation
Plug-in
Document
Corpus
Process
12
GATE
LIVE
DEMO
Lucene
•
Product in Jakartar Apache
•
Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch…
•
Open source in JAVA
•
The most efficient framework for IR
•
Index
•
Search
13
Lucene4c / CLucene
Nlucene / Lucene.NET
PyLucene
Ferret / RubyLucene
ZEND Framework
What uses Lucene
14
NUTCH
WIKIPEDIA
RED-PIRANHA
CNET
……….
Lucene Sketch
15
/>/library/os-apache-lucenesearch/
In
5
Mins
Analysis with Lucene
16
Database
Textual Format
Analysis process
TOKENs
Extract words
Remove stopword
Stem to root
Analyzer
Operations done
on the text data
WhitespaceAnalyzer
Splits tokens at whitespace
SimpleAnalyzer
Divides text at non-letter characters
Puts text in lowercase
StopAnalyzer
Removes stop words
Puts text in lowercase
StandardAnalyzer
Tokenizes text based on a
sophisticated grammar that
recognizes: e-mail addresses;
acronyms; Chinese, Japanese, and
Korean characters; alphanumerics;
and more
Removes stop words
Puts text in lowercase
17
Analysis with Lucene
TOKENs
Index with Lucene
18
Database
Textual Format
Indexing process
INVERTED
INDEX
Core indexing classes
Search with Lucene
Query
Term – Field – Term Modifier
IndexSeacher
Displaying search results
19
Search with Lucene
Query
Query = Term(s) + Operator(s)
20
Term
Field
Term modifier: ?, *, + …
Boolean: AND, +, NOT, -
Grouping: ( )
Range: [], {} Boost: ^
…
/>Search with Lucene
Term
21
Key Value
Word/phrase id
“John likes to watch movies. Mary likes movies too”
“John also likes to watch football games”
Sentences
[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
22
Search with Lucene
Field
Specify a fielded data
title:"The Right Way" AND text:go
title:"Do it right" AND right
title:Do it right
23
Search with Lucene
Term Modifier
Wildcard Search
Fuzzy Search
Proximity Search
Range Search
Boosting term
jakarta^4 apache
mod_year:[2002 TO 2003]
te?t test*
roam~0.8
"jakarta apache"~10
Search with Lucene
IndexSearcher
Primary class
searching indices stored in a given directory
calculates a score for each of the documents
that match a given query
24
Search with Lucene
Search Result
Primary class
ScoreDOC: document that hits
Position
Score
TopDOCs: total documents that hit [number]
25