Information Retrieval IR

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.81 MB, 33 trang )

GATE
Annie Lib
Lucene
Course Project
Presentator:
Bui Dac Thinh
For: IR Students TH2010
1
Objective
Information Retrieval:
Search Engine
on crawler text datasets
and open-source system
2
“What is the largest city of VietNam?”
“……….”
HIT/ NOT HIT
question
Answer document set
Objective
3
4
Search Engine Architecture
User Interface
Caching
Indexing and Ranking
Index Builder
Web Page Parser
Crawler
Web Graph
Builder

Link
Analysis
Inverted
index
Cached
Pages
Page & Site
Statistics
Page
Ranks
Web
Graph
Pages
Links
Anchors
Link Map
Online Part
Offline Part
7/2/14
What to do
5
Crawler
crawler4j JAVA
Sphider PHP
Scrapy PYTHON
HTMLAgilityPack .NET
GeckoFx .NET
Link set
Textual data
Preprocessing

GATE JAVA
UIMA JAVA
Data
ANNIE
OPENNLP
NLP
Survey of Tools & Resources

General frameworks

UIMA

GATE

NLP components, pipelines, and tools

Stanford Named Entity Recognizer (NER)

Stanford CoreNLP (CoreNLP)

NegEx (NegEx)

ENJU (ENJU)

OpenNLP
6
Java framework
Apache OpenNLP

OpenNLP tools


Sentence detector Pos-tagger

Tokenizer Shallow and full syntactic parser

Named-entity detector

Emdros

Text database engine for analyzed and annotated text

Mallet

Machine learning for language toolkit in Java

NLTK

Weka

Wordnet::Similarity

Measures of semantic relatedness using WordNet
7
Annie
a Nearly - New Information Extraction System
8
Annie
a Nearly - New Information Extraction System

Document Reset


Tokeniser

Gazetter

Sentence Splitter

RegEx Sentence Splitter

POS Tagger

Semantic Tagger
9
GATE

Open source software

Community of Text engineering

Defined and repeatable process

The Eclipse of NLP

The Lucene of Infromation Extraction
10
11
GATE
Annotation
Plug-in
Document

Corpus
Process
12
GATE
LIVE
DEMO
Lucene
•
Product in Jakartar Apache
•
Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch…
•
Open source in JAVA
•
The most efficient framework for IR
•
Index
•
Search
13
Lucene4c / CLucene
Nlucene / Lucene.NET
PyLucene
Ferret / RubyLucene
ZEND Framework
What uses Lucene
14
NUTCH
WIKIPEDIA
RED-PIRANHA

CNET
……….
Lucene Sketch
15
/>/library/os-apache-lucenesearch/
In
5
Mins
Analysis with Lucene
16
Database
Textual Format
Analysis process
TOKENs
Extract words
Remove stopword
Stem to root
Analyzer
Operations done
on the text data
WhitespaceAnalyzer
Splits tokens at whitespace
SimpleAnalyzer
Divides text at non-letter characters
Puts text in lowercase
StopAnalyzer
Removes stop words
Puts text in lowercase
StandardAnalyzer
Tokenizes text based on a

sophisticated grammar that
recognizes: e-mail addresses;
acronyms; Chinese, Japanese, and
Korean characters; alphanumerics;
and more
Removes stop words
Puts text in lowercase
17
Analysis with Lucene
TOKENs
Index with Lucene
18
Database
Textual Format
Indexing process
INVERTED
INDEX
Core indexing classes
Search with Lucene

Query

Term – Field – Term Modifier

IndexSeacher

Displaying search results
19
Search with Lucene
Query

Query = Term(s) + Operator(s)
20
Term
Field
Term modifier: ?, *, + …
Boolean: AND, +, NOT, -
Grouping: ( )
Range: [], {} Boost: ^
…
/>Search with Lucene
Term
21
Key Value
Word/phrase id
“John likes to watch movies. Mary likes movies too”
“John also likes to watch football games”
Sentences
[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
22
Search with Lucene
Field

Specify a fielded data
title:"The Right Way" AND text:go
title:"Do it right" AND right
title:Do it right
23
Search with Lucene
Term Modifier


Wildcard Search

Fuzzy Search

Proximity Search

Range Search

Boosting term
jakarta^4 apache
mod_year:[2002 TO 2003]
te?t test*
roam~0.8
"jakarta apache"~10
Search with Lucene
IndexSearcher

Primary class

searching indices stored in a given directory

calculates a score for each of the documents
that match a given query
24
Search with Lucene
Search Result

Primary class


ScoreDOC: document that hits

Position

Score

TopDOCs: total documents that hit [number]
25

Information Retrieval IR

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về