Tải bản đầy đủ (.pptx) (33 trang)

Information Retrieval IR

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.81 MB, 33 trang )

GATE
Annie Lib
Lucene
Course Project
Presentator:
Bui Dac Thinh
For: IR Students TH2010
1
Objective
Information Retrieval:
Search Engine
on crawler text datasets
and open-source system
2
“What is the largest city of VietNam?”
“……….”
HIT/ NOT HIT
question
Answer document set
Objective
3
4
Search Engine Architecture
User Interface
Caching
Indexing and Ranking
Index Builder
Web Page Parser
Crawler
Web Graph
Builder


Link
Analysis
Inverted
index
Cached
Pages
Page & Site
Statistics
Page
Ranks
Web
Graph
Pages
Links
Anchors
Link Map
Online Part
Offline Part
7/2/14
What to do
5
Crawler
crawler4j JAVA
Sphider PHP
Scrapy PYTHON
HTMLAgilityPack .NET
GeckoFx .NET
Link set
Textual data
Preprocessing

GATE JAVA
UIMA JAVA
Data
ANNIE
OPENNLP
NLP
Survey of Tools & Resources

General frameworks

UIMA

GATE

NLP components, pipelines, and tools

Stanford Named Entity Recognizer (NER)

Stanford CoreNLP (CoreNLP)

NegEx (NegEx)

ENJU (ENJU)

OpenNLP
6
Java framework
Apache OpenNLP

OpenNLP tools


Sentence detector Pos-tagger

Tokenizer Shallow and full syntactic parser

Named-entity detector

Emdros

Text database engine for analyzed and annotated text

Mallet

Machine learning for language toolkit in Java

NLTK

Weka

Wordnet::Similarity

Measures of semantic relatedness using WordNet
7
Annie
a Nearly - New Information Extraction System
8
Annie
a Nearly - New Information Extraction System

Document Reset


Tokeniser

Gazetter

Sentence Splitter

RegEx Sentence Splitter

POS Tagger

Semantic Tagger
9
GATE

Open source software

Community of Text engineering

Defined and repeatable process

The Eclipse of NLP

The Lucene of Infromation Extraction
10
11
GATE
Annotation
Plug-in
Document

Corpus
Process
12
GATE
LIVE
DEMO
Lucene

Product in Jakartar Apache

Popular: Xerox, Apple, Wikipedia, IBM, CNN, Nutch…

Open source in JAVA

The most efficient framework for IR

Index

Search
13
Lucene4c / CLucene
Nlucene / Lucene.NET
PyLucene
Ferret / RubyLucene
ZEND Framework
What uses Lucene
14
NUTCH
WIKIPEDIA
RED-PIRANHA

CNET
……….
Lucene Sketch
15
/>/library/os-apache-lucenesearch/
In
5
Mins
Analysis with Lucene
16
Database
Textual Format
Analysis process
TOKENs
Extract words
Remove stopword
Stem to root
Analyzer
Operations done
on the text data
WhitespaceAnalyzer
Splits tokens at whitespace
SimpleAnalyzer
Divides text at non-letter characters
Puts text in lowercase
StopAnalyzer
Removes stop words
Puts text in lowercase
StandardAnalyzer
Tokenizes text based on a

sophisticated grammar that
recognizes: e-mail addresses;
acronyms; Chinese, Japanese, and
Korean characters; alphanumerics;
and more
Removes stop words
Puts text in lowercase
17
Analysis with Lucene
TOKENs
Index with Lucene
18
Database
Textual Format
Indexing process
INVERTED
INDEX
Core indexing classes
Search with Lucene

Query

Term – Field – Term Modifier

IndexSeacher

Displaying search results
19
Search with Lucene
Query

Query = Term(s) + Operator(s)
20
Term
Field
Term modifier: ?, *, + …
Boolean: AND, +, NOT, -
Grouping: ( )
Range: [], {} Boost: ^

/>Search with Lucene
Term
21
Key Value
Word/phrase id
“John likes to watch movies. Mary likes movies too”
“John also likes to watch football games”
Sentences
[1, 2, 1, 1, 2, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]
22
Search with Lucene
Field

Specify a fielded data
title:"The Right Way" AND text:go
title:"Do it right" AND right
title:Do it right
23
Search with Lucene
Term Modifier


Wildcard Search

Fuzzy Search

Proximity Search

Range Search

Boosting term
jakarta^4 apache
mod_year:[2002 TO 2003]
te?t test*
roam~0.8
"jakarta apache"~10
Search with Lucene
IndexSearcher

Primary class

searching indices stored in a given directory

calculates a score for each of the documents
that match a given query
24
Search with Lucene
Search Result

Primary class


ScoreDOC: document that hits

Position

Score

TopDOCs: total documents that hit [number]
25

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×