Lập chỉ mục nghịch đảo (invertedindex) và tìm kiếm trên tập tài liệu lớn

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (739.56 KB, 29 trang )

TRƯỜNG ĐẠI HỌC CẦN THƠ

LOGO

KHOA CÔNG NGHỆ THÔNG TIN & TRUYỀN THÔNG

BIG DATA
Lập chỉ mục nghịch đảo (inverted-index)
và tìm kiếm trên tập tài liệu lớn
GIẢNG VIÊN:

HỌC VIÊN THỰC HIỆN:

TS. PHAN THƯỢNG CANG

Lớp cao học - HTTT-K24

LÊ THỊ HỒNG CHIÊU - M2517001
PHAN THỊ THÚY KIỀU - M2517009

Lập chỉ mục nghịch đảo
và tìm kiếm trên tập tài
liệu lớn

NỘI DUNG
1

Phạm vi thực hiện

2

Giải thuật Inverted Index và Search

3

Demo chương trình

4

Kết luận

5

Tài liệu tham khảo

6

Hướng dẫn thực hành

1. PHẠM VI THỰC HIỆN
 Lập chỉ mục nghịch đảo đơn giản theo mô
hình MapReduce
 Tập dữ liệu input: định dạng .txt
 Cấu trúc chỉ mục:(không sắp xếp danh sách
docID)
<Term_1> [tab] <docID> [space] <docID> ….
<Term_2> [tab] <docID> [space] <docID> ….
Ví dụ:
Line 1: term1 doc1.txt doc2.txt …

Line 2: term2 doc2.txt doc5.txt …

1. PHẠM VI THỰC HIỆN
 Tìm kiếm đơn giản trên tập chỉ mục
nghịch đảo theo mô hình MapReduce
(duyệt qua tất cả nội dung chỉ mục, không áp
dụng kỹ thuật tìm kiếm nào để tăng tốc độ).
 Cấu trúc lưu trữ kết quả tìm kiếm: tìm thấy từ
nào thì hiển thị từ đó cùng với danh sách docID
chứa từ đã tìm kiếm
<Term_1> [tab] <docID> [space] <docID> ….
<Term_2> [tab] <docID> [space] <docID> ….

2.1. Inverted Index (tổng quát)
Documents to
be indexed.

Friends, Romans, countrymen.
Tokenizer
Friends Romans

Token stream.

Countrymen

Linguistic
modules
Modified tokens.

friend roman
Indexer

Inverted index.

countryman

friend

2

4

roman

1

2

countryman

13

16

2.1. Inverted Index: áp dụng cho bài toán

Tokenizer

(term,
docID)
Sort by
term
Inverted Index
(Dictionary and
Postings)

2.1. Indexer steps: Tokenizer

Doc 1

Doc 2

I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.

So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious

2.1. Indexer steps: Sort

2.1. Indexer steps: Dictionary & Postings



Multiple term
entries in a single
document are
merged.



Split into
Dictionary and
Postings

2.1. Inverted Index - MapReduce
Input,
Splitting

Mapper class

Reducer class

Mapping

Tokenizer
(term,
docID)

Shuffling

Sort by term

Reducing

Inverted Index
(Dictionary and
Postings)

2.1. Inverted Index – MapReduce: Mô hình
Input files

Spliting
(Default 128MB/ split)

red orange blue

Doc1.txt

yellow blue

red orange blue
yellow blue

orange black red
Doc2.txt

Mapping

Shuffling

Reducing

(Key, Value)

(Key, Value)

(Key, Value)

black, Doc2.txt

blue, Doc1.txt
blue, Doc1.txt

blue, Doc1.txt

red, Doc1.txt
orange, Doc1.txt
blue, Doc1.txt

yellow, Doc1.txt
blue, Doc1.txt
orange, Doc1.txt
orange, Doc2.txt
orange, Doc2.txt

orange, Doc1.txt Doc2.txt

red, Doc1.txt
red, Doc2.txt

red, Doc1.txt Doc2.txt

yellow, Doc1.txt
yellow, Doc2.txt

yellow, Doc1.txt Doc2.txt

orange, Doc2.txt
black, Doc2.txt
red, Doc2.txt

orange black red
yellow orange

yellow orange

black, Doc2.txt

Result

yellow, Doc2.txt
orange, Doc2.txt

black, Doc2.txt
blue, Doc1.txt
orange, Doc1.txt Doc2.txt
red, Doc1.txt Doc2.txt
yellow, Doc1.txt Doc2.txt

2.1. Inverted Index – MapReduce: Map

Map(k1: id of row in the file,
v1: a line of text in the file){
docID=file.getName();
word[]=v1.split();
for (i=1;i<=word.length;i++)

emit(k2: word[i], v2: docID);
}

2.1. Inverted Index – MapReduce: Reduce

Reduce(k2: the word,
v2[]: list of docID with the same k2){
deleteDuplicate(v2);
with docID in v2:
listdocID= listdocID+docID;

emit(k3: the word ~ k2, v3: listdocID);
}

2.2. Search - MapReduce
Input,
Splitting

Mapper class

Mapping

Query file

Index files

Tokenizer
(term/query,
listdocID),
 term  query and index

Shuffling

Reducer class

Reducing

Sort by term/query
//Query: xử lý thêm
Write results
(term/query,
listdocID)

2.2. Search – MapReduce: Map

Map(k1: id of row in the index file,
v1: a line of text in the index file){
index=v1.split(“\t”);

for(i=1;i<=word.length;i++)
if(word.equals(index[0]))

emit(k2: word[i], v2: index[1]);
}

2.2. Search – MapReduce: Reduce

Reduce(k2: the word,
v2[]: list of docID with the same k2,
but this case v2 has one element){
// nếu k2 là query thì cần xử lý thêm

emit(k3: the word ~ k2, v3: listdocID);
}

2.3. Inverted Index – Java code: Mapper class
public class InvertedIndexMapper extends MapperText>{
private Text docID = new Text();
private Text word = new Text();

map

public void
(LongWritable key, Text value, Context context
) throws IOException, InterruptedException {
FileSplit file = (FileSplit) context.getInputSplit();

docID.set(new Text(file.getPath().getName()));
//Tach tu va emit(k2: word,v2:docID)
StringTokenizer textofline =
new StringTokenizer(value.toString().toLowerCase());
while (textofline.hasMoreTokens()) {
word.set(textofline.nextToken());
context.write(word, docID);
}
}
}

2.3. Inverted Index – Java code: Reducer class
public class InvertedIndexReducer
extends Reducer<Text,Text,Text,Text> {
String listdocID=new String();

reduce

public void
(Text key, Iterable<Text> values, Context
context) throws IOException, InterruptedException {
listdocID=""; //clear du lieu truoc khi ham map duoc goi lap lai
for(Text docID : values){
if(!listdocID.contains(docID.toString())) //delete Duplicate
if(listdocID.trim().length() > 0){
listdocID = listdocID + " " + docID;
}else listdocID = listdocID + docID;
}
//emit ~ Output(k2: word, v3: listdocID)

context.write(key, new Text(listdocID));
}
}

2.3. Inverted Index – Java code: Driver class
public class InvertedIndexDriver {
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "Inverted Index");
job.setJarByClass(InvertedIndexDriver.class);//class contains
main function
job.setMapperClass(InvertedIndexMapper.class);
job.setCombinerClass(InvertedIndexReducer.class);
job.setReducerClass(InvertedIndexReducer.class);
job.setOutputKeyClass(Text.class);//type data of key ~ key's
output type of reduce
job.setOutputValueClass(Text.class);//type data of value ~
value's output type of reduce
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//thiet lap so reducer
job.setNumReduceTasks(2);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

2.3. Search – Java code: Mapper class
public class SearchMapper extends Mapper <LongWritable, Text, Text, Text>{

String [] fields;
String query="";
String[] index;
//code read query: code nay dat ngoai ham map de tranh lap lai nhieu lan khi Mapper goi lai ham map
public void setup(Context context) {
Configuration conf = context.getConfiguration();
String queryPath = conf.get("queryPath");
String lineQuery=null;
//read contents of query file
try {
Path path = new Path(queryPath);
FileSystem fileSystem = FileSystem.get(new Configuration());
BufferedReader bufferedReader = new BufferedReader(new
InputStreamReader(fileSystem.open(path)));
while ((lineQuery = bufferedReader.readLine()) != null) {
if(query.length()>0) {
query=query + " " + lineQuery;
}else {
query=lineQuery;//lap lan dau tien
}
}
fields= query.toString().split(" "); //Tách dòng truy vấn
} catch (IOException e) {
e.printStackTrace();
}

2.3. Search – Java code: Mapper class
public void map(LongWritable key, Text value, Context
context) throws IOException, InterruptedException {

index = value.toString().split("\t"); //Tach tung dong chi muc
for (String field : fields) {
if (field.toLowerCase().equals(index[0]))
context.write(new Text(field), new Text(index[1]));

}
}//end map
}//end mapper

2.3. Search – Java code: Reducer class
public class SearchReducer extends Reducer<Text,Text,Text,Text> {
public void reduce(Text key, Iterable<Text> values, Context context)
throws IOException, InterruptedException {
//ban chat mot key duoc gom ve cung mot reducer ma chi muc thi da nhom
lai theo key roi
//nen o day chi co duy nhat mot dong gom key, value tuong ung nen ko can
phai nhom lai nua
//emit ~ Output(k2: word, v3: listdocID)
for(Text docIDs : values){

context.write(key, new Text("\t"+docIDs));
}
}//end reduce
}//end SearchReducer

2.3. Search – Java code: Driver class

public class SearchDriver {

public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
String[] otherArgs =
new GenericOptionsParser(conf, args).getRemainingArgs();
if (otherArgs.length != 3) {
System.err.println("args include: <intput-path> <output-path> <query-path>");
System.exit(2);
}
String queryPath=otherArgs[2];//lay duong dan chuoi query
conf.set("queryPath", queryPath);
Job job = Job.getInstance(conf, "Search Engine");
job.setJarByClass(SearchDriver.class);//class contains main function
job.setMapperClass(SearchMapper.class);
job.setCombinerClass(SearchReducer.class);
job.setReducerClass(SearchReducer.class);
job.setOutputKeyClass(Text.class);//type data of key ~ key's output type of reduc
job.setOutputValueClass(Text.class);//type data of value ~ value's output type of
reduce
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

3. Demo chương trình
Đề mô trực tiếp:
+ Số lượng: 7 files (~19 kB)
+ Định dạng: plain text (.txt).
Thiết bị: giả lập 1 máy tính, hệ điều hành Linux (bản

phân phối Ubuntu 16.04, RAM 5GB, ổ cứng 50GB).
Kết quả đã demo:
+ Số lượng: 1.364 files (~4.3MB)
+ Định dạng: plain text (.txt).
+ Thời gian xử lý: ~ 2.5 giờ

4. KẾT LUẬN
 Vận dụng được mô hình MapReduce vào bài
toán đơn giản.
 Hoàn thành phạm vi bài toán đặt ra: lập chỉ
mục nghịch đảo và tìm kiếm trên chỉ mục.
 Do điều kiện chạy thử nghiệm còn nhiều hạn
chế nên chưa phát huy được ưu điểm của mô
hình lập trình MapReduce.

Lập chỉ mục nghịch đảo (invertedindex) và tìm kiếm trên tập tài liệu lớn

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về