Tải bản đầy đủ (.pptx) (31 trang)

Course recommendation Xây dựng hệ thống gợi ý khóa học

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.62 MB, 31 trang )

Online
Course
Search

Professor
Kim Kyoung-Yun 


Content
1. Introduction: Online course Search Engine
2. Crawling by Apache Nutch
3. Indexing in Apache Solr
4. Analysis data
5. Functions
6. UI design
7. Results
8. Discussion


1. Introduction
Project domain: Online Course
Urls: />Concept map:


1. Introduction
Scenario flow chart


1. Introduction
Scenario flow chart


The words were
not trained


1.Introduction: Architecture

Recommendation

Data Collection

Topic modeling
Crawler
Topics
Apache Solr
Document
Index

Query
parser

Calculating semantic
similarity score,
ranking

Ranking

Result

Tittle, Score,
URL,

Description

Query processing
(correct, split)

Input: [Number (int or str)] +
keywords
For example: six machine
learning python
Or 6 python computer vision

Query
.csv file
Title

Offer by

Level

Description

Skill

Rating

User


1.Introduction: Architecture
/> /> /> />

Crawler

Topics
Calculating semantic
similarity score,
ranking

Data Collection
Document index
Extracting
metadata

connect

affiliation
journal

author

book

conference

publisher

Tittle, URL, Abstract,
Author,conference/j
ournal/book name,
publisher,
publish_date


paper

Result

paper

Topic modeling

Used in filter
condition

Query processing
(correct, split)

Input: keywords
For example: Human
balance estimation

Query
User


2. Crawling by Apache Nutch
• The seed.txt file

• The regex-urlfilter file include:
+^ />+^ />

2. Crawling by Apache Nutch

+ Open Cygwin terminal and go to {nutch_home}:
Crawl: bin/crawl –i –s urls crawl 3

Dump to file:
• bin/nutch readdb crawl/crawldb -stats >stats.txt
• bin/nutch readdb crawl/crawldb/ -dump db
• bin/nutch readlinkdb crawl/linkdb/ -dump link
• bin/nutch readseg -dump crawl/segments/{segment_folder}
crawl/segments/{segment_folder}_dump -nocontent -nofetch -noparse -noparsedata -noparsetext
1) CrawlDb - It contains all the link parsed by the Nutch.
2) LinkDB - It contains for each URL the outgoing and the
incoming URLs.
3) Segment - It contains the list of URLs to be crawled or being
crawled.


3. Indexing in Solr
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/{segments_folder}/ filter -normalize -deleteGone


Can crawl

4. Dataset
For the first url:
For example:
/>cializations/machinelearning-introduction

Can crawl

Can crawl



Can crawl

4. Dataset

Miss data, however
we can ignore this
Can crawl


Can crawl

4. Dataset


Can crawl

4. Dataset


4. Dataset


5. Functions
Topic modelling: Top2Vec which leverages joint document and word semantic
embedding to find topic vectors
• Top2Vec(documents, speed, workers)
documents: Input corpus, should be a list of strings.
speed: This parameter will determine how fast the model takes to train. The ‘fastlearn’ option is the fastest and will generate the lowest quality vectors. The ‘learn’

option will learn better quality vectors but take a longer time to train. The ‘deeplearn’ option will learn the best quality vectors but will take significant time to
train.
workers: The amount of worker threads to be used in training the model. Larger
amount will lead to faster training


5. Functions
• search_documents_by_keywords(keywords, num_docs, keywords_neg=None,
return_documents=True, use_index=False, ef=None)
Semantic search of documents using keywords. The most semantically similar documents to the
combination of the keywords will be returned. If negative keywords are provided, the documents
will be semantically dissimilar to those words. Too many keywords or certain combinations of words
may give strange results. This method finds an average vector(negative keywords are subtracted) of
all the keyword vectors and returns the documents closest to the resulting vector.
Parameters
• keywords (List of str) – List of positive keywords being used for search of semantically similar documents.
• keywords_neg (List of str (Optional)) – List of negative keywords being used for search of semantically
dissimilar documents.
• num_docs (int) – Number of documents to return.
• return_documents (bool (Optional default True)) – Determines if the documents will be returned. If they were
not saved in the model they will also not be returned.
Return: documents, doc_scores, doc_ids


5. Functions
• correcting_word(keyword)

Jaccard distance, the opposite of the Jaccard coefficient, is used to
measure the dissimilarity between two sample sets.
We get Jaccard distance by subtracting the Jaccard coefficient from

1.
We can also get it by dividing the difference between the sizes of
the union and the intersection of two sets by the size of the union
For example: pyth -> python, machne -> machine


5. Functions
• get_recommendation (query)
-> Splitting query
-> Correcting query
-> Calculating score of query and documents
-> Display result: “number” courses ( title, url, score and description)
query: including [number(str or int)] + keywords


5. Functions
• Showing “number” of results: If a number of results is >= number
Showing a number of results if a number of results is < number
• If query is not correct in spelling, it can guess the query and display
the results
For example: 7 pyt learning -> 7 python learning
• Input: number can be str or int ( one or 1, two or 2)
• If the keywords did not train in model, it can not display the result.



×