Online
Course
Search
Professor
Kim Kyoung-Yun
Content
1. Introduction: Online course Search Engine
2. Crawling by Apache Nutch
3. Indexing in Apache Solr
4. Analysis data
5. Functions
6. UI design
7. Results
8. Discussion
1. Introduction
Project domain: Online Course
Urls: />Concept map:
1. Introduction
Scenario flow chart
1. Introduction
Scenario flow chart
The words were
not trained
1.Introduction: Architecture
Recommendation
Data Collection
Topic modeling
Crawler
Topics
Apache Solr
Document
Index
Query
parser
Calculating semantic
similarity score,
ranking
Ranking
Result
Tittle, Score,
URL,
Description
Query processing
(correct, split)
Input: [Number (int or str)] +
keywords
For example: six machine
learning python
Or 6 python computer vision
Query
.csv file
Title
Offer by
Level
Description
Skill
Rating
User
1.Introduction: Architecture
/> /> /> />
Crawler
Topics
Calculating semantic
similarity score,
ranking
Data Collection
Document index
Extracting
metadata
connect
affiliation
journal
author
book
conference
publisher
Tittle, URL, Abstract,
Author,conference/j
ournal/book name,
publisher,
publish_date
paper
Result
paper
Topic modeling
Used in filter
condition
Query processing
(correct, split)
Input: keywords
For example: Human
balance estimation
Query
User
2. Crawling by Apache Nutch
• The seed.txt file
• The regex-urlfilter file include:
+^ />+^ />
2. Crawling by Apache Nutch
+ Open Cygwin terminal and go to {nutch_home}:
Crawl: bin/crawl –i –s urls crawl 3
Dump to file:
• bin/nutch readdb crawl/crawldb -stats >stats.txt
• bin/nutch readdb crawl/crawldb/ -dump db
• bin/nutch readlinkdb crawl/linkdb/ -dump link
• bin/nutch readseg -dump crawl/segments/{segment_folder}
crawl/segments/{segment_folder}_dump -nocontent -nofetch -noparse -noparsedata -noparsetext
1) CrawlDb - It contains all the link parsed by the Nutch.
2) LinkDB - It contains for each URL the outgoing and the
incoming URLs.
3) Segment - It contains the list of URLs to be crawled or being
crawled.
3. Indexing in Solr
bin/nutch index crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/{segments_folder}/ filter -normalize -deleteGone
Can crawl
4. Dataset
For the first url:
For example:
/>cializations/machinelearning-introduction
Can crawl
Can crawl
Can crawl
4. Dataset
Miss data, however
we can ignore this
Can crawl
Can crawl
4. Dataset
Can crawl
4. Dataset
4. Dataset
5. Functions
Topic modelling: Top2Vec which leverages joint document and word semantic
embedding to find topic vectors
• Top2Vec(documents, speed, workers)
documents: Input corpus, should be a list of strings.
speed: This parameter will determine how fast the model takes to train. The ‘fastlearn’ option is the fastest and will generate the lowest quality vectors. The ‘learn’
option will learn better quality vectors but take a longer time to train. The ‘deeplearn’ option will learn the best quality vectors but will take significant time to
train.
workers: The amount of worker threads to be used in training the model. Larger
amount will lead to faster training
5. Functions
• search_documents_by_keywords(keywords, num_docs, keywords_neg=None,
return_documents=True, use_index=False, ef=None)
Semantic search of documents using keywords. The most semantically similar documents to the
combination of the keywords will be returned. If negative keywords are provided, the documents
will be semantically dissimilar to those words. Too many keywords or certain combinations of words
may give strange results. This method finds an average vector(negative keywords are subtracted) of
all the keyword vectors and returns the documents closest to the resulting vector.
Parameters
• keywords (List of str) – List of positive keywords being used for search of semantically similar documents.
• keywords_neg (List of str (Optional)) – List of negative keywords being used for search of semantically
dissimilar documents.
• num_docs (int) – Number of documents to return.
• return_documents (bool (Optional default True)) – Determines if the documents will be returned. If they were
not saved in the model they will also not be returned.
Return: documents, doc_scores, doc_ids
5. Functions
• correcting_word(keyword)
Jaccard distance, the opposite of the Jaccard coefficient, is used to
measure the dissimilarity between two sample sets.
We get Jaccard distance by subtracting the Jaccard coefficient from
1.
We can also get it by dividing the difference between the sizes of
the union and the intersection of two sets by the size of the union
For example: pyth -> python, machne -> machine
5. Functions
• get_recommendation (query)
-> Splitting query
-> Correcting query
-> Calculating score of query and documents
-> Display result: “number” courses ( title, url, score and description)
query: including [number(str or int)] + keywords
5. Functions
• Showing “number” of results: If a number of results is >= number
Showing a number of results if a number of results is < number
• If query is not correct in spelling, it can guess the query and display
the results
For example: 7 pyt learning -> 7 python learning
• Input: number can be str or int ( one or 1, two or 2)
• If the keywords did not train in model, it can not display the result.