Tải bản đầy đủ (.pdf) (27 trang)

Thông tin tóm tắt về những đóng góp mới của luận án tiến sĩ: Một số kỹ thuật tìm kiếm thực thể dựa trên quan hệ ngữ nghĩa ẩn và gợi ý truy vấn hướng ngữ cảnh.

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.47 MB, 27 trang )

<span class='text_page_counter'>(1)</span><div class='page_container' data-page=1>

MINISTRY OF EDUCATION AND


TRAINING



VIETNAM ACADEMY


<b>OF SCIENCE AND TECHNOLOGY </b>



<b>GRADUATE UNIVERSITY SCIENCE AND TECHNOLOGY </b>


<b>--- </b>



<b>Tran Lam Quan </b>



<b>SOME SEARCHING TECHNIQUES FOR ENTITIES BASED ON IMPLICIT SEMANTIC </b>


<b>RELATIONS AND CONTEXT-AWARE QUERY SUGGESTIONS</b>



Major: Mathematical Theory of Informatics


<b> Code: 9.46.01.10 </b>



<b>SUMMARY OF MATHEMATICS DOCTORAL THESIS </b>



</div>
<span class='text_page_counter'>(2)</span><div class='page_container' data-page=2>

Cơng trình được hồn thành tại: Học viện Khoa học và Công nghệ - Viện Hàn lâm Khoa học và Công nghệ Việt
Nam.


Người hướng dẫn khoa học: TS. Vũ Tất Thắng


Phản biện 1: …
Phản biện 2: …
Phản biện 3: ….


Luận án sẽ được bảo vệ trước Hội đồng đánh giá luận án tiến sĩ cấp Học viện, họp tại Học viện Khoa học và
Công nghệ - Viện Hàn lâm Khoa học và Công nghệ Việt Nam vào hồi … giờ ..’, ngày … tháng … năm 202….



Có thể tìm hiểu luận án tại:


</div>
<span class='text_page_counter'>(3)</span><div class='page_container' data-page=3>

<b>INTRODUCTION </b>
<b>1. The necessity of the thesis </b>


In the big data era, when the new data flow is generated incessantly, the search engine becomes a useful
tool for the user to search for information. Based on the statistics, approximately 71% of the web searching
sentences includes the name of entities [7], [8]. When looking at the query only includes the entity name:
"Vietnam", "Hanoi", "France ", in terms of visualization, we see the underlying semantics behind this query. In
other words, a similar relationship exists between the pair of entity names "Vietnam": "Hanoi" and the pair of
entity names "France": "?". If only considered visually, this is one of the "natural" abilities of human - the ability
to infer unknown information/knowledge by similar inference.


With the above query, human have the

ability to give immediate answers, but the


Search Engine (SE) can only find the


documents containing the aforementioned


keywords, the SE cannot immediately give the


answer "Paris". The same happen in real world,


there are questions as: "If Fansipan is the


highest mountain in Vietnam, which one is the


highest in Tibet?" or "If you know Elizabeth as


Queen of England, who is the Japanese


monarch?", etc. For queries with similar


relationships as above, the keyword search


engine has difficulty in giving answers while


human can easily make similar inferences.



<i>Figure 1.1: The list returns from Keyword-SE with query = "Việt Nam", "Hà Nội", "Pháp". </i>
Researching and simulating ability of human to deduce from a familiar semantic domain ("Vietnam",


"Hanoi") to an unfamiliar semantic domain ("France", "?") - is the purpose of the first problem.


The second problem about query suggestions. Also according to statistics, the queries of user to


enter are often short, ambiguous, and

poly-semantic [1-6]. In search sessions, the number of results

returned a lot, but most of them are not suitable for the user's search intent

1<sub>. Therefore, there are many </sub>


researching directions set out to improve results and assist searchers. These researching directions


include: query suggestion, rewriting queries, query expansion, personalized recommendations,


ranking/re-ranking search results, etc.



The researching direction suggests that the query often applies traditional techniques such as clustering,
similarity measurement, etc. of queries [9], [10]. However, traditional techniques have three disadvantages: First,
it can only give similar suggestion or related to the query that is recently entered (current query) - but the quality
is not sure and better than the current query. Second, it is not possible to give the trend that most knowledge often
asks after the current query. Third, these approaches do not seamlessly consider the user's query to capture the
user's search intent. For example, on the keyword SE, type 2 consecutive queries q1: "Who is Joe Biden", q2:


</div>
<span class='text_page_counter'>(4)</span><div class='page_container' data-page=4>

"How old is he", q1, q2 are semantically related. However, the results returned for q1, q2 are 2 very different set
<b>of the result. This shows the disadvantage of keyword search. </b>


<i>Figure 1.2: The answers list from SE corresponding to q1 and q2. </i>


Capturing a seamless query string, in other words, capturing the search context, SE will "understand" the
user's search intent. Moreover, capturing query string, SE can suggest string query, this suggestion string is
majority knowledge, community often asks after q1, q2. This is the purpose of the second problem.


<b>2. Thesis: Objectives, Layout and Contributions </b>


Research, identify and experiment with methods to solve the two above problems. The objectives are set


out, the main contributions of the thesis include:


- The thesis researches and builds an entity search technique based on implicit semantic relations using
clustering methods to improve search efficiency.


- Apply context-aware techniques, build an vertical search engine that applies context-aware in its own
knowledge base domain (aviation data).


- Propose to measure combinatorial similarity in the contextual query suggestion problem to improve the
quality of suggestion.


<b>CHAPTER 1: OVERVIEW </b>


<b>1.1. </b> <b>The problem for searching entities based on implicit semantic relations </b>


Consider query including entities: "Qur’an": "Islam", "Gospels": "?", Humans have the ability to
immediately deduce the "?", but the SE only gives results that contain documents that contain the above keywords,
do not immediately give the answer: "Christian". Due to only finding entities, the techniques of extending or
rewriting the query do not apply to the relationship form with the meaning hidden in the entity pair. From there, a
new search form is studied, the search query's motive has the form: {(A, B), (C,?)}, where (A, B) is the source
entity-pair, (C,?) is the target entity-pair. Simultaneously, two pairs (A, B), (C,?) have similar semantic
relationship. Specifically, when the user enters a query consisting of 3 entities {(A, B), (C,?)}, SE has the task of
listing and searching in the candidate list of entities D ( entity sign?), each entity D satisfies the condition having
<i>semantic relationship with C, and the pair (C, D) has similar relationship with the pair (A, B). </i>

Semantic relation


- in the narrow sense and in the lexical perspective - is expressed by terms/patterns/context surrounding


(before, between, after) the known entity pair

2<sub>. Because of the semantic relation, the similarity relation is not </sub>


</div>
<span class='text_page_counter'>(5)</span><div class='page_container' data-page=5>

explicitly stated in the query (the query consists of only 3 entities: A, B, C), so motive search morphology is

called the Implicit Relational Entity Search or Implicit Relational Search, in short: IRS.




Consider the input query that includes only 3 entities q= "Mekong":"Vietnam", "?": "China". Query q
contains only 3 entities ("Mekong": "Vietnam", "?": "China"). The query q does not describe a semantic relation
("longest river" or "largest" or "widest basin", etc.). The searching model based on the implicit semantic relation
is responsible for finding the entity "?", such as satisfying the semantic relationship with the "China" entity, and
the "?":"China" pair being similar with the pair: "Mekong":"Vietnam".


Finding/calculating the relative similarity between two pairs of entities is a difficult problem because:
First, the relational similarity


changes over time, considering two
pairs of entities (Joe Biden, US
President) and (Elizabeth, Queen of
England), the similarity of relationship
changes over the term. Second, it is
difficult due to the intrinsic entity
having names (names of individuals,
organizations, places, ...) which are not
common words or in the dictionary.


<i>Hình 1.3: Input query: ”Cuba”, “José Marti”, “Ấn Độ” (ngữ </i>
nghĩa ẩn: “anh hùng dân tộc”)


Third, in a pair of entities, there can be many different semantic relations, such as: "The Corona outbreak
originated from Wuhan"; "Corona isolates Wuhan city"; "The number of Corona infections decreased gradually
in Wuhan"; .v.v. Fourth, due to the timing factor, 2 entity pairs may not share or share very little of the context
around the entity pair, like: Apple: iPod (in 2010s) and Sony: Walkman (1980s), leading to the result of 2 pairs of
entities are not identical. Fifth, the pair of entities has only one semantic relation but has more than one expression:
"X was acquired by Y" and "X buys Y". And finally, it is difficult because the unknown D entity, the D entity is
in the process of searching.



The query's search motive takes the form: q = {(A, B), (C,?)}, the query consists of only 3 entities:

A, B, C. Identifying the similarity relationship between the pair of entities (A, B), (C, ?) is a necessary


condition for determining the entity to be sought. As a problem of NLP (Natural Language Processing),


similarity relational is one of the most important tasks of search for entities based on the implicit semantic


relations. Thus, thesis lists the main research directions for similarity relationship.



<b>1.2. </b> <b>IRS - Related work </b>


<i><b>1.2.1. SMT - Structure Mapping Theory </b></i>


SMT [12] considers the similarity as a mapping of “knowledge” (mapping of knowledge ) from the source
domain to the target domain, according to the mapping rules: Eliminate the attributes of the object but maintain
the relational mapping between objects from the source domain to the target domain.


 Mapping rules: M: si  ti; (in which s: source, t: target).


 Eliminate attribute: HOT(si) ↛HOT(ti); MASSIVE(si) ↛MASSIVE(ti); ...


</div>
<span class='text_page_counter'>(6)</span><div class='page_container' data-page=6>

Figure 1.5 shows that due to the same s
(subject), o (object) structures, the SMT
considers the pairs (Planet, Sun) and (Electron,
Nucleus) are relation similarity, regardless of
the fact that the source and target pairs - Sun
and Nucleus, Planet and Electron are very
different in properties (HOT, MASSIVE, . . . ).
Referring to the purpose of the paper, if the
query is ((Planet, Sun), (Electron, ?)), SMT will
output the correct answer: "Nucleus".


<i>Figure 1.5: Structure Mapping Theory (SMT) </i>



However, SMT is not feasible with low-level structures (lack of relations). Therefore, SMT is not
feasible with the problem of searching entities based on implicit semantic relation.


<i><b>1.2.2. Relational similarity based on Wordnet classification system </b></i>


Cao [20] and Agirre [21] proposed relational similarity measure based on similarity classification system
in Wordnet. However, as mentioned above, Wordnet does not contain named entities. Thus, Wordnet is not suitable
for entity search model.


<i><b>1.2.3. VSM - Vector Space Model </b></i>


Using the vector space model, Turney [13] presents the concept of each vector formed by a pattern
containing the entity pair (A, B) and the occurrence frequency of the pattern. The VSM performs the relational
similarity measurement as follows: Patterns are generated manually and queried to the Search Engine (SE), the
number of results returned from the SE is the frequency of occurrence of such patterns. Thus, the relational
similarity of two pairs of entities is computed by Cosine between two frequency vectors.


<i><b>1.2.4. LRA - Latent Relational Analysis </b></i>


By extension of VSM, Turney combines it with LRA to determine level of relational similarity [14-16].
Like VSM, LRA uses a vector made up of the pattern/context containing the entity pair (A, B) and the frequency
of the pattern (pattern in n-grams format). At the same time, LRA applies a thesaurus to extend the variants of: A
bought B, A acquired B; X headquarters in Y, X offices in Y, etc. LRA applies the most frequent n-grams to assign
the pattern with the entity pair (A, B), then builds a pattern - entity pair matrix, where each element of the matrix
represents the frequency of the pair (A, B) in the pattern. In order to reduce the matrix dimension, the LRA uses
Singular Value Decomposition (SVD) to reduce the number of columns in the matrix. Finally, the LRA applies a
Cosine measure to define the relational similarity between two pairs of entities.


In spite of an effective approach to identifying relational similarity, LRA requires a long time to compute


and process. LRA requires 8 days to perform 374 SAT analogy questions [17]. This is impossible with a real-time
response system.


<i><b>1.2.5. LMRE - Latent Relation Mapping Engine - LRME </b></i>


</div>
<span class='text_page_counter'>(7)</span><div class='page_container' data-page=7>

<i><b>1.2.6. LSR - Latent Semantic Relation </b></i>


Bollega, Duc. et al. [17, 18], Kato [19] uses the Distributional Hypothesis at the context level: In the
corpus, if two contexts pi and pj are different but usually co-occur with entity pairs wm, wn, they are similar in


semantics. When pi, pj are semantically similar, entity-pairs wm, wn are similar in relation.


The Distribution Hypothesis requires pairs of entities to always co-occur with contexts, and the

Bollega clustering algorithm is proposed at the context level rather than clustering at the term level in


the sentence. Measure of similarity based on the distribution hypothesis, which is not based on term


similarity, will significantly affect the quality of the clustering technique, thus affecting the quality of


the search system.



<i><b>1.2.7. Word2Vec </b></i>


The Word2Vec model, proposed by Mikolov et al. [22], is a learning model that represents each word into
a vector (maps a word to one-hot vector), Word2Vec describes the relationship (probability) between words with
the context of the word. The Word2Vec model has 2 simple Neural network architectures: Continous
Bag-Of-Words (CBOW) and Skip-gram.


Apply Skip-gram, at each training step, the
Word2Vec model predicts the contexts within
certain skip-grams. Assuming the input training
word is "banking", with the sliding window skip =
m = 2, the left context output will output as "turning


into", the right context will output as "crises as".


<i>Figure 1.6: Relationship between the target word and the context in the Word2Vec model. </i>
In order to predict, the objective function in Skip-gram implemented to maximize probability. With a
series of training words w1 , w2 , ..., wT , Skip-gram applies Maximum Likelihood:


𝐽(𝜃) =1


𝑇∑ ∑ log 𝑝(𝑤𝑡+𝑗|𝑤𝑡)


−𝑚≤𝑗≤𝑚,𝑗≠0
𝑇


𝑡=1


in which: T: number of words in the data-set; t: trained words;
m: window-side (skip); 𝜃: vector representation;


</div>
<span class='text_page_counter'>(8)</span><div class='page_container' data-page=8>

𝑝(𝑜|𝑐) = exp⁡(𝑢𝑜


𝑇<sub>⁡𝑣</sub>
𝑐)


∑𝑊𝑤−1exp⁡(𝑢𝑤𝑇⁡𝑣𝑐)


in which: W: Vocabulary; c: the trained word (input/center); o: output of c;
u: representing vector of o; v: representing vector of c;


In the experiment, Mikolov et al.
[22-25] treats phrases as single words,


eliminates frequently repeated words, uses
Negative Sampling loss function,
randomly selecting n words to process
calculations instead of entire words in the
data-set, helping for the training algorithm
faster than the above softmax function.


<i>Figure 1.7: Word2Vec "learns" the "hidden" relationship between the target word and its context</i>3<sub>. </sub>


Vector operations such as: vec ("king") - vec ("man") ≈ vec ("queen") - vec ("woman") show that the
Word2Vec model is suitable for a query like "A: B :: C :? ”, in other words, the Word2Vec model is quite close to
the research direction of the thesis. The difference: Word2Vec input (following the Skip-gram model) is one word,
output is a context. The input of IRS model based on the semantics is 3 entities (A: B :: C :?), the output is the
entity to be searched for (D).


Regarding the search for entities based on semantics, from existing problems, to asymptotic to an "artificial
intelligence" in the search engine, the research thesis, the application of self-ability simulation techniques of
human: ability to infer information/knowledge not determined by similar inference.


<b>1.3. </b> <b>The problem Context-aware query suggestions </b>


For SE, the ability to "understand" the search intent in a user's query is a challenge. The data set used for
mining is Query Log (QLogs). Query set in the past QLogs record the queries and "interactions" of users with
search engines, so QLogs contain valuable information about the query content, purpose, behavior, habits,
preferences as well as implicit feedback of the user on the result set returned by SE.


Logs data set mining is useful in many applications:
Query analysis, advertising, trends, personalization, query
suggestion, etc.For query suggestions, traditional techniques
such as Explicit Feedback [30-32], Implicit Feedback


[33-36], User profile [37-39], Thesaurus [40-42], ... only give
suggestions similar to input queries of users.


<i>Figure 1.12: Suggest traditional with input query: “điện thoại di động” </i>
<b>1.4. </b> <b>Query suggestions – Related work </b>


Around the kernel is Qlogs, it can be said that query suggestion in traditional techniques performs two
main functions:




</div>
<span class='text_page_counter'>(9)</span><div class='page_container' data-page=9>

<b> Cluster-based techniques apply similarity measurements to aggregate similar queries together in </b>
<b>clusters (groups). </b>


 Session-based technology with session search is a continuous sequence of queries.
<i><b>1.4.1. Session-based query suggestion technique </b></i>


a) Based on queries co-occurrence or adjacent (adjacency) belongings to sessions in Qlog: In a
Session-based approach, adjacency or co-occurrence query pairs belonging to the same session act as the candidate list for
the query proposal.


b) Based on the graph (Query Flow Graph - QFG): On QFG graph, two queries qi, qj belong to the same


search intent (search mission) are represented as an edge with direction from qi to qj. Each node on the graph


corresponds to a query, any edge on the graph is also considered a searching behavior.
The session general structure in CFG is represented: QLog = <q, u, t, V, C>;


Boldi et al. [50, 51] uses the simplified session structure QLog = <q, u, t> to perform query suggestions,
following a series of steps:



 Construct QFG graph with the input that is set of sessions in Query Logs.


 The queries qi and qj are connected if there exists at least one session where qi and qj occur


consecutively.


 Calculate the weight w (qi, qj) on each edge:
w(qi, qj) = {


𝑓(𝑞𝑖,⁡𝑞𝑗)


𝑓(𝑞𝑖) , 𝑖𝑓⁡(𝑤(𝑞𝑖, 𝑞𝑗) > 𝜃) ∨ (𝑞𝑖 = 𝑠) ∨ (𝑞𝑖 = 𝑡)


0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (1.6)


in which: f(𝑞𝑖,⁡𝑞𝑗): the number of occurrences qj immediately after qi in the session;


f(qi): number of occurrences of qi in QLogs; 𝜃: threshold;


s, t: 2 state nodes start, end of query chain in session;


<b> Identify the strings that meet the conditions (1.6) to analyze the user's intent: When new query is </b>
<b>inserted, based on the graph, it gives the query suggestions in turn with the greatest edge weight. </b>


<i><b>1.4.2. Cluster-based query suggestion technique </b></i>
 K-means;


 Hierarchical;
 DB-SCAN;




<i>Figure 1.9: Clustering methods [54]. </i>


</div>
<span class='text_page_counter'>(10)</span><div class='page_container' data-page=10>

query. At the same time, the query layer immediately after the current query often includes better queries (query
strings) that better reflect the search intent.


<b>CHAPTER 2: SEARCH FOR ENTITIES BASED ON IMPLICIT SEMANTIC RELATIONS </b>
<b>2.1. Problem </b>


In nature, there exists a relationship between two entities, such as: Khue Van Cac - Temple of Literature;
Stephen Hawking - Physicist; Shakyamuni - Mahayana group; Apple - iPhone; ... In the real world, there are
questions like: "knowing Fansipan is the highest mountain in Vietnam, which is the highest mountain in India?",
"If Biden is president-elect the United States, who is the most powerful person in Sweden? ”, …


By the keyword search engine, according to statistics, queries are often short, ambiguous, and
poly-semantic [1-6]. Approximately 71% of web search queries contain names of entities, as statistics [7, 8]. If the user
enters the entities like "Vietnam", "Hanoi", "French", then the search engine only results in documents that contain
the above keyboards, but does not immediately answer "Paris". Because of looking for entities only, query
extending and query rewriting techniques are not applied to the type of the implicit semantic relation in the entity
pair. Therefore, a new search morphology is researched. The pattern of the search query is in the form of: (A, B),
(C, ?), in which (A, B) is the source entity pair, (C, ?) is the target entity pair. At the same time, the two pairs (A,
B), (C, ?) have a semantic similarity. In other words, when the user enters the query (A, B), (C,?), the search engine
has the duty of listing entities D so that each entity D satisfies the condition of semantic relation with C as well as
the pair (C, D) have similarity relation with the pair (A, B). With an input consisting of only 3 entities: "Vietnam",
<i>"Hanoi", "France", the semantic relation "is the capital" is not indicated in the query </i>


<b>2.2. Method for searching entities is based on implicit semantic relations </b>
<i><b>2.2.1. Architecture – Modeling </b></i>



The concept of searching entities through implicit semantic relation is the most obvious distinction for
search engines based on keywords. Figure 2.1 simulates a query consisting of only three entities, query = (Vietnam,
Mekong), (China, ?).


Write the convention: q = {(A, B), (C,
?)}, where (Vietnam, Mekong) is a pair of source
entities, (China, ?) is a pair of target entities. The
search engine is responsible for identifying the
entity ("?") that has a semantic relation with the
"China" entity, and the entity pair (China, ?) must
be similarly related to the entity pair (Vietnam,
Mekong). Note that the above query does not
explicitly contain the semantic relation between
the two entities. This is because semantic
relations are expressed in various ways around
the pair of entities (Vietnam, Mekong), such as
"the longest river", "big river system", "the
largest basin", etc.


</div>
<span class='text_page_counter'>(11)</span><div class='page_container' data-page=11>

Due to the fact that the query consists of only three entities that do not include semantic relations, the
model is called the implicit semantic relation search model.


In case IRS does not find A, B or C, the keyword search engine will be applied.


The morphology of search for entities based on implicit semantic relations must determine the semantic
relation between the two entities and calculate the similarity of the two entity pairs, since that, give the answer to
the unknown entity (entity "?"). On a specific corpus, in general, Implicit Relational Search (IRS) consists of three
main modules: The extracting module of the sematic relations from the corpus; Clustering module of semantic
relations; Calculating module of similar relations between two entity pairs. In practice, the IRS model consists of
two phases: online phase: meeting the real-time search, and offline phase: processing, calculating, storing semantic


relations and similarity relations, in which, the extracting and clustering modules of the semantic relations are in
the off-line phase of the IRS model.


<i>Extracting module of the semantic relations: From the corpus, this module extracts the patterns (the root </i>
<i>sentence contains pairs of entities and context) as illustrated above: A the longest river B, where A, B are 2 named </i>
entities. The pattern set obtained will consist of different patterns, similar patterns, or patterns of different lengths
and terms, but the same semantic expression. For example: A is the largest river of B, A is the river of B has the
largest basin, or B has the longest river as A, etc.


<i>Clustering module of semantic relations : From the obtained pattern set, clustering is performed to identify </i>
clusters of contexts, where each context in the same clusters has a semantic similarity. Make a table of the pattern
indexes and the corresponding entity pairs.


<i>Calculating module of similar relations between two entity pairs is in the online phase of the IRS model. </i>
Pick up the query q = (A, B), (C, ?), IRS will search the entity pair (A, B) and the corresponding semantic relation
(context) set in the index table. From the obtained semantic relation set, find the pairs of entities (C, Di) associated


with this relation. Apply the Cosine measure to calculate and rank the similarity between (A, B) and (C, Di), and


give a list of ranked entities Di to answer the query.


Considering q = {(Vietnam, Mekong),
(China,?)}, the IRS finds a cluster containing
pairs of entities (Vietnam, Mekong) and the
corresponding semantic relation: "longest river"
(from the original sentence: "The Mekong is the
longest river in Vietnam"). This cluster also
contains a similar semantic relation: "the largest
river", in which the relation: "largest river" is
associated with the entity (China, Changjiang)


(from the original sentence: "Changjiang is
river the biggest in China ”). The IRS will put
"Truong Giang" in the list of candidates, rank
semantic relations according to the measure of
similarity, and return results to the searcher.


<i>Figure 2.2: General structure of IRS. </i>
Entity – pairs &


Context
Corpus


</div>
<span class='text_page_counter'>(12)</span><div class='page_container' data-page=12>

<i><b>2.2.2. Extracting module of the semantic relations </b></i>


Receiving the input query q = (A, B), (C, ?), the general structure of IRS is modelized:


 Filter-Entities (Fe) filters/seeks candidate set S containing entity pairs (C, Di) that are related to


the input entity pair (A, B):


Fe(q, D) = Fe({(A, B), (C, ?)}, D) = {1, 𝑖𝑓⁡𝑅𝑒𝑙𝑒𝑣𝑎𝑛𝑡(𝑞, 𝐷)<sub>0, 𝑒𝑙𝑠𝑒</sub> (2.1)


 Hàm Rank-Entities (Re) ranks the entities Di, Dj in the candidate set S by RelSim (Relational


Similarity), whichever has higher RelSim is ranked lower (i.e. closer with the top or higher rank):
∀ Di, Dj ∊ S, if:


RelSim((A, B),(C, Di)) > RelSim((A, B),(C, Dj)) : Re(q, Di) < Re(q, Dj) (2.2)


<i><b>2.2.3. Clustering module of semantic relations </b></i>



The clustering process converts "similar" elements into a cluster. In the semantic entity search model, the
elements in the cluster are semantically similar context sentences. Similarity is a quantity used to compare two or
more elements with each other, reflecting the correlation between two elements. Therefore, the thesis generalizes
the measurements of terms similarity; similarity based on vector space; semantic similarity - of the two contexts.


<i>a) Measurements of the similarity between 2 context </i>
 Terms-similarity


Zaro function: Winkler4<i><sub>. Distance Zaro calculates the similarity between 2 strings a, b: </sub></i>


SimZaro(a,b) = {


0,⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡⁡𝑖𝑓⁡𝑚 = 0


1
3(


𝑚
|𝑎|+


𝑚
|𝑏|+


𝑚−𝑠𝑘𝑖𝑝


𝑚 ) ⁡⁡⁡⁡⁡⁡𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 (2.6)


<i><b>Constrast model: As proposed by Tversky (“Features of similarity”, Psychological Review, 1977)</b></i>5<sub>, </sub>



<b>applying a contrast model to calculate the similarity between two sentences a, b: </b>


Sim(a, b)⁡=⁡α*f(a∩b) − 𝛽*f(a-b) − γ*f(b−a) (2.8)
<b>Jaccard distance: </b> <b> Sim(a, b) = </b>|𝑎∩𝑏|<sub>|𝑎∪𝑏|</sub> <b> </b> <b> </b> <b> </b> <b> (2.9) </b>
<b> Similarity based on vector space: Euclidean, Manhattan, Cosine. </b>


 Semantic similarity


According to the Distribution Hypothesis [17]: Words occurring in the same context tend to have
semantic similarity. Because Vietnamese language does not have Vietwordnet system to calculate the
semantic similarity between 2 terms, the thesis uses PMI correlation to measure and evaluate the semantic
similarity between two sentences (context).


PMI (Pointwise Mutual Information) method: Proposed by Church and Hanks [1990]. Based on the
probability that co-occurs between 2 terms t1, t2 in the corpus, PMI(t1, t2) are calculated by the formula:


PMI(t1, t2) = log2(<sub>𝑷(𝒕</sub>𝑷(𝒕𝟏,𝒕𝟐)


𝟏).𝑷(𝒕𝟐)) (2.16)


<i>b) Clustering module of semantic relations </i>


 Applying PMI to improve the similarity measure according to the Distribution Hypothesis:


4<sub> />


</div>
<span class='text_page_counter'>(13)</span><div class='page_container' data-page=13>

SimDH(p,q) = Cosine(PMI(p, q)) =⁡<sub>||𝑃𝑀𝐼(𝑤</sub>∑ (𝑃𝑀𝐼(𝑤𝑖 𝑖,𝑝)∙𝑃𝑀𝐼(𝑤𝑖,𝑞))


𝑖,𝑝)||||𝑃𝑀𝐼(𝑤𝑖,𝑞)|| (2.25)
 The similarity by terms of 2 context p, q:



Sim<sub>term</sub>(p, q) = ⁡∑ni=1(weighti(p)∙weighti(q))


||weight(p)||⁡||weight(q)|| (2.26)


 Measurement of the combined similarity:


Sim(p,q) = Max(SimDH(p, q),Simterm(p, q)) (2.27)
<i>c) Clustering algorithm: </i>


- Input: Set P = {p1, p2,…, pn}; Cluster threshold: θ1, heuristic threshold: θ2; Dmax: Cluster diameter;
Sim_cp: Results of combined similarity measurement function, apply the formula (2.27).


- Output: Set of clusters: Cset (ClusterID; context; weight of each context; pair of respective entities).
Program Clustering_algorithm


01. Cset = {}; iCount=0;


02. for each context pi ∈ P do


03. Dmax = 0; c* = NULL;


04. for each cluster cj ∈ Cset do


05. Sim_cp=Sim(pi,Centroid(cj))


06. if (Sim_cp > Dmax) then


07. Dmax = Sim_cp; c* ← cj;



08. end if


09. end for


10. if (Dmax > θ1) then


11. c*.append(pi)


12. else


13. Cset ∪= new cluster{c*}


14. end if


15. if (iCount > θ2) then


16. iCount++;


17. exit Current_Proc_Cluster_Alg();


18. end if


19. end for


20. Return Cset;


@CallMerge_Cset_from_OtherNodes()


<i><b>2.2.4. Modules calculating the relational similarity between two pairs of entities </b></i>



The module calculating the relational similarity between two pairs of entities that perform two tasks:
Filtering (searching) and ranking. As illustrated in 3.1, the input query q = (A, B), (C, ?), through the inverted
index, IRS executes the function Filter-Entities Fe to filter (search) out candidate sets having entity pairs (C, Di)


and the corresponding context, such that (C, Di) similar to (A, B). Then, it executes the function Rank-Entities Re


to rank the entities Di, Dj within the candidate set according to RelSim measure (Relational Similarity), finally -


which results in list of ranked {Di}.


<b>Filter-Entities algorithm: Filter to find the candidate set containing the answer: </b>
Input: Query q = (A, B)(C, ?)


Output: Candidate set S (includes Di entities and corresponding context);


Program Filter_Entities
01. S = {};


</div>
<span class='text_page_counter'>(14)</span><div class='page_container' data-page=14>

03. for each context pi ∈ P(w) do


04. W(p) = Context(pi).EntPairs();


05. If (W(p) contains (C:Di)) then S ∪= W(p);


06. end for
07. retufn S


After executing Filter-Entities, a subset of the entities Di and corresponding context are obtained. RelSim
only processes and calculates on the very small subset. In addition, RelSim uses the threshold

α to eliminate


entities Di with low RelSim values.


With: Fe(q,D) = Fe({(A, B),(C,?)}, D):


𝐹𝑒(𝑞, 𝐷𝑖) = {1, 𝑖𝑓⁡𝑅𝑒𝑙𝑆𝑖𝑚((𝐴, 𝐵), (𝐶, 𝐷<sub>0, 𝑒𝑙𝑠𝑒</sub> 𝑖)) > α (2.29)


<b>Rank-Entities function: Rank-Entities Algorithm is responsible for calculating RelSim: </b>
Input: Candidate set S and:


- Source entity pair (A, B), denote s; Candidate entities (C, Di), denote c;
- Contexts corresponding to s, c; The resulting cluster set: Cset;


- Known entities A, B, C  corresponding cluster set containing A, B, C are identified;
- Threshold α (compare RelSim value); Threshold α is set during testing the program;
- Initialize the dot product (β); used-context set (γ);


Output: List of answers (ranked entity list) Di;
Denotations:


- P(s), P(c) given in formula (2.19), (2.20);
- f(s, pi), f(c, pi), ɸ(s), ɸ(c) given in (2.21), (2.22);


- γ: Variable (set of context) keep the considered contexts;
- q: Temporary/Intermediate variable (Context);


- Ω: Cluster;
Program Rank_Entities


01. for each context pi ∈ P(c) do


02. if (pi ∈ P(s)) then



03. β ← β + f(s, pi)·f(c, pi)


04. γ ← γ ∪ {p}


05. else


06. Ω ← cluster contains pi


07. max_co-occurs = 0;


08. q← NULL;


09. for each context pj ∈ (P(s)\P(c)\γ) do
10. if (pj ∈ Ω) & (f(s, pj) > max_co-occurs)


11. max_co-occurs ← f(s, pj);


12. q ← pj;


13. end if


14. end for


15. if (max_co-occurs > 0) then


16. β ← β + f(s, q)·f(c, pi)


17. γ ← γ ∪ {q}



</div>
<span class='text_page_counter'>(15)</span><div class='page_container' data-page=15>

19. end if
20. end for


21. RelSim ← β/L2-norm(ɸ(s), ɸ(c))
22. if (RelSim ≥ α) then return RelSim


<b>Algorithm interpretation: In case two pairs of source and target entities have the same semantic </b>
relationship (sharing the same context, statement 1-2): pi ∊ P(s) ∩ P(c), calculate the dot product as a modified


version of standard Cosine similarity formula.


In the case of pi ∊ P(c) but pi ∉ P(s), the algorithm finds the context pj (or temporary variable, q, line 12),


where pi,pj belong to the same cluster. The loop body (from statements 10-13) chooses the context pj has largest


frequency of co-occurrence with the s. Under the Distribution Hypothesis, the more pairs of entities two contexts
pi, pj co-occur in, the higher Cosine similarity between the two vectors. As the cosine value is higher, pi, pj are


more similar. Therefore, the pair (C, Di) is more accurate and semantically consistent with the source entity pair
(A, B).


The sequence of statements from 15-18 calculate the dot product. Statements from 21-22 calculate the
RelSim value. From the set of RelSim value, whichever entities Di have RelSim higher will be ranked lower (in


the closer top, or higher rank). Finally, the result set Di is the answer list for the query that the end-user wants to


find.


<b>2.3. Experiment Results - Evaluation </b>
<i><b>2.3.1. Dataset </b></i>



The dataset is built from the empirical sample dataset, based on four entity subclasses named: PER; ORG;
LOC and TIME;


<i><b>2.3.2. Test - Parameter adjustment </b></i>
To evaluate the effectiveness of the
Rank_Entities clustering and ranking algorithm,
Chapter 2 changes the values θ1 and α, then


calculates the Precision, Recall, F-Score
measures corresponding to each value of α, θ1.


Figure 2.3 shows that at α = 0.5, θ1 = 0.4, the


F-Score score has the highest value.


<i>Figure 2.3: F-Score value corresponding to each changed value of α, θ</i>1.


Giải thuật Rank_Entities dòng 22 (if (RelSim ≥ α) return RelSim) cũng cho thấy, khi α nhỏ thì số lượng
ứng viên tăng, có thể có nhiễu, đồng thời thời gian xử lý real-time tốn chi phí thời gian, do hệ thống xử lý nhiều
truy vấn ứng viên. Ngược lại khi α lớn thì giá trị Recall nhỏ, kéo theo F-Score giảm đáng kể.


<i><b>2.3.3. Evaluation with MRR (Mean Reciprocal Rank) </b></i>


For the query Q, if the first correct answer rank in the query q ∈ Q is rq, then the MRR measurement of Q


is calculated:


MRR(Q) = <sub>|𝑄|</sub>1 ∑ 1



𝑟𝑞


</div>
<span class='text_page_counter'>(16)</span><div class='page_container' data-page=16>

With 4 entity subclasses: PER; ORG; LOC and TIME; the method is based on the co-current frequency
(f) reaching the average value MRR ≈ 0.69; meanwhile, PMI-based method is 0.86. This shows that PMI helps
improve the accuracy of semantic similarity better than the co-current frequency of context-pair entities.


<i>Figure 2.4: Compare PMI with f: frequency (co-current) based on MRR. </i>
<i><b>2.3.4. Experimental system </b></i>


The

dataset was downloaded from Viwiki (7877 files) and Vn-news (35440 files). The goal of


selecting source Viwiki and Vn-news because these datasets contain samples of named entities (Named Entity).

After reading, extracting file content, separating paragraphs and sentences (main-sentences,


sub-sentences), 1572616 sentences are obtained.



The general labels of NER (Named Entity Recognition) include: PER: Name of person; ORG: Name of
organization; LOC: Name of place; TIME: Time type; NUM: Number type; CUR: Currency; PCT: Percentage
type; MISC: Another entity type; O: Not an entity.


By using the algorithm for extracting context stored in the database, after performing the processing steps
and restriction conditions, Database remains with 404507 context sentences. From this set of context, the clustering
algorithm of semantic relations collects 124805 clusters.


<i>Figure 2.5: IRS experiment with B-PER entity label </i>


Đo evaluate the accuracy, experimentally performed 500 queries to test, the results showed an

accuracy of about 92%.



<i>Table 2.3: Examples of experimental results with input q = {A, B, C} and output D </i>


</div>
<span class='text_page_counter'>(17)</span><div class='page_container' data-page=17>

.. German Angela Merkel Israel Benjamin Netanyahu



.. Harry Kane Tottenham Messi Barca


.. Hồng Cơng Lương Hịa Bình Thiên Sơn RO


<b>2.4. Conclusion </b>


The ability to infer information/knowledge is not determined by similar inference is one of the natural
abilities of human. Chapter 2 presents an an Implicit Relational entity search model (IRS) that simulates the

above possibility. The IRS model searches for information/knowledge from an unfamiliar domain and does not


require keyword in advance, using a similar example (similarity relation) from a familiar domain. The main
contribution of Chapter 2: Build the entity search technique based on hidden semantic relation using clustering
method to improve search efficiency. At the same time, the thesis proposes the measure of combined similarity -
terms and distribution hypothesis; From the proposed measurement, and at the same time applying heuristic to
cluster algorithm with improving cluster quality.


<b>CHAPTER 3: CONTEXT-AWARE QUERY SUGGESTION </b>
<b>3.1. Problem </b>


In the field of sugessting the query, traditional approaches like session-based, document-click based, and
so on. Performing Query Logs to generate the suggestion. The approach to "Suggesting context-aware query by
session data mining and click-throught documents" (short call: "context-aware approach" by Huanhuan Cao et al
<i>[9], [10]) is one new approach - this approach considers the queries immediately before the query just entered (the </i>
current query) as a search context, in order to "capture" the user's search intent, in order to provide exact
suggestions worth more. Obviously, the preceding layer of query has a semantic relationship with the current
<i>query. Next, do mining for queries that immediately follow the current query - like a list of suggestions. This </i>
method makes use of the "knowledge" of the community, because the query layer immediately follow the current
query reflects the problems that users often ask after the current query.


The main contributions of chapter 3 include:



1) Apply context-aware techniques, build an vertical search engine that applies context-aware in its own
knowledge base domain (aviation data).


2) Propose to measure combinatorial similarity in the contextual query suggestion problem to improve
the quality of suggestion.


In addition, chapter 3 also has additional experimental contributions: i) Integrating Vietnamese speech
recognition and synthesis as an option into the search engine to create a voice-search system, with speech
interaction. ii) Apply the Concept-lattice structure to classify the returned result set.


<b>3.2. Context-aware method </b>


<i><b>3.2.1. Definition - Terminology </b></i>


 Search session: Is a continuous sequence of queries. Query strings are represented in
chronological order. Each session corresponds to one user.


 General session structure: {sessionID; queryText; queryTime; URL_clicked}


 Context: specifies adjacent string before the current query. In a user's search session, context is
<i>the query string immediately preceding the query entered. </i>


</div>
<span class='text_page_counter'>(18)</span><div class='page_container' data-page=18>

<i><b>3.2.2. Architecture - Modeling </b></i>
The ideology of
Context-aware based on two phases: online
and offline, generalized: During a
search session (online phase), the
context-aware waits the current query
and looks at the preceding query


string standing before the current
query as a context.. More precisely,
this process is interpreted to the
concept sequence - this concept
sequence expressed searching


intention of users . <i>Figure 3.4: Context-aware query suggestion model.</i>


When a search context is obtained, the system performs a match against the built-in context set (phase
offline, the built-in context set is pre-processed on the query set in the past - Query Logs. About structural data
and storage, the built-in context set is stored on a suffix tree data structure). A maximum matching provides a list
of candidates, a list of issues that most users often ask about after the queries they already entered. After the
ranking step, the candidate list becomes a suggestion list.


<i><b>3.2.4. Offline phase – Clustering algorithm </b></i>


The idea of clustering algorithm: The algorithm scans all queries in Query Logs once, the clusters will be
generated during the scan. Each cluster is initially initialized with a query, and then expanded gradually by similar
queries. The expansion process stops when the cluster diameter exceeds the threshold Dmax. Because each cluster


is seen as a concept, so the cluster set is the concept set.
Input: Query Logs Q, threshold Dmax;


Output: Set of clusters: Cset;


program Context_Aware_Clustering_alg


// Khởi tạo mảng dim_array[d] = Ø, ∀d (d: document được click)
// Mảng dim_array chứa số chiều các vectors.



01. for each query qi ∈ Q do
02. θ = Ø;


03. for each nonZeroDimension d of 𝑞⃗⃗⃗ do 𝑖


04. θ ∪= dim_array[d];


C = arg minC’∈C-Setdistance(qi, C’);


05. if (diameter(C∪{qi}) ≤ Dmax)


06. C ∪= qi; cập nhật lại đường kính và tâm cụm C;


07. else


C = new cluster({qi}); Cset ∪= C;


08. for each nonZeroDimension d of 𝑞⃗⃗⃗ do 𝑖


</div>
<span class='text_page_counter'>(19)</span><div class='page_container' data-page=19>

10. return Cset;
<i><b>3.3.6. Analyze pros and cons </b></i>
<b>Advantages: </b>


- Context-aware issue is a novel approach. Performing query suggestions, almost all traditional
approaches are often taken the classical queries which existed in Query Logs for proposals. This kind
of queries can only proposed similar or related queries to the current query, rather than giving trends
about which communities often asked after the current query. Likewise, there is no approach which
places the previous queries in preceding of the current query into a search context - as a seamless
expression for the intentions of the users. The context-aware technique, above all, is the idea suggested
by the problems that users often asked after the current query, which is a unique, efficient, and a “smart


focusing” on the field of query suggestions.


<b>Disadvantages: </b>


- When the user enters the first query or some of the queries are new (new compared to past queries) or
not even new - the meaning does not present in the frequent concept string (for example, in data-set,
with 2 conceptual strings c2c3 and c1c2c3, the algorithm for determining the frequently equence is
c2c3, in this case - the user enters c1). Context-aware approach is not generated the suggestion even
<b>though c1 was in the past (already in QLogs). </b>


- Each cluster (each concept) consists of a group of similar queries. The similarity measure is only based
on URL click without basing on similarity of term, which can significantly affect the quality of
clustering technique. includes a group of similar queries. Similarity measure is only based on URL click
<b>without basing on similarity of term. </b>


- Constraint each query belongs to a cluster (concept): This point of view is not reasonable and unnatural
for a polymorphic query like "tiger" or "gladitor", or many other polymorphic words in Vietnamese
langguage, etc.


- Besides, just only query suggestion without considering URL recommendation or document
suggestion. Likewise, “click-through” orientation but does not use clicked Urls information in search
context (when searching on the suffix tree, the input of Concept sequence consists of queries only).
- On the bipartite graph, on the vertex set Q, the vectors are quite sparse (low dimensions), the set of


click URLs also encounter sparse data (URL click sparse), when the vectors are sparse, the quality of
clustering will affected.


- In clustering algorithm, when Query Logs is large, or the number of dimensions of each vector is large,
the dim_array [d] array will be very large, requiring a large amount of memory to be executed.
In fact, in any one search session, the user enters one or more queries, likewise, the user may not click or


click on many of the resulting URLs, of which there are default URLs not as expected, is seen as noise.
Context-aware method requires a series of consecutive queries to form a context that does not reflect the reality, when users
only enter one query. However, depending on the click URL and not taking into account term similarity is the
<b>most disadvantage of this approach. </b>


</div>
<span class='text_page_counter'>(20)</span><div class='page_container' data-page=20>

In terms of query suggestions, although having the same philosophy with the team working in
context-aware in [9], [10]: "It provides the suggest that majority of users often ask after the current query", the approach,
implementation, complex formulas, detail of data structures, design, algorithms, source code, etc. in our search
engine are completely different. Mining Query Logs, clustering step in our application does not simply rely on
click-through that focuses on three of fixed and certain components, including: query; Top N results; set of URLs
click. These are the three most important components of data mining tasks in context-aware search engine, with
the premise:


 If the intersection of two keywords (terms) sets in the two queries reaches a certain rate, the two
<b>queries are considered similar. </b>


 If the intersection of the top N results of two queries reaches a certain rate, the two queries are
<b>considered similar. </b>


 If the intersection of sets of URLs click of two queries reaches a certain rate, the two queries are
<b>considered similar. </b>


Considering of the above premises, combined with the thresholds drawn from experiments, to ensure the
exact similarity measurement, the thesis lists the following formulas:


 Similarity according to keywords in 2 queries p, q:


Simkeywords(p, q) = ⁡∑ w(ki(p))+w(ki(q))


n


i=1


2×MAX(kn(p),kn(q)) (3.9)


In the above formula:


- kn(.): the total weight of the terms in p, in q;
- w(ki(.)): the weight of common ith term in p and q;


 The similarity in top 50 URL results of 2 queries p, q:


Sim<sub>top50URL</sub>(p, q) = ⁡<sub>2×MAX(kn(p),kn(q))</sub>∧(topUp,top⁡Uq) (3.10)
denote:  (topUp, topUq): the intersection of the results top50URL p and q;


 <b>The similarity of two queries p, q by Urls_clicked: </b>


SimURLsClicked(p, q) = ⁡∧(U_click_p,U_click_q)<sub>2×MAX(kn(p),kn(q))</sub> (3.11)


denote:  (U_Clickp, U_Clickq): the common URLs_clicked in p and q;
U_Clickp: number of URL clicked in 1 query;


 <b>From (3.9), (3.10), (3.11), the thesis propose the equation to calculate combination similarity: </b>
Sim(p, q) =


𝑐𝑜𝑚𝑏𝑖𝑛𝑎𝑡𝑖𝑜𝑛 α. Sim(p, q) +keywords β. Sim(p, q) +𝑡𝑜𝑝50𝑈𝑅𝐿 γ. Sim(p, q)⁡URLsclicked (3.12)


With α + β + γ = 1; α, β, γ the threshold parameter drawn during the experiment. In the search application,
α = 0,4; β = 0,4; γ = 0,2.


<i><b>3.2.8. Classification search results technique based on Concept Lattice </b></i>



</div>
<span class='text_page_counter'>(21)</span><div class='page_container' data-page=21>

attribute, from which to construct the Concept-lattice [88]. In searching information, FCA considers
object-attribute correlation like document-term correlation. In the process of setting up the lattice, FCA defines each node
in the lattice as a concept. The algorithm for the construction of Concept-lattice will install a couple on each node,
including a set of documents with common terms, and a set of terms which co-occurs in documents. Expanded,
each concept (each node in the lattice) can be viewed as a question/answer pair. In the lattice, the action of
browsing up or down of nodes will allow approaching more general concepts or more detail concepts, respectively.
Corresponding to the general concept will be the more number of documents resulting, and vice versa.


With the test-set (aviation domain), the thesis uses the concept-lattice to classify the results

returned according to predetermined topics.



<i>a) Illustrative problem </i>


To visualize the concept-lattice, the thesis presents the problem (example) in nature that illustrates the
conceptual set (set of objects, set of properties). From the context table, the concept-lattice is formed visually by
the Hasse diagram:


<i>Table 3.2: </i>

Context table 1



<i>Figure 3.10: C</i>onstruct the Concept-lattice from the
context table 1.


</div>
<span class='text_page_counter'>(22)</span><div class='page_container' data-page=22>

It can be seen that the ranking of the algorithm is
based on the characteristics of the lattice structure, the
parent concept set contains more objects (documents), the
sub-concept set contains more terms. When outputting,
the first results contain all the terms to be found, the
following results contain a portion of the terms in the
query, in decreasing number. Figure 3.11 illustrates


finding and classifying the returned result set on a staging
structure [92]. Regarding the classification of search
results, during the process of testing the staging, the
overlay set (the variable upper cover of the Locate_Pivot
function) contains labels, which describe the topic of the
layers.


<i>Figure 3.11: Search and classify results </i>
with the query "jaguar"


<i>b) Create and browse lattice </i>


<b>Create lattice: With AddIntent algorithm, the time of algorithm is calculated in the best case that is </b>
evaluated by O(|L||G|2<sub>max(|g’|)). Where L is the concept-lattice, G is the set of objects of L, max(g’) is the greatest </sub>


attribute number of a concept in L.


It can be said that, in the lattice creation
algorithms, AddIntent is the technique to add
gradually the concepts to the lattice, in other
words, it is assumed that the data set has N
documents, AddIntent will add gradually the
document 1, 2, …i, i+1, …, N to lattice Li, Li+1,


…L. AddIntent has the main pseudo-code:
CreateLatticeIncrementally.


Algorithm Interpretation:


<b>CreateLatticeIncrementall procedure </b>



01: CreateLatticeIncrementally(G, M, I)
02: BottomConcept := (Ø, M)


03: L := {BottomConcept}
04: For each g in G
05:


ObjConcept=AddIntent(g’,BottomConcept,L)
06: Add g to the extent of ObjConcept
and all concepts above


07: End For


CreateLatticeIcrementally procedure (G, M, I) receives all data sets (set of objects G includes


files, set of attributes M includes terms in files, correlation I belongs to G, M). AddIntent is the algorithm


in Bottom-Up direction, assigned by {0, M}. It means that, the BottomConcept contains all terms of


lattice L (row 02). The process starts with updating BottomConcept to the bottom of lattice (row 03).



With each object g of the set of objects G (with each file of set of files), AddIntent call procedure

to add gradually the concepts to the concept-lattice, transmit to AddIntent with three parameters: g’


(intent, set of terms in a file), BottomConcept concept (set of terms in files) and lattice L (row 04, 05).


In the procedure-body, AddIntent creates concept (and connections with other concepts), For.. End For


loop of the procedure takes in turn each concept of the set of created concepts – to update to Extent, row


06. The finished procedure is the created lattice.



</div>
<span class='text_page_counter'>(23)</span><div class='page_container' data-page=23>

analyze the query, find out the formal concepts (terms), browse lattice and match with the concepts of lattice. The
core of browse lattice in the fact is in AddIntent function. It can be said that AddIntent is “backbone” function of
two processes of creating lattice and search on lattice. he idea of the browse and search algorithm on lattice is as
follows (BR-Explorer [95]): Use AddIntent function to bring the query (intent) to lattice (to satisfy the order


relation ≤). And then find pivot concept (Locate_Pivot) corresponding to intent of the query. Finally, the set of
results includes the documents in the pivot concept and documents in the parent concepts of pivot concept is the
set of results that need to be found. The found result will be ranked, the first results contain all terms that need to
be found, the next results contain a part of terms in the research requirement, according to the descending quantity.
In BR-Explorer algorithm, two pseudo-code sections play the main role: Locate_Pivot and BR-Explorer


In the BR-Explorer algorithm, the
Locate_Pivot pseudo-snippet defines the
upper-cover (topic set).


Algorithm Interpretation:

Corresponding to query


Q (the query is entered from the user,


Locate_Pivot function will:



- Return a concept (this concept is in the upper-cover
<i>set of concepts with intent (set of terms) contains or </i>
<i>equal to set of terms of query Q (row 04-13). </i>
- If not, return BottomConcept.


In the process of browsing, the variable
upper cover of the Locate_Pivot function contains
labels, which describe the topic of the subclasses of
<b>the search result set. </b>


<b>Locate_Pivot function </b>
01: found := false


// <i>⊥ is the BottomConcept in B(G</i>q,Mq,Iq)


02: SUBS := {⊥}


03: while !found do


04: for each C = (A,B) ∈ SUBS do
05: <i>if x’ = B then </i>


06: Pivot P:=C; found:=true
07: break


08: <i>else if x′ ⊂ B then </i>
09: SUBS:=upper-cover(SUBS)
10: break


11: end if
12: end for


<i>c) Comments </i>


The

ranking of the algorithm is based on the characteristics of the lattice structure, the parent


concept set contains more objects (documents), the sub-concept set contains more properties (terms). The


results found will be ranked, the first results contain all the search terms, the following results contain a


portion of the terms in the search request, according to the decreasing number.



Thesis empirically set up on
sample data that are documents about
flights [98], [99]. Figure 3.12 illustrates
the browsing and search in the staging
corresponding to the query "Which
airline flies to US, Europe, Canada,
Mexico, the Caribbean?”. In the
context-aware search engine experiment (Section


3.4), the procedures to creat and approve
the staging are applied in classifying the


</div>
<span class='text_page_counter'>(24)</span><div class='page_container' data-page=24>

<b>Advantages: Concep-lattice is suitable for clustering technique (according to topics), classifying </b>
concepts; The parent-child conceptual relationship of the sequence relational structure ≺, therefore, the searcher
can exploit information at the neighboring nodes of the lattice without wasting time searching all over again large
text database.


<b>Disadvantages: When the query is take into the lattice, the AddIntent class must be called. AddIntent does </b>
recursion, resulting in a significant increase in seeking times. The BR-Explorer function has drawbacks in time of
measurement, the inner function calls other functions (to calculate spread over the lattice) and must be recursive
(when adding queries to the lattice via AddIntent).


<b>3.4. Experimental Results - Evaluation </b>


<i>Figure 3.13: Context-aware search model that is integrated with Vietnamese voice interaction. </i>
The application of specialized


search engine differs from general SE in
3 points: Domain-dependent input data,
specific Query Logs, context-aware
query suggestion, grouping returned
result, forming a search engine is
different from general search engines.
The addition of voice recognition to the
SE creates a context-aware search engine


and voice interaction [34], [76]. <i><sub>Figure 3.17: Context-aware Search Engine.</sub></i>


<b>Evaluation and comparison: To evaluate the effectiveness of the context-aware method, the thesis </b>


establishes a comparison table between the context-oriented SE and Lucene (Nutch), and compares the query
suggestion technique with the 2 baselines methods: Adjacency and N-Gram. The comparison is based on:


 Relevance (Quality) and


 Diversity (coverage) of query suggestion set.


</div>
<span class='text_page_counter'>(25)</span><div class='page_container' data-page=25>

<i>Table 3.3: Comparison table of Context-aware search and Lucene-Nutch </i>
Lucene - Nutch Context-aware SE


Sample dataset Combined with the data set


Time for searching milliseconds milliseconds


Ranking Have Have


Practicality Common Applied with VNA (WAN)


The ability to quickly suggest Not Have


Classify the returning result set Not Have


Query suggestion Not Have


<b>Comparison criteria </b>


Quality measurement accurately reflects a need for information while helping users find what they care
about. Coverage reflects diversity, covering many different search aspects. To perform the assessment, the thesis
<b>compares the context-oriented suggestion technique with the two baselines methods: Adjacency and N-Gram. </b>



The Adjacency method receives the query string q1, q2, .., qi - on all search sessions - Adjacency ranks


according to the frequency of queries appearing immediately after a qi query. Output the topN (N = 5) query with
the highest frequency of occurrences as suggested list.


Similarly, N-Gram receives the input string query sequence qs = q1, q2, .., qi. On the search sessions,


N-Gram ranks according to the frequency of queries appearing right after the qs series, returning the topN of the
<b>queries with the highest frequency as suggested list. </b>


<i>Figure 3.14: (a): Diversity measurement; (b): Relevance measurement. </i>


Coverage is measured by the ratio of the number of test cases that are able to issue query suggestions to
the total number of test cases. Figure a illustrates the coverage measurement results of the three methods. As
assuming, when receiving the test case qs = q1, q2, .., qi,, the N-Gram method can only give a list of suggestions if


there exists in the search session training data form qs1= q1, q2, .., qi, qi+1, .., qj. Obviously, the Adjacency method


has a superior diversity ratio compared to the N-gram method, because only strings of the form qs2= .., qi, qi+1, ..,


qj belong to the training data. In other words, qs1 is a special case of qs2. However, considering the time sequence


in a search session, the N-Gram method has the advantage - when suggesting, the suggestion will be in series (the
whole series of suggestions). Compared with the two methods N-Gram and Adjacency, the case of "absence" of
both qs1 and qs2, the context-aware method proves the effectiveness over the above two methods, because only


0
10
20
30


40
50
60


Adjacency N-Gram Hướng
ngữ cảnh


0.7
0.8
0.9
1


</div>
<span class='text_page_counter'>(26)</span><div class='page_container' data-page=26>

the query string of the form qs2’= .., qi’, qi+1, .., qj where qi and qi’' are similar (belonging to the same cluster), the


context-aware term still provides a list of suggestions


The quality measurement is scored by consulting experts (people). To compare with the current query, if
the suggested sentence in the list is evaluated as appropriate, the method is added 1 point. If the list of suggestion
contains two or more suggestion that are almost identical, the method will add only 1 point. If the test case doesn't
give a hint, the test doesn't count this test case. The total score of a method for a specific test case is equal to the
total score divided by the total number of query suggestions. Average score of each method is equal to the quotient
between the total score and the total number of test cases counted.


On all test samples, on both the relevant and diversity measurements, the rating scale of the three methods
illustrated in figure b shows that the optimal hint of context-aware is relative to the two baseline method. Instead
of single-query-level suggestion, the context-aware method determines the user's search intent at the cluster level
(conceptual level).


<b>CHAPTER 4: CONCLUSION AND RECOMMENDATION </b>
<b>4.1. Conclusion </b>



Application of Formal Concept Analysis (FCA) and Concept-lattice for mining and searching textual data.
The lattice has a beautiful structure mathematically, suitable for mining, analyzing and clustering data, but the
lattice is not completely suitable in the field of search. Therefore, the in-depth thesis has two main research
directions: i) Search for entities based on implicit semantic relations, in order to simulate the ability to infer
unknown information/knowledge by similar inference, as a “natural” of human ability; and ii) Suggesting
context-aware query - considering a seamless query sequence to capture search intent, then give out the trend that most
knowledge often asks after the current query.


<b>4.2. Recommendation </b>


With the direction to research entity based on implicit semantic relation, it can be seen that the search
model is hardened by 3 input entities, this is a disadvantage. To overcome the drawback, on the one hand - consider
adding relational mappings, adding time factors so that search results are updated and accurate. On the other hand,
it is possible to extend the entity search with input query of only one entity, for example: "Which river is the
longest in China?", the entity search model based on hidden semantics will be given the correct answer:
"Changjiang", although Corpus only has the original sentence "Changjiang is the largest river in China".


</div>
<span class='text_page_counter'>(27)</span><div class='page_container' data-page=27>

<b>LIST OF PUBLICATIONS </b>


1. Trần Lâm Quân - Vũ Tất Thắng. “Tìm kiếm thực thể dựa trên quan hệ ngữ nghĩa ẩn”. Hội thảo Quốc gia
lần thứ XXI: Một số vấn đề chọn lọc của Công nghệ Thông tin và Truyền thông. (27-28/07.2018).
2. Trần Lâm Quân - Vũ Tất Thắng. “Search for entities based on the Implicit Semantic Relations”. Tạp chí


Tin học và Điều khiển 2019 (Volume 35, Number 3. 2019).


3. Trần Lâm Quân - Đỗ Quốc Trường - Phan Đăng Hưng - Đinh Anh Tuấn - Phi Tùng Lâm - Vũ Tất Thắng
- Lương Chi Mai. “A study of applying Vietnamese voice interaction for a context-based Aviation search
engine”. The IEEE RIVF 2013 International Conference on Computing and Communication
Technologies. 10-13.11.2013.



4. Trần Lâm Quân – Vũ Tất Thắng. “Context-aware and voice interactive search”. (the SoCPaR 2013 special
issue). Journal of Network and Innovative Computing. ISSN 2160-2174 Volume 2, pages 233-239, 2014.
5. Trần Lâm Quân - Phan Đăng Hưng - Vũ Tất Thắng. “Tìm kiếm bằng giọng nói với kĩ thuật hướng ngữ
cảnh”. Tạp chí Khoa học và Công nghệ - Viện Hàn lâm Khoa học và Công nghệ Việt Nam. ISSN: 0886
768X. Số 52 (1B), 29.06.2014.


</div>

<!--links-->
<a href=' />

×