Tải bản đầy đủ (.pdf) (63 trang)

Integrating Natural Language Processing And Web Gis For Interactive Knowledge Domain Isualization

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.87 MB, 63 trang )

INTEGRATING NATURAL LANGUAGE PROCESSING AND WEB GIS
FOR INTERACTIVE KNOWLEDGE DOMAIN VISUALIZATION

_______________

A Thesis
Presented to the
Faculty of
San Diego State University

_______________

In Partial Fulfillment
of the Requirements for the Degree
Master of Science in Geography
with a Concentration in
Geographic Information Science

_______________

by
Fangming Du
Summer 2014



iii

Copyright © 2014
by
Fangming Du


All Rights Reserved


iv

DEDICATION
To my parents and my family.


v

ABSTRACT OF THE THESIS
Integrating Natural Language Processing and Web GIS for
Interactive Knowledge Domain Visualization
by
Fangming Du
Master of Science in Geography with a Concentration in
Geographic Information Science
San Diego State University, 2014
Recent years have seen a powerful shift towards data-rich environments throughout
society. This has extended to a change in how the artifacts and products of scientific
knowledge production can be analyzed and understood. Bottom-up approaches are on the rise
that combine access to huge amounts of academic publications with advanced computer
graphics and data processing tools, including natural language processing. Knowledge
domain visualization is one of those multi-technology approaches, with its aim of turning
domain-specific human knowledge into highly visual representations in order to better
understand the structure and evolution of domain knowledge. For example, network
visualizations built from co-author relations contained in academic publications can provide
insight on how scholars collaborate with each other in one or multiple domains, and
visualizations built from the text content of articles can help us understand the topical

structure of knowledge domains.
These knowledge domain visualizations need to support interactive viewing and
exploration by users. Such spatialization efforts are increasingly looking to geography and
GIS as a source of metaphors and practical technology solutions, even when nongeoreferenced information is managed, analyzed, and visualized. When it comes to deploying
spatialized representations online, web mapping and web GIS can provide practical
technology solutions for interactive viewing of knowledge domain visualizations, from
panning and zooming to the overlay of additional information.
This thesis presents a novel combination of advanced natural language processing –
in the form of topic modeling – with dimensionality reduction through self-organizing maps
and the deployment of web mapping/GIS technology towards intuitive, GIS-like, exploration
of a knowledge domain visualization. A complete workflow is proposed and implemented
that processes any corpus of input text documents into a map form and leverages a web
application framework to let users explore knowledge domain maps interactively. This
workflow is implemented and demonstrated for a data set of more than 66,000 conference
abstracts.


vi

TABLE OF CONTENTS
PAGE
ABSTRACT ............................................................................................................................. vi
LIST OF TABLES .....................................................................................................................x
LIST OF FIGURES ................................................................................................................ xii
ACKNOWLEDGEMENTS .....................................................................................................xv
CHAPTER
1

INTRODUCTION .......................................................................................................11
Problem Statement .................................................................................................13

Objectives and Intellectual Merit ...........................................................................15

2

LITERATURE REVIEW ............................................................................................17
Knowledge Domain Visualization .........................................................................17
Web GIS.................................................................................................................18
Spatialization..........................................................................................................19
Topic Modeling ......................................................................................................20

3

RESEARCH DESIGN .................................................................................................21
Functionality Design ..............................................................................................21
Spatial Concepts...............................................................................................21
Real World .......................................................................................................21
Semantic World ...............................................................................................21
From Concepts to Functionality.......................................................................23
Workflow Design ...................................................................................................24
Web GIS Application Design ................................................................................26

4

IMPLEMENTATION ..................................................................................................28
Workflow ...............................................................................................................28
Text Processing Workflow ..............................................................................28
Data Preprocessing.....................................................................................28
LDA Topic Modeling ................................................................................31
SOM Training and Clustering ....................................................................36



vii
Programming Environment ........................................................................37
GIS Processing Workflow ...............................................................................38
Integrating Workflow with Web GIS ....................................................................39
Web GIS Implementation Framework .............................................................39
Web Inferencing Services ................................................................................40
Mapping and Geoprocessing Services .............................................................43
Web User Interface ..........................................................................................43
Evaluation of Performance ....................................................................................46
5

CONCLUSION ............................................................................................................48
Results Summary ...................................................................................................48
Limitations and Future Studies ..............................................................................49

REFERENCES ........................................................................................................................51
A FILTERED STOP TOPICS .........................................................................................54


viii

LIST OF TABLES
PAGE
Table 1 Semantic Generalization (Fabrikant and Skupin 2005) ..............................................21
Table 2 Functions for Non-geographic Information Visualization in GIS ..............................23
Table 3 Dataset Format ............................................................................................................29
Table 4 LDA Topic Model Training Output Files ...................................................................34
Table 5 Input and Output Data for Inferencing Services .........................................................41
Table 6 Filtered Out Stop Topics, Each Stop Topic Consists of Several Topic Phrases .........55



ix

LIST OF FIGURES
PAGE
Figure 1 Google Maps technology deployed for knowledge domain visualization ................12
Figure 2 Perplexity evaluations of different computational language models (Blei
2003) ............................................................................................................................25
Figure 3 Data processing workflow .........................................................................................26
Figure 4 Web GIS application framework ...............................................................................26
Figure 5 NLP serivces ..............................................................................................................27
Figure 6 XML Processing for PDF Format Data .....................................................................29
Figure 7 XML Schema.............................................................................................................30
Figure 8 Data Content Preprocessing ......................................................................................31
Figure 9 Perplexity Computation Using Mallet .......................................................................33
Figure 10 Perplexity Graph for Our Model .............................................................................33
Figure 11. Trained SOM represented as Shape File. Panels (a) and (b) show the SOM
neurons as hexagons at zoom levels. Panels (c) and (d) contain renderings of
component planes, i.e. the distribution of the weights of one particular
attribute across the two-dimensional neuron geometry. ..............................................37
Figure 12 GIS Processing Workflow .......................................................................................38
Figure 13 SOM Polygon Dissolve and Labeling. (a) represents the dissolved polygons
from SOM neurons (Figure 11 (a)). (b) adds labels to cluster polygons. ....................39
Figure 14 Data and Process Flow in Web GIS ........................................................................40
Figure 15 Inferencing Services for Projection Functionality...................................................41
Figure 16 User Interface Components .....................................................................................44
Figure 17 Projection as Point ...................................................................................................45
Figure 18 Projection as Overlay Map Layer ............................................................................45
Figure 19 Time Consumption in Inferencing Service with Three Test Groups of Data..........46

Figure 20 Time Consumption in Geoprocessing Service ........................................................47


x

ACKNOWLEDGEMENTS
I would like to thank the members my thesis committee Dr. Skupin, Dr. Tsou, and Dr.
Eckberg for their help, support, and interest in my thesis work. Especially, I would like to
thank Dr. Skupin for serving as my major professor and graduate advisor. This thesis would
not have been possible without his great amount of support. I am very grateful to him for
giving me invaluable advice and continuous guidance on my research project. In addition I
would like to acknowledge and thank David Mckinsey and Marcus Chiu for their great
technical support. I would also like to thank Raymond Lee for his useful advise and help in
my thesis writing. I extend my gratitude to my colleagues and friends Jay Yang, Shuang
Yang, Marilyn Stowell for their support. Finally, last but not the least, I would like to address
a special thank you to my parents for the love and care they have continually given me all
these years.


11

CHAPTER 1
INTRODUCTION
Visualization is the process of making a phenomenon visible or enabling the forming
of a mental image of it. Through different visualization products, human beings are able to
see and thus understand abstract information more efficiently. For example, on a subway
map, people can actually see the whole transportation system and understand how to transfer
between different lines to get to a destination.
Information visualization is the use of computer-supported, interactive, visual
representations of abstract data to amplify cognition (Card & Shneiderman 1999). With more

and more information available online nowadays through computers and the Internet, it has
become much more difficult to understand the huge information or even produce any forms
of visualization from it. With computational algorithms, information visualization can
represent huge amount of information visually for human beings to better understand them
and explore them to create new knowledge (Card, Mackinlay & Shneiderman 1999). Science
is rapidly developing in different disciplines every year with new publications; it has become
almost impossible to understand the whole structure of science or even one knowledge
domain of it. Techniques and theories in information visualization are utilized visualization
of knowledge (Börner, Chen & Boyack 2003; Börner 2010). For this particular type of
knowledge, it represents the opinions, values, and perspectives from scientific disciplines,
which is communicated in scientific journals and articles. It can give an overview of a whole
discipline and its development from the past to the future, thus further guide the professional
groups in more fruitful directions (Börner, Chen & Boyack 2003 ; Boyack, Klavans &
Börner 2005).
On the other hand, for the visualization part, cartography has theories and practices
dealing with the visualization of geographic information. And spatial metaphors have been
used in the information visualization to utilize humans’ spatial cognition. Spatialization that
emerges as the new research frontier in recent decade studies how to display high
dimensional data in lower dimensional space. It integrates computational algorithms that deal


12
with dimension deduction and spatial concepts and cartographic principles that help design
the lower-dimensional display space. Spatialization is applicable for the knowledge domain
visualization and has the potential to integrate more cartographic approaches (Skupin,
Biberstine & Börner, 2013). However, interaction as one of the most important aspects in
information visualization, it cannot be achieved with traditional static cartographic principles
in spatialization for knowledge domain visualization (Skupin, Biberstine & Börner, 2013).
Although some relatively simple online mapping technologies have been used for nongeographic knowledge domain visualization, such as Google Maps (Fig 1), these tend to
provide only very limited user interaction and functionality.


Figure 1 Google Maps technology deployed for knowledge domain
visualization
Meanwhile, more advanced web GIS solutions are now widely used to provide
interactive web mapping applications, but have traditionally focused solely on geographically
referenced data. This study will investigate whether and to which degree web GIS technology


13
can be employed in interactive knowledge domain visualization and how geographic
concepts and text mining techniques can be usefully combined in the process.

PROBLEM STATEMENT
Skupin (2002) discusses the creation of a base map using VSM (vector space model)
and SOM. In that spatialization approach, the VSM consists of vectors containing term
counts for each document. This is the high-dimensional model that then undergoes
dimensionality reduction using SOM. However, there are certain drawbacks to this use of
traditional VSM:
1. Scalability. Large document collections will result in vectors whose high
dimensionality may make SOM training more difficult;
2. Sparseness. Vectors in the VSM tend to be very sparse, since any particular document
vector will record a count of zero for most terms;
3. Term order. The order in which the term appears in the documents is lost in the
vector representation, at least when using unigram counts. While use of multi-part ngrams would be possible, that can increase the already high dimensionality of the
VSM even further;
4. Semantic sensitivity. Documents with related content, but differences in actual
vocabulary (e.g., synonyms), may not display sufficiently strong similarity;
5. Stemming effects. Though stemming of the original terms will lower the model
dimensionality, it may result in "false positive matches" for stems that originate from
terms with significantly different meaning.

One key goal of this thesis is to explore the feasibility of replacing the VSM approach with a
topic model approach, prior to SOM training. Topic models – specifically latent Dirichlet
allocation (LDA) – treat each document as a mixture of topics derived from a collection of
documents (Blei, 2003).
Another problem is the lack of a comprehensive workflow for the creation of base
maps from text documents, as opposed to processing steps occurring in several, relatively
separate, segments (Skupin 2002, 2004), which makes it difficult to replicate the process for
new document collections. Combination of an existing Java library for topic modeling and a
newly developed Java library for SOM training creates the possibility of a seamless
processing workflow for the creation of base maps.
Finally, current knowledge domain visualizations do not provide sufficiently high
degrees of interaction to allow exploratory visualization by users. Instead, most of the more


14
intricate knowledge domain visualizations are image- or paper-based, with graphic zooming
as the only interactive operation supported. The technological base of web GIS should
provide a basis for more advanced interaction, since it is founded on a mature theoretical and
practical framework for managing, analyzing, displaying geographic data online.

In addressing these various problems of contemporary knowledge domain visualization,
the following research questions are pursued in this thesis:

1. What are some of the fundamental spatial concepts in GIS that may be of potential use
for knowledge domain visualization? How could these be used?
This question will identify certain fundamental concepts in GIS and apply them to
the visualization of a high-dimensional space in which the knowledge domain is
represented. For example, the overlay operation can be used in GIS to project any
point/line/area geographic features onto a base map based on their geographic
coordinates and we want to know whether and how this concept and GIS

technique is applicable to knowledge domain visualization.

2. How can one develop a domain base map from a large document corpus based on
NLP and dimensionality reduction techniques?
While the classic vector space model (VSM), in conjunction with the selforganizing map (SOM) method, has been successfully used for domain mapping,
the more advanced NLP approach of latent Dirichlet allocation (LDA) has been
speculated to have advantages over a classic VSM approach, both in
computational performance and in terms of how meaningful the resulting highdimensional space is. This study will investigate how the LDA topic model can be
adapted and combined with SOM dimensionality reduction towards the creation
of detailed domain base maps.

3. How and to what degree can web GIS technology be utilized for interactive knowledge
domain visualization?


15
This question is intended to identify, adapt, and implement specific functions in a
web GIS environment, such that the spatial concepts of interest (see question 1)
can be operationalized in the context of the domain base map (see question 2). To
that end, a prototype web application will be implemented that combines web GIS
technology with live operations on a high-dimensional knowledge space and its
two-dimensional projection.

OBJECTIVES AND INTELLECTUAL MERIT
The overall objective of this research is to create an integrated workflow and
framework to utilize LDA topic modeling, SOM dimensionality reduction, and web GIS to
create interactive knowledge domain visualization from any domain specific large text
corpus. The following specific objectives are pursued:
a) Java program modules are generated that can preprocess a text corpus,
iteratively create an LDA topic model, and perform SOM training in the same

programming environment.
b) GIS-based modules are created that transform the output of the LDA/SOM
process into data structures compatible with GIS software, such that the base
map can be represented in GIS.
c) Trained model and base map are the content drivers for web mapping and web
processing services that provide both interactive online domain mapping and
live NLP inference.

The intellectual merit of this research rests on a novel, iterative approach to LDA
topic modeling and the use of web GIS technology to implement advanced spatial operators
for interactive high-dimensional visualization and inference.
Compared to traditional VSM, the LDA topic model is meant to result in a lowerdimensional representation that is computational more efficient and also a potentially more
meaningful representation of the document corpus (Skupin, Biberstine & Börner, 2013). This
is one of the first studies to explore this combination of LDA topic modeling with SOM and
the first study to create a detailed knowledge domain base map through this process.
Meanwhile, the technological solutions and workflows proposed, developed, and


16
documented in this study will serve as a template for future visualizations of other knowledge
domains.
Web GIS technology has become very popular and widely adopted during the
previous decade. From simple web mapping, as found in Google Maps, it has extended to
web based geo-processing services providing much of the functionality found in stand-alone
GIS software. However, its underlying spatial concepts and analytical capabilities have
typically only been applied to geographically referenced information. This study represents
the first practical exploration of an extensive set of spatial concepts in a high-dimensional
framework and its operationalization for large-scale knowledge domain visualization in a
web GIS environment.



17

CHAPTER 2
LITERATURE REVIEW
As discussed in the previous chapter, this research studies the combination of LDA
topic model, SOM and Web GIS for interactive knowledge domain visualization. This review
discusses the knowledge domain visualization in the first section and Web GIS in the second
section. The third section develops the review of the spatialization method; namely, the
metaphor of spatial concepts used in information visualization. Finally, this chapter ends with
discussions of topic modeling.

KNOWLEDGE DOMAIN VISUALIZATION
Knowledge domain visualization aims at the interactive visual representation of
knowledge domains (Börner, Chen & Boyack 2003). Knowledge domains can be considered
as abstract spaces within which different knowledge objects can be represented. For a
specific discipline, such as medical science or geography, it can be defined as one knowledge
domain that are made up of scientific journals, articles, and professional groups in that
discipline.
Knowledge domain visualization is not a new field. Price (1965) introduced a method
to look into the development of science by analyzing scientific papers. He examined the
changes of references and citations of scientific papers over years using statistical analysis
and found out that changes in citations can indicate how scientific fields grow.
With the development of computer science and GIS, knowledge domain visualization
expands the old citation based analysis into a new research field, utilizing all kinds of data
and visualization techniques to reveal the development of scientific knowledge (Börner,
Chen & Boyack 2003). They introduced a general framework in doing knowledge domain
visualization and identified two fundamental problems in knowledge domain visualization.
One of the problems is the need to project high-dimensional data to a two-dimensional
display space; and the other one is the conflict between large amounts of data and limited

space and resolution.


18
Besides the visual representation of knowledge domains, interaction as one of the
most important elements in information visualization also emerges as an important issue in
the knowledge domain visualization. Shiffrin and Börner (2004) also mentioned that without
interaction the visualization of knowledge domains will be of little use.

WEB GIS
In traditional cartography and GIS, theories and practices have been applied to static
representation of geographic information as maps, including projection, generalization, and
map design (Dent 1999; Robinson et al. 1995). After the invention and evolutionary growth
of Internet, several changes have taken place in cartography, especially for the disseminating
and interactive access of geographic information (Kraak & Brown, 2001).
The widespread accessibility of the Internet around the world makes, sharing and
providing geographic information and services easier and more powerful. Smith & Frew
(1995) introduced a new undergoing project, Alexandria Digital Library, which is one of the
earliest distributed systems, aimed at providing online services for sharing geographic
information. Its functionality included supporting access, providing queries, storage, and
management. Green and Bossomaier (2002) introduced the idea about distributing
geographic information system (GIS) into online GIS services. They proposed a two-tier
framework, including server-side and client-side. This was a promising framework which can
provide more services, including geographic data sharing and geographic data analysis. Then
Tsou (2004) introduced a Web-GIS architecture with three levels of geographic information
services: data archive, information display and spatial analysis. The first level is a web-based
data warehouse, which can be seen as an extension of the earlier Alexandria Digital Library.
The second and third level services build on the first level, providing user interactive
services, which is similar to Green’s two-tier framework. Tsou (2004) also provided a
prototype implemented of the three levels architecture.

Interaction is the second big change web brings to cartography. In traditional
cartography and GIS, the contents of the maps are static and the quality of the maps mostly
depend on the data and professional cartographers. However, after the launch of Google
Maps and Google Earth at 2005, people have much easier access to geographic information
than ever before and they can easily find the geographic information they want by dynamic


19
requests. there are many other new websites emerging based on the idea about “Web 2.0”
(Haklay, Singleton & Parker 2008). Goodchild (2007) introduced the emerging phenomenon
of sharing geographic information on the web, like Wikimapia (www.wikimapia.org, online
editable map, easy access to anyone to mark or describe any sites on the earth) and Flickr
(www.flickr.com, upload and locate photographs on the Earth’s surface by longitude and
latitude). So the users not only have more freedom in the ways of viewing geographic data,
but also gradually become the providers of geographic information. Tsou (2011) identified
this new change and tried to redefine web cartography, emphasizing the trend toward a usercentered design, user-generated content and ubiquitous access.

SPATIALIZATION
In geographic information science and cartography, there are several concepts which
have been applied to other research fields as spatial metaphors (Kuhn and Blumenthal 1996;
Skupin and Buttenfield 1996, 1997), such as location, distance and scale.
Fabrikant (2000, 2001) used region, distance and scale as spatial metaphors for the
visualization and interactive exploration of digital libraries. Different documents in the
library are displayed in a 2-dimension space. Distance represents the similarity between
documents, thus similar documents in the library would be displayed closer to each other.
Scale represents the level of details in the hierarchy of documents and region is used to
aggregate similar items.
Beyond these spatial metaphors, Skupin (2000, 2002) applied cartographic
approaches, such as generalization, feature labeling, and map design to the visualization of
text documents. These cartographic approaches can tackle some issues that spatial metaphors

cannot help, such as dealing with complexity of large amount of features.
The study of applying these spatial metaphors and cartographic approaches to other
types of data visualization, especially for high-dimensional data, forms a new research
frontier over the recent years as spatialization, “systematic transformation of high
dimensional datasets into lower-dimensional, spatial representations for facilitating data
exploration and knowledge construction” (Skupin and Fabrikant 2008). Thus it can transform
large amount of unstructured and non-georeferenced high-dimensional data into organized


20
geographic space. Then other spatial concepts and techniques can be applied to this display
space to utilize humans’ spatial cognition to understand the original datasets.

TOPIC MODELING
Semantic space is one of the high-dimensional data that can be transformed to lowerdimensional space using spatialization. Semantic spaces consist of certain groups of
document corpus that can be represented using different models. One of the earliest models
that have been widely used for text retrieval is Vector Space Model (VSM). It computes
similarities between textual units based on frequencies of shared keywords (Salton, 1989;
Skupin & Buttenfield, 1996). One of the limitations is that many low frequent keywords have
to be dropped in processing a large amount of data (Skupin, Biberstine & Börner, 2013) and
VSM is sensitive to the vocabularies.
Topic model is a new type of statistical model for discovering abstract topics from
document corpus. Given that one document is about a particular topic, one would expect that
the particular words describing that topic would appear in that document more frequently.
Latent Dirichlet allocation (LDA) is the most common topic model currently in use. In LDA,
one topic is defined as a distribution over a fixed vocabulary and each document is a mixture
of topics with different proportion (Blei, Ng & Jordan, 2003). It allows one document to have
a mixture of topics. Thus one document that exists in a semantic space is defined by different
topics as dimensions.



21

CHAPTER 3
RESEARCH DESIGN

The main goals of this research are to propose a workflow for processing text
documents to create knowledge domain maps and to propose a web application framework
for interactive exploration of the knowledge domain maps online. To accomplish these goals,
functionality, processing workflow and web GIS application design are introduced in this
chapter.

FUNCTIONALITY DESIGN
This section identifies the main conceptual building blocks in GIS and applies some
of them into the design of functionalities that are potentially applicable for visualization of
high-dimensional non-geographic information.

Spatial Concepts
Fabrikant and Skupin (2005) propose a spatialization framework with a two-step
transformation process for semantic spaces. In the first step, they consider a semantic
generalization (table 1) that deals with the abstraction of semantic spaces with geographic
primitives. The second step deals with selection of appropriate visual variables for the
representation.
Table 1 Semantic Generalization (Fabrikant and Skupin 2005)
Real World

Semantic World

Feature


Semantic entity

Feature location

Entity location

World has time

Space is time

Scale

Granularity


22
Using this framework, we can identify more fundamental concepts in GIS, especially
in web GIS and apply them to an interactive visualization of the knowledge domain of
geography. Semantic generalization is applied to the data using the LDA topic model and
self-organizing maps (SOM). The visual representation of the 2-dimensional display space is
rendered with map symbols and map design principles.
Golledge (1995) presented a set of primitives – identity, location, magnitude, time –
as the building blocks of spatial concepts. Then he identified three different levels of spatial
concepts based on these primitives. First level concepts are called derived concepts: distance,
angle and direction, sequence and order, connection and linkage. Second level is spatial
distribution, including boundary, density, dispersion, pattern and shape. Third level is higher
order derived concepts: correlation, overlay, network, hierarchy and other concepts.
Dibiase et al. (2006) developed a comprehensive body of knowledge for geographic
information science and technology. This provides some basic concepts from the science and
technology perspective. The main concepts for representation and analysis of geographic

information include geometric measures (distance, direction, shape, area and connectivity),
basic analytical operations (buffers, overlays, neighborhoods, and map algebra), elements of
geographic information (discrete entities, events and processes, fields in space and time, and
integrated models), domains of geographic information (space, time, relationships between
space and time, properties).
Janelle and Goodchild (2011) identified several fundamental spatial concepts from
more recent research in geographic information science. They are location, distance,
neighborhood and region, networks, overlays, scale, spatial heterogeneity, spatial
dependence, and objects and fields. This group of concepts forms the basis for the
contemporary spatial analysis, visualization.
A comparison of these different approaches to the delineation of spatial concepts
yielded the following fundamental concepts in GIS that may be of particular use for the
representation of semantic spaces: identity, location, distance, neighborhood and region,
connection, scale, time, objects and fields, overlays, buffers, networks.


23

From Concepts to Functionality
Based on the spatial concepts just mentioned, we need to derive functions that can be
implemented in a web GIS system and are relevant to knowledge domain visualization.
However, since a knowledge domain is here defined as existing in a high-dimensional space,
some of the functions available in web GIS need to be extended to support high-dimensional
approaches. For example, in order to implement a buffer function, one first needs to compute
the proximal region in a high-dimensional space and then project it to the 2-dimensional
display space. The following table shows all the definitions of the functions in our
application.
Table 2 Functions for Non-geographic Information Visualization in GIS
Concepts


Functions

Definitions

Identity

Identify

Access attributes of a feature.

Scale

Zoom

View data in different scale levels.

High-dimensional
buffering

The input geometry/text is buffered by
calculating each offset in the semantic
space and then represented in the 2-D
display space as regions.

Proximal Region

Neighbor

Find features within
distance

Project as
discrete object

Projection
Project as
continuous field

Path

Shortest path

The distance is computed in semantic
space between the geometry/text input and
other features in semantic space.
The input text is computed by topic
models to get its coordinates in semantic
space and then projected to the 2-D
display space and represented as
point/field.
Different points would be interpolated
between two input point/text. Then the
path between the interpolated points
would be represented in the display space
as the shortest path.


24
Transect

View profile graph


View different term weights between
different points in the display place as
graphs.

WORKFLOW DESIGN
The original dataset that is used in this research is a collection of abstracts submitted
to the AAG for their annual meetings over 20 years, which has around 66,000 records. Each
abstract consists of around 250 words of text, including author information and keywords.
We would like to employ the LDA topic model to extract topics from this collection of
abstracts corpus and apply SOM training to the abstracts based on the topics from LDA topic
model.
The data come in various file formats and data structures across the 20-year range so
extensive pre-processing will be necessary to represent each original AAG abstract within a
single XML schema. Then the title, key words and abstract text of each abstract must be
computed in Mallet (McCallum, 2002) using LDA topic model to get a basic topic loading
that describes the whole collection of abstracts using topics.
Though the LDA topic model intuitively discovers a range of topics, some of these
"topics" will be of a syntactic or procedural type, with little value for semantic/topical
distinctions in the knowledge domain. For example, some of the topics exported by the
model are characterized by phrase like “paper examines, paper explores, paper concludes,
paper discusses”. These general phrases could appear in very heterogeneous abstracts that
otherwise have little else in common. We define this type of topic as a stop topic that should
be removed from original text corpus before further analysis.
Deciding on the number of topics in the model is another challenge. Blei (2003)
describes the perplexity (a statistical measure for comparison of different probability models)
computation in order to evaluate the performance of different models (as shown in Figure 2).
Then, perplexity is used as the indicator to decide the number of topics in our LDA topic
model.



25

Figure 2 Perplexity evaluations of different computational language models
(Blei 2003)
The two most important output files from LDA topic model are the document topic
file and the topic inferencer file. The document topic file gives every input document scores
on how related it is to each of the topics. This file then is processed during SOM training
with the aim of generating a 2-dimensional topical display space. SOM training will treat
topics as distinct dimensions and will thus represent each AAG abstract as a topic vector
during SOM training. Neurons in the SOM become associated with topic vectors of the same
dimensionality as the input vectors and become the geometric features for the visualization of
the geographic knowledge domain. The topic inferencer, which can be used to infer topic
scores for any text item, including arbitrary text entered by users later in the web GIS
application. With the 2-dimensional neurons from SOM, it then can be processed in GIS
tools to create a base map. The proposed whole workflow is shown in Figure 3.


×