Tải bản đầy đủ (.pdf) (32 trang)

Information Extraction for Financial Analysis

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.45 MB, 32 trang )



ĐẠI HỌC QUỐC GIA HÀ NỘI
TRƯỜNG ĐẠI HỌC CÔNG NGHỆ



CÔNG TRÌNH DỰ THI
GIẢI THƯỞNG “SINH VIÊN NGHIÊN CỨU KHOA HỌC”
NĂM 2012



Tên công trình:
Information Extraction for Financial Analysis
Họ và tên sinh viên: Lê Văn Khánh Nam, Nữ: Nam
Lớp: K53CA Khoa: Công Nghệ Thông Tin
Giảng viên hướng dẫn: TS. Phạm Bảo Sơn








HÀ NỘI - 2012

Abstract

Today, a lot of useful information on the World Wide Web which is usually


formatted for its users is difficult to extract relevant data from various sources.
Therefore, Information Extraction was born to solve this problem. Recently,
flexible Information Extraction (IE) systems that transform the information
resources into program-friendly structures such as: a relational database, XML,
etc… will become a great necessity. In this report, we present a problem which is
applying Information Extraction for Financial Analysis. The main goal is how to
extract the information from a thousand of financial reports written in different
formats. We also present a systematic approach in building rules to recognize. In
particular, we also evaluate the performance of our system.
Contents
Chapter 1 5
Introduction 5
1.1 Subject Overview 5
1.2 Information Extraction 6
1.3 Report Structure 8
Chapter 2 9
Approaches in Information Extraction 9
2.1 Manually Constructed IE Systems 9
2.1.1 TSIMMIS tool 9
2.1.2 W4F 10
2.1.3 XWRAP 10
2.2 Supervised IE Systems 10
2.2.1 SRV 11
2.2.2 RAPIER 11
2.2.3 WHISK 11
2.3 Semi-Supervised IE Systems 12
2.3.1 IEPAD 12
2.3.2 OLERA 12
2.4 Unsupervised IE Systems 13
2.4.1 RoadRunner 13

2.4.2 DEPTA 14
Chapter 3 16
Our Approach 16
3.1 Problem Formalization 16
3.1.1 HTML Mode 16
3.1.2 Plain-text Mode 18
3.2 Approaches & Implementation 19
3.2.1 HTML Mode 19
3.2.1.1 Preprocessing 20
3.2.1.2 Extracting 22
3.2.1.3 Finalizing 24
Chapter 4 25
Experimental setup and Evaluations 25
4.1 Evaluation Metrics 25
4.2 Corpus Development 25
4.3 Evaluation Criteria 26
4.4 Training Process 27
4.5 Testing Process 28
Chapter 5 30
Conclusion & Future work 30
Reference 31


Chapter 1
Introduction

1.1 Subject Overview

Nowadays, the report plays an important role in all fields. Managers can observe the
progress of work due to the report. Our problem is how we can deal with financial reports

which are written in various formats by different companies. In fact, these reports contain
a great deal of information, however, people just need brief but essential information in
order to quickly understand what are performed in these report . For example, there is a
document like Figure 1.1 , and then outputs as Figure 1.2. As Figure 1.3 shows general
scenarios of our work.

Figure 1.1. A report

Figure 1.2. Excel format


Figure 1.3. Scenarios

In the scenarios (Figure 1.3), we only concentrate on step 1: processing reports to get
such output like Figure 1.2, and then got such ouputs the technical finance will analysis at
step 2. To sum up, our task is applying Information Extraction for Financial Analysis
to get such output (e.g. Figure 1.2 given above).

1.2 Information Extraction

Information extraction (IE) is originally applied to identify desired information from
natural language text and convert them into a well-defined structure, e.g., a database with
particular fields. With the huge and rapidly increasing amount of available information
sources and electronic documents on the world wide web, information extraction is
extended for identification from structured and semi-structured web pages. Recently,
more and more research groups concentrate their attention on development of
information extraction systems, such as: web-mining, question answering. Researches on
information extraction could be divided into two subareas: the extraction patterns used
for identification of target information from given text, and using machine learning
techniques to automatically build such extraction patterns for the sake of avoiding

expensive construction by hand. Actually, a lot of information extraction systems have
been successfully implemented, and part of them perform very well. To be specific,
Figure 1.4 shows an example of information extraction. Given document of seminar
announcement, the entities Date, Start-time, Location, Speaker and Topic could be
specified.

Figure 1.4. Information Extraction for Seminar Announcement
Formally, an IE task is defined by its input and its extraction target. The input can be
unstructured documents like plain-text that are written in natural language (e.g. Figure
1.4) or the semi-structured documents that are popular on the Web, such as tables or
itemized and enumerated lists (e.g. Figure 1.5).

Figure 1.5. A Semi-structured page containing data records (in rectangular box) to be
extracted.
The extraction target of an IE task can be a relation of k-tuple (where k is the number of
fields/attributes in a record) or it can be a complex object with hierarchically organized
data. For some IE tasks, an attribute may have zero (missing) or multiple instantiations in
a record. The IE systems to be also called as extractors or wrappers.
As a result, the traditional IE systems usually use some main approaches as: rules-based,
machine learning and pattern mining techniques to exploit the information.
1.3 Report Structure
Our report is organized as following. First, in Chapter 2, we introduce IE systems in
information extraction domain and also review some of the solutions that have been
proposed. In the next Chapter 3, we then describe our approach and system
implementation. Chapter 4 describes the experiment we carry out to evaluate the quality
of our approach. Finally, Chapter 5 is conclusion and our future work.

Chapter 2
Approaches in Information Extraction


As we found out that earlier IE systems are designed to facilitate programmers in writing
extraction rules, while later IE systems take machine learning to generate automatically
rules generalization. Such systems have different degree of automation and accuracy.
Therefore, the IE systems can be classified into the four classes: manually-constructed IE
Systems, supervised IE Systems, semi-supervised IE Systems and unsupervised IE
Systems.

2.1 Manually Constructed IE Systems
In manually-constructed IE systems, users create a wrapper for each input by hand using
general programming languages such as Java, Python, Perl, etc… or by using special
designed languages. Hence, these tools require expert developer to have substantial
computer and programming backgrounds, so it becomes expensive. Such systems include
TSIMMIS [1], W4F [2] and XWRAP [3].
2.1.1 TSIMMIS tool
The main component of this tool is a wrapper that takes as input a specification file that
declaratively states. For example, Figure 2.1(a) shows the specification file.

Figure 2.1. A TSIMMIS specification file and (b) the OEM output.
Each command is of the form: [variables, source, pattern], where source specifies the
input text to be considered, pattern specifies how to find the text of interest within the
source, and variables are a list of variables that hold the extracted results. The special
symbol “*” in a pattern means discard, and “#” means save in the variables. TSIMMIS
then outputs data in Object Exchange Model (e.g. Figure 2.1(b)) that contains the
extracted data together with information about the structure and the contents of the result.
TSIMMIS provides two important operators: split and case. The split operator is used to
divide the input list element into individual element. The case operator allows user to
handle the irregularities in the structure of the input pages.
2.1.2 W4F
W4F stand for Wysiwyg Web Wrapper Factory which is Java toolkit to generate Web
wrappers. The wrapper development process consists of three independent layers:

retrieval, extraction and mapping layers. In the retrieval layer, a document is retrieved
(from the Web through HTTP protocol), cleaned up and then parse into a tree following
the Document Object Model (DOM tree) [5]. In the extraction layer, extraction rules are
applied on the DOM tree to extract information and then store them into internal format
called Nested String List (NSL). In the mapping layer, the NSL structures are exported to
the upper-level application according to mapping rules. Extraction rules are expressed
using the HEL (HTML Extraction Language), which uses the HTML parse tree (i.e.
DOM tree) path to address the data to be located. For example, users can use regular
expression to match or split (following the programming language syntax) the string
obtained by DOM tree path.
2.1.3 XWRAP
The wrapper generation process includes 2 phases: structure analysis and source-specific
XML generation. Firstly, XWRAP fetches, cleans up, and generates a tree, and then
identifies regions, semantic tokens. The second phase, the system generates a XML
template file based on the content tokens and the nesting hierarchy specification and then
constructs a source-specific XML generator. It requires users’ understanding of HTML
parse tree; the identification of the separating tags for rows and columns in a table, etc.
Actually, it use mainly extraction rules based-on DOM tree to extract, no learning
algorithm is used here.
2.2 Supervised IE Systems
Supervised WI systems take a set of inputs labeled with examples of the data to be
extracted and output a wrapper. The user provides an initial set of annotated examples to
train system. For such systems, general users instead of programmers can be trained to
use the labeling GUI, thus reducing the cost of wrapper generation: SRV[4], RAPIER[6],
WHISK [12].
2.2.1 SRV
A top-down relational algorithm that generates single-slot extraction rules. The input
documents are tokenized and all substrings of continuous tokens are labeled as either
extraction target or not. The rules generated by SRV are logic rules that rely on a set of
token-oriented features which can be either simple or relational. A simple feature is a

function that maps a token into some discrete value such as length, character type (e.g.,
numeric), orthography (e.g., capitalized) and part of speech (e.g., verb). A relational
feature maps a token to another token, e.g. the contextual (previous or next) tokens of the
input tokens.
2.2.2 RAPIER
RAPIER focuses on field-level extraction but uses bottom-up (compression-based)
relational learning algorithm. For instance, it begins with the most specific rules and then
replacing them with more general rules. RAPIER learns single slot extraction patterns
that make use of syntactic and semantic information including part-of-speech tagger or a
lexicon (WordNet). It also uses templates to lear extraction pattern. The extraction rules
contain 3 distinct patterns. The first one is the pre-filler pattern that matches text
immediately preceding the filler, the second one is the pattern that match the actual slot
filler, finally the last one is the post-filler pattern that match the text immediately
following the filler. As an example, Figure 2.2 shows the extraction rule for the book title,
which is immediately preceded by words “Book”, “Name”, and “</b>”, and immediately
followed by the word “<b>”.

Figure 2.2. RAPIER extraction rule
2.2.3 WHISK
WHISK uses a covering learning algorithm to generate multi-slot extraction rules for a
wide variety of documents ranging from structured to free text. When applying to free
text, WHISK works best with input that has been annotated by a syntactic analyzer and a
semantic tagger. WHISK rules are based on a form of regular expression patterns that
identify the context of relevant phrases. For structured or semi-structured text, a text is
broken into multiple instances based-on HTML tags or other regular expression. For free-
text, a sentence analyzer segments the text into instances where each instance is clause,
sentence or sentence fragment. Another pre-processing may be automatically adding
semantic tag or syntactic annotation. WHISK begins with untagged instances and an
empty training tagged instances. At each iteration of WHISK, set of untagged instances is
selected and presented to user to annotate. WHISK creates rule from a seed instance as

Figure 2.3.

Figure 2.3. Creating rule from seed instance

2.3 Semi-Supervised IE Systems
In this section, we’re going to consider such systems: IEPAD [13] & OLERA [14].
2.3.1 IEPAD
IEPAD generalize extraction patterns from unlabeled Web pages. This method exploits
the fact that if a Web page contains multiple. Thus, repetitive patterns can be discovered
if the page is well encoded. Therefore, learning wrappers can be solved by discovering
repetitive patterns. IEPAD uses a data structure called PAT trees which is a binary suffix
tree to discover repetitive patterns in a Web page. IEPAD further applies center star
algorithm to align multiple strings which start from each occurrence of a repeat and end
before the start of next occurrence.

2.3.2 OLERA
OLERA acquires a rough example from the user for extraction rule generation. It can
learn extraction rules for pages containing single data records. OLERA consists of 3 main
operations. (1) Enclosing an information block of interest: where the user marks an
information block containing a record to be extracted for OLERA to discover other
similar blocks and generalize them to an extraction pattern (using multiple string
alignment technique). (2) Drilling-down/rolling up an information slot: drilling-down
allows the user to navigate from a text fragment to more detailed components, whereas
rolling-up combines several slots to form a meaningful information unit. (3) Designating
relevant information slots for schema specification as in IEPAD
2.4 Unsupervised IE Systems
Unsupervised IE systems do not use any labeled training examples and have no user
interactions to generate a wrapper. In contrast to supervised IE systems where the
extraction targets are specified by the users, the extraction target is defined as the data
that is used to generate the page or non-tag texts in data-rich regions of the input page.

With type of systems, we examined systems like: RoadRunner [15], DEPTA.
2.4.1 RoadRunner
It considers the site generation process as encoding of the original database content into
strings of HTML code. And data extraction is considered as a decoding process.
Therefore, generating a wrapper for a set of HTML pages corresponds to inferring a
grammar for the HTML code. The system uses the ACME (Align, Collapse, Match &
Extract) matching technique to compare HTML pages of the same class and generate a
wrapper based on their similarities and differences. It starts from comparing two pages,
using the ACME technique to align the matched tokens and collapse for mismatched
tokens. There are two kinds of mismatches: string mismatches that are used to discover
attributes (#PCDATA) and tag mismatches that are used to discover iterators (+) and
optional (?).Figure 2.4 shows both an example of matching for the first two pages of the
running example and its generated wrapper. To reduce the complexity, RoadRunner
adopts UFRE (union-free regular expression).




×