Tải bản đầy đủ (.pdf) (92 trang)

Methodology of relational datamining for stock market prediction

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (22.29 MB, 92 trang )

VIETNAM NATIONAL UNIVERSITY, HANOI
COLLEGE OF TECHNOLOGY
***

CHU THAI HOA

METHODOLOGY OF
RELATIONAL DATAMINING FOR
STOCK MARKET PREDICTION
Major:
Code:

Information Technology
1.01.10

MASTER'S THESIS

Instructor:

Prof. Dr. HO TU BAG

DAI HOC QUOC GiA HA NOl
TRUNG FAM IHONG TIN THIJ VIEN

000 ^J 000095^
Hanoi, June 2007


ABSTRACT
This thesis presents the methodology of relational data mining for stock market
prediction by making clear each problem related to the keywords: methodology,


relational, data mining, stock market, and prediction, then coming to the
methodology of relational data mining with the emphasis on Machine Methods for
Discovering Regularities (MMDR) for stock market prediction.
Stock market prediction has been widely studied in terms of time-series
prediction problem. Deriving relationships that allow one to predict future values of
time series is challenging. One approach to prediction is to spot pattems in the past,
when we already know what followed them, and to test on more recent data. If a
pattem is followed by the same outcome frequently enough, we can gain confidence
that it is a genuine relationship.
The purpose of relational data mining (RDM) is to overcome the limitations of
attributed-based learning methods (commonly used in finance) in representing
background knowledge and complex relations. RDM approaches look for pattems
that involve multiple tables (relations) from a relational database. This approach will
play a key role in future advances in data mining methodology and practice.
MMDR method is one of the few Hybrid Probabilistic Relational Data Mining
methods developed and applied to stock market data. The method has an advantage
in handling numerical data. It expresses pattems in First-order Logic (FOL) and
assigns probabilities to rules generated by composing pattems. This will be made
clear through an application of MMDR with computational experiment on price
index data of Standard and Poor's 500.
The thesis consists of 3 chapters concentrating on relational data mining
methodology for stock market prediction.

Methodology of Relational Data mining for Stock Market Prediction


ACKNOWLEDGEMENTS
This thesis would not have been completed if there was no help and support of
many people. I would like to take this opportunity to express my gratitude to the
many people who helped me during the time of development leading to the thesis.

In particular, I would like to thank my instructor. Prof Dr. HO Tu Bao, for his
courage of accepting me as a Master's student, for his enthusiasm, his knowledge
and his encouragement in the work throughout. I would never been able to finish
this Thesis without his encouragement as well as his strict requirement for quality of
the research.
I also enjoyed and appreciated the fruitful exchange of ideas with Dr. NGUYEN
Trong Dung, to whom I am also grateful for comments on the thesis. In the early
days of my research. Dr. HA Quang Thuy, Dr. PHAM Tran Nhu and Dr. DO Van
Thanh stimulated my interest in data mining in financial forecast. I am thankful for
that and for the many discussions I had with them.
I am indebted to CFO. LE The Anh, CFO. NGUYEN Minh Quang for their
patience with my questions on financial and stock market forecast. I am also grateful
to Dr. PHAM Ngoc Khoi, Dr. NGUYEN Phu Chien, MSc. DAO Van Thanh, Mrs.
LE Thi Hoang My for words of encouragement during months of the thesis efforts
and for their style-improving suggestions. My thanks also go to everyone who has
provided support or advice to me on data mining, stock market, forecast and so on in
one way or another.
My family has been creating good conditions for me to complete the thesis. I
dedicate the thesis to my father, my mother and my young brother whose love and
support are always for me.

Hanoi, June 2007,
CHU Thai Hoa.

Methodology of Relational Data mining for Stock Market Prediction


TABLE OF CONTENTS

ABSTRACT

i
ACKNOWLEDGEMENTS
ii
TABLE OF CONTENTS
iii
LIST OF TABLES AND FIGURES
v
LIST OF ABBREVIATIONS
vi
INTRODUCTION
1
Problem definition
1
Motivations of the Thesis
2
Objectives of the Thesis
4
Method of the Thesis study
4
Stmcture of the Thesis
5
CHAPTER I: OVERVIEW OF STOCK MARKET PREDICTION IN DM...6
LI. Introduction to stock market prediction
....6
1.1.1. Basic concepts of forecast
6
1.1.2. Prediction tasks in stock market
7
1.1.3. Stock market time series properties
8

1.1.4. Stock market prediction with the efficient market theory
9
1.1.5. Questions in stock market prediction
10
1.1.6. Challenges and Possibilifies on Developing a Stock Market
Prediction System
11
1.2. Data mining methodology for stock market prediction
13
1.2.1. Prediction in data mining
13
1.2.2. Parameters
14
1.2.3. Approaches to stock market prediction
15
1.2.4. Data mining methods in stock market
17
CHAPTER II: RELATIONAL DATA MINING FOR STOCK MARKET
PREDICTION
"22
ILL Introduction
22
II.2. Basic problems
22
11.2.1. First-order logic and rules
22
11.2.2. Representative measurement theory
25
11.2.3. Breadth-first search
29

11.2.4. Occam's razor principle
30
IL3. Theory of RDM
31
11.3.1. Data types in RDM
31
11.3.2. Relational representation of examples
33
11.3.3. Background knowledge and problems of search for regularities
34
IL4. An algorithm for RDM: MMDR
39
II.4.1. Motivations of choice for MMDR
39

Methodology of Relational Data mining for Stock Market Prediction

III


11.4.2. Some concepts
40
11.4.3. Algorithm MMDR
L'"!"...".^.".^43
CHAPTER III: AN APPLICATION OF MMDR TO STOCK PRICE
PREDICTION
47
IILL MMDR model for prediction
47
III.2. Experiment preparation

48
111.2.1. Data description and representation
48
111.2.2. Demo program
50
IIL3. Application of MMDR model
52
111.3.1. Step 1: Generating logical rules
52
111.3.2. Step 2: Learning logical rules
54
IIL3.3. Step 3: Creating intervals
56
IIL4. Results and evaluations
58
111.4.1. Stability of discovered rules on test data
58
111.4.2. Evaluations of forecast performance
61
CONCLUSIONS
70
Contributions of the thesis
70
Limitations of the thesis
71
Future work
72
Summary
73
APPENDICIES

.....vii
Source code
vii
REFERENCES
xii
In English
xii
In Vietnamese
xvii
Website
xvii

Methodology of Relational Data mining for Stock Market Prediction

IV


LIST OF TABLES AND FIGURES

Comparison of AVL-based methods and first-order logic methods
20
UpDown predicate
23
Predicates Up and Down
23
Examples of terms
24
Attribute-based data example
34
Partial background knowledge for stock market

..37
Figure III.l. Flow diagram for MMDR model: steps and techniques
48
Training set and Test set
49
Examples of rule consistent with hypotheses H1-H4
54
Table A.1: Stability checking table
59
Table A.2: Performance matrics for a set of 125 regularities
62
Figure A.l: Performance of 125 found regularities on test data
62
Table A.3: Performance matrics for a set of 292 regularities
63
Figure A.2: Performance of 125 found regularities on test data
63
Table A.5: Performance for regularity with conditional probability of 0.49
66
Figure A.3: Performance of an individual regualrity with conditional probability of
0.49 on test data
66
Table A.6: Performance for regularity with conditional probability of 0.84
67
Figure A.4: Performance of an individual regualrity with conditional probability of
0.84 on test data
67
Table A.7: Forecast result for the day December 1'^ 2006 (the regularity with
conditionalprobability of 0.84)
68

Table A.8: Forecast result for the day December 1^ 2006 (the set of 292
regularities with conditional probability not less than 0.65)
69

Methodology of Relational Data mining for Stock Market Prediction


LIST OF ABBREVIATIONS
AI

: Artificial Intelligence

AVL(s) : Attribute-value language(s)
DM

: Data mining

FOL

: First-order Logic

ILP

: Inductive Logic Programming

ML

: Machine Leaming

MMDR : Machine Methods for Discovering Regularities

MRDM : Multi-Relational Data mining
RDM

: Relational Data mining

RMT

: Representative measurement theory

Methodology of Relational Data mining for Stock Market Prediction

VI


INTRODUCTION
Problem definition

There are four major technological reasons stimulating data mining
development, applications and public interest: the emergence of very large databases;
advances in computer technology; fast access to vast amounts of data; and the ability
to apply computationally intensive statistical methodology to these data.
Data mining is the process of discovering hidden patterns in data. Due to the
large size of databases, importance of information stored, and valuable information
obtained, finding hidden pattems in data has become increasingly significant. The
stock market provides an area in which large volumes of data are created and stored
on a daily basis.
Financial forecasfing has been widely studied at a case of time-series prediction
problem. Times series such as the stock market are often seen as non-stationary
which present challenges in predicting fiiture values. The efficient market theory
states that it is pracfically impossible to predict financial markets long-term.

However, there is good evidence that short-term trends do exist and programs can be
written to find them. The data miners' challenge is to find the trends quickly while
they are valid, as well as to recognize the time when the trends are no longer
effective. Data mining methods provides thefi-ameworkfor stock market predictions
to discover hidden trends and pattems.
Well-known and commonly used data mining methods in stock market are
attributed-based leaming methods but they have some serious drawbacks: limited
ability to represent background knowledge and lack of complex relations. The
purpose of RDM is to overcome these limitations. RDM is a learning method that is
better suited for stock market mining with a better ability to explain discovered rules
than other symbolic approaches.
However, current relational methods are relatively inefficient and have rather
limited facilities for handling numerical data. RDM as a hybrid leaming method
combines the strength of FOL and probabilistic inference to meet these challenges.
One of the few Hybrid Probabilistic Relational Data Mining methods, MMDR that
handles numerical data efficiently, is developed and applied to stock market data.
It is believed that now is the time for RDM methods, in particular, MMDR to
stock market prediction has advantages in discovering regularities in stock market
time series.

Methodology of Relational Data mining for Stock Market Prediction

1


Motivations of the Thesis

In the past few years, Vietnam's stock market was still in early stage of
development and thus did not catch attention from investors and researchers.
Especially, to interested learners, mastering professional methods of stock market

analysis and forecast require to have fime and wide background knowledge to study
all fields covered. Moreover, according to the efficient market theory, it is
practically impossible to infer a fixed long-term global forecasting model from
historical stock market information. Therefore, there have been few Vietnamese
interested in and performing research on stock market prediction.
Two recent years have witnessed the surprising development of the Vietnamese
stock market with a host of notable events. Especially, after Vietnam became a
World Trade Organization (WTO) member, the Vietnamese economy has so many
opportunities to develop, leading to the development of many companies and
markets including the financial and stock markets. It is said that Vietnam's stock
market will grow rapidly in the next years, and it will ranlc second in the region, just
after China, in terms of growth rate.
Under the rapid development of Vietnam's financial market, professional
activities such as analysis and prediction of financial market should be paid more
attention. In particular, these activities play a significant role in the task of macro
economic forecast at the National Center for Socio-economic Information and
Forecast (under the Ministry of Planning and Investment), which helps make sound
policies related to socio-economic management and regulation at macro level. Data
mining provides some methods and techniques that are able to help approach stock
market prediction quite effectively.
In fact, there have been already some studies and successful applications of data
mining techniques to stock market forecast. However, the capture of loiowledge and
application techniques of each approach is quite challenging and consumes time. I
read some papers and especially paid attention to a research on relational data
mining in finance by two researchers, Prof Dr. Boris Kovalerchuk and Dr. Evgenii
Vityaev. They reported that, "Mining stock market data presents special challenges.
For one, the rewards for finding successftil pattems are potentially enormous, but so
are the difficulties and sources of conftisions. The efficient market theory states that
it is practically impossible to predict financial markets long-term. However, there is
good evidence that short-term trends do exist and programs can be written to find

them. The data miners' challenge is to find the trends quickly while they are valid, to

Methodology of Relational Data mining for Stock Market Prediction


deal effectively with time series and calendar effects, as well as to recognize the
time when the trends are no longer effective".
The leaming method RDM is able to leam more expressive rules, make better
use of underlying domain knowledge and explain discovered rules than other
symbolic approaches. It is thus better suited for stock market mining. This approach
will play a key role in fiiture advances in data mining methodology and practice.
The earlier algorithms for RDM suffer fi-om a relative computational inefficiency
and have rather limited tools for processing numerical data. This problem is
especially necessary to be considered in stock market analysis where data commonly
are numerical time series. Therefore, RDM as a hybrid leaming method that
combining the strength of FOL and probabilistic inference is developed to meet these
challenges. One of the few Hybrid Probabilisfic Relational Data Mining methods,
MMDR, that handles numerical data efficiently, is developed and applied to stock
market forecasting.
The common question "Can stock market prediction be profitable?" is often
made to any research on methods of stock market prediction. In fact, there are few
people doing research on RDM for stock market forecast, because it requires
interested learners to have wide background knowledge to understand all fields
covered. Much less has been reported publicly on success of data mining in real
trading by financial institutions. If real success is reported then competitors can
apply the same methods and the leverage will disappear, because in essence all
ftindamental data mining methods are not proprietary. I used to concentrate my
study in attempt to end up with a Master's Degree and as a millionaire (kidding), but
this is too high risk to take.
Basing my intention on practical suggestions and requirements, as well as my

personal interest, I came to a decision of doing research on stock market forecast.
Through some school lessons and extra self-learning efforts, I access some data
mining techniques to seek a solution to the task. Those above motivate the aim of
the thesis - to carry out research and experiment on methodology of RDM for stock
market prediction.

Methodology of Relational Data mining for Stock Market Prediction


Objectives of the Thesis
- Systematical organization of RDM methodology for stock market
prediction
Most of the exisfing studies on RDM for stock market prediction are reported in
a short and overview way, which causes difficulties for many readers. The thesis is
primarily based on the book "Data Mining in Finance: Advances in Relational and
Hybrid Methods" and some papers by the two researchers Dr. Kovalerchuk & Dr.
Vityaev. However, after having a thorough grasp of the RDM methodology, I
systematically organize the methodology, especially the algorithm MMDR in my
view and supplement more extensions of knowledge in data mining and stock
market forecast to the thesis. Hopefiilly, it plays an important role in helping new
comers move toward the problem more favorably.
- Experiment performance of MMDR method to stock market prediction
Centre of the thesis is the issue of discovering regularities in stock price series
addressed and illustrated through the MMDR. The thesis also carries out an RDM
application to stock market prediction through an experiment with a small selfdeveloped program in a set of Standard and Poor's data. The experiment helps
understand and trust more the feasibility and efficiency of RDM methodology and
MMDR algorithm presented in the thesis.

Method of the Thesis study
The study behind the thesis has been mostly goal driven. As problems appeared

on the way to realizing stock market prediction, they were tackled by various means
as listed below:
• Investigation of some existing machine learning and data mining methods
through related documents such as Doctoral Theses, Master' theses, online
papers, books, etc.
• Reading of financial and stock market literatures for properties, forecast
techniques and hints of regularities in stock market data able to be exploited.
• Learning about some existing stock market prediction software for deeper
understanding of regularity discovered.
• Some theoretical considerations on mechanisms behind the generation of
stock data, and on general predictability demands and limits.
• Practical insights into the realm of trading in stock market.
• Contacts with experts on data mining and data mining software development,
with stock market investors and chief financial officers.
Methodology of Relational Data mining for Stock Market Prediction

4


Courses on economic forecast and stock market mostly organized by the
National Center for Socio-economic Information and Forecast.
Collection of related documents and systemization of Mater's thesis.
Programming in PHP and carrying experiments to illustrate and to prove the
main idea and algorithm presented in the Thesis.

Structure of the Thesis
The thesis is stmctured in the following way. The first part introduces the
problem definifion, method of study, objectives and stmcture of the Thesis.
Chapter 1 provides an overview of stock market prediction in data mining
through two following parts. "Introduction to stock market prediction" includes

basic concepts of stock market forecast, data mining with the Efficient Market
Theory, stock market time series properties, and drawbacks and possibilities on
developing a stock market prediction, etc. The last part "Data mining methodology
for stock market prediction" presents some major types of data mining prediction,
approaches to stock market prediction and comparisons on representation languages
and data mining methods used in stock market.
Chapter 2 talks about some basic problems, theory of RDM and an algorithm
MMDR. In comparison with other data mining methods, the RDM approach is
considered fi-om the point of view of their Data Types, Representation Languages
(to manipulate and interpret data) and Class of hypothesis (to be tested on data). One
of the few Hybrid Probabilistic Relational Data Mining methods, MMDR, which is
equipped with probabilistic mechanism that is necessary for time series with high
level of noise, is mainly introduced.
In Chapter 5, an MMDR application to stock market price prediction is made
clear for the methodology through three steps: mle generating, rule learning and
interval creating. This chapter also brings out some statisfic results and evaluations
for the experiment conducted to demonstrate the application.
Finally, contributions, limitations and fiiture work of my research are given as
conclusion part for the thesis. At the appendix part, the thesis also provides some
table stmctures and source code developed by myself that are used for experiment.

Methodology of Relational Data mining for Stock Market Prediction


CHAPTER I:

1.1.

OVERVIEW OF STOCK MARKET
PREDICTION IN DATA MINING


Introduction to stock market prediction

1.1.1. Basic concepts of forecast
This section provides a brief basic concepts of forecast. An introductory
discussion of the topic can be found in [46] - Michael Leonard, Large-Scale
Automatic Forecasting: Millions of Forecasts, International Symposium of
Forecasting, 2002.
Forecasts are time series predictions made for future periods in time. They are
random variables and therefore have an associated probability distribution. The
mean or median of each forecast is called the prediction. The variance of each
forecast is called the prediction error variance and the square root of the variance is
called the prediction standard error. The variance is computed from the forecast
model parameter estimates and the model residual variance.
The forecast for the next future period is called the one-step ahead forecast. The
forecast for h periods in the future is called the h-step ahead forecast. The forecast
horizon or forecast lead is the number of periods into the future for which
predictions are made (one-step, two-step,..., h-step). The larger the forecast horizon,
the larger the prediction error variance at the end of the horizon.
The confidence limits are based on the prediction standard errors and a chosen
confidence limit size. A confidence limit size of 0.05 results in 95% confidence
limits. The confidence limits are often computed assuming a normal distribution, but
others could be used. As with the prediction standard errors, the width of the
confidence limits increases with the forecast horizon.
The prediction error is the difference between the predicted value and the actual
value when the actual value is known. For transformed models, it is important to
understand the difference between the model errors (or residuals) and the prediction
errors. The residuals measure the departure from the model in the transformed
metric. The prediction errors measure the departure from the original series.
Taken together, the predictions, prediction standard errors, and confidence

limits at each period in the forecast horizon are the forecasts. Although many people
use the word "forecast" to imply only prediction, a forecast is not one number for
each future time period.
Using a transformed forecasting model requires the following steps:
Methodology of Relational Data mining for Stock Market Prediction

6


• The time series data are transformed.
• The transformed time series data are fit using the forecasting model.
• The forecasts are computed using the parameter estimates and the
transformed time series data.
• The forecasts (predictions, prediction standard errors, and confidence limits)
are inverse transformed.
The naive inverse transformation resuhs in median forecasts. To obtain mean
forecasts requires that the prediction and the prediction error variance both are
adjusted based on the transformation. Additionally, the model residuals will be
different from the prediction errors due to this inverse transformation. If no
transformation is used, the model residual and the prediction error will be the same,
and likewise the mean and median forecast will be the same (assuming a symmetric
disturbance distribution).
The statistics of fit evaluate how well a forecasting model performs by
comparing the actual data to the predictions. For a given forecast model that has
been fitted to the time series data, the model should be checked or evaluated to see
how well it fits or forecasts the data. The statistics of fit can be computed from the
model residuals or the prediction errors.
When a particular statistic of fit is used for forecast model selection, it is
referred to as the model selection criterion. When using model selection criteria to
rank forecasting models, it is important to compare the errors on the same metric,

that is, you should not compare transformed model residuals with non-transformed
model residuals. You should first inverse transform the forecasts from the
transformed model prior to compufing the prediction errors and then compute the
model selection criterion based on the prediction errors.

1.1.2.

Prediction tasks in stock market

Boris Kovalerchuk, Evgenii Vityaev, Data Mining For Financial Applications, In:
0. Maimon, L. Rokach (Eds.): The Data Mining and Knowledge Discovery
Handbook, Springer 2005, pp. 1203-1224
Stock market prediction includes uncovering market trends, planning
investment strategies, identifying the best time to purchase the stocks and what
stocks to purchase. Prediction tasks in stock market typically are posed in one of two
forms:
• Straight prediction of the stock market numeric characteristic, e.g., stock
return or exchange rate

Methodology of Relational Data mining for Stock Market Prediction


• The prediction whether the stock market characteristic will increase or
decrease.
Having in mind that in the first case, it is necessary to take into account the
trading cost and the significance of the trading return. And in the second case, it is
necessary to forecast whether the stock market characteristic will increase or
decrease no less than some threshold. Thus, the difference between data mining
methods for the first or second case can be less obvious, because the second case
may require some kind of numeric forecast.

Financial institutions produce huge datasets that build a foundation for approaching
these enormously complex and dynamic problems with data mining tools. Potential
significant benefits of solving these problems motivate extensive research for years.

1.1.3. Stock market time series properties
One may wonder if there are universal characteristics of the many series coming
from markets different in size, location, sophistication, etc. The surprising fact is that
there are. Moreover, interacting systems in other fields, such as statistical mechanics,
suggest that the properties of stock market time series loosely depend on the market
microstructure and are common to a range of interacting systems. Such observations
have stimulated new models of markets based on analogies with particle systems and
brought in new analysis techniques opening the era of econophysics. A more
detailed discussion of stock market time series properties can be found in [66] Stefan Zemke, On Developing a Financial Prediction System: Pitfalls and
Possibilities, First International Workshop on Data Mining Lessons Learned at
ICML'02, 2002. This section introduces a brief on stock market time series
properties including:
- Distribution
Distribution of stock market series tends to be non-normal, sharp peaked and
heavy-tailed, these properties being more pronounced for intraday values. Such
observations were pioneered interestingly around the time the EMH was formulated.
Extreme values appear more frequently in a stock market series as compared to a
nomially-distributed series of the same variance. This is important to the practitioner
since often the values cannot be disregarded as erroneous outliers but must be
actively anticipated, because of their magnitude which can influence trading
performance.
- Scaling property

Methodology of Relational Data mining for Stock Market Prediction

8



Scaling property of a time series indicates that the series is self-similar at
different time scales. This is common in stock market time series, i.e. given a plot of
returns without the axis signed; it is next to impossible to say if it represents hourly,
daily or monthly changes, since all the plots look similar, with differences appearing
at minute resolution. Thus prediction methods developed for one resolution could, in
principle, be applied to others.
- Data frequency
Data frequency refers to how often series values are collected: hourly, daily,
weekly etc. Usually, if a stock market series provides values on daily, or longer,
basis, it is low frequency data, otherwise - when many intraday quotes are included
- it is high frequency. Tick-by-tick data includes all individual transactions, and as
such, the event-driven fime between data points varies creating challenge even for
such a simple calculation as correlation.

1.1.4. Stock market prediction with the efficient market theory
The Efficient Market Theory/Hypothesis (EMH) inifially got wide acceptance
in the financial community. It asserts, in weak form, that the current price of an asset
already reflects all informafion obtainablefi*ompast prices and assumes that news is
promptly incorporated into prices. Since news is assumed unpredictable, so are
prices. In other words, according to the EMH, the evolufion of the prices for each
economic variable is a random walk. The variations in prices are completely
independent from one fime step to the next in the long run. EMH states that it is
practically impossible to infer a fixed long-term global forecasting model from
historical stock market informafion. This idea is based on the observation that if the
market presents some kind of regularity then someone will take advantage of it and
the regularity disappears.
However, real markets do not obey all the consequences of the hypothesis, e.g.,
price random walk implies normal distribution, not the observed case; there is a

delay while price stabilizes to a new level after news, which among other, lead to a
more modem view: "Overall, the best evidence points to the following conclusion.
The market isn't efficient with respect to any of the so-called levels of efficiency.
The value investing phenomenon is inconsistent with semi-strong form efficiency,
and the January effect is inconsistent even with weak form efficiency. Overall, the
evidence indicates that a great deal of information available at all levels is, at any
given time, reflected in stock prices. The market may not be easily beaten, but it
appears to be beatable, at least if you are willing to work at it."

Methodology of Relational Data mining for Stock Market Prediction


The market efficiency theory does not exclude that hidden short-term local
conditional regularities may exist. These regularities can not work "forever," they
should be corrected frequently. It has been shown that the stock market data are not
random and that the efficient market hypothesis is merely a subset of a larger
chaotic market hypothesis. This hypothesis does not exclude successful short term
forecasfing models forpredicfion of chaofic time series.
Data mining does not try to accept or reject the efficient market theory. Data
mining creates tools which can be useful for discovering subtle short-term
conditional pattems and trends in wide range of stock market data. This means that
retraining should be a permanent part of data mining in stock market and any claim
that a silver bullet trading has been found should be treated similarly to claims that a
perpetual mobile has been discovered.
1.1.5.

Questions in stock market prediction

Following are some questions of scienfific and pracfical interest concerning
stock market prediction:

• Prediction possibility: Is statistically significant prediction of stock market
data possible? Is profitable prediction of such data possible? What involves
answer to the former question, adjusted by constraints imposed by the real
markets?
• Methods: If prediction is possible, what methods are best at performing it?
What methods are best-suited for what data characteristics - could it be said
in advance?
• Meta-methods: What are the ways to improve the methods? Can
metaheuristics successful in other domains, such as ensembles or pruning,
improve stock market prediction?
• Data: Can the amount, type of data needed for prediction, be characterized?
• Data preprocessing: Can data transformations that facilitate prediction be
identified? In particular, what transformation formulae enhance input data?
• Evaluation: What are the features of sound evaluation procedure, respecting
the properties of stock market data and the expectations of stock market
prediction? What are the common evaluation drawbacks?
• Predictor development: Are there any common features of successful
prediction systems? If so, what are they, and how could they be advanced?
Can common reasons of failure of stock market prediction be identified?
Are they intrinsic, non-reparable, or there is a way to amend them?
• Transfer to other domains: Can the methods developed for stock market
prediction benefit other domains?

Methodology of Relational Data mining for Stock Market Prediction

10


Predictability estimation: Can stock market data be reasonably quickly estimated
to be predictable or not, without the investment to build a custom system? What

are the methods, what do they actually say, what are their limits?
Consequences of predictability: What are the theoretical and practical
consequences of demonstrated predictability of stock market data, or the
impossibility of it? How a successful prediction method translates into
economical models? What could be the social consequences of stock market
prediction?

1.1-6. Challenges and Possibilities on Developing a Stock Market
Prediction System
A successful stock market predicfion system presents many challenges. Some
are encountered over agam, and though an individual solution might be systemspecific, general principles still apply. Using them as a guideline might save fime,
effort, boost results, as such promoting project's success.
The idea of stock market predicfion (and resulting riches) is appealing, initiating
countless attempts. In this competitive environment, if one wants above-average
resuhs, one needs above-average insight and sophistication. Reported successful
systems are hybrid and custom made, whereas straightforward approaches, e.g. a
neural network plugged to relatively unprocessed data, usually fail. The
individuality of a hybrid system offers chances and dangers. One can bring together
the best of many approaches; however the interaction complexity hinders judging
where the performance dis/advantage is coming from.
Stock market prediction has been widely studied at a case of time-series
prediction problem; The difficulty of this problem is due to the following factors:
low signal-to-noise ratio, non-Gaussian noise distribufion, nonstationarity, and
nonlinearly. Deriving relationships that allow one to predict future values of time
series is a challenging task when the underlying system is highly non-linear.
Usually, the history of the time series is provided and the goal is to extract from that
data a dynamic system. The dynamic system models the relationship between a
window of past values and a value T time steps ahead. Discovering such a model is
difficult in pracfice since the processes are typically cormpted by noise and can only
be partially modeled due to missing information and the overall complexity of the

problem. In addition, stock market time series are inherently non-stationary so
adaptive forecasting techniques are required.
- Data Preprocessing

Methodology of Relational Data mining for Stock Market Prediction

11


Before data is fed into an algorithm, it must be collected, inspected, cleaned and
selected. Since even the best predictor will fail on bad data, data quality and
preparafion is cmcial. Also, since a predictor can exploit only certain data features, it
is important to detect which data preprocessing/presentation works best.
• Visual inspecfion is invaluable. At first, one can look for: trend - if need to
remove, histogram - redistribute, missing values and outliers, any
regularities.
• Missing values deah with by data mining methods
• Series to instances conversion is required by most leaming algorithms
expecting as an input a fixed length vector
• Indicators are series derived from others, enhancing some features of
interest, such as trend reversal.
• Feature selection can make learning feasible, as because of the curse of
dimensionality long instances demand (exponentially) more data.
- Prediction Algorithms
Common leaming algorithms point their features important to stock market
prediction:
• Linear methods are widely used in stock market prediction.
• Neural Network seems the method of choice for stock market predicfion.
• C4.5, ILP - generate decision trees/if-then rules - human understandable, if
small.

• Nearest Neighbor does not create a general model, but to predict, it looks
back for the most similar case(s). Irrelevant/noisy features disrupt the
similarity measure, so pre-processing is worthwhile.
• Bayesian classifier/predictor first learns probabilities how evidence supports
outcomes, used then to predict new evidence's outcome.
• Support Vector Machines (SVM) are a relatively new and powerful learner,
having attractive characteristics for time series prediction.
- System Evaluation
Proper evaluation is critical to a prediction system development. First, it has to
measure exactly the interesting effect as opposed to prediction accuracy. Second, it
has to be sensitive enough as to disfinguish oflen minor gains. Third, it has to
convince that the gains are no merely a coincidence.
• Evaluation bias resulfing from the evaluation scheme and time series data,
needs to be recognized.

Methodology of Relational Data mining for Stock Market Prediction

12


• Evaluation data should include different regimes, markets, even data errors,
and be plentiful. Dividing test data into segments helps to spot performance
irregularities (for different regimes).
• Sanity checks involve common sense. Prediction errors along the series
should not reveal any stmcture, unless the predictor missed something.

1.2.

Data mining methodology for stock market prediction


1.2.1. Prediction in data mining
a.

Introduction

The goal of data mining is to produce new knowledge that the user can act
upon. It does this by building a model of the real world based on data collected from
a variety of sources. The result of the model building is a description of patterns and
relationships in the data that can be confidenfiy used for prediction.
Prediction is one of the most important problems in data mining. It involves
using some variables or fields in the data set to predict unknown or future values of
other variables of interest. The goal of prediction is to forecast or deduce the value
of an attribute based on values of other attributes.
b. Major types of prediction
- In the view of contruction and use of model
Prediction can be viewed as the construction and use of model to assess the
class of an unlabeled sample, or to assess the value or value ranges of an attribute
that a given sample is likely to have. In this view, classification and regression are
the two major types of predicfion problems:
• Classification: used to discrete or nominal values. It predicts into what
category or class a case falls. In other words, classification problems aim to
identify the characteristics that indicate the group to which each case
belongs. Data mining creates classificafion models by examining already
classified data (cases) and inductively finding a predictive pattern.
• Regression: used to predict continuous or ordered values. It predicts what
number value a variable will have. In other words, regression uses existing
values to forecast what other values will be. The prediction of continuous
values can be modeled by statistical techniques of regression.
- In the view of use of prediction to predict
This view is commonly accepted in data mining. Predicfion refers the use of

prediction to predict class labels as classification and to predict continuous values as
prediction:

Methodology of Relational Data mining for Stock Market Prediction

13


• Classification: used to extract models describing important data classes.
Classificafion predicts categorical class label. It classifies data (constructs a
model) based on the training set and the values (class labels) in a classifying
attribute and uses it in classifying new data.
• Prediction: used to predict future data trends, i.e., predict unknown or
missing values. It models confinuous-valued funcfions. Any of the methods
and techniques used for classification may also be used for prediction.
1.2.2.

Parameters

There are several parameters to characterize data mining methodologies for
stock market forecasting:
1.2.2.1. Datatypes
Two major groups of data types
• Attributes data type: object is represented by attributes that is each object x
is given by a set of values ^i(x), A2{x\.., An{x).
• Relational data type: objects are represented by their relations with other
objects. For instance, x>y, y<z, x>z. In this example we may not know that
x=3, y=l and z=2. Thus attributes of objects are not known, but their
relations are known. Objects may have different attributes (e.g., x=5, y=2,
and z= 4), but still have the same relations.

1.2.2.2. Data set and techniques
Fundamental and technical analyses are two widely used techniques in stock
market forecast.
- Fundamental analysis
Fundamental analysis tries to determine all the econometric variables that may
influence the dynamics of a given stock price or exchange rate. Often it is hard to
establish which of these variables are relevant and how to evaluate their effect.
- Technical analysis
Technical analysis assumes that when the sampling rate of a given economic
variable is high, all the information necessary to predict the future values is
contained in the time series itself There are several difficulties in technical analysis
for accurate prediction: successive ticks correspond to bids from different sources,
the correlation between price variations may be low, time series are not stationary,
good statisfical indicators may not be known, different realizations of the random
process may not be available, and the number of training examples may not be
enough to accurately infer rules. Therefore, the technical analysis can fit short-term
predictions for stock market time series without great changes in the economic
Methodology of Relational Data mining for Stock Market Prediction

14


environment between successive ticks. Actually, the technical analysis was more
successful in identifying market trends, which is much easier than forecasting the
future stock prices. Currently different data mining techniques try to incorporate
some of the most common technical analysis strategies in pre-processing of data and
in the construction of appropriate attributes.
Two major options exist: use the time series itself or use all variables that may
influence the evolution of the time series. Data mining methods do not restrict
themselves to a particular option. They follow a fundamental analysis approach

incorporating all available attributes and their values, but they also do not exclude a
technical analysis approach based only on a time series such as stock price and
parameters derivedfi"omit. Most popular time series are index value at open, index
value at close, highest index value, lowest index value and trading volume and
lagged returns from the time series of interest. Fundamental factors include the price
of gold, retail sales index, industrial production indices, and foreign currency
exchange rates. Technical factors include variables that are derived from time series
such as moving averages.
1.2.2.3. Mathematical algorithm (method, model)
A variety of statistical, neural network and logical methods has been developed.
For example, there are many neural network models, based on different
mathematical algorithms, theories and methodologies. Combinations of different
models may provide a better performance than those provided by individuals. Many
data mining methods assume a functional form of the relationship being modeled.
1.2.2.4. Form of relationships between objects
The next characteristic of a specific data mining methodology is a form of the
relationship between objects. Many data mining methods assume o. functional form
of the relationship being modeled. For instance, the linear discriminant analysis
assumes linearity of the border that discriminates between two classes in the space
of attributes. Often it is hard to justify such functional form in advance. RDM
methodology in stock market does not assume a functional form for the relationship.
In addition, RDM algorithms do not assume the existence of derivatives. It can
automatically leam symbolic relations on numerical data of stock market time series.

1.2.3. Approaches to stock market prediction
a. Physics approach and data mining approach
The impact of market players on market regularities stimulated a surge of
attempts to use ideas of statistical physics in finance. If an observer is a large
marketplace player then such observer can potentially change regularities of the


Methodology of Relational Data mining for Stock Market Prediction

15


marketplace dynamically. Attempts to forecast in such dynamic environment with
thousands active agents leads to much more complex models than traditional data
mining models designed for. This is one of the major reasons that such interactions
are modeled using ideas from statistical physics rather than from statistical data
mining. The physics approach in finance is also known as "econophysic" and
"physics of finance". The major difference from data mining approach is coming
from the fact that in essence the data mining approach is not about developing
specific methods for financial tasks, but the physics approach is.
b. Deterministic dynamic system approach
Stock market data are often represented as a time series of a variety of attributes
such as stock prices and indexes. Time series prediction has been one of the ultimate
challenges in mathematical modeling for many years. Currently data mining
methods try to enhance this study with new approaches. Dynamic system approach
has been developed and applied successfully for many difficult problems in physics.
Recently several studies have been accomplished to apply this technique in stock
market. Usually, the history of the time series is provided and the goal is to extract
from that data a dynamic system. The dynamic system models the relationship
between a window of past values and a value T time steps ahead. Below presents the
major steps of this approach:
• Step 1: Development of state space for the dynamic system, i.e. selecting
and/or inventing attributes characterizing the system behavior.
• Step 2: Discovering the laws that govern the phenomenon, i.e. discovering
relations between attributes of current and previous states (state vectors) in
the form of differential equations.
• Step 3: Solving differential equations for identifying the transition function

(mles).
• Step 4: Use of the transition funcfion as a predictor of the next state of the
dynamic system, e.g., next day stock value.
Inferring a set of rules for dynamic system assumes that there is
• Enough information in the available data to sufficiently characterize the
dynamics of the system with high accuracy
• All of the variables that influence the time series are available or they vary
slowly enough that the system can be modeled adaptively
• The system has reached some kind of stationary evoludon
• The system is a detenninistic system
• The evoludon of a system can be described by means of a surface in the
space of delayed values.

Methodology of Relational Data mining for Stock Market Prediction

16


There are several applicafions of these methods to stock time series. However,
the literature claims both for and against the existence of chaotic deterministic
system underlying stock market. Recent research has focused on methods to
disfinguish stochastic noise from deterministic chaotic dynamics and more generally
on constmcting systems combining deterministic and probabilistic techniques.
1.2.4.

Data mining methods in stock market

Almost every computational method has been explored and used for financial
modeling. New developments augment traditional technical analysis of stock market
curves that has been used extensively by financial institutions. Such stock charting

helps to identify buy/sell signals (timing "flags") using graphical pattems. Data
mining as a process of discovering useful patterns, correlations has its own place in
stock market modeling.
Similarly to other computational methods, almost every data mining method
and technique has been used in financial modeling. An incomplete list includes a
variety of linear and non-linear models, multi-layer neural networks, k-means and
hierarchical clustering; k-nearest neighbors, decision tree analysis, regression
(logistic regression; general multiple regression), ARIMA, principal component
analysis, and Bayesian leaming. Less traditional methods used include rough sets,
RDM methods (deterministic inductive logic programming) and newer probabilistic
methods, support vector machine, independent component analysis, Markov models
and hidden Markov models.
1.2.4.1. Representation languages
a.

Propositional Logic language

A proposition is a statement that can be true or false. Propositional logic uses
true statements to form or prove other tme statements. In other words, propositional
logics are concerned with propositional (or sentential) operators which may be
applied to one or more propositions giving new propositions.
Propositional logic has very limited expressive power. It is not adequate for
formalizing valid arguments that rely on the internal stmcture of the propositions
involved.
b. First-order logic language
First-order logic (FOL) is a system of deduction extending propositional logic
by the ability to express relations between individuals. FOL languag^s^support
variables, relations, and complex expressions.
DAI HOC QUOC GIA HA NOl
TRUrJG TAM THONG TIN THL/ViEN


Methodology of Relational Data mining for Stock Market Prediction

17


The FOL language differs from a propositional logic language mainly by the
presence of variables. Therefore, a language of monadic fiinctions and predicates is
a FOL language, but a very restricted language.
c. Attribute-value languages
Attribute-value language is a propositional language in which propositions are
attribute-value pairs that can be considered as predicates. In other words, in an
attribute-value language, objects are described by tuples of attribute-value pairs,
where each attribute represents some characteristic of the object.
Attribute-value languages are languages of monadic fiinctions (fiinctions of one
variable) and monadic predicates (Boolean functions with only one argument). This
language was not designed to represent relations that involve two, three or more
objects.
d. Comparison of these languages
Many well-known rule learners are propositional but propositional
representations offer no general way to describe the essential relations among the
values of the attributes. In contrast with propositional mles, first order mles have an
advantage in discovering relational assertions because they capture relations
directly. Several types of hypotheses/mles presented in FOL are simple relational
assertions with variables. Relational assertions can be conveniently expressed using
first-order representations, while they are very difficult to describe using
propositional representations.
Also, first order mles allow one to express naturally other more general
hypotheses not only the relation between pairs of attributes. These more general
rules can be as for classification problems as for an interval forecast of continuous

variable. Moreover, these mles are able to catch Markov chain type of models used
for stock market time series forecast. That algorithms designed to leam sets of firstorder rules that contain variables is significant because first-order rules are much
more expressive than propositional mles.
1.2.4.2. AVL-based methods
The common data mining methodology assuming attributes data type is known
as an attribute-based or attribute-value methodology. It covers a wide range of
statistical and connectionist (neural network) methods. There are two types of
attribute-value methods: the first one is based on numerical expressions, and the
second one is based on logical expressions and operations.
Historically, methods based on AVLs such as neural networks, the nearest
neighbors method, and decision trees dominate in financial applications of data

Methodology of Relational Data mining for Stock Market Prediction

18


×