Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 44 pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (416.48 KB, 10 trang )

410 Oded Maimon and Shahar Cohen
In order to achieve good behavior, the agent must explore its environment. Explo-
ration means trying different sort of actions in various situations. While exploring,
some of the choices may be poor ones, which may lead to severe costs. In such cases,
it is more appropriate to train the agent on a computer-simulated model of the en-
vironment. It is sometimes possible to simulate an environment without explicitly
understanding it.
RL methods have been used to solve a variety of problems in a number of do-
mains. Pednault et al. (2002) solved targeted marketing problems. Tesauro (1994,
1995) planned an artificial backgammon player with RL. Hong and Prabhu (2004)
and Zhang and Dietterich (1996) used RL to solve manufacturing problems. Littman
and Boyan (1993) have used RL for the solution of a networking routing problem.
Using RL, Crites and Barto (1996) trained an elevator dispatching controller.
20.6 Reinforcement-Learning and Data-Mining
This chapter presents an overview of some of the ideas and computation methods in
RL. In this section the relation and relevance of RL to DM is discussed.
Most DM learning methods are taken from ML. It is popular to distinguish be-
tween three categories of learning methods – Supervised Learning (SL), Unsuper-
vised Learning and Reinforcement Learning. In SL, the learner is programmed to
extract a model from a set of observations, where each observation consists of ex-
plaining variables and corresponding responses. In unsupervised learning there is a
set of observations but no response, and the learner is expected to extract a helpful
representation of the domain from which the observations were drawn. RL requires
the learner to extract a model of response based on experience observations that in-
clude states, responses and the corresponding reinforcements.
SL methods are central in DM and a correlation may be established between SL
and RL in the following manner. Consider a learner that needs to extract a model
of response for different situations. A supervised learner will rely on a set of ob-
servations, each of which is labeled by an advisor (or an oracle). The label of each
observation is regarded by the agent as the desired response for the situation intro-
duced by the explanatory variables for this observation. In RL the privilege of having


an advisor is not given. Instead, the learner views situations (in RL these are called
states) chooses responses (in RL these are called actions) autonomously and obtains
rewards that indicate how good the choices were. In this approach toward SL and
RL, states and realizations of ”explaining variables” are actually the same.
In some DM problems, cases arise in which the responses in one situation affect
future outcomes. This is typically the case in cost-sensitive DM problems. Since SL
relies on labeled observations and assumes no dependence between observations, it
is sometimes inappropriate for such problems. The RL model, on the other hand,
perfectly fits cost-sensitive DM problems
5
. For example, Pednault et al. (2002) used
5
Despite this claim, there are several difficulties in applying RL methods to DM problems.
A serious issue is that DM problems suggest batches of observation stored in a database,
whereas RL methods require incremental accumulation of observations through interaction.
20 Reinforcement Learning 411
RL to solve a problem in targeted marketing – deciding on the optimal targeting
of promotion efforts in order to maximize the benefits due to promotion. Targeted
marketing is a classical DM problem in which the desired response is unknown, and
responses taken at one point in time affect the future. (For example, deciding on an
extensive campaign for a specific product this month may reduce the effectiveness of
a similar campaign the following month).
Finally, DM may be defined as a process in which computer programs manipulate
data in order to provide knowledge about the domain that produced the data. From
the point of view implied by this definition, RL definitely needs to be considered as
certain type of DM.
20.7 An Instructive Example
In this section, an example-problem from the area of supply-chain management is
presented and solved through RL. Specifically, the modeling of the problem as an
MDP with unknown reward and state-transition functions is shown; the sequence of

Q-Learning is demonstrated; and the relations between RL and DM are discussed
with respect to the problem.
The term ”supply-chain management” refers to the attempts of an enterprise
to optimize processes involved in purchasing, producing, shipping and distributing
goods. Among other objectives, enterprises seek to formulate a cost-effective inven-
tory policy. Consider the problem of an enterprise that purchases a single product
from a manufacturer and sells it to end-customers. The enterprise may maintain a
stock of the product in one or more warehouses. The stock help the enterprise re-
spond to customer demand, which is usually stochastic. On the other hand, the en-
terprise has to invest in purchasing the stock and maintaining it. These activities lead
to costs.
Consider an enterprise that has two warehouses in two different locations and
behaves as follows. At the beginning of epoch t, the enterprise observes the stock
levels s
1
(t) and s
2
(t), at the first and the second warehouses respectively. As a re-
sponse, it may order from the manufacturer in quantities a
1
(t), a
2
(t) for the first and
second warehouse respectively. The decision of how many units to order for each of
the warehouses is taken centrally (i.e. simultaneously by a single decision-maker),
but the actual orders are issued separately by the two warehouses. The manufacturer
charges c
d
for each unit ordered, and additional c
K

for delivering an order to a ware-
house (i.e. if the enterprise issues orders at both warehouses it is charged a fixed 2c
K
in addition to direct costs of the units ordered). It is assumed that there is no lead-
time (i.e. the units ordered become available immediately after issuing the orders).
Subsequently, each of the warehouses observes a stochastic demand.
A warehouse that has enough units in stock sells the units and charges p for each
sold unit. If one of the warehouses fails to respond to the demand, whereas the other
warehouse, after delivering to its customers, can spare units, transshipment is ini-
tiated. Transshipment means transporting units between the warehouses in order to
412 Oded Maimon and Shahar Cohen
meet demand. Transshipment costs c
T
for each unit transshipped. Any unit remain-
ing in stock by the end of the epoch costs the enterprise c
i
for that one epoch. The
successive epoch begins with the number of units available at the end of the current
epoch, and so-on.
The enterprise wants to formulate an optimal inventory policy (i.e. given the stock
levels and in order to maximize its long-run expected profits the enterprise wants to
know when to issue orders, and in what quantities). This problem can be modeled as
an MDP (see the definition of MDP in Section 20.2). The stock levels s
1
(t) and s
2
(t)
at the beginning of epochs are the states faced by the enterprise’s decision-makers.
The possible quantities for two orders are the possible actions given a state. As a
consequence of choosing a certain action at a certain state, each warehouse obtains a

deterministic quantity-on-hand. As the demand is observed and met (either directly
or through transshipment), the actual, immediate profit r
t
can be calculated as the
revenue gained from selling products minus costs due to purchasing the products, de-
livering the orders, the transshipments and maintaining inventory. The stock levels at
the end of the period, and thus the state for the successive epoch, are also determined.
Since the demand is stochastic, both the reward (the profit) and the state-transition
function are stochastic.
Assuming that the demand functions at the two warehouses are unknown, the
problem of the enterprise is how to solve an MDP with unknown reward and state-
transition functions. In order to solve the problem via RL, a large number of ex-
perience episodes needs to be presented to an agent. Gathering such experience is
expensive, because in order to learn an optimal policy, the agent must explore its
environment simultaneously to the exploitation of its current knowledge (see discus-
sion on the exploration-exploitation dilemma on Section 20.3.2). However, in many
cases learning may be based on simulated experience.
Consider using Q-Learning (see Section 20.3.2) for the solution of the enter-
prise’s problem. Let this application be demonstrated for epoch t = 158, and the ini-
tial stock levels s
1
(158)=4, s
2
(158)=2. The agent constantly maintains a unique
Q-value for each of the initial stock levels and the quantities ordered. Assumed that
capacity at both warehouses is limited to 10 units of stock, the possible actions given
the states are:
A(s
1
(t),s

2
(t)) =
{
a
1
,a
2

: a
1
+ s
1
(t) ≤ 10, a
2
+ s
2
(t) ≤ 10
}
(20.18)
The agent chooses an action from the set of possible actions based on some heuristic
that breaks the exploration-exploitation dilemma (see discussion in Section 20.3.2).
Assume that the current Q-values for the state s
1
(158)=4, and s
2
(158)=2 are as
described in Figure 20.1. The heuristic used should tend to choose actions for which
the corresponding Q-value is high, while allowing each action to be chosen with a
positive probability. Assume that the action chosen is a
1

(158)=0, a
2
(158)=8. This
action means that the first warehouse does not issue an order while the second ware-
house orders 8 units. Assume that the direct cost per unit is c
d
= 2, and that the fixed
cost for an order is c
K
= 10. Since only the second warehouse issued an order, the
enterprise’s ordering costs are 10 +8·2 = 26. The quantities-on-hand after receiving
20 Reinforcement Learning 413
the order are 4 units in the first warehouse and 10 units in the second warehouse.
Assume the demand realizations are 5 units from the first warehouse and a single
unit from the second warehouse. Although the first warehouse can provide only 4
units directly, the second warehouse can spare a unit from its stock, transshipment
occurs, and both warehouses meet demand. Assume the transshipment cost is c
T
= 1
for each unit transshipped. Since only one unit needs to be transshipped, the total
transshipment cost is 1. In epoch 158, six units were sold. Assuming the enterprise
charges p = 10 for each unit sold, the revenue from selling products in this epoch
is 60. At the end of the epoch, the stock levels are zero units for the first warehouse
and 8 units for the second warehouse. Assuming the inventory costs are 0.5 per unit
in stock for one period, the total inventory costs for epoch 158 are 4 (= 8 ·0.5).
The immediate reward for that epoch is 60-26-1-4=29. The state for the next epoch
is s
1
(159)=0 and s
2

(159)=8. The agent can calculate V
158
(s
1
(159),s
2
(159)) by
maximizing the Q-values corresponding with s
1
(159) and s
2
(159), which it holds
by the end of epoch 158. Assume that the result of this maximization is 25. Assume
that the appropriate learning rate for s
1
= 4, s
2
= 2, a
1
= 0, a
2
= 8 and t = 158 is
α
158
(4,2,0,8)=0.1, and that the discount factor is 0.9. The agents update the
appropriate entry according to the update rule in Equation 20.14 as follows.
Q
159
(


4,2

,

0,8

)=0.9 ·Q
158
(

4,2

,

0,8

)
+ 0.1 ·[r
158
+
γ
V
158
(0,8)]
= 0.9 ·45 + 0.1 ·[29 + 0.9 ·25]=45.65
. (20.19)
The consequence of this update results in a change in the corresponding Q-value as
indicated in Figure 20.2. Figure 20.3 shows the learning curve of a Q-Learning agent
that was trained to solve the enterprise’s problem in accordance with the parameters
assumed in this section. The agent was introduced to 200,000 simulated experience

episodes, in each of which the demands were drawn from Poisson distributions with
means 5 and 3 for the first and second warehouses respectively. The learning rates
were set to 0.05 for all t, and a heuristic based on Boltzmann’s distribution was used
to break the exploration-exploitation dilemma (see Sutton and Barto, 1996). The
figure shows a plot of the moving average reward (over 2000 episodes) against the
experience of the agent while gaining these rewards.
This section shows how RL algorithms (specifically how Q-Learning) can be
used to learn from data observation. As discussed in Section 20.6, this by itself
makes RL, in this case, a DM tool. However, the term DM may imply the use of
a SL algorithm. Within the scope of problem discussed here, SL is inappropriate. A
supervised learner could induce an optimal (or at-least a near-optimal) policy based
on examples of the form s
1
,s
2
,a
1
,a
2
 whereas s
1
and s
2
describe a certain state, and
a
1
and a
2
are the optimal responses (orders quantity) for that state. However in the
case discussed here, such examples are probably not available.

The methods presented in this chapter are useful for many application domains,
such as: Manufacturing lr18,lr14, Security lr7,l10 and Medicine lr2,lr9, and for many
data mining techniques, such as: decision trees lr6,lr12, lr15, clustering lr13,lr8, en-
semble methods lr1,lr4,lr5,lr16 and genetic algorithms lr17,lr11.
414 Oded Maimon and Shahar Cohen
Fig. 20.1. Q-values for the state encounter on epoch 158 before the update. The value corre-
sponding with the action finally chosen is marked.
Fig. 20.2. Q-values for the state encounter on epoch 158 after the update. The value corre-
sponding with the action finally chosen is marked.
References
Arbel, R. and Rokach, L., Classifier evaluation under limited resources, Pattern Recognition
Letters, 27(14): 1619–1631, 2006, Elsevier.
Averbuch, M. and Karson, T. and Ben-Ami, B. and Maimon, O. and Rokach, L., Context-
sensitive medical information retrieval, The 11th World Congress on Medical Informat-
ics (MEDINFO 2004), San Francisco, CA, September 2004, IOS Press, pp. 282–286.
Bellman R. Dynamic Programming. Princeton University Press, 1957.
20 Reinforcement Learning 415
Fig. 20.3. The learning curve of a Q-Learning agent assigned to solve the enterprise’s trans-
shipment problem.
Bertsekas D.P. Dynamic Programming: Deterministic and Stochastic Models. Prentice-Hall,
1987.
Bertsekas D.P., Tsitsiklis J.N. Neuro-Dynamic Programming. Athena Scientific, 1996.
Claus C., Boutilier, C. The Dynamics of Reinforcement Learning in Cooperative Multiagent
Systems. AAAI-97 Workshop on Multiagent Learning, 1998.
Cohen S., Rokach L., Maimon O., Decision Tree Instance Space Decomposition with
Grouped Gain-Ratio, Information Science, Volume 177, Issue 17, pp. 3592-3612, 2007.
Crites R.H., Barto A.G. Improving Elevator Performance Using Reinforcement Learning.
Advances in Neural Information Processing Systems: Proceedings of the 1995 Confer-
ence, 1996.
Filar J., Vriez K. Competitive Markov Decision Processes. Springer, 1997.

Hong J, Prabhu V.V. Distributed Reinforcement Learning for Batch Sequencing and Sizing
in Just-In-Time Manufacturing Systems. Applied Intelligence, 2004; 20:71-87.
Howard, R.A. Dynamic Programming and Markov Processes, M.I.T Press, 1960.
Hu J., Wellman M.P. Multiagent Reinforcement Learning: Theoretical Framework and Algo-
rithm. In Proceedings of the 15th International Conference on Machine Learning, 1998.
Jaakkola T., Jordan M.I.,Singh S.P. On the Convergence of Stochastic Iterative Dynamic
Programming Algorithms. Neural Computation, 1994; 6:1185-201.
Kaelbling L.P., Littman L.M., Moore A.W. Reinforcement Learning: a Survey. Journal of
Artificial Intelligence Research 1996; 4:237-85.
Littman M.L., Boyan J.A. A Distributed Reinforcement Learning Scheme for Network Rout-
ing. In Proceedings of the International Workshop on Applications of Neural Networks
to Telecommunications, 1993.
416 Oded Maimon and Shahar Cohen
Littman M.L. Markov Games as a Framework for Multi-Agent Reinforcement Learning. In
Proceedings of the 7th International Conference on Machine Learning, 1994.
Littman M. L. Friend-or-Foe Q-Learning in General-Sum Games. Proceedings of the 18th
International Conference on Machine Learning, 2001.
Maimon O., and Rokach, L. Data Mining by Attribute Decomposition with semiconductors
manufacturing case study, in Data Mining for Design and Manufacturing: Methods and
Applications, D. Braha (ed.), Kluwer Academic Publishers, pp. 311–336, 2001.
Maimon O. and Rokach L., “Improving supervised learning by feature decomposition”, Pro-
ceedings of the Second International Symposium on Foundations of Information and
Knowledge Systems, Lecture Notes in Computer Science, Springer, pp. 178-196, 2002.
Maimon, O. and Rokach, L., Decomposition Methodology for Knowledge Discovery and
Data Mining: Theory and Applications, Series in Machine Perception and Artificial In-
telligence - Vol. 61, World Scientific Publishing, ISBN:981-256-079-3, 2005.
Moskovitch R, Elovici Y, Rokach L, Detection of unknown computer worms based on behav-
ioral classification of the host, Computational Statistics and Data Analysis, 52(9):4544–
4566, 2008.
Pednault E., Abe N., Zadrozny B. Sequential Cost-Sensitive Decision making with

Reinforcement-Learning. In Proceedings of the 8th ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, 2002.
Puterman M.L. Markov Decision Processes. Wiley, 1994
Rokach, L., Decomposition methodology for classification tasks: a meta decomposer frame-
work, Pattern Analysis and Applications, 9(2006):257–271.
Rokach L., Genetic algorithm-based feature set partitioning for classification prob-
lems,Pattern Recognition, 41(5):1676–1700, 2008.
Rokach L., Mining manufacturing data using genetic algorithm-based feature set decompo-
sition, Int. J. Intelligent Systems Technologies and Applications, 4(1):57-78, 2008.
Rokach, L. and Maimon, O., Theory and applications of attribute decomposition, IEEE In-
ternational Conference on Data Mining, IEEE Computer Society Press, pp. 473–480,
2001.
Rokach L. and Maimon O., Feature Set Decomposition for Decision Trees, Journal of Intel-
ligent Data Analysis, Volume 9, Number 2, 2005b, pp 131–158.
Rokach, L. and Maimon, O., Clustering methods, Data Mining and Knowledge Discovery
Handbook, pp. 321–352, 2005, Springer.
Rokach, L. and Maimon, O., Data mining for improving the quality of manufacturing: a
feature set decomposition approach, Journal of Intelligent Manufacturing, 17(3):285–
299, 2006, Springer.
Rokach, L., Maimon, O., Data Mining with Decision Trees: Theory and Applications, World
Scientific Publishing, 2008.
Rokach L., Maimon O. and Lavi I., Space Decomposition In Data Mining: A Clustering Ap-
proach, Proceedings of the 14th International Symposium On Methodologies For Intel-
ligent Systems, Maebashi, Japan, Lecture Notes in Computer Science, Springer-Verlag,
2003, pp. 24–31.
Rokach, L. and Maimon, O. and Averbuch, M., Information Retrieval System for Medical
Narrative Reports, Lecture Notes in Artificial intelligence 3055, page 217-228 Springer-
Verlag, 2004.
Rokach, L. and Maimon, O. and Arbel, R., Selective voting-getting more for less in sensor
fusion, International Journal of Pattern Recognition and Artificial Intelligence 20 (3)

(2006), pp. 329–350.
Ross S. Introduction to Stochastic Dynamic Programming. Academic Press. 1983.
20 Reinforcement Learning 417
Sen S., Sekaran M., Hale J. Learning to Coordinate Without Sharing Information. In Pro-
ceedings of the Twelfth National Conference on Artificial Intelligence, 1994.
Sutton R.S., Barto A.G. Reinforcement Learning, an Introduction. MIT Press, 1998.
Szepesv
´
ari C., Littman M.L. A Unified Analysis of Value-Function-Based Reinforcement-
Learning Algorithms. Neural Computation, 1999; 11: 2017-60.
Tesauro G.T. TD-Gammon, a Self Teaching Backgammon Program, Achieves Master Level
Play. Neural Computation, 1994; 6:215-19.
Tesauro G.T. Temporal Difference Learning and TD-Gammon. Communications of the
ACM, 1995; 38:58-68.
Watkins C.J.C.H. Learning from Delayed Rewards. Ph.D. thesis; Cambridge University,
1989.
Watkins C.J.C.H., Dayan P. Technical Note: Q-Learning. Machine Learning, 1992; 8:279-92.
Zhang W., Dietterich T.G. High Performance Job-Shop Scheduling With a Time Delay
TD(
λ
) Network. Advances in Neural Information Processing Systems, 1996; 8:1024-
30.

21
Neural Networks For Data Mining
G. Peter Zhang
Georgia State University,
Department of Managerial Sciences,

Summary. Neural networks have become standard and important tools for data mining. This

chapter provides an overview of neural network models and their applications to data mining
tasks. We provide historical development of the field of neural networks and present three
important classes of neural models including feedforward multilayer networks, Hopfield net-
works, and Kohonen’s self-organizing maps. Modeling issues and applications of these models
for data mining are discussed.
Key words: neural networks, regression, classification, prediction, clustering
21.1 Introduction
Neural networks or artificial neural networks are an important class of tools for quan-
titative modeling. They have enjoyed considerable popularity among researchers and
practitioners over the last 20 years and have been successfully applied to solve a va-
riety of problems in almost all areas of business, industry, and science (Widrow,
Rumelhart & Lehr, 1994). Today, neural networks are treated as a standard data min-
ing tool and used for many data mining tasks such as pattern classification, time
series analysis, prediction, and clustering. In fact, most commercial data mining soft-
ware packages include neural networks as a core module.
Neural networks are computing models for information processing and are par-
ticularly useful for identifying the fundamental relationship among a set of variables
or patterns in the data. They grew out of research in artificial intelligence; specif-
ically, attempts to mimic the learning of the biological neural networks especially
those in human brain which may contain more than 10
11
highly interconnected neu-
rons. Although the artificial neural networks discussed in this chapter are extremely
simple abstractions of biological systems and are very limited in size, ability, and
power comparing biological neural networks, they do share two very important char-
acteristics: 1) parallel processing of information and 2) learning and generalizing
from experience.
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_21, © Springer Science+Business Media, LLC 2010

×