Tải bản đầy đủ (.pdf) (10 trang)

Data Mining and Knowledge Discovery Handbook, 2 Edition part 43 docx

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (385.03 KB, 10 trang )

400 Alex A. Freitas
University, Edinburgh, UK.
Sharpe PK and Glover RP (1999) Efficient GA based techniques for classification. Applied
Intelligence 11, 277-284.
Smith RE (2000) Learning classifier systems. In: T. Back, D.B. Fogel and T. Michalewicz
(Eds.) Evolutionary Computation 1: Basic Algorithms and Operators, 114-123. Institute
of Physics Publishing.
Smith MG and Bull L (2003) Feature construction and selection using genetic programming
and a genetic algorithm. Genetic Programming: Proc. EuroGP-2003, LNCS 2610, 229-
237. Springer.
Smith MG and Bull L (2004) Using genetic programming for feature creation with a genetic
algorithm feature selector. Proc. Parallel Problem Solving From Nature (PPSN-2004),
LNCS 3242, 1163-1171, Springer.
Song D, Heywood MI and Zincir-Heywood AN (2005) Training genetic programming on
half a million patterns: an example from anomaly detection. IEEE Trans. Evolutionary
Computation 9(3), 225-239, June 2005.
Srikanth R, George R, Warsi N, Prabhu D, Petry FE, Buckles B (1995) A variable-length
genetic algorithm for clustering and classification. Pattern Recognition Letters 16(8),
789-800.
Tan PN, Steinbach M and Kumar V (2006) Introduction to Data Mining. Addison-Wesley.
Terano T and Ishino Y (1998) Interactive genetic algorithm based feature selection and its ap-
plication to marketing data analysis. In: Liu H and Motoda H (Eds.) Feature Extraction,
Construction and Selection: a data mining perspective, 393-406. Kluwer.
Terano T and Inada M (2002) Data mining from clinical data using interactive evolutionary
computation. In: A. Ghosh and S. Tsutsui (Eds.) Advances in Evolutionary Computing:
theory and applications, 847-861. Springer.
Vafaie H and De Jong K (1998) Evolutionary Feature Space Transformation. In: H. Liu and
H. Motoda (Eds.) Feature Extraction, Construction and Selection, 307-323. Kluwer.
Witten IH and Frank E (2005) Data Mining: practical machine learning tools and techniques
. 2nd Ed. Morgan Kaufmann.
Wong ML and Leung KS (2000) Data Mining Using Grammar Based Genetic Programming


and Applications. Kluwer.
Yang J and Honavar V (1997) Feature subset selection using a genetic algorithm. Genetic
Programming 1997: Proc. 2nd Annual Conf. (GP-97), 380-385. Morgan Kaufmann.
Yang J and Honavar V (1998) Feature subset selection using a genetic algorithm. In: Liu, H.
and Motoda, H (Eds.) Feature Extraction, Construction and Selection, 117-136. Kluwer.
Zhang P, Verma B, Kumar K (2003) Neural vs. Statistical classifier in conjunction with ge-
netic algorithm feature selection in digital mammography. Proc. Congress on Evolu-
tionary Computation (CEC-2003). IEEE Press.
Zhou C, Xiao W, Tirpak TM and Nelson PC (2003) Evolving accurate and compact classi-
fication rules with gene expression programming. IEEE Trans. on Evolutionary Compu-
tation 7(6), 519-531.
20
A Review of Reinforcement Learning Methods
Oded Maimon
1
and Shahar Cohen
1
Department of Industrial Engineering, Tel-Aviv University, Ramat-Aviv 69978, Israel,

Summary. Reinforcement-Learning is learning how to best-react to situations, through trial
and error. In the Machine-Learning community Reinforcement-Learning is researched with
respect to artificial (machine) decision-makers, referred to as agents. The agents are assumed
to be situated within an environment which behaves as a Markov Decision Process. This chap-
ter provides a brief introduction to Reinforcement-Learning, and establishes its relation to
Data-Mining. Specifically, the Reinforcement-Learning problem is defined; a few key ideas
for solving it are described; the relevance to Data-Mining is explained; and an instructive
example is presented.
Key words: Reinforcement-Learning
20.1 Introduction
Reinforcement-Learning (RL) is learning how to best-react to situations, through

trial-and-error. The learning takes place as a decision-maker interacts with the envi-
ronment she lives in. On a sequential basis, the decision-maker recognizes her state
within the environment, and reacts by initiating an action. Consequently she obtains
a reward signal, and enters another state. Both the reward and the next state are
affected by the current state and the action taken. In the Machine Learning (ML)
community, RL is researched with respect to artificial (machine) decision-makers,
referred to as agents.
The mechanism that generates reward signals and introduces new states is re-
ferred to as the dynamics of the environment. As the RL agent begins learning, it
is unfamiliar with that dynamics, and therefore initially it cannot correctly predict
the outcome of actions. However as the agent interacts with the environment and
observes the actual consequences of its decisions, it can gradually adapt its behav-
ior accordingly. Through learning the agent chooses actions according to a policy.A
policy is a means of deciding which action to choose when encountering a certain
state. A policy is optimal if it maximizes an agreed-upon return function. A return
function is usually some sort of expected weighted-sum over the sequence of rewards
O. Maimon, L. Rokach (eds.), Data Mining and Knowledge Discovery Handbook, 2nd ed.,
DOI 10.1007/978-0-387-09823-4_20, © Springer Science+Business Media, LLC 2010
402 Oded Maimon and Shahar Cohen
obtained while following a specified policy. Typically the objective of the RL agent
is to find an optimal policy.
RL research has been continually advancing over the past three decades. The
aim of this chapter is to provide a brief introduction to this exciting research, and to
establish its relation to Data-Mining (DM). For a more comprehensive RL survey,
the reader is referred to Kaelbling et al. (1996). For a comprehensive introduction
to RL, see Sutton and Barto (1998). A rigorous presentation of RL can be found in
Bertsekas and Tsitsiklis (1996).
The rest of this chapter is organized as follows. Section 20.2 formally describes
the basic mathematical model of RL, and reviews some key results for this model.
Section 20.3 introduces some of the principles of computational methods in RL. Sec-

tion 20.4 describes some extensions to the basic RL model and computation methods.
Section 20.5 reviews several application of RL. Section 20.6 discusses RL in a DM
perspective. Finally Section 20.7 presents an example of how RL is used to solve a
typical problem.
20.2 The Reinforcement-Learning Model
RL is based on a well-known model called Markov Decision Process (MDP). An
MDP is a tuple S,A,R,P, where S is a set of states, A is a set of actions
1
, R : S×A →
ℜ is a mean-reward function and P : S×A ×S →[0, 1] is a state-transition function.
An MDP evolves through discrete time stages. On stage t, the agent recognizes the
state of its environment s
t
∈S and reacts by choosing an action a
t
∈A. Consequently
it obtains a reward r
t
, whose mean-value is R(s
t
,a
t
), and its environment is transited
to a new state s
t+1
with probability P(s
t
,a
t
,s

t+1
). The two sets - of states and of
actions - may be finite or infinite. In this chapter, unless specified otherwise, both
sets are assumed to be finite. The RL agent begins interacting with the environment
without any knowledge of the mean-reward function or the state-transition function.
Situated within its environment, the agent seeks an optimal policy. A policy
π
: S ×A → [0,1] is a mapping from state-action pairs to probabilities. Namely, an
agent that observes the state s and follows the policy
π
will chose the action a ∈ A in
probability
π
(s,a). Deterministic policies are of particular interest. A deterministic
policy
π
d
is a policy in which for any state s ∈ S there exists an action a ∈ A so that
π
d
(s,a)=1, and
π
d
(s,a

)=0 for all a

= a. A deterministic policy is therefore a map-
ping from states to actions. For a deterministic policy
π

d
the action a =
π
d
(s) is the
action for which
π
d
(s,a)=1. The subscript ”d” is added to differentiate determin-
istic from non-deterministic policies (when it is clear from the context that a policy
is deterministic, the subscript is omitted). A policy (either deterministic or not) is
optimal if it maximizes some agreed-upon return function. The most common return
function is the expected geometrically-discounted infinite-sum of rewards. Consid-
ering this return function, the objective of the agent is defined as follows:
1
It is possible to allow different sets of actions for different states (i.e. letting A(s) be the
set of allowable actions in state s for all s ∈ S). For ease of notation, it is assumed that all
actions are allowed in all states.
20 Reinforcement Learning 403
E



t=1
γ
t−1
r
t

→ max, (20.1)

where
γ
∈ (0, 1) is a discount factor representing the extent to which the agent is
willing to compromise immediate rewards for the sake of future rewards. The dis-
count factor can be interpreted either as a means of capturing some characteristics
of the problem (for example an economic interest-rate) or as a mathematical trick
that makes RL problems more tractable. Other useful return-functions are defined by
expected finite-horizon sum of rewards, and expected long-run average reward (see
Kaelbling et al., 1996). This chapter assumes that the return function is the expected
geometrically-discounted infinite-sum of rewards.
Given a policy
π
, the value of the state s is defined by:
V
π
(s)=E
π



t=1
γ
t−1
r
t
|
s
1
= s


s ∈S , (20.2)
where the operator E
π
represents expectation given that actions are chosen accord-
ing to the policy
π
. The value of a state for a specific policy represents the return
correlated with following the policy given a specific initial state. Similarly, given the
policy
π
, the value of a state-action pair is defined by:
Q
π
(s,a)=E
π



t=1
γ
t−1
r
t
|
s
1
= s,a
1
= a


s ∈S;a ∈ A , (20.3)
The value of a state-action pair for a specific policy is the return correlated with first
choosing a specific action while being on a specific state, and thereafter choosing
actions according to the policy. The optimal value of states is defined by:
V

(s)=max
π
V
π
(s), s ∈S
. (20.4)
A policy
π

is optimal if it achieves the optimal values for all states, i.e. if:
V
π

(s)=V

(s) ∀s ∈S
. (20.5)
If
π

is an optimal policy it also maximizes the value of all state-action pairs:
Q

(s,a)=Q

π

(s,a)=max
π
Q
π
(s,a) ∀s ∈S; a ∈ A
. (20.6)
A well-known result is that under the assumed return-function, any MDP has an
optimal deterministic policy. This optimal policy, however, may not be unique. Any
deterministic optimal policy must satisfy the following relation:
π

(s)=argmax
a
Q

(s,a), ∀s ∈ S
. (20.7)
Finally, the relation between optimal values of states and of state-action pairs is es-
tablished by the following set of equations:
404 Oded Maimon and Shahar Cohen
Q

(s,a)=R(s, a)+
γ

s

∈S

P(s,a, s

)V

(s

)
V

(s)=max
a
Q

(s,a)
, s ∈S;a ∈A . (20.8)
For a more extensive discussion about MDPs, and related results, the reader is re-
ferred to Puterman (1994) or Ross (1983).
Some RL tasks are continuous while others are episodic. An episodic task is
one that terminates after a (maybe random) number of stages. As a repeat of an
episodic task terminates, another may begin, possibly at a different initial state. Con-
tinuous tasks, on the other hand, never terminate. The objective defined by Equation
20.1 considers an infinite horizon and therefore might be seen as inappropriate for
episodic tasks (which are finite by definition). However, by introducing the concept
of an absorbing state, episodic tasks can be viewed as infinite-horizon tasks (Sut-
ton and Barto, 1998). An absorbing state is one from which all actions result in a
transition to the same state and with zero reward.
20.3 Reinforcement-Learning Algorithms
The environment in RL problems is modeled as an MDP with unknown mean-reward
and state transition-functions. Many RL algorithms are generalizations of dynamic-
programming (DP) algorithms (Bellman, 1957; Howard, 1960) for finding optimal

policies in MDPs given these functions. Sub-section 20.3.1 introduces a few key
DP principles. The reader is referred to Puterman (1994), Bertsekas (1987) or Ross
(1983) for a more comprehensive discussion. Sub-section 20.3.2 introduces several
issues related to generalizing DP algorithms to RL problems. Please see Sutton and
Barto (1998) for a comprehensive introduction to RL algorithms, and Bertsekas and
Tsitsiklis (1996) for a more extensive treatment.
20.3.1 Dynamic-Programming
Typical DP algorithms begin with an arbitrary policy and proceed by evaluating the
values of states or state-action pairs for this policy. These evaluations are used to
derive a new, improved policy, for evaluating the values of states or state-action pairs
for the new policy and so on. Given a deterministic policy
π
, the evaluation of val-
ues of states may take place by incorporating an iterative sequence of updates. The
sequence begins with arbitrary initializations V
1
(s) for each s. On the k-th repeat of
the sequence, the values V
k
(s) are used to derive V
k+1
(s) for all s:
V
k+1
(s)=R(s,
π
(s)) +
γ

s


∈S
P(s,
π
(s),s

)V
k
(s

) ∀s ∈S
. (20.9)
It can be shown that following this sequence, V
k
(s) converges to V
π
(s) for all s,ask
increases.
Having the deterministic policy
π
and the values of states for this policy V
π
(s),
the values of state-actions pairs are given by:
20 Reinforcement Learning 405
Q
π
(s,a)=R(s, a)+
γ


s

∈S
P(s,a, s

)V
π
(s) s ∈S;a ∈ A
(20.10)
The values of state-action pairs for a given policy
π
k
, can be used to derive an im-
proved deterministic policy
π
k+1
:
π
k+1
(s)=argmax
a∈A
Q
π
k
(s,a) ∀s ∈S
(20.11)
It can be shown that V
π
k+1
(s) ≥V

π
k
(s) for all s, and that if the last relation is satisfied
with equality for all states, then
π
k
is an optimal policy.
Improving the policy
π
may be done based on estimations of V
π
(s) instead of the
exact values (i.e. if V(s) estimate V
π
(s), a new policy can be derived by calculating
Q(s,a) according to Equation 20.10 where V (s) replace V
π
(s), and by calculating the
improved policy according to Equation 20.11, based on Q(s,a). Estimations of V
π
(s)
are usually the result of executing the sequence of updates defined by Equation 20.9,
without waiting for V
k
(s) to converge to V
π
(s) for all s ∈ S. In particular, it is possi-
ble to repeatedly execute a single repeat of the sequence defined in Equation 20.9, to
use the estimation results to derive a new policy as defined by Equations 20.10 and
20.11, to re-execute a single repeat of Equation 20.9 starting from the current esti-

mation results, and so-on. This well-known approach, termed value-iteration, begins
with arbitrary initialization Q
1
(s,a) for all s and a, and proceeds iteratively with the
updates:
Q
t+1
(s,a)=R(s, a)+
γ

s

∈S
P(s,a, s

)V
t
(s

)
∀s ∈ S;a ∈A;t = 1, 2,
(20.12)
where V
t
(s)=max
a
Q
t
(s,a). It can be shown that using value-iteration, Q
t

(s,a) con-
verges to Q

(s,a))
2
. The algorithm terminates using some stopping conditions (e.g.
the change in the values Q
t
(s,a) due to a single iteration is small enough)
3
. Let the
termination occur at stage T . The output policy is calculated according to:
π
(s)=argmax
a∈A
Q
T
(s,a) ∀s ∈S
(20.13)
20.3.2 Generalization of Dynamic-Programming to Reinforcement-Learning
It should be noted that both the mean-reward and the state-transition functions are
required in order to take on the computations described in the previous sub-section.
In RL, these functions are initially unknown. Two different approaches, indirect and
direct, may be used to generalize the discussion to the absence of these functions.
According to the indirect approach, samples from the consequences of choosing
various actions at various states are gathered. These samples are used to approximate
2
In addition, there are results concerning the rate of this convergence. See, for instance, the
discussion in Puterman (1994).
3

It is possible to establish a connection between the change in the values Q
t
(s,a) due to
a single iteration and the distance between Q
t
(s,a) and Q

(s,a). See, for instance, the
discussion in Puterman (1994).
406 Oded Maimon and Shahar Cohen
the mean-reward and state-transition functions. Subsequently, the indirect approach
uses the approximations to extract policies. During the extraction of policies, the
approximated functions are used as if they were the exact ones.
The direct approach, the more common of the two, involves continuously main-
taining estimations of the optimal values of state and state-action pairs without hav-
ing any explicitly approximated mean-reward and state-transition functions. The
overview in this sub-section focus on methods that take a direct approach.
In a typical direct method, the agent begins learning in a certain state, while hav-
ing arbitrary estimations of the optimal values of states and state-action pairs. Sub-
sequently the agent uses a while-learning policy to choose an action. Consequently a
new state is encountered and an immediate reward is obtained (i.e. a new experience
is gathered). The agent uses the new experience to update the estimations of optimal
values for states and state-action pairs visited in previous stages.
The policy, that the agent uses while it learns, needs to solve a dilemma, known
as the exploration-exploitation dilemma. Exploitation means using the knowledge
gathered in order to obtain desired outcomes. In order to obtain desired outcomes on
a certain stage, the agent needs to choose an action which corresponds with the max-
imal optimal value of state-action given the state. Since the exact optimal values of
state-action pairs are unknown, at least the corresponding estimations are expected
to be maximized. On the other hand, due to the random fluctuations of the reward

signals and the random nature of the state-transition function, the agent’s estimations
are never accurate. In order to obtain better estimations, the agent must explore its
possibilities. Exploration and exploitation are conflicting rationales, because by ex-
ploring possibilities the agent will sometimes choose actions that seem inferior at the
time they are chosen.
In general, it is unknown how to best-solve the exploration-exploitation dilemma.
There are, however, several helpful heuristics, which typically work as follows. Dur-
ing learning, the action chosen while being on state s is randomly chosen from the
entire set of actions, but with a probability function that favors actions for which the
current optimal value estimates are high. (See Sutton and Barto, 1998 for a discus-
sion on the exploration-exploitation dilemma and its heuristic solutions).
Many RL algorithms are stochastic variations of DP algorithms. Instead of us-
ing an explicit mean-reward and state-transition functions, the agent uses the actual
reward signals and state transitions while interacting with the environment. These
actual outcomes implicitly estimate the real, unknown functions. There are several
assumptions under which the estimates maintained by stochastic variations of DP
algorithms converge to the optimal values of states and state-action pairs. Having the
optimal values, a deterministic optimal policy may be derived according to Equation
20.7. The reader is referred to Bertsekas and Tsitsiklis (1996), Jaakkola et al. (1994)
or Szepesv
´
ari and Littman (1999) for formal, general convergence results.
One of the most common RL algorithms is termed Q-Learning (Watkins, 1989;
Watkins and Dayan 1992). Q-Learning takes the direct approach, and can be regarded
as the stochastic version of value-iteration. At stage t of Q-Learning, the agent holds
Q
t
(s,a) – estimations of the optimal values for state-action pairs Q

(s,a) for all

state-action pairs. At this stage, the agent encounters state s
t
and chooses the action
20 Reinforcement Learning 407
a
t
. Following the execution of a
t
from s
t
, the agent obtains the actual reward r
t
, and
faces the new state s
t+1
. The tuple s
t
,a
t
,r
t
,s
t+1
 is referred to as the experience
gathered on stage t. Given that experience, the agent updates its estimate as follows:
Q
t+1
(s
t
,a

t
)=(1 −
α
t
(s
t
,a
t
))Q
t
(s
t
,a
t
)+
α
t
(s
t
,a
t
)(r
t
+
γ
V
t
(s
t+1
))

= Q
t
(s
t
,a
t
)+
α
t
(s
t
,a
t
)(r
t
+
γ
V
t
(s
t+1
) −Q
t
(s
t
,a
t
))
(20.14)
where V

t
(s)=max
a
Q
t
(s,a), and
α
t
(s
t
,a
t
) ∈ (0, 1) is a step size reflecting the ex-
tent to which the new experience needs to be blended into the current estimates
4
.It
can be shown that as t → ∞, Q
t
(s,a) converges to Q

(s,a) for all s and a and un-
der several assumptions. (Convergence proofs can be found in Watkins and Dayan,
1992; Jaakkola et al. 1996; Szepesv
´
ari and Littman 1999 and Bertsekas and Tsitsik-
lis, 1996).
In order to understand the claim that Q-Learning is a stochastic version
of value-iteration, it is helpful to address Equation 20.14 as an update of
the estimated value of a certain state-action pair Q
t

(s
t
,a
t
) in the direction
r
t
+
γ
V
t
(s
t+1
), with a step size
α
t
(s
t
,a
t
). With this interpretation, Q-Learning can
be compared to value-iteration (Equation 20.12). Referring to a certain state-action
pair, noted by s
t
,a
t
, and replacing the state space index s

with s
t+1

Equation 20.12
can be re-phrased as:
Q
t+1
(s
t
,a
t
)=R(s
t
,a
t
)+
γ

s
t+1
∈S
P(s
t
,a
t
,s
t+1
)V
t
(s
t+1
) (20.15)
Rewriting Equation 20.14 with

α
t
(s
t
,a
t
)=1 results in:
Q
t+1
(s
t
,a
t
)=r
t
+
γ
V
t
(s
t+1
) (20.16)
The only difference between Equations 20.15 and 20.16 lies in the exchanged
use of r
t
instead of R(s
t
,a
t
) and the exchanged use of V

t
(s
t+1
) instead of

s
t+1
∈S
P(s
t
,a
t
,s
t+1
)V
t
(s
t+1
). It can be shown that:
E [r
t
+
γ
V
t
(s
t+1
)] = R (s
t
,a

t
)+
γ

s
t+1
∈S
P(s
t
,a
t
,s
t+1
)V
t
(s
t+1
), (20.17)
namely Equation 20.16 is a stochastic version of Equation 15. It is appropriate to use
a unit step-size when basing an update on exact values, but it is inappropriate to do
so when basing the update on unbiased estimates, since the learning algorithm must
be robust to the random fluctuations.
In order to converge, Q-Learning is assumed to infinitely update each state-action
pair, but there is no explicit instruction as to which action to choose at each stage.
However in order to boost the rate at which estimations converge to optimal values,
heuristics that break the exploration-exploitation dilemma are usually used.
4
In general, a unique step-size is defined for each state and action and for each stage. Usually
step-sizes decrease with time.
408 Oded Maimon and Shahar Cohen

20.4 Extensions to Basic Model and Algorithms
In general, the RL model is quite flexible and can be used to capture problems in a
variety of domains. Applying an appropriate RL algorithm can lead to optimal solu-
tions without requiring an explicit mean-reward or state-transition functions. There
are some problems, however, that the RL model described in Section 20.2 cannot
capture. There may also be some serious difficulties in applying RL algorithms de-
scribed in Section 20.3. This section presents overviews of two extensions. The first
extension involves, a multi-agent RL scenario, where learning agents co-exist in a
single environment. This is followed by an overview of the problem of large (or even
infinite) sets of states and actions.
20.4.1 Multi-Agent RL
In RL, as in real-life, the existence of one autonomous agent affects the outcomes ob-
tained by other co-existing agents. An immediate approach for tackling multi-agent
RL is to let each agent refer to its colleagues (or adversaries) as part of the envi-
ronment. A learner that takes this approach is regarded as an independent learner
(Claus and Boutilier, 1998). It is to be noticed that the environment of an indepen-
dent learner consists of learning components (the other agents) and is therefore not
stationary. The model described in Section 20.4, as well as the convergence results
mentioned in Section 20.3, assumed a stationary environment (i.e. the mean-reward
and state-transition functions do not change in time). Although convergence is not
guaranteed when using independent learners in multi-agent problems, several au-
thors have reported good empirical results for this approach. For example, Sen et al.
(1994) used Q-Learning in multi-agent domains.
Littman (1994) proposed Markov Games (often referred to as Stochastic Games)
as the theoretic model appropriate for multi-agent RL problems. A k-agents Stochas-
tic Game (SG) is defined by a tuple

S,
¯
A,

¯
R,P

, where S is a finite set of states (as
in the case of MDPs);
¯
A = A
1
×A
2
× ×A
k
is a Cartesian product of k action-sets
available for the k agents;
¯
R : S ×
¯
A → ℜ
k
is a collection of k mean-reward functions
for the k agents; and P : S×
¯
A×S →[0,1] is a state-transition function. The evolution
of an SG is controlled by k autonomous agents acting simultaneously rather than by a
single agent. The notion of a game as a model for multi-agent RL problems raises the
concept of Nash-equilibrium as an optimality criterion. Generally speaking, a joint
policy is said to be in Nash-equilibrium if no agent can gain from being the only one
to deviate from it.
Several algorithms that rely on the SG model appear in the literature
(Littman, 1994; Hu and Wellman, 1998; Littman, 2001). In order to assure con-

vergence, agents in these algorithms are programmed as joint learners (Claus and
Boutilier, 1998). A joint learner is aware of the existence of other agents and in one
way or another adapts its own behavior to the behavior of its colleagues. In gen-
eral, the problem of multi-agent RL is still the subject of ongoing research. For a
comprehensive introduction to SGs the reader is referred to Filar and Vrieze (1997).
20 Reinforcement Learning 409
20.4.2 Tackling Large Sets of States and Actions
The algorithms described in Section 20.3 assumed that the agent maintains a look-
up table with a unique entry corresponding to each state or state-action pair. As
the agent gathers new experience, it retrieves the entry corresponding with state or
state-action pair for that experience, and updates the estimate stored within the entry.
Representation of estimates to optimal values in the form of a look-up table is limited
to problems with a reasonably small number of states and actions. Obviously, as the
number of states and action increases, the memory required for the look-up table
increases, and so does the time and experience needed to fill up this table with reliable
estimates. That is to say, if there is a large number of states and actions, the agent
cannot have the privilege of exploring them all, but must incorporate some sort of
generalization.
Generalization takes place through function-approximation in which the agent
maintains a single approximated value function from state or state-action pairs to
mean-rewards. Function approximation is a central idea in Supervised-Learning
(SL). The task in function-approximation is to find a function that will best ap-
proximate some unknown target-function based on a limited set of observations.
The approximating function is located through a search over the space of parame-
ters of a decided-upon family of parameterized functions. For example, the unknown
function Q

: S ×A → ℜ may be approximated by an artificial neural network of
pre-determined architecture. A certain network f : S ×A → ℜ belongs to a family of
parameterized networks

Φ
, where the parameters are the weights of the connections
in the network.
RL with function approximation inherits the direct approach described in Sec-
tion 20.3. That is, the agent repeatedly estimates the values of states or state-action
pairs for its current policy; uses the estimates to derive an improved policy; estimates
the values corresponding with the new policy; and so on. However, the function rep-
resentation adds some more complications to the process. In particular, when the
number of actions is large, the principle of policy improvement by finding the ac-
tion that maximizes current estimates, does not scale well. Moreover, convergence
results characterizing look-up table representations usually cannot be generalized
to function-approximation representations. The reader is referred to Bertsekas and
Tsitsiklis (1996) for an extensive discussion on RL with function-approximation and
corresponding theoretic results.
20.5 Applications of Reinforcement-Learning
Using RL, an agent may learn how to best behave in a complex environment without
any explicit knowledge regarding the nature or the dynamics of this environment. All
that an agent needs in order to find an optimal policy is the opportunity to explore its
options.
In some cases, RL occurs through interaction with the real environment under
consideration. However, there are cases in which experience is expensive. For exam-
ple, consider an agent that needs to learn a decision policy in a business environment.

×