Tải bản đầy đủ (.pdf) (22 trang)

sensor based learning for practical planning of fine motion in robotics ppt

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (598.54 KB, 22 trang )

Sensor-based learning for practical
planning of fine motions in robotics
Enric Cervera
*
, Angel P. del Pobil
Department of Computer Science and Engineering, Jaume-I University, Castell

oo, Spain
Received 4 July 2001; received in revised form 8 October 2001; accepted 28 November 2001
Abstract
This paper presents an implemented approach to part-mating of three-dimensional
non-cylindrical parts with a 6 DOF manipulator, considering uncertainties in modeling,
sensing and control. The core of the proposed solution is a reinforcement learning al-
gorithm for selecting the actions that achieve the goal in the minimum number of steps.
Position and force sensor values are encoded in the state of the system by means of a
neural network. Experimental results are presented for the insertion of different parts –
circular, quadrangular and triangular prisms – in three dimensions. The system exhibits
good generalization capabilities for different shapes and location of the assembled
parts. These results significantly extend most of the previous achievements in fine
motion tasks, which frequently model the robot as a polygon translating in the plane in
a polygonal environment or do not present actual implemented prototypes.
Ó 2002 Elsevier Science Inc. All rights reserved.
Keywords: Robotics; Neural nets; Reinforcement learning
1. Introduction
We present a practical framework for fine motion tasks, particularly the
insertion of non-cylindrical parts with uncertainty in modeling, sensing and
control. The approach is based on an algorithm which autonomously learns a
Information Sciences 145 (2002) 147–168
www.elsevier.com/locate/ins
*
Corresponding author. Present address: Department of Computer Science and Engineering,


Jaume-I, Castell

oo, Spain.
E-mail addresses: (E. Cervera), (A.P. del Pobil).
0020-0255/02/$ - see front matter Ó 2002 Elsevier Science Inc. All rights reserved.
PII: S 0 0 20-0255(0 2 ) 0 0 2 28-1
relationship between sensed states and actions. This relationship allows the
robot to select those actions which attain the goal in the minimum number of
steps. A feature extraction neural network complements the learning algo-
rithm, forming a practical sensing-action architecture for manipulation tasks.
In the type of motion planning problems addressed in this work, interactions
between the robot and objects are allowed, or even mandatory, for operations
such as compliant motions and parts mating. We restrict ourselves to tasks
which do not require complex plans; however, they are significantly difficult to
attain in practice due to uncertainties. Among these tasks, the peg-in-hole in-
sertion problem has been broadly studied, but very few results can be found in
the literature for three-dimensional non-cylindrical parts in an actual imple-
mentation.
We believe that practicality, although an important issue, has been vastly
underestimated in fine motion methods, since most of these approaches are
based on geometric models which become complex for non-trivial cases espe-
cially in three dimensions [1].
The remainder of this paper is structured as follows. Section 2 reviews some
related work and states the key contributions of our work. In Section 3, we
describe the components of the architecture. Thorough experimental results are
then presented in Section 4. Finally, Section 5 discusses a number of issues
regarding the proposed approach, and draws some conclusions.
2. Background and motivation
2.1. Related research
Though the peg-in-hole problem has been exhaustively studied for a long

time [2–4], most of the implementations have been limited to planar motions or
cylindrical parts [5–7]. Caine et al. [8] pointed out the difficulties of inserting
prismatic pegs. To our knowledge, our results are the first for a system which
learns to insert non-cylindrical pegs (see Fig. 1) in a real-world task with un-
certainty in position and orientation.
Parts mating in real-world industry is frequently performed by passive
compliance devices [4], which support parts and aid their assembly. They are
capable of high-speed precision insertions, but they lack the flexibility of
software methods.
A difficult issue in parts mating is the need for nonlinear compliance for
chamferless insertions, which was demonstrated by Asada [2], who proposed a
supervised neural network for learning the nonlinear relationship between
sensing and motion in a two-dimensional frictionless peg-in-hole task. The use
of a supervised network presents a great difficulty in real-world three-dimen-
sional problems, since a proper training set has to be generated.
148 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
Lozano-P

eerez [9] first proposed a formal approach to the synthesis of
compliant-motion strategies from geometric descriptions of assembly opera-
tions and explicit estimates of errors in sensing and control. In an extension to
this approach, Donald [10] presented a formal framework for computing
motion strategies which are guaranteed to succeed in the presence of three
kinds of uncertainty (sensing, control and model). Experimental verification is
described in [11], but only for planar tasks. Following DonaldÕs work, Briggs
[12] proposed an Oðn
2
logðnÞÞ algorithm, where n is the number of vertices in
the environment, for the basic problem of manipulating a point from a spec-
ified start region to a specified goal region amidst planar polygonal obstacles

where control is subject to uncertainty. Latombe et al. [13] describe two
practical methods for computing preimages for a robot having a two-dimen-
sional Euclidean configuration space. Though the general principles of the
planning methods immediately extend to higher dimensional spaces, the geo-
metric algorithms do not, and only simulated examples of planar tasks are
shown. LaValle and Hutchinson [14] present another framework for manipu-
lation planning under uncertainty, based on preimages, though they consider
such approach to be reasonable only for a few dimensions. Their computed
examples are restricted to planar polygonal models.
A different geometric approach is introduced by McCarragher and Asada
[15] who define a discrete event in assembly as a change in contact state re-
flecting a change in a geometric constraint. The discrete event modeling is
accomplished using Petri nets. Dynamic programming is used for task-level
planning to determine the sequence of desired markings (contact state) for
discrete event control that minimizes a path length and uncertainty perfor-
mance measure. The method is applied to a dual peg-in-hole insertion task, but
the motion is kept planar.
Learning methods provide a framework for autonomous adaptation and
improvement during task execution. An approach to learning a reactive control
strategy for peg-in-hole insertion under uncertainty and noise is presented in
[16]. This approach is based on active generation of compliant behavior using a
Fig. 1. Diagram of the insertion task.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 149
nonlinear admittance mapping from sensed positions and forces to velocity
commands. The controller learns the mapping through repeated attempts at
peg insertion. A two-dimensional version of the peg-in-hole task is imple-
mented on a real robot. The controller consists of a supervised neural network,
with stochastic units. In [5] the architecture is applied to a real ball-balancing
task, and a three-dimensional cylindrical peg-in-hole task. Kaiser and Dillman
[17] propose a hierarchical approach to learning the efficient application of

robot skills in order to solve complex tasks. Since people can carry out ma-
nipulation tasks with no apparent difficulty, they develop a method for the
acquisition of sensor-based robot skills from human demonstration. Two
manipulation skills are investigated: peg insertion and door opening. Distante
et al. [18] apply reinforcement learning techniques to the problem of target
reaching by using visual information.
2.2. Motivation
Approaches based on geometric models are far from being satisfactory:
most of them are restricted to planar problems, and a plan might not be found
if the part geometries are complex or the uncertainties are great. Many
frameworks do not consider incorrect modeling and robustness.
Though many of the approaches have been implemented in real-world en-
vironments, they are frequently limited to planar motions. Furthermore, cyl-
inders are the most utilized workpieces in three-dimensional problems.
If robots can be modeled as polygons moving amid polygonal obstacles in a
planar world, and a detailed model is available, a geometric framework is fine.
However, since such conditions are rarely found in practice, we argue that a
robust, adaptive, autonomous learning architecture for robot manipulation
tasks – particularly part mating – is a necessary alternative in real-world en-
vironments, where uncertainties in modeling, sensing and control are un-
avoidable.
3. A practical adaptive architecture
Fig. 2 depicts the three components of the adaptive architecture: two sensor-
based motions – guarded and compliant – and an additional subsystem com-
bining learning and exploration.
This architecture relies on two types of sensor: position ðxÞ and force ðf Þ.
Throughout this work, position and orientation of the tool frame are obtained
from the robot joint angles using the kinematic equations. Force measurements
are obtained from a wrist-mounted strain gauge sensor. It is assumed that all
sensors are calibrated, but uncertainty cannot be absolutely eliminated due to

sensor noise and calibration imprecision. The systemÕs output is the end-
150 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
effector velocity ðvÞ in Cartesian coordinates, which is translated to joint co-
ordinates by a resolved motion rate controller:
_
hh ¼ J
À1
v; where v ¼
_
xx: ð1Þ
Since the work space of the fine motion task is limited to a small region, the
singularities of J are not important in this framework.
3.1. The insertion plan
Uncertainty in the location of the part and the hole prevents the success of a
simple position-based plan. Contact between parts has to be monitored, and
different actions are needed to perform a correct insertion. Other approaches
have tried to build a plan by considering all the possible contact states, but they
have only succeeded in simple planar tasks. In addition, uncertainty poses
difficulties for identifying the current state.
The proposed insertion plan consists of three steps, which are inspired by
intuitive manipulation skills:
(1) Approach hole until a contact is detected.
(2) Move compliantly around the hole until contact is lost (hole found).
(3) Move into the hole until a contact is detected (bottom of the hole).
This strategy differs from a pure random search in that an adaptation
procedure is performed during the second step. The system learns a relation-
ship between sensing and action, in an autonomous way, which guides the
exploration towards the target. Initially, the system relies heavily on explora-
tion. As a result of experience, an insertion skill is learned, and the mean in-
sertion time for the task is considerably improved.

3.2. Guarded motions
In guarded motions, the system is continuously monitoring a condition, which
usually stops the motion, e.g. a force value going beyond a fixed threshold.
Fig. 2. Subsystems of the adaptive architecture.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 151
In the above insertion plan, all the steps are force-guarded. Starting from a
free state and due to the geometry of the task, a contact is gained if jF
z
j raises to
0.1 kgf, and the contact is lost if jF
z
j falls below 0.05 kgf. This dual-threshold
accounts for small variations in the contact force due to friction, or uncertainty
in the measurements.
It is not impossible to insert the part at the first step, and additional in-
formation is required to know that the contact has been caused by the surface.
A position value is enough, since the depth of the hole is usually much greater
than the uncertainty in location. Another possibility is making small lateral
motions: if large forces are detected, the part has already been inserted into the
hole.
3.3. Compliant motions
Once a contact is achieved, motion is restricted to a surface. In practice, two
degrees of freedom ðX ; Y Þ are position-controlled, while the third one ðZÞ is
force-controlled. Initially, random compliant motions are performed, but a
relationship between sensed forces and actions is learned, which decreases the
time needed to insert the part.
During the third step, a complementary compliant motion is performed. In
this task, when the part is inserted, Z is position-controlled, while ðX ; Y Þ are
force-controlled.
3.4. Exploration and learning

Random search has been proposed in the literature as a valid tool for
dealing with uncertainties [19]. However, the insertion time greatly increases
when the clearance ratio decreases. In the proposed architecture (see Fig. 3), an
adaptation process learns a relationship between sensed states and actions,
which guides the insertion task towards completion with the minimum number
of actions.
A sensed state consists of a discretized position and force measurement, as
described below. A value is stored in a look-up table for each pair of state and
Fig. 3. Learning subsystem. Exploration is embedded in the action selection block.
152 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
action. This value represents the amount of reinforcement which is expected in
the future, starting from the state, if the action is performed.
The reinforcement (or cost) is a scalar value which measures the quality of
the performed action. In our setup, a negative constant reinforcement is gen-
erated after every motion. The learning algorithm adapts the values of the table
so that the expected reinforcement is maximized, i.e., the number of actions
(cost) to achieve the goal is minimized.
The discrete nature of the reinforcement learning algorithm poses the ne-
cessity of extracting discrete values from the sensor signals of force and posi-
tion. This feature extraction process along with the basis of the learning
algorithm is described below.
3.4.1. Feature extraction
Force sensing is introduced to compensate for the uncertainty in positioning
the end-effector. It does a good job when a small displacement causes a contact,
since a big change in force is detected. However, with only force signals it is
not always possible to identify the actual contact state, i.e., different contacts
produce similar force measurements, as described in [20].
The adopted solution is to combine the force measurements with the relative
displacement of the end-effector from the initial position, i.e., that of the first
contact between the part and the surface.

The next problem is the discretization of the inputs, which is a requirement
of the learning algorithm. There is a conflict between size and fineness. With a
fine representation, the number of states is increased, thus slowing down the
convergence of the learning algorithm. Solutions are problem-dependent, using
heuristics for finding a good representation of manageable size.
We have obtained good results with the division of the exploration space in
three intervals along each position-controlled degree of freedom. For cylin-
drical parts, the XY-plane of the exploration space is divided into nine regions –
a3Â 3 grid. For non-cylindrical parts, the rotation around Z-axis has to be
considered too, thus the total number of states is 27. Region limits are fixed
according to the estimated uncertainty and the radius of exploration.
Though the force space could be partitioned in a similar way, an unsuper-
vised clustering scheme is used. In a previous work [20] we pointed out the
feasibility of unsupervised learning algorithms, particularly KohonenÕs self-
organizing maps (SOMs) [21], for extracting feature information from sensor
data in robotic manipulation tasks.
An SOM is a lattice of units, or cells. Each unit is a vector with as many
components as inputs to the SOM. Though there is a neighborhood relation-
ship between units in the lattice, this is only used during the training of the map
and not in our scheme.
SOMs perform a nonlinear projection of the probability density function of
the input space onto the two-dimensional lattice of units. Though all the six
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 153
force and torque signals are available, the practical solution adopted is to use
only the three torque signals as inputs to the map. The reason for this is the
strong correlation between the force and the torque; thus, adding those cor-
related signals does not include any new information to the system.
The SOM is trained with sensor samples obtained during insertions. After
training, each cell or unit of the map becomes a prototype or codebook vector,
which represents a region of the input space. The discretized force state is the

codebook vector which comes the nearest (measured by the Euclidean distance)
to the analog force values.
The number of units must be chosen a priori, seeking for a balance between
size and fineness. In the experiments, a 6 Â 4 map is used, thus totalling 24
force discrete states. Since the final state consists of position and force, there
are 9 Â 24 ¼ 216 discrete states in the cylindrical insertion, and 27 Â 24 ¼ 648
discrete states in the non-cylindrical task.
3.4.2. Reinforcement learning
The advantage of the proposed architecture over other random approaches
is the ability to learn a relationship between sensed states and actions. As the
system becomes skilled, this relationship is more intensely used to guide the
process towards completion with the minimum number of steps.
The system must learn without a teacher. The skill measurement is the time
or number of steps required to perform a correct insertion and is expressed in
terms of cost or negative reinforcement.
Sutton [22] defined reinforcement learning (RL) as the learning of a map-
ping from situations to actions so as to maximize a scalar reward or rein-
forcement signal.
Q-learning [23] is an RL algorithm that can be used whenever there is no
explicit model of the system and the cost structure. This algorithm learns the
state–action pairs which maximize a scalar reinforcement signal that will be
received over time. In the simplest case, this measure is the sum of the future
reinforcement values, and the objective is to learn an associative mapping that
at each time step selects, as a function of the current state, an action that
maximizes the expected sum of future reinforcement.
In Q-learning, a look-up table of Q-values is stored in memory, one Q-value
for each state–action pair. The Q-value is the expected amount of reinforce-
ment if, from that state, the action is performed and, afterwards, only optimal
actions are chosen. In our setup, when the system performs any action (mo-
tion), a negative constant reinforcement is signalled. This reinforcement rep-

resents the cost of the motion. Since the learning algorithm tends to maximize
the reinforcement, cost will be minimized, i.e., the system will learn those ac-
tions which lead to the goal with the minimum number of steps.
The basic learning step consists in updating a single Q-value. If the system
senses state s, and it performs action a, resulting in reinforcement r and
154 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
the system senses a new state s
0
, then the Q-value for ðs; aÞ is updated as
follows:
Qðs; aÞ ð1 À aÞQðs; aÞþa

r þ c max
a
0
2Aðs
0
Þ
Qðs
0
; a
0
Þ

; ð2Þ
where a is the learning rate and c is a discount factor, which weighs the value of
future reinforcement. The table converges to the optimal values as long as all
the states are visited infinitely often. In practice, a good solution is obtained
with a few thousand trials of the task.
3.4.3. Action selection and exploration

During the learning process, there is a conflict between exploration and
exploitation. Initially, the Q-values are meaningless and actions should be
chosen randomly, but as learning progresses, better actions should be chosen to
minimize the cost of learning. However, exploration cannot be completely
turned off, since the optimal action might not yet be discovered.
Some heuristics for exploration and exploitation can be found in the liter-
ature. In the implementation, we have chosen the Boltzmann exploration: the
Q-values are used for weighing exploitation and exploration. The probability
of selecting an action a in state s is
pðs ; aÞ¼
exp
Qðs;aÞ
T

P
a
0
exp
Qðs;a
0
Þ
T

; ð3Þ
where T is a positive value, which controls the degree of randomness, and it is
often referred to as temperature. It gradually decays from an initial value, and
exploration is turned off when it is close to zero, since the best action is selected
with Probability 1.
4. Experimental results
The system has been implemented in a robot arm equipped with a wrist-

mounted force sensor (Fig. 4). The task is the insertion of pegs of different
shapes (circular, square and triangular section) into their appropriate holes.
Pegs are made of wood, and the platform containing the holes is made of a
synthetic resin.
Uncertainty in the position and orientation is greater than the clearance
between the pegs and holes. The nominal goal is specified by a vector and a
rotation matrix relative to an external fixed frame of reference. This location is
supposed to be centered above the hole, so the peg would be inserted just by
moving straight along the Z axis with no rotation if there were no uncertainty
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 155
present. After positioning over the nominal goal, the robot performs a guarded
motion towards the hole.
If the insertion fails, the robot starts a series of perception and action cycles.
First, sensors are read, and a state is identified; depending on such state, one
action or another is chosen, and the learning mechanism updates the internal
parameters of decision. The robot performs compliant motions, i.e., it keeps
the contact with the surface while moving, so that it can detect the hole by a
sudden force change due to the loss of contact.
To avoid long exploration cycles, a timeout is set which stops the process if
the hole is not found within that time. In this case a new trial is started.
4.1. Case of the cylindrical peg
The peg is 29 mm in diameter, while the hole is chamferless and 29.15 mm in
diameter. The clearance between the peg and the hole is 0.075, thus the clear-
ance ratio is 0.005. The peg has to be inserted to a depth of 10 mm into the hole.
The input space of the self-organizing map is defined by the three filtered
torque components. The map has 6 Â 4 units. The map is trained off-line with
approximately 70,000 data vectors extracted from previous random trials.
Once the map is trained, the robot performs a sequence of trials, each of
which starts at a random position within an uncertainty radius of 3 mm. To
ensure absolutely that the goal is within the exploration area, this area is set to

a 5 mm square, centered at the real starting position. Exploration motions are
Fig. 4. Zebra Zero robot arm, grasping a peg over the platform.
156 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
tangential to the surface, i.e., along the X and Y dimensions. The exploration
space is partitioned into nine regions – limits between regions are )2 and +2
mm away from the initial location for both X and Y. Each of these regions
define a qualitative location state. The state is determined by combining the
winner unit of the map and this relative qualitative position with respect to the
initial location, thus the total number of states is 24 Â 9 ¼ 216.
Contact is detected simply by thresholds in the force component F
z
(normal
to the surface). During compliant motions, a force F
z
equal to )0.15 kgf is
constantly exerted on the surface.
The action space is discretized. Exploratory compliant motions consist of
fixed steps in eight different directions of the XY -plane, with some degrees of
freedom ðXY Þ being position-controlled and other degrees ðZÞ being force-
controlled. The complexity of the motion is transferred to the control modules,
and the learning process is simplified.
4.1.1. Learning results
The learning update step consists in modifying the Q-value of the previous
state and the performed action according to the reinforcement and the value of
the next state. The agent receives a constant negative reinforcement for each
action it performs (action-penalty representation). The best policy, the one that
maximizes the obtained reinforcement, is the one achieving the goal with the
minimum number of actions. Experimental results are shown in Fig. 5. The
critical phase is the surface-compliant motion towards the hole. The system
must learn to find the hole based on sensory information. The exploration time

Fig. 5. Smoothed insertion time taken on 4000 trials of the cylinder task.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 157
of 4000 consecutive trials is shown. The timeout is set to 20 s in each trial. The
smoothed curve was obtained by filtering the data using a moving-average
window of 100 consecutive values.
After 1500 trials, the insertion time is considerably improved over the values
at the first steps. Although the results presented by [5] show a faster conver-
gence for a similar task, one should note that the setup is quite different, since
the real location is used as input to the robot, and it is unclear how the trained
system could generalize to a different test location.
Fig. 6 depicts the evolution of the probability of successful insertions, given
a timeout of 20 s. This probability is estimated by calculating the percentage of
achieved goals during 100 consecutive trials. The system evolves from a bare
38% of successful insertions during the first 500 trials (accomplished by random
motions) to a satisfactory 93% of success during the last 500 trials of the
learning process.
4.1.2. Adaptation to a new position
Our system uses relative forces and a relative open-loop estimation of the
location of the end-effector. Theoretically, this information is invariant with
respect to the position and orientation of the target. Any goal should be
achieved with equal probability, provided an initial location within a given
bound of the real goal. Our system has been tested on different locations
without relearning, showing a good performance (83% of successful insertions)
when there is no orientation change.
Fig. 6. Evolution of the probability of successful insertion during the training process.
158 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
However, if the hole is rotated 90° (see Fig. 7), there is a significant loss in
performance (only 44% of insertions are successful), but, upon additional
training, the system quickly recovers a near perfect performance for the new
setup (Fig. 8 shows that with less than 500 new trials, more than 70% of in-

sertions are successful, whereas during the training process, 1000 trials were
required to achieve this rate).
Since the trial timeout is set at 20 s, additional experiments were carried out
with a higher timeout in order to study the distribution of successes over a long
time, and compare the differences between the random and learning strategies.
Fig. 9 depicts the distribution of successful insertions with respect to time for
1000 random trials and 1000 trials using the learned controller – learning has
been turned off during these trials. As expected, it was found that it is possible
to achieve nearly all the insertions with random motions, provided the neces-
sary amount of time. The learned controller achieves best results, however,
with significantly less time.
4.2. Non-cylindrical shapes: square section
Due to its radial symmetry, the cylinder is simpler than other parts for in-
sertions. It has been widely studied in the literature since the force analysis can
Fig. 7. New orientation of the hole.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 159
be done in two dimensions. Analytical results for pegs of other shapes are
much more difficult: Caine et al. [8] developed a heuristic approach to manage
1000 different contact states of a rectangular peg insertion.
Fig. 8. Adaptation to a new hole setup.
Fig. 9. Probability of insertion for random and learned strategies with the cylinder.
160 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
In our architecture, it is very simple to extend the agentÕs capabilities to deal
with other shapes apart from the cylinder. Besides the uncertainty in the po-
sition along the dimensions X and Y (tangential to the surface), the agent must
deal with the uncertainty in the orientation with respect to the Z axis (the hole
axis, which is normal to the surface). The peg used in the experiments has a
square section, its side being 28.8 mm. The hole is a 29.2 mm square, thus the
clearance is 0.2 mm, and the clearance ratio is approximately 0.013. The peg
is made of wood, like the cylinder, and the hole is located in the same platform

as before.
The radius of uncertainty in the position is 3 mm, and the uncertainty in the
orientation is Æ8.5°. The exploration area is a 5 mm square and an angle of
Æ14°. The area is partitioned into nine regions, and the angle is divided into
three segments. The self-organizing map contains 6 Â 4 units, like in the pre-
vious case. The rest of the training parameters are the same as before.
The same parameters are used with the map as with the cylinder. The trained
map is depicted in Fig. 10. The input space is partitioned by the map units in an
unsupervised way, according to the statistical distribution of the input data.
The total number of states is 27 Â 24 ¼ 648. Some of them may never actually
be visited at all, thus the number of real states is somewhat smaller. There is
a tradeoff between the number of states and the learning speed.
Two new actions are added, namely rotations around the normal axis to the
surface, since symmetry around it does not hold any more. A qualitative
measure of that angle is also included in the estimation of the agentÕs location.
Since 10 different actions are possible at each state, the table of Q-values has
6480 entries.
Fig. 10. Voronoi diagram defined by the projection of the SOM on dimensions ðM
x
; M
y
Þ, for the
cube task.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 161
The rest of the architecture and the training procedure remains unchanged.
The increased difficulty of the task is shown by the low percentage of suc-
cessful insertions that are achieved randomly at the beginning of the learning
process.
4.2.1. Learning results
Fig. 11 depicts the insertion time during 8000 learning trials. One should

take into account that any failed insertion is rated at an untrue value of 30 s.
The improvement is shown more clearly in Fig. 12, which depicts the proba-
bility of successful insertion within 30 s time. The process is slightly more
unstable than the cylinder due to the increased difficulty, but the agent achieves
a significant 80% of successful insertions. If this timeout is not considered, the
benefit is more apparent. Fig. 13 depicts the probability of successful insertion
for 1000 random trials and 1000 trials with the learned controller, with respect
to a time up to 210 s (3
1
2
min). The difference is more dramatic than in the case
of the cylinder, since the random controller, even for a long time, is only ca-
pable of performing a low percentage of trials (about 45%), whereas the
learned controller achieves more than 90% of the trials.
As far as we know, this is the best performance achieved for this task using a
square peg. In [5] only results for the cylinder are presented and, though
generalizing to other shapes is said to be possible, no real experiments are
carried out.
Fig. 11. Smoothed insertion time taken on 8000 trials of the cube task.
162 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
4.3. Other shapes: triangle
The architecture is not restricted to square shapes, but in principle it can be
used with any non-symmetric shape. Results are now presented for a triangular
Fig. 12. Evolution of the probability of successful insertion during the training process.
Fig. 13. Probability of insertion for random and learned strategies for the square peg.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 163
peg, with three equal edges. Each edge is 30.5 mm long, and the hole edges are
30.9 mm long.
The exact same representation of the state space has been used as for the
square. The same radius of uncertainty and exploration area are considered.

The total number of states is 27 Â 24 ¼ 648 and the same actions (eight
translations and two rotations) are used. The parameters of the map are the
same as those used with the previous parts. This is depicted in Fig. 14.
4.3.1. Learning results
The evolution of the mean insertion time during 8000 learning trials was
recorded. However, the improvement is not as apparent as in the previous
cases. Moreover, the probability of insertion only reaches about 60% of success
after the training process, whereas 80% of successful insertions were attained in
the cube example.
This is quite surprising, since initially the probability of insertion for the
triangle is higher, and that means that it is easier to insert the triangle randomly
than the cube. However, it is more difficult to improve these skills based on the
sensed forces for the triangle. This could be caused by the different contact
states, which seem to be more informative in the case of the cube. This is not a
contradiction at all. Possibly, the contacts of the triangle are more ambiguous,
as Fig. 14 suggests, thus making it difficult to learn a good strategy for the
insertion task.
Unfortunately, since there are no other published works for a similar task,
these results cannot be compared to test if our hypothesis is true. Nevertheless,
Fig. 14. Voronoi diagram defined by the projection of the SOM on dimensions ðM
x
; M
y
Þ, for the
triangle task.
164 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
this absence of results in the literature might be indicative of the difficulties for
properly performing and learning this task.
4.3.2. Learning using the previous SOM
An interesting generalization test is to use an SOM trained with samples

from insertions of the square peg for learning the insertions of the triangle peg.
Though trained with different shapes, the purpose is to test if the features
learned with the square are useful for the insertion of other shapes. Since the
size of the SOMs is the same, the state representation is not modified at all.
The evolution of the mean insertion time during 8000 learning trials is de-
picted in Fig. 15. The results are very similar to those obtained before with a
specific SOM. The probability of insertion is depicted in Fig. 16.
Fig. 17 depicts the probability of successful insertion for 1000 random trials,
1000 trials with the strategy learned with the specific SOM, and 1000 trials with
the strategy learned with the SOM from the cube task, with respect to a time up
to 210 s (3
1
2
min).
Surprisingly enough, results with the cube SOM are slightly better than
those obtained with the specific SOM. A possible explanation is that the SOM
trained with the cube is more powerful than that trained with the triangle. By
examining Figs. 10 and 14, which depict the Voronoi diagrams of both SOMs,
one can see that the cube SOM is covering a wider area of the input space than
the other one. It might occur that although some input data do not exercise
Fig. 15. Smoothed insertion time taken on 8000 trials of the triangle task, with the SOM trained
with the square peg.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 165
much influence during the training process of the triangle SOM (due to its low
probability density) they are still rather important for the learning of the in-
sertion strategy. Since the cube SOM is covering a wider area, some states may
Fig. 16. Evolution of the probability of successful insertion during the training process for the
triangle, with the SOM trained with the square peg.
Fig. 17. Probability of insertion for random and learned strategies for the triangle with SOMs
trained with the triangle and the square.

166 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168
be properly identified with this SOM whereas they are ambiguous with the
triangle SOM.
This is an interesting result which demonstrates the generalization capabil-
ities of the SOM for extracting features which are suitable for different tasks.
5. Conclusion
A practical sensor-based learning architecture has been presented. We have
indicated the need for a robust representation of the task state, to minimize the
effects of uncertainty. The implemented system is fully autonomous, and in-
crementally improves its skill in performing the task.
Results for the 3D peg insertion task with both cylindrical and non-cylin-
drical pegs have demonstrated the effectiveness of the proposed approach. The
learning process is fully autonomous. First, features are extracted from sensor
signals by an unsupervised neural network. Later, the reinforcement learning
algorithm associates the optimal actions to each state.
The system is able to manage uncertainty in the position and orientation of
the peg. Obviously, uncertainty is larger than the clearance between the parts.
Experimental results demonstrate the ability of the system to learn to insert
non-cylindrical parts, for which no other working system has been described in
the literature. In addition, the system generalizes well to other positions and
orientations of the parts.
Future work includes the study of skill transfer between tasks, to avoid
learning a new shape from scratch. A promising example of using a neural
network trained with the square peg for the insertion of a triangle peg is shown.
Another important direction for future research will be to investigate the in-
tegration of the presented techniques with other sensors, e.g. vision.
References
[1] J. Canny, J. Reif, New lower bound techniques for robot motion planning problems, in: 28th
IEEE Symposium on Foundations of Computer Science, 1987, pp. 49–70.
[2] H. Asada, Representation and learning of nonlinear compliance using neural nets, IEEE

Transactions on Robotics and Automation 9 (6) (1993) 863–867.
[3] R.J. Desai, R.A. Volz, Identification and verification of termination conditions in fine motion
in presence of sensor errors and geometric uncertainties, in: Proceedings of the IEEE
International Conference on Robotics and Automation, 1989, pp. 800–807.
[4] D.E. Whitney, Quasi-static assembly of compliantly supported rigid parts, ASME Journal of
Dynamic Systems, Measurement and Control 104 (1982) 65–77.
[5] V. Gullapalli, J.A. Franklin, H. Benbrahim, Acquiring robot skills via reinforcement learning,
IEEE Control Systems 14 (1) (1994) 13–24.
[6] M. Kaiser, R. Dillman, Building elementary robot skills from human demonstration, in:
Proceedings of the IEEE International Conference on Robotics and Automation, 1996,
pp. 2700–2705.
E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168 167
[7] M. Nuttin, H. van Brussel, C. Baroglio, R. Piola, Fuzzy controller synthesis in robotic
assembly: procedure and experiments, in: 3rd IEEE International Conference on Fuzzy
Systems, 1994, pp. 1217–1223.
[8] M.E. Caine, T. Lozano-P

eerez, W.P. Seering, Assembly strategies for chamferless parts, in:
Proceedings of the IEEE International Conference on Robotics and Automation, 1989, pp.
472–477.
[9] T. Lozano-P

eerez, Spatial planning: a configuration space approach, IEEE Transacions on
Computing 32 (2) (1983) 108–120.
[10] B.R. Donald, Error Detection and Recovery in Robotics, Springer, Berlin, 1989.
[11] J. Jennings, B.R. Donald, D. Campbell, Towards experimental verification of an automated
compliant motion planner based on a geometric theory of error detection and recovery,
in: Proceedings of the IEEE International Conference on Robotics and Automation, 1989,
pp. 632–637.
[12] A.J. Briggs, An efficient algorithm for one-step planar compliant motion planning with

uncertainty, in: 5th ACM Annual Symposium on Computational Geometry, 1989, pp. 187–
196.
[13] J.C. Latombe, A. Lazanas, S. Shekhar, Robot motion planning with uncertainty in control
and sensing, Artificial Intelligence 52 (1) (1991) 1–47.
[14] S.M. LaValle, S.A. Hutchinson, An objective-based framework for motion planning under
sensing and control uncertainties, International Journal of Robotics Research 17 (1) (1998) 19–
42.
[15] B.J. McCarragher, H. Asada, A discrete event controller using Petri nets applied to assembly,
in: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems,
1992, pp. 2087–2094.
[16] V. Gullapalli, R.A. Grupen, A.G. Barto, Learning reactive admittance control, in: Proceedings
of the IEEE International Conference on Robotics and Automation, 1992, pp. 1475–1480.
[17] M. Kaiser, R. Dillman. Hierarchical learning of efficient skill application for autonomous
robots, in: International Symposium on Intelligent Robotic Systems, 1995.
[18] C. Distante, A. Anglani, F. Taurisano, Target reaching by using visual information and Q-
learning controllers, Autonomous Robots 9 (2000) 41–50.
[19] M.A. Erdmann, Randomization in robot tasks, International Journal of Robotics Research 11
(5) (1992) 399–436.
[20] E. Cervera, A.P. del Pobil, E. Marta, M.A. Serna, Perception-based learning for motion in
contact in task planning, Journal of Intelligent and Robotic Systems 17 (1996) 283–308.
[21] T. Kohonen, in: Self-Organizing Maps, Springer Series in Information Sciences, Springer,
Berlin, 1995.
[22] R.S. Sutton (Ed.), Reinforcement Learning, Kluwer Academic Publishers, Dordrecht, 1992.
[23] C.J.C.H. Watkins, P. Dayan, Q-learning, Machine Learning 8 (1992) 279–292.
168 E. Cervera, A.P. del Pobil / Information Sciences 145 (2002) 147–168

×