Tải bản đầy đủ (.pdf) (62 trang)

VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 1

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (12.82 MB, 62 trang )

Chapter 1
Introduction
The human visual system can quickly, e↵ortlessly, and efficientl y process visual
information from the environment (Ho↵man, 2000). As a result, modern computer
vision has been heavily influenced by how biological visual syst em s encode proper-
ties of the natural environment that are important for the survival of the species.
Human subjects can perform several complex tasks such as obj ect localization,
identification, and recognition in a given scene e↵o r t l essl y, owing to their ability
to attend to selected portions of their visual fields while ignoring other informa-
tion. However, this selective attention (James, 1890)mechanismisnottheonly
way perception is achieved. Human s u bjects can also utilize a divided attention
mechanism to achieve perception in daily life (Goldstein, 2010). For instance car
drivers can simultaneously pay attention to traffic lights, tra ffi c signs, an d vehicles
in front of them while driving. It is also worth noting that perception can occur
even without directed attention (Reddy et al., 2007)
In this thesis, I present our research findings on the study of temporal (fixation
duration) and spatial (fixation location) p r o perties of vi su a l attent i o n . Fix a tion
durations have been extensively studied using the static scene change paradigm
(Land and Hayhoe, 2001; Henderson, 2003; Pannasch et al., 2010). However the
1
influence of scene change on fixation durations in movies is not well understood.
Part of this thesis attempts to fill this gap by looking at how fixation durations
change in movies across scene changes. We also show how fixation durations can
be used as an unbiased behavioural metric to quantify the center bias in movies,
which serves to complement spatial measures of the center bias (Tatler, 2007; Tseng
et al., 2009).
The second part of this thesis focuses on a computational model of the hu-
man visual attention system. We propose a novel method of combining bottom-up
(sensory information) and top-down (experience) cues. Specifically we used unsu -
pervised learning techniques to categorize scene specific patterns of visual attention
and hypothesized that these patterns will be unique for different types of scenes


e.g., natural and man-made scenes. Using these patterns of eye fixations, we mod-
ulated our saliency maps to investigate if we were able to improve our prediction
of human fixations. Our result show that indeed there are scene-speci fi c differences
for visual attention patterns, and augmenting such information (top-down knowl-
edge) with sensory cues (bottom-up information) improves the predictive power of
the proposed computation model.
The overall thesis is organized as fol l ows. Chapter 2 reviews the current hy-
potheses on the control of fixation durations, and reviews the computational models
used to p re d i ct visual attention in static and dynamic natural scenes. In Chapter
3, I describe our exp erimental methods, and the analysis of fixation durations with
three major results. In Chapter 4, I describe a computational model of attention
and discuss our results in comparison to previous models. In Chapter 5, I discuss
our overall conclusions, with important directions for future work.
2
Chapter 2
Literature Review
2.1 Early History
Although the word attention, from the Latin word attenti, existed in Roman times,
there is no evidence of its study in those times. The very first documented account
of attention dates back to 1694, when Descartes proposed that movements of the
pineal gl a n d s were responsible for attention (Descartes, 1649). Following this,
many other proposals we re put forth over the years (Hobbes, 1655; Malebr an che,
1721; Leibniz, 1765; von Helmholtz, 1886). However, all these proposals were based
on the central idea of the obligatory coupling of visual attention to physical eye
movements, otherwise known as overt attention. Helmholtz was the first to discover
covert attention by successfu l l y demonstrating that attention can be achieved even
without physical eye movements to attended locations (von Helmholtz, 1896).
2.2 Types of Eye Movements
The acuity of the visual system in primates drops with eccentricity from the fovea
towards the periphery. The density of cone photoreceptors is much greater at

3
2.2 Types of Eye Movements
Table 2.1: Di↵erent eye movements
Types of Oculomotor
movements
Description
Gaze-
Stabilizing
(Fixational)
Tremor Also known as Nystagmus is a compensatory eye
movements, having lowest amplitude of all the eye
movements
(Yarbus, 1967; Carpenter, 1988; Spauschus, 1999).
Drift Also known as Opto-Kinetic reex is a pursuit like movement
that stabilizes an image of low velocity object
(Nachmias, 1959; Fender, 1969).
Micro-
saccade
Also known as Vestibulo-ocular rex and xational saccades is
involuntary image-stabilizing eye movement in response to
head movements (Horw, 2003). They are also thought to
correct eye displacements caused by drifts (Yarbus, 1967;
Ditchburn, 1953; Cornsweet, 1956) .
Gaze-shifting
(saccadic)
Saccades Voluntary jump of eye from one spatial location to another.
Smooth-
pursuit
Voluntary tracking of the stimuli moving across the visual
 to keep it under fovea spot-light.

Vergence Coordinated eye movements to stabilize the target image on
to the fovea region of the both eyes (Sparks, 2002)
the centre than in the periphery. Peripheral vision is poor at detecting object
information such as colour and shape, but is more sensitive to motion (Balas et al.,
2009). Consequently, to attend to any one spatial location in detail, a gaze shift
to that location is necessary, bringing the image onto the central fovea where the
population of p h o to r ecept o r s is the h i gh e st . With advancements i n neuroscience,
di↵erent kinds of oculomotor movements have been discovered, an d their properties
defined (Martinez-Conde et al., 2004; Sparks et al., 2002). Table 2.1 lists some well-
known eye movements.
In general, gaze-stabilizing eye movements compensate for head and body move-
ments to keep the image und er the hig h -r eso l u t i o n fovea, while gaze-shift i n g eye
movements provide high resolution samples of the visual environment by control-
ling and directing eye movements.
4
2.3 Overt and Covert Orienting of Visual Attention
Figure 2.1: Visual orienting graph.
2.3 Overt and C overt Orienting of Visual Atten-
tion
All th e se eye-movements tend to supplement a n abstract concept of overt and
covert orienting of visual attention (Posner and Cohen, 1984). Johnson (1994)
illustrated these di↵erent types of att ention in Figure 2.1.
Overt attention is the result of a directed eye movement towards an attended
location, as opposed to covert attention, where the attended location is indepen-
dent of eye position. A further decomposition of these two types of attention sug-
gest that eye movements are under the control of endogenous and exogenous sub-
systems (see Figu r e 2.1). Endogenous control refers to controlled eye movements
under the influence of visual or verbal instruction; e.g., to look at a central fixation
cross preceeding the presentation of the stimuli. In contrast, exogenous control
refers to autonomous eye movements under the influence of the visual stimuli, e.g.

a brief presentation of a target in the periphery in attention capt u r e experiments.
The di↵erence between covert and overt orienting of attention, un der endogenous
control can be understood by an example from a cueing paradigm (Van der Stigchel
and Theeuwes, 2007). In a majority of the trials subjects were endogenously cued
5
2.3 Overt and Covert Orienting of Visual Attention
(displaying an arrow at the central fixation) to covertly attend to the upcoming
target location, while maintaining their gaze at the central fixation. The responses
to these target onsets were sampled using key-presses. However in a few of the
trials, subject responses were sampled by instructing them to saccade to the target
location (overt orienting under endogenous control). Similarly, di↵erences between
covert and overt attention, under exogenous control, can be exemplified by an
attention capture paradigm. A brief onset of the target at a peripheral location
is first covertly a t t en ded (attention capture) followed by an overt orientation, for
focused processing, (occulomotor capt u r e) .
A third level decomposition has o n l y been r eported for covert attention. Two
examples of the e↵ects of covert at tention are response facilitation and inhibition
of return (IOR). Response facilitation was first demonstrated by Posner (1980),
in a study in which subjects were asked to press a button when they detected
a flash of light that coul d appear at one of 4 possible peripheral locations on
ascreeninfrontofthem. Toensurethatonlycovertattentionwasinvolved,the
subjects were required to maintain central fixation throughout the trial. Before the
target stimulus app eared, a cue was presented that would instruct the subject to
orient his/her covert attention to one of the 4 possible locations. They found that
reaction times were significantly lowered when subjects were cued to the correct
location, despite the fact that their eyes were directed somewhere else. This appears
to reflect a reflexive orienting of attention to the location of salient cues (Klein,
2000). However, the facilitation in sacc ad e reaction times to the target location
lasted for 100 to 200 m i l l i s eco n d s following the cue onset. Any fu rt h er delays in the
onset of the target, when it was near the cued location, resulted in slower reaction

times. This was later explained to be due to the e↵ect of IOR. The term inhibition
of return was first coined by Posner and colleagues (Posner et al., 1985). They
showed the relative impaired ability of the immediate attentional shifts to a target
6
2.4 Temporal Properties of Visual Attention
location if attention wa s recently withdrawn from that cued location. The delayed
onset of the target near the cued location resulted in significantly slower reaction
times. The IOR e↵ect has been discussed as a novelty seeking mechanism (Posner
&Cohen,1984)andinfacilitatingvisualsearchwhenthetargetdoesnotpopout
(Klein and MacInnes, 1999). It is also worth noting that overt and covert attention
can co-occur or act individu al l y in a given scenario (Fin d l ay and Gilchrist, 2003).
Over the years, other models of attention have also been put forth:
• Independent Attention Model states that both types of attention can
co-exist since they are independently driven by the same scene (Klein, 1980).
• Sequential Attention Model states that overt foveation is preceded by
covert attention (Posner, 1980).
• Pre-Motor Theory of Attention st at es that covert attention is a by-
product of the motor system initia ti n g overt foveation (Rizzolatti et al.,
1987).
2.4 Temporal Properties of Vis ua l Attention
Temporal prop erties of visual attention are attributed to how a presented stimuli
influence’s fixation durations. Overt orientation is not only manifest ed by phy si ca l
eye movements to the desired location, but also by the duration that the eyes stay
at the attended location. This behavioral property (fixation duration) has been
investigated by many researchers and has been found to be a function of a vari ety of
factors (Buswell, 1935; Rayner, 1998; Findlay and Gilchrist, 2003). Henderson and
Smith (2009)listedthedi↵erent control mechanisms a↵ecting fixation durations;
• Process Monitoring states that fixation durations are driven by the moment-
to-moment visual and cognitive analysis.
7

2.4 Temporal Properties of Visual Attention
– Immediate control exerts i n flu en ces on fixation durations based on
the visual and cognitive processes taking place during the fixation (Rayner
and Pollatsek, 1981).
– Delayed Control exerts influences on subsequent fixation durations,
originating from the slow development of higher-level visual and cogni-
tive processes.
• Autonomous Contr ol states that most of the fixation durations are inde-
pendent of the immediate perceptual and cognitive processing of the current
fixation.
– Timing control suggests that fixation durations are determined by an
internal stochastic timer that is designed to move the eyes at a constant
rate, regardless of the scene type or task definition.
– Parameter Control su ggest s that fixation durations are based on oc-
culomotor timing parameters reflecting the global viewin g conditions,
determined early in scene viewing (Henderson and Holl i n gworth, 1999).
• Mixed Control suggests that fixation durations are driven by some com-
bination of the above-mentio n ed processes. As an example, an argument
can be made that most of the time, fixation durations are und e r immediate
control, but are influenced by delayed control occasionally. Reichle et al.
(1998) showed in th ei r readin g task experiments that fix at i o n durati o n s for
the currently processed word was longer, if the following or proceeding word
was skipped compared to when it was fixated. Another argument can also be
made that fixation durati o n s are under timing control, which is sometimes
overridden by delayed control due to slower acting higher-level visual and
cognitive processes (Yang and McCo n ki e, 2001).
8
2.4 Temporal Properties of Visual Attention
Another interesting behavio r al bias observed while watching natural stimuli
(static images and videos) is the tendency to fixate near the centre more often

than the periphery. This bias has been replicated in many studies ( Bu swell, 1935;
Mannan et al., 1995, 1996, 1997; Reinagel and Zador, 1999; Parkhurst et al., 2002;
Parkhurst and Niebur, 2003; It t i, 2004; Tatler et al., 2005; Tatler, 2007; Foulsham
and Underwood, 2008; Tseng et al., 2009). At present, the reasons behind this cen-
tering phenomenon h ave yet to be elucidated. In a recent paper, Tseng et al. (2009)
suggested th a t this centre bias should be largely attributed to the photographer
bias and an expectancy-derived viewing strateg y that follows from it.
The photographer bias indicat es that phot ogr ap h er s and film makers typically
place objects of interest at or around the centre of the fr a m e, p r esu m a b l y so that
their viewing audience would be able to easily perceive the intended meaning of
the scene and would not need to alter their gaze to perceive it. Moreover, as stated
above, it has been suggested that the photographer bias promotes a typical viewing
strategy, where viewers develop a ten d en cy to move th ei r eyes toward the centre
of a newly presented scene since they expect the most interesting or important
features of the scene to appear at or around that region (Tseng et al., 2009).
The precise contribution of the photographer bias to the centre bias is difficult
to assess, and remains a crucial issue to the understanding of how we perceive
visual images. In Tseng et al. (2009)twovariablesweremeasuredtoquantifythe
photographer bias. To assess top-down influences of t h e pho t o g r a p h er bi a s, sub-
jects were required to rate the extent to whi ch th e i nteresting aspects of the scene
were biased toward th e centre. However, the re were large di↵erences between the
subjective ratings of the viewers, alluding to the fact tha t perhaps the scene prop-
erties manifest ed in the photographer bias were not compl et el y captured by this
rating system. To measure the bottom-up influences of the photographer bias, the
authors computed a saliency map based on the widely cited Itti and Koch (2000)
9
2.4 Temporal Properties of Visual Attention
model. Importantly though, the contribution of saliency to the pho t o g r a p h er and
centre biases was found to be markedly smaller than that of top-down influences
like subjective assessment by th e participants of t h e experiment.

Adi↵erent way to tackle this issu e, which is not based on subjective measures
but is more objective, is to assess the amount of photographer bias in a scene as a
function of the disparity in fixation durations between the centre and the periph-
ery. When a photographer b i as is present, it should influence not only the number
of fixations that are made toward the centre relative to the periphery, but also the
duration of ind i v i d u a l fixations. Th i s is because viewers are expected to remain
fixated at meaningful locations for longer periods compared to less-meani n g fu l lo-
cations that would not especially evoke the viewer’s interest. Under a photographer
bias, most of th e meaningful portions of the scene are positioned near the centre.
Hence when the viewer fixates these locations the probability that he/she would be
fixating a meaningful spot is increased, and so is the probability that the fixation
duration wil l be extended. In ot h er words, since viewers are expected to remai n
fixated at informative, interesting, or otherwise meaningful locations relative to
less-meaningful ones, we should expect a correlatio n between fixatio n durati o n
and distance from the centre when a photographer b i as i s pr esent. On the other
hand, ot h er biases (like orbital reserve and motor bias) of visual attention that
are unrelated to the semantic content of the scene, do not predict that fixation
durations will be longer in central regions. If there were no photographer bias,
viewers would not be expected to have longer fixation s at central regions, as they
would not contain information that would be more meaningful than other parts of
the image.
10
2.5 Models of Visual Attention
2.5 Models of Visual Attention
Visual attention can be driven by either bottom-up / exogenous-control or
top-down / endogenous-control mechanisms. The bottom-up mechan i sm s are
characterized by involuntary and unconscious ey e movements with low cognitive
load. The top-down m echanisms are characterized by voluntary and conscious eye
movements, under the influence of special task s, and are accom p a n i ed by h i g h er
cognitive load. Research studies have found that bottom-up influen c es act more

rapidly than top-down processes (Wolfe et al., 2000; Henderson, 2003). In partic-
ular, Wolfe and collea g u es (2000) conducted two sets of experiments that ni ce l y
demonstrated this concept. In the first experiment, 8 i m a g es were shown sequen-
tially at 8 di↵erent locations on the screen for a period of 53 m i l l i s eco n d s, in-
terleaved with a variable duration mask. The target could appear at one of the
eight locations, distributed in a clock-like fashion. In the co m m an d condition, the
images were presented sequentially in a clockwise manner, i.e. the first image ap-
peared at the 12 o’clock location, followed by an image at t h e 1 o’clock location.
If the target was to be presented as the fourth image in a presentation sequ e nce,
then the target would only appear at the fourth location. In the random anarchy
condition, the target could appear randomly on each frame. The results showed
that attention can shift much more rapidly when allowed to move randomly (anar-
chy condition) then when controlled in a top-down manner (command condition).
Similar resul t s were found for the second experiment where the objective was to
find mirr o r -r eversed letters.
The bottom-up mechan ism s are mainly driven by low level p r ocesses that de-
pend on the intrinsic features of the visual stimuli. This mechanism is also called
automatic, exogenous, reflexive, or p eripherally-cued (Egeth and Yantis, 1997).
Any feature in the visu a l environm ent can attract att ention. Some features like
colour, motion, orientation and size (including length and spa ti a l frequency) have
11
2.5 Models of Visual Attention
proved to be reliable predictors of foveated attention, while others (e.g., shape,
optical fl ow, lu m i n ance polarity etc.) are less probable in attracti n g attention
(Wolfe and Horowitz, 2004). Over the years, many attempts have been made t o
develop comput at i o n al models based on bottom-up attention. They have shown
some deg r ee of success in simulating the fixation pattern of humans. The founda-
tions were laid by a classical paper from Treisman (Treisman and Gelade, 1980)
on the “Feature Integration Theory” (FIT). Accordi n g to the FIT, the human vi-
sual system can autonomously det ect discrete features of the st i mulus in a parallel

fashion (within the limits of visual acuity and discrimi n ab i l i ty), resulting in feature
maps. These feature maps are then combined into a master map of locations that
show where things are in the scene. The pre-attentive parallel processing capability
mediates figure-ground grouping and texture segregation. However for higher-level
precepts such as object identification in a scene, focused attention is necessar y.
This is achieved by serial scanning of the map (item-by-item) and directi n g focal
attention to the relevant location (Treisman, 1985). However, the mor e discrim-
inable the object features are, the fast er this conjunction search can be completed.
Thus the FIT explains the finding that reac ti o n times tend to increase wi t h the dis-
play size in conjunction-sear ch tasks but remain almost const a nt in feature-search
tasks. The FIT was the basis of the work by Koch a n d Ullman (1985)onthe
biological plausi b i l i ty of the feat u r e-b a sed com p u t a t i o n a l mo d el . A key p r o posal
was the idea of a topographical map that linearly combined di↵erent individual
feature map s . This master map , termed the Sali ency Map, provided a measure
of global conspicuity. An alternati ve map descrip t i o n , the Activation Map,was
presented in Wolf’s guided search model (Wolfe et al., 1989; Chun et al., 1996)as
the weighted summation of feature activations. These maps essentially combined
both top-down and bottom-up information. The activation maps are then used
to guide visual attention from object to object, starting with the object with the
12
2.5 Models of Visual Attention
highest activati o n , until the target is found or the current activation level falls
below a threshold. Simila r l y the notion of a Priority Map (Fecteau et al., 2006)
also combines representations of bottom-up salience in a scene with top-down rele-
vance of ob jects to the subject’s goal. Currently , there are three classes of models,
as pointed by Le Meur and Le Callet (2009).
• Hierarchical models decompose visual information using Gaussian, Fou r i er -
based or wavelet-based decomposition. Later, di↵erent methods are used to
aggregate the information across the hierarchy to produce an unique saliency
map (Itti et al ., 1998; Le Meur et al., 2006; Bur and H¨ugli, 2007).

• Statistical models make use of the local statistical properties of the region
in the scene. The saliency m a p is then computed as a measure of the deviation
of these regional properties from their surroundings (Oliva et al., 2003; Bruce
and Tsotsos, 2009; Gao et al., 2008).
• Bayesian models combine both the bo tt o m -up sensory visual infor m at i o n
with the top-down prior knowledge relevant to the task at hand (Torralba,
2003b; Zhang et al ., 2009).
The classic Itti and Koch’s saliency model (Itti et al., 1998; Itti and Koch, 2000,
2001) is an implementation and expansion of the work by Koch and Ullman (Koch
and Ullman, 1985). Koch and Ullman claimed that the brain computes an explicit
saliency map of the visual world. The saliency was defined using principals of the
center-surround mechanism: pixels in the scene are salient if they di↵er from sur-
rounding pixels in intensity. Features are computed from responses of biologically
plausible linear filters such as the DoG (di↵erence of Gaussian, Hawken and Parker
1987 and Gabor filters (Marˇcelja, 1980). Briefly, an input image is decomposed
into three channels: color, intensity and orientation. Subsequently, th e color and
13
2.5 Models of Visual Attention
intensity channel images are repeatedly sub-sam p l ed using Gaussia n -sh a ped ker-
nels to create Dyadic Gaussi a n pyramids. Four orientation Gabor pyramids are
created, using four preferred orientations (0, 45, 90, 135). A center-surround op-
eration, imp l em ented by taking th e di↵erence of the filter responses, yields a set
of feature map. The feature maps for each channel are then normalized (to pro-
mote the maps with few stronger peaks and suppress maps wit h many comparable
peaks) and combined across scale and orientation t o crea t e a conspi cu i ty m a p for
each channel. These three maps are further norma l i zed to enha n ce the con sp i cu o u s
regions, and channels are linearly combined to for m an overall saliency map. The
model’s output then feeds i nto a two-layer neural network (winner-t a kes-all), to
simulate the shifting of attention from one location to another location. To avoid
returning immediately to the previously processed location (IOR), DoG filters are

used. The excitatory surround around the inhibitory center gives a slight pref-
erence to salient features near the previously attended location (Itti and Koch,
2000).
Even though t h i s model has b een shown to be successful in predicting human
fixations, it is somewha t ad-hoc in that there is no objective function to be op-
timized and many parameters must be tuned manually. In contrast, the mo d el
proposed by Bruce and Tsotsos (2009)definesbottom-upsaliencybasedoninfor-
mation sampled from the image. Features are learned from a set of natural im a g es
using in d ependent component analysis (ICA). These have been shown qualitatively
to resemble the receptive fields found in t h e p r i m ar y visual cortex (V1) and their
responses exhibit the desired properties of sparsity. Furthermore, since the featu r es
learned are independent, the joint probability of the features is the p r oduct of the
feature’s marginal probability. Once the basis function and coefficients are learned,
they are test ed on a set of new imag es. First, for ea ch location x in the image,
the responses of the learned basis funct i on s are ob t ai n ed (i.e. ICA coefficients).
14
2.5 Models of Visual Attention
These ICA coefficients correspond to various basis filters that respond to di↵erent
features. The histogram density estimate is used to produce the dist ri b u t i ons for
each of these coefficients, over a local neighborhood. This is followed by computing
the joint likeliho od using the given neighborhood coeffi cients. In the end, th e final
saliency map is obtained usin g t h e self-information metric log(P (x | C)), utilizing
the likelihood of the content within its given neighborhood (ensemble). It is worth
noting that if t h e neighbo r h ood of the point of interest is defined as the entire im -
age, the definition of saliency becomes identical to bottom-up saliency as defined
in Oliva et al. (2003), where the saliency of each location is inversely proportional
to its occurrence probability in the image.
Another model proposed by Itti and Baldi (2009)isbasedontheBayesian
observer’s beliefs of their environment and how these beliefs evolve over time. The
data observations that leave the observers’ beliefs una↵ected carry no surprise, and

thus do not get registered in the model’s output. In contrast, the data observations
that forc e the observers to revi se their existing beliefs, significantly, elicit surprise.
Briefly, initial beliefs are computed in a series of small windows over the entire
image, along several low level featur es (colors, intensity, orientation, and contrast)
and at several spati a l and temporal sca l es. Following this initia l calcula t i on of the
beliefs, ba sed on the low level hypothesis of the environment, any abrupt visual
change in subsequent frames or images would cause a re-evaluation of the prior
beliefs about the environment. The mo d el then uses Bayes’ ru l e to compute the
posterior beliefs/probability P (H|I)
H=H
from prior beliefs/probabilities P(H)
H=H
over the set of hypothesis H in hypothesis space H, after observing each successive
data or image I.
P (H|I)=
P (I|H)
P (I)
P (H)(2.1)
Here if the posterior d i st r ib u t i on is significantly di↵erent than th e prior, this
15
2.5 Models of Visual Attention
implies that the observed data/imag e carries su r p ri s e in terms of an abrupt low
level visual change. In contrast, if ther e were no di↵erences i n the posterior and
prior distributions, no surprise will be elicited. Thus surprise is quantified by the
distance between the posterior and prior distributions, as measured using th e KL
divergence.
S(I, H)=KL[P (H|I),P(H)] (2.2)
In summary, the surprise reflect s the computation of saliency over spatial and
temporal scales. A spatial oddity (e.g. a house in a field) or temporal oddity (e.g. a
snow im age appearing durin g a normal TV broadcast) will typically elicit surprise

initially. But since th e observer is continu ou sl y updating its beliefs about the world
through Byes’ rule, the surprise elicited by the rep eated presentation of such o ddity
decreases with every presentation. This definition of surprise saliency is somewh a t
similar to other definitions of saliency that are based on deviation of local features
from their neighborhood feat u r es over an im ag e space (It t i et al., 1998; Bru ce an d
Tsotsos, 2006; Gao et al., 2008), except that it ex te n ds th e noti o n to th e spat i o-
temporal realm. However, it is important to note that statistical uniqueness is not
always similar to surprise. Statistically unique snow im a g es during a TV broadcast
will still elicit decreasing salien ce surprise over time.
Zhang and colleagues (Zhang et al., 2009)proposedasaliencymodelbased
on learned statistics from a series of natural images. A contrasting view is to
learn the statistics from a lo ca l scene (Bruce and Tsotsos, 2006; Torralba et al.,
2006), in the case of static images, or a local spatio-temporal region in the case
of movies (Bruce and Ts otsos, 2009). All these m odels use a definition of self
information, but their un d er l y i n g implications are di↵erent. Saliency using local
image st a t i st i cs argues that foreground objects are likely to have di↵erent features
than the background. Thus saliency is defined as the deviation of the values from
16
2.5 Models of Visual Attention
the average statistics of the image. On the other h a n d , SUN’s intuition states
that since target objects are less frequently observed than the background in daily
life, thus r a r e/ n ovel features are more likely to be fi xa t ed by humans. This claim
was substantiated using evidence from the literature that describe what attracts
attention in infants (Fagan, 1970; Friedman, 1972)andthefactthatobjectnovelty
facilitates visual search (Wolfe, 2001). Employing this novelty strategy, Zhang et al.
(2009)haveshownthepredictionpoweroftheirmodeloverclassicalpsychophysics
results (such as search asymmetry and parallel vs. serial search) and eye movement
patterns, in general, while viewing ima g es and movies. The main formulation of
their probabilistic model is made under the assumption that the aim of the human
visual system is to find potential targets that are important for survival. Attention

is directed to regions that have high likelihood of belonging to the target class.
P (T
x
|F
x
)=logP(F
x
|T
x
)+logP (T
x
)+log
1
P (F
x
)
(2.3)
where x deno t es a point in the visua l scene, T
x
is a bin a r y variable indicat i n g
point x belongs to a target cl a ss, and F
x
is the visual feature at p o i nt x. The first
term on the right hand side forms top-down saliency (probability of features of
the target at currently processed location), the second term is a constant, and the
third te rm captures bottom-up saliency ( sel f information at the current location).
Taken together, both top-down and bottom-up saliency provides pointwise mutual
information between the features and t a r g et s. Two feat u r es were used: DoG and
ICA. The DoG filte rs were ap p l i ed to three di↵erent channels separately (intensity,
Red-Green and Blue-Yellow) at 4 scales, yielding a total of 12 features. Indepen-

dent components were learned from the Kyoto image database (Wachtler et al.,
2007). This was followed by fitting a generalized Gaussian distribution to the his-
tograms of each feature (the histogram of each feature represents the frequency of
its response, indi ca t i n g how rarely o r frequently a particula r feature was present
17
2.5 Models of Visual Attention
in natural images. This is fundamentally di↵erent from Bruce and Tsotsos (2006),
where learned basis functions were used for den si ty estimation, producing coeffi-
cients for local neighborhoods and thus yielding a distributi o n of values for a single
coefficient. Bottom-up saliency was computed by estimati n g the joint probability
from the featu r es. Since the features were independent, the joint probabilities were
obtained from the product of the feature marginal probabilities of the features.
Another model, from Hou and Zhang (2008)hasproposedaspectralresidualap-
proach for computing saliency maps. The approach is di↵erent from other methods
using natural image statistics in that the method is non-parametric and computes
saliency rapidl y. It does so by computing the di↵erence (spectral residue) between
the log spectrum (log o f an amplitude spectrum) of an image and its smoothed
version. This is foll owed by transforming the residual image to the s p at i a l domain
to obtain the saliency map. Lately, Guo and colleagues (Guo et al., 2008)have
claimed that the phase spectrum of an image has even more predictive power than
the amplitude spectrum.
However, all these models (Hou and Zhang, 2008; Itti and Baldi, 2009; Zhang
et al., 2009) share a common problem, which is sensitivity to the presence of ego-
motion (Gao et al., 2008). They do not have any strategies for background motion
cancellation, and thus frequently assign high salience values to background motion
(due to camera panning or zooming) in the presence of more salient foreground
motion.
Top-down mechanisms are defined as relying on task or contextual defini-
tion. These mechanism are also called endogenous (Posner et al., 1980), voluntary
(Jonides, 1981)orcentrallycued(Posner et al., 1980)attention. Theclassical

example of this type of attention mechanisms comes from Yarbus’ work (Yarbus
et al., 1967). He was the first to demonstrate that a subject’s gaze-shift varies
with high-level task definition. In his exp er i ment using paintings of people in a
18
2.5 Models of Visual Attention
living room, subjects were either asked no questions or specific questions like judge
the economic status, what clothes the people were wearing, where are they, etc.
A t h r ee minute recor d i n g of the gaze showed di ↵er ent eye scan paths for all the
conditions (elicited by di↵erent q u est i ons asked), thus implying that top-down task
information does influence overt foveation. Posner’s cuein g paradigm (Posner and
Cohen, 1980, 1984)alsoshowedhowcovertandovertattentionmightco-existto
achieve the task. A central cue, shown as an arrow, is presented to the subjects
pointing to the location of the target. The ability to detect the tar ge t, in t er m s of
reaction times, was typically better in the trials in which the target was presented
at the cued location than in the trials in which the target appeared at an uncued
location. Su b seq u ently, Posner described three major functions concerning atten-
tion: alertness, orienting of attention and tar g et detection. Alerting pertains to
providing the ability to process high priority signals, e.g. peripheral motion gets
prioritized by the attention system for subsequent processing. Orienting improves
efficiency of the target p rocessing in terms of acuity by reporting more rapidly the
events occurring at the foveated location. Target detection implies that observers
are conscious about the presence of the stimulus.
In practice, computat ion al models based on top -down information are more
complex and require specific contextual knowledge of the task to modulate the
feature map s. Recent studie s in object recognition have proposed unified frame-
works, combining both top-down and bottom-up information to compute saliency
(Torralba, 2003b; Gao an d Vasconcelos, 2005; Gao et al., 2008).
For example, Torralba’s model of a context based vision system (Torralba et al.,
2003)usescontextualprimingforplaceorobjectlocalization(foratargetdetec-
tion task), based on past search exp er i en ces in similar environments (Torralba,

2003a). In the search of an object in a scene, the probability of interest is the joint
19
2.5 Models of Visual Attention
probability that the object is present i n the current scene, tog et h er with the ob-
ject’s probable location (if the object is present), gi ven the observed features. This
was calculated using Bayes’ rule. Another example of contextual priming is the
incorporation of gist (Torralba 2 0 03 b ) , which represents a semant i c classification of
the scene, such as ‘highway’ or ‘beach’ (Oliva, 2005). Research studies have shown
that humans can do basic level categorization of complex natural scenes within a
glance (Potter, 1976; Thorpe et al., 1996; Schyns and Oliva, 1994). To perform
these basic level classifications at such speeds, it has been suggested that we may
be using some sort of intermediate global representation(Fei-Fei and Perona, 2005).
This is because eyes have not moved much in a such short time and thus no scene
wide exploration has been carried out to form object level representations. These
intermediate representations can also be u sed t o recognize unfamiliar scenes (scenes
which were not part of a traini n g set) with good accuracy (Torralba et al., 2003)
Similarly, Gao and colleagues (Gao and Vasconcelos, 2005; Gao et al., 2008)
posed saliency as a solution to the classification problem, with the aim of minimiz-
ing the classificatio n error. They first applied this concept to the problem of object
detection (Gao and Vasconcelos, 2005), under the perspective that saliency should
be assigned to the location in a scene useful for the t a sk . To accomplish this, they
selected a set of discriminative features best representing the class of interest (e.g.
faces or cars). The saliency is then defined as a weighted sum of the fea t u r es that
are salient for that class. Thus, an inherent, task-oriented definition of saliency has
been formed. Later, Gao and colleagues (2008)definedbottom-upsaliencyasthe
locations that are very di↵erent from their surroundings. They used di↵erence-of-
Gaussians (DoG) filters and Gabor filters, and measured the saliency of a pixel as
aKullback-Leibler(KL)divergencebetweenthehistogramoffilteredresponsesat
that pixel and the histogram of the filtered responses in t h e surrounding region.
Their result s showed b et t er performance for mot ion saliency, even in the presen ce

20
2.5 Models of Visual Attention
of ego-motion.
Earlier e↵orts in developing visual attention models were limited to static im -
ages, utilizing only the spatial domain of the visual stimulus (Itti et al., 1998;
Watson and Ahumada Jr, 2005; Le Meur et al. , 2006; Walther and Koch, 2006;
Kienzle et al., 2009; Najemnik and Gei sl er , 2008). However more r ecent models can
take into account the spatio-temporal dynamics of the visual input, typically expe-
rienced in normal settings (Ma et al., 2005; Hou and Zhang, 2008; Le Meur et al.,
2007; Gao et al., 2008; Itti and Baldi, 2009; Zhang et al., 2009; Seo and Milanfar,
2009; Marat et al., 2009; Bruce and Tsotsos, 2009; Vig et al., 2010; Mahadevan
and Vasconcelos, 2010; Zhao and Koch, 2011). As mentioned earlier , many of
these models are parametric in nature, requiring explicit design parameters (Gao
et al., 2008; It t i et al., 1998; Torralba et al., 2003; Zhang et al., 2009)e.g. type
of filters, number of filters to use, etc. In contrast, non-parametric models tend to
learn th e filters directly from the training stimuli with o u t any need for parameter
adjustments (Bruce and Tsotsos, 2009; Seo and Milanfar, 2009).
However most of the current models lack a simple and general top-d own com-
ponent, which is one of the things this thesis focused on. For instance, Gao et al.
(2008)computedsaliencyusingdiscriminantcentre-surroundmechanism,Seo and
Milanfar (2009)proposedaself-resemblancemeasureforcomputingsalientregions,
and Bruce and Tsotsos (2008)usedaself-informationmeasure.Ontheotherhand,
few models have drawn inspiration from top-down processes. For instance, Itti and
Baldi’s (2006) surprise model used B ayesi a n statistics to compute saliency. Here,
prior probability modulated the saliency over time, e.g., a new object in the scene
will be highly salient, bu t with the passage of time, it will become less salient. Zhao
and Koch (2011)haveshownthatsubjectsweighimagefeaturesdi↵erently (e.g.,
face and orientation were given more importance than colour and intensity). Thus,
their proposed model accounted for the non-linear integration of feature maps. The
21

2.5 Models of Visual Attention
feature weights were learned from eye movement data using least square regression
(Zhao and Koch, 2011) and AdaBoost techn i q u es (Zhao and Koch, 2012). Tor-
ralba et al. (2006)attemptedtointegratetop-downinformation,usingcontextual
modulation. However the context u al mod u l a ti o n was implemented usin g image-
wide horizontal regions of interest (ROI). These ROI’s were learned from the eye
movements on labeled training da t a, e.g., in a street scene, wh e n asked to search
for people and trees, subjects prefer entially l ooked at the centre of the image and
the top of the image respectively. This approach is highly specific and not easily
generalizable. In addition , the use of horizontal regions of interest appear to be
quite limited as image gist is more likely to be two dimensional than one dimen-
sional. Part of this thesis focused on how to integrate top-down infor m at i on with
bottom-up spatio-temporal saliency in a general way. To accomplish this, we first
learnt scene categories on un l a beled train i n g data consisting of scene gist d escri p -
tor, and verified t h at the eye movement patterns for the di↵erent scene cat egor i es
were indeed di↵erent. Test images were first categorized into the di↵erent scene
categories using the scene fist descriptor, and then the category specific eye move-
ment patterns were used to modulate the bottom-up saliency maps for these test
images. Finally, we validated the saliency modulation by compari n g to a number
of di↵erent controls, followed by comparisons to a number of well-known models
of visual attention.
22
Chapter 3
Experiment and Analysis
In this chapter, I will present our study on temporal properties of visual attention
(fixation duration). I will give details on our method of collecting eye movement
data from human subjects, parsin g the eye data into fixations and saccades, fol -
lowed by the analysis of fixation durations in the context of scene transitions. Our
analyses show that fixation durations vary in response to global visual interrup-
tion. In addition, fixation durations can also be used as a behavioural metric to

quantify the centre bias.
3.1 Methods
3.1.1 Participa nts
Eye movement traces were collected for a total of 32 university students (17 female
and 15 male) between the ages of 20 to 28. They were monetarily compensated for
their time. All the participants had normal or corrected to normal vision and were
naive to eye movement experiments. Informed consent was also obtained from all
the participants before the start of the experiment.
23
3.1 Methods
Table 3.1: Movie D at ab ase
Index Group
Movie Name Scenes Resolution Run-Time
(min:sec)
V01
1 Everest 95 636x480 20:10
V02
1 Animals 88 620x 460 10:29
V03
1 Cats 64 620x380 06:18
V04
1 Galapagos 43 636x480 07:40
V05
1 The Big Lebowski 62 640x476 04:55
V06
1 The Matrix 104 720x296 04:26
V07
2 Hitler 260 624x352 18:01
V08
2 Forbidden City Cop 348 720x416 17:06

V09
2 Flirting Scholar 446 720x400 19:03
V10
2 I,Robot 341 800x340 18:11
V11
2 Kung Fu Hustle 212 704x528 19:32
V12
2 Wong Fei Hong 237 720x320 18:01
3.1.2 Visual Stimul i
12 motion picture films were used to extract the movie clips used in the experi-
ment. Each movie clip was taken from one film, with lengths ranging from 4 to
20 minutes, and inclu d ed numerous scene transit i ons . The movies were properly
de-interlaced before they were sh own to the subjects. The audio track was stripped
o↵ the mov i es, followed by a conversion from RGB colour to n eu t r al monochro-
matic luminance. To avoid any aspect ratio artifacts due to resizi n g fra m es to a
common resolution, movies were played in thei r n at i ve resolution, which meant
that the movie clips that we used had slightly di↵erent resolutions and a range of
run-times (Table 3.1). The movies were always centered on the screen surr ou n de d
by a gray background, so it was unlikely that the di↵erent frame resolutions would
have a↵ected our result s.
3.1.3 Procedure
We performed 2 sets of experiments. In the first set of experiments, 11 partici-
pant were presented with movies from Group 1 (see Table 3.1). Preliminary data
24
3.1 Methods
analysis encouraged us to expand the collected psychophysical data by including
more participants and longer movies. Thus, in the second set of experiment s, 11
participants were presented with longer duration movies from Group 2. Movies
were played on a 21-inch computer monito r with a refresh rate of 120 Hz, and at
adistanceof56cmfromthesubject(correspondingtoa40x30degreesfieldof

view). Subjects were tested for eye dominance to facilitate better eye tracking.
For the eye dom i n an ce test, subjects were instructed to extend their arms, with
palms facing away. They wer e then asked to bring the hands together and use
their forefinger and thumb on both hands to form a small hole in the middle. This
was followed by the instruction to look through the hole and focus on an object,
about 15 feet away, with both eyes open. Subsequently, they were instructed to
close one eye at a time to find out when the object disappeared from view within
the hole. The eye in which the object remained in the hole was determined to be
the dominant eye. The subject’s head was stabilized with a chin-rest and subjects
were instructed to simply watch the mo vies that were shown in random order. In
order t o assess the level of alertn es s of the subjects, they were told there would be
asked some general questions about the movie at the end of the experiment.
In each experimental session, the movies were blocked into six trials. The
order of these movies was varied between participants, to minimize any potential
order e↵ects on their fixation pat te rns. All the subjects underwent a calibration
procedure before the start of each eye tracking trial. Subjects were allowed to ta ke
breaks between the trials, and were allowed to complete a session over multiple
days. In order to maintain alertness during the experiment, we limited each session
to maxi mum o f 45 minutes. Instantaneous eye positions were tracked by a high
speed CMOS camera (CRS Research) utilizing pupil and dual first purkinje images
(250 Hz sampli n g frequency, average gaze position accu ra cy of  0.25 degree, and
gaze tracking ranges b etween -20 to +20 degrees horizontal and -15 to +15 degrees
25

×