Tải bản đầy đủ (.pdf) (19 trang)

VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 2

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.41 MB, 19 trang )

Chapter 4
Computational Modelling
4.1 Methods
We used a Spatio-Temporal Coherency model, which is an extension of the user
attention model previously described by Ma et al. and colleagues (2002). Our
model used moti o n intensity, sp a t i a l coherency, temporal coherency, face detection
and gist modulation to comp u te sali en cy map s (see Figure 4.1). The critical part
of this model is motion saliency, defined as the attention due to motion. Abrams
and Christ (2003)haveshownthattheonsetofmotioncapturesvisualattention.
Motion saliency for aframeiscomputedovermultiplepreceding frames. Accor d i ng
to our investigations, overt orienting of the fovea to any motion induced salient
locations in any given frame,n, is influenced by the saliency at that location in
the preceding frames (n  1,n 2, ···). This is known as saccade latency in the
literature, with latencies typically in the order of 200-250 milliseconds (Becker
and J¨urgens, 1979; Hayhoe et al., 2003). We investigated the influence of up to
ten preceding frames (approximately 500 milliseconds), however beyond the fifth
frame (i.e., n  5) we did not see any significant contributions to overt orientation.
This means that the currently fixated location in fra m e n was indeed selected based
63
4.1 Methods
on the saliency at that location up to n  5 preceding frames. The on-screen time
for 5 frames was about 210 milliseconds (the video frame rate in our experiments
were 24 frames per second) which is well within the bounds of the reported sa cca de
latencies.
4.1.1 Spatio-Temporal Saliency
We comp u te d motion vectors using Adaptive rood pattern search (ARPS) algo-
rithm, for fast block-matching motion estimation (Nie and Ma, 2002). The motion
vectors are computed by dividing the frame into a matrix of blocks. The fast
ARPS algo r i th m leverages on the fact that general motion is usually coherent. For
example, if we see a block surr ou n d ed by other blocks, and the surrounding blo cks
moved in a partic u l ar direc ti o n , th er e is a high probability that the c u rr ent blo ck


will also have a similar motion vector. Thus, the algorithm estimates the motion
vector of a given block using the motion vector of the macro block to its immediate
left.
The Motion Intensity map, I (Figure 4.2), which is a measure of motion induced
activity, is computed using motion vectors (dx, dy)normalizedbythemaximum
magnitude in the motion vector field.
I(i, j)=
q
dx
i,j
2
+ dy
i,j
2
Max of Magnitude
(4.1)
Spatial coherency (Cs)andtemporalcoherency(Ct)mapswereobtainedby
convolving the frames with the Entropy filter. The Cs maps captured regularity
at a spatial scale of 9x9, with in a frame, while the Ct maps captured regularity
at the same spatial scale (ie., 9x9 pixels) but over a temp oral scale of 5 frames.
Spatial co h er en cy (see Figure 4.3) measured the consistency of pixels around the
point of interest, otherwise known as the correlation. The higher the correlation
64
4.1 Methods
n Intensity (I)
l Coherency(Cs) Temporal Coherency(Ct)
(
)
Features combinaon and normalizaon
I x Ct x ( 1 - I x Cs )

Center- Surround Suppression ( -scale erences

3-6,3-7,4-7,4-8,5-8,5-9
)
Saliency Map
Original frame
Face channel
Viola and Jones (2004)
Saliency Map with face
n
Gist channel
PCA
dimensionality ren per region
PC - 2
PC - 3
Classifying frame Gist vector
to scene category


Filtering each region using -scale
oriented Gabor rs
Modulaon
with  map of winning category
Original frame
Frame 
using moon esa and entropy tering
9 Scale Gaussian pyramid
B/U
B/U
T/D

B/U : Bottom-up
T/D : Top-down
Figure 4.1: Spatio-Temporal saliency model architecture diagram.
65
4.1 Methods
Low High
Velocity
Compute motion vectors
Frame(n-1)
Frame(n)
Frame(n-1) Frame(n)
Frame(n-1) Frame(n)
Motion Vector Field
Compute motion intensity
Motion Intensity Map
Motion Intensity Map
Motion Intensity Map
Motion Vector Field Motion Vector Field
High Motion Energy
Low Motion Energy
Figure 4.2: Computation of motion intensity on adjacent frames. Three examples from
di↵erent movies are shown.
the more probability they are from the same object. This is computed as the
entropy over the block of pixels. Higher entropy showed more randomness in the
block structure, indicating lower correlation among pixels and, hence lower sp a t i a l
coherency. Figure 4.3 shows five examples of the spatial coherency computation
using the following equation.
Cs(x, y)=
n
X

i=1
P
s
(t)logP
s
(t)(4.2)
where P
s
(t) is the probability of occurrence of th e pixel intensity i and n cor-
responds to the 9x9 neighbourhood.
Similarly, to comput e consistency in pixel correlation over time, we used
Ct(x, y)=
n
X
i=1
P
s
(t)logP
s
(t)(4.3)
where P
s
(t)istheprobabilityofoccurrenceofthepixeli at the corresponding
66
4.1 Methods
2
Frame(n)
Entropy Filtering - Spatial
Frame(n)
Frame(n)

Frame(n)
Frame(n)
High Entropy
Low Entropy
Figure 4.3: Examples of spat i al coherency map computed on five di↵erent movie frames.
location in preceding five frames (m = 5). Higher entropy implies greater motion
and thus higher saliency at that location . The temporal coherency map (see Figure
4.4), in general, signifies motion energy in each fixated frame, contributed by the
five preceding frames (with exception in cases for boundary frames where motion
vectors are inv alid due to scene or camera transitions).
Once all three feature maps are computed, we apply centre-surround suppres-
sion to these maps to highlight regions having hi gh er spatial contrast. This is akin
to simulating the behaviour of ganglion cells in retina (Hubel and Wiesel, 1962). To
achieve this, we first compute dyadic Gaussian pyramid (Burt and Adelson, 1983)
for each map by repeatedly low-pass filtering and subsampling the map (see Figure
4.5). For low-pass filtering, we used a 6 ⇥ 6 separable Gaussian kernel (Walther
and Koch, 2006)definedasK =
[1 5 10 10 5 1]
32
(see Walther, 2006, Appendi x A.1
for m or e details).
We start with level 1 (L1) which is the actual size of the map. Image for each
successive level is obtained by first low-pass filtering th e image. This step results
in a blurry image with supressed higher spatial frequencies. The res u l ti ng image
is then subsampled to half of its current size to obtain the level 2 (L2) image. The
process continue until the map cannot be further subsampled (L9 in Figure 4.5).
67
4.1 Methods
Frames Frames Frames
n-5

n-4
n-3
n-2
n-1
n
n-5
n-4
n-3
n-2
n-1
n
n-5
n-4
n-3
n-2
n-1
n
Entropy Filtering - Temporal
High Entropy
Low Entropy
Figure 4.4: Examples of temporal coherency map computed over the previous five frames,
shown from three di↵erent movie examples.
68
4.1 Methods
100
200 300 400 500 600
700
50
100
150

200
250
50 100 150 200 250
300
350
20
40
60
80
100
120
140
50 100
150
20
40
60
20 40 60 80
10
20
30
10
20
30 40
5
10
15
5
10 15
20

2
4
6
8
2 4 6
8
10
1
2
3
4
1
2
3
4
5
0.5
1
1.5
2
2.5
0.5
1
1.5 2 2.5
0.5
1
1.5
L1
L2
L3

L4
L5
L6
L7
L8
L9
Low Saliency
High Saliency
Normalized Histogram
Figure 4.5: Example of a temp or al coherency map at nine di↵er ent levels of Gaussian
pyramid. Starting at level 1 (L1 in above figure),having same size as the original map,
each successive level is obtained by low-pass filtering and subsequently subsampling the
map to half of its size at the current level.
To simulate the behaviour of centre-surround receptive field, we take the di↵er-
ence among di↵erent levels of the pyramid for a given feature map as was previously
described in Itti et al. (1998). We select di↵erent levels of pyramid to represent
centre C 2 {2, 3, 4} and surround S 2 {C + } where  2 {3, 4}. This r esu l t s
in six intermediate maps as shown in Figure 4.6 . To get a point-wise differences
across scale, the images are interpolated to a common size.
All of these six centre surround maps are then added, across scalestogeta
single map per feature as shown in Figure 4.7.
All three feature m ap s are then combined linearly to produce a s t an d ar d saliency
map.
SaliencyMap = I ⇥ Ct ⇥ ((1  I) ⇥ Cs)(4.4)
Since hi g h er entropy in the temporal coherency map indicates greater motion
69
4.1 Methods
20 40 60 80
5
10

15
20
25
30
35
20
40
60 80
5
10
15
20
25
30
35
20
40
60
80
5
10
15
20
25
30
35
20 40
60
80
5

10
15
20
25
30
35
20 40
60 80
5
10
15
20
25
30
35
20
40
60
80
5
10
15
20
25
30
35
C-S = L2 - L5 C-S = L2 – L6
C-S = L3 – L6
C-S = L3 - L7 C-S = L4 – L7
C-S = L4 – L8

Low Saliency
High Saliency
Normalized Histogram
Figure 4.6: Taking point-wise di↵erences across scales (2-5, 2-6, 3-6, 3-7, 4-7, 4-8) results
in six intermediate maps for a given feature map.
200
400
600
50
100
150
200
250
200
400 600
50
100
150
200
250
200
400 600
50
100
150
200
250
200
400 600
50

100
150
200
250
200
400 600
50
100
150
200
250
200
400 600
50
100
150
200
250
Mn
Intensity Map
Spa
Coherency Map
Tempora
Coherency Map
No Center-surround
suppression
With Center-surround
suppression
High Saliency
Low Saliency

Normalized Histogram
Figure 4.7: Final feature maps obtained after adding across-scale centre-surround dif-
ferences. The top panel shows the feature maps before centre-surround suppression is
applied. The bottom row shows the final feature maps after the applicat i on of the centre-
surround suppression via across scales point-wise di↵erence followed by the summation.
The example shown for one movie frame clearly demonstrates the e↵ectiveness of the
centre-surround suppression in producing spars er featur e maps.
70
4.1 Methods
over a particular region, intensity maps are directly multiplied with temporal co-
herency maps. This highlights the contribution of the motion salient regions in
the saliency maps. On the contrary higher entropy in the spatial coherency map
indicates randomness in the block structure, suggesting that region does not bel on g
to any single entity or object. Since we are interested in motion saliency induced
by the spatially coherent o bject, we assign higher value to the pixel s belonging to
the same objects. In Figure 4.8 we show the standard saliency map computed for
randomly chosen frame from each of the movie in our database.
71
4.1 Methods
Animals Cats
Matrix BigLebowski
Galapagos Everest
Hitler ForbiddenCityCop
 IRobot
KungFuHustle WongFeiHong
Low Saliency
High Saliency
Normalized Histogram
Figure 4.8: Saliency maps shown for randomly selected frame from every movie in the
database. Column 1 and 3 show movie frames while column 2 and 4 show the saliency

map f or the correspon di n g frame. A saliency values is indicated by a warmer colour as
illustrated by the colour map on the right.
72
4.1 Methods
4.1.2 Face Modulation
We modulate the stand a r d sali en cy map wit h hig h -l evel semantic knowledge, such
as faces using a state of the art face detector (Viola and Jones, 2004). This accounts
for the fact that overt attention is frequently deployed to faces (Cerf et al., 2009),
and it can be argued that faces are a part of the bottom-up information as there
are cortical areas specialized for faces, in particular the fusiform gyrus (Kanwisher
et al ., 1997).
Viola and Jones (2004)facedetectorisbasedontrainingacascadedclassifier
using a learning technique called AdaBoost on a set of very simple visual features.
These visual features have Haar-like feature prop e rt i es as they are computed by
subtracting the sum of the sub region from the sum of t h e remaining region. The
Figure 4.9 shows example of Haar-like rectangle features. Panels A and B show two-
rectangle features (horizontal / vertical) and panels C and D show three-rectangle
and four-rectangle features respectively. The value of the feature is computed
by su b t r a ct i n g the sum of th e pixel values in the white region from the sum of
the pixel values in the grey region. These Haar-like feat u re s are simple and very
efficient to compute using integral of image representation. The integral of image
representation allows the computation of the rectangle sum in constant time.
The Haar-like features are extracted over a 24 ⇥ 24 pixel sub-window resulting
in thousan d s of features per image. The goal here is to construct a strong classifier
by selecting a small number of discriminant features from the limited set of labelled
training images. This is achieved by employing an AdaBoost technique to learn the
cascade of weak classifi er s. Each weak classifier in the cascade is trained on a single
feature. The term weak signifies the fact that no single classi fi er in the cascade
can classify all the examples accurately. For each round of boosting, the Ada Boost
method allows the selection of the weak classifier with the lowest erro r rate con-

trolled by the desired hit and miss rate. This is followed by the re-assi g n m ent of
73
4.1 Methods
Figure 4.9: Example of four basic rectangular features as shown in Viola and Jones
2004 IJCV pap er . Panel A and B show two-rectangle features while panels C and D
show three-rectangle and four-rectangle features. Panel E shows example of two features
overlaid on a face image. The first feature is a two-rectangle feature measuring the
di↵erence between the eye and upper cheek region while the second feature, a three-
rectangle feature, measures the di↵erence between the eye region and upper nose region.
the weights to emphasize the examples with poor classification for the next round.
The Adaboost method is regarded as a greedy algorithm since it associates a large
weight to every good features and a small weight to poor features. The final strong
classifier is then a weighted combination of the weak classifiers.
Panel E of Figure 4.9 demonstrates how the selected features reflect useful
properties of the image. The example shows a two-feature classifier (top row shows
the two selected features) trained over 507 faces. The first feature measures the
di↵erence in the luminance value between the regions of the eyes and upper cheeks.
The second selected feature measures the di↵erence in the luminance value between
the eye regio n and the nose bridge. An intuitive rationale behind the selection of
these features is that the eye region is generally darker than the skin region.
Previous findings on static images suggest that people look at face components
(eyes, mouth, and nose) preferentially, with the eyes being given more preference
over other components (Buswell, 1935; Yarbu s et al., 1967; Langton et al., 2000;
Birmingham and Kingstone, 2009). However, recent stu d y (V˜o et al., 2012)on
74
4.1 Methods
gaze allocation in dynamic sc en es suggests that eyes are not fixated prefer entially.
V˜o et al. (2012) showed that the percentage of overall gaze distribution is not
significantly di↵erent for any of the face compon ents for vocal scenes . However,
for mute scenes, they did find a significant drop in gaze distribution for the mouth

compared to the eyes and nose. In fact, the nose was given priority over the eyes
regardless of whether the person in the video made eye contact with the camera
or not. However these di↵erences were found to be insignificant.
To detect faces in our video database, we used trained classifiers from the
OpenCV library (Bradski and Pisarevsky, 2000). The classifier detects a face
region and returns with a bounding box encompassing the complete face. This is
followed by convolving the face region with a gaussian having a size h equal to the
width of the box and peak value at the centre of the box. This automatically assigns
the highest feature value to nose compared to other face components. Figure 4.10
shows the process of face modulation for an example frame from the movie “The
Matrix” (1999). Note that the bottom right corner highlights the salient regions
in the movie frame by overlaying the face modu l a t ed saliency map on the movie
frame.
75
4.1 Methods
Movie frame with face detected
Saliency overlayed on movie frame
Saliency map

Low Saliency
High Saliency
Normalized histogram
Figure 4.10: Example of saliency map mo du l at ion with detected face region of interest
(ROI). Top left column shows the original movie frame with face ROI bounding box.
Subsequent columns show how the face modulation is applied to the spatio-temporal
saliency map. The bottom right column overlays the face modulated saliency map on
the movie frame signifying hot spots in the frame.
4.1.3 Gist Modulation
We investigated an improvement to the bottom-up influenced spa t i o- te m poral
saliency model by incorporating top-down semantics of the scene. Our hypothesis is

that variability in eye movement patterns for di↵erent scene categories (O’Connell
and Walther, 2012)canhelpinimprovingsaliencypredictionfortheearlyfixations.
Earlier research experiments have shown the infl uen ce of scene context in guiding
visual attention (Neider and Zelinsky, 2006; Chen et al., 2006). In Neider and
Zelinsky (2006)thescene-constrainedtargetsweresearchedfasterwithahigher
percentage of initial saccades directed to target-con si st ent scene regions. Moreover,
they found that contextual guidance biases eye movement towards target-consistent
regions (Navalpakkam and Itti, 2005)asopposedtoexcludingtarget-inconsistent
scene regions (Desimone and Duncan, 1995). Chen et al. (2006) showed that in the
presence of both top-down (scene preview) and botto m -u p (colour singleton) cues,
top-down information prevails in guiding eye movement. They observed faster
76
4.1 Methods
manual reaction times and more initial saccad es were made to the target location.
In comparison, the colour singleton attracted attention only in the absence of a
scene p r ev iew.
Currently, th er e are d i ↵er ent ways to compute the Gist descriptor of an image
(Oliva and Torralba, 2001; Renninger and Malik, 2004; Siagian and Itti, 2007;
Torralba et al., 2003)
In the framework proposed by Oliva and Torralba (2001), for the purpose
of scene classification, an input image is subdivided into 4x4 equal-sized, non-
overlapping segments. A magnitude spectrum of th e windowed fourier transform
is then computed over each of these segments. This is followed by the feature
dimension reduction, usin g Principal Component Analysis.
Siagian and Itti (2007), computed the gist descriptor from the hierarchal model
(Itti et al., 1998). A 4x4 non-overlapping grid is placed over 34 sub-channels, from
color, orientation and intensity. An average value over each grid box is computed,
yielding a 16 dimensional vector per sub-channel. The resulting 544 raw gist values
are then re d u ced using PCA/ICA to 80 dimensional gist descriptor or featu r e
vector. Subsequently, the scene classification is done using neural netw orks.

In experiments conducted by Rennin ger (Renninger and Malik, 2004), subjects
were asked to identify the scenes that they were exposed to very briefly (<70 msec).
They found that the probability of correctly classifying the scene was always above
chance and improved with exposure durations. They subsequently proposed a sim-
ple model, based on texture analysis, and showed its usefulness toward scene cate-
gorization. The model applied a Gabor filter to an input scene and extracted 100
universal textons, selected from training using K-mean clustering. Textons were
first proposed by Julesz (Julesz, 1981, 1986), as a first order statistic, determining
the strength of a current texture from its surrounding texture. The Gist vector
is then computed as a histogram of universal texton s . Later models identified the
77
4.1 Methods
new s cen e by matching their texton histo gr a m with learned examp l e s.
Torralba and colleagues (Torralba et al., 2003)proposedamodelusingwavelet
image decomposition, tuned to 6 orientations and 4 scales. A raw 384 dim en si o n a l
gist feature vector is then computed by averaging the filter responses (total of 24
filter responses) over a grid of 4x4. The dimensionality reduction is applied, using
PCA, to reduce the gist descriptor to only 80 dimensions. Finally, classification
is done by finding the min i mum Euclidean distance between the gist descriptor of
the input image and the learned example.
Usually, gist descriptors are computed for manually labelled (i.e., scene cate-
gory label) scenes (Oliva and Torralba, 2001). In a recent study Fei-Fei and Perona
(2005)attemptedtolearnscene’sglobalrepresentationusingdistributionofcode-
words. Codewords were represented by an image patch randomly sampled from
650 training examples, over 13 scene categories. A code book of codewords was
learned using k-means algorithm. Subsequently theme models were formulated us-
ing best distribu t i o n of the learned codewords. Finally each scene category model
was based on mixture of theme models and codewords.
The overall gist m odulation p r ocess can b e described as follow. In total we had
2300 scenes across our entire movie database. We used 50-50 ratio to formulate

our training and test set. To discover scene category we picked first frame from
each scene as a representative frame of t h at scene. The rational behind picking
the first frame is that gist of the scene can be extracted with in a glance (Oliva,
2005; Loschky and Larson, 2008). This makes the first frame of each scene a
good candidate for the computation of gist descriptor. Furthermore, the overall
semantics within the scene did not vary much across all the frames of agivenscene
e.g., the fight sequence (indo or scene) remained indoor throughout the pr ogressi on
of the scene. In the trainin g phase (see Figure 4.11), the gist descriptors are used
to discover scene categories, using an unsupervised clustering algorithm (Harris
78
4.1 Methods
et al., 2000). Each discovered cluster corresponds to one scene category. Once
the scene categories are identified, we for mulate fixation maps (later referred to as
categorical fixation maps) for each of these categories, using only early fixations in
the training set. The early fixations are defined as f ixations 2 {2, 3, 4, 5}. In the
test phase, we first classify a given movie scene ba sed on its gist descriptor and
subsequently modulate saliency map s, for early fixa ti o n frames, using categorical
fixation maps.
4.1.3.1 Simulations
To test the robustness of the proposed method of top-down gist modulation we ran
1000 simulations. A 1000 random permutations of 2300 frames (first frame from
each scene of the movie) were generated. These per mutations were t hen bisected to
form training and test sets. This was followed by clustering of the training frames
and subsequent creation of categorical fixation maps for every simulation. Finally,
the modulation of the saliency maps of the test frames was carried out for each
simulation. It is worth mentioning that the com p u t ati on of gist descriptors and
face modulated saliency saliency maps were carried out once and were reused for
every simulation of gist modulation .
4.1.3.2 Gist Descriptor
To build a gist descriptor we used 32 multi-scale oriented filters(Torralba, 2003a;

Oliva et al., 2006)encompassing4di↵erent scales and 8 di↵erent orientations (see
Figure 4.11). This operation resulted in 32 response images for each training
frame. These response images represented statistical configuration of orientation
and spatial frequencies present in the movie frame. Since the dimensionality of th e
image space is very high, any further proces si n g would have been computationally
very expensive. To ad d r ess this issue, the response images were first segmented
79
4.1 Methods
into N ⇥N non-overlapping regions. Each region was then represented by it’s mean
value. This operation resulted in a 1x32 dimensional vector per region across the
response images. To see if there was any e↵ect of choosing N on the final results,
we experimented with three di↵erent values of N 2 {2, 3, 4}. To further reduce the
dimensionality of the feature space, we used principal comp on e nt analysis (PCA) .
PCA was performed on each region, reducing the num ber of dimensions from 32
dimensions to 2 dimensions by using the 2nd and 3rd principal components. We
avoided selecting the 1st principal component since we found that all the filters
contributed equally to it. However, for the 2nd and 3rd principal components, we
found discriminative power in 32 filter responses, with some filters contributing less
than others. Examples of gist descriptor for di↵erent choices of N are shown in
Figure 4.12. We also show a Fourier signature of the gist descriptor, in the original
space and the reduced dimension space. As sh own, dimensionality reduction did
not have any evident e↵ect on the gist descriptor’s Fourier signature.
4.1.3.3 Learning Scene Cat eg or ies from Training Data
The reduced dimensional training data was subjected to an unsupervised clustering
algorithm (Harris et al., 2000)todiscoverthescenecategories. Sincethetraining
data had more than 3 dimensions, the visualization of the feature space was not
possible for all the dimensions at once. In Figure 4.14, we show the feature space
for N = 2, by projecting data onto a two dimensional cartesian coordinate system
at a time, thus showing all the possible combinations of the available dimensions.
This particular example was taken from simulation 927, for which we found two

scene categories. In gener a l , we found 2 to 4 scene categories in our training data
for di↵erent choices of N {2, 3, 4}, over 1000 simulations. Addi t i on al examples of
such cases are show in Figure 4.15. The small number of scene categories we found
may be the result of limited variety in o u r movie dataset. In general, our movies
80
4.1 Methods
Frame segmenta
 re using
Regin mean fwed b PCA
Spa-nted tering
Unsupervised stering
a maps crr
t each scene categr
4x4
1150 x 32
1150 x 32
2x2
1150
frames
3x3
1150 x 8
1150 x 18
Cster 1 { 231 x8 }
Cster 2 { 131 x8 }
Cster X { 174 x8 }
Cster 1 {145 x18 }
Cster 2 {154 x18 }
Cster X { 374 x18}
Cster 1 {331 x32 }
Cster 2 {169 x32 }

Cster X { 99 x32}


Fixa Map 1
Fixa Map 2
Fixa Map X
Fixa Map 1
Fixa Map 2
Fixa Map X
Fixa Map 1
Fixa Map 2
Fixa Map X










Scale
Orientation
Figure 4.11: A flowchart of training process for discovering scene categories. In total 1150
examples are used to compute the gist descriptor followed by di me ns i onal i ty reduction
using PCA and unsupervised clustering using mixture of gaussian.
81

×