VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 2

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.41 MB, 19 trang )

Chapter 4
Computational Modelling
4.1 Methods
We used a Spatio-Temporal Coherency model, which is an extension of the user
attention model previously described by Ma et al. and colleagues (2002). Our
model used moti o n intensity, sp a t i a l coherency, temporal coherency, face detection
and gist modulation to comp u te sali en cy map s (see Figure 4.1). The critical part
of this model is motion saliency, deﬁned as the attention due to motion. Abrams
and Christ (2003)haveshownthattheonsetofmotioncapturesvisualattention.
Motion saliency for aframeiscomputedovermultiplepreceding frames. Accor d i ng
to our investigations, overt orienting of the fovea to any motion induced salient
locations in any given frame,n, is inﬂuenced by the saliency at that location in
the preceding frames (n  1,n 2, ···). This is known as saccade latency in the
literature, with latencies typically in the order of 200-250 milliseconds (Becker
and J¨urgens, 1979; Hayhoe et al., 2003). We investigated the inﬂuence of up to
ten preceding frames (approximately 500 milliseconds), however beyond the ﬁfth
frame (i.e., n  5) we did not see any signiﬁcant contributions to overt orientation.
This means that the currently ﬁxated location in fra m e n was indeed selected based
63
4.1 Methods
on the saliency at that location up to n  5 preceding frames. The on-screen time
for 5 frames was about 210 milliseconds (the video frame rate in our experiments
were 24 frames per second) which is well within the bounds of the reported sa cca de
latencies.
4.1.1 Spatio-Temporal Saliency
We comp u te d motion vectors using Adaptive rood pattern search (ARPS) algo-
rithm, for fast block-matching motion estimation (Nie and Ma, 2002). The motion
vectors are computed by dividing the frame into a matrix of blocks. The fast
ARPS algo r i th m leverages on the fact that general motion is usually coherent. For
example, if we see a block surr ou n d ed by other blocks, and the surrounding blo cks
moved in a partic u l ar direc ti o n , th er e is a high probability that the c u rr ent blo ck

will also have a similar motion vector. Thus, the algorithm estimates the motion
vector of a given block using the motion vector of the macro block to its immediate
left.
The Motion Intensity map, I (Figure 4.2), which is a measure of motion induced
activity, is computed using motion vectors (dx, dy)normalizedbythemaximum
magnitude in the motion vector ﬁeld.
I(i, j)=
q
dx
i,j
2
+ dy
i,j
2
Max of Magnitude
(4.1)
Spatial coherency (Cs)andtemporalcoherency(Ct)mapswereobtainedby
convolving the frames with the Entropy ﬁlter. The Cs maps captured regularity
at a spatial scale of 9x9, with in a frame, while the Ct maps captured regularity
at the same spatial scale (ie., 9x9 pixels) but over a temp oral scale of 5 frames.
Spatial co h er en cy (see Figure 4.3) measured the consistency of pixels around the
point of interest, otherwise known as the correlation. The higher the correlation
64
4.1 Methods
n Intensity (I)
l Coherency(Cs) Temporal Coherency(Ct)
(
)
Features combinaon and normalizaon
I x Ct x ( 1 - I x Cs )

Center- Surround Suppression ( -scale erences

3-6,3-7,4-7,4-8,5-8,5-9
)
Saliency Map
Original frame
Face channel
Viola and Jones (2004)
Saliency Map with face
n
Gist channel
PCA
dimensionality ren per region
PC - 2
PC - 3
Classifying frame Gist vector
to scene category

Filtering each region using -scale
oriented Gabor rs
Modulaon
with  map of winning category
Original frame
Frame 
using moon esa and entropy tering
9 Scale Gaussian pyramid
B/U
B/U
T/D

B/U : Bottom-up
T/D : Top-down
Figure 4.1: Spatio-Temporal saliency model architecture diagram.
65
4.1 Methods
Low High
Velocity
Compute motion vectors
Frame(n-1)
Frame(n)
Frame(n-1) Frame(n)
Frame(n-1) Frame(n)
Motion Vector Field
Compute motion intensity
Motion Intensity Map
Motion Intensity Map
Motion Intensity Map
Motion Vector Field Motion Vector Field
High Motion Energy
Low Motion Energy
Figure 4.2: Computation of motion intensity on adjacent frames. Three examples from
di↵erent movies are shown.
the more probability they are from the same object. This is computed as the
entropy over the block of pixels. Higher entropy showed more randomness in the
block structure, indicating lower correlation among pixels and, hence lower sp a t i a l
coherency. Figure 4.3 shows ﬁve examples of the spatial coherency computation
using the following equation.
Cs(x, y)=
n
X

i=1
P
s
(t)logP
s
(t)(4.2)
where P
s
(t) is the probability of occurrence of th e pixel intensity i and n cor-
responds to the 9x9 neighbourhood.
Similarly, to comput e consistency in pixel correlation over time, we used
Ct(x, y)=
n
X
i=1
P
s
(t)logP
s
(t)(4.3)
where P
s
(t)istheprobabilityofoccurrenceofthepixeli at the corresponding
66
4.1 Methods
2
Frame(n)
Entropy Filtering - Spatial
Frame(n)
Frame(n)

Frame(n)
Frame(n)
High Entropy
Low Entropy
Figure 4.3: Examples of spat i al coherency map computed on ﬁve di↵erent movie frames.
location in preceding ﬁve frames (m = 5). Higher entropy implies greater motion
and thus higher saliency at that location . The temporal coherency map (see Figure
4.4), in general, signiﬁes motion energy in each ﬁxated frame, contributed by the
ﬁve preceding frames (with exception in cases for boundary frames where motion
vectors are inv alid due to scene or camera transitions).
Once all three feature maps are computed, we apply centre-surround suppres-
sion to these maps to highlight regions having hi gh er spatial contrast. This is akin
to simulating the behaviour of ganglion cells in retina (Hubel and Wiesel, 1962). To
achieve this, we ﬁrst compute dyadic Gaussian pyramid (Burt and Adelson, 1983)
for each map by repeatedly low-pass ﬁltering and subsampling the map (see Figure
4.5). For low-pass ﬁltering, we used a 6 ⇥ 6 separable Gaussian kernel (Walther
and Koch, 2006)deﬁnedasK =
[1 5 10 10 5 1]
32
(see Walther, 2006, Appendi x A.1
for m or e details).
We start with level 1 (L1) which is the actual size of the map. Image for each
successive level is obtained by ﬁrst low-pass ﬁltering th e image. This step results
in a blurry image with supressed higher spatial frequencies. The res u l ti ng image
is then subsampled to half of its current size to obtain the level 2 (L2) image. The
process continue until the map cannot be further subsampled (L9 in Figure 4.5).
67
4.1 Methods
Frames Frames Frames
n-5

n-4
n-3
n-2
n-1
n
n-5
n-4
n-3
n-2
n-1
n
n-5
n-4
n-3
n-2
n-1
n
Entropy Filtering - Temporal
High Entropy
Low Entropy
Figure 4.4: Examples of temporal coherency map computed over the previous ﬁve frames,
shown from three di↵erent movie examples.
68
4.1 Methods
100
200 300 400 500 600
700
50
100
150

200
250
50 100 150 200 250
300
350
20
40
60
80
100
120
140
50 100
150
20
40
60
20 40 60 80
10
20
30
10
20
30 40
5
10
15
5
10 15
20

2
4
6
8
2 4 6
8
10
1
2
3
4
1
2
3
4
5
0.5
1
1.5
2
2.5
0.5
1
1.5 2 2.5
0.5
1
1.5
L1
L2
L3

L4
L5
L6
L7
L8
L9
Low Saliency
High Saliency
Normalized Histogram
Figure 4.5: Example of a temp or al coherency map at nine di↵er ent levels of Gaussian
pyramid. Starting at level 1 (L1 in above ﬁgure),having same size as the original map,
each successive level is obtained by low-pass ﬁltering and subsequently subsampling the
map to half of its size at the current level.
To simulate the behaviour of centre-surround receptive ﬁeld, we take the di↵er-
ence among di↵erent levels of the pyramid for a given feature map as was previously
described in Itti et al. (1998). We select di↵erent levels of pyramid to represent
centre C 2 {2, 3, 4} and surround S 2 {C + } where  2 {3, 4}. This r esu l t s
in six intermediate maps as shown in Figure 4.6 . To get a point-wise differences
across scale, the images are interpolated to a common size.
All of these six centre surround maps are then added, across scalestogeta
single map per feature as shown in Figure 4.7.
All three feature m ap s are then combined linearly to produce a s t an d ar d saliency
map.
SaliencyMap = I ⇥ Ct ⇥ ((1  I) ⇥ Cs)(4.4)
Since hi g h er entropy in the temporal coherency map indicates greater motion
69
4.1 Methods
20 40 60 80
5
10

15
20
25
30
35
20
40
60 80
5
10
15
20
25
30
35
20
40
60
80
5
10
15
20
25
30
35
20 40
60
80
5

10
15
20
25
30
35
20 40
60 80
5
10
15
20
25
30
35
20
40
60
80
5
10
15
20
25
30
35
C-S = L2 - L5 C-S = L2 – L6
C-S = L3 – L6
C-S = L3 - L7 C-S = L4 – L7
C-S = L4 – L8

Low Saliency
High Saliency
Normalized Histogram
Figure 4.6: Taking point-wise di↵erences across scales (2-5, 2-6, 3-6, 3-7, 4-7, 4-8) results
in six intermediate maps for a given feature map.
200
400
600
50
100
150
200
250
200
400 600
50
100
150
200
250
200
400 600
50
100
150
200
250
200
400 600
50

100
150
200
250
200
400 600
50
100
150
200
250
200
400 600
50
100
150
200
250
Mn
Intensity Map
Spa
Coherency Map
Tempora
Coherency Map
No Center-surround
suppression
With Center-surround
suppression
High Saliency
Low Saliency

Normalized Histogram
Figure 4.7: Final feature maps obtained after adding across-scale centre-surround dif-
ferences. The top panel shows the feature maps before centre-surround suppression is
applied. The bottom row shows the ﬁnal feature maps after the applicat i on of the centre-
surround suppression via across scales point-wise di↵erence followed by the summation.
The example shown for one movie frame clearly demonstrates the e↵ectiveness of the
centre-surround suppression in producing spars er featur e maps.
70
4.1 Methods
over a particular region, intensity maps are directly multiplied with temporal co-
herency maps. This highlights the contribution of the motion salient regions in
the saliency maps. On the contrary higher entropy in the spatial coherency map
indicates randomness in the block structure, suggesting that region does not bel on g
to any single entity or object. Since we are interested in motion saliency induced
by the spatially coherent o bject, we assign higher value to the pixel s belonging to
the same objects. In Figure 4.8 we show the standard saliency map computed for
randomly chosen frame from each of the movie in our database.
71
4.1 Methods
Animals Cats
Matrix BigLebowski
Galapagos Everest
Hitler ForbiddenCityCop
 IRobot
KungFuHustle WongFeiHong
Low Saliency
High Saliency
Normalized Histogram
Figure 4.8: Saliency maps shown for randomly selected frame from every movie in the
database. Column 1 and 3 show movie frames while column 2 and 4 show the saliency

map f or the correspon di n g frame. A saliency values is indicated by a warmer colour as
illustrated by the colour map on the right.
72
4.1 Methods
4.1.2 Face Modulation
We modulate the stand a r d sali en cy map wit h hig h -l evel semantic knowledge, such
as faces using a state of the art face detector (Viola and Jones, 2004). This accounts
for the fact that overt attention is frequently deployed to faces (Cerf et al., 2009),
and it can be argued that faces are a part of the bottom-up information as there
are cortical areas specialized for faces, in particular the fusiform gyrus (Kanwisher
et al ., 1997).
Viola and Jones (2004)facedetectorisbasedontrainingacascadedclassiﬁer
using a learning technique called AdaBoost on a set of very simple visual features.
These visual features have Haar-like feature prop e rt i es as they are computed by
subtracting the sum of the sub region from the sum of t h e remaining region. The
Figure 4.9 shows example of Haar-like rectangle features. Panels A and B show two-
rectangle features (horizontal / vertical) and panels C and D show three-rectangle
and four-rectangle features respectively. The value of the feature is computed
by su b t r a ct i n g the sum of th e pixel values in the white region from the sum of
the pixel values in the grey region. These Haar-like feat u re s are simple and very
eﬃcient to compute using integral of image representation. The integral of image
representation allows the computation of the rectangle sum in constant time.
The Haar-like features are extracted over a 24 ⇥ 24 pixel sub-window resulting
in thousan d s of features per image. The goal here is to construct a strong classiﬁer
by selecting a small number of discriminant features from the limited set of labelled
training images. This is achieved by employing an AdaBoost technique to learn the
cascade of weak classiﬁ er s. Each weak classiﬁer in the cascade is trained on a single
feature. The term weak signiﬁes the fact that no single classi ﬁ er in the cascade
can classify all the examples accurately. For each round of boosting, the Ada Boost
method allows the selection of the weak classiﬁer with the lowest erro r rate con-

trolled by the desired hit and miss rate. This is followed by the re-assi g n m ent of
73
4.1 Methods
Figure 4.9: Example of four basic rectangular features as shown in Viola and Jones
2004 IJCV pap er . Panel A and B show two-rectangle features while panels C and D
show three-rectangle and four-rectangle features. Panel E shows example of two features
overlaid on a face image. The ﬁrst feature is a two-rectangle feature measuring the
di↵erence between the eye and upper cheek region while the second feature, a three-
rectangle feature, measures the di↵erence between the eye region and upper nose region.
the weights to emphasize the examples with poor classiﬁcation for the next round.
The Adaboost method is regarded as a greedy algorithm since it associates a large
weight to every good features and a small weight to poor features. The ﬁnal strong
classiﬁer is then a weighted combination of the weak classiﬁers.
Panel E of Figure 4.9 demonstrates how the selected features reﬂect useful
properties of the image. The example shows a two-feature classiﬁer (top row shows
the two selected features) trained over 507 faces. The ﬁrst feature measures the
di↵erence in the luminance value between the regions of the eyes and upper cheeks.
The second selected feature measures the di↵erence in the luminance value between
the eye regio n and the nose bridge. An intuitive rationale behind the selection of
these features is that the eye region is generally darker than the skin region.
Previous ﬁndings on static images suggest that people look at face components
(eyes, mouth, and nose) preferentially, with the eyes being given more preference
over other components (Buswell, 1935; Yarbu s et al., 1967; Langton et al., 2000;
Birmingham and Kingstone, 2009). However, recent stu d y (V˜o et al., 2012)on
74
4.1 Methods
gaze allocation in dynamic sc en es suggests that eyes are not ﬁxated prefer entially.
V˜o et al. (2012) showed that the percentage of overall gaze distribution is not
signiﬁcantly di↵erent for any of the face compon ents for vocal scenes . However,
for mute scenes, they did ﬁnd a signiﬁcant drop in gaze distribution for the mouth

compared to the eyes and nose. In fact, the nose was given priority over the eyes
regardless of whether the person in the video made eye contact with the camera
or not. However these di↵erences were found to be insigniﬁcant.
To detect faces in our video database, we used trained classiﬁers from the
OpenCV library (Bradski and Pisarevsky, 2000). The classiﬁer detects a face
region and returns with a bounding box encompassing the complete face. This is
followed by convolving the face region with a gaussian having a size h equal to the
width of the box and peak value at the centre of the box. This automatically assigns
the highest feature value to nose compared to other face components. Figure 4.10
shows the process of face modulation for an example frame from the movie “The
Matrix” (1999). Note that the bottom right corner highlights the salient regions
in the movie frame by overlaying the face modu l a t ed saliency map on the movie
frame.
75
4.1 Methods
Movie frame with face detected
Saliency overlayed on movie frame
Saliency map

Low Saliency
High Saliency
Normalized histogram
Figure 4.10: Example of saliency map mo du l at ion with detected face region of interest
(ROI). Top left column shows the original movie frame with face ROI bounding box.
Subsequent columns show how the face modulation is applied to the spatio-temporal
saliency map. The bottom right column overlays the face modulated saliency map on
the movie frame signifying hot spots in the frame.
4.1.3 Gist Modulation
We investigated an improvement to the bottom-up inﬂuenced spa t i o- te m poral
saliency model by incorporating top-down semantics of the scene. Our hypothesis is

that variability in eye movement patterns for di↵erent scene categories (O’Connell
and Walther, 2012)canhelpinimprovingsaliencypredictionfortheearlyﬁxations.
Earlier research experiments have shown the inﬂ uen ce of scene context in guiding
visual attention (Neider and Zelinsky, 2006; Chen et al., 2006). In Neider and
Zelinsky (2006)thescene-constrainedtargetsweresearchedfasterwithahigher
percentage of initial saccades directed to target-con si st ent scene regions. Moreover,
they found that contextual guidance biases eye movement towards target-consistent
regions (Navalpakkam and Itti, 2005)asopposedtoexcludingtarget-inconsistent
scene regions (Desimone and Duncan, 1995). Chen et al. (2006) showed that in the
presence of both top-down (scene preview) and botto m -u p (colour singleton) cues,
top-down information prevails in guiding eye movement. They observed faster
76
4.1 Methods
manual reaction times and more initial saccad es were made to the target location.
In comparison, the colour singleton attracted attention only in the absence of a
scene p r ev iew.
Currently, th er e are d i ↵er ent ways to compute the Gist descriptor of an image
(Oliva and Torralba, 2001; Renninger and Malik, 2004; Siagian and Itti, 2007;
Torralba et al., 2003)
In the framework proposed by Oliva and Torralba (2001), for the purpose
of scene classiﬁcation, an input image is subdivided into 4x4 equal-sized, non-
overlapping segments. A magnitude spectrum of th e windowed fourier transform
is then computed over each of these segments. This is followed by the feature
dimension reduction, usin g Principal Component Analysis.
Siagian and Itti (2007), computed the gist descriptor from the hierarchal model
(Itti et al., 1998). A 4x4 non-overlapping grid is placed over 34 sub-channels, from
color, orientation and intensity. An average value over each grid box is computed,
yielding a 16 dimensional vector per sub-channel. The resulting 544 raw gist values
are then re d u ced using PCA/ICA to 80 dimensional gist descriptor or featu r e
vector. Subsequently, the scene classiﬁcation is done using neural netw orks.

In experiments conducted by Rennin ger (Renninger and Malik, 2004), subjects
were asked to identify the scenes that they were exposed to very brieﬂy (<70 msec).
They found that the probability of correctly classifying the scene was always above
chance and improved with exposure durations. They subsequently proposed a sim-
ple model, based on texture analysis, and showed its usefulness toward scene cate-
gorization. The model applied a Gabor ﬁlter to an input scene and extracted 100
universal textons, selected from training using K-mean clustering. Textons were
ﬁrst proposed by Julesz (Julesz, 1981, 1986), as a ﬁrst order statistic, determining
the strength of a current texture from its surrounding texture. The Gist vector
is then computed as a histogram of universal texton s . Later models identiﬁed the
77
4.1 Methods
new s cen e by matching their texton histo gr a m with learned examp l e s.
Torralba and colleagues (Torralba et al., 2003)proposedamodelusingwavelet
image decomposition, tuned to 6 orientations and 4 scales. A raw 384 dim en si o n a l
gist feature vector is then computed by averaging the ﬁlter responses (total of 24
ﬁlter responses) over a grid of 4x4. The dimensionality reduction is applied, using
PCA, to reduce the gist descriptor to only 80 dimensions. Finally, classiﬁcation
is done by ﬁnding the min i mum Euclidean distance between the gist descriptor of
the input image and the learned example.
Usually, gist descriptors are computed for manually labelled (i.e., scene cate-
gory label) scenes (Oliva and Torralba, 2001). In a recent study Fei-Fei and Perona
(2005)attemptedtolearnscene’sglobalrepresentationusingdistributionofcode-
words. Codewords were represented by an image patch randomly sampled from
650 training examples, over 13 scene categories. A code book of codewords was
learned using k-means algorithm. Subsequently theme models were formulated us-
ing best distribu t i o n of the learned codewords. Finally each scene category model
was based on mixture of theme models and codewords.
The overall gist m odulation p r ocess can b e described as follow. In total we had
2300 scenes across our entire movie database. We used 50-50 ratio to formulate

our training and test set. To discover scene category we picked ﬁrst frame from
each scene as a representative frame of t h at scene. The rational behind picking
the ﬁrst frame is that gist of the scene can be extracted with in a glance (Oliva,
2005; Loschky and Larson, 2008). This makes the ﬁrst frame of each scene a
good candidate for the computation of gist descriptor. Furthermore, the overall
semantics within the scene did not vary much across all the frames of agivenscene
e.g., the ﬁght sequence (indo or scene) remained indoor throughout the pr ogressi on
of the scene. In the trainin g phase (see Figure 4.11), the gist descriptors are used
to discover scene categories, using an unsupervised clustering algorithm (Harris
78
4.1 Methods
et al., 2000). Each discovered cluster corresponds to one scene category. Once
the scene categories are identiﬁed, we for mulate ﬁxation maps (later referred to as
categorical ﬁxation maps) for each of these categories, using only early ﬁxations in
the training set. The early ﬁxations are deﬁned as f ixations 2 {2, 3, 4, 5}. In the
test phase, we ﬁrst classify a given movie scene ba sed on its gist descriptor and
subsequently modulate saliency map s, for early ﬁxa ti o n frames, using categorical
ﬁxation maps.
4.1.3.1 Simulations
To test the robustness of the proposed method of top-down gist modulation we ran
1000 simulations. A 1000 random permutations of 2300 frames (ﬁrst frame from
each scene of the movie) were generated. These per mutations were t hen bisected to
form training and test sets. This was followed by clustering of the training frames
and subsequent creation of categorical ﬁxation maps for every simulation. Finally,
the modulation of the saliency maps of the test frames was carried out for each
simulation. It is worth mentioning that the com p u t ati on of gist descriptors and
face modulated saliency saliency maps were carried out once and were reused for
every simulation of gist modulation .
4.1.3.2 Gist Descriptor
To build a gist descriptor we used 32 multi-scale oriented ﬁlters(Torralba, 2003a;

Oliva et al., 2006)encompassing4di↵erent scales and 8 di↵erent orientations (see
Figure 4.11). This operation resulted in 32 response images for each training
frame. These response images represented statistical conﬁguration of orientation
and spatial frequencies present in the movie frame. Since the dimensionality of th e
image space is very high, any further proces si n g would have been computationally
very expensive. To ad d r ess this issue, the response images were ﬁrst segmented
79
4.1 Methods
into N ⇥N non-overlapping regions. Each region was then represented by it’s mean
value. This operation resulted in a 1x32 dimensional vector per region across the
response images. To see if there was any e↵ect of choosing N on the ﬁnal results,
we experimented with three di↵erent values of N 2 {2, 3, 4}. To further reduce the
dimensionality of the feature space, we used principal comp on e nt analysis (PCA) .
PCA was performed on each region, reducing the num ber of dimensions from 32
dimensions to 2 dimensions by using the 2nd and 3rd principal components. We
avoided selecting the 1st principal component since we found that all the ﬁlters
contributed equally to it. However, for the 2nd and 3rd principal components, we
found discriminative power in 32 ﬁlter responses, with some ﬁlters contributing less
than others. Examples of gist descriptor for di↵erent choices of N are shown in
Figure 4.12. We also show a Fourier signature of the gist descriptor, in the original
space and the reduced dimension space. As sh own, dimensionality reduction did
not have any evident e↵ect on the gist descriptor’s Fourier signature.
4.1.3.3 Learning Scene Cat eg or ies from Training Data
The reduced dimensional training data was subjected to an unsupervised clustering
algorithm (Harris et al., 2000)todiscoverthescenecategories. Sincethetraining
data had more than 3 dimensions, the visualization of the feature space was not
possible for all the dimensions at once. In Figure 4.14, we show the feature space
for N = 2, by projecting data onto a two dimensional cartesian coordinate system
at a time, thus showing all the possible combinations of the available dimensions.
This particular example was taken from simulation 927, for which we found two

scene categories. In gener a l , we found 2 to 4 scene categories in our training data
for di↵erent choices of N {2, 3, 4}, over 1000 simulations. Addi t i on al examples of
such cases are show in Figure 4.15. The small number of scene categories we found
may be the result of limited variety in o u r movie dataset. In general, our movies
80
4.1 Methods
Frame segmenta
 re using
Regin mean fwed b PCA
Spa-nted tering
Unsupervised stering
a maps crr
t each scene categr
4x4
1150 x 32
1150 x 32
2x2
1150
frames
3x3
1150 x 8
1150 x 18
Cster 1 { 231 x8 }
Cster 2 { 131 x8 }
Cster X { 174 x8 }
Cster 1 {145 x18 }
Cster 2 {154 x18 }
Cster X { 374 x18}
Cster 1 {331 x32 }
Cster 2 {169 x32 }

Cster X { 99 x32}
…
…
Fixa Map 1
Fixa Map 2
Fixa Map X
Fixa Map 1
Fixa Map 2
Fixa Map X
Fixa Map 1
Fixa Map 2
Fixa Map X
…
…
…
…
…
…
…
…
…
…
Scale
Orientation
Figure 4.11: A ﬂowchart of training process for discovering scene categories. In total 1150
examples are used to compute the gist descriptor followed by di me ns i onal i ty reduction
using PCA and unsupervised clustering using mixture of gaussian.
81

VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 2

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về