Tải bản đầy đủ (.pdf) (5 trang)

VISUAL ATTENTION IN DYNAMIC NATURAL SCENES 5

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (407.34 KB, 5 trang )

Chapter 5
Conclusion and Future Work
5.1 Thesis Conclusion
The main aim of this research work Visual Attention in Dynamic N atural Scenes
was to discover the governing mechanisms involved in deployment of attention in
response to natural visual stimuli. Humans can e↵ortlessly perceive and respond
in the natural environment using the principles of selective attention and divided
attention. Thus attention deployment mechanism is critical to successful design of
artificial system capable of performing at human level of efficiency. However since
attention can be deployed using multiple senses at a time, here we constrained our
study only to visual modality. We ar e mor e i nterested in how the visual system
deploys attention. What properties of the complex natural environment (higher or
lower order image statistics) attract attention and what proprieties, even though
perceived, are ignored by the brain. Is vision based attention primarily visio-
sensory driven, prior world knowledge driven or both? What prompts to active
scanning of the scene (i.e., need to move gaze from currently fixated locatio n to the
new location)? How does visual system react in response to global scene change?
To answer these questions we studied temporal (fixation duration) and spatial
107
5.1 Thesis Conclusion
(fixation location) properties of visual attention.
In investigating temporal properties we looked at how fi x ati on durations varied
in response to movie scene transition. We fo cu sed our analysis on fixation ending
before scene transition (last fixation), on-going fixation at the time of scene tran-
sition (cross-over fixation), and first fixat i on after the scene transition. In general
we found first fixation was shorter in duration compared to last fixation and cross-
over fixation was longer in duration compared to both last and first fixation. We
further profiled all the first fixations and lo oked at the changes in durat i on with
progression over the first 1600 msec. We found that first fixation duration varied
with lat en cy from the scene transition onset. Therefore we separated first fixation
into two sub-population s; early star t i n g fixations and late starting fixat io n s. We


found the largest di↵erence between the early starting and late starting fixations
at approximately 180 - 220 milliseconds relative to the onset of the transition.
Subsequently we formulated early and late sets for cro ss-over and last fixations
corresponding to early and late first fixa ti o n s. As expected we did not observe any
di↵erence in duration for early and late set of the last fixation. This was because
subjects could not anticipate the incoming scene transition. However we did find
di↵erences in du ra t i on for the early and late set o f cross-over fixation. This gave
rise to saccade programming hypothesis. If the onset of the new scene occurred
in saccade programming phase of the cross-over fixat i on then it’s too late for it
to influence the subsequent saccade thus resulting in first fixation at un-intended
location in the new scene. However if the scene onset occurr ed before the saccade
programming phase then visual system had time to analy se new scene and sub-
sequently initiat e the saccade to intended first fixation location in new scene. In
conclusion these results favoured process Monitoring mechanism (Henderson and
Smith, 2009). Any global visual change appeared to a↵ect not only length of on-
going cross-over fixation (immediate control), but also shortened the duration of
108
5.1 Thesis Conclusion
the fixation immediately following that change (delayed control).
In anoth er analysis we investigated di↵erences in fixation duration in context
of visual correlates. Our analysis showed that for early ending cross-over fixation
there was significantly large change in local lu mi n an c e and contrast at cross-over
fixation location before and after the scene transition. Although changes in local
luminance and contrast were significant for late set as well however those changes
were still small compared to early set. These results suggest that despite the large
changes in a scene transition, the visual system analyzes the local information in
the new scene, and if the contrast at the new location was sufficiently high, appears
to be happy to continue the fixation rather than picking a new location. These
results show that fixations were under the influence of moment-to-moment visual
and cognitive analysis.

In a separate analysis we attempted to quantify the centre bias by comparing
fixation durations and fixation distance to the frame centre. Such strategy poten-
tially removed any systematic biases such as inconsistent ranking among subjects
and selection of the saliency model (Tseng et al., 2009)inquantificationprocess.
Moreover the inferred results were purely based on behavioural data. Our analysis
show a significant negative co r re lat i o n between fixation duration and fixation dis-
tance to the centre of the frame. Moreover the strongest correlation was observed
at the time of scene transition . In summary our results support the hypothesis
that centre bias is largely due to photographer’s bias of placing interesting things
near the centre.
In investigating spatial properties of visual attention we proposed a compu-
tational model that combined bottom-up stimulus features with top down con-
textual/gist modulat i o n for natural and dynamic scenes. The scene specifi c gist
modulation was learned using unlabeled training data of 1150 movie scenes. The
learning process was completely unsup er v i sed as compared to previous approaches
109
5.1 Thesis Conclusion
of labeled training data (Fei-Fei and Perona, 2005; Oliva et al., 2006). Moreover the
modulation was demonstrat ed for free viewing task in dynamic scen es as compared
search task in static images (Torralba et al., 2006). To show the robustness of the
proposed scene catego r y learning and subsequent modulation method we ran 1000
simulations, each having unique training and test data permutation. The model’s
performance was tested on early fixation from 32 subject s and it demonstrated
comparable perfor m a n ce (AUC = 0.9 and KL divergence = 1.6 ) to well-known
models of visual attention.
The critical part was the uniqu e way of modulating the movie scenes, using
human fixation maps. Although these maps ha d an inherent centre bias which
could potentially un d er m i n e the improved performance of the model. However
we rigorously tested it against control conditions like modulating with fixation
map averaged across the scene categories and with fixation maps of wrong scene

category. Our analysis show that with modulation of correct scene specific fixation
maps the model’s perform a n ce is significantly improved over modulati o n s using
average fixation map and incorrect scene fixation maps. These results suggest that
Gist plays a significant role in guiding visual attention by mediating early fixation
behavior.
Another important factor was unsupervised learni n g of movie scene categories.
In past research the sc en es were either manually ranked or labeled (Oliva and
Torralba, 2001)orthemaincategorylabelhadtobespecifiedtolearnthetheme
model (Fei-Fei and Perona, 2005). In contrast we used unsup ervised clustering to
discover the l a t ent scene cat eg o ri e s in our movies. The clusteri ng was performed
for thousand unique permutations of the training data resulting in minimum two
to maximum four scene categories over all the simulations.
110
5.2 Future Work
5.2 Future Work
In current experiments scene onset was not co ntrolled and mostly occurred in a
fixation with random latency. It would be interesting to extend the paradigm
by including the scene transitions in sa cca d e a n d controlling for the transition
onset latency in both fixations and saccades. This would enable us to get a more
complete picture of the e↵ects due to global and local visual change on subsequent
fixation b eh aviour. Another potential direction is to propose a mode l of saccade
programming that takes in to account the varia b i l i ty observed in fixation duration
for dynamic scenes and integrating it wi th the proposed saliency model.
Regarding i m p r ovement in com pu t at i o n al model there are several interesting
avenues to explore. Does gist play a role in later part of the scene by influencing
late fixation behavior? Current mod el showed improved results for early fixations
thus indicating significant role of gist in early scene processing (V˜o and Henderson,
2010) but is it possible to quantify the role of gist over time as scene progresses?
What other featu r es can be used to formulate improved gist descriptor? Current
model uses spatial frequency signature to build a gist descriptor (Oliva et al., 2006).

However there are plenty of other methods to compute gist as well (Renninge r
and Malik, 2004; Siagian and Itti, 2007). How to improve upon f or mulation of
fixation maps for gist depen d ent modulation? Currently each scene in the cluster
contributes equally towards the formulation of fixation map for subsequent gist
modulation. One way to improve upon this could be to formulate fixation maps for
each cluster using only top 10% to 15% scene s of the respective clusters. Another
approach is to weigh the contribution of each scene. A higher weight is assigned
to a scene if it is found closer to the cluster centroid and vice versa.
111

×