92. Building 3D event logs for video investigation

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (855.5 KB, 23 trang )

Multimed Tools Appl
DOI 10.1007/s11042-013-1826-9

Building 3D event logs for video investigation
Trung Kien Dang · Marcel Worring · The Duy Bui

© Springer Science+Business Media New York 2014

Abstract In scene investigation, creating a video log captured using a handheld camera is
more convenient and more complete than taking photos and notes. By introducing video
analysis and computer vision techniques, it is possible to build a spatio-temporal representation of the investigation. Such a representation gives a better overview than a set of photos
and makes an investigation more accessible. We develop such methods and present an interface for navigating the result. The processing includes (i) segmenting a log into events using
novel structure and motion features making the log easier to access in the time dimension,
and (ii) mapping video frames to a 3D model of the scene so the log can be navigated in
space. Our results show that, using our proposed features, we can recognize more than 70
percent of all frames correctly, and more importantly find all the events. From there we
provide a method to semi-interactively map those events to a 3D model of the scene. With
this we can map more than 80 percent of the events. The result is a 3D event log that captures the investigation and supports applications such as revisiting the scene, examining the
investigation itself, or hypothesis testing.
Keywords Scene investigation · Video analysis · Story navigation · 3D model

1 Introduction
The increasing availability of cameras and the reduced cost of storage have encouraged
people to use image and videos in many aspects of their life. Instead of writing a diary,
nowadays many people capture their daily activities in video logs. When such capturing is
continuous this is known as “life logging”. This idea goes back to Vannevar Bush’s Memex
device [5] and is still topic of active research [2, 8, 33]. Similarly, professional activities
T. K. Dang ( ) · M. Worring
University of Amsterdam, Amsterdam, The Netherlands
e-mail:
T. K. Dang · T. D. Bui

University of Engineering and Technology, Vietnam National University Hanoi, Hanoi, Vietnam

Multimed Tools Appl

can be recorded with video to create logs. For example, in home safety assessment, an
investigator can walk around, examine a house and record speech notes when finding a construction issue. Another interesting professional application is crime scene investigation.
Instead of looking for evidences, purposely taking photos and writing notes, investigators
wear a head-mounted camera and focus on finding the evidence, while everything is automatically recorded in a log. These professional applications all share a similar setup, namely
a first person view video log recorded in a typically static scene. In this paper, we focus on
this group of professional logging applications, which we denote by scene investigation.
Our proposed scene investigation framework includes three phases: capturing, processing, and reviewing.
In the capturing phase, an investigator records the scene and all objects of interest it contains using various media including photos, videos, and speech. The capturing is a complex
process in which the investigator performs several actions to record different aspects of the
scene. In particular, the investigator records the overall scene to get an overview, walks
around to search for objects of interest and then examines those objects in detail. Together
these actions form the events of the capturing process. In the processing phase, the system analyzes all data to get information about the scene, the objects, as well as the events.
Later, in the reviewing phase, an investigator uses the collected information to perform various tasks: assessing the evidence, getting an overview of the case, measuring specific scene
characteristics, or evaluating hypotheses.
In common investigation practice, experts take photos of the scene and objects they find
important and add hand-written notes to them. This standard way of recording does not
provide sufficient basis for processing and reviewing for a number of reasons. A collection
of photos cannot give a good overview of the scene. Thus, it is hard to understand the
relation between objects. In some cases, investigators use the pictures to create a panorama
to get a better overview [4]. But since the viewpoint is fixed for each panorama, it does
not give a good spatial impression. Measuring characteristics of the scene is also not easy
with photos if the investigator has not planned for it in advance. More complicated tasks,
like making a hypothesis on how the suspect moved, are very difficult to perform using
a collection of photos due to the lack of sense of space. Finally, a collection of photos
and notes hardly captures investigation events, which are important for understanding the

investigation process itself.
The scene can also be captured with the aim to create 3D models. 3D models make
discussion easier, hypothesis assessment more accurate, and court presentation much
clearer [11, 14]. When 3D models are combined with video logs it could enhance scene
investigation further.
Using a video log to capture the investigation is straightforward in the capturing phase.
Instead of taking photos and notes, investigators film the scene with a camera. All moves
and observations of investigators are thus recorded in video logs. However, in order to take
the benefit, it is crucial to have a method to extract information and events from the logs
for reviewing (Fig. 1). For example, a 3D model helps visualize the spatial relation between
events; while details of certain parts of the scene can be checked by reviewing events captured in these parts. Together, a 3D model of the scene and log events form what we call a
3D event log of the case. Such a log enables event-based navigation in 3D, and other applications such as knowledge mining of expert moves, or finding correlation among cases. The
question is how to get the required information and how to combine them.
3D models of a scene can be built in various ways [24, 28, 30, 35]. In this work, we
use 3D models reconstructed using a semi-interactive method that builds the model from

Multimed Tools Appl

shelf

sofa

c

d

shelf

TTVV

heater

table

sofa

b
dy
bo

table

a
moving path

camera pose

a

b

c

d
Fig. 1 An investigation log is a series of investigation events (like a taking overview, b searching, c getting
details, or d examining) within a scene. When reviewing the investigation, it is helpful if an analysis can
point out which events happened and where in the scene they took place

panoramas [6]. These models are constructed prior to the work presented in this paper.

Here we focus on analyzing investigation log to find events and connecting them to the 3D
model.

Multimed Tools Appl

Analyzing an investigation log is different from regular video analysis. The common
target in video analysis is to determine the content of a shot, while in an investigation
log analysis we already know the content (the scene) and the purpose (investigation). The
investigation events we want to detect arise from both content and the intentions of the
cameraman. For example, when an investigator captures an object of interest she will walk
around the object and zoom-in on it. While the general context and purpose of the log is
known, these events are not easy to recognize as the mapping from intention to scene movement is not well defined, and to some extent depends on the investigator. To identify events
features we need features that consider both content and camera movements.
Once we have the events identified in the logs we need to map the events to the 3D model
of the scene. When the data would be high quality imagery, accurate matching would be
possible [19, 31]. Video frames of investigation logs, however, are not optimal for matching
as they suffer from intensity noise and motion blur. This hinders the performance of familiar
automatic matching methods. Different mapping approaches capable of dealing with lower
quality data need to be considered.
In the next section we review the related work. Then two main components of our system
are presented in subsequent sections: (i) analyzing an investigation log to segment it into
events, (ii) and their mapping to a 3D model for reviewing (Fig. 2). In order to segment
an investigation log into events, we introduce in Section 3 our novel features to classify
frames into classes of events. This turns a log into a story of the investigation, making it
more accessible in the time dimension. Section 4 presents our semi-interactive approach
to map events to a reconstructed 3D model. Together with the log segmentation step, this
builds a 3D event log containing investigation events and their spatial and temporal relations.

Fig. 2 Overview of the

framework to build a 3D event
log of an investigations

Video log

Automatic
video log
segmentation

Investigation
events

Semiinteractive
matching

3D event
log

3D
reconstructed
model

Multimed Tools Appl

Section 5 evaluates the results of the proposed solution at the two analysis steps: segmenting
a log into events, and mapping the events to a 3D model. Finally, in Section 6 we present
our interface which allows for navigating the 3D events.

2 Related work

2.1 Video analysis and segmentation
Video analysis often starts with segmenting a video into units for easier management and
processing. The commonly used unit is a shot, a series of frames captured by a camera in
an uninterrupted period of time. Shot boundaries can be detected quite reliable, e.g. based
on motion [23]. After shot segmentation, various low-level features and machine learning
techniques are used to get more high-level information, such as whether a specific concept
is present [32]. Instead of content descriptors, we want to get information on the movements and actions of the investigator. Thus attention and intention are two important aspects.
Attention analysis tries to capture the passive reaction of the viewer. To that end the attention model defines what elements in the video are most likely to get the viewer’s attention.
Many works are based on the visual saliency of regions in frames. They are then used as
criteria to select key frames [1, 20, 22]. When analyzing video logs intention analysis tries
to find the motivation of the cameraman while capturing the scene. This information leads
to one more browsing dimension [21], or another way of summarization [1].
Tasks such as summarization are very difficult to tackle as a general problem. Indeed
existing systems have been built to handle data in specific domains, such as news [3] and
sports [27], In all examples mentioned, we see that domain specific summarization methods
perform better than generic schemes. New domains need new demands. Social networking
video sites, like YouTube, urge for research on analysis of user generated videos (UGV) [1]
as well as life logs [2], a significant sub-class of UGVs. Indeed, we have seen research in
both hardware [2, 7, 10] and algorithms [8, 9] to meet that need. Research on summarizing
investigation logs is very limited.
Domains dictate requirements and limit the techniques applicable for analysis. For example, life logs, and in general many UGVs, are one-shot. This means that the familiar unit
of video analysis (shots) is no longer suitable, and new units as well as new segmentation
methods must be developed [1]. The quality of those videos is also lower than professionally
produced videos. Unstable motion and varying types of scenes violate the common assumptions on the motion model. A more difficult issue is that those videos are less structured,
making it harder to analyze contextual information. In this work we consider professional
logs of scene investigation that shares many challenges with general UGVs, but also has its
own domain specific characteristics.
2.2 Video navigation
The simplest video navigation scheme, as seen on every DVD, divides a video into tracks
and presents them with a representative frame and description. If we apply multimedia analysis and know more about the purpose of navigation, there are many alternative ways to

navigate a video. For example, in [16] the track and representative frame scheme is enhanced
using an interactive mosaic as a customized interface. The method to create the video mosaic
takes into account various features including color distribution, existence of human faces,
and time, to select and pack key frames into a mosaic template. Apart from the familiar time

Multimed Tools Appl

dimension, we can also navigate in space. Tour into video [15] shows the ability of spatial navigation in video by decomposing an object into different depth layers allowing users
to watch the video from new perspectives. The navigation can be object-based or frame
based. In [12], object tracking enables an object-based video navigation scheme. For example, users can navigate video by dragging an object from one frame to a new location. The
system then automatically navigates to the frame in which the object location is closest to
that expectation.
The common thing in the novel navigation schemes described above is that they depend
on video analysis to get the information required for navigation. That is also the way we
approach the problem. We first analyze video logs and then use the result as basis for
navigation in a 3D model.

3 Analyzing investigation logs
In this section we first discuss investigation events and their characteristics. Based on that
we motivate our solution for segmenting investigation logs, which is described subsequently.
3.1 Investigation events
Watching logs produced by professionals (policemen), we identify four types of events:
search, overview, detail, and examination. In a search segment investigators look around
the scene for interesting objects. An overview segment is taken with the intention to capture
spatial relations between objects and to position oneself in the room. In a detail segment
the investigator is interested in a specific object e.g. an important trace, and moves closer
or zooms in to capture it. Finally, in examination segments, investigators carefully look at
every side of an important object. The different situations lead to four different types of
segments in an investigation log. As a basis for video navigation, our aim is to automatically

segment an investigation log into these four classes of events.
There are several clues for segmentation, namely structure and motion, visual content,
and voice. Voice is an accurate clue, however, as users usually add voice notes at a few
important points only it does not cover all the frames of the video. Since in investigation,
the objects of interest vary greatly and are unpredictable, both in type and appearance, the
visual content approach is infeasible. So understanding the movement of the cameraman
is the most reliable clue. The class of events can be predicted by studying the trajectory
of cameramen movement and his relative position to the objects. In computer vision terms
this represents the structure of the scene and the motion of the camera. We observe that the
four types of events have different structure and motion patterns. For example, an overview
has moderate pan and tilt camera motion and the camera is far from the objects. Table 1
summarizes the different types of events and characteristics of their motion patterns.
Though the description in Table 1 looks simple, performing the log segmentation is not.
Some terms, such as “go around objects”, are at a conceptual level. These are not considered
in standard camera motion analysis, which usually classifies video motion into pan, tilt, and
zoom. Also it is not just camera motion. For example, the term “close” implies that we also
need features representing the structure (depth) of the scene. Thus, to segment investigation
logs, we need features containing both camera motion and structure information.

Multimed Tools Appl
Table 1 Types of investigation events, and characteristics of their motion patterns
Investigation event

Characteristics

Search

– Unstable, mixed motion

Overview

– Moderate pan/tilt; far from objects

Detail

– Zooming like motion; close to objects

Examination

– Go around objects; close to objects

3.2 Segmentation using structure-motion features
As discussed, in order to find investigation events, we need features capturing patterns of
motion and structure. We propose such features below. These features, employed in a threestep framework (Fig. 3), help to segment a log into investigation events despite of varying
contents.
3.2.1 Extracting structure and motion features
From the definition of the investigation event classes, it follows that to segment a log into
these classes, we need both structure and motion information.
While it is possible to estimate the structure and motion even from an uncalibrated
sequence [24], that approach is not robust and not efficient enough for the freely captured
investigation videos. To come to a solution, we look at geometric models capturing structure
and motion. We need models striking a balance between simple models which are robust
to nosie and detailed models capturing the full geometry and motion but easily affected by
noise.
In our case of investigation, the scene can be assumed static. The most general model
capturing structure and motion in this case is the fundamental matrix [13]. In practice, many
applications, taking advantage of domain knowledge, use more specific models. Table 2
shows those models from the most general to the most specific. In studio shots where cameras are mounted on tripods, i.e. no translation presents, the structure and motion are well
captured by the homography model [13]. If the motion between two consecutive frame is

small, it can be approximated by the affine model. This fact is well exploited in video analysis [26]. When the only information required is to know whether a shot is a pan, tilt, or
zoom then the three-parameter model is enough [17, 23]. In that way, the structure of the
scene, i.e. variety in 3D depth, is ignored.
We base our method on the models in Table 2. We first find the correspondences between
frames to derive the motion and structure information. How well those correspondences fit
into the above models tells us something about the structure of the scene as well as the
camera motion. Such a measurement is called an information criterion (IC) [34].
An IC measures the likelihood of a model being the correct one, taking into account both
fitting errors and model complexity. The lower the IC value, the more likely the model is
correct. Vice versa, the higher the value the more likely the structure and motion possesses
the properties not captured by the model. A series of IC values computed on the four models
represented in Table 2 characterize the scene and the camera motion within the scene. Based
on those IC values we can build features capturing structure and motion information in
video. Figure 4 summarizes the proposed features and their meaning, derived from Table 2.

Multimed Tools Appl
Fig. 3 Substeps to automatically
sengment an investigation log
into events

Video log

Extracting
SM
features

Classifying
frames

Merging
labeled
frames

Investigation
events

Multimed Tools Appl
Table 2 Models commonly used in video analysis, their degrees of freedom; names and structure and motion
condition under which they hold
Model (d.o.f)

Structure and motion assumption

Homography HP (8)

Flat scene, or no translation in motion

Affine model HA (6)

Far flat scene

Similarity model HS (4)

Far flat scene, image plane parallel to

Three-parameter model HR (3)

Same as HS , no rotation around the

scene plane
principal ray
The degrees of freedoms are required to compute the proposed features

The IC we use here is the Geometric Robust Information Criterion (GRIC) [34]. GRIC,
as reflected in its name, is robust against outliers. It has been successfully used in 3D reconstruction from images [25]. The main purpose of GRIC is to find the least complex model
capable of describing the data.
To introduce GRIC let us first define some parameters. Let d denote the dimension of the
model ; r the input dimension; k the model’s degrees of freedom; and E = [e1 , e2 , ..., en ]
the set of residuals resulting from fitting n corresponding data points in the model and the
input. The GRIC is now formulated as:

g(d, r, k, E) =

min
ei ∈E

ei2
, λ1 (r − d) + (λ2 nd + λ3 k)
σ2

(1)

where σ is the standard deviation of the residuals.
The left term of (1), derived from fitting residuals, is the model fitting error. The minimum function used in ρ is meant to threshold outliers. The right term, consisting of model
parameters, is the model complexity. λ1 , λ2 , and λ3 are parameters steering the influence
of the fitting error and the model complexity on the criterion. Their suggested values are 2,
log(r), and log(rn) respectively [34].
In our case, we consider a two dimensional problem, i.e. d = 2; and the dimension of

the input data is r = 4 (two 2D points). The degrees of freedom k for the models are given

Structure and motion criteria

Motion model assumptions

cP - Depth variety, translation
in camera motion

HP - Flat scene, or no
translation

cA – same as cP, plus
scene's distance

HA - Far flat scene

cS – same as cA, plus
scene-camera alignment
cR – same as cS,, plus camera
rotation around the principal ray

Fig. 4 Proposed structure and motion criteria for video analysis

Hs - Far flat scene, image
plane parallel to scene plane
HR - Same as HS, no rotation
around the principal ray

Multimed Tools Appl

in Table 2; and n is the number of correspondences. The GRIC equation, adding explicitly
the dependence on the models in Table 2, is simplified to:
gH (k, E) =

min
ei ∈E

ei2
, 4 + (2n log(4) + k log(4n))
σ2

(2)

In order to make the criteria comparable over frames, the number of correspondences n
must be the same. To enforce this condition, we compute motion fields by computing correspondences with a fixed sampling grid. As mentioned, GRIC is robust against outliers,
thus outliers often existing in motion fields should not be a problem. For a pair of consecutive frames, we compute GRIC for each of the four models listed in Table 2 and Fig. 4. For
example, cp = gH (8, E), given E is the set of residuals of fitting correspondences to the
Hp model.
Our features include estimations of the three 2D frame motion parameters of the HR
model, and four GRIC values of the four motion models (Fig. 4). The frame motion parameters (namely the dilation factor o, the horizontal movement x, and the vertical movement y)
have been used in video analysis before [17, 23] e.g. to recognize detail segments [17].
We consider them as the baseline features. Our proposed measurements (cP , cA , cS , and
cR ) add more 3D structure and motion information to those baseline features. To make it
robust again noisy measurements and to capture the trend in the structure and the motion,
we use the mean and variance of criteria/parameters over a window of frames. This yields a
14-element feature vector for each frame.
F = o,
¯ x,

¯ y,
¯ c¯P , c¯A , c¯S , c¯R , o,
˜ x,
˜ y,
˜ c˜P , c˜A , c˜S , c˜R

(3)

where ¯. is the mean, and ˜. is the variance of the value over the feature window wf .
In summary, computing features includes:
1. Compute optical flow though out the video. Sample the optical flow using a fixed grid
to get correspondence.
2. Fit correspondences to models HP , HA , HS , HR to get the residuals and three
parameters o, x, y of HR .
3. Compute cP , cA , cS , cR using (2).
4. Compute feature vectors as defined in (3).
3.2.2 Classifying frames
Based on the new motion and structure features, we aim to classify the frames into four
classes corresponding to the four types of events listed in Table 1. The search acts as “the
others” class, containing frames that do not have a clear intention.
Since logs are captured by handheld or head-mounted cameras, the motion in the logs is
unstable. Consequently, the input features are noisy. It is even hard for humans to classify
every class correctly. While the detail class is quite recognizable from its zooming motion,
it is hard to distinguish the search and examination classes. Therefore, we expect that the
boundary between classes is not well defined by traditional motion features. While the proposed features are expected to distinguish classes, we do not know which features are best
to recognize the classes. Feature selection is needed here. Two popular choices with implicit
feature selection are support vector machines and random forest classifiers. We have carried out experiments with both and the random forest classifier gave better results. Hence,
in Section 5, we only present result with the random forest classifier.

Multimed Tools Appl

3.2.3 Merging labeled frames
As mentioned, the captured motion is unstable and the input data for classification is dirty.
We thus expect many frames of other classes to be misclassified as search frames. To
improve the result of the labeling step, we first apply a voting technique over a window
of frames length wv , the voting window. Within the voting window, each type of label is
counted. Then the center frame is relabeled using the label with the highest vote count.
Finally, we group consecutive frames having the same label into events.

4 Mapping investigation events to a 3D model
The video is now represented in the time dimension as a series of events. In this section, we present the method to enhance the comprehensiveness of the investigation in the
space dimension by connecting events to a 3D model of the scene. In this way we enable
interaction with a log in 3D.
For each type of event, we need one or more representative frames. These frames give a
hint to the user which part of the scene is covered by an event, as well as a rough indication of the camera motion. Overview and detail events are presented by the middle frames
of the events. For this we make the assumption that the middle frame of an overview or a
detail event is close to the average pose of all the frames in the event. Another way to create
a representative frame for an overview event would be a large virtual frame, a panorama,
that approximately covers the space captures by that overview. It is, however, costly and not
always feasible, and thus is not implemented in this work. As for searching and examination events one frame is not sufficient to represent the motion these are presented by three
frames, the first, the middle and the last. To visualize the event in space, we have to match
those representative frames to the 3D model.
Matching frames is a non-trivial operation. Logs are captured at varying locations in the
scene and with different poses and video frames are not as clear as high resolution images.
Also the number of images calibrated to the 3D model is limited. Thus we expect that some
representative frames may be poorly matched or cannot be matched at all. To overcome
those problems, we propose a semi-interactive solution containing two steps (Fig. 5): (i)
automatically map as many representative frames as possible to the 3D model, and then (ii)
let users interactively adjust predicted camera poses of other representative frames.

4.1 Automatic mapping of events
Since our 3D model is built using an image-based method [6], the frame-to-model mapping is formulated as image-to-image matching. Note that color laser scanners also use
images, calibrated to the scanning points, to capture color information. So our solution is
also applicable for laser scanning based systems.
Let I denote the set of images from which the model is built, or more generally a set
which is calibrated to the 3D model. Matching a representative frame i to one of the images
in I enables us to recover the camera pose. To do the matching, we use the well-known
SIFT detector and descriptor [19]. First SIFT keypoints and descriptors are computed for
representative frame i and every image in I . Keypoints of frames i are initially matched to
keypoints of every image in I , based on comparing descriptors [19] only. Correctly matched
keypoints are found by robustly estimating the geometric constraints between the two
images [13]. When doing so there might be more than one image in I matched to frame i.

Multimed Tools Appl
Fig. 5 Mapping events to a 3D
model includes an automatic
mapping that paves the way for
interactive mapping

Investigation
events

3D
reconstructed
model

Automatic
mapping

Interactive
mapping

3D event log

Since one matched image is enough to recover camera pose, we take the one with the most
correctly matched keypoints, which potentially gives the most reliable camera pose estimation. In our work, the numbers of images, of panorama and representative frames are
reasonably small, so we use an exhaustive method to match them. If the number of images
is large, it is recommended to use more sophisticate methods such as Best Bin First [18],
or Bag-of-Words [29]. When a match is established we can recover 3D information. To
estimate the camera pose of each matched frame we use the 5-point algorithm [30]. The geometric constraints computed in the estimation process also indicate the frames that cannot
be mapped automatically. Those frames are mapped using the interactive method presented
below.
4.2 Interactive mapping of events
To overcome the missing matches between some events and the 3D model, we employ user
interaction. The simplest way is to ask the user to navigate the 3D model, to viewpoints
close to the representative frames of those events. However, this is ineffective as the starting
viewpoints could be far from the appropriate viewpoints.
To overcome this problem, we make use of the observation that the log is continuously
captured and events are usually short. As a consequence, camera poses of events which are
close in time are also close in space. We exploit this closeness to reduce the time users
need to navigate to find those viewpoints (Fig. 6). For each representative frame that is
not mapped to the 3D model, we search backward and forward to find the closest mapped
representative frame (both automatically mapped and previously interactively mapped, in
terms of frames). We use the camera pose of these closest representative frames to initialize
the camera pose of the unmapped representative frame. There are 6 parameters defining a

Multimed Tools Appl
Fig. 6 Manually mapping an

event to the 3D model could be
hard a. Fortunately, automatic
mapping could provide an initial
guess b that gives more visual
similarity to the frames of the
event. From there, users can
quickly adjust the camera pose to
a satisfactory position c

shelf

Shelf

heater

TTVV

table

sofa

shelf

TV

sofa
dy
bo

table

a
Shelf

Shelf

elf
Sh

TV

TV

b
Shelf

Shelf

elf
Sh

TV

TV

c
camera pose: 3 defining the coordinates in space, and 3 defining camera orientation/rotation.
We interpolate each of them from the parameters of the two closest known camera poses:
pu =

pi tj + pj ti
ti + tj

(4)

where pu is a parameter of the unknown camera pose; pi , pj are the same parameters of
the two closest known camera poses; and ti , tj are the frame distances to frames of those
known camera poses.
Applying this initialization, as illustrated in Fig. 6, we effectively utilize automatically
mapped and previously interactively mapped results to reduce the time of interaction to
register an unmapped frames.
Having the camera pose of frames, we visualize each of them as a camera frustum. A
camera frustum is a pyramid, whose apex is drawn at camera viewpoint. The image plane is
visualized by the pyramid’s base, which has the same size as image size up to a scale s. The
distance from the apex to the pyramid’s base is equal to the focal length of the camera up
to the same scale s. The scale s can be adjusted for proper visualization. The extension of
the planes forming by connecting the apex to the pyramid’s base shows the field of view of
the frame. This field of view helps to compute points covered by a representative frame, or
vice versa to find which events cover a certain point in the 3D model. The camera frustum
concept is illustrated in Fig. 7.

Multimed Tools Appl
Fig. 7 A camera frustum is a
pyramid shape representing the
camera pose and field of view of
a representative frame. The
pyramid’s base and the distance
from pyramid’s apex to its base
are proportional to the image size

(width w, height h) and the focal
length (f ), with the same scale
(s)

s.w
viewpoint
covered
area

s.f
s.h

After interactive mapping we have a one-to-one correspondence between the video log
and its temporal events and the 3D spatial position in the model. That is to say we have
arrived at our 3D event logs.

5 Evaluation
In this section, we detail the implementation and give an evaluation of the log analysis
method of Section 3 and the method to connect these logs to a 3D model presented in
Section 4.
5.1 Dataset
Getting access to real experts in realistic settings is difficult. We were fortunate to have real
experts operating in a realistic training setting. Working with real experts we had to limit
ourselves in the number of different indoor environments though. However, in our opinion,
the complexity of the scene and the number of cameramen compensates that limitation.
In order to obtain clear ground truth for training, we capture a set of videos of separate
types of events, i.e. each video purely contains frames of one type of event. The setup for
training data is a typical office scene, captured using a handheld camera. In total there are
more than 15 thousand frames captured for training.
For testing, we have captured logs in the same office scene. Furthermore, we had policemen and others capture logs in fake crime scenes. Those logs in total are about one hour

of video. The ground truth is obtained by manually segmenting those logs into segments
corresponding to the four types of events defined in Table 1.
5.2 Analyzing investigation logs
5.2.1 Criteria
We evaluate the log analysis for the two stages of the algorithm, namely classifying frames
and segmenting logs. For the former, we look at the frame classification result. For the later,
which is more important as it is the purpose of analysis, we use three criteria to evaluate the
quality of the resulting investigation story. They are the completeness, the purity, and the
continuity.

Multimed Tools Appl

To define those criteria, we first define what we mean by a correct event. Let S =
{s1 , s2 , . . . , sk } denote a segmentation. Each event si has range gi (which is a tuple composed of the start and end frames of the event) and class li . So let two segmentations Sˆ and
S¯ be given. To check whether a segment sˆi is correct with respect to the reference segmen¯ the first condition is that there exists an event s¯j in S¯ that sufficiently overlaps sˆi ,
tation S,
and has the same class label. Formally:
α sˆi , S¯ =

1 if ∃ s¯j ,

|gˆ i ∩gˆ j |
min(|gˆ i |,|g¯ j |)

0 otherwise

> z ∧ lˆi ≡ l¯j

(5)

where |.| is the number of frames in a range, and z indicates how much the two events must
overlap. Here we use z = 0.75, i.e. the overlap is at least 75 % of the shorter event.
Now suppose that Sˆ is the result of automatic segmentation and S¯ is the reference segmentation. The completeness of the story C, showing whether all events are found, is
ˆ
defined as the ratio of segments of S¯ correctly identified in S.

C=

¯
|S|
i=1 α

s¯i , Sˆ

ˆ
|S|
i=1 α

sˆi , S¯

(6)
¯
|S|
The purity of the story P , reflecting whether identified events are correct, is defined as
¯
the ratio of segments of Sˆ correctly identified in S.
P =

(7)

ˆ
|S|
where |S| is the total number of segments in a segmentation S.
The last criterion is the continuity of the story U reflecting how well events are recovered
without being broken into several events or wrongly merged. It is defined as the ratio of the
number of events in the result and in the ground truth:
U=

ˆ
|S|
¯
|S|

(8)

If U is greater than 1.0 then we are in the situation that the number of events in the result
is greater than the real number of events, implying that there are false alarm events. When
U is less than 1.0 it means that the number of events found is less than the actual number of
events, implying that we miss some events. An important restriction on U is that we do not
want to have a high value of R as the number of events should be manageable for reviewing
logs. A perfect result has all criteria equal to 1.0.
5.2.2 Implementation
The motion fields are estimated using OpenCV’s implementation of the Lucas-Kanade
method.1 Of course other implementations can also be used. Results presented here are
produced with feature window wf = 8, and voting window wv = 24 (i.e. 1 second)
(Section 3.2.1). We use the random forest classifier implemented in the Weka package.2

1
2 />

Multimed Tools Appl

5.2.3 Results
The accuracy of classifying frames, defined as number of correctly classified frames over
total number of frames, using the baseline features and using the proposed features with and
without voting are given in Table 3.
When using only 2D frame motion features, the accuracy of the frame classification is
0.60. Our proposed structure and motion features improve the accuracy to 0.71. Looking
into the confusion matrices (Table 4a,b) we see that the recall of most classes is increased.
The largest improvement is for the recall of the search class, increasing from 0.639 to 0.868.
The recall of the examination class, is increased considerably from 0.074 to 0.195. Most
of the incorrect results are frames misidentified as search frames. As mentioned, this is
an expected problem as video logs captured from handheld or head-mounted cameras are
unstable. The recall of the detail class undesirably decreased about 30 percent. This seems to
be due to an overfitting problem. This class, as discussed in Section 3.2.1, is best recognized
base on three parameters of the HR model, especially the dilation factor o. Adding SnM
features brings the problem into a higher dimensional space, causing overfitting problem.
Solution to consider for the future would be to use multiple classifiers, and select different
sets of features to classify each class.
After we apply voting (Table 4c), the overall accuracy is further improved to 0.755. The
recall of the examination class is decreased. However, our ultimate goal is to recognize
events not frames, and as shown later overall with voting the final result is improved.
We evaluate the log segmentation with the overlap threshold z set to 0.75. The results are
given in Table 5, including results with and without post processing of the voting window
with wv = 24. Without post processing the completeness is C = 1.0. As in reviewing
an investigation, it is important to find all important events, this is a very good result. The
purity of the story is reasonable, P = 0.65. However the number of events is extremely
high compared to the ground truth, (U = 17.41). This is undesirable as it would take much
time to review the investigation. Fortunately, applying the voting technique, the number of
identified events is much less and acceptable (U = 2.16), while the completeness remains

perfect. The purity P is decreased to 0.58 percent. This is practically acceptable as users can
correct the false alarm events during reviewing. Table 6 gives a detailed evaluation for each
class before and after voting. After applying voting, P is slightly decreased for all classes,
while U is greatly reduced to 1.0, the perfect value.
Results presented above are the merged results of the data captured ourselves in a lab
room and the data captured by different others in the fake crime scene. This because we
have found no significant difference between them (the average accuracy is only about one
percent better for the data we captured ourselves). This evidently shows that the method is
stable.

Table 3 Accuracy of the frame classification
Accuracy
Baseline

0.596

Proposed features

0.710

Proposed features & voting

0.755

Multimed Tools Appl
Table 4 Classification results (confusion matrices): (a) using only 2D motion parameters as features
(baseline), (b) using the proposed structure and motion features, and (c) using the proposed features and
voting
Search

Overview

Detail

Examination

0.639

0.136

0.174

0.051

Overview

0.476

0.325

0.131

0.068

Detail

0.229

0.053

0.660

0.058

Examination

0.535

0.077

0.314

0.074

Search

0.868

0.046

0.046

0.041

Overview

0.502

0.497

0.000

0.001

Detail

0.537

0.016

0.328

0.119

Examination

0.723

0.029

0.053

0.195

Search

0.931

0.028

0.026

0.015

Overview

0.515

0.485

0.000

0.000

Detail

0.649

0.012

0.280

0.059

Examination

0.834

0.000

0.022

0.144

(a)
Search

(b)

(c )

Increased and decreased recall (comparing to the baseline) are in bold and italic respectively. The recall is
improved in most of the classes, especially the hard examination class

5.3 Mapping events to a 3D model
In terms of representative frames, the percentage of frames mapped to the 3D model is
about 70 percent, of which about 20 percent is mapped automatically. The percentage of
unmatched frames due to lack of frame-to-frame overlap, i.e. no visual clue to map at all, is
about 30 percent. This results in 81.9 percent of events mapped to the 3D model (Table 7).
Table 7 also provides more insight in the mapability of each type of event. All the overview
events are matched, either automatically or interactively. The examination events are hardest. None of them is matched automatically. This is due to the fact that those events are
usually captured at close distance, while the panorama images are captured at a wide view.
We have tried to extract SIFT keypoints on downscaled representative frames to match the
scale. Unfortunately, this did not help, probably because the details are too blurred in the
panoramas. A possible solution is to work with panoramas at higher resolution of multiple
focuses.
Table 5 Log segmentation result with and without applying voting
C

P

U

Before voting

1.00

0.65

17.41

After voting

1.00

0.58

2.16

Multimed Tools Appl
Table 6 Log segmentation evaluation per class
C

P

U

Search

1.00

0.68

16.57

Overview

1.00

0.66

139.50

Detail

1.00

0.71

7.41

Examination

1.00

0.54

54.33

Search

1.00

0.58

2.13

Overview

1.00

0.61

19.00

Detail

1.00

0.68

1.15

Examination

1.00

0.33

4.00

(a) Before voting

(b) After voting

In conclusion, more than 80 percent of events can be mapped to the 3D model, of which
about 25 is done automatically. This provides sufficient connection to represent a log in a
3D model, giving us a spatial temporal representation of the investigation for review.

6 Navigating investigation logs
We describe here our navigation system for investigation logs. As discussed, the system
aims to enable users to re-visit and re-investigate scenes. The user interface, shown in Fig. 8,
includes the main window showing a 3D model of the scene and a storyboard at the bottom
showing events in chronological order. Those two components present the investigation in
space and time. Users navigate an investigation via interaction with the two components.
When the user selects one event, camera frustums are displayed in the model to hint the
area in the scene covered by that segment. Vice versa, when the user clicks at a point in
the model, log segments covering that point are highlighted and camera frustums of those
events are displayed. Those interactions visualize the relation between the scene and the
log, i.e. the spatial and the temporal elements of the investigation. To take a closer look at
that segment, users click on the camera frustum to transform it into a camera viewpoint and
watch the video segment in an attached window. Those interactions are demonstrated in the
accompanying video.

Table 7 Percentage of events automatic/interactive mapped, and cannot be mapped
Map 81.9

Miss 18.1

Automatic 20.9

Interactive 61.0

Search

9.6

29.4

6.8

Overview

1.7

1.1

0.0

Detail

9.6

28.2

10.2

Examination

0.0

2.3

1.1

Multimed Tools Appl

Fig. 8 The log navigation allows users to dive into the scene a, check event location of events related to a
location b, watch a video segment of an event and compare it to the scene c

Multimed Tools Appl

7 Conclusion and future work
We propose to use a combination of video logs and 3D models, coined 3D event logs, to
provide a new way to do scene investigation. The 3D events logs provide a comprehensive
representation of an investigation process in time and space, helping users to easily get an
overview of the process and understand its details. To build such event logs we have to
overcome two problems: (i) decomposing a log into investigation events, and (ii) mapping
those events into a 3D model.
By using novel features capable of describing scene structure and camera motion and
machine learning techniques, we can classify frames into event classes at more than 70
percent accuracy. This helps recovering the investigation story completely, with a fairly
good purity. To map events to a 3D model, we use a semi-interactive approach that combines
automatic computer vision techniques with user interaction. More than 80 percent of the
events in our experimental logs were mapped into a 3D model of the scene, providing a
presentation that supports reviewing well.

Improvement to the results could be obtained by developing features capturing more
structure and motion information by considering more than two frames. An example could
be computing GRIC of three-frame constraints [25]. Also feature selection and multiple
classifiers would be helpful to increase the accuracy of classification. Structure and motion
features could also be useful in semantic scene analysis. Hence, we are also looking for
other domains where those features are effective.
Acknowledgments We thank Jurrien Bijhold and the Netherlands Forensic Institute for providing the data
and bringing in domain knowledge, and the police investigators for participating in the experiment. This
work is supported by the Research Grant from Vietnam’s National Foundation for Science and Technology
Development (NAFOSTED), No. 102.02-2011.13.

References
1. Abdollahian G, Taskiran CM, Pizlo Z, Delp EJ (2010) Camera motion-based analysis of user generated
video. IEEE Trans Multimed 12(1):28–41
2. Aizawa K (2005) Digitizing personal experiences: capture and retrieval of life log In: MMM ’05:
Proceedings of the 11th international multimedia modelling conference, pp 10–15
3. Albiol A, Torrest L, Delpt EJ (2003) The indexing of persons in news sequences using audio-visual data
In: IEEE international conference on acoustic, speech, and signal processing
4. Bijhold J, Ruifrok A, Jessen M, Geradts Z, Ehrhardt S, Alberink I (2007) Forensic audio and visual
evidence 2004–2007: a review. 15th INTERPOL forensic science symposium
5. Bush V (1945) As we may think. The atlantic
6. Dang TK, Worring M, Bui TD (2011) A semi-interactive panorama based 3D reconstruction framework
for indoor scenes. Comp Vision Image Underst 115:1516–1524
7. Dickie C, Vertegaal R, Fono D, Sohn C, Chen D, Cheng D, Shell JS, Aoudeh O (2004) Augmenting and
sharing memory with eyeblog In: CARPE’04: Proceedings of the the 1st ACM workshop on continuous
archival and retrieval of personal experiences, pp 105–109
8. Doherty AR, Smeaton AF (2008) Automatically segmenting lifelog data into events In: WIAMIS
’08: Proceedings of the 2008 9th international workshop on image analysis for multimedia interactive
services, pp 20–23
9. Doherty AR, Smeaton AF, Lee K, Ellis DPW (2007) Multimodal segmentation of lifelog data In:

Proceedings of RIAO 2007. Pittsburgh
10. Gemmell J, Williams L, Wood K, Lueder R, Bell G (2004) Passive capture and ensuing issues for a
personal lifetime store In: CARPE’04: Proceedings of the the 1st ACM workshop on continuous archival
and retrieval of personal experiences, pp 48–55
11. Gibson S, Hubbold RJ, Cook J, Howard TLJ (2003) Interactive reconstruction of virtual environments
from video sequences. Comput Graph 27(2):293–301

Multimed Tools Appl
12. Goldman DB, Gonterman C, Curless B, Salesin D, Seitz SM (2008) Video object annotation, navigation, and composition In: UIST ’08: Proceedings of the 21st annual ACM symposium on user interface
software and technology, pp 3–12
13. Hartley R, Zisserman A (2004) Multiple view geometry in computer vision, 2nd edn. Cambridge
University Press
14. Howard TLJ, Murta AD, Gibson S (2000) Virtual environments for scene of crime reconstruction and
analysis In: SPIE – visual data exploration and analysis VII, vol 3960, pp 1–8
15. Kang HW, Shin SY (2002) Tour into the video: image-based navigation scheme for video sequences
of dynamic scenes In: VRST ’02: Proceedings of the ACM symposium on virtual reality software and
technology, pp 73–80
16. Kim K, Essa I, Abowd GD (2006) Interactive mosaic generation for video navigation In: MULTIMEDIA
’06: Proceedings of the 14th annual ACM international conference on multimedia, pp 655–658
17. Lan DJ, Ma YF, Zhang HJ (2003) A novel motion-based representation for video mining In: International
conference on multimedia and expo, vol 3, pp 469–472
18. Lowe DG (1999) Object recognition from local scale-invariant features In: International conference on
computer vision, vol 2, pp 1150–1157
19. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–
110
20. Ma YF, Lu L, Zhang HJ, Li M (2003) A user attention model for video summarization In: ACM
multimedia, pp 533–542
21. Mei T, Hua XS, Zhou HQ, Li S (2007) Modeling and mining of users’ capture intention for home video.
IEEE Trans Multimed 9(1)

22. Meur OL, Thoreau D, Callet PL, Barba D (2005) A spatial-temporal model of the selective human visual
attention In: International conference on image processing, vol 3, pp 1188–1191
23. Ngo CW, Pong TC, Zhang H (2002) Motion-based video representation for scene change detection. Int
J Comput Vis 50(2):127–142
24. Pollefeys M, Van Gool L, Vergauwen M, Verbiest F, Cornelis K, Tops J, Koch R (2004) Visual modeling
with a hand-held camera. Int J Comput Vis 59:207–232
25. Pollefeys M, Verbiest F, Van Gool L (2002) Surviving dominant planes in uncalibrated structure and
motion recovery In: European conference on computer vision, pp 837–851
26. Robinson D, Milanfar P (2003) Fast local and global projection-based methods for affine motion
estimation. J Math Imaging Vis 8(1):35–54
27. Rui Y, Gupta A, Acero A (2000) Automatically extracting highlights for TV baseball program In: ACM
multimedia, pp 105–115
28. Sinha SN, Steedly D, Szeliski R, Agrawala M, Pollefeys M (2008) Interactive 3D architectural modeling
from unordered photo collections. ACM Trans Graph 27(5):159
29. Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern
Anal Mach Intell 31(4):591–606
30. Snavely N, Seitz SM, Szeliski R (2006) Photo tourism: exploring photo collections in 3D. ACM Trans
Graph 25(3):835–846
31. Snavely N, Seitz SM, Szeliski R (2008) Modeling the world from internet photo collections. Int J Comput
Vis 80(2):189–210
32. Snoek CGM, Worring M (2009) Concept-based video retrieval. Found Trends Inf Retr 4(2):215–322
33. Tancharoen D, Yamasaki T, Aizawa K (2005) Practical experience recording and indexing of life log
video In: CARPE ’05: Proceedings of the 2nd ACM workshop on continuous archival and retrieval of
personal experiences, pp 61–66
34. Torr P, Fitzgibbon AW, Zisserman A (1999) The problem of degeneracy in structure and motion recovery
from uncalibrated image sequences. Int. J. Comput. Vis. 32(1)
35. van den Hengel A, Dick A, Thorm¨ahlen T, Ward B, Torr PHS (2007) VideoTrace: rapid interactive scene
modelling from video. ACM Trans Graph 26(3):86

Multimed Tools Appl

Trung Kien Dang got M.Sc in Telematics from University of Twente, The Netherlands, in 2003 and received
PhD in Computer Science from University of Amsterdam, the Netherlands in 2013. His research includes 3D
model reconstruction and video log analysis.

Marcel Worring received the M.Sc. degree (honors) in computer science from the VU Amsterdam, The
Netherlands, in 1988 and the Ph.D. degree in computer science from the University of Amsterdam in 1993. He
is currently an Associate Professor in the Informatics Institute of the University of Amsterdam. His research
focus is multimedia analytics, the integration of multimedia analysis, multimedia mining, information visualization, and multimedia interaction into a coherent framework yielding more than its constituent components.
He has published over 150 scientific papers covering a broad range of topics from low-level image and video
analysis up to multimedia analytics. Dr. Worring was co-chair of the 2007 ACM International Conference on
Image and Video Retrieval in Amsterdam, co-initiator and organizer of the VideOlympics, and program chair
for both ICMR 2013 and ACM Multimedia 2013. He was an Associate Editor of the IEEE TRANSACTIONS
ON MULTIMEDIA and the Pattern Analysis and Applications journal.

Multimed Tools Appl

The Duy Bui got his B.Sc in Computer Science from University of Wollongong, Australia in 2000 and his
Ph.D. in Computer Science from University of Twente, the Netherlands in 2004. He is now Associate Professor at Human Machine Interaction Laboratory, Vietnam National University, Hanoi. His research includes
computer graphics, image processing and artificial intelligence with focus on smart interacting systems.

92. Building 3D event logs for video investigation

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về