Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 1
Ct
Bd
Vid
Rti l
C
oncep
t
-
B
ase
d
Vid
eo
R
e
t
r
i
eva
l
Cees Snoek and Marcel Worring
with contributions by:
many
Intelligent Systems Lab Amsterdam,
University of Amsterdam, The Netherlands
3
The science of labeling
¾ To understand anything in science, things have to
have a name that is recognized and is universal
naming chemical elements
naming human genome
naming ‘categories’
4
naming living organisms
naming rocks and minerals
naming textual information
What about naming video information?
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 2
Problem statement
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
Hu Jintao
Basketball
Table
Tree
US flag
Building
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
0110110110011
0101101111100
5
Multimedia Archives
Aircraft
Dog
Tennis
Mountain
Explosion
0
110110110011
0101101111100
1101011011111
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
1101011011011
0110110110011
0101101111100
1101011011111
Different low-level features
count
Histogram
Regularity
Each feature yields a vector representation of the visual data
l
Regularity
Coarseness
Directionality
6
co
l
o
r
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 3
Basic example: color histogram
count
Histogram
380
pixels
count
7
640 pixels
color
Total 243200 pixels
Histogram is a summary
of the data summarizing
in this case color characteristics
Advanced example: codebook model
¾ Create a codeword vocabulary
9 Codeword annotation (e.g. Sky, Water)
Leung and Malik. IJCV, 2001.
Sivic and Zisserman. ICCV, 2003.
van Gemert, PhD thesis, UvA, 2008.
¾ Discretize Image with codewords
¾ Represent image as codebook histogram
0 100 200 300 40
0
50
0
0
10
20
30
40
50
60
70
80
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 4
The goal: semantic video indexing
¾ Is the process of automatically detecting the
presence of a semantic concept in a video stream
Airplane
9
Semantic indexing
¾ The computer vision approach
9 Building detectors one-at-the-time
A face detector for frontal
faces
3 years later
10
A face detector for
non-frontal faces
One (or more) PhD for every new concept
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 5
So how about these?
Rd
Bh
Bt
Ai l
B ildi
R
oa
d
B
eac
h
B
oa
t
A
n
i
ma
l
B
u
ildi
ng
Graphic People Car Vegetation Overlayed
Text
And the > 1000 others
11
Studio
Setting
Outdoor
And
the
>
1000
others
………
Generic concept detection in a nutshell
outdoor
aircraft
Feature
Extraction
Supervised
Learner
Training
Labeled
examples
12
Feature
Measurement
Classification
Testing
Video
It is an aircraft
probability 0.7
It is outdoor
probability 0.95
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 6
K nearest neighbor
F
1
13
F
F
2
Linear classification
F
1
14
F
F
2
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 7
Support vector machine
F
1
SVM usually is a good choiceSVM usually is a good choice
15
F
F
2
¾ Support Vector Machine
9 Learns from provided examples
9 Maximizes margin between two classes
Margin
Supervised Learner
¾ Depends on many parameters
9 Select best of multiple parameter combinations
9 Using cross validation
16
SVM
Vector
Semantic Concept Probability
Weight for
positive class
Weight for
negative class
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 8
How to improve concept detection?
Feature
Extraction
Supervised
Learner
Feature
Extraction
Supervised
Learner
17
Feature
Extraction
Supervised
Learner
Vector concatenation
& normalization
Feature fusion: multimodal
References:
Snoek, ACM Multimedia 2005
Magelhaes, CIVR 2007
Feature
Fusion
Visual
Feature
Extraction
Ttl
Supervised
Learner
+ Only one learning phase
+ Truly a multimedia representation
- Multimodal combination often ad hoc
18
T
ex
t
ua
l
Feature
Extraction
- One modality may dominate
- Feature vectors become too large easily
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 9
Feature fusion: unimodal
References:
van de Sande, CIVR 2008
+ Codebook model reduces dimensionality
- Combination still ad hoc
-
One feature may dominate
0
1
Relative
frequency
123 4 5
Codebook element
Harris-Laplace salient points
Point sampling strategy Color feature extraction
Codebook model
1
Relative
frequency
Bag-of-features
.
.
.
.
One
feature
may
dominate
Spatial pyramid
Dense sampling
0
1
12345
0
1
12345
0
1
12345
0
1
12345
0
123 4 5
Codebook element
Bag-of-features
Spatial pyramid:
multiple bags-of-features
.
Image
.
.
.
+ Focus on modality strength
+ Fusion in semantic space
Classifier fusion: multimodal
References:
Wu, ACM Multimedia 2004
Snoek, ACM Multimedia 2005
Supervised
Learner
Classifier
Fusion
Visual
Feature
Extraction
Textual
Feature
Et ti
Supervised
Learner
Supervised
Learner
20
E
x
t
rac
ti
on
Learner
- Expensive in terms of learning effort
- Possible loss of feature space correlation
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 10
Classifier fusion: unimodal
Support
Vector
Machine
Global
Image
Feature
Extraction
References:
Snoek, TRECVID 2006
Wang, ACM MIR 2007
Geometric
Mean
Logistic
Regression
Fisher
Linear
Discriminant
Regional
Image
Feature
Extraction
Keypoint
Image
Feature
Extraction
21
+ Aggregation functions reduce learning effort
+ Offers opportunity to use all available examples efficiently
- Linear function likely to be sub-optimal
Modeling relations
¾ Exploitation of conceptual co-occurrence
9 Concepts do not occur in vacuum
9
In contrast, they are related
References: IBM 2003
Naphade and Huang, TMM 3(1) 2001
In
contrast,
they
are
related
Sky
Aircraft
22
¾ What is sports?
9 Answer: a combination of various individual sports
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 11
Modeling relations
¾ Learning co-occurrence
9 Explicitly model relations: using graphical models
–
Com
p
utationall
y
com
p
lex
References: IBM 2003
Qi , ACM Multimedia 2007
Liu, IEEE TMM 2008
pyp
– Limited scalability
9 Implicitly learn relations: using SVM, or data mining tools
– Assumes classifier learns relations
– Suffers from error propagation
23
References: IBM 2003
Naphade and Huang, TMM 3(1) 2001
IBM’s pipeline
24
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 12
References: IBM 2003
Naphade and Huang, TMM 3(1) 2001
IBM’s pipeline
F
e
a
t
F
u
s
i
C
l
a
s
s
i
F
u
s
i
M
o
d
e
l
R
e
l
a
t
25
t
u
r
e
i
o
n
i
f
i
e
r
i
o
n
l
i
n
g
t
i
o
n
s
Layout
Features
Extraction
Semantic Pathfinder
Supervised
L
Multimodal
Features
Supervised
Learner
Supervised
Learner
Visual
Features
Extraction
Semantic
Features
Combination
Context
Features
Extraction
Capture
Features
Extraction
Content
Features
Extraction
Select
Best of
3 Paths
after
Validation
Animal
Sports
Vehicle
Flag
Fire
26
L
earner
Features
Combination
Textual
Features
Extraction
Content Analysis Step
Style Analysis Step Context Analysis Step
Sports
Entertainment
Monologue
Weather
news
Hu
Jintao
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 13
Layout
Features
Extraction
Semantic Pathfinder
C
l
Supervised
L
Multimodal
Features
Supervised
Learner
Supervised
Learner
Visual
Features
Extraction
Semantic
Features
Combination
Context
Features
Extraction
Capture
Features
Extraction
Content
Features
Extraction
Select
Best of
3 Paths
after
Validation
Animal
Sports
Vehicle
Flag
Fire
Feature Fusion
a
s
s
i
f
i
F
u
s
i
o
n
Modeling
Relations
27
L
earner
Features
Combination
Textual
Features
Extraction
Content Analysis Step
Style Analysis Step Context Analysis Step
Sports
Entertainment
Monologue
Weather
news
Hu
Jintao
Feature
Fusion
e
r
Tsinghua University
28
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 14
Tsinghua University
C
l
Feature
Fusion
l
a
s
s
i
f
i
F
u
s
i
o
n
Modeling
Relations
29
i
e
r
n
Fragmented research efforts…
30
NIST
Since 2001 worldwide evaluation by NIST
Video analysis researchers
9Until 2001 everybody defined her or his own concepts
9Using specific and small data sets
9Hard to compare methodologies
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 15
NIST TRECVID benchmark
¾ Benchmark objectives
9 Promote progress in video retrieval research
9 Provide common dataset
(shots, recognized speech, key frames)
9
Use open metrics
-
based evaluation
anno 2001
9
Use
open
,
metrics
-
based
evaluation
¾ Large international field of participants
31
9 and the 70 others…
¾ Currently the de facto standard for evaluation
/>80
100
120
140
160
180
Hours of train data
Hours of test data
ABC,
CNN
Data
English,
Chinese,
Arabic
TV
English,
Chinese,
Arabic
TV
2001 2002 2003 2004 2005 2006
TRECVID Evolution:
data, tasks, participants,
Source: Paul Over, NIST
0
20
40
60
60
70
Applied
Finished
NIST
Prelinger
archive
CNN
,
C-Span
ABC,
CNN
Shots Shots Shots Shots Shots Shots
Search Search Search Search Search Search
Concepts Concepts Concepts Concepts Concepts
Stories Stories BBC rushes BBC rushes
Camera motion
Tasks
s
news
news
32
0
10
20
30
40
50
2001 2002 2003 2004 2005 2006
Applied
Finished
Team
s
10 18 51 40 55
Peer-reviewed
papers:
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 16
80
100
120
140
160
180
Hours of train data
Hours of test data
ABC,
CNN
Data
English,
Chinese,
Arabic
TV
English,
Chinese,
Arabic
TV
2001 2002 2003 2004 2005 2006
TRECVID Evolution:
data, tasks, participants,
Source: Paul Over, NIST
0
20
40
60
60
70
Applied
Finished
NIST
Prelinger
archive
CNN
,
C-Span
ABC,
CNN
Shots Shots Shots Shots Shots Shots
Search Search Search Search Search Search
Concepts Concepts Concepts Concepts Concepts
Stories Stories BBC rushes BBC rushes
Camera motion
Tasks
s
news
news
33
0
10
20
30
40
50
2001 2002 2003 2004 2005 2006
Applied
Finished
Team
s
10 18 51 40 55
Peer-reviewed
papers:
80
100
120
140
160
180
Hours of train data
Hours of test data
ABC,
CNN
Data
English,
Chinese,
Arabic
TV
English,
Chinese,
Arabic
TV
2001 2002 2003 2004 2005 2006
TRECVID Evolution:
data, tasks, participants,
Source: Paul Over, NIST
0
20
40
60
60
70
Applied
Finished
NIST
Prelinger
archive
CNN
,
C-Span
ABC,
CNN
Shots Shots Shots Shots Shots Shots
Search Search Search Search Search Search
Concepts Concepts Concepts Concepts Concepts
Stories Stories BBC rushes BBC rushes
Camera motion
Tasks
s
news
news
34
0
10
20
30
40
50
2001 2002 2003 2004 2005 2006
Applied
Finished
Team
s
10 18 51 40 55
Peer-reviewed
papers:
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 17
NIST TRECVID Benchmark
Concept detection task
¾ Given:
9 a video dataset segmented into set of S unique shots
9
set of
N
semantic concept definitions:
set
of
N
semantic
concept
definitions:
¾ Task:
35
9 How well can you detect the concepts?
9 Rank S based on presence of concept from N
=>
S
N
1. 2. 3. ….
Measuring uncertainty
Set of retrieved itemsSet of relevant items
Results
¾ Precision
Set of relevant
retrieved items
1.
2.
3.
36
¾ Recall
inverse relationship
4.
5.
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 18
TRECVID evaluation measures
¾ Classification procedure
9 Training: many hours of (partly) annotated video
9
Testing: many hours of
unseen
video
Results
Testing:
many
hours
of
unseen
video
¾ Evaluation measure: Average Precision
9 Combines precision and recall
9 Averages precision after every relevant shot
9 Top of the ranked list most important
1.
2.
3.
37
AP =
1/1 + 2/3 + 3/4 + …
Total Number of correct shots
4.
5.
Semantic Pathfinder @ TRECVID
2004
The Good
With the MediaMill team
2005
Th U l
The Bad
ill-defined / few examples
38
2006
Th
e
U
g
l
y
exploit TV repetition
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 19
491 detectors, a closer look
The number of labeled image examples used at training
time seems decisive in concept detector accuracy.
Demo time!
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 20
Concept detector: requires examples
¾ TRECVID’s collaborative research agenda has been
pushing manual concept annotation efforts
374
491
…
LSCOM
MediaMill - UvA
Others
41
17
32
39
101
Publicly Available
¾ MM078-Police/Security Personnel
9 Shots depicting law enforcement or private security
agency personnel.
Concept definition
42
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 21
Collaborative annotation tool
¾ Manual annotation by 100+ TRECVID participants
9 Incomplete, but reliable
TRECVID 2005
References:
Christel, Informedia, 2005
Volkmer et al, ACM MM 2005
43
Manual annotations: LSCOM-lite
¾ LSCOM:
9 Large Scale Annotation for Multimedia
9
Aims for ontology of 1,000 annotated concepts
References:
Naphade et al, IEEE Multimedia 2006
TRECVID 2005
Aims
for
ontology
of
1,000
annotated
concepts
¾ LSCOM-Lite: annotations for 39 semantic concepts
9 Used in TRECVID 2005 and 2006
44
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 22
TRECVID Criticism
¾ Focus is on the final result
9 TRECVID judges relative merit of indexing methods
9
Ignores repeatability of intermediate analysis steps
Ignores
repeatability
of
intermediate
analysis
steps
¾ Systems are becoming more complex
9 Typically combining several features and learning methods
¾ Component-based optimization and comparison
impossible
Content
Layout
Features
Extraction
45
Supervised
Learner
Multimodal
Features
Combination
Supervised
Learner
Supervised
Learner
Visual
Features
Extraction
Textual
Features
Extracti on
Content A nalysis Step
Style Analysis Step Context A nalysis Step
Semanti c
Features
Combination
Context
Features
Extraction
Capture
Features
Extraction
Content
Features
Extraction
Select
Best of
3 Paths
after
Validation
What is the contribution of these components?
¾ The Challenge provides
9 Manually annotated lexicon of 101
semantic concepts
9 Pre-computed low-level multimedia features
9 Trained classifier models
¾ The Challenge allows to
9 Gain insight in intermediate video
analysis steps
9 Foster repeatability of experiments
9
Optimize video analysis systems on a
MediaMill Challenge
Supervised
Learner
Visual
Feature
Extraction
Experiment 1
Experiment 1
•The Challenge lowers threshold for novice multimedia researchers
9 Five experiments
9 Baseline implementation together with
baseline results
9
Optimize
video
analysis
systems
on
a
component level
9 Compare and improve upon baseline
46
Online available: http:/www.mediamill.nl/challenge/
Early
Fusion
Late
Fusion
Textual
Feature
Extraction
Combined
Analysis
Supervised
Learner
Supervised
Learner
Supervised
Learner
Experiment
1
Experiment 2
Experiment 3
Experiment 4
Experiment 5
Experiment 4
Experiment 3
Experiment 2
Experiment
1
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 23
MediaMill Challenge
¾ Advantages
9 For research
–
People can focus on the experiment for which they have
People
can
focus
on
the
experiment
for
which
they
have
the expertise without having to do all the processing
Pure computer vision
Pure natural language processing
Pure machine learning
……………………….
9 For education
47
–
Students can do
large scale experiments
compare themselves to each other
…… and to the state-of-the-art
Columbia374
¾ Baseline for 374 concept detectors
9 Focus is on visual analysis experiments
48
Online available:
/>Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 24
Fabchannel.com
¾ Fabchannel narrowcasts concerts from Amsterdam
Paradiso and Melkweg venues
9
Currently
+
/
-
700 concerts online
Case study
Currently
/
700
concerts
online
¾ Fabchannel request
9 What can you do with 45 hours of live concerts?
¾ Answer:
9 Let’s try the semantic pathfinder to detect concert
concepts
49
concepts
Demo
Results for singer
50
Concept-based Video Retrieval
Cees G.M. Snoek and Marcel Worring
SSIP 2008
www.mediamill.nl 25
Demo
Results for drummer
51
Conclusions
¾ An international community is building a bridge to narrow the
semantic gap
9 Currently detects more than 500 concepts in broadcast video
9 Generalizes outside news domain
¾ Important lessons
9 No superior method for all concepts exists,
9 Best to learn optimal approach per concept
9 Best methods cover variation in features, classifiers, and concepts
52