Tải bản đầy đủ (.pdf) (76 trang)

Efficient video identification based on locality sensitive hashing and triangle inequality

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (504.98 KB, 76 trang )

EFFICIENT VIDEO IDENTIFICATION
BASED ON LOCALITY SENSITIVE
HASHING AND TRIANGLE INEQUALITY

Yang Zixiang

NATIONAL UNIVERSITY OF SINGAPORE
2005


Name:

Yang Zixiang

Degree:

Master of Science

Dept:

Computer Science

Thesis Title:

Efficient Video Identification Based on Locality Sensitive Hashing and Triangle Inequality

Abstract
Searching for duplicated version video clips in large video database, or video identification, requires fast and robust similarity search in high-dimensional space. Locality
sensitive hashing, or LSH, is a well-known indexing method for efficient approximate
similarity search in such space. In this thesis, we present a highly efficient video identification method for transcoded video content based on locality sensitive hashing and
triangle inequality. To store large volume of videos, we design a small feature dataset


and index the dataset using improved locality sensitive hashing. In addition, we employ triangle inequality to further enhance the system efficiency. Experimental results
demonstrate that once the features of a given 8s query video are extracted, it takes
about 0.17s to retrieve it from a 96-hour video database. Furthermore, our system is
robust to the changes of the query videos on frame size, frame rate and compression
bit-rate.

Keywords:

video identification
video search
video hashing
locality sensitive hashing


EFFICIENT VIDEO IDENTIFICATION
BASED ON LOCALITY SENSITIVE
HASHING AND TRIANGLE INEQUALITY

Yang Zixiang
B. Eng. (Hons), XJTU, P. R. China

A THESIS SUBMITTED
FOR THE DEGREE OF MASTER OF SCIENCE
DEPARTMENT OF COMPUTER SCIENCE
SCHOOL OF COMPUTING
NATIONAL UNIVERSITY OF SINGAPORE
2005


Acknowledgements

I sincerely thank my supervisors, Dr. Sun Qibin and Dr. Ooi Wei Tsang, who have
guided and supported me throughout my postgraduate years. Their suggestions for improvement and faith in my work have strengthened my confidence. I have benefited
tremendously both technically and personally from their guidance and supervision.
I send my sincere regards to all my colleagues who I worked with during my academic years, for their valuable suggestions: Dr Tian Qi, Dr. Heng Wei Jyh, Dr. Gao
Sheng, Dr. Zhu Yongwei, Dr. He Dajun, Mr. Zhang Zhishou. In addition, many friends
who have contributed in one way or another: Mr. Yuan Junsong, Mr. Yang Xianfeng,
Mr. Wang Dehong, Mr. Ye Shuiming, Mr. Zhou Zhicheng, Mr. Li Zhi, to name a few,
for their help and encouragement.
Finally, special thanks to my family members who gave me their continual moral
support to complete this course.

i


Publications
Zixiang Yang, Wei Tsang Ooi and Qibin Sun, “Hierarchical, non-uniform locality sensitive hashing and its application to video identification,” in Proceedings of International Conference on Multimedia and Expo, Jun 2004, Taipei, Taiwan.

Wei Jyh Heng, Yu Chen, Zixiang Yang and Qibin Sun, “Classroom assistant for realtime information retrieval,” in Proceedings of International Conference on Information Technology: Research and Education, pp.436-439, Aug 2003, Newark, New Jersey, USA.

ii


Contents
Acknowledgements ........................................................................................................ i
Publications ................................................................................................................... ii
Contents ........................................................................................................................ iii
List of Figures................................................................................................................ v
List of Tables ............................................................................................................... vii
Summary..................................................................................................................... viii
1 Introduction................................................................................................................ 1

1.1 Classification for Video Search Systems.............................................................. 3
1.1.1 “Query by Keywords” and “Query by Video Clip” ...................................... 3
1.1.2 Video Retrieval and Video Identification...................................................... 4
1.2 Different Levels of Video Identification............................................................... 8
1.3 Different Tasks of Video Identification.............................................................. 10
1.4 Objectives ........................................................................................................... 11
1.5 Organization of Thesis........................................................................................ 11
2 Background and Related Work.............................................................................. 12
2.1 Content-Based Video Identification: A Survey .................................................. 12
2.1.1 Architecture of a Video Storage and Identification System ........................ 12
2.1.2 Video Segmentation and Feature Extraction ............................................... 13
2.1.3 Similarity Measuring ................................................................................... 16
2.1.4 Feature Vectors Indexing............................................................................. 17

iii


2.1.5 Some Well-known Video Search Systems .................................................. 18
2.2 Similarity Search via Database Index Structure ................................................. 21
2.3 Introduction to Locality Sensitive Hashing ........................................................ 23
3 Efficient Video Identification Based on Locality Sensitive Hashing and Triangle
Inequality..................................................................................................................... 26
3.1 System Overview................................................................................................ 27
3.2 Slide Search Window on Query Video............................................................... 28
3.3 Improvements to Locality Sensitive Hashing ..................................................... 31
3.3.1 Description of Locality Sensitive Hashing .................................................. 31
3.3.2 Improvements to Locality Sensitive Hashing .............................................. 33
3.4 Skip Redundant Match Operations by Triangle Inequality ................................ 38
3.5 Feature Extraction............................................................................................... 41
4 Experimental Results and Discussion .................................................................... 44

4.1 Feature Dataset of the Video Database............................................................... 44
4.2 Query Video Datasets ......................................................................................... 45
4.3 Performance of HNLSH ..................................................................................... 47
4.4 Performance of Video Identification .................................................................. 50
4.5 Comparison with NTT’s “Active Search” .......................................................... 52
5 Conclusions and Future Work................................................................................ 53
5.1 Conclusions......................................................................................................... 53
5.2 Future Work ........................................................................................................ 54
Bibliography ................................................................................................................ 56

iv


List of Figures
1.1 Two types of classifications for video search systems ........................................... 5
1.2 Different levels of video identification ................................................................... 8
2.1 Architecture of a video storage and identification system.................................... 13
2.2 Structure of video segmentation and feature extraction module .......................... 14
2.3 Architectural diagram of a video retrieval system................................................ 19
2.4 Interface of Informedia system ............................................................................. 21
2.5 A 2D example of merging the results from multiple hash tables ......................... 24
2.6 Disk accesses comparison between LSH and SR-tree.......................................... 25
3.1 System overview................................................................................................... 27
3.2 A usual video search algorithm ............................................................................ 28
3.3 Slide search window on query video .................................................................... 29
3.4 Locality sensitive hashing..................................................................................... 32
3.5 Hierarchical partitioning in locality sensitive hashing.......................................... 34
3.6 Non-uniform selection of partitioned dimensions in locality sensitive hashing... 35
3.7 PDF of Gaussian distributions for different variances.......................................... 35
3.8 Illustration of HNLSH for video identification .................................................... 38

3.9 Skip redundant match operations by triangle inequality ...................................... 39
3.10 Quantization of the HSV color space ................................................................... 41
3.11 Frame partition...................................................................................................... 42
4.1 A distance pattern between the query video and the videos in database .............. 46

v


4.2 Distance distribution of the query video and the videos in database.................... 46
4.3 Performance of HNLSH ....................................................................................... 48
4.4 Performance of video identification ..................................................................... 50
5.1 Incorporate hierarchical feature vectors with hierarchical hash tables................. 55
5.2 Process diagram for special domain video indexing............................................. 55

vi


List of Tables
4.1 Number of hash tables N vs. miss rate.................................................................. 49
4.2 Summary of the performance for video identification.......................................... 51
4.3 Comparison of our algorithm and NTT's "active search"..................................... 52

vii


Summary
The problem of content-based video identification concerns identifying the duplicated
version of a given short query video clip in a large video database based on content
similarity. Video identification has many applications, including news report tracking
on different channels, video copyright management on the internet, detection and statistical analysis of broadcasted commercials, video database management, etc. Three

key steps in building a video database for video identification are (i) video segmentation and feature extraction to represent the video clips; (ii) similarity measuring between the query video and the videos in database; (iii) indexing of the feature vectors
to allow efficient search of similar video.
In this thesis, we present a highly efficient video identification system at transcoding level for a large video database by systematically taking “feature extraction”, “feature indexing” and “video database construction” together into consideration. The selected feature is robust to the changes on frame size, frame rate and compression bitrate. Principal components analysis (PCA) and improved locality sensitive hashing
(LSH, an index structure in database area) are then used to reduce the dimensions of
feature space and generate the index code. Considering that the original LSH is only
good for indexing uniformly distributed high-dimensional data points and can be improved for video identification where data points may be clustered. We therefore give
two improvements to LSH to distribute the points more evenly. First, by building a hierarchical hash table, we adapt the number of hashed dimensions to the density of the

viii


data points. Second, we choose the hashed dimensions carefully in such a way that the
points are more evenly hashed, thus making the hash table more uniformly distributed
and reducing the miss rate. We further apply triangle inequality on the resulted buckets
by LSH to skip some redundant match operations. In terms of system design, to save
the storage of the video database’s feature dataset, we slide the search window on the
query video rather than the videos in database.
Experimental results verify that our improved LSH is much better than original
LSH in terms of both efficiency and accuracy when applied on the video feature dataset for similarity search. For video identification, our system is robust to the transcoding level noise, i.e. changes on frame size, frame rate and compression bit-rate. We
greatly reduce the search space and redundant match operations by incorporating improved LSH with triangle inequality to improve the efficiency. We further demonstrate
the promising system performance by comparing our algorithm with NTT’s “active
search” algorithm. The use of LSH with triangle inequality and sliding search window
on the query video are two main contributions of this research work.

ix


Chapter 1
Introduction
We live in a world of information. Information was first delivered to the general public

through broadcasting media such as newspapers, radio, and eventually television. Later,
the computer was invented. Computers allow information to be compiled in digital
form, and make it possible for people to search for required information. Furthermore,
information could be selectively retrieved when required, which is quite useful when
querying huge database. Looking at the great success of text search engines, such as
Google and Yahoo, researchers started to wonder if the same concept could be applied
to videos because recently digital videos become increasingly popular with the development of hardware and video compression standard like MPEG. There are a wide
range of applications for content-based video search. For example, you may be interested in a historic event or a scene involving a movie star, but only have few materials
about it. With an effective video retrieval system, you can find more detailed video
content. For some video producers, they may be interested in how their publications
are spread in the world. They can find if there are some illegal copies via a video identification system. A video search system is also useful for video editors. They can
search for useful video clips with a simple query instead of spending hours browsing
unrelated video content. For video database management, videos with similar content
could be clustered to facilitate browsing. In [1], Hong-Jiang Zhang summarized the

1


state-of-the-art technologies, directions, and important applications for research on
content-based video retrieval. Some applications are:


Professional and educational applications
o Automated authoring of web video content
o Searching and browsing large video archives
o Easy access to educational video material
o Indexing and archiving multimedia presentations
o Indexing and archiving multimedia collaborative sessions




Consumer domain applications
o Video overview and access
o Video content filtering
o Enhanced access to broadcast video

While video is widely accepted as a form of broadcasting media, the ability to
search through video contents has only recently been investigated. The search for text
in documents simply looks for matching words and it achieves great success. Therefore,
a straightforward approach to index video database is to represent the visual contents
in textual form (e.g. keywords and attributes). These keywords serve as indices to access the associated visual data. This approach has the advantage that visual databases
could be accessed using standard query languages (SQL). However, this approach
needs too much manual annotation and processing. More seriously, these descriptive
data are not reliable because they do not conform to a standard language. So they are
inconsistent and might not capture the video content. Thus the retrieval results may not
be satisfied since the query is based on the features that have been inadequately represented. Actually, the search of content within video sequence is much more complicated. There are different kinds of inputs and requirements for different video search

2


applications. We can classify the video search systems into “query by keywords” and
“query by video clip” based on the inputs, or classify it into video retrieval and video
identification based on the results. We will give more details about these different
categories in next section.

1.1 Classification for Video Search Systems
1.1.1 “Query by Keywords” and “Query by Video Clip”
We can classify video search systems into “query by keywords” and “query by video
clip” based on their inputs. For example, we give the video search system several keywords to find a category of video clips, i.e. query by keywords, and these returned
video clips are ranked by their similarity to these query keywords. Here, the keywords

not only refer to text, but also include some other properties that describe the video
content, such as shape, color, etc. “Query by keywords” is a semantical level video retrieval application [2, 3, 4] which works just like the text search engine. The advantage
is that it is easy for the users because they only need to give the system some keywords
or some descriptions to search what they want. However, since text can not well represent the content of video, the returned results may not be satisfied. Another case is using an example video clip as the query to search the similar videos, i.e. query by video
clip, which also has been actively researched [5, 6, 7]. This kind of system is suitable
when the user can not clearly describe what they want in keywords, or the text index
structure for the video database is unavailable, or they just want to search some specified video clips like pirated video copy detection. Compared with “query by keywords”, “query by video clip” provides a more flexible method to search the video database because usually a well-built text index structure is unavailable for a large video
database. It is quite laborsome to manually label the whole video database while the
3


performance of automatically indexing the video database is poor. For “query by video
clip”, the query clip could be a sub-shot, a shot or several shots, based on the requirements of the users. Since the query clip is usually a logical story unit which contains
cohesive semantical meaning, “query by video clip” is a more natural way for users to
access and search the video database. The application of “query by video clip” comprises video copyright management [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20],
video content identification in broadcast video [21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
31, 32, 33, 34, 35, 36], and similar video content search by given example [6, 5, 7, 37,
38, 39, 40, 41].

1.1.2 Video Retrieval and Video Identification
We can classify video search systems into video retrieval and video identification
based on their results. For video retrieval, we measure the similarity between the query
and the video clips in database. The resulted video clips will be ranked by their similarities and returned to the users. The users will browse these results and decide which
one is exactly they wanted, just like the text search engine. Thus, video retrieval is a
measurement problem. For video identification, the system need to decide whether a
video clip in database is a duplicated version or not based on the similarity matrix, so
video identification is a decision problem. Video identification is a relatively new area
compared to video retrieval. The topic of video retrieval has been extensively researched for more than ten years, but only recently has video identification been proposed as a new topic. The two areas are similar in some aspects. Some of the main research issues in video retrieval including video content representation and indexing are
shared by video identification. Video identification can inherit many techniques from
video retrieval. For example, those representation schemes used in video retrieval systems, such as key frame representation, color histogram feature, motion histogram, etc.,

4


are also used in some video identification systems [11, 24]. However, video retrieval
and video identification are different:
Firstly, the query is different. The query of video retrieval could be text, shape,
color or other properties that describe the video content; also it could be a query video
clip. For video identification, the query must be a query video clip. Therefore, video
identification definitely belongs to “query by video clip”, while “query by video clip”
also includes some video retrieval systems which use the video clip as the query.
Secondly, video retrieval aims to search video clips that somehow look similar to
the query, such as contain similar objects as the query, while video identification is to
identify video clips that are perceptually the same, except for quality differences or the
effects of various video editing operations. The results in video retrieval are similar to
the query in semantical level, but for video identification they may be false alarms.
Thus, the features for video identification need to be far more discriminatory, but they
do not necessarily need to be semantical which is used for video retrieval.
Thirdly, video retrieval generally has the loop of relevance feedback in which user
interaction is incorporated, i.e. users will decide which one is better in the returned
video clips, but for video identification the system will output the final results. That is
to say, generally video retrieval needs more manually work like in feature extraction,
data supervision and training, etc., due to the poor performance of artificial intelligence
on semantical level applications in current stage.
Query by
Video Clip

Query by
Keywords

Video

Video
Identification
Retrieval

Video Retrieval

Figure 1.1 Two types of classifications for video search systems

5


Figure 1.1 shows the relation of the above classifications for video search systems.
Since the topic of this thesis is video identification, we will not discuss with “query by
keywords” any more. For the case of “query by video clip”, the differences between
video retrieval and video identification result in different considerations and emphases
on the system framework, although video retrieval and video identification have the
same term of “similarity video search”. For video retrieval, the task of retrieving similar video clips of the query at the concept level is associated with the challenge of capturing and modeling the semantical meaning inherent to the query. With an appropriate
semantical model and similarity definition, video clips (a shot or several shots) with a
similar concept as the query can be found [42]. However, for video identification, as
the recognition task is relatively simple, complex concept level content modeling is
usually unnecessary to identify and locate the duplicated versions of the query, but the
prospective features or signatures are expected to be compact and robust to some variations, e.g. different frame size, frame rate, compression bit-rate and color shifting,
brought by digitization, coding and post editing.
Furthermore, the methods and intentions to organize and manage the video database are different when targeting video retrieval and video identification tasks. Both of
the tasks need to organize and index the video database, but their purposes are fundamentally different, even though they may apply the same term of “video indexing”. For
video retrieval, “video indexing” refers to annotating the video contents and classifying them into different concepts or semantical classes. By doing this, it could help the
user to browse and retrieve the video content more effectively. On the other hand,
“video indexing” mentioned in the video identification means to apply some basic database index techniques to organize the feature dataset extracted from the video contents, e.g. using a tree structure or hash index [43, 44]. Such a database index structure

6



aims to provide an efficient method to accelerate the search speed. The nodes of the
basic database index structure do not contain semantical level meaning, which is just
the case for video retrieval indexing, to facilitate the video content browsing.
Finally, the search speed requirements are different for video retrieval and video
identification. When doing video retrieval, normally we are not concerned with the
search speed since the performances on precision and recall are not good enough. The
bottleneck against a promising performance is the gap between low-level perceptual
features and high-level semantical concepts. However, for video identification, the
search speed is a big concern, because its applications are usually oriented to a very
large video database or a time-critical online environment. On the other hand, compared with video retrieval, the task of video identification is relatively simple. Generally, video identification can achieve quite high precision and recall, which making
efficient search possible.
Video identification and video retrieval are research issues on different levels. In
fact, even inside video identification itself there are different level research problems.
We will show different level video identification problems in next section.

7


1.2 Different Levels of Video Identification
Query Video Clips

Potential Resulted Video Clips

Levels
Large
Change From the
Original Copy
Small


transcoding
(different frame size,
frame rate, bit-rate,
or different
compression codec)
nearly same
version
(recorded by two
TV recorders with
same conditions)

Easy

overall brightness,
contrast, hue,
saturation, etc
adjustment

Difficulty

frame level video
editing
(the logo, subtitle,
etc. may be changed)

Hard

nearly duplicate
version detection

(recorded by two
cameras from
different angles)
shot level video
editing
(the order of the shots
may be changed, or
insert additional shots)

Figure 1.2 Different levels of video identification
We divide the video identification problems into 6 levels based on the noise between
the original and the duplicated version video clips. Figure 1.2 illustrates these different
level problems of video identification. The systems for high level or semantical level
video identification problems have to be robust to large noise, like recorded by cameras on different angles, different shot orders, various video editing operations, etc.
These systems concern more on the performance of precision and recall than the search
speed. Usually they need to apply some models and semantical level features to
achieve acceptable results, which is a relatively difficult task. Compared with high
level video identification, low level or exact match level video identification problems
are easier. They only have small noise, like frame shift, transcoding, overall brightness

8


adjustment, etc. Since nearly 100% of the performance on precision and recall can be
achieved, low level video identification systems have more concerns on the search
speed and scalability. Usually they will not apply models and their features do not necessarily need to be semantical, but have to be far more discriminatory. More details
and some typical research works about each level are listed here:
1) Nearly duplicated version detection: The duplicated version video clip may be recorded by cameras from different angles. Some objects may be obstructed while
some other objects may be reappeared because of the different view angles. Dong
Qing Zhang et al. [36] presented a part-based image similarity measure derived

from the stochastic matching of Attributed Relational Graphs that represent the
compositional parts and part relations of image scenes. They compared this model
with several prior similarity models, such as color histogram, local edge descriptor,
etc. This presented model outperforms the prior approaches with large margin.
2) Shot level video editing: The order of the shots in duplicated version video clip
may be different, or the duplicated vision can insert/delete shots into/from the
original version. Victor Kulesh et al. [25] presented an approach for video clip recognition based on HMM and GMM for modeling video and audio streams respectively. Their method can detect the new shorter version of video clip which is produced by removing some shots from the original one.
3) Frame level video editing: The video editing operation is limited to frame level.
The logo, subtitle, etc., may be changed. Timothy C. Hoad et al. [14] presented the
shot-length comparison method for video identification. This method is found to be
extremely robust to changes in the video, including alterations to the colors as well
as changes in frame size, frame rate, bit-rate, and introduction of analogue interference, because the feature is not related to the content of a single frame.

9


4) Overall brightness, contrast, hue, saturation, etc. adjustment: This is common in
different standard TV programs (like PAL, NSTC) conversion. Color (brightness)
ordinal feature is useful for this kind of video identification [28, 33, 37], since ordinal measure is non-sensitive to uniform color shifting.
5) Transcoding level: The duplicated version video clip is transcoded from the original version. It may be different on frame size, frame rate, bit-rate or compression
codec. Oostveen et al. [17] proposed a new hashing solution (i.e., perceptual/robust
hash or fingerprints) and a database index strategy for video identification. Their
fingerprints are robust to the above transcodings. Unfortunately, they did not report
their performance on search speed. Our work in this thesis is also in this level.
6) Nearly same version level: The duplicated version video clip may be captured from
real-time TV broadcasting using other TV recordings (in same conditions) which
are different from their original version. There is only a little frame shift noise between the duplicated and original version video clips. Kunio Kashino et al. [31]
proposed a quick search method for audio and video signals based on histogram
pruning. They tested their algorithm on a 48h video database and get good performance on search speed.


1.3 Different Tasks of Video Identification
Besides the above 6 levels, there are 3 different tasks of video identification:
1) Task 1 is to find the identical video clips by comparing the query video with the
videos in database [15]. The video database comprise of many short video clips.
This task does not need to locate a short query video in a long video in database.
2) Task 2 is to identify the reoccurrences of some specified video segments in a long
video clip [29]. The noise of task 2 is quite small because these reoccurrence video

10


segments are in the same video clip, i.e. the query videos have no distortions like
changes on frame size, frame rate and compression bit-rate for a normal video
identification application.
3) Task 3 is to search and locate a short query video clip in a large video database,
which comprises of many long video clips [17, 31]. This is a general case for video
identification, which is more difficult than the above two cases. Our work in this
thesis is in this category.

1.4 Objectives
Our work in this thesis is located in the second lowest level of video identification
problems, i.e. transcoding level. The task is to search and locate a transcoded version
short query video clip in a large video database which comprises of many long video
clips. That is to say, our objective is to build a highly efficient content-based video
identification system which is robust to the transcoding level noise, i.e. changes on
frame size, frame rate and compression bit-rate.

1.5 Organization of Thesis
The rest of this thesis is organized as follows. Chapter 2 gives a broad survey about
content-based video identification. Some backgrounds about similarity search in highdimensional database and locality sensitive hashing (LSH) are also provided since they

are closely related to this thesis. Chapter 3 presents our highly efficient video identification system for a large video database based on improved locality sensitive hashing
and triangle inequality. Chapter 4 evaluates our system performance. Finally, chapter 5
concludes the thesis and points out the future work.

11


Chapter 2
Background and Related Work
In this chapter, some backgrounds and related work are provided. Firstly, we will give
a survey of related issues to video identification which include “feature extraction”,
“similarity measuring” and “index structures”. Some profound surveys about video
search can be found in [1, 45, 46, 47]. Secondly, we will give some backgrounds about
efficient similarity search in high-dimensional space via database index structures,
which is closely related to this thesis. Finally, we will introduce locality sensitive hashing (LSH), a highly efficient index structure applied in our work.

2.1 Content-Based Video Identification: A Survey
2.1.1 Architecture of a Video Storage and Identification System
A systematical video database used for video identification has two main processes:
storage and identification. The storage process extracts features from videos and organizes these feature vectors for storage in the database. In the identification process,
an input query is represented by the appropriate features, and a search is formed on the
stored feature vectors to find the closest videos. A similarity metric is used to measure
the similarities between the query video and the videos in database. The feature vector

12


indexing structure can improve the search efficiency. Figure 2.1 shows the architecture
of a video storage and identification system.
Query


Query User
Interface

Video
Segmentation
& Feature
Extraction

Add New
Videos into the
Database

New Videos

Feature Vector
Indexing

Similarity
Measuring

Output

Output User
Interface
Database (videos + features)

Figure 2.1 Architecture of a video storage and identification system
In the above system, there are 3 key modules: (i) video segmentation and feature
extraction; (ii) similarity measuring; (iii) feature vector indexing. Some high level or

semantical level video search systems do not have module “feature vector indexing”,
which is useful for increasing the search speed, because they only care the performance on precision and recall in current stage.

2.1.2 Video Segmentation and Feature Extraction
This module is the main part of the whole video search system. Lots of research work
has been done for this module [48]. Figure 2.2 shows how to extract features to represent a video clip. Video has both spatial and temporal dimensions and hence a good
video index should capture the spatiotemporal contents of the scene. Normally, a video
is first segmented into elemental video segments (scenes or shots). For some video databases which only comprise short video clips (e.g. task 1 in section 1.3), this step may

13


×