Tải bản đầy đủ (.pdf) (551 trang)

IT training semantic mining technologies for multimedia databases tao, xu li 2009 04 15

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (15.8 MB, 551 trang )


Semantic Mining
Technologies for
Multimedia Databases
Dacheng Tao
Nanyang Technological University, Singapore
Dong Xu
Nanyang Technological University, Singapore
Xuelong Li
University of London, UK

Information science reference
Hershey • New York


Director of Editorial Content:
Senior Managing Editor:
Managing Editor:
Assistant Managing Editor:
Typesetter:
Cover Design:
Printed at:

Kristin Klinger
Jamie Snavely
Jeff Ash
Carole Coulson
Amanda Appicello
Lisa Tosheff
Yurchak Printing Inc.


Published in the United States of America by
Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail:
Web site: />and in the United Kingdom by
Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site:
Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by
any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identi.cation purposes only. Inclusion of the names of the products or companies does
not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Semantic mining technologies for multimedia databases / Dacheng Tao, Dong Xu, and Xuelong Li, editors.
p. cm.
Includes bibliographical references and index.
Summary: "This book provides an introduction to the most recent techniques in multimedia semantic mining necessary to researchers new
to the field"--Provided by publisher.
ISBN 978-1-60566-188-9 (hardcover) -- ISBN 978-1-60566-189-6 (ebook) 1. Multimedia systems. 2. Semantic Web. 3. Data mining. 4.
Database management. I. Tao, Dacheng, 1978- II. Xu, Dong, 1979- III.Li, Xuelong, 1976QA76.575.S4495 2009 006.7--dc22
2008052436

British Cataloguing in Publication Data

A Cataloguing in Publication record for this book is available from the British Library.
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the authors, but not
necessarily of the publisher.


Table of Contents

Preface . ................................................................................................................................................ xv
Section I
Multimedia Information Representation
Chapter I
Video Representation and Processing for Multimedia Data Mining ...................................................... 1

Amr Ahmed, University of Lincoln, UK
Chapter II
Image Features from Morphological Scale-Spaces............................................................................... 32

Sébastien Lefèvre, University of Strasbourg – CNRS, France
Chapter III
Face Recognition and Semantic Features ............................................................................................. 80

Huiyu Zhou, Brunel University, UK

Yuan Yuan, Aston University, UK

Chunmei Shi, People’s Hospital of Guangxi, China
Section II
Learning in Multimedia Information Organization
Chapter IV
Shape Matching for Foliage Database Retrieval ................................................................................ 100


Haibin Ling, Temple University, USA

David W. Jacobs, University of Maryland, USA
Chapter V
Similarity Learning for Motion Estimation......................................................................................... 130

Shaohua Kevin Zhou, Siemens Corporate Research Inc., USA

Jie Shao, Google Inc., USA

Bogdan Georgescu, Siemens Corporate Research Inc., USA

Dorin Comaniciu, Siemens Corporate Research Inc., USA


Chapter VI
Active Learning for Relevance Feedback in Image Retrieval............................................................. 152

Jian Cheng, National Laboratory of Pattern Recognition, Institute of Automation, Chinese

Academy of Sciences, China

Kongqiao Wang, Nokia Research Center, Beijing, China

Hanqing Lu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese

Academy of Sciences, China
Chapter VII
Visual Data Mining Based on Partial Similarity Concepts.................................................................. 166


Juliusz L. Kulikowski, Polish Academy of Sciences, Poland
Section III
Semantic Analysis
Chapter VIII
Image/Video Semantic Analysis by Semi-Supervised Learning......................................................... 183

Jinhui Tang, National University of Singapore, Singapore

Xian-Sheng Hua, Microsoft Research Asia, China

Meng Wang, Microsoft Research Asia, China
Chapter IX
Content-Based Video Semantic Analysis............................................................................................. 211

Shuqiang Jiang, Chinese Academy of Sciences, China

Yonghong Tian, Peking University, China

Qingming Huang, Graduate University of Chinese Academy of Sciences, China

Tiejun Huang, Peking University, China

Wen Gao, Peking University, China
Chapter X
Applications of Semantic Mining on Biological Process Engineering................................................ 236

Hossam A. Gabbar, University of Ontario Institute of Technology, Canada

Naila Mahmut, Heart Center - Cardiovascular Research Hospital for Sick Children, Canada

Chapter XI
Intuitive Image Database Navigation by Hue-Sphere Browsing......................................................... 263

Gerald Schaefer, Aston University, UK

Simon Ruszala, Teleca, UK


Section IV
Multimedia Resource Annotation
Chapter XII
Formal Models and Hybrid Approaches for Ef.cient Manual Image Annotation and Retrieval ....... 272

Rong Yan, IBM T.J. Watson Research Center, USA

Apostol Natsev, IBM T.J. Watson Research Center, USA

Murray Campbell, IBM T.J. Watson Research Center, USA
Chapter XIII
Active Video Annotation: To Minimize Human Effort........................................................................ 298

Meng Wang, Microsoft Research Asia, China

Xian-Sheng Hua, Microsoft Research Asia, China

Jinhui Tang, National University of Singapore, Singapore

Guo-Jun Qi, University of Science and Technology of China, China
Chapter XIV
Annotating Images by Mining Image Search...................................................................................... 323


Xin-Jing Wang, Microsoft Research Asia, China

Lei Zhang, Microsoft Research Asia, China

Xirong Li, Microsoft Research Asia, China

Wei-Ying Ma, Microsoft Research Asia, China
Chapter XV
Semantic Classification and Annotation of Images............................................................................. 350

Yonghong Tian, Peking University, China

Shuqiang Jiang, Chinese Academy of Sciences, China

Tiejun Huang, Peking University, China

Wen Gao, Peking University, China
Section V
Other Topics Related to Semantic Mining
Chapter XVI
Association-Based Image Retrieval..................................................................................................... 379

Arun Kulkarni, The University of Texas at Tyler, USA

Leonard Brown, The University of Texas at Tyler, USA
Chapter XVII
Compressed-Domain Image Retrieval Based on Colour Visual Patterns ........................................... 407

Gerald Schaefer, Aston University, UK



Chapter XVIII
Resource Discovery Using Mobile Agents ......................................................................................... 419

M. Singh, Middlesex University, UK

X. Cheng, Middlesex University, UK & Beijing Normal University, China

X. He, Reading University, UK
Chapter XIX
Multimedia Data Indexing .................................................................................................................. 449

Zhu Li, Hong Kong Polytechnic University, Hong Kong

Yun Fu, BBN Technologies, USA

Junsong Yuan, Northwestern University, USA

Ying Wu, Northwestern University, USA

Aggelos Katsaggelos, Northwestern University, USA

Thomas S. Huang, University of Illinois at Urbana-Champaign, USA
Compilation of References................................................................................................................ 476
About the Contributors..................................................................................................................... 514
Index.................................................................................................................................................... 523


Detailed Table of Contents


Preface . ................................................................................................................................................ xv
Section I
Multimedia Information Representation
Chapter I
Video Representation and Processing for Multimedia Data Mining ...................................................... 1

Amr Ahmed, University of Lincoln, UK
Video processing and segmentation are important stages for multimedia data mining, especially with
the advance and diversity of video data available. The aim of this chapter is to introduce researchers,
especially new ones, to the “video representation, processing, and segmentation techniques”. This includes an easy and smooth introduction, followed by principles of video structure and representation,
and then a state-of-the-art of the segmentation techniques focusing on the shot-detection. Performance
evaluation and common issues are also discussed before concluding the chapter.
Chapter II
Image Features from Morphological Scale-Spaces............................................................................... 32

Sébastien Lefèvre, University of Strasbourg – CNRS, France
Multimedia data mining is a critical problem due to the huge amount of data available. Efficient and reliable data mining solutions requires both appropriate features to be extracted from the data and relevant
techniques to cluster and index the data. In this chapter, the authors deal with the first problem which is
feature extraction for image representation. A wide range of features has been introduced in the literature,
and some attempts have been made to build standards (e.g. MPEG-7). These features are extracted with
image processing techniques, and the authors focus here on a particular image processing toolbox, namely
the mathematical morphology, which stays rather unknown from the multimedia mining community,
even if it offers some very interesting feature extraction methods. They review here these morphological
features, from the basic ones (granulometry or pattern spectrum, differential morphological profile) to
more complex ones which manage to gather complementary information.


Chapter III
Face Recognition and Semantic Features ............................................................................................. 80


Huiyu Zhou, Brunel University, UK

Yuan Yuan, Aston University, UK

Chunmei Shi, People’s Hospital of Guangxi, China
The authors present a face recognition scheme based on semantic features’ extraction from faces and
tensor subspace analysis. These semantic features consist of eyes and mouth, plus the region outlined
by three weight centres of the edges of these features. The extracted features are compared over images
in tensor subspace domain. Singular value decomposition is used to solve the eigenvalue problem and
to project the geometrical properties to the face manifold. They also compare the performance of the
proposed scheme with that of other established techniques, where the results demonstrate the superiority
of the proposed method.
Section II
Learning in Multimedia Information Organization
Chapter IV
Shape Matching for Foliage Database Retrieval ................................................................................ 100

Haibin Ling, Temple University, USA

David W. Jacobs, University of Maryland, USA
Computer-aided foliage image retrieval systems have the potential to dramatically speed up the process
of plant species identification. Despite previous research, this problem remains challenging due to the
large intra-class variability and inter-class similarity of leaves. This is particularly true when a large
number of species are involved. In this chapter, the authors present a shape-based approach, the innerdistance shape context, as a robust and reliable solution. They show that this approach naturally captures
part structures and is appropriate to the shape of leaves. Furthermore, they show that this approach can
be easily extended to include texture information arising from the veins of leaves. They also describe a
real electronic field guide system that uses our approach. The effectiveness of the proposed method is
demonstrated in experiments on two leaf databases involving more than 100 species and 1000 leaves.
Chapter V

Similarity Learning for Motion Estimation......................................................................................... 130

Shaohua Kevin Zhou, Siemens Corporate Research Inc., USA

Jie Shao, Google Inc., USA

Bogdan Georgescu, Siemens Corporate Research Inc., USA

Dorin Comaniciu, Siemens Corporate Research Inc., USA
Motion estimation necessitates an appropriate choice of similarity function. Because generic similarity
functions derived from simple assumptions are insufficient to model complex yet structured appearance
variations in motion estimation, the authors propose to learn a discriminative similarity function to match


images under varying appearances by casting image matching into a binary classification problem. They
use the LogitBoost algorithm to learn the classifier based on an annotated database that exemplifies the
structured appearance variations: An image pair in correspondence is positive and an image pair out of
correspondence is negative. To leverage the additional distance structure of negatives, they present a
location-sensitive cascade training procedure that bootstraps negatives for later stages of the cascade
from the regions closer to the positives, which enables viewing a large number of negatives and steering the training process to yield lower training and test errors. They also apply the learned similarity
function to estimating the motion for the endocardial wall of left ventricle in echocardiography and to
performing visual tracking. They obtain improved performances when comparing the learned similarity
function with conventional ones.
Chapter VI
Active Learning for Relevance Feedback in Image Retrieval............................................................. 152

Jian Cheng, National Laboratory of Pattern Recognition, Institute of Automation, Chinese

Academy of Sciences, China


Kongqiao Wang, Nokia Research Center, Beijing, China

Hanqing Lu, National Laboratory of Pattern Recognition, Institute of Automation, Chinese

Academy of Sciences, China
Relevance feedback is an effective approach to boost the performance of image retrieval. Labeling data
is indispensable for relevance feedback, but it is also very tedious and time-consuming. How to alleviate users’ burden of labeling has been a crucial problem in relevance feedback. In recent years, active
learning approaches have attracted more and more attention, such as query learning, selective sampling,
multi-view learning, etc. The well-known examples include Co-training, Co-testing, SVMactive, etc.
In this literature, the authors will introduce some representative active learning methods in relevance
feedback. Especially they will present a new active learning algorithm based on multi-view learning,
named Co-SVM. In Co-SVM algorithm, color and texture are naturally considered as sufficient and
uncorrelated views of an image. SVM classifier is learned in color and texture feature subspaces, respectively. Then the two classifiers are used to classify the unlabeled data. These unlabeled samples
that disagree in the two classifiers are chose to label. The extensive experiments show that the proposed
algorithm is beneficial to image retrieval.
Chapter VII
Visual Data Mining Based on Partial Similarity Concepts.................................................................. 166

Juliusz L. Kulikowski, Polish Academy of Sciences, Poland
Visual data mining is a procedure aimed at a selection from a document’s repository subsets of documents presenting certain classes of objects; the last may be characterized as classes of objects’ similarity
or, more generally, as classes of objects satisfying certain relationships. In this chapter attention will be
focused on selection of visual documents representing objects belonging to similarity classes.


Section III
Semantic Analysis
Chapter VIII
Image/Video Semantic Analysis by Semi-Supervised Learning......................................................... 183

Jinhui Tang, National University of Singapore, Singapore


Xian-Sheng Hua, Microsoft Research Asia, China

Meng Wang, Microsoft Research Asia, China
The insufficiency of labeled training samples is a major obstacle in automatic semantic analysis of large
scale image/video database. Semi-supervised learning, which attempts to learn from both labeled and
unlabeled data, is a promising approach to tackle this problem. As a major family of semi-supervised
learning, graph-based methods have attracted more and more recent research. In this chapter, a brief
introduction is given on popular semi-supervised learning methods, especially the graph-based methods,
as well as their applications in the area of image annotation, video annotation, and image retrieval. It is
well known that the pair-wise similarity is an essential factor in graph propagation based semi-supervised
learning methods. A novel graph-based semi-supervised learning method, named Structure-Sensitive
Anisotropic Manifold Ranking (SSAniMR), is derived from a PDE based anisotropic diffusion framework. Instead of using Euclidean distance only, SSAniMR further takes local structural difference into
account to more accurately measure pair-wise similarity. Finally some future directions of using semisupervised learning to analyze the multimedia content are discussed.
Chapter IX
Content-Based Video Semantic Analysis............................................................................................. 211

Shuqiang Jiang, Chinese Academy of Sciences, China

Yonghong Tian, Peking University, China

Qingming Huang, Graduate University of Chinese Academy of Sciences, China

Tiejun Huang, Peking University, China

Wen Gao, Peking University, China
With the explosive growth in the amount of video data and rapid advance in computing power, extensive
research efforts have been devoted to content-based video analysis. In this chapter, they authors will give
a broad discussion on this research area by covering different topics such as video structure analysis,
object detection and tracking, event detection, visual attention analysis, etc. In the meantime, different

video representation and indexing models are also presented.
Chapter X
Applications of Semantic Mining on Biological Process Engineering................................................ 236

Hossam A. Gabbar, University of Ontario Institute of Technology, Canada

Naila Mahmut, Heart Center - Cardiovascular Research Hospital for Sick Children, Canadaa
Semantic mining is an essential part in knowledgebase and decision support systems where it enables
the extraction of useful knowledge form available databases with the ultimate goal of supporting the
decision making process. In process systems engineering, decisions are made throughout plant / process /
product life cycles. The provision of smart semantic mining techniques will improve the decision making


process for all life cycle activities. In particular, safety and environmental related decisions are highly
dependent on process internal and external conditions and dynamics with respect to equipment geometry
and plant layout. This chapter discusses practical methods for semantic mining using systematic knowledge representation as integrated with process modeling and domain knowledge. POOM or plant/process
object oriented modeling methodology is explained and used as a basis to implement semantic mining as
applied on process systems engineering. Case studies are illustrated for biological process engineering,
in particular MoFlo systems focusing on process safety and operation design support.
Chapter XI
Intuitive Image Database Navigation by Hue-Sphere Browsing......................................................... 263

Gerald Schaefer, Aston University, UK

Simon Ruszala, Teleca, UK
Efficient and effective techniques for managing and browsing large image databases are increasingly
sought after. This chapter presents a simple yet efficient and effective approach to navigating image datasets. Based on the concept of a globe as visualisation and navigation medium, thumbnails are projected
onto the surface of a sphere based on their colour. Navigation is performed by rotating and tilting the
globe as well as zooming into an area of interest. Experiments based on a medium size image database
demonstrate the usefulness of the presented approach.

Section IV
Multimedia Resource Annotation
Chapter XII
Formal Models and Hybrid Approaches for Efficient Manual Image Annotation and Retrieval ....... 272

Rong Yan, IBM T.J. Watson Research Center, USA

Apostol Natsev, IBM T.J. Watson Research Center, USA

Murray Campbell, IBM T.J. Watson Research Center, USA
Although important in practice, manual image annotation and retrieval has rarely been studied by means
of formal modeling methods. In this paper, we propose a set of formal models to characterize the annotation times for two commonly-used manual annotation approaches, i.e., tagging and browsing. Based
on the complementary properties of these models, we design new hybrid approaches, called frequencybased annotation and learning-based annotation, to improve the efficiency of manual image annotation
as well as retrieval. Both our simulation and experimental results show that the proposed algorithms can
achieve up to a 50% reduction in annotation time over baseline methods for manual image annotation,
and produce significantly better annotation and retrieval results in the same amount of time.
Chapter XIII
Active Video Annotation: To Minimize Human Effort........................................................................ 298

Meng Wang, Microsoft Research Asia, China

Xian-Sheng Hua, Microsoft Research Asia, China

Jinhui Tang, National University of Singapore, Singapore

Guo-Jun Qi, University of Science and Technology of China, China


This chapter introduces the application of active learning in video annotation. The insufficiency of training
data is a major obstacle in learning-based video annotation. Active learning is a promising approach to

dealing with this difficulty. It iteratively annotates a selected set of most informative samples, such that
the obtained training set is more effective than that gathered randomly. We present a brief review of the
typical active learning approaches. We categorize the sample selection strategies in these methods into
five criteria, i.e., risk reduction, uncertainty, positivity, density, and diversity. In particular, we introduce the Support Vector Machine (SVM)-based active learning scheme which has been widely applied.
Afterwards, we analyze the deficiency of the existing active learning methods for video annotation, i.e.,
in most of these methods the to-be-annotated concepts are treated equally without preference and only
one modality is applied. To address these two issues, we introduce a multi-concept multi-modality active learning scheme. This scheme is able to better explore human labeling effort by considering both
the learnabilities of different concepts and the potential of different modalities.
Chapter XIV
Annotating Images by Mining Image Search...................................................................................... 323

Xin-Jing Wang, Microsoft Research Asia, China

Lei Zhang, Microsoft Research Asia, China

Xirong Li, Microsoft Research Asia, China

Wei-Ying Ma, Microsoft Research Asia, China
Although it has been studied for years by computer vision and machine learning communities, image
annotation is still far from practical. In this paper, we propose a novel attempt of modeless image annotation, which investigates how effective a data-driven approach can be, and suggest annotating an
uncaptioned image by mining its search results. We collected 2.4 million images with their surrounding
texts from a few photo forum websites as our database to support this data-driven approach. The entire
process contains three steps: 1) the search process to discover visually and semantically similar search
results; 2) the mining process to discover salient terms from textual descriptions of the search results;
and 3) the annotation rejection process to filter noisy terms yielded by step 2). To ensure real time annotation, two key techniques are leveraged – one is to map the high dimensional image visual features
into hash codes, the other is to implement it as a distributed system, of which the search and mining
processes are provided as Web services. As a typical result, the entire process finishes in less than 1
second. Since no training dataset is required, our proposed approach enables annotating with unlimited
vocabulary, and is highly scalable and robust to outliers. Experimental results on real web images show
the effectiveness and efficiency of the proposed algorithm.

Chapter XV
Semantic Classification and Annotation of Images............................................................................. 350

Yonghong Tian, Peking University, China

Shuqiang Jiang, Chinese Academy of Sciences, China

Tiejun Huang, Peking University, China

Wen Gao, Peking University, China
With the rapid growth of image collections, content-based image retrieval (CBIR) has been an active area
of research with notable recent progress. However, automatic image retrieval by semantics still remains
a challenging problem. In this chapter, we will describe two promising techniques towards semantic


image retrieval  semantic image classification and automatic image annotation. For each technique,
four aspects are presented: task definition, image representation, computational models, and evaluation.
Finally, we will give a brief discussion of their application in image retrieval.
Section V
Other Topics Related to Semantic Mining
Chapter XVI
Association-Based Image Retrieval..................................................................................................... 379

Arun Kulkarni, The University of Texas at Tyler, USA

Leonard Brown, The University of Texas at Tyler, USA
With advances in computer technology and the World Wide Web there has been an explosion in the
amount and complexity of multimedia data that are generated, stored, transmitted, analyzed, and accessed. In order to extract useful information from this huge amount of data, many content-based image retrieval (CBIR) systems have been developed in the last decade. A Typical CBIR system captures
image features that represent image properties such as color, texture, or shape of objects in the query
image and try to retrieve images from the database with similar features. Recent advances in CBIR

systems include relevance feedback based interactive systems. The main advantage of CBIR systems
with relevance feedback is that these systems take into account the gap between the high-level concepts
and low-level features and subjectivity of human perception of visual content. CBIR systems with relevance feedback are more efficient than conventional CBIR systems; however, these systems depend on
human interaction. In this chapter, we describe a new approach for image storage and retrieval called
association-based image retrieval (ABIR). We try to mimic human memory. The human brain stores
and retrieves images by association. We use a generalized bi-directional associative memory (GBAM)
to store associations between feature vectors that represent images stored in the database. Section I
introduces the reader to the CBIR system. In Section II, we present architecture for the ABIR system,
Section III deals with preprocessing and feature extraction techniques, and Section IV presents various
models of GBAM. In Section V, we present case studies.
Chapter XVII
Compressed-Domain Image Retrieval Based on Colour Visual Patterns ........................................... 407

Gerald Schaefer, Aston University, UK
Image retrieval and image compression have been typically pursued separately. Only little research
has been done on a synthesis of the two by allowing image retrieval to be performed directly in the
compressed domain of images without the need to uncompress them first. In this chapter we show that
such compressed domain image retrieval can indeed be done and lead to effective and efficient retrieval performance. We introduce a novel compression algorithm – colour visual pattern image coding
(CVPIC) – and present several retrieval algorithms that operate directly on compressed CVPIC data.
Our experiments demonstrate that it is not only possible to realise such midstream content access, but
also that the presented techniques outperform standard retrieval techniques such as colour histograms
and colour correlograms.


Chapter XVIII
Resource Discovery Using Mobile Agents ......................................................................................... 419

M. Singh, Middlesex University, UK

X. Cheng, Middlesex University, UK & Beijing Normal University, China


X. He, Reading University, UK
Discovery of the multimedia resources on network is the focus of the many researches in post semantic
web. The task of resources discovery can be automated by using agent. This chapter reviews the current
most used technologies that facilitate the resource discovery process. The chapter also the presents the
case study to present a fully functioning resource discovery system using mobile agents.
Chapter XIX
Multimedia Data Indexing .................................................................................................................. 449

Zhu Li, Hong Kong Polytechnic University, Hong Kong

Yun Fu, BBN Technologies, USA

Junsong Yuan, Northwestern University, USA

Ying Wu, Northwestern University, USA

Aggelos Katsaggelos, Northwestern University, USA

Thomas S. Huang, University of Illinois at Urbana-Champaign, USA
The rapid advances in multimedia capture, storage and communication technologies and capabilities
have ushered an era of unprecedented growth of digital media content, in audio, visual, and synthetic
forms, and both personal and commercially produced. How to manage these data to make them more
accessible and searchable to users is a key challenge in current multimedia computing research. In this
chapter, we discuss the problems and challenges in multimedia data management, and review the state
of the art in data structures and algorithms for multimedia indexing, media feature space management
and organization, and applications of these techniques in multimedia data management.
Compilation of References................................................................................................................ 476
About the Contributors..................................................................................................................... 514
Index.................................................................................................................................................... 523



xv

Preface

With the explosive growth of multimedia databases in terms of both size and variety, effective and efficient indexing and searching techniques for large-scale multimedia databases have become an urgent
research topic in recent years.
For data organization, the conventional approach is based on keywords or text description of a
multimedia datum. However, it is tedious to give all data text annotation and it is almost impossible
for people to capture as well. Moreover, the text description is also not enough to precisely describe
a multimedia datum. For example, it is unrealistic to utilize words to describe a music clip; an image
says more than a thousand words; and keywords-based video shot description cannot characterize the
contents for a specific user. Therefore, it is important to utilize the content based approaches (CbA) to
mine the semantic information of a multimedia datum.
In the last ten years, we have witnessed very significant contributions of CbA in semantics targeting
for multimedia data organization. CbA means that the data organization, including retrieval and indexing, utilizes the contents of the data themselves, rather than keywords provided by human. Therefore,
the contents of a datum could be obtained from techniques in statistics, computer vision, and signal
processing. For example, Markov random fields could be applied for image modeling; spatial-temporal
analysis is important for video representation; and the Mel frequency cepstral coefficient has been shown
to be the most effective method for audio signal classification.
Apart from the conventional approaches mentioned above, machine learning also plays an indispensable role in current semantic mining tasks, for example, random sampling techniques and support vector
machine for human computer interaction, manifold learning and subspace methods for data visualization,
discriminant analysis for feature selection, and classification trees for data indexing.
The goal of this IGI Global book is to provide an introduction about the most recent research and
techniques in multimedia semantic mining for new researchers, so that they can go step by step into this
field. As a result, they can follow the right way according to their specific applications. The book is also
an important reference for researchers in multimedia, a handbook for research students, and a repository
for multimedia technologists.
The major contributions of this book are in three aspects: (1) collecting and seeking the recent and

most important research results in semantic mining for multimedia data organization, (2) guiding new
researchers a comprehensive review on the state-of-the-art techniques for different tasks for multimedia database management, and (3) providing technologists and programmers important algorithms for
multimedia system construction.
This edited book attracted submissions from eight countries including Canada, China, France, Japan,
Poland, Singapore, United Kingdom, and United States. Among these submissions, 19 have been accepted. We strongly believe that it is now an ideal time to publish this edited book with the 19 selected


xvi

chapters. The contents of this edited book will provide readers with cutting-edge and topical information
for their related research.
Accepted chapters are solicited to address a wide range of topics in semantic mining from multimedia
databases and an overview of the included chapters is given below.
This book starts from new multimedia information representations (Video Representation and
Processing for Multimedia Data Mining) (Image Features from Morphological Scale-spaces) (Face
Recognition and Semantic Features), after which learning in multimedia information organization, an
important topic in semantic mining, is studied by four chapters (Shape Matching for Foliage Database
Retrieval) (Similarity Learning For Motion Estimation) (Active Learning for Relevance Feedback in
Image Retrieval) (Visual Data Mining Based on Partial Similarity Concepts). Thereafter, four schemes
are presented for semantic analysis in four chapters (Image/Video Semantic Analysis by Semi-Supervised
Learning) (Content-Based Video Semantic Analysis) (Semantic Mining for Green Production Systems)
(Intuitive Image Database Navigation by Hue-sphere Browsing). The multimedia resource annotation
is also essential for a retrieval system and four chapters provide interesting ideas (Hybrid Tagging and
Browsing Approaches for Efficient Manual Image Annotation) (Active Video Annotation: To Minimize
Human Effort) (Image Auto-Annotation by Search) (Semantic Classification and Annotation of Images).
The last part of this book presents other related topics for semantic mining (Association-Based Image
Retrieval) (Compressed-domain Image Retrieval based on Colour Visual Patterns) (Multimedia Resource
Discovery using Mobile Agent) (Multimedia Data Indexing).
Dacheng Tao
Email:

Nanyang Technological University, Singapore
Dong Xu
Email:
Nanyang Technological University, Singapore
Xuelong Li
Email:
University of London, UK



Section I

Multimedia Information
Representation




Chapter I

Video Representation and
Processing for
Multimedia Data Mining
Amr Ahmed
University of Lincoln, UK

ABSTRACT
Video processing and segmentation are important stages for multimedia data mining, especially with
the advance and diversity of video data available. The aim of this chapter is to introduce researchers,
especially new ones, to the “video representation, processing, and segmentation techniques”. This includes an easy and smooth introduction, followed by principles of video structure and representation,

and then a state-of-the-art of the segmentation techniques focusing on the shot-detection. Performance
evaluation and common issues are also discussed before concluding the chapter.

I. INTRODUCTION
With the advances, which are progressing very fast, in the digital video technologies and the wide
availability of more efficient computing resources, we seem to be living in an era of explosion in digital
video. Video data are now widely available, and being easily generated, in large volumes. This is not
only on the professional level. It can be found everywhere, on the internet, especially with the video
uploading and sharing sites, with the personal digital cameras and camcorders, and with the camera
mobile phones that became almost the norm.
People use the existing easy facilities to generate video data. But at some point, sooner or later, they
realize that managing these data can be a bottleneck. This is because the available techniques and tools
for accessing, searching, and retrieving video data are not on the same level as for other traditional

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.


Video Representation and Processing for Multimedia Data Mining

data, such as text. The advances in the video access, search, and retrieval techniques have not been
progressing with the same pace as the digital video technologies and its generated data volume. This
could be attributed, at least partly, to the nature of the video data and its richness, compared to text data.
But it can also be attributed to the increase of our demands. In text, we are no longer just satisfied by
searching for exact match of sequence of characters or strings, but need to find similar meanings and
other higher level matches. We are also looking forward to do the same on video data. But the nature
of the video data is different.
Video data is more complex and naturally larger in volume than the traditional text data. They
usually combine visual and audio data, as well as textual data. These data need to be appropriately annotated and indexed in an accessible form for search and retrieval techniques to deal with it. This can
be achieved based on either textual information, visual and/or audio features, and more importantly on
semantic information. The textual-based approach is theoretically the simplest. Video data need to be

annotated by textual descriptions, such as keywords or short sentences describing the contents. This
converts the search task into the known area of searching in the text data, where the existing relatively
advanced tools and techniques can be utilized. The main bottleneck here is the huge time and effort
that are needed to accomplish this annotation task, let alone any accuracy issues. The feature-based
approach, whether visual and/or audio, depends on annotating the video data by combinations of their
extracted low-level features such as intensity, color, texture, shape, motion, and other audio features.
This is very useful in doing a query-by-example task. But still not very useful in searching for specific
event or more semantic attributes. The semantic-based approach is, in one sense, similar to the textbased approach. Video data need to be annotated, but in this case, with high-level information that
represents the semantic meaning of the contents, rather than just describing the contents. The difficulty
of this annotation is the high variability of the semantic meaning, of the same video data, among different people, cultures, and ages, to name just a few. It will depend on so many factors, including the
purpose of the annotation, the domain and application, cultural and personal views, and could even be
subject to the mood and personality of the annotator. Hence, generally automating this task is highly
challenging. For specific domains, carefully selected combinations of the visual and/or audio features
correlate to useful semantic information. Hence, the efficient extraction of those features is crucial to
the high-level analysis and mining of the video data.
In this chapter, we focus on the core techniques that facilitate the high-level analysis and mining
of the video data. One of the important initial steps in segmentation and analysis of video data is the
shot-boundary detection. This is the first step in decomposing the video sequence to its logical structure
and components, in preparation for analysis of each component. It is worth mentioning that the subject
is enormous and this chapter is meant to be more of an introduction, especially for new researchers.
Also, in this chapter, we only focus on the visual modality of the video. Hence, the audio and textual
modalities are not covered.
After this introductory section, section II provides the principles of video data, so that we know the
data that we are dealing with and what does it represent. This includes video structure and representation, both for compressed and uncompressed data. The various types of shot transitions are defined in
section III, as well as the various approaches of classifying them. Then, in section IV, the key categories
of the shot-boundary detection techniques are discussed. First, the various approaches of categorizing
the shot-detection techniques are discussed, along with the various factors contributing to that. Then,
a selected hierarchical approach is used to represent the most common techniques. This is followed by
discussion of the performance evaluation measures and some common issues. Finally the chapter is
summarized and concluded in section V.




Video Representation and Processing for Multimedia Data Mining

II. VidEO STRUCTURE AND REPRESENTATION
In this section, it is aimed to introduce, mainly new, researchers to the principles of video data structure and representation. This is an important introduction to understand the data that will be dealt with
and what does it represent. This introduction is essential to be able to follow the subsequent sections,
especially the shot-transition detection.
This section starts by an explanation of the common structure of a video sequence, and a discussion of the various levels in that structure. The logical structure of the video sequence is particularly
important for segmentation and data mining.

A. Video Structure
The video consists of a number of frames. These frames are usually, and preferably, adjacent to each
other on the storage media, but should definitely be played back in the correct order and speed to convey the recorded sequences of actions and/or motion. In fact, each single frame is a still image that
consists of pixels, which are the smallest units from the physical point of view. These pixels are dealt
with when analyzing the individual frames, and the processing usually utilizes a lot of image processing techniques. However, the aim of most applications of analyzing the video is to identify the basic
elements and contents of the video. Hence, logical structure and elements are of more importance than
the individual pixels.
Video is usually played back with frequencies of 25 or 30 frames per second, as described in more
details in section ‘B’ below. These speeds are chosen so that the human eye do not detect the separation
between the frames and to make sure that motion will be smooth and seems to be continuous. Hence,
as far as we, the human beings, are concerned, we usually perceive and process the video on a higher
level structure. We can easily detect and identify objects, people, and locations within the video. Some
objects may change position on the screen from one frame to another, and recording locations could be
changing between frames as well. These changes allow us to perceive the motion of objects and people.
But more importantly, it allows us to detect higher level aspects such as behaviors and sequences, which
we can put together to detect and understand a story or an event that is recorded in the video.
According to the above, the video sequence can be logically represented as a hierarchical structure,
as depicted in fig. 1 and illustrated in fig. 2. It is worth mentioning that as we go up in the hierarchy,

more detailed sub-levels may be added or slightly variations of interpretations may exist. This is mainly
depend on the domain in hand. But at least the shot level, as in the definition below, seems to be commonly understood and agreed upon.
The definition of each level in the hierarchy is given below, in the reverse order, i.e. bottom-up:



Frame: The frame is simply a single still image. It is considered as the smallest logical unit in this
hierarchy. It is important in the analysis of the other logical levels.
Shot: The shot is a sequence of consecutive frames, temporally adjacent, that has been recorded
continuously, within the same session and location, by the same single camera, and without substantial change in the contents of the picture. So, a shot is highly expected to contain a continuous
action in both space and time. A shot could be a result of what you continuously record, may be
with a camcorder or even a mobile video camera, since you press the record button, until you stop
the recording. But off course, if you drop the camera or the mobile, or someone has quickly passed




Video Representation and Processing for Multimedia Data Mining

Figure 1. Hierarchy of the video logical structure
Video
sequence

Segment

Scene

Scene

. . . . .. … … .


Scene

Segment

Scene

Scene

Shot
Shot
Shot

. . . . .. … … . .. . . . . . . . . . …

Shot
Shot

Frames
Frames

. . . . .. … … . .. . . . . . . . . . …
Frames








Frames
Frames

in front of your camera, this would probably cause a sudden change in the picture contents. If such
change is significant, it may results in breaking the continuity and may then not be considered as
a single shot.
Scene: The scene is a collection of related shots. Normally, those shots are recorded within the
same session, at the same location, but can be recorded from different cameras. An example could
be a conversation scene between few people. One camera may be recording the wide location and
have all people in the picture, while another camera focuses on the person who is currently talking,
and may be another camera is focusing on the audience. Each camera is continuously recording its
designated view, but the final output to the viewer is what the director selects from those different views. So, the director can switch between cameras at various times within the conversation
based on the flow of the conversation, change of the talking person, reaction from the audience,
and so on. Although the finally generated views are usually substantially different, for the viewer,
the scene still seems to be logically related, in terms of the location, timing, people and/or objects
involved. In fact, we are cleverer than that. In some cases, the conversation can elaborate from one
point to another, and the director may inject some images or sub-videos related to the discussion.
This introduces huge changes in the pictures shown to the viewer, but still the viewer can follow it
up, identify that they are related. This leads to the next level in the logical structure of the video.
Segment: The video segment is a group of scenes related to a specific context. It does not have
to have been recorded in the same location or in the same time. And off course, it can be recorded
with various cameras. However, they are logically related to each other within a specific semantic


Video Representation and Processing for Multimedia Data Mining

Figure 2. Illustration of the video logical structure
Video Sequence
Segment 2


Segment 1
Scen
e1
S
h

Scene 2
Shot

Scene 4

Scene
3
S
h

S
h

Shot

Scene 5
S
h

S
h

time




context. The various scenes of the same event, or related events, within the news broadcast are an
example of a video segment.
Sequence: A video sequence consists of a number of video segments. They are usually expected
to be related or share some context or semantic aspects. But in reality it may not always be the
case.

Depending on the application and its domain, the analysis of the video to extract its logical components can fall into any of the above levels. However, it is worth mentioning that the definitions may
slightly differ with the domain, as they tend to be based on the semantic, especially towards the top
levels (i.e. Scenes and segments in particular). However, a more common and highly potential starting
point is the video shot. Although classified as a part of the logical structure, it also has a tight link with
the physical recording action. As it is a result of a continuous recording from a single camera within
the same location and time, the shot usually has the same definition in almost all applications and domains. Hence, the shot is a high candidate starting point for extracting the structure and components
of the video data.
In order to correctly extract shots, we will need to detect their boundaries, lengths, and types. To do
so, we need to be aware of how shots are usually joined together in the first place. This is discussed in
details in section III and the various techniques of shot detection are reviewed in section IV.

B . Video Representation
In this sub-section we discuss the video representation for both compressed and uncompressed data. We
first explore the additional dimensionality of video data and frame-rates with their associated redundancy in uncompressed data. Then, we discuss the compressed data representation and the techniques
of reducing the various types of redundancies, and how that can be utilized for shot-detection.




Video Representation and Processing for Multimedia Data Mining

1) Uncompressed Video Data

The video sequence contains groups of successive frames. They are designed so that when they are
played back, the human eye perceives continuous motion of objects within the video and no flickers
are recognized due to the change from one frame to another. The film industry uses a frame-rate of 24
frames/sec for films. But the most two common TV standard formats are PAL and NTSC. The framerate in those two standards is either 25 frames/sec, for PAL TV standard, or 30 frames/sec for the NTSC
TV standard. In case of the videos that are converted from films, some care need to be taken, especially
due to the different frame-rates involved in the different standards. A machine, called telecine, is usually used in that conversion that involves the 2:2 pulldown or 3:2 pulldown process for PAL or NTSC
respectively.
Like the still image, each pixel within the frame has a value or more, such as intensity or colors. But
video data has an extra dimension, in addition to the spatial dimensions of the still images, which is the
temporal dimension. The changes between various frames in the video can be exhibited in any of the
above attributes, pixel values, spatial and/or temporal. And the segmentation of video, as discussed in
later sections, is based on detecting the changes in one or more of the above attributes, or their statistical properties and/or evolution.
The video data also carry motion information. Motion information are among the useful information
that can be used for segmenting video, as it give an indication of the level of activities and its dynamics
within the video. Activity levels can change between the different parts of the video and in fact can
characterize its parts. Unlike the individual image properties and pixel values, the motion information
are embedded within the video data. Techniques that are based on motion information, as discussed in
section V, have to extract them first. This is usually a computationally expensive task.
From the discussion above, it can be noticed that in many cases, the video data will usually contain
redundancy. For example, some scenes can be almost stationary where there are almost no movements
or changes happening. In such cases, the 25 frames produced every second, assuming using the PAL
standard, will be very similar, which is a redundancy. This example represents a redundancy in the
temporal information, which is between the frames. Similarly, if large regions of the image have the
same attributes, this represents a redundancy in the spatial information, which is within the same image. Both types of redundancy can be dealt with or reduced as described in the next subsections, with
the compressed video data.

2) Compressed Video Data
Video compression aims to reduce the redundancy exist in video data, with minimum visual effect on
the video. This is useful in multimedia storage and transmission, among others. The compression can
be applied on one or more of the video dimensions; spatial and/or temporal. Each of them is described,

with focus on the MPEG standards, as follows:
Spatial-Coding (Intra-Coding)
The compression in the spatial dimension deals with reducing the redundancy within the same image
or frame. There is no need to refer to any other frame. Hence, it is also called intra-coding, and the
term “intra” here means “within the same image or frame”. The Discrete Cosine Transform (DCT)




×