Tải bản đầy đủ (.pdf) (818 trang)

Multimedia Image and Video Processing pdf

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (28.67 MB, 818 trang )

Multimedia Image
and Video Processing
Second Edition
© 2012 by Taylor & Francis Group, LLC
© 2012 by Taylor & Francis Group, LLC
Multimedia Image
and Video Processing
Edited by
Ling Guan
Yifeng He
Sun-Yuan Kung
Second Edition
CRC Press is an imprint of the
Taylor & Francis Group, an informa business
Boca Raton London New York
© 2012 by Taylor & Francis Group, LLC
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2012 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Version Date: 20120215
International Standard Book Number-13: 978-1-4398-3087-1 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.


Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at

and the CRC Press Web site at

© 2012 by Taylor & Francis Group, LLC
Contents
List of Figures ix
Preface xxvii
Acknowledgments xxix
Introduction xxxi
Editors li
Contributors liii
Part I Fundamentals of Multimedia
1. Emerging Multimedia Standards 3
Huifang Sun
2. Fundamental Methods in Image Processing 29
April Khademi, Anastasios N. Venetsanopoulos, Alan R. Moody,
and Sridhar Krishnan
3. Application-Specific Multimedia Architecture 77
Tung-Chien Chen, Tzu-Der Chuang, and Liang-Gee Chen

4. Multimedia Information Mining 129
Zhongfei (Mark) Zhang and Ruofei Zhang
5. Information Fusion for Multimodal Analysis and Recognition 153
Yongjin Wang, Ling Guan, and Anastasios N. Venetsanopoulos
6. Multimedia-Based Affective Human–Computer Interaction 173
Yisu Zhao, Marius D. Cordea, Emil M. Petriu, and Thomas E. Whalen
Part II Methodology, Techniques, and Applications: Coding of
Video and Multimedia Content
7. Part Overview: Coding of Video and Multimedia Content 197
Oscar Au and Bing Zeng
8. Distributed Video Coding 215
Zixiang Xiong
9. Three-Dimensional Video Coding 233
Anthony Vetro
10. AVS: An Application-Oriented Video Coding Standard 255
Siwei Ma, Li Zhang, Debin Zhao, and Wen Gao
Part III Methodology, Techniques, and Applications: Multimedia
Search, Retrieval, and Management
11. Multimedia Search and Management 291
Linjun Yang, Xian-Sheng Hua, and Hong-Jiang Zhang
v
© 2012 by Taylor & Francis Group, LLC
vi Contents
12. Video Modeling and Retrieval 301
Zheng-Jun Zha, Jin Yuan, Yan-Tao Zheng, and Tat-Seng Chua
13. Image Retrieval 319
Lei Zhang and Wei-Ying Ma
14. Digital Media Archival 345
Chong-Wah Ngo and Song Tan
Part IV Methodology, Techniques, and Applications: Multimedia

Security
15. Part Review on Multimedia Security 367
Alex C. Kot, Huijuan Yang, and Hong Cao
16. Introduction to Biometry 397
Carmelo Velardo, Jean-Luc Dugelay, Lionel Daniel, Antitza Dantcheva,
Nesli Erdogmus, Neslihan Kose, Rui Min, and Xuran Zhao
17. Watermarking and Fingerprinting Techniques for Multimedia Protection 419
Sridhar Krishnan, Xiaoli Li, Yaqing Niu, Ngok-Wah Ma, and Qin Zhang
18. Image and Video Copy Detection Using Content-Based Fingerprinting 459
Mehrdad Fatourechi, Xudong Lv, Mani Malek Esmaeili, Z. Jane Wang, and
Rabab K. Ward
Part V Methodology, Techniques, and Applications: Multimedia
Communications and Networking
19. Emerging Technologies in Multimedia Communications and Networking:
Challenges and Research Opportunities 489
Chang Wen Chen
20. A Proxy-Based P2P Live Streaming Network: Design, Implementation, and
Experiments 519
Dongni Ren, S H. Gary Chan, and Bin Wei
21. Scalable Video Streaming over the IEEE 802.11e WLANs 531
Chuan Heng Foh, Jianfei Cai, Yu Zhang, and Zefeng Ni
22. Resource Optimization for Distributed Video Communications 549
Yifeng He and Ling Guan
Part VI Methodology, Techniques, and Applications: Architecture
Design and Implementation for Multimedia Image and
Video Processing
23. Algorithm/Architecture Coexploration 573
Gwo Giun (Chris) Lee, He Yuan Lin, and Sun Yuan Kung
© 2012 by Taylor & Francis Group, LLC
Contents vii

24. Dataflow-Based Design and Implementation of Image Processing
Applications 609
Chung-Ching Shen, William Plishker, and Shuvra S. Bhattacharyya
25. Application-Specific Instruction Set Processors for Video Processing 631
Sung Dae Kim and Myung Hoon Sunwoo
Part VII Methodology, Techniques, and Applications: Multimedia
Systems and Applications
26. Interactive Multimedia Technology in Learning: Integrating Multimodality,
Embodiment, and Composition for Mixed-Reality Learning Environments . . 659
David Birchfield, Harvey Thornburg, M. Colleen Megowan-Romanowicz,
Sarah Hatton, Brandon Mechtley, Igor Dolgov, Winslow Burleson, and
Gang Qian
27. Literature Survey on Recent Methods for 2D to 3D Video Conversion 691
Raymond Phan, Richard Rzeszutek, and Dimitrios Androutsos
28. Haptic Interaction and Avatar Animation Rendering Centric Telepresence
in Second Life 717
A. S. M. Mahfujur Rahman, S. K. Alamgir Hossain, and A. El Saddik
Index 741
© 2012 by Taylor & Francis Group, LLC
List of Figures
1.1 Typical MPEG.1 encoder structure. 6
1.2 (a) An example of an MPEG GOP of 9, N = 9, M = 3. (b) Transmission
order of an MPEG GOP of 9 and (c) Display order of an MPEG GOP of 9. 7
1.3 Two zigzag scan methods for MPEG-2 video coding. 8
1.4 Block diagram of an H.264 encoder. 13
1.5 Encoding processing of JPEG-2000. 18
1.6 (a) MPEG-1 audio encoder. (b) MPEG-1 audio decoder. 20
1.7 Relations between tools of MPEG.7. 23
1.8 Illustration of MPEG.21 DIA. 25
2.1 Histogram example with L number of bins. (a) FLAIR MRI (brain).

(b) PDF p
G
(g) of (a). 31
2.2 Example histograms with varying number of bins (bin widths).
(a) 100 bins, (b) 30 bins, (c) 10 bins, (d) 5 bins. 32
2.3 Empirical histogram and KDA estimate of two random variables,
N(0, 1) and N(5,1). (a) Histogram. (b) KDA. 33
2.4 Types of kernels for KDA. (a) Box, (b) triangle, (c) Gaussian, and
(d) Epanechnikov. 34
2.5 KDA of random sample (N(0,1) +N(5,1)) for box, triangle, and
Epanechnikov kernels. (a) Box, (b) triangle, and (c) Epanechnikov. 34
2.6 Example image and its corresponding histogram with mean and variance
indicated. (a) g(x, y). (b) PDF p
G
(g) of (a). 36
2.7 HE techniques applied to mammogram lesions. (a) Original.
(b) Histogram equalized. 37
2.8 The KDA of lesion “(e)” in Figure 2.7, before and after enhancement.
Note that after equalization, the histogram resembles a uniform PDF.
(a) Before equalization. (b) After equalization. 38
2.9 Image segmentation based on global histogram thresholding.
(a) Original. (b) B(x, y) ∗g(x, y). (c) (1 −B(x, y)) ∗ g(x, y). 39
2.10 The result of a three-class Otsu segmentation on the image of Figure 2.6a.
The left image is the segmentation result of all three classes (each class is
assigned a unique intensity value). The images on the left are binary
segmentations for each tissue class B(x, y). (a) Otsu segmentation.
(b) Background class. (c) Brain class. (d) Lesion class. 40
2.11 Otsu’s segmentation on retinal image showing several misclassified pixels.
(a) Original. (b) PDF p
G

(g) of (a). (c) Otsu segmentation. 41
ix
© 2012 by Taylor & Francis Group, LLC
x List of Figures
2.12 Example FLAIR with WML, gradient image, and fuzzy edge mapping
functions. (a) y(x
1
, x
2
). (b) g(x
1
, x
2
) =∇y. (c) ρ
k
and p
G
(g). (d) ρ
k
(x
1
, x
2
). 42
2.13 T1- and T2-weighted MR images (1 mm slice thickness) of the brain and
corresponding histograms. Images are from BrainWeb database; see
(a) T1-weighted MRI.
(b) T2-weighted MRI. (c) Histogram of Figure 2.13a. (d) Histogram of
Figure 2.13b. 44
2.14 T1- and T2-weighted MR images (1 mm slice thickness) with 9% noise and

corresponding histograms. Images are from BrainWeb database; see
(a) T1-weighted MRI with 9%
noise. (b) T2-weighted MRI with 9% noise. (c) Histogram of Figure 2.14a.
(d) Histogram of Figure 2.14b. 45
2.15 (Un)correlated noise sources and their 3D surface representation.
(a) 2D Guassian IID noise. (b) Surface representation of Figure 2.15a.
(c) 2D Colored noise. (d) Surface representation of Figure 2.15c. 47
2.16 Empirically found M
2
distribution and the observed M
obs
2
for uncorrelated
and correlated 2D data of Figure 2.15. (a) p(M
2
) and M
obs
2
for Figure 2.15a.
(b) p(M
2
) and M
obs
2
for Figure 2.15c. 48
2.17 Correlated 2D variables generated from normally (N) and
uniformly (U) distributed random variables. Parameters used to simulate the
random distributions are shown in Table 2.1. 49
2.18 1D nonstationary data. 50
2.19 Grid for 2D-extension of RA test. (a), (b), and (c) show several examples of

different spatial locations where the number of RAs are computed. 51
2.20 Empirically found distribution of R and the observed R

for 2D
stationary and nonstationary data. (a) IID stationary noise.
(b) p(R) and R

of (a). (c) Nonstationary noise. (d) p(R) and R

of (c). 52
2.21 Nonstationary 2D variables generated from normally (N) and
uniformly (U) distributed. Parameters (μ, σ) and (a, b) used to simulate
the underlying distributions are shown in Table 2.1. 53
2.22 Scatterplot of gradient magnitude images of original image (x-axis)
and reconstructed version (y-axis). 54
2.23 Bilaterally filtered examples. (a) Original. (b) Bilaterally filtered. (c) Original.
(d) Bilaterally filtered. 56
2.24 Image reconstruction of example shown in Figure 2.23a. (a) Y
0.35
rec
, (b) Y
0.50
rec
,
(c) Y
0.58
est
, and (d) Y
0.70
rec

58
2.25 Reconstruction example (τ

= 0.51 and τ

= 0.53, respectively).
(a)
S
(
Y
τ
rec
(x
1
,x
2
)
)
C
(
Y
τ
rec
(x
1
,x
2
)
)
. (b) Y

0.51
rec
. (c) Hist(Y). (d) Hist

Y
0.51
rec

. (e)
S
(
Y
τ
rec
(x
1
,x
2
)
)
C
(
Y
τ
rec
(x
1
,x
2
)

)
. (f) Y
0.53
rec
.
(g) Hist(Y). (h) Hist

Y
0.53
rec

60
2.26 Normalized differences in smoothness and sharpness, between the proposed
method and the bilateral filter. (a) Smoothness. (b) Sharpness. 61
© 2012 by Taylor & Francis Group, LLC
List of Figures xi
2.27 Fuzzy edge strength ρ
k
versus intensity y for the image in Figure 2.23a.
(a) ρ
k
vs. y, (b) μ
ρ
(y), and (c) μ
ρ
(x
1
, x
2
). 62

2.28 Original image y(x
1
, x
2
), global edge profile μ
ρ
(y) and global edge values
mapped back to spatial domain μ
ρ
(x
1
, x
2
). (a) y(x
1
, x
2
), (b) μ
ρ
(y), and
(c) μ
ρ
(x
1
, x
2
) 63
2.29 Modified transfer function c(y) with original graylevel PDF p
Y
(y), and the

resultant image, c(x
1
, x
2
). (a) c(y) and p
Y
(y) and (b) c(x
1
, x
2
) of (b). 64
2.30 CE transfer function and contrast-enhanced image. (a) y
CE
(y) and p
Y
(y).
(b) y
CE
(x
1
, x
2
) of (b). 65
2.31 Original, contrast-enhanced images and WML segmentation. (a–c) Original.
(d–f) Enhanced. (g–i) Segmentation. 66
2.32 One level of DWT decomposition of retinal images. (a) Normal image
decomposition; (b) decomposition of the retinal images with diabetic
retinopathy. CE was performed in the higher frequency bands (HH, LH, HL)
for visualization purposes. 68
2.33 Medical images exhibiting texture. (a) Normal small bowel, (b) small bowel

lymphoma, (c) normal retinal image, (d) central retinal vein occlusion,
(e) benign lesion, and (f) malignant lesion. CE was performed on (e) and
(f) for visualization purposes. 71
3.1 A general architecture of multimedia applications system. 79
3.2 (a) The general architecture and (b) hardware design issues of the
video/image processing engine. 80
3.3 Memory hierarchy: trade-offs and characteristics. 82
3.4 Conventional two-stage macroblock pipelining architecture. 84
3.5 Block diagram of the four-stage MB pipelining H.264/AVC encoding system. . . 85
3.6 The spatial relationship between the current macroblock and the searching
range. 86
3.7 The procedure of ME in a video coding system for a sequence. 87
3.8 Block partition of H.264/AVC variable block size. 88
3.9 The hardware architecture of 1DInterYSW, where N = 4, P
h
= 2, and P
v
= 2. . . . 89
3.10 The hardware architecture of 2DInterYH, where N = 4, P
h
= 2, and P
v
= 2 90
3.11 The hardware architecture of 2DInterLC, where N = 4, P
h
= 2, and P
v
= 2 90
3.12 The hardware architecture of 2DIntraVS, where N = 4, P
h

= 2, and P
v
= 2. 91
3.13 The hardware architecture of 2DIntraKP, where N = 4, P
h
= 2, and P
v
= 2. 92
3.14 The hardware architecture of 2DIntraHL, where N = 4, P
h
= 2, and P
v
= 2 92
3.15 (a) The concept, (b) the hardware architecture, and (c) the detailed
architecture of PE array with 1-D adder tree, of Propagate Partial SAD,
where N = 4. 93
© 2012 by Taylor & Francis Group, LLC
xii List of Figures
3.16 (a) The concept, (b) the hardware architecture, and (c) the scan order
and memory access, of SAD Tree, where N = 4. 94
3.17 The hardware architecture of inter-level PE with data flow I for (a) FBSME,
where N = 16; (b) VBSME, where N = 16 and n = 4 95
3.18 The hardware architecture of Propagate Partial SAD with Data Flow II for
VBSME, where N = 16 and n = 4 96
3.19 The hardware architecture of SAD Tree with Data Flow III for VBSME,
where N = 16 and n = 4 97
3.20 Block diagram of the IME engine. It mainly consists of eight PE-Array
SAD Trees. Eight horizontally adjacent candidates are processed in parallel. . . 101
3.21 M-parallel PE-array SAD Tree architecture. The inter-candidate data reuse
can be achieved in both horizontal and vertical directions with Ref. Pels Reg.

Array, and the on-chip SRAM bandwidth is reduced. 101
3.22 PE-array SAD Tree architecture. The cost of 16 4 ×4 blocks are separately
summed up by 16 2-D Adder sub-trees and then reduced by one VBS
Tree for larger blocks. 102
3.23 The operation loops of MRF-ME for H.264/AVC. 103
3.24 The level-C data reuse scheme. (a) There are overlapped region of SWs for
horizontally adjacent MBs; (b) the physical location to store SW data in local
memory. 103
3.25 The MRSC scheme for MRF-ME requires multiple SWs memories. The
reference pixels of multiple reference frames are loaded independently
according to the level-C data reuse scheme. 104
3.26 The SRMC scheme can exploit the frame-level DR for MRF-ME. Only single
SW memory is required. 105
3.27 Schedule of MB tasks for MRF-ME; (a) the original (MRSC) version; (b) the
proposed (SRMC) version. 106
3.28 Estimated MVPs in PMD for Lagrangian mode decision. 107
3.29 Proposed architecture with SRMC scheme. 108
3.30 The schedule of SRMC scheme in the proposed framework. 109
3.31 The rate-distortion efficiency of the reference software and the proposed
framework. Four sequences with different characteristics are used for the
experiment. Foreman has lots of deformation with media motions. Mobile
has complex textures and regular motion. Akiyo has the still scene, while
Stefan has large motions. The encoding parameters are baseline profile,
IPPP structure, CIF, 30frames/s, 4 reference frames, ±16-pel search
range, and low complexity mode decision. (a) Akiyo (CIF, 30 fps);
(b) Mobile (CIF, 30 fps); (c) Stefan (CIF, 30 fps); (d) Foreman (CIF, 30 fps). 110
3.32 Multiple reference frame motion estimation. 112
3.33 Variable block size motion estimation. 112
© 2012 by Taylor & Francis Group, LLC
List of Figures xiii

3.34 Interpolation scheme for luminance component : (a) 6-tap FIR filter for half
pixel interpolation. (b) Bilinear filter for quarter pixel interpolation. 112
3.35 Best partition for a picture with different quantization parameters
(black block: inter block, gray block: intra block). 113
3.36 FME refinement flow for each block and sub-block. 113
3.37 FME procedure of Lagrangian inter mode decision in H.264/AVC reference
software. 114
3.38 The matching cost flowchart of each candidate. 115
3.39 Nested loops of fractional motion estimation. 115
3.40 Data reuse exploration with loop analysis. (a) Original nested loops;
(b) Loop i and Loop j are interchanged. 116
3.41 Intra-candidate data reuse for fractional motion estimation. (a) Reference
pixels in the overlapped (gray) interpolation windows for two horizontally
adjacent interpolated pixels P0 and P1 can be reused; (b) Overlapped (gray)
interpolation windows data reuse for a 4 ×4 interpolated block. Totally, 9 ×9
reference pixels are enough with the technique of intra-candidate data reuse. . . 118
3.42 Inter-candidate data reuse for half-pel refinement of fractional motion
estimation. The overlapped (gray) region of interpolation windows can be
reused to reduce memory access. 118
3.43 Hardware architecture for fractional motion estimation engine. 119
3.44 Block diagram of 4 ×4-block PU. 120
3.45 Block diagram of interpolation engine. 121
3.46 Hardware processing flow of variable-block size fractional motion estimation.
(a) Basic flow; (b) advanced flow. 121
3.47 Inter-4 ×4-block interpolation window data reuse. (a) Vertical data reuse,
(b) horizontal data reuse. 122
3.48 Search Window SRAMs data arrange. (a) Physical location of reference pixels
in the search window; (b) traditional data arrangement with 1-D random
access; (c) proposed ladder-shaped data arrangement with 2-D
random access. 122

3.49 Illustration of fractional motion estimation algorithm. The white circles are
the best integer-pixel candidates. The light-gray circles are the half-pixel
candidates. The dark-gray circles are the quarter-pixel candidates. The circles
labeled “1” and “2” are the candidates refined in the first and second passes,
respectively. (a) Conventional two-step algorithm; (b) Proposed one-pass
algorithm. The 25 candidates inside the dark square are processed in parallel. . 124
3.50 Rate-distortion performance of the proposed one-pass FME algorithm.
The solid, dashed, and dotted lines show the performance of the two-step
algorithm in the reference software, the proposed one-pass algorithm,
and the algorithm with only half-pixel refinement. 125
© 2012 by Taylor & Francis Group, LLC
xiv List of Figures
3.51 Architecture of fractional motion estimation. The processing engines on the
left side are used to generate the matching costs of integer-pixel and
half-pixel candidates. The transformed residues are reused to generate the
matching costs of quarter-pixel candidates with the processing engines inside
the light-gray box on the right side. Then, the 25 matching costs are compared
to find the best MV. 125
4.1 Relationships among the interconnected fields to multimedia information
mining. 132
4.2 The typical architecture of a multimedia information mining system. 134
4.3 Graphic representation of the model developed for the randomized data
generation for exploiting the synergy between imagery and text. 137
4.4 The architecture of the prototype system. 142
4.5 An example of image and annotation word pairs in the generated database.
The number following each word is the corresponding weight of the word. . . . 143
4.6 The interface of the automatic image annotation prototype. 144
4.7 Average SWQP(n) comparisons between MBRM and the developed
approach. 146
4.8 Precision comparison between UPMIR and UFM. 147

4.9 Recall comparison between UPMIR and UFM. 148
4.10 Average precision comparison among UPMIR, Google Image Search, and
Yahoo! Image Search. 149
5.1 Multimodal information fusion levels. 155
5.2 Block diagram of kernel matrix fusion-based system. 164
5.3 Block diagram of KCCA-based fusion at the feature level. 165
5.4 Block diagram of KCCA-based fusion at the score level. 165
5.5 Experimental results of kernel matrix fusion (KMF)-based method (weighted
sum (WS), multiplication (M)). 167
5.6 Experimental results of KCCA-based fusion at the feature level. 167
5.7 Experimental results of KCCA-based fusion at the score level. 168
6.1 HCI devices for three main human sensing modalities: audio, video,
and haptic. 174
6.2 Examples of emotional facial expressions from JAFFE (first three rows), MMI
(fourth row), and FG-NET (last row) databases. 177
6.3 Muscle-controlled 3D wireframe head model. 179
6.4 Person-dependent recognition of facial expressions for faces from the MMI
database. 179
6.5 Person-independent recognition of facial expressions for faces from the MMI
database. 180
© 2012 by Taylor & Francis Group, LLC
List of Figures xv
6.6 Visual tracking and recognition of facial expression. 181
6.7 General steps of proposed head movement detection. 182
6.8 General steps of proposed eye gaze detection. 183
6.9 Geometrical eye and nostril model. 184
6.10 Example of gaze detection based on the |D −D
0
| global parameter difference. . 184
6.11 Taxonomy of the human-head language attributes. 185

6.12 Fuzzy inferences system for multimodal emotion evaluation. 186
6.13 Fuzzy membership functions for the five input variables. (a) Happiness,
(b) anger, (c) sadness, (d) head-movement, and (e) eye-gaze. 187
6.14 Fuzzy membership functions for the three output variables. (a) Emotion
set-A, (b) emotion set-B, and (c) emotion set-C. 188
6.15 Image sequence of female subject showing the admire emotion state. 188
6.16 Facial muscles. 190
6.17 The architecture of the 3D head and facial animation system. 190
6.18 The muscle control of the wireframe model of the face. 191
6.19 Fundamental facial expressions generated by the 3D muscle-controlled facial
animation system: surprise, disgust, fear, sadness, anger, happiness, and
neutral position. 192
7.1 9-Mode intraprediction for 4 ×4 blocks. 203
7.2 4 × 4 ICT and inverse ICT matrices in H.264. 204
7.3 Multiple reference frame. 204
8.1 (a) Direct MT source coding. (b) Indirect MT source coding (the chief executive
officer (CEO) problem). 217
8.2 Block diagram of the interframe video coder proposed by Witsenhausen
and Wyner in their 1980 patent. 219
8.3 Witsenhausen–Wyner video coding. (a) Encoding, (b) decoding. 221
8.4 Witsenhausen–Wyner video coding versus H.264/AVC and H.264/AVC
IntraSkip coding when the bitstreams are protected with Reed–Solomon codes
and transmitted over a simulated CDMA2000 1X channel. (a) Football with a
compression/transmission rate of 3.78/4.725Mb/s. (b) Mobile with a
compression/transmission rate of 4.28/5.163Mb/s 221
8.5 Block diagram of layered WZ video coding 222
8.6 Error robustness performance of WZ video coding compared with H.26L FGS
for Football. The 10th decoded frame by H.26L FGS (a) and WZ video coding
(b) in the 7th simulated transmission (out of a total of 200 runs). 222
8.7 (a) 3D camera settings and (b) first pair of frames from the 720 ×288 stereo

sequence “tunnel.” 223
© 2012 by Taylor & Francis Group, LLC
xvi List of Figures
8.8 PSNR versus frame number comparison among separate H.264/AVC coding,
two-terminal video coding, and joint encoding at the same sum rate of
6.581 Mbps for the (a) left and the (b) right sequences of the “tunnel.” 224
8.9 The general framework proposed in [46] for three-terminal video coding. 224
8.10 An example of left-and-right-to-center frame warping (based on the first
frames of the Ballet sequence). (a) The decoded left frame. (b) The original
center frame. (c) The decoded right frame. (d) The left frame warped to the
center. (e) The warped center frame, and (f) The right frame warped to
the center. 225
8.11 Depth camera-assisted MT video coding. 226
8.12 An MT video capturing system with four HD texture cameras and one
low-resolution (QCIF) depth camera. 227
8.13 An example of depth map refinement and side information comparisons.
(a) The original HD frame. (b) The preprocessed (warped) depth frame.
(c) The refined depth frame. (d) The depth frame generated without
the depth camera. (e) Side information with depth camera help, and
(f) Side information without depth camera help. 228
9.1 Applications of 3D and multiview video. 235
9.2 Illustration of inter-view prediction in MVC. 237
9.3 Sample coding results for Ballroom and Race1 sequences; each sequence
includes eight views at video graphics array (VGA) resolution. 239
9.4 Subjective picture quality evaluation results given as average
MOS with 95% confidence intervals. 241
9.5 Comparison of full-resolution and frame-compatible formats:
(a) full-resolution stereo pair; (b) side-by-side format;
(c) top-and-bottom format. 243
9.6 Illustration of video codec for scalable resolution enhancement

of frame-compatible video. 244
9.7 Example of 2D-plus-depth representation. 247
9.8 Effect of down/up sampling filters on depth maps and corresponding
synthesis result (a, b) using conventional linear filters; (c, d) using
nonlinear filtering as proposed in [58]. 249
9.9 Sample plot of quality for a synthesized view versus bit rate
where optimal combinations of QP for texture and depth are
determined for a target set of bit rates. 250
10.1 The block diagram of AVS video encoder. 258
10.2 Neighboring samples used for intraluma prediction.
(a): 8 ×8 based. (b): 4 ×4 based. 259
10.3 Five intraluma prediction modes in all profiles in AVS1-P2. 260
10.4 Macroblock partitions in AVS1-P2. 261
© 2012 by Taylor & Francis Group, LLC
List of Figures xvii
10.5 VBMC performance testing on QCIF and 720p test sequences.
(a) QCIF and (b) 1280 ×720 Progressive. 261
10.6 Multiple reference picture performance testing. 262
10.7 Video codec architecture for video sequence with static background
(AVS1-P2 Shenzhan Profile). 262
10.8 Interpolation filter performance comparison. 264
10.9 Filtering for fractional sample accuracy MC. Uppercase letters indicate
samples on the full-sample grid, lowercase letters represent samples at half-
and quarter-sample positions, and all the rest samples with s integer number
subscript are eighth-pixel locations. 265
10.10 Temporal direct mode in AVS1-P2. (a) Motion vector derivation for direct
mode in frame coding. Colocated block’s reference index is 0 (solid line),
or 1 (dashed line). (b) Motion vector derivation for direct mode in top field
coding. Colocated block’s reference index is 0. (c) Motion vector derivation for
direct mode in top field coding. Colocated block’s reference index is 1 (solid

line), 2 (dashed line pointing to bottom field), or 3 (dashed line pointing to top
field). (d) Motion vector derivation for direct mode in top field coding.
Colocated block’s reference index is 1. (e) Motion vector derivation for direct
mode in top field coding. Colocated block’s reference index is 0 (solid line),
2 (dashed line pointing to bottom field), or 3 (dashed line pointing
to top field). 268
10.11 Motion vector derivation for symmetric mode in AVS1-P2. (a) Frame coding.
(b) Field coding, forward reference index is 1, backward reference index is 0.
(c) Field coding, forward reference index is 0, backward reference index is 1. . . 270
10.12 Quantization matrix patterns in AVS1-P2 Jiaqiang Profile. 272
10.13 Predefined quantization weighting parameters in AVS1-P2 Jiaqiang Profile:
(a) default parameters, (b) parameters for keeping detail information
of texture, and (c) parameters for removing detail information of texture. 272
10.14 Coefficient scan in AVS1-P2. (a) zigzag scan. (b) alternate scan. 273
10.15 Coefficient coding process in AVS1-P2 2D VLC entropy coding scheme.
(a) Flowchart of coding one intraluma block. (b) Flowchart of coding one
interluma block. (c) Flowchart of coding one interchroma block. 275
10.16 An example table in AVS1-P2—VLC1_Intra: from (Run, Level)toCodeNum. 276
10.17 Coefficient coding process in AVS1-P2 context-adaptive arithmetic coding. . . . 277
10.18 Deblocking filter process in AVS1-P2. 278
10.19 Slice-type conversion process. E: entropy coding, E
−1
: entropy decoding,
Q: quantization, Q
−1
: Inverse quantization, T: transform, T
−1
: inverse
transform, MC: motion compensation. (a) Convert P-slice to L-slice.
(b) Convert L-slice to P-slice. 280

© 2012 by Taylor & Francis Group, LLC
xviii List of Figures
10.20 Slice structure in AVS1-P2. (a) Normal slice structure where the slice can only
contain continual lines of macroblocks. (b) Flexible slice set allowing more
flexible grouping of macroblocks in slice and slice set. 280
10.21 Test sequences: (a) Vidyo 1 (1280 × 720@60 Hz); (b) Kimono 1
(1920 ×1080@24 Hz); (c) Crossroad (352 ×288@30 Hz); (d) Snowroad
(352 ×288@30 Hz); (e) News and (f) Paris. 284
10.22 Rate–distortion curves of different profiles. (a) Performance of Jiaqiang Profile,
(b) performance of Shenzhan Profile, and (c) performance of Yidong Profile. . . 285
11.1 Overview of the offline processing and indexing process for a typical
multimedia search system. 292
11.2 Overview of the query process for a typical multimedia search system. 292
12.1 An illustration of SVM. The support vectors are circled. 303
12.2 The framework of automatic semantic video search. 307
12.3 The query representation as structured concept threads. 309
12.4 UI and framework of VisionGo system. 312
13.1 A general CBIR framework. 320
13.2 A typical flowchart of relevance feedback. 328
13.3 Three different two-dimensional (2D) distance metrics. The red dot q denotes
the initial query point, and the green dot q

denotes the learned optimal query
point, which is estimated to be the center of all the positive examples. Circles
and crosses are positive and negative examples. (a) Euclidean distance;
(b) normalized Euclidean distance; and (c) Mahalanobis distance. 329
13.4 The framework of search-based annotation. 333
14.1 Large digital video archival management. 347
14.2 Near-duplicates detection framework. 348
14.3 Partial near-duplicate videos. Given a video corpus, near-duplicate segments

create hyperlinks to interrelate different portions of the videos. 353
14.4 A temporal network. The columns of the lattice are frames from the reference
videos, ordered according to the k-NN of the query frame sequence. The label
on each frame shows its time stamp in the video. The optimal path is
highlighted. For ease of illustration, not all paths and keyframes are shown. . . 354
14.5 Automatically tagging the movie 310 to Yuma using YouTube clips. 356
14.6 Topic structure generation and video documentation framework. 358
14.7 A graphical view of the topic structure of the news videos about “Arkansas
School Shooting.” 359
14.8 Google-context video summarization system. 361
© 2012 by Taylor & Francis Group, LLC
List of Figures xix
14.9 Timeline-based visualization of videos about the topic “US Presidential
Election 2008.” Important videos are mined and aligned with news articles, and
then attached to a milestone timeline of the topic. When an event is selected,
the corresponding scene, tags, and news snippet are presented to users 361
15.1 Forgery image examples in comparison with their authentic versions. 375
15.2 Categorization of image forgery detection techniques. 378
15.3 Image acquisition model and common forensic regularities. 379
16.1 Scheme of a general biometric system and its modules: enrollment, recognition,
and update. Typical interactions among the components are shown. 399
16.2 The lines represent two examples of cumulative matching characteristic curve
plots for two different systems. The solid line represents the system that
performs better. N is the number of subjects in the database. 404
16.3 Typical examples of biometric system graphs. The two distributions
(a) represent the client/impostor scores; by varying the threshold, different
values of FAR and FRR can be computed. An ROC curve (b) is used to
summarize the operating points of a biometric system; for each different
application, different performances are required to the system. 405
16.4 (a) Average face and (b),(c) eigenfaces 1 to 2, (d),(e) eigenfaces 998-999 as

estimated on a subset of 1000 images of the FERET face database. 407
16.5 A colored (a) and a near-infrared (b) version of the same iris. 410
16.6 A scheme that summarizes the steps performed during Daugman approach. . . 410
16.7 Example of a fingerprint (a), and of the minutiae: (b) termination,
(c) bifurcation, (d) crossover, (e) lake, and (f) point or island. 411
16.8 The two interfaces of Google Picasa (a) and Apple iPhoto (b). Both the systems
summarize all the persons present in the photo collection. The two programs
give the opportunity to look for a particular face among all the others. 415
17.1 Generic watermarking process. 421
17.2 Fingerprint extraction/registration and identification procedure for legacy
content protection. (a) Populating the database and (b) Identifying the
new file. 423
17.3 Structure of the proposed P2P fingerprinting method. 423
17.4 Overall spatio-temporal JND model. 425
17.5 The process of eye track analysis. 426
17.6 Watermark bit corresponding to approximate energy subregions. 429
17.7 Diagram of combined spatio-temporal JND model-guided watermark
embedding. 429
17.8 Diagram of combined spatio-temporal JND model-guided watermark
extraction. 430
© 2012 by Taylor & Francis Group, LLC
xx List of Figures
17.9 (a) Original walk pal video. (b) Watermarked pal video by Model 1.
(c) Watermarked pal video by Model 2. (d) Watermarked pal video by Model 3.
(e) Watermarked pal video by the combined spatio temporal JND model. 431
17.10 (a) Robustness versus MPEG2 compression by four models. (b) Robustness
versus MPEG4 compression by four models. 432
17.11 Robustness versus Gaussian noise. 433
17.12 Robustness versus valumetric scaling. 433
17.13 BER results of each frame versus MPEG2 compression. 434

17.14 BER results of each frame versus Gaussian noise. 435
17.15 BER results of each frame versus valumetric scaling. 435
17.16 Example of decomposition with MMP algorithm. (a) The original music signal.
(b) The MDCT coefficients of the signal. (c) The molecule atoms after 10
iteration. (d) The reconstructed signal based on the molecule atoms in (c). 439
17.17 Example of decomposition with MMP algorithm. 440
17.18 Fingerprint matching. 442
17.19 MDCT coefficients after low-pass filter. (a) MDCT coefficients of the
low-pass-filtered signal. (b) MDCT coefficient differences between the original
signal and the low-pass-filtered signal. 443
17.20 MDCT coefficients after random noise. (a) MDCT coefficients of the noised
signal. (b) MDCT coefficient differences between the original signal and the
noised signal. 444
17.21 MDCT coefficients after MP3 compression. (a) MDCT coefficients of MP3
signal with bit rate 16 kbps. (b) MDCT coefficient differences between the
original signal and the MP3 signal. 444
17.22 Fingerprint embedding flowchart. 448
17.23 Two kinds of fingerprints in a video. UF denotes that a unique fingerprint is
embedded and SF denotes that a sharable fingerprint is embedded. 452
17.24 The topology of base file and supplementary file distribution. 452
17.25 Comparison of images before and after fingerprinting. (a) Original Lena.
(b) Original Baboon. (c) Original Peppers. (d) Fingerprinted Lena.
(e) Fingerprinted Baboon. (f) Fingerprinted Peppers. 453
17.26 Images after Gaussian white noise, compression, and median filter. (a) Lena
with noise power at 7000. (b) Baboon with noise power at 7000. (c) Peppers
with noise power at 7000. (d) Lena at quality 5 of JPEG compression.
(e) Baboon at quality 5 of JPEG compression. (f) Peppers at quality 5 of JPEG
compression. (g) Lena with median filter [9 9]. (h) Baboon with median filter
[9 9]. (i) Peppers with median filter [9 9]. 454
18.1 The building blocks of a CF algorithm. 461

18.2 Overall scheme for finding copies of an original digital media using CF. 461
© 2012 by Taylor & Francis Group, LLC
List of Figures xxi
18.3 An example of partitioning an image into overlapping blocks
of size m ×m. 464
18.4 Some of the common preprocessing algorithms for content-based video
fingerprinting. 465
18.5 (a–c) Frames 61, 75, and 90 from a video. (d) A representative frame
generated as a result of linearly combining these frames. 466
18.6 Example of how SIFT can be used for feature extraction from an image.
(a) Original image, (b) SIFT features (original image), and (c) SIFT features
(rotated image). 468
18.7 Normalized Hamming distance. 472
18.8 (a) An original image and (b–f) sample content-preserving attacks. 473
18.9 The overall structure of FJLT, FMT-FJLT, and HCF algorithms. 477
18.10 The ROC curves for NMF, FJLT, and HCF fingerprinting algorithms when
tested on a wide range of attacks. 478
18.11 A nonsecure version of the proposed content-based video fingerprinting
algorithm. 479
18.12 Comparison of the secure and nonsecure version in presence of (a) time shift
from −0.5sto+0.5 s and (b) noise with variance σ
2
. 479
19.1 Illustration of the wired-cum-wireless networking scenario. 499
19.2 Illustration of the proposed HTTP streaming proxy. 500
19.3 Example of B frame hierarchy. 501
19.4 User feedback-based video adaptation. 504
19.5 User attention-based video adaptation scheme. 505
19.6 Integration of UEP and authentication. (a) Joint ECC-based scheme.
(b) Joint media error and authentication protection. 512

19.7 Block diagram of the JMEAP system. 513
19.8 Structure of transmission packets. The dashed arrows represent hash
appending. 513
20.1 A proxy-based P2P streaming network. 520
20.2 Overview of FastMesh–SIM architecture. 524
20.3 Software design. (a) FastMesh architecture; (b) SIM architecture; (c) RP
architecture. 525
20.4 HKUST-Princeton trials. (a) A lab snapshot; (b) a topology snapshot; (c) screen
capture. 527
20.5 Peer delay distribution. (a) Asian Peers; (b) US peers. 528
20.6 Delay reduction by IP multicast. 529
21.1 The four ACs in an EDCA node. 534
© 2012 by Taylor & Francis Group, LLC
xxii List of Figures
21.2 The encoding structure. 535
21.3 An example of the loss impact results. 536
21.4 An example of the RPI value for each packet. 537
21.5 Relationship between packet loss probability, retry limit, and transmission
collision probability. 538
21.6 PSNR performance of scalable video traffic delivery over EDCA and EDCA
with various ULP schemes. 540
21.7 Packet loss rate of scalable video traffic delivery over EDCA. 540
21.8 Packet loss rate of scalable video traffic delivery over EDCA with fixed retry
limit-based ULP. 541
21.9 Packet loss rate of scalable video traffic delivery over EDCA with adaptive
retry limit-based ULP. 541
21.10 Block diagram of the proposed cross-layer QoS design. 543
21.11 PSNR of received video for DCF. 545
21.12 PSNR of received video for EDCA. 545
21.13 PSNR for our cross-layer design. 546

22.1 Illustration of a WVSN. 557
22.2 Comparison of power consumption at each sensor node. 564
22.3 Trade-off between the PSNR requirement and the achievable maximum
network lifetime in lossless transmission. 565
22.4 Comparison of the visual quality at frame 1 in Foreman CIF sequence with
different distortion requirement D
h
, ∀h ∈ V: (a) D
h
= 300.0, (b) D
h
= 100.0,
and (c) D
h
= 10.0. 565
23.1 Complexity spectrum for advanced visual computing algorithms. 574
23.2 Spectrum of platforms. 576
23.3 Levels of abstraction. 577
23.4 Features in various levels of abstraction. 578
23.5 Concept of AAC. 578
23.6 Advanced visual system design methodology. 579
23.7 Dataflow model of a 4-tap FIR filter. 580
23.8 Pipeline view of dataflow in a 4-tap FIR filter 581
23.9 An example for an illustration of quantifying the algorithmic degree of
parallelism. 587
23.10 Lifetime analysis of input data for typical visual computing systems. 589
23.11 Filter support of a 3-tap horizontal filter. 590
© 2012 by Taylor & Francis Group, LLC
List of Figures xxiii
23.12 Filter support of a 3-tap vertical filter. 590

23.13 Filter support of a 3-tap temporal filter. 591
23.14 Filter support of a 3 ×3 ×3 spatial–temporal filter. 591
23.15 Search windows for motion estimation: (a) Search window of a single block.
(b) Search window reuse of two consecutive blocks, where the gray region is
the overlapped region. 592
23.16 Search windows for motion estimation at coarser data granularity. (a) Search
window of a single big block. (b) Search window reuse of two consecutive big
blocks, where the gray region is the overlapped region. 593
23.17 Average external data transfer rates versus local storage at various data
granularities. 594
23.18 Dataflow graph of Loeffler DCT. 596
23.19 Dataflow graphs of various DCT: (a) 8-point CORDIC-based Loeffler DCT,
(b) 8-point integer DCT, and (c) 4-point integer DCT. 597
23.20 Reconfigurable dataflow of the 8-point type-II DCT, 8-point integer DCT, and
4-point DCT. 598
23.21 Dataflow graph of H.264/AVC. 599
23.22 Dataflow graph schedule of H.264/AVC at a fine granularity. 599
23.23 Dataflow graph schedule of H.264/AVC at a coarse granularity. 600
23.24 Data granularities possessing various shapes and sizes. 601
23.25 Linear motion trajectory in spatio-temporal domain. 602
23.26 Spatio-temporal motion search strategy for backward motion estimation. 603
23.27 Data rate comparison of the STME for various number of search locations. . . . 604
23.28 PSNR comparison of the STME for various number of search locations. 604
23.29 PSNR comparison of ME algorithms. 605
23.30 Block diagram of the STME architecture. 605
24.1 Dataflow graph of an image processing application for Gaussian filtering. 616
24.2 A typical FPGA architecture. 621
24.3 Simplified Xilinx Virtex-6 FPGA CLB. 621
24.4 Parallel processing for tile pixels geared toward FPGA implementation. 622
24.5 A typical GPU architecture. 625

25.1 SoC components used in recent electronic device. 633
25.2 Typical structure of ASIC. 634
25.3 Typical structure of a processor. 634
25.4 Progress of DSPs. 635
© 2012 by Taylor & Francis Group, LLC
xxiv List of Figures
25.5 Typical structure of ASIP. 636
25.6 Xtensa LX3 DPU architecture. 638
25.7 Design flow using LISA. 639
25.8 Example of DFG representation. 640
25.9 ADL-based ASIP design flow. 641
25.10 Overall VSIP architecture. 642
25.11 Packed pixel data located in block boundary. 643
25.12 Horizontal packed addition instructions in VSIP. (a) dst = HADD(src).
(b) dst = HADD(src:mask). (c) dst = HADD(src:mask1.mask2). 643
25.13 Assembly program of core block for in-loop deblocking filter. 644
25.14 Assembly program of intraprediction. 644
25.15 Operation flow of (a) fTRAN and (b) TRAN instruction in VSIP. 645
25.16 Operation flow of ME hardware accelerator in VSIP. (a) ME operation in the
first cycle. (b) ME operation in the second cycle. 646
25.17 Architecture of the ASIP. 647
25.18 Architecture of the ASIP. 649
25.19 Architecture example of the ASIP [36] with 4 IPEU, 1 FPEU, and 1 IEU. 651
25.20 Top-level system architecture. 652
25.21 SIMD unit of the proposed ASIP. 653
26.1 SMALLab mixed-reality learning environment. 661
26.2 (a) The SMALLab system with cameras, speakers, and project, and
(b) SMALLab software architecture. 669
26.3 The block diagram of the object tracking system used in the multimodal
sensing module of SMALLab. 670

26.4 Screen capture of projected Layer Cake Builder scene. 675
26.5 Layer Cake Builder interaction architecture schematic. 676
26.6 Students collaborating to compose a layer cake structure in SMALLab. 678
26.7 Layer cake structure created in SMALLab. 678
27.1 Tsukuba image pair: left view (a) and right view (b). 692
27.2 Disparity map example. 693
27.3 3DTV System by MERL. (a) Array of 16 cameras, (b) array of 16 projectors,
(c) rear-projection 3D display with double-lenticular screen, and
(d) front-projection 3D display with single-lenticular screen. 693
27.4 The ATTEST 3D-video processing chain. 695
27.5 Flow diagram of the algorithm by Ideses et al. 697
© 2012 by Taylor & Francis Group, LLC
List of Figures xxv
27.6 Block diagram of the algorithm by Huang et al. 698
27.7 Block diagram of the algorithm by Chang et al. 699
27.8 Block diagram of the algorithm by Kim et al. 700
27.9 Multiview synthesis using SfM and DIBR by Knorr et al. Gray: original camera
path, red: virtual stereo cameras, blue: original camera of a multiview setup. . . 701
27.10 Block diagram of the algorithm by Li et al. 702
27.11 Block diagram of the algorithm by Wu et al. 703
27.12 Block diagram of the algorithm by Xu et al. 704
27.13 Block diagram of the algorithm by Yan et al. 704
27.14 Block diagram of the algorithm by Cheng et al. 706
27.15 Block diagram of the algorithm by Li et al. 707
27.16 Block diagram of the algorithm by Ng et al. 708
27.17 Flow chart of the algorithm by Cheng and Liang. 710
27.18 Flow chart of the algorithm by Yamada and Suzuki. 711
28.1 A basic communication block diagram depicting various components of the SL
interpersonal haptic communication system. 719
28.2 The Haptic jacket controller and its hardware components. Array of

vibro-tactile motors are placed in the gaiter-like wearable cloth in order to
wirelessly stimulate haptic interaction. 722
28.3 The flexible avatar annotation scheme allows the user to annotate any part of
the virtual avatar body with haptic and animation properties. When interacted
by the other party, the user receives those haptic rendering on his/her haptic
jacket and views the animation rendering on the screen. 722
28.4 User-dependent haptic interaction access design. The haptic and animation
data are annotated based on the target user groups such as family, friend,
lovers, and formal. 724
28.5 SL and haptic communication system block diagram. 726
28.6 A code snippet depicting portion of the Linden Script that allows customized
control of the user interaction. 727
28.7 An overview of the target user group specific interaction rules stored (and
could be shared) in an XML file. 728
28.8 Processing time of different interfacing modules of the SL Controller. The
figure depicts the modules that interface with our system. 729
28.9 Processing time of the components of the implemented interaction controller
with respect to different haptic and animation interactions. 729
28.10 Haptic and animation rendering time over 18 samples. The interaction
response time changes due to the network parameters of
SL controller system. 731
© 2012 by Taylor & Francis Group, LLC
xxvi List of Figures
28.11 Average of the interaction response times that were sampled on particular time
intervals. The data were gathered during three weeks experiment sessions and
averaged. From our analysis, we observed that based on the server load the
user might experience delay in their interactions. 731
28.12 Interaction response time in varying density of traffic in the SL map location for
the Nearby Interaction Handler. 733
28.13 Usability study of the SL haptic interaction system. 735

28.14 Comparison between the responses of users from different (a) gender, (b) age
groups, and (c) technical background. 736
© 2012 by Taylor & Francis Group, LLC

×