DUBLIN CITY UNIVERSITY
SCHOOL OF ELECTRONIC ENGINEERING
Detection of Interesting Events in Movies using
only the Audio signal
PHAM MINH LUAN NGUYEN
August 2009
MASTER OF ENGINEERING
IN
TELECOMMUNICATIONS
Supervised by Dr. Sean Marlow
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
ii
Acknowledgements
I would like to thank my supervisor Dr. Sean Marlow for his extensive guidance, enthusiasm
and commitment to this project. Thanks also due to Dr. David Sadlier for supporting movies
and codes. Thanks also to all other friends/colleagues for their contribution to the
establishment.
Declaration
I hereby declare that, except where otherwise indicated, this document is entirely my own
work and has not been submitted in whole or in part to any other university.
Signed: Date:
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
iii
Abstract
The imminent rapid expansion in the movie industry is driving the need for efficient digital
video indexing, browsing and playback systems. This report is to develop the idea which
makes an automatic detector system to detect the exciting events directly from the original
movie using only the audio signal. Interesting events in movies are typically flagged by high
audio amplitude. Detection of these events based on the audio amplitude is an efficient
method. It is a fast detection method, which takes advantage of the fact that audio features
are computationally cheaper than the visual features. Then the highlight events are classified
to evaluate the automatic system.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
iv
Contents
ACKNOWLEDGEMENTS II
DECLARATION II
ABSTRACT III
CONTENTS IV
LIST OF FIGURES VI
LIST OF GRAPHS VII
LIST OF TABLES IX
CHAPTER 1 -INTRODUCTION 1
1.1
R
ELATED WORK
2
1.1.1 Automatically Selecting Shots for Action Movie Trailers 2
1.1.2 Voice Processing for Automatic TV Sports Program Highlights Detection 3
1.1.3 Audio/visual analysis for high-speed TV advertisement detection from MPEG bistream 4
1.2
E
XCITING EVENT DETECTION IN MOVIE USING AUDIO SIGNAL
5
CHAPTER 2 – MPEG-1 AUDIO/VIDEO STANDARD 6
2.1
O
VERVIEW
6
2.2
MPEG-1
LAYER
2
A
UDIO
7
CHAPTER 3 – MOVIE HIGHLIGHT DETECTION 10
3.1
G
ETTING
G
ROUND
T
RUTH
10
3.2
A
UTOMATIC
D
ETECTION
15
3.2.1 Getting Scale Factor 16
3.2.2 Audio amplitude threshold 19
CHAPTER 4 – RESULTS AND ANALYSIS 36
4.1
R
ESULTS
36
4.1.1 The average audio amplitude 36
4.1.2 The audio amplitude threshold time 36
4.1.3 Results and result tables 36
4.2
P
RECISION AND
R
ECALL
44
CHAPTER 5 - CONCLUSIONS AND FURTHER WORK 45
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
v
5.1
S
YSTEM
E
VALUATION
45
5.2
F
URTHER WORK
46
REFERENCES 48
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
vi
List of Figures
F
IGURE
2-1:
ISO/MPEG-1
LAYER
I/II
ENCODER
7
F
IGURE
2-2:
S
TRUCTURE OF
L
AYER
–
II
SUBBAND SAMPLES
9
F
IGURE
2-3:
T
HE DATA BITSTREAM STRUCTURE OF
L
AYER
-
II 9
F
IGURE
3-1:
MPEG-1
L
AYER
-II
F
REQUENCY
S
UBBANDS
16
F
IGURE
3-2:
V
IDEO FRAME AUDIO LEVELS GENERATED FROM SCALEFACTORS CORRESPODING
TO TEMPORALLY ASSOCIATED AUDIO
18
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
vii
List of Graphs
G
RAPH
3-1:
P
ER
-F
RAME
A
UDIO
A
MPLITUDE LEVEL FOR EXAMPLE MOVIE
17
G
RAPH
3-2:
P
ER
-S
ECOND
A
UDIO
A
MPLITUDE LEVEL FOR EXAMPLE MOVIE
18
G
RAPH
3-3:
A
UDIO AMPLITUDE PROFILE OF THE
N
IGHT AT THE
M
USEUM
2 20
G
RAPH
3-4:
A
UDIO AMPLITUDE DETECTION OF THE
N
IGHT AT THE
M
USEUM
2 20
G
RAPH
3-5:
A
UDIO AMPLITUDE DETECTION OF THE
N
IGHT AND THE
M
USEUM
2
AND
G
ROUND
T
RUTH
(B
LUE IS AUTOMATIC DETECTION
.
R
ED IS THE
G
ROUND
T
RUTH
) 20
G
RAPH
3-6:
A
UDIO AMPLITUDE PROFILE OF
T
HE
K
ING
D
OM
21
G
RAPH
3-7:
A
UDIO AMPLITUDE DETECTION OF THE
K
ING
D
OM
21
G
RAPH
3-8:
A
UDIO AMPLITUDE DETECTION OF THE
K
ING
D
OM AND
G
ROUND
T
RUTH
21
G
RAPH
3-9:
A
UDIO AMPLITUDE PROFILE OF
T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
22
G
RAPH
3-10:
A
UDIO AMPLITUDE DETECTION OF
T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
22
G
RAPH
3-11:
C
OMPARE RESULT AUTOMATIC DETECTION AND
G
ROUND
T
RUTH
22
G
RAPH
3-12:
A
UDIO AMPLITUDE PROFILE
(N
IGHT AT THE
M
USEUM
2
-
ONE FRAME
) 24
G
RAPH
3-13:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(N
IGHT AT THE
M
USEUM
2
–
ONE
FRAME
) 24
G
RAPH
3-14:
A
UDIO AMPLITUDE PROFILE
(N
IGHT AT THE
M
USEUM
2
–
TWO FRAMES
) 25
G
RAPH
3-15:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(N
IGHT AT THE
M
USEUM
2
–
TWO
FRAMES
) 25
G
RAPH
3-16:
A
UDIO AMPLITUDE PROFILE
(N
IGHT AT THE
M
USEUM
2
-
TWO SECONDS
) 26
G
RAPH
3-17:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(N
IGHT AT THE
M
USEUM
2
–
TWO
SECONDS
) 26
G
RAPH
3-18:
A
UDIO AMPLITUDE PROFILE
(N
IGHT AT THE
M
USEUM
2
–
FOUR SECONDS
) 27
G
RAPH
3-19:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(N
IGHT AT THE
M
USEUM
2
–
FOUR
SECONDS
) 27
G
RAPH
3-20:
A
UDIO AMPLITUDE PROFILE
(T
HE
K
ING
D
OM
–
ONE FRAME
) 28
G
RAPH
3-21:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
K
ING
D
OM
–
ONE FRAME
) 28
G
RAPH
3-22:
A
UDIO AMPLITUDE PROFILE
(T
HE
K
ING
D
OM
–
TWO FRAMES
) 29
G
RAPH
3-23:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
K
ING
D
OM
–
TWO FRAMES
) 29
G
RAPH
3-24:
A
UDIO AMPLITUDE PROFILE
(T
HE
K
ING
D
OM
–
TWO SECONDS
) 30
G
RAPH
3-25:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
K
ING
D
OM
–
TWO SECONDS
) 30
G
RAPH
3-26:
A
UDIO AMPLITUDE PROFILE
(T
HE
K
ING
D
OM
–
FOUR SECONDS
) 31
G
RAPH
3-27:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
K
ING
D
OM
–
FOUR
SECONDS
) 31
G
RAPH
3-28:
A
UDIO AMPLITUDE PROFILE
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
ONE
FRAME
) 32
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
viii
G
RAPH
3-29
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
ONE FRAME
) 32
G
RAPH
3-30:
A
UDIO AMPLITUDE PROFILE
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
TWO
FRAMES
) 33
G
RAPH
3-31:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
TWO FRAMES
) 33
G
RAPH
3-32:
A
UDIO AMPLITUDE PROFILE
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
TWO
SECONDS
) 34
G
RAPH
3-33:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
TWO SECONDS
) 34
G
RAPH
3-34:
A
UDIO AMPLITUDE PROFILE
((T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
FOUR
SECONDS
) 35
G
RAPH
3-35:
A
UTOMATIC DETECTION AND
G
ROUND
T
RUTH
(T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
–
FOUR SECONDS
) 35
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
ix
List of Tables
T
ABLE
3-1:
G
ROUND
T
RUTH OF
N
IGHT AT THE
M
USEUM
2 11
T
ABLE
3-2:
G
ROUND
T
RUTH OF
T
HE
K
ING
D
OM
12
T
ABLE
3-3:
G
ROUND
T
RUTH OF
T
HE
K
ING
D
OM
(
CONTINUE
) 13
T
ABLE
3-4:
G
ROUND
T
RUTH OF
T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
13
T
ABLE
3-5:
G
ROUND
T
RUTH OF
T
HE
L
EGEND OF
B
UTCH AND
S
UNDANCE
(
CONTINUE
) 14
T
ABLE
4-1:
C
OMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE
G
ROUND
T
RUTH
38
T
ABLE
4-2:
P
OSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM
38
T
ABLE
4-3:
G
ROUND
T
RUTH
E
VENTS MISSED IN AUTOMATIC SYSTEM
39
T
ABLE
4-4:
C
OMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE
G
ROUND
T
RUTH
40
T
ABLE
4-5:
P
OSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM
41
T
ABLE
4-6:
C
OMPARE RESULTS BETWEEN THE AUTOMATIC SYSTEM AND THE
G
ROUND
T
RUTH
42
T
ABLE
4-7:
P
OSSIBLE EXCITING EVENTS ARE DETECTED BY AUTOMATIC SYSTEM
43
T
ABLE
4-8:
G
ROUND
T
RUTH
E
VENTS MISSED IN AUTOMATIC SYSTEM
43
T
ABLE
4-9:
P
RECISION AND
R
ECALL VALUES FOR THREE MOVIES
44
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
1
Chapter 1 -Introduction
The growing availability of video content creates a strong requirement for efficient tools to
manage or access multimedia data [3]. Considerable progress has been made in audio
analysis for movie content with automatic highlight detection being one of the targets of
recent research. Highlight detection is important, since they provide the user with a short
version of the movie that ideally contains all important information for understanding the
content. Hence, the user may quickly evaluate the movie as interesting or not.
Audio, which includes voice, music, and various kinds of environmental sounds, is an
important type of media, and also a significant part of audiovisual data. However, since
there are more and more digital audio databases in place these days, people are realizing the
importance of effective management for audio databases relying on audio content analysis.
Audio segmentation and classification have applications in professional media production,
audio archive management, commercial music usage, surveillance, and so on. Furthermore,
audio content analysis may play a primary role in video annotation. Current approaches for
video segmentation and indexing are mostly focused on the visual information. However,
visual – based processing often leads to a far too fine segmentation of the audiovisual
sequence with respect to the diverse multimedia components (audio, visual, and textual
information) will be essential in achieving a fully functional system for video parsing.
Existing research on content – based on audio data management is very limited. There are in
general four directions [6]. One direction is audio segmentation and classification. One basic
problem is speech/music discrimination. The second direction is audio retrieval. One
specific technique in content-based audio retrieval is query-by-humming. The third direction
is audio analysis for video indexing. The fourth direction is the integration of audio and
visual information for video segmentation and indexing.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
2
1.1 Related work
1.1.1 Automatically Selecting Shots for Action Movie Trailers
Alan F. Smeaton, Bart Lehane, Noel E. O’Connor, Conor Brady and Gary Craig of Dublin
City University, Ireland have researched into the area of the movie highlights [3]. Their
study was based on the following principles:
• They utilise a shot boundary technique in order to generate the basic shot-based
structure of a movie. Colour histograms have been demonstrated as a highly accurate
and efficient method of comparing images and detecting shot boundaries.
• The audio track of a movie is analysed in order to detect the presence of the following
categories: speech, music, silence, speech with background music and other audio.
Their rationale for using these audio categories is that music can be indicative of high,
or low, points of a movie.
• For each shot they also detect two motion features, the motion intensity and the
percentage of camera movement present. The motion intensity is an indicator of the
amount of motion within each frame of video, and is determined by calculating the
standard deviation of the motion vectors.
The features used in order to detect shots used in trailers are shot length, motion intensity,
and the amount of camera movement, speech, music, silence, speech with background music
and other audio present in each shot. Evaluation of the performance of their shot selection
used the classic measures of precision and recall where a set of shots selected using their
trained approach was compared against the ground truth of shots which appear in the official
movie trailer. Their approach to using SVM (support vector machines) selects shots in rank
order based on their likelihood for inclusion in the original trailer and the specific metric
they use for evaluation is R−Precision [14]. Given a ranked list produced as the output of a
system to be evaluated, R–Precision is defined as the precision at rank position R, where R
is the number of document or objects relevant to the query.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
3
When evaluating shot selection they face the issue of how to evaluate sub-shot retrieval. One
approach they could take to address this is to evaluate based on the proportion of frames
from the original movie which appear in the trailer and this would correspond to the way
gradual shot transitions are evaluated in TRECVid [13] using frame − precision and frame −
recall where the evaluation is in terms of the number of overlapped frames.
Evaluation of their approach to trailer shot selection was done using a leave-one-out k-fold
cross validation. This is a technique used in information retrieval in which a dataset, T, is
divided into training T1 and testing T2 subsets, T =T1+T2, training is done on T1 and
testing on T2, and then T is re-divided into different training and testing subsets T1′ and T2′
and the training and evaluation is repeated, a total of k times.
The results show several interesting aspects. Firstly, the consistently high results indicate
that this approach of selecting shots for action movie trailers is both accurate and reliable.
One possible danger with our results is that their accuracy could be biased by the use of
automatic shot segmentation. A correct classification of a movie trailer shot occurs when the
ground-truth trailer sub-shot occurs within the selected movie full-shot.
Three event classes were chosen (exciting, dialogue and musical) that typically encapsulate
all relevant portions of a movie. A range of low-level audiovisual features were extracted
and finite state machines were used in order to detect the events.
1.1.2 Voice Processing for Automatic TV Sports Program Highlights
Detection
This study was done by Seán Marlow, David A. Sadlier, Noel O’Connor, Noel Murphy of
Dublin City University, Ireland [4]. This report uses the Sport program which is supported
by the Centre for Digital Video Processing at DCU. This report focuses the audio to do
highlight detection in Sport Program. The author used some features of the Audio MPEG-1
Layer II and features of the audio in Sport Program. The audio in a sport program has a
feature that gets high audio amplitude when an exciting event happens in program, i.e. goal
in football match, penalty offence, Red Card offence. In this report, the author focuses the
audio amplitude to highlight detection through the Scale Factor in the Audio MPEG -1
Layer II. The principle is the audio amplitude threshold. The Scale Factor was stripped from
the audio then it was processed to get amplitude level in one frame. The method detected in
this report that detected three audio-amplitude-frames higher than the amplitude threshold.
The author detects the highlight by the audio amplitude threshold because this is the cheap,
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
4
fast way. This report’s result had detected almost the highlight events in the Sport Program.
This method was successful in locating the presence of highlight event and the boundary of
the events.
Their work is a preliminary investigation into the usefulness of pure audio analysis for
summarisation of (limited types of) sports programmes. A further eight 10-minute
summaries were generated from various other broadcast sports programmes. The content of
returned clips, make up the final summary.
In a real scenario, automatic summarisation of such broadcasts would depend on some
combination of an analysis of the closed captions (teletext), and analysis at the visual level.
1.1.3 Audio/visual analysis for high-speed TV advertisement detection
from MPEG bistream
This project is a research by David A. Sadlier, Noel O’Connor, Sean Marlow, Noel Murphy
[5]. The research is concerned the TV advertisements. A television programme is typically
accompanied by beginning/and credits with one or more ad-breaks somewhere in the
middle. To the user, these features of a programme would be generally regarded as an
insignificant part of the material. Their study was based on the following principles:
• Black Video Frame Detection: a black video frame may be recognised by its luminance
histogram, which would be typically characterised by having most of its ‘power’ at the
bottom end of pixel amplitude spectrum, corresponding to black or very dark pixels.
• Silent Video Frame Detection: A summation of the absolute value of all the individual
audio samples corresponding to the temporal length of one video frame may be defined
as the ‘audio level’ for that frame, i.e. for a video frame with relatively quite audio, a
slow audio level would be expected. Thus, by threshold this audio level, silent video
frames (of intensity defined by threshold) may be detected.
The authors report that black/silent video frame series may indicate the existence of an ad-
break. However, they use another element which is some features of the advertisement
breaks. There are the length of the advertisement breaks and the frame number between two
advertisement breaks.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
5
1.2 Exciting event detection in movie using audio signal
We also have some cases to study about event detection and movie detection. The first case,
they had detected events in movie by using the audiovisual data [3]. The second case, they
use the audio signal to highlight events in the sport program [4]. The third case, they use the
audiovisual data to detect the ad-break in a television program [5]. However, they have not
to detect the events in movie using the only audio signal.
The method uses the audio signal to highlight events in movie is the cheaper way. It does
not have too much time to calculate as the audiovisual data method. In this document, we
choose a figure of the audio signal to highlight event in movie. This is the audio amplitude.
The audio amplitude in movie is one indicator of exciting events. The exciting events
usually happen with high audio amplitude in movies. The high audio amplitude events may
be the gunshot event, fighting events, crash events, or explosion events. So the audio
amplitude may be helpful to highlight the events.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
6
Chapter 2 – MPEG-1 Audio/Video Standard
2.1 Overview
The Moving Pictures Experts Group (MPEG) [15] who meet under the International
Standards Organisation (ISO), generate international standards for digital video and audio
compression. MPEG-1 is a standard in five parts:
1. ISO/IEC 11172-1:1993
This addresses problem of combining one or more data stream from the video and audio
parts of the MPEG-1 standard with timing information to form a single stream. i.e.
multiplexing and synchronisation of audio/video.
2. ISO/IEC 11172-2:1993
This specifies a coded representation that can be used for compressing video sequences.
3. ISO/IEC 11172-3:1993
This specifies a coded representation that can be used for compressing audio sequences
– both mono and stereo.
4. ISO/IEC 11172-4:1995
Part 4 specifies how to test can be designed to verify whether bitstream and decoders
meet the requirements as specified in part 1, 2 and 3.
5. ISP/IEC 11172-5:1998
Technically not a standard, but a technical report. Gives a full software implementation
of the first three parts of the MPEG-1 standard.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
7
2.2 MPEG-1 layer 2 Audio
MPEG-1 audio standard (ISO/IEC 1172-3) comprises a flexible hybrid coding technique
that incorporates several methods including subband decomposition, filter-bank analysis,
transform coding, entropy coding, dynamic bit allocation, nonuniform quatization, adaptive
segmentation, and psychoacoustic analysis. MPEG-1 audio codec operates on 16-bit PCM
input data at samples rates of 32, 44.1 and 48 kHz. Moreover, MPEG-1 offers separate
modes for mono, stereo, dual independent mono and joint stereo. Available bit rates are 32 -
192 kb/s for mono and 64-384 kb/s for stereo.
The MPEG-1 architecture contains three layers of increasing complexity, delay and output
quality. Each higher layer incorporates functional blocks from the lower layers. The input
signal is first decomposed into 32 critically subsampled subbands using a polyphase
realization of a pseudo-QMF( (PQMF) bank. The channels are equally spaced such that a
48-kHz input signal is split into 750-Hz subbands, with the subbands decimated 32:1. A
511
th
-order prototype filter was chosen such that the inherent overall PQMF distortion
remains below the threshold of the audibility. Moreover, the prototype filter was designed
for high sidelobe attenuation (96dB) to ensure that intraband aliasing remains negligible.
Figure 2-1: ISO/MPEG-1 layer I/II encoder. [2]
Dynamic
bit
alllocation
32 ↓
Block
companding
quantization
M
U
L
T
I
P
L
E
X
E
R
Psychoacoustic
signal analysis
32 Channel
PQMF
analysis bank
FFT computation
(L1:512; L2:1024)
x(n)
32
Data
Quantizers
Side
info
SMR
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
8
For the purposes of psychoacoustic analysis and determination of just noticeable distortion
(JND) thresholds, a (512 layer I) or 1024 (layer II) point FFT is computed in parallel with
the subband decomposition for each decimated block of 12 input samples (8 ms at 48 kHz).
Next, the subband are block companded (normalized by a scale factor) such that the
maximum sample amplitude in each block is unity, then an iterative bit allocation procedure
applies the JND threshold to select an optimal quantizer from a predetermined set for each
subband. Quantizers are selected such that both the masking and bit rate requirements are
simultaneously satisfied. In each subband, scale factors are quantized using 6 bits and
quantizer selections are encoded using 4 bits.
MPEG-1 Audio specifies three layers. The different layers offer increasing higher audio
quality at slightly increased complexity. While Layers I and II share the basic structure of the
encoding process having their roots in an ealier algorithm also known as MUSICAM, Layer
III is substantially different.
Layer I is the simplest layer and it operates at data rates between 32 and 224 kb/s per
channel. The preferred range of operation is above 128 kb/s. Layer I finds an application, for
example in the digital compact cassette, DCC, at 192 kb/s per channel. Layer II is of
medium complexity and it employs data rate between 32 and 192 kb/s per channel. At 128
kb/s per channel it provides very good audio quality.
The MPEG-1 Layer-II compression algorithm encodes audio signals as follows: the
frequency spectrum of the audio signal, bandlimited to 20 kHz, is uniformed divided into32
subbands. The subbands are assigned individual bit-allocations according to the audibility of
quantisation noise within each subband. A pyschoacoustic model of the ear analyses the
audio signal and provides this information to the quantiser.
Layer-II frames consist of 1152 samples; 3 groups of 12 samples from each of 32 subbands.
A group of 12 samples gets a bit-allocation and, if this is non-zero, a scalefactor.
Scalefactors are weights that scale groups of 12 samples such that they fully use the range of
the quantiser. The scalefactor for such a group is determined by the next largest value (given
in a look-up table) to the maximum of the absolute values of the 12 samples. Thus it
provides an indication of the maximum power exhibited by any one of the 12 samples
within the group.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
9
.
.
.
.
.
.
Figure 2-2: Structure of Layer – II subband samples. [5]
Bit
Allocatio
n (2~4
bits)
Scale factor
Select
Information
(2 bits)
Scale Factor
(6 bits)
Samples
(2~16 bits)
Ancillary
Figure 2-3: The data bitstream structure of Layer – II. [5]
granule
1152 samples
32 subbands
12
granules
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
10
Chapter 3 – Movie highlight detection
This study focuses on audio, especially audio amplitude. In movie, we have a lot of various
events, i.e. speech, music, speech with ground music, scream Usually, the audio amplitude
event does not change much if the event just speech. Exciting event detection in movie may
be a gunshot, an explosion, a laugh, a scream. When an exciting event happens, the audio
amplitude of event increases suddenly, i.e. gunshot, loud voice.
3.1 Getting Ground Truth
When we get the results from the automatic detection method, how do we know how it
performs. So we need a table of the exciting events. To get this table, we have to do by hand.
We call this work is Ground Truth. To know exactly where the events happened in a movie
we need to watch the movie and to note the exciting events. We need to know when the
exciting events happen and how long it happens, we write all events information in a table:
the event time, the event length. In this step, we have a problem, it is our opinion because
the event it may be exciting with us but it may not be exciting with someone. That is a
problem; we need to find the solution. We can use the movie trailer to know more about the
exciting movie when we do Ground Truth. The movie trailer was done manually. The movie
trailer was done to advertise about the movie so in this case the exciting event may be in the
movie trailer, but it is not all the exciting event was in the trailer. We just refer the movie
trailer to know how good the automatic method.
When we do the Ground Truth, another problem is the length of the events. Example: the
event is a gunshot combine fighting, beating, so we need to choose the main event happen or
we can combine all of these events to become a big event. In some cases, the big event has
long happened – time, so the automatic detection can get result as much as we want.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
11
Event
Number
Events location in movie
(hour/minute/second- hour/minute/second)
(second- second)
Classified Events Length of
event
(seconds)
1 00.01.17 – 00.01.22 (77-88) Music and name of movie 11
2 00.08.31 – 00.09.10 (531-550) Loud noise, scream, dump. 19
3 00.15.26 – 00.15.49 (926 – 949) Loud voice, scream. 23
4 00.24.20 – 00.24.40 (1460 – 1480) Loud noise 20
5 00.26.20 – 00.27.41 (1580 – 1661) Buster, scream, drum -beat 81
6 00.30.00 – 00.32.56 (1800 – 1976) Drum-beat, buster, cracker,
wham, fighting, sound of
spear flying
176
7 00.35.00 – 00.35.36 (2100 -2136) Scream, fighting 36
8 00.49.00 – 00.49.20 (2940 -2960) Sound of water flowing 20
9 00.56.44 – 00.57.14 (3404 -3434) Scream, squeak 20
10 00.58.58 – 00.59.12 (3538 -3552) Scream, yell, charivari 14
11 01.01.56 – 01.02.09 (3716 -3729) Scream, speech 13
12 01.03.30 – 01.04.30 (3810 - 3870) Loud voice 60
13 01.07.07 – 01.07.40 (4027 -4060) Whirr, scream, music 33
14 01.07.50 – 01.08.40 (4070 – 4120) Alarm, scream, shouting 50
15 01.14.20 – 01.16.47 (4460 – 4607) Scream, drum-beat, crunch,
clump, crash, footstep, loud
noise
147
16 01.17.36 – 01.17.57 (4656 - 4677) Trumpet-call, battle-cry 21
17 01.19.58 – 01.20.13 (4798 -4813) Beating, smack 15
18 01.21.11 – 01.21.40 (4871 -4900) Drum beating, fighting 29
19 01.21.47 – 01.22.39 (4907 – 4959) Shouting, drum beating,
fighting
52
20 01.23.32 – 01.23.53 (5012 - 5033) Crash, beating, smack 21
21 01.24.09 – 01.24.56 (5049 -5096) Drumbeating, shouting 47
22 01.31.49 – 01.32.09 (5509 -5529) Roaring 20
Table 3-1: Ground Truth of Night at the Museum 2
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
12
Event
Number
Events location in movie
(hour/minute/second- hour/minute/second)
(second- second)
Classified Events Length
of event
(seconds)
1 00.00.39 – 00.00.56 (39 – 56) Music and name of music 17
2 00.00.58 – 00.03.51 (58 – 231) Speech, drum beating 173
3 00.06.56 – 00.07.00 (416 – 420) Gunshot 04
4 00.07.30 – 00.08.00 (450 – 480) Gunshot, machine-gun shot 30
5 00.08.10 – 00.09.04 (490 – 544) Gunshot, machine-gun shot 54
6 00.09.28 – 00.10.33 (578 – 633) Loud voice, ambulance,
shouting
55
7 00.11.40 – 00.11.52 (700 – 712) Explosion 12
8 00.14.50 – 00.15.30 (890 – 930) Speech, beating 40
9 00.15.42 – 00.16.20 (942 – 980) Speech, beating 38
10 00.31.15 – 00.32.54 (1875– 1974) Whistle, sound of wheel
brake
99
11 00.33.00 – 00.33.30 (1980 – 2010) Sound of wheel brake 30
12 00.47.20 – 00.47.45 (2840 – 2865) Gunshot, crashing 25
13 00.51.30 – 00.52.30 (3090 – 3150) Alarm, drumbeat 60
14 00.52.56 – 00.53.10 (3176 – 3190) Shouting, beating, machine-
gun shot
14
15 01.14.45 – 01.15.15 (4485 – 4515) Drum-beating, scraping,
gunshot, explosion
30
16 01.19.50 – 01.20.30 (4790 – 4830) Explosion, crashing 40
17 01.20.44 – 01.21.47 (4844 – 4907) Shouting, beating, gunshot 63
18 01.22.00 – 01.22.12 (4920 – 4932) Crash, beating 12
19 01.25.12 – 01.25.23 (5112 – 5123) Explosion, crashing, smash 11
20 01.25.30 – 01.25.44 (5130 – 5144) Explosion, gunshot, shouting 14
21 01.26.14 – 01.26.54 (5174 – 5214) Explosion, gunshot 40
22 01.27.00 – 01.27.10 (5220 – 5230) Gunshot 10
23 01.27.12 – 01.28.00 (5232 – 5280) Gunshot 48
24 01.28.05 – 01.28.26 (5285 – 5306) Gunshot, explosion 21
25 01.30.55 – 01.31.12 (5455 – 5472) Gunshot 17
Table 3-2: Ground Truth of The KingDom
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
13
Event
Number
Events location in movie
(hour/minute/second- hour/minute/second)
(second- second)
Classified Events Length
of event
(seconds)
26 01.31.24 – 01.31.42 (5484 – 5502) Gunshot, explosion 18
27 01.31.46 – 01.32.02 (5506 – 5522) Scream, gunshot, shouting 16
28 01.32.47 – 01.33.05 (5567 – 5585) Scream, gunshot 18
29 01.33.10 – 01.33.56 (5590 – 5636) Shouting, gunshot, beating 46
30 01.36.44 – 01.37.07 (5804 – 5827) Gunshot, shouting 23
Table 3-3: Ground Truth of The KingDom (continue)
Event
Number
Events location in movie
(hour/minute/second- hour/minute/second)
(second- second)
Classified Events Length
of event
(seconds)
1 00.00.19 – 00.00.29 (19 – 29) Gunshot 10
2 00.01.00 – 00.02.12 (60 – 132) Gunshot, shouting, yell 72
3 00.04.00 – 00.04.30 (240 – 270) Speech 30
4 00.05.09 – 00.05.40 (309 – 340) Gunshot 31
5 00.07.54 – 00.08.10 (474 – 490) Gunshot, shouting, hoofbeat
26
6 00.10.05 – 00.11.10 (605 – 670) Gunshot, hoofbeat 65
7 00.12.23 – 00.13.20 (743 – 800) Loud voice, shouting 57
8 00.14.23 – 00.14.50 (863 – 890) shouting 27
9 00.15.50 – 00.16.42 (950 – 1002) Scream, gunshot, shouting, 52
10 00.18.08 – 00.18.21 (1088 – 1101) gunshot 13
Table 3-4: Ground Truth of The Legend of Butch and Sundance
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
14
Event
Number
Events location in movie
(hour/minute/second- hour/minute/second)
(second- second)
Classified Events Length
of event
(seconds)
11 00.19.54 – 00.20.05 (1194 – 1205) laughing 11
12 00.23.15 – 00.23.55 (1395 – 1435) gunshot 40
13 00.28.08 – 00.29.04 (1688 – 1744) Shouting, gunshot, hoofbeat 56
14 00.30.01 – 00.30.26 (1801 – 1826) Sound of slide-action 25
15 00.30.30 – 00.30.46 (1830 – 1846) Shouting, gunshot 16
16 00.42.03 – 00.43.00 (2523 – 2580) Explosion, gunshot,
shouting
57
17 00.44.15 – 00.45.19 (2655 – 2709) Gunshot, shouting 54
18 00.53.27 – 00.54.46 (3207 – 3286) Sound of silde-action 79
19 00.55.01 – 00.55.31 (3301 – 3331) shouting 30
20 00.57.40 – 00.57.56 (3460 – 3476) explosion 16
21 00.58.20 – 00.59.16 (3500 – 3556) Shouting, hoofbeat 36
22 01.04.02 – 01.04.50 (3842 – 3890) gunshot 48
23 01.05.02 – 01.05.58 (3902 – 3958) Gunshot, loud voice 56
24 01.10.58 – 01.11.12 (4258 – 4272) beating 14
25 01.14.19 – 01.15.50 (4459 – 4550) Drumbeat, gunshot,
shouting
91
26 01.16.30 – 01.18.40 (4590 – 4720) Gunshot, shouting 130
27 01.20.23 – 01.21.23 (4823 – 4883 ) Gunshot, shouting 60
28 01.21.43 – 01.22.27 (4903 – 4947) Loud voice 44
29 01.23.45 – 01.24.10 (5025 – 5040 ) gunshot 15
Table 3-5: Ground Truth of The Legend of Butch and Sundance (continue)
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
15
3.2 Automatic Detection
The audio amplitude gives a benefit feature to detect the exciting events. In movie, the
exciting events may be happen in milliseconds, the other events may happen in second or
minute. The other hand, some events has high audio amplitude but they are not exciting
events. Some events are exciting events but audio amplitude of these events is not as high as
we need to detect. So in fact, this automatic method may miss some events. Automatic
detection is to get automatically the exciting events through an automatic system. The
system just detects the audio signal of the movie, suggests the exciting events and includes
the length events.
To use audio amplitude in our detection, we need to get only the audio from movie. First of
all, the audio was extracted separately from the movie. In this step, the audio was saved as a
*.aud file. The next step is to strip the scale factors in audio file. Scale factors in audio
movie gave a clear sight about the audio amplitude because the scale factor has information
about the audio amplitude. After we have scale factors of the audio file, we need to find the
audio amplitude in one movie frame. Audio amplitude in one frame was a good sight to
know about the exciting events in movie. Before doing those works, we need to choice the
type of movie and type of movie file. In some cases, the type of movie give us the better
result to detect, i.e. action movie is a good type of movie to work because the exciting event
in this movie type usually had a higher audio amplitude than the other events. To get scale
factor, we need to choose the movie type because this belong the compressed movie method.
In this project, movie type is MPEG-1 and the audio type was MPEG-1 Layer II. In fact, we
need to study the audio amplitude so we could use MPEG-1 Layer II to make the sample
study.
When we have the audio level in one frame, we begin to analysis the audio amplitude of
movie. This study analysis based on the audio amplitude so we just focus about the
threshold of the audio amplitude and the threshold time of the audio amplitude. In once case,
we changed the value to compare and to find the better way to detect exciting movie.
Detection of Interesting Events in Movies using only the audio signal– PHAM MINH LUAN NGUYEN
16
3.2.1 Getting Scale Factor
3.2.1.1 Reduction of cut-off frequency [4], [5]
One scale factor is computed for each 12 subband frequency samples (call a “granule”). The
maximum value absolute value of the 12 – sample granule is determined and mapped to a
scale factor value via a lookup table defined in the standard. The samples in the granule are
divided by the scale factor prior to the quantization stage. The dynamic range covered by the
scale factor is 120dB.
Most of the energy in a speech signal lies between 0.1 kHz – 4 kHz. According to the
MPEG-1 Layer-II audio standard, the maximum allowable frequency component in the
audio signal is at 20 kHz. At the encoder, the frequency spectrum (0 – 20 kHz) is divided
uniformly into 32 subbands, each having bandwidth of 0.625 kHz. Thus, subbands 2 through
7 represent the frequency range from 0.625 kHz – 4.375 kHz.
By limiting the audio examination to these subbands, which approximate the range of the
speech band, we further concentrate the audio investigation on commentator vocals.
Therefore, the influence of the commentator on the generation of the audio amplitude profile
is increased. It was expected that the examination of subbands 2 though 7 would provide for
a reasonable trade-off between rejection of low-frequency background noise and the capture
of excited events.
Speech
1 2 3 4 5 6 7 8 31 32
0.625 kHz 4.375 kHz
Figure 3-1: MPEG-1 Layer-II Frequency Subbands. [4]