୯ ҥ ύ ѧ ε Ꮲ
ၗૻπำᏢس
ᅺγፕЎ
ፄኧࠠଯථၸำӣᘜᔈҔܭᇟॣϩᚆ
Complex-valued Gaussian Process Regression
for speech separation
ࣴ زғǺLe Dinh Nguyen
ࡰᏤ௲ǺЦৎቼ ௲
ύ ҇ ୯ ԭ႟Ϥ ԃ Ϥ Д
NATIONAL CENTRAL UNIVERSITY
Department of Computer Science
Master Thesis
Complex-valued Gaussian Process Regression
for speech separation
ࣴ زғ : Le Dinh Nguyen
ࡰᏤ௲ǺJia-Ching Wang
ύ ҇ ୯ 106 ԃ 6 Д
ύЎᄔा
ᇟॣϩᚆӧૻဦೀύࢂڀԖࡷᏯޑ܄ୢᚒǴځӧӚᅿჴШࣚޑ
ᔈҔύวචΑख़ाբҔǴٯӵᇟॣᒣس܈ႝߞ೯ૻǶᇟॣϩᚆޑЬा
ҞࣁவঁڀԖӭঁว၉ޑޣషӝᇟॣीрঁձว၉ޑޣᇟॣǶҗܭ
ӧԾฅᕉნΠǴᇟॣૻဦதډڙᏓॣځ܈ѬᇟॣޑυᘋǴᇟॣϩᚆ
ӢԜᡂԋঁԖ֎ЇΚزࣴޑፐᚒǶ
ќБय़Ǵଯථၸำ(Gaussian Process, GP)ࢂᅿ୷ܭਡڄኧޑᐒᏔᏢ
ಞБݤǴ٠ЪςεໆޑᔈҔӧૻဦೀǶӧԜࣴزύǴॺךගр୷
ܭଯථၸำӣᘜ(Gaussian Process Regression, GPR)ޑБٰݤኳᔕషӝᇟॣ
ૻဦᆶଳృᇟॣϐ໔ߚޑጕࢀ܄Ǵख़ࡌޑᇟॣૻဦёҗGPኳࠠޑѳ֡
ڄኧளǶኳ္ࠠޑຬୖኧ(Hyper-parameter)җӅ೫ఊࡋ(ݤConjugate
Gradient Method)ٰՉന٫ϯǶӧჴᡍ٬ҔTIMITޑᇟॣၗǴ่ځ
݀ᡉҢගрޑБݤԖၨӳ߄ޑǶ
!
i
Abstract
Speech separation is a challenging signal processing which plays a
significant role in improving the accuracy of various real-world applications,
such as speech recognition system and telecommunication. Its main goal is to
isolate or estimate the target voice of each speaker from a mixed speech talked
by various speakers at the same time. Due to the fact that speech signals
collected in the natural environment are frequently corrupted by noise data,
speech separation has become an attractive research topic over the past several
decades.
In addition, Gaussian process (GP) is a flexible kernel-based learning
method which has found widespread application in signal processing. In this
thesis, a supervised method is proposed for handling speech separation problem.
In this work, we focus on modeling a nonlinear mapping between mixed and
clean speeches based on GP regression, in which reconstructed audio signal is
estimated by the predictive mean of GP model. The nonlinear conjugate gradient
method was utilized to perform the hyper-parameter optimization. An
experiment on a subset of TIMIT speech dataset is carried out to confirm the
validity of the proposed approach.
ii
Acknowledgements
The work presented in this thesis has been carried out at the Department of
Computer Science and Information Engineering in National Central University,
Taiwan during the years 2015-2017.
First of all, I wish to express my deepest gratitude to my research advisor,
Professor Jia-Ching Wang, for guiding and encouraging me in my research. The
fact that the thesis is finished at all is in great part of his endless enthusiasm for
talking about my work.
I also specially thank to Ms. Sih-Huei Chen. She greatly supported me for
theoretical and helped me take my initial thesis proposal and develop it into a
true body of work, resulting in several conference and workshop papers
together.
I would like to thank students in Laboratory for lots of interesting
discussions, various help, and making life at the laboratory so enjoyable.
Especially, I would like to thank to Ms. Sih-Huei Chen for discussing and coworking in the research, to Mr. Tuan Pham for helping me familiar with source
separation.
The financial support provided by National Central University fellowship
program and advisor Professor Jia-Ching Wang is gratefully acknowledged.
In addition, I wish to thank my family for their support in all my efforts.
iii
Table of Contents
Chapter 1 Introduction ........................................................................................ 1
1.1
Motivation................................................................................................. 1
1.2
Aim and Objective .................................................................................... 3
1.3
Thesis Overview ....................................................................................... 4
Chapter 2 Background knowledge ...................................................................... 5
2.1
Gaussian Process ...................................................................................... 5
2.1.1 Introduction ........................................................................................ 5
2.1.2 Covariance functions .......................................................................... 8
2.1.3 Optimization of hyper-parameters ................................................... 10
2.2
Short-time Fourier transform .................................................................. 12
2.2.1 Introduction ...................................................................................... 12
2.2.2 Spectrogram of STFT ....................................................................... 14
2.2.3 Inverse short-time Fourier transform ............................................... 16
2.3
Overlap-add method ............................................................................... 17
2.4
Complex-valued Derivatives: ................................................................. 22
2.4.1 Differentiating complex exponentials of a real parameter ............... 22
2.4.1.1 Differentiating complex exponentials ...................................... 22
2.4.2 Differentiating function of a complex parameter ............................. 23
Chapter 3 Employed systems ............................................................................ 26
3.1 System overview: .................................................................................... 26
3.1.1 Real-valued GP-based system for source separation ....................... 26
3.1.2 Complex-valued GP-based system for source separation ................ 28
iv
3.2 GP regression-based source separation: .................................................. 29
3.2.1 Real-valued GPR-based source separation ...................................... 29
3.2.2 Complex-valued GPR-based source separation ............................... 31
Chapter 4 Experiments ...................................................................................... 34
4.1 Real-valued GP regression-based model for source separation.............. 34
4.2 Complex-valued GP regression-based model for speech enhancement . 37
Chapter 5 Conclusions and future work ........................................................... 40
Bibliographies………………………………………………………………….41
v
List of Figures
Figure 1.1 Cocktail party problem ...................................................................... 1
Figure 1.2 An example of single channel source separation .............................. 2
Figure 2.1 GP model for regression .................................................................... 8
Figure 2.2 GP model for regression .................................................................. 12
Figure 2.3 Windows overlapping ...................................................................... 13
Figure 2.4 STFT of signal ................................................................................. 14
Figure 2.5 (2-D) presentation of a spectrogram ................................................ 16
Figure 2.6 ISTF process .................................................................................... 17
Figure 2.7 A general diagram of OLA analysis and synthesis system ............. 18
Figure 2.8 Linear convolution ........................................................................... 18
Figure 2.9 OLA overview ................................................................................. 20
Figure 2.10 An example of OLA ........................................................................ 21
Figure 3.1 Real-valued GPR-based system....................................................... 27
Figure 3.2 Complex-valued GPR-based system ............................................... 28
Figure 4.1 Spectrograms of mixture, 1 source and 1 de-noised speech............ 37
vi
List of tables
Table 2.1 List of common Kernel functions ..................................................... 10
Table 4.1 Source separation performance using 512-points STFT .................. 36
Table 4.2 Source separation performance using 1024-points STFT ................ 36
Table 4.3 SNR and SegSNR in dB averaged over the white noise .................. 38
Table 4.4 SNR and SegSNR in dB averaged over the babble noise ................. 38
vii
List of symbols and abbreviations
Symbols
È
f*
x*
cov( f* )
ld
s
θ
I
¶
z
zR
zI
Ñ
՜
Joint distribution
՜
Test input
՜
Characteristic length-scale
՜
Set of hyper-parameters
՜
Derivative function
՜
Predictive mean
՜
Predictive covariance
՜
՜
Variance
Identity matrix
՜
Complex number
՜
Imaginary part of z
՜
՜
Real part of z
Gradient
viii
Abbreviations
DNN
GP
GPR
NMF
SCSS
STFT
DFT
STFTM
FT
FFT
iSTFT
iFFT
SDR
SAR
SIR
SNR
SegSNR
i.i.d
՜
Deep neural networks
՜
Gaussian process regression
՜
Gaussian process
՜
Nonnegative Matrix Factorization
՜
Short-time Fourier transform
՜
STFT magnitude
՜
Fast Fourier transform
՜
Inverse Fast Fourier transform
՜
Source-to-artifacts ratio
՜
Signal-to-noise
՜
Independent and identically distributed
՜
Single-channel speech separation
՜
Discrete Fourier transform
՜
Fourier transform
՜
Inverse Short-time Fourier transform
՜
Source-to-distortion
՜
Source-to-interference ratio
՜
Segmental signal-to-noise ratio
ix