Tải bản đầy đủ (.pdf) (90 trang)

Nghiên cứu phương pháp cải thiện chất lượng hệ thống ghi nhật ký người nói

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.94 MB, 90 trang )

HANOI UNIVERSITY OF SCIENCE AND TECHNOLOGY

MASTER THESIS
A Study on Improving
Speaker Diarization System
TUNG LAM NGUYEN

Dept. of Control Engineering and Automation

Supervisor

Dr. T. Anh Xuan Tran

School

School of Electrical Engineering

Hanoi, March 1, 2022



CỘNG HÒA XÃ HỘI CHỦ NGHĨA VIỆT NAM
Độc lập – Tự do – Hạnh phúc

BẢN XÁC NHẬN CHỈNH SỬA LUẬN VĂN THẠC SĨ
Họ và tên tác giả luận văn: Nguyễn Tùng Lâm
Đề tài luận văn: Nghiên cứu phương pháp cải thiện chất lượng hệ thống
ghi nhật ký người nói
Chuyên ngành: Kỹ thuật Điều khiển và Tự động hóa
Mã số SV: CBC19018
Tác giả, Người hướng dẫn khoa học và Hội đồng chấm luận văn xác nhận


tác giả đã sửa chữa, bổ sung luận văn theo biên bản họp Hội đồng ngày
31/12/2021 với các nội dung sau:
- Trang 4, 5: Vẽ lại sơ đồ hệ thống cơ bản: Tách bạch giữa phần ghi nhận ký người nói
và nhận dạng người nói. Có bổ sung mô tả chung về hệ thống.
- Trang 26: Bổ sung sơ đồ khối thuật toán Agglomerative Hierarchial Clustering (AHC).
- Trang 27: Sửa lại hình 2.15 – ví dụ về thuật toán AHC.
- Trang 32 - 34: Viết thêm phần Frameworks. Trong đó có trình bày về thư viện tự viết
Kal-Star. Trong phần này có nhắc đến việc Kal-Star được sử dụng cho hệ thông được
mô tả ở trang 4, 5.
- Trang 41 - 49: Thay sơ đồ hệ thống cũ bằng sơ đồ mới. Sắp xếp lại phần nội dung.
- Trang 53, 56: Cập nhật vị trí của các bảng kết quả để tiện theo dõi.

Ngày 28 tháng 02 năm 2022
Giáo viên hướng dẫn

Tác giả luận văn

CHỦ TỊCH HỘI ĐỒNG

SĐH.QT9.BM11

Ban hành lần 1 ngày 11/11/2014



i

Declaration of Authorship
I, Tung Lam NGUYEN, declare that this thesis titled, “A Study on Improving Speaker
Diarization System” and the work presented in it are my own. I confirm that:

• This work was done wholly or mainly while in candidature for a research degree at
this University.
• Where any part of this thesis has previously been submitted for a degree or any
other qualification at this University or any other institution, this has been clearly
stated.
• Where I have consulted the published work of others, this is always clearly attributed.
• Where I have quoted from the work of others, the source is always given. With the
exception of such quotations, this thesis is entirely my own work.
• I have acknowledged all main sources of help.
Signed:
Date:



iii

“I’m not much but I’m all I have.”
- Philip K Dick, Martian Time-Slip



v

Abstracts
Speaker diarization is the method of dividing a conversation into segments spoken by
the same speaker, usually referred to as “who spoke when”. At Viettel, this task is especially important to the IP contact center (IPCC) automatic quality assurance system,
by which hundreds of thousands of calls are processed everyday. Integrated within a
speaker recognition system, speaker diarization helps distinguishing between agents and
customers within each support call and giving further useful insights (e.g: agent attitude
and customer satisfaction,..). The key to accurately do such task is to learn discriminative speaker representations. X-Vectors, bottle-neck features of a time-delayed neural network (TDNN), have emerged as the speaker representations of choice for many

speaker diarization system. On the other hand, ECAPA-TDNN, a recent development
over X-Vectors’ neural network with residual connections and attention on both time and
feature channels, has shown state-of-the-art results on popular English corpora. Therefore, the aim of this work is to explore capability of ECAPA-TDNN versus X-Vectors in
the current Vietnamese speaker diarization system. Both baseline and proposed systems
are evaluated in two tasks: speaker verification, to evaluate the discriminative characteristics of speaker representations; and speaker diarization, to evaluate how these speaker
representations affect the whole complex system. Used data include private data sets
(IPCC_110000, VTR_1350) and a public data set (ZALO_400). In general, conducted
experiments show the proposed system out-perform the baseline system on all tasks and
on all data sets.



vii

Acknowledgements
First and foremost, I would like to express my deep gratitude to my main supervisor, Dr.
T. Anh Xuan Tran. Without her outstanding guidance and patience, I would never finish
this thesis.
I would like to thank Dr. Van Hai Do, Mr. Nhat Minh Le and colleagues at Viettel
Cyberspace Center as their kindness and tremendous technical assistance have made my
days doing this thesis much more relieved.
Finally, huge thanks to my friends for giving me stress-relief at weekends, and my family
who did most of the cooking so I would have more time on working on this thesis.
Hanoi, March 1, 2022



ix

Contents

Declaration of Authorship

i

Abstracts

v

Acknowledgements
1

2

vii

Introduction

1

1.1

Research Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6


Speaker Diarization System

7

2.1

Front-end Processing . . . . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.1

Features Extraction . . . . . . . . . . . . . . . . . . . . . . . .

7

2.1.2

Front-end Post-processing . . . . . . . . . . . . . . . . . . . .

10

2.1.2.1

Speech Enhancement . . . . . . . . . . . . . . . . .

10

2.1.2.2


De-reverbation . . . . . . . . . . . . . . . . . . . . .

10

2.1.2.3

Speech Separation . . . . . . . . . . . . . . . . . . .

10

2.2

Voice Activity Detection . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3

Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.4

Speaker Representations . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.4.1


X-Vector Embeddings . . . . . . . . . . . . . . . . . . . . . .

13

2.4.1.1

Frame Level . . . . . . . . . . . . . . . . . . . . . .

15

2.4.1.2

Segment Level . . . . . . . . . . . . . . . . . . . . .

15


x
2.4.2

ECAPA-TDNN Embeddings . . . . . . . . . . . . . . . . . . .

16

2.4.2.1

17

2.4.2.2


Frame-level . . . . . . . . . . . . . . . . . . . . . .
2.4.2.1.1

1D Convolutional Layer . . . . . . . . . . .

17

2.4.2.1.2

1D Squeeze-and-Excitation Block . . . . .

18

2.4.2.1.3

Res2Net-with-Squeeze-Excitation Block . .

19

Segment-level . . . . . . . . . . . . . . . . . . . . .

19

2.4.2.2.1
2.5

3

Attentive Statistical Pooling . . . . . . . . .


20

Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.5.1

PLDA Scoring . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.5.2

Agglomerative Hierarchical Clustering . . . . . . . . . . . . .

26

Experiments

29

3.1

Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

3.1.1


Equal Error Rate and Minimum Decision Cost Function . . . .

29

3.1.2

Diarization Error Rate . . . . . . . . . . . . . . . . . . . . . .

31

Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.2.1

Kaldi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

3.2.2

SpeechBrain . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.2.3

Kal-Star . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


33

Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

3.3.1

IPCC_110000 . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

3.3.1.1

IPCC_110000 Verification Test Set . . . . . . . . . .

36

3.3.1.2

IPCC_110000 Diarization Test Set . . . . . . . . . .

37

VTR_1350 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

3.3.2.1


VTR_1350 Verification Test Set . . . . . . . . . . .

39

3.3.2.2

VTR_1350 Diarization Test Set . . . . . . . . . . . .

39

ZALO_400 . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

3.3.3.1

40

3.2

3.3

3.3.2

3.3.3

ZALO_400 Verification Test Set . . . . . . . . . . .



xi
3.3.3.2
3.4

3.5
4

5

ZALO_400 Diarization Test Set . . . . . . . . . . . .

41

Baseline System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

3.4.1

Speaker Diarization System . . . . . . . . . . . . . . . . . . .

41

3.4.2

Speaker Verification System . . . . . . . . . . . . . . . . . . .

43

Proposed System . . . . . . . . . . . . . . . . . . . . . . . . . . . . .


46

Results

51

4.1

Speaker Verification Task . . . . . . . . . . . . . . . . . . . . . . . . .

51

4.2

Speaker Diarization Task . . . . . . . . . . . . . . . . . . . . . . . . .

54

Conclusions and Future Works

57



xiii

List of Figures
1.1


A traditional speaker diarization system diagram. . . . . . . . . . . . .

1

1.2

An example speaker diarization result. . . . . . . . . . . . . . . . . . .

2

1.3

An example clustering result of a 3-way conversation (adapted from [8]).
Each dot represents a speech segment in 2D dimension. . . . . . . . . .

3

Generic speaker diarization system diagram, including 3 phases: embeddings extractor training, PLDA backend training and speaker diarization.
In this thesis, two state-of-the-art embeddings extractor: X-Vector and
ECAPA-TDNN are experimented. . . . . . . . . . . . . . . . . . . . .

4

Generic speaker verification system diagram, employing the same embeddings extractor and PLDA backend as used in 1.4. This system is primarily used to optimize the speaker diarization system. The EER threshold can be used for clustering without knowing the number of speakers
in system 1.4. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1


Diagram of a F-banks / MFCCs extraction process (adapted from [11]).

8

2.2

N=10 Mel filters for signal samples sampled at 16000Hz. . . . . . . . .

9

2.3

Example output of a VAD system visualized in Audacity (audio editor)
[36]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.4

Diagram of X-Vectors DNN (adapted from [58]). . . . . . . . . . . . .

14

2.5

Diagram of X-Vectors’ frame-level TDNN with sub-sampling (as configured in [59]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

2.6


Diagram of X-Vectors’ segment-level DNN (as configured in [59]) . . .

16

2.7

Complete network architecture of ECAPA-TDNN (adapted from [62]). .

17

1.4

1.5


xiv
2.8

Kernel sliding across speech frames in a dilated 1D-CNN layer, with
k=3, d=4 and c=6. Essentially this is a TDNN layer with context of
{3,0,3}. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

A 1D-Squeeze-and-Excitation block. Different colors represent different
scales for channels. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18


2.10 A Res2Net-with-Squeeze-Excitation Block. . . . . . . . . . . . . . . .

20

2.11 Attentive Statistics Pooling (on both time frames and channels). . . . .

21

2.12 An example of LDA transformation from 2D to 1D (taken from [76]). .

23

2.13 Fitting the parameters of the PLDA model (taken from [77]). . . . . . .

25

2.14 Agglomerative hierarchical clustering flowchart. . . . . . . . . . . . . .

26

2.15 An example iterative process of agglomerative hierarchical clustering
(taken from [80]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

27

2.16 Visualization of the result of hierarchical clustering (taken from [80]). .

27

3.1


An EER plot. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

30

3.2

Kaldi logo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

3.3

Kaldi general architecture diagram. . . . . . . . . . . . . . . . . . . . .

32

3.4

Filtering VTR_1350 data set by utterances’ durations and number of
utterances per speaker. . . . . . . . . . . . . . . . . . . . . . . . . . .

34

Generating 200 5-way conversations from VTR_1350 data set. The max.
and min. numbers of utterances picked from each conversation are 2 and
30 respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34


3.6

IPCC_110000 data distributions. . . . . . . . . . . . . . . . . . . . . .

35

3.7

VTR_1350 data distributions. . . . . . . . . . . . . . . . . . . . . . . .

39

3.8

ZALO_400 data distributions. . . . . . . . . . . . . . . . . . . . . . .

40

3.9

Baseline speaker diarization system diagram. . . . . . . . . . . . . . .

44

3.10 Baseline speaker verification system diagram. . . . . . . . . . . . . . .

45

3.11 Proposed speaker diarization system diagram. . . . . . . . . . . . . . .


48

3.12 Proposed speaker verification system diagram. . . . . . . . . . . . . . .

49

2.9

3.5


xv
4.1

A speaker diarization output of a 3-way conversation in VTR_1350 test
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54



xvii

List of Tables
3.1

List of speech tasks and corpora that are currently supported by SpeechBrain (taken from [81]) . . . . . . . . . . . . . . . . . . . . . . . . . .

33


3.2

IPCC_110000 data set overview. . . . . . . . . . . . . . . . . . . . . .

35

3.3

IPCC_110000 data subsets. . . . . . . . . . . . . . . . . . . . . . . . .

36

3.4

VTR_1350 data set overview. . . . . . . . . . . . . . . . . . . . . . . .

38

3.5

ZALO_400 data set overview. . . . . . . . . . . . . . . . . . . . . . .

40

3.6

EER and MinDCF performance of all systems on the standard VoxCeleb1 and VoxSRC 2019 test sets (taken from [62]). . . . . . . . . . .

46


Diarization Error Rates (DERs) on AMI dataset using the beamformed
array signal on baseline and proposed systems (taken from [88]). . . . .

47

4.1

EER and MinDCF performance. . . . . . . . . . . . . . . . . . . . . .

53

4.2

DER performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

3.7



xix

List of Abbreviations
IPCC
DNN
CNN
TDNN
RTTM
RNN

LPC
PLP
DWT
MFBC
MFCC
STFT
DCT
WPE
MLE
PIT
VAD
SAD
HMM
GMM
GLR
BIC
UBM
LDA
PLDA
LSTM
SE-Res2Net
ReLU

IP Contact Center
Deep Neural Network
Convolutional Neural Network
Time-Delayed Neural Network
Rich Transcription Time Marked
Recurrent Neural Network
Linear Prediction Coding

Perceptual Linear Prediction
Discrete Wavelet Transform
Mel Filterbank Coefficients
Mel Frequency Cepstral Coefficients
Short-time Discrete Fourier Transform
Discrete Cosine Transform
Weighted Prediction Error
Maximum Likelihood Estimation
Permutation Invariant Training
Voice Activity Detection
Speech Activity Detection
Hidden Markov Model
Gaussian Mixture Model
Generalized Likelihood Ratio
Bayesian Information Criterion
Universal Background Model
Linear Discriminant Analysis
Probabilistic Linear Discriminant Analysis
Long Short-Term Memory
Res2Net-with-Squeeze-Excitation
Rectified Linear Unit


xx
AAM
AHC
EER
CER
FAR
FRR

TPR
FPR
FNR
MinDCF
DER
PCM
SNR

Additive Angular Margin
Agglomerative Hierarchical Clustering
Equal Error Rate
Crossover Error Rate
False Acceptance Rate
False Rejection Rate
True Positive Rate
False Positive Rate
False Negative Rate
Minimum Decision Cost Function
Diarization Error Rate
Pulse-Code Modulation
Signal-to-Noise Ratio


1

Chapter 1

Introduction
1.1


Research Interest

Speaker diarization, usually referred as "who spoke when”, is the method of dividing a
conversation that often includes a number of speakers into segments spoken by the same
speaker. This task is especially important to Viettel IP contact center (IPCC) automatic
quality assurance system, where hundreds of thousands of calls are processed everyday,
and the human resources are limited and costly. In the scenarios that only single-channel
recordings are provided, speaker diarization, integrated within a speaker recognition
system, helps distinguishing between agents and customers within each support call and
giving further useful insights (e.g: agent attitude and customer satisfaction,..). Nevertheless, speaker diarization can also be applied in analyzing other forms of recorded
conversations such as meetings, medical therapy sessions, court sessions, talk shows,...
Audio

Input

Front-End
Processing

Voice Activity
Detection

Segmentation


Diarization
Output

Post-

Processing


Clustering

Speaker

Representation

FIGURE 1.1: A traditional speaker diarization system diagram.

A traditional speaker diarization system (figure 1.2) is built by six modules: front-end
processing, voice activity detection, segmentation, speaker representation, clustering,
and post-processing. All output information, including the number of speakers, the beginning time and duration of each of their speech segments, is encapsulated in form of
Rich Transcription Time Marked (RTTM) file [1] (figure 1.2).


×