THESIS TITLE ADVANCES IN PUNCTUATION AND DISFLUENCY PREDICTIO

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.77 MB, 119 trang )

ADVANCES IN PUNCTUATION AND
DISFLUENCY PREDICTION
WANG XUANCONG
B.Sc. (Hons.) NUS
A THESIS SUBMITTED
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
NUS GRADUATE SCHOOL FOR INTEGRATIVE
SCIENCES AND ENGINEERING
NATIONAL UNIVERSITY OF SINGAPORE
2015

DECLARATION
I hereby declare that this thesis is my original work and it has been written by
me in its entirety. I have duly acknowledged all the sources of information which
have been used in the thesis.
This thesis has also not been submitted for any degree in any university previ-
ously.
Wang Xuancong
23 January 2015
i

Acknowledgment
My PhD journey is a life journey during which I have not only learned the knowl-
edge in the ﬁeld of speech and natural language processing, but also learned var-
ious techniques in doing research, how to collaborate with other people, how to
analyze problems and come up with effective solutions. Now at the end of this
journey, it is time to acknowledge all those who have contributed to it.
First and foremost, I would like to thank my main supervisor Prof. Ng Hwee
Tou and my co-supervisor Prof. Sim Khe Chai. I began my initial research in
speech processing under Prof. Sim Khe Chai. As a physics undergraduate, I
lacked various techniques in doing computer science research. Prof. Sim was

very patient and helpful in teaching me those basic experimental skills in addition
to knowledge in speech processing. Later, my research focus was shifted to nat-
ural language processing (NLP) because I realized that there was a gap between
speech recognition and natural language processing when we talked about real-
life applications, and some intermediate processing was indispensable for down-
stream NLP tasks. Prof. Ng, with his experience for many years in the NLP ﬁeld,
has helped me tremendously in coming up with useful ideas and tackling difﬁcult
problems. Under their teaching and supervision, I have acquired acknowledge in
both speech and NLP ﬁeld. They have also spent numerous time in providing me
invaluable guidance and assistance in the writing of my papers and thesis. Discus-
sions with them have been very pleasant and helpful in improving my scientiﬁc
skills.
Next, I would like to thank the other member of my thesis advisory committee,
Prof. Wang Ye. His guidance and feedback during the time of my candidature has
always been helpful and encouraging.
iii
I would also like to thank my friends, schoolmates and colleagues in NUS
Graduate School for Integrative Sciences and Engineering and NUS School of
Computing for their support, helpful discussions, and fellowship.
Finally, I would like to thank my parents for their continued emotional care
and spiritual support especially when I encountered difﬁculties or failures.
iv
Contents
1 Introduction 1
1.1 Why do we need to predict punctuation? . . . . . . . . . . . . . . 3
1.2 Why do we need to predict disﬂuency? . . . . . . . . . . . . . . . 6
1.3 Contributions of this Thesis . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Dynamic Conditional Random Fields for Joint Sentence
Boundary and Punctuation Prediction . . . . . . . . . . . 9
1.3.2 A Beam-Search Decoder for Disﬂuency Detection . . . . 10

1.3.3 Combining Punctuation and Disﬂuency Prediction . . . . 11
1.4 Organization of the Thesis . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work 13
2.1 Sentence Boundary and Punctuation Prediction . . . . . . . . . . 14
2.2 Disﬂuency Prediction . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Joint Learning and Joint Label Prediction . . . . . . . . . . . . . 17
2.4 Model Combination using Beam-Search Decoders . . . . . . . . . 19
3 Machine Learning Models 21
3.1 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . 21
v
3.2 Max-margin Markov Networks (M3N) . . . . . . . . . . . . . . . 26
3.3 Graphical Model Extension . . . . . . . . . . . . . . . . . . . . . 27
3.4 The Relationship between Model Complexity and Clique Order . 32
4 Dynamic Conditional Random Fields for Joint Sentence Boundary
and Punctuation Prediction 36
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Model Description . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.3 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Lexical Features . . . . . . . . . . . . . . . . . . . . . . 38
4.3.2 Prosodic Features . . . . . . . . . . . . . . . . . . . . . . 39
4.3.3 Normalized N-gram Language Model Scores . . . . . . . 39
4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.4.1 Data Preparation . . . . . . . . . . . . . . . . . . . . . . 40
4.4.2 Incremental Local Training . . . . . . . . . . . . . . . . . 41
4.4.3 Vocabulary Pruning . . . . . . . . . . . . . . . . . . . . . 42
4.4.4 Experimental Results . . . . . . . . . . . . . . . . . . . . 43
4.4.5 Comparison to a two stage LCRF+LCRF . . . . . . . . . 45
4.4.6 Results on the Switchboard Corpus . . . . . . . . . . . . 46
4.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5 A Beam-Search Decoder for Disﬂuency Detection 48

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2 The Improved Baseline System . . . . . . . . . . . . . . . . . . . 49
5.2.1 Node-Weighted and Label-Weighted Max-Margin Markov
Networks (M3N) . . . . . . . . . . . . . . . . . . . . . . 50
vi
5.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.3 The Beam-Search Decoder Framework . . . . . . . . . . . . . . . 53
5.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.3.2 General Framework . . . . . . . . . . . . . . . . . . . . . 55
5.3.3 Hypothesis Producers . . . . . . . . . . . . . . . . . . . . 57
5.3.4 Hypothesis Evaluators . . . . . . . . . . . . . . . . . . . 59
5.3.5 Integrating M3N into the Decoder Framework . . . . . . . 60
5.3.6 POS-Class Speciﬁc Expert Models . . . . . . . . . . . . . 61
5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 63
5.4.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
5.4.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 67
5.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
6 Combining Punctuation and Disﬂuency Prediction: An Empirical Study 70
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2 The Baseline System . . . . . . . . . . . . . . . . . . . . . . . . 72
6.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . 72
6.2.2 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
6.2.3 Evaluation and Results . . . . . . . . . . . . . . . . . . . 75
6.3 The Cascade Approach . . . . . . . . . . . . . . . . . . . . . . . 77
6.3.1 Hard Cascade . . . . . . . . . . . . . . . . . . . . . . . . 78
6.3.2 Soft Cascade . . . . . . . . . . . . . . . . . . . . . . . . 79
6.3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . 79
6.4 The Rescoring Approach . . . . . . . . . . . . . . . . . . . . . . 82
vii

6.5 The Joint Approach . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7 Conclusion and Future Work 90
viii
Summary
With the advancement of automatic speech recognition (ASR) technology, more
and more natural language processing (NLP) applications have been used in our
daily life. For example, spoken language translation, automatic question answer-
ing, and speech information retrieval. When dealing with recognized spontaneous
speech, several natural problems arise. Firstly, recognized speech does not have
punctuation or sentence boundary information. Secondly, spontaneous speech
contains disﬂuency which carries no useful content information. The lack of punc-
tuation and sentence boundary information and the presence of disﬂuency affect
the performance of downstream NLP tasks.
Thus, the goal of this work is to develop or improve algorithms to automati-
cally detect sentence boundaries, add punctuation, and identify disﬂuent words in
recognized speech so as to improve the performance of downstream NLP tasks.
Speciﬁcally, we focus on punctuation prediction and disﬂuency prediction. For
punctuation prediction, we propose using dynamic conditional random ﬁelds for
joint sentence boundary and punctuation prediction. We have also investigated
several model optimization techniques which are important for practical applica-
tions. For disﬂuency prediction, we propose a beam-search decoder approach.
Our decoder can combine generative models like n-gram language models (LM)
and discriminative models like Max-margin Markov Networks (M3N). Lastly, we
have performed an empirical study on various state-of-the-art methods for com-
bining the two tasks, and we have highlighted some insights in balancing the trade-
off between performance and efﬁciency for building practical systems.
ix

List of Tables
4.1 Comparison of punctuation prediction F1 measures (in %) for dif-
ferent algorithms and features for punctuation prediction. . . . . . 45
4.2 Comparison of F1 measures (in %) on sentence boundary detec-
tion (Stage 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3 Comparison of F1 measures (in %) on punctuation prediction (Stage
2) (using predicted sentence boundaries from Stage 1 for LCRF
and MaxEnt) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Comparison between DCRF and LCRF on the Switchboard corpus. 46
5.1 Feature templates for ﬁller word prediction . . . . . . . . . . . . 53
5.2 Feature templates for edit word prediction . . . . . . . . . . . . . 54
5.3 Baseline edit F1 scores for different POS tags . . . . . . . . . . . 55
5.4 POS classes for expert M3N models and their baseline F1 scores . 62
5.5 Weighted hamming loss, v(y
t
, ¯y
t
) for M3N for both stages . . . . 64
5.6 Edit detection F1 scores (%) of expert models on all words be-
longing to that POS class in the test set (expert-M3N column),
and baseline model on all words belonging to that POS class in
the test set (baseline-M3N column) . . . . . . . . . . . . . . . . . 65
xi
5.7 Degradation of the overall performance by expert models com-
pared to the baseline model . . . . . . . . . . . . . . . . . . . . . 65
5.8 Performance of the beam-search decoder with different combina-
tions of components . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.9 An example showing the effect of measuring the quality of the
cleaned-up sentence. . . . . . . . . . . . . . . . . . . . . . . . . 68
6.1 Corpus statistics for all the experiments. *: each conversation pro-

duces two long/sentence-joined sequences, one from each speaker. 73
6.2 Labels for punctuation prediction and disﬂuency prediction. . . . 74
6.3 Feature templates for disﬂuency prediction, or punctuation pre-
diction, or joint prediction for all the experiments in this chapter. . 76
6.4 Baseline results showing the degradation by joining utterances
into long sentences, removing precision/recall balancing, and re-
ducing the clique order of features. All models are trained using
M3N. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
6.5 Performance comparison between the hard cascade method and
the soft cascade method with respect to the baseline isolated pre-
diction. All models are trained using M3N without balancing pre-
cision and recall. . . . . . . . . . . . . . . . . . . . . . . . . . . 80
xii
6.6 Performance comparison between the rescoring method and the
soft-cascade method with respect to the baseline isolated predic-
tion. The rescoring is done on 2n
2
hypotheses. All models are
trained using M3N without balancing precision and recall. Fig-
ures in the bracket are the oracle F1 scores of the 2n
2
hypotheses.
*:on the development set, the best overall result is obtained at
n = 10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.7 Performance comparison among 2-layer FCRF, mixed-label LCRF
and cross-product LCRF, with respect to the soft-cascade and the
isolated prediction baseline. All models are trained using GRMM,
with reduced clique orders. . . . . . . . . . . . . . . . . . . . . . 85
xiii

List of Figures
3.1 Graphical structure of a linear-chain CRF of length T (three dif-
ferent ways of drawing). . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Graphical structure of two dynamic CRFs of length T . . . . . . . 30
3.3 Graphical structure of a 2-layer factorial CRF of length T . . . . . 30
3.4 Linear-chain CRF with a 3
rd
-order clique . . . . . . . . . . . . . . 31
3.5 An illustration of the number of weights for every observation
feature on each clique of a dynamic CRF. . . . . . . . . . . . . . 33
4.1 A graphical representation of the three basic undirected graphical
models. y
i
denotes the 1
st
layer label, z
i
denotes the 2
nd
layer
label, and x
i
denotes the observation sequence. . . . . . . . . . . 38
4.2 An example showing the two layers of factorial CRF labels for a
sentence in TDT3 English corpus . . . . . . . . . . . . . . . . . . 39
4.3 Punctuation statistics and distribution of the number of words in
an utterance in the preprocessed TDT3 corpus. . . . . . . . . . . . 41
4.4 The effect of vocabulary pruning and feature pruning. The x-
axis represents the value x such that the proportion of the fea-
tures/vocabulary remaining after pruning is 2

−x
. . . . . . . . . . . 42
xv
6.1 Illustration of the rescoring pipeline framework using the four
M3N models used in the soft-cascade method: P (PU|x), P (DF|PU, x),
P (DF|x) and P (PU|DF, x) . . . . . . . . . . . . . . . . . . . . . 81
6.2 Illustration using (a) mixed-label LCRF; (b) cross-product LCRF;
and (c) 2-layer FCRF, for joint punctuation (PU) and disﬂuency
(DF) prediction. Unshaded nodes are observations and shaded
nodes are variables to be predicted. . . . . . . . . . . . . . . . . . 84
7.1 An overall picture showing the relationship between different ma-
chine learning models used in this thesis and their evolution over
time. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
xvi
Chapter 1
Introduction
When we look at the advancement of human civilization, language plays a very
important role. It is through the use of language, knowledge is spread. It is also
through the use of language, people can communicate with one another. Lan-
guage can be expressed in the form of speech (spoken language) or written text
(written language). After the invention of computers, especially with the popular-
ization of personal computers, the interaction between human and computer has
become more and more frequent. In the early days, the primary means of interac-
tion with the computers is through electro-mechanical devices like keyboards and
mice. With the advancement of the computer science, especially machine learn-
ing theory, people start to develop computer algorithms that can recognize human
speech. The automatic speech recognition (ASR) technology has advanced signif-
icantly in recent decades. Probably, some day in the future, speech will become
the dominant way of interaction with computers and mobile devices, because it
is the natural way human beings interact with one another. In fact, many natural

language processing (NLP) applications have already been adopted in our daily
1
life. For example, iPhone’s Siri can now recognize and interpret human speech
queries and execute user commands. Automatic spoken language translators (e.g.,
Google Translate) can recognize speech in a source language, translate it into a
target language, and synthesize speech in the target language. There are also many
other spoken language processing tasks which are still undergoing research, e.g.,
automatic creation of meeting minutes, telephone speech tracking, etc. In a typi-
cal application framework, the ASR system is used to convert speech into text for
downstream NLP systems to process.
One problem when dealing with ASR output is that it does not have punctua-
tion, nor complete sentence boundary information (Ostendorf et al., 2008). Punc-
tuation is a set of symbols that is used to divide text into sentences, clauses, etc.,
for the disambiguation of meaning. They only occur in written language and are
not pronounced in spoken language. Thus, conventional ASR systems do not out-
put punctuation symbols because they only model audible speech sounds. How-
ever, ASR systems are able to detect silence duration and use that information to
predict sentence boundaries. Typically, if the silence duration is longer than some
pre-set threshold, a sentence boundary is set. Therefore, if the speaker pauses too
long in the middle of a sentence or pauses too short between two sentences: the
ASR system might not be able to predict the sentence boundaries accurately.
The other problem in processing human speech is that spontaneous speech of-
ten contains disﬂuency. Some literature study (Shriberg, 1999) shows that about
5–10% of natural conversations are disﬂuent. The proportion not only varies from
people to people, from country to country, it also depends on the circumstance in
which the speech is made. For example, everyday telephone conversations usu-
ally contain more disﬂuency than news reports. The presence of disﬂuency affects
2
ASR performance signiﬁcantly, the inﬂuence is more prominent in spontaneous
speech as compared to read speech. Moreover, the presence of disﬂuency also

confuses downstream NLP applications since most systems are trained using ﬂu-
ent text.
In the rest of this chapter, we will ﬁrst introduce the use of punctuation in
written text and the presence of disﬂuency in spontaneous speech. After that, we
will give a brief summary of the three main contributions of this thesis, followed
by an outline of the thesis.
In the literature, the technical term for spotting disﬂuent word tokens in a
text is called “disﬂuency detection” because the disﬂuent word tokens are already
present in the text for us to identify. However, the technical term for inserting
punctuation symbols into unpunctuated text is called “punctuation prediction” be-
cause punctuation is not present in the original text, so the algorithm needs to ﬁnd
possible locations and insert an appropriate punctuation symbol at each location.
In this thesis, since we have treated both tasks as label prediction tasks, we will
refer to both problems as prediction tasks for simplicity. We may also use the
term “disﬂuency detection” and “disﬂuency prediction” interchangeably in some
sections, i.e., the term “disﬂuency prediction” in this thesis refers to “disﬂuency
detection” in the literature.
1.1 Why do we need to predict punctuation?
Punctuation is a very important constituent in written language. It is a product
of language evolution because not all languages contain punctuation since the
beginning of the time. For example, punctuation was not used in Japanese and
3
Korean writing until the late 19
th
century and early 20
th
century. Moreover, the
punctuation used in ancient Chinese is very different from now. In fact, most of
the ancient inscriptions do not contain punctuation.
The reason why humans introduce punctuation into written language is be-

cause without punctuation, the meaning of a sequence of words can often be am-
biguous. This kind of ambiguity can occur both inside a sentence (intra-sentence)
and across sentences (inter-sentence). At the intra-sentence level, for example,
consider the following two sentences:
“Woman, without her man, is nothing.”
“Woman: without her, man is nothing.”
The ﬁrst sentence is essentially saying “woman is nothing”, and it emphasizes
the importance of man; while the second sentence essentially says that “man is
nothing”, and it emphasizes the importance of woman (example adopted from
Wikipedia). This ambiguity arises because the word ‘her’ can be either a pronoun
for third person singular or a possessive determiner for belonging to a female
entity. Moreover, this kind of ambiguity can also occur at the inter-sentence level.
For example,
“John fell sick. In the hospital, there was another man.”
“John fell sick in the hospital. There was another man.”
Without punctuation, we are not sure whether there was another man in the hos-
pital or John fell sick in the hospital. This ambiguity arises because the adver-
bial phrase “in the hospital” can either post-modify the previous sentence or pre-
modify the next sentence. As such, there is the uncertainty in the position of the
4
sentence boundary. Moreover, sometimes whether a sentence boundary is present
can also lead to some ambiguity. For example,
“I don’t know why.”
“I don’t know. Why?”
In the ﬁrst case, the speaker is saying in one sentence that he/she does not know the
reason. However, the second case splits the word sequence into two sentences, in
the ﬁrst sentence, the speaker declares that he/she does not know and in the second
sentence, he/she is asking for the reason. Without the knowledge of the context or
without listening to the actual speech, it is very difﬁcult to determine which case
is more appropriate because both are grammatically correct.

Interestingly, this kind of structural ambiguity can be partially resolved in
speech by increasing the pause duration after those words followed by punctu-
ation symbols. This is one reason why without punctuation symbols, the raw text
contains less information than the corresponding speech. In addition to this, we
can speak a statement-like sentence in a rising tone to turn it into a question. Sim-
ilarly, in text form, we can put a question mark at the end of a statement-like
sentence to denote that it is a question, e.g., “you are sure about that?”. Further-
more, a sentence can also be spoken in a more emphatic form to express emotion.
Such sentences are called exclamatory sentences and we denote them by ending
the sentence with an exclamation mark. Features such as pause duration and ris-
ing/falling tone are also called prosodic features or acoustic features because they
describe the characteristics of speech sound. From these two examples, we can
also see that both prosody and punctuation introduce additional information apart
from the raw sequence of words.
5
From the above analysis, punctuation has two main purposes: ﬁrstly, it breaks
up a sequence of words into smaller linguistic units to establish a hierarchical
structure, which can reduce ambiguity and make the text easier to read; secondly,
it indicates the purpose of the sentence. Therefore, by predicting punctuation
in a text, we can recover structure information in the original text and reduce
ambiguity in its meaning, which can aid parsing and semantic analysis. We can
also infer its sentence type, e.g., whether it is a question or a statement sentence,
which can be useful for machine translation, because given the same sequence of
words, translating it as a question can be very different from translating it as a
statement.
1.2 Why do we need to predict disﬂuency?
Disﬂuency is an artefact of spoken language. It only occurs in speech, but not
in written text. A disﬂuent speech may contain breaks, irregularities and other
non-lexical vocables. Disﬂuency can result from a few factors. It could be that
the speaker has made a mistake in speech and wants to make a correction. It could

also be that people pause for a short moment to think about what they should say
next. Moreover, some people have the habit of inserting words/phrases such as
“uh-huh”, “I mean”, “you know”, etc., every so often while they speak.
Not all types of disﬂuencies will show up in the ASR outputs. For example,
if a speaker pauses to think for a moment without making any voiced sounds, and
resumes his speech without speaking any words incorrectly, then provided that
the speech is transcribed correctly, the ASR output will not contain disﬂuency.
Sometimes, the speaker might have spoken an incomplete word and then aborted
6
that word. Since conventional ASR systems do not output partial words, the ASR
system might output either no words, a word different from the intended word,
or the intended word. Partial word detection and elimination are considered as
disﬂuency processing at sub-word level. They are usually handled by the speech
recognizer. In the NLP literature for disﬂuency prediction, people mainly focus
on word-level disﬂuency that is reﬂected in the text form.
At the text level, there are mainly two types of disﬂuencies: ﬁller words and
edit words. Filler words include ﬁlled pauses (e.g., ‘uh’, ‘um’) and discourse
markers (e.g., “I mean”, “you know”). They are insertions in spontaneous speech
to indicate pauses or mark boundaries in discourse. Edit words are words that
are spoken wrongly and then corrected by the speaker. For example, consider the
utterance:
I want a ﬂight
Edit
  
to Boston
Filler
  
uh I mean
Repair
  

to Denver
The speaker has initially spoken the wrong destination, “to Boston”. After
that, he paused and gave a hint for making a correction by speaking “uh, I mean”.
Finally, he spoke the correct destination, “to Denver”. So the phrase “to Boston”,
which is spoken incorrectly, is called reparandum in linguistics. It is to be re-
placed by the correct destination, “to Denver”, which is called repair. The words
“uh I mean” are called

fillers. They are inserted to give the speaker some time to
think about the correct destination and to give listeners a cue that he is making a
correction afterwards. More complex disﬂuencies can be reduced to this simple
scheme. For example, the speaker can restart a sentence or abort a sentence. In
that case, the entire incomplete sentence is the reparandum.
Disﬂuency varies signiﬁcantly from people to people, and it also depends on
7

THESIS TITLE ADVANCES IN PUNCTUATION AND DISFLUENCY PREDICTIO

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về