Tải bản đầy đủ (.pdf) (4 trang)

How Vietnamese Attitudes can be Recognized and Confused: CrossCultural Perception and Speech Prosody Analysis

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (364.08 KB, 4 trang )

How Vietnamese attitudes can be recognized and confused:
Cross-cultural perception and speech prosody analysis
Dang-Khoa Mac, Eric Castelli
International Research Center MICA
HUST-CNRS/UMI 2954 Grenoble INP
Hanoi, Vietnam
{dang-khoa.mac, eric.castelli}@mica.edu.vn
Véronique Aubergé
1
, Albert Rilliard
2

1
Laboratory of Informatics of Grenoble (LIG),
2
LIMSI
CNRS
1
Grenoble,
2
Orsay, France
1
,
2



Abstract - Prosodic attitudes, or social affects, are main part
of face-to-face interaction and linked to the language
through the culture. This paper presents a study on prosodic
attitudes in Vietnamese, a tonal language. Perception


experiments on 16 Vietnamese attitudes were carried out
with Vietnamese and French participants. The results
revealed perception differences between native and non-
native listeners. As attitudinal expressions are partially
carried through speech prosody, an analysis was also carried
out, in order to have a better understanding of why these
attitudes are recognized or confused, and to bring out some
prosodic characteristics of Vietnamese social affects.
Keywords - Vietnamese, attitude, perception, prosodic
analysis
I. INTRODUCTION
During communication between humans, speech is an
important information channel to express mental,
intentional, attitudinal and emotional states. According to
some theoretical models of affects [1], the affective
expression in speech communication may be controlled at
different levels of cognitive processing, from the
involuntarily controlled expressions of emotion to the
intentionally, voluntarily controlled expressions of
attitudes. Therefore, attitudes and emotions can be
distinguished depends on the nature of the control exerted
by the speaker (voluntary vs. involuntary) [2]. Some types
of expressivity may be expressed as either an attitude or an
emotion. For example, “surprise” can be considered as an
attitude when expressed during a voluntary process;
otherwise it can be considered as an emotion.
Attitude expression carries the intention and points of
view of the speaker (e.g. surprise, confirmation, politeness
etc.) [3]. Attitudes are constructed for each language and
each culture and they need to be learned by children or by

second language students [5]. As all attitudinal expressions
are constructed for a certain language and culture, they can
differ between languages. Some attitudes can be expected
to have a universal value (e.g. “surprise”), but specific
attitudes in one language may not be recognized or may be
ambiguous in another language [7]. The understanding of
this phenomenon may benefit from cross-cultural studies
[3,6,7].
The important role of prosody in emotional and
attitudinal expression was shown in many researches [4,9].
According to [4], some emotions can be characterized by
the mean level and the range of F0. This research also
showed the different contour shapes for different emotions.
With a tonal language such as Vietnamese, the acoustic
parameters implied in the linguistics and affective
functions of prosody (F0, intensity, timing) also play an
important role at the phonemic level for lexical access.
Moreover, the Vietnamese tones use voice quality settings
such as creaky voice [12], that are used in the morphology
of some other languages’ attitudes and emotions [11].
After presenting the corpus, we describe the perceptual
experiment with Vietnamese and French participants. This
result shows the differences in attitude perception between
the native and non-native speakers. Then, a prosodic
analysis is presented and discussed to give some
explanations of the perception test. This paper concludes
with some discussions.
II. E
XPERIMENTS
A. The corpus

In the researches on social affects in different
languages [5,10], the attitudes have been selected thanks to
the foreign languages literature in didactic. Unfortunately,
as an under-resourced language, there are few researches
on Vietnamese expressive speech. We have found only one
study [12], which describes 16 Vietnamese attitudes (cf.
Table 1), which have been selected and audio-visually
recorded by a male native speaker of Hanoi (standard
pronunciation of Vietnamese). However, for the purpose of
prosodic analysis, this paper addresses only the audio
information of Vietnamese attitudes.
TABLE I. SELECTION OF 16 VIETNAMESE ATTITUDES, WITH THEIR
ABBREVIATIONS

Declaration DEC Irritation IRR
Interrogation INT Sarcastic irony SAR
Exclamation of neutral surprise EXo Scorn SCO
Exclamation of positive surprise EXp Politeness POL
Exclamation of negative surprise EXn Admiration ADM
Obviousness OBV Infant-directed speech IDS
Doubt-Incredulity DOU Seduction SED
Authority AUT Colloquial COL
B. Perception tests
The perception test was carried out to study how the
native and non-native listeners recognize and confuse the
16 Vietnamese attitudes. To examine the influence of
sentence length, three sentences, having one, two or five
syllables, were chosen from the corpus. To control a
possible effect of Vietnamese tone on the perception of
attitudes, all syllables are performed with tone 1 (the level

tone). The perception test therefore comprises 48 stimuli (3
sentences * 16 attitudes).
2011 International Conference on Asian Language Processing
978-0-7695-4554-7/11 $26.00 © 2011 IEEE
DOI 10.1109/IALP.2011.39
220
Forty listeners participated in this experiment: 20
Vietnamese (10 men and 10 women) who speak the same
dialect as the speaker; and 20 French (10 males and 10
females) who have not been exposed to Vietnamese
language. The test interface gave them the labels and the
definitions of the 16 attitudes (in the native language of the
listeners). No listener expressed any difficulty in
understanding the concepts of these 16 attitudes. All
subjects listened to each stimulus only one time. After
each stimulus, they were asked to indicate the perceived
attitude among the 16 presented ones.
C. Result analysis
Effect of factors
: Firstly, a repeated measure ANOVA
was carried out to evaluate the relative importance of the
following factors on the listeners’ perception: the sentence
length (number of syllables); the listeners’ linguistic
background (natives and non-native) and the listeners’
gender. The ANOVA shows that the listeners’ linguistic
background factor has a significant effect on the perception
(p<0.01): Vietnamese and French listeners don’t perceive
these expressions the same way. In contrast, sentence
length (number of syllables) and the listeners’ gender have
no influence on perception (p>0.01).

TABLE II. THE OUTPUT OF ANOVA IN PERCENT OF GOOD
ANSWERS
. SIGNIFICANT EFFECTS AT THE 1% LEVEL ARE SET IN BOLD.
Factors df F p
Atttitude 15 28.700 0.000
Listener (Vietnamese or French) 1 1286.772 0.000
Gender of listener 1 3.754 0.053
Sentence length (Num. of syllables) 2 1.376 0.253

Attitude recognition
: Figure 1 presents recognition
rates (in percent) of the 16 attitudes for both groups of
listeners. Globally, most of the attitudes were recognized
above a chance level, and native listeners had higher
recognition scores than foreign ones. Some attitudes were
well recognized by both Vietnamese and French listeners:
DEC, AUT, IRR, SAR, SED.

Figure 1. Recognition rate of 16 attitudes by Vietnamese and French
listeners. The dashed line indicates the chance level (6.25%).
Some other attitudes received low recognition scores
(POL) or were not recognized by both Vietnamese and
French listeners (ADM). The SCO and IDS attitudes were
well recognized by Vietnamese listeners but almost not
recognized by the French listeners. Conversely, the EXn
attitude was recognized by the French listeners, but not by
the Vietnamese ones.
Attitude confusion
: The analysis of the confusions
between attitudes gives interesting details on the

perceptive proximity between the 16 expressive labels.
From the confusion matrices, confusion graphs (cf. figure
2) were built, reporting all the confusions higher than twice
the chance level (i.e.  12.5%).
For both Vietnamese and French listeners, ADM was
not recognized and it was mixed with COL, EXo (for
Vietnamese listeners) and with COL and IDS (for French
listeners). Vietnamese listeners did not recognize the EXn
attitude and mixed it with EXo and DOU. French listeners
did not recognize IDS and mixed it with SAR or DOU.
Vietnamese listeners made reciprocal confusions
between some pairs or groups of attitudes: SAR and SCO;
POL and DEC; SED and COL; EXo, EXn and DOU.
French listeners made reciprocal confusions between AUT
and IRR; DEC and OBV; DOU and EXn; DOU and EXo.


<
=
19
%

3
0%

=
>
<
=
2

6
%

2
2
%
=
>
<=63%
19%
=
>
<=
2
5%
17
%
=
>
18% =>
<= 15%


Figure 2. Confusion graphs (in percentage of recognition) for
Vietnamese (top) and French (bottom) listeners. The reciprocal
confusions are in bold
Some similarities can be found in the confusion of
Vietnamese and French listeners. Both of them made the
reciprocal confusion between EXn, DOU and EXo. They
strongly confused EXp with EXn (>30%), IRR with AUT

(about 25%). They also confused POL, COL, INT and
EXn with DEC. However, there are some differences
between them. The SED was strongly confused with COL
(33% of confusion) by Vietnamese listeners, but not by
French listeners. For Vietnamese listeners, SAR and SCO
221
show strong reciprocal confusions, while the French
listeners show no confusion between these two attitudes.
III. P
ROSODIC ANALYSIS
A prosodic analysis was carried out to give some
acoustical explanations of the recognition and confusion of
16 Vietnamese attitudes. According to the ANOVA
analysis (cf. Table II), there is no influence of the
sentences’ length on the perception of attitudes. In three
types of sentence, only the five-syllable sentences have a
complete structure of Vietnamese sentences (Subject- Verb
- Object). The sentence with 5 syllable-lengths also allows
us to analyze the variations of prosodic parameters in the
different parts of the sentence (first, middle and last part).
Therefore, and to save space, the prosodic analysis was
carried out only on the 5-syllable long sentence.
A. Principal Component Analysis (PCA)
The audio signals of 16 attitudes were phonetically
segmented manually. Three acoustic parameters were
extracted automatically; F0 (in semitones calculated with 1
Hz as the reference value), syllabic duration (in seconds),
and intensity (in dB). We calculated the mean values of F0
and intensity on each sentence (F0_mean, Int_mean), the
slope of last syllable (F0_final_slope, Int_Final_slope) and

the slope of whole sentence (i.e., the mean value of the last
syllable minus the mean value of the first syllable:
F0_slope, Int_slope). For the syllabic duration, the mean
(dur_mean) and the length of final syllable (final_length)
were calculated. Using the parameters described above as
features, separate Principal Components Analyses were
carried out, in order to see how all these acoustic
parameters allow to distinguish the 16 different attitudes
(figure 3).
With the PCAs based on the F0 parameters, F0 slope
separates the 16 attitudes into 2 groups: attitudes with
rising F0 contour (EXp, IRR, EXo, DOU, EXN, ADM
OBV) and the others with falling F0 contour. The F0 final
slope shows the attitudes ADM, EXN, DOU, INT, DEC
with a rising F0 on the last syllable. The OBV, AUT, IDS
have falling F0 on the last syllable. The IRR and EXp are
characterized by high F0 mean and high positive F0 slope.
The OBV and AUT are distinguished with other attitudes
by a very low and negative F0’s final slope.
With the PCAs based on intensity, the parameter of
mean intensity shows some attitudes with very low
intensity (ADM, COL, SED, SCO, POL). The AUT, IDS,
EXP have the highest mean intensity and positive final
slope. The parameter Int_Slope is important to distinguish
the IRR (highest positive slope) and SED (lowest negative
slope).
With the duration parameters, IDS, SCO and SAR are
separated by high duration mean. IDS is also distinguished
by a high value of duration mean and the length of the last
syllable.

B. Prosodic contours comparison
For all attitudes, the F0 contours were extracted (in
semitones calculated with 1 Hz as the reference value) to
examine the similarity and the specific shape of intonation
contours. Figure 4 shows F0 contours of 5 syllables-length
sentences (extract in semitone) of 16 Vietnamese attitudes.
Overall, most attitudes have the duration from 0.8 to 1s.
However, three attitudes SCO, SAR, IDS have the duration
twice longer than the others.
Figure 3. Two main dimensions of PCA for 16 attitudes, base on F0
(top), Intensity (middle) and Duration (bottom)
For most attitudes, the F0 curves at the middle of
sentence (from the second syllable to the next-to-last
syllable) are nearly similar. The F0 contours of the
attitudes are mostly different at the first and the last
syllables. Researches on different languages also show the
informative weight of the first syllable [8]. In the case of
Vietnamese, the attitudes AUT, IRR, OBV and EXp have
their first syllable with a long duration and a rising F0.
Amongst them, IRR have the last syllable with level
222
contour, the EXp, OBV and AUT have last syllable with
the falling contours.
The F0 contours of DEC, POL and ADM are nearly
similar, with a flat shape for all syllables. That may explain
why they were confused in perception test. The INT, EXn
and DOU have the same shape of last syllable (slightly
rising). That may make some confusion between them.
According to the perception test, Vietnamese listeners
recognized the SAR and SCO attitudes, but with a strong

reciprocal confusion. Such a result can also be explained
by the similar shapes of their F0 contours. Both attitudes
have a long overall duration, due to an important
lengthening of their first and last syllable. Their F0
contours rise rapidly from the first syllable and fall down
after the second syllable. The EXp, OBV have special
shape of the last syllable, which rises at the beginning but
falls down rapidly at the end. The IDS can be also
distinguished from other attitudes by the longest duration.
IV. D
ISCUSSTION AND CONCLUSIONS
Using a cross-cultural perception test, 16 Vietnamese
attitudes were evaluated by native and non-native listeners.
Experimental results do not show any significant effect of
listener’s gender nor sentence length. On the contrary,
there are some obvious differences between the perception
of native and non-native listeners. Some attitudes such as
DEC, AUT, IRR, SAR, SED were well recognized by both
Vietnamese and French listeners. One can suppose that the
concepts and the expressions of these attitudes are similar
between the two languages and the two cultures. Other
attitudes are recognized by native listeners, but almost not
recognized by non-native ones (SCO and IDS). Such
attitudes shall be conceptually encoded using different
strategies by Vietnamese and French speakers.
The fact that some attitudes were not recognized by
either Vietnamese, French listeners or both of them may be
explained by the assumption that such kinds of attitudes
cannot be distinguished satisfactorily from others on the
basis of audio information only, outside any pertinent

interaction context: the listeners may need more
information – and particularly visual information from the
face or from gestures to distinguish such attitudes. It raises
interesting questions for future researches on audio-visual
perception and the analysis of the facial parameters. It is
particularly the case for the EXn attitude, which is not
recognized by natives while non-natives do recognize it:
the subtle variations of prosody may not be sufficient when
confronted also to the sentence’s meaning – a problem that
does not have non-native listeners.
The prosodic analysis proposed some reasonable
explanations of these 16 attitude’s recognition and
confusion. It also gives us some basic characteristics of the
Vietnamese attitude. Those are the basic results for our
future work on modeling Vietnamese prosodic attitudes.
However, this analysis was limited to three prosodic
parameters (F0, intensity and duration). The future work
will also deal with voice quality analysis and visual
parameter analysis, in order to bring out more complete
description of Vietnamese social affects. Future works will
also explore the importance of the tonal system on the
production and the perception of Vietnamese attitudes, not
only for native, but also for foreign speakers without any
linguistic knowledge of a tonal language: will they be able
to separate tonal from attitudinal information?
Figure 4. The F0 contours of 5 syllables-length sentences for 16
Vietnamese attitudes
REFERENCES
[1] K.R. Scherer, and H. Ellgring, “Multimodal Expression of
Emotion: Affect Programs or Componential Appraisal Patterns?”,

Emotion, 7(1), pp. 158-171, 2007.
[2] V. Aubergé, "A Gestalt Morphology of Prosody Directed by
Functions: the Example of a Step by Step Model Developed at
ICP", Speech Prosody, 2002.
[3] F. Danes , “Involvement with language and in language”, Journal
of Pragmatics, 22,251–264, 1994.
[4] T. Banziger and K. R. Scherer. "The role of intonation in emotional
expressions." Speech Communication 46(3-4): 252-267, 2005.
[5] P. Delattre “Les dix intonations de base du francǜais”. The French
Review, 40(1):1-14, 1966.
[6] S. Shigeno, “Cultural similarities and differences in the recognition
of audio-visual speech stimuli”, ICSLP98, 1998.
[7] K. R. Scherer, R Banse, H. G. Wallbott, “Emotion inferences from
vocal expression correlate across languages and cultures”, Journal
of Cross-Cultural Psychology, 32(1), 76-92, 2001.
[8] V. Aubergé, T. Grépillat, A. Rilliard, “Can we perceive attitudes
before the end of sentences? The gating paradigm for prosodic
contours”, 5th Eurospeech, 1997.
[9] S. Mozziconacci, “Prosody and Emotion”, Speech Prosody 2002.
[10] M L. Diaféria, "Les Attitudes de l’Anglais : Premiers Indices
Prosodiques", Master thesis, INP Grenoble, France 2002.
[11] C. Gobl and A. Ni Chasaide, "The role of voice quality in
communicating emotion, mood and attitude." Speech
Communication 40(1-2): 189-212, 2003
[12] T.X. Le, "Etude contrastive de l’intonation expressive en français
et en vietnamien", PhD thesis of Linguistic and Phonetic,
Université Paris 3, 1989
223

×