Tải bản đầy đủ (.pdf) (26 trang)

A conversation analysis—informed test of l2 aural pragmatic comprehension

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (249.08 KB, 26 trang )

A Conversation Analysis–Informed Test
of L2 Aural Pragmatic Comprehension
F. SCOTT WALTERS
Queens College, City University of New York
Flushing, New York, United States

Speech act theory–based, second language pragmatics testing (SLPT)
raises test-validation issues owing to a lack of correspondence with
empirical conversational data. On the assumption that conversation
analysis (CA) provides a more accurate account of language use, it is
suggested that CA serve as a more empirically valid basis for SLPT development. The current study explores this notion by administering a pilot
CA-informed test (CAIT) of listening comprehension to learners of
English as second language (ESL) and to a control group of native
speakers of English. The listening CAIT protocol involved participants’
addressing multiple-choice items after listening to audiotaped conversational sequences derived from the CA literature. Statistical analyses of
pilot-test responses, correlations of test score with participant demographic variables, and CA-informed, qualitative analyses of nonnative
and native speaker responses with reference to operationalized pragmatic norms provided tentative evidence that the CAIT aural-comprehension measure possesses some utility in SLPT.

S

econd language pragmatics testing (SLPT) is a relatively new area of
language testing (e.g., Hudson, Detmer, & Brown, 1995; Yamashita,
2001). Nevertheless, well-known standardized tests of English as a second
or foreign language have components which address pragmatics. For
example, the TOEFL Internet-based Test (or TOEFL iBT) contains a section focusing in part on “listening for pragmatic understanding,” defined
as “to understand a speaker’s purpose, attitude, degree of certainty, etc.”
(Educational Testing Service, 2007, p. 31). Similarly, Part 1 of the
Cambridge First Certificate in English (FCE), Paper 4, contains multiplechoice items, some of which target pragmatics, that is, “function, purpose, [or] attitude” of a speaker (University of Cambridge Local
Examinations Syndicate, 2007, p. 2). SLPT itself has a theoretical basis in
speech act theory (Austin, 1962; Searle, 1975) and methodological roots
in cross-cultural pragmatics (e.g., Blum-Kulka, House, & Kasper, 1989)


and in interlanguage pragmatics (e.g., Blum-Kulka, 1982; Faerch &
Kasper, 1989). However, current speech act–based SLPT practices reveal
certain problems in validity. That is, given that validity (or validation)

TESOL QUARTERLY Vol. 43, No. 1, March 2009

29


involves making inferences from test responses about a given skill (Bachman,
1990; Chapelle, 1999), this article argues that speech act–based SLPT elicits responses from which appropriate skill inferences cannot be made. This
article then describes a pilot study in which conversation analysis (CA) is
used as a possible alternative on which to base an SLPT measure.

SECOND LANGUAGE PRAGMATICS COMPREHENSION
As Kasper and Rose (2002) indicate, within the developmental second
language (L2) pragmatics literature, studies into L2 pragmatic comprehension have been relatively rare. Early studies investigated L2 learners’
attribution of illocutionary force, that is, the process through which a hearer
interprets the meaning of an utterance as being of a particular speech act,
such as a request or a refusal. Carrell (1979), for example, found that
advanced L2 learners have full access to such inferential skills and are
able to infer indirect speech acts. Carrell (1981) also found evidence of a
hierarchy of difficulty in one particular speech act—indirect requests—
depending on how the act was syntactically constructed. When such
requests—in this case, to paint a circle blue—were phrased as interrogatives (i.e., Must you make the circle blue?) or negatives, they were more difficult for lower proficiency learners to interpret than were requests in
conventional form (i.e., Please + imperative verb phrase). Gibbs (1984)
found that L2 listeners have direct access to the meanings of indirect
speech acts if the utterance forms and situations are conventionalized,
but lack of familiar situational contexts and the presence of nonconventionalized utterances must be processed sequentially—literal meaning
first, then nonliteral. Bouton (1988) found that groups of learners from

differing cultural backgrounds (German, Portuguese-Spanish, and
Taiwanese on the one hand and Korean, Japanese, and mainland Chinese
on the other) have different perceptions of indirect answers, such as indirect criticism (e.g., p. 59: Speaker A: “What did you think of [Mark’s term
paper]?” Speaker B: “I thought it was well typed”). Koike (1996) tested
the ability of L2 students of Spanish to identify direct and indirect speech
acts (requests and apologies) recited on videotaped monologues by a
native speaker of Spanish, finding an association with length of L2 study.
These results are somewhat uncertain, however, because a third of the
participants were actually Spanish-English bilinguals. In any event, one
may note in these studies assumptions regarding the notions of direct
and indirect speech acts, namely, that there is a regular association
between form and function of so-called direct speech acts, and that those
speech acts which are indirect are understood by hearers to be so with
reference to the direct versions. These assumptions, as we shall see, do
not stand up well under scrutiny.
30

TESOL QUARTERLY


SECOND LANGUAGE PRAGMATICS TESTING
Such assumptions have carried over into SLPT development. For
example, in an early study, Hudson et al. (1995) devised a prototype battery of SLP test instruments employing a written discourse completion
test (DCT). A DCT consists of a short situational description followed by
a blank into which a respondent writes what he or she feels is an appropriate response, as in the following item (p. 87):
Situation 2: You work at a small shop that repairs jewelry. A valued
customer comes into the shop to pick up an antique watch that you
know is to be a present. It is not ready yet, even though you promised it
would be.
You: _____________________________________________________


The DCTs in Hudson et al. were intended to elicit requests, refusals,
and apologies—three so-called speech acts. Prepilot rating of the situational prompts (items) by native speakers (NS) of English showed a 92%
agreement with the intended speech-act codings (e.g., as an apology);
high percentage of agreement can be interpreted to mean that the operational norm (i.e., the provisional standard embodied in the test item)
assumed by the researcher is likely to be valid. Prompts on which native
speakers had disagreed as to speech-act codings were later found to elicit
multiple speech acts with nonnative-speaking participants.
Brown (2001) examined various SLP test methods adapted from the
Hudson et al. (1995) study, focusing on requests, refusals, and apologies,
and given to Japanese learners of English as a foreign language (EFL)
and American learners of Japanese as a second language (JSL). Methods
included a written discourse completion task (WDCT), the responses
scored by raters on a five-point scale of appropriateness. A multiplechoice discourse completion task (MDCT) replaced a fill-in blank with a
set of three response options on a test sheet. In an oral discourse completion task (ODCT), the situation descriptions were delivered via tape
recording, and the participant’s response was orally produced and
recorded. There were three additional tasks: The first was the discourse
role-play task (DRPT), which involved a printed situation acted out by the
test taker and a native speaker of the target language. The second, a roleplay self-assessment task (RPSA), involved the test taker viewing an audiotaped record of his or her performance on the DRPT and rating it on a
five-point scale. A discourse self-assessment task (DSAT) was similar to
the DRPT in that after reading a situation-prompt, the test taker rated his
or her ability to respond on a five-point Likert scale. In comparing the
results, Brown found high reliability coefficients for most measures across
both EFL and JSL groups. Factor analysis suggested method effects, that
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

31


is, a productive language factor with the WDCT, ODCT, and DRPTs; and a

paper-and-pencil factor for the WDCT and the MDCT with the JSL group.
That is, success at the WDCT, ODCT, and DRPT seemed to depend on
productive language ability, and, for the JSL group, success at the WDCT
and MDCT tasks seemed to depend on skills associated with use of paperand-pencil tests.
Yamashita (2001) attempted to improve on written situational prompts
by employing pictures to elicit apology strategies. A picture response test
(PRT) was developed, with feedback from native speakers of both
Japanese and English to eliminate cultural ambiguity. Six pictorial situations (e.g., Person A borrows Person B’s book and then accidentally drops
it into a pond) involved the test taker adopting a Person A character and
responding in writing to a fictitious Person B. Frequency counts were
made of targeted phenomena, for example, expressions of dismay,
explicit apology-devices (i.e., I’m sorry), and interjections.

PROBLEMS WITH MAINSTREAM SLPT
Although the preceding literature review is not exhaustive, it is an adequately representative sample to discuss the limits which speech act theory–based SLPT instruments have with regard to validation. Such methods
evince problems with authenticity and hence with producing appropriate
evidence of L2 pragmatic competence. First, the SLPT methods reflect
an underlying assumption regarding speech act form–function correspondence, that is, direct versus indirect speech act. However, as Levinson
(1983) points out, the form–function correspondence implicit in speech
act theory is essentially meaningless because most speech acts are indirect. Second, we may observe, as do Levinson (1983), Richards and
Schmidt (1983), and Rose (1992), that speech act theory tends to focus
simplistically on the speaker at the expense of the hearer; however, conversational actions, such as promises, cannot be performed alone. Mey
(2001) similarly refers to pragmatic acts occurring above the level of individual utterances. Given these criticisms of the theory underlying traditional SLPT, one can see that the very unit of analysis built into the test
methods of Hudson et al. (1995), Yamashita (2001), and in those reviewed
in Brown (2001) is suspect from the standpoint of authenticity; hence,
inferring pragmatic competence from these measures is dubious.
The use of the DCT and related methods also evince shortcomings.
The format is clearly artificial, and there is a risk, despite careful wording and translation of the situation-prompts, that participants will elaborate on the context in ways not envisaged by the investigator (Yamashita
2001, p. 48; Hudson et al., 1995, p. 52). Moreover, some DCT-related
studies suggest a method effect: excessively long written responses by

32

TESOL QUARTERLY


high-proficiency nonnative speakers (NNSs) (Blum-Kulka & Olshtain,
1986; Faerch & Kasper, 1989; see also Brown, 2001). The implications of
these shortcomings for practical L2 pragmatics assessment can be underscored by pointing out that if a teacher or student desires a valid assessment of that student’s L2 pragmatics ability, an SLPT instrument, such as
the DCT, drawing on speech act theory for assessment criteria is unlikely
to provide useful information. The assessment fails because the construct speech act is founded on learners’ and researchers’ intuitive understandings of what that act consists of, rather than on some objectively
verifiable criterion of pragmatic behavior. That is, the assessment results
will indicate what the learner believes is so, rather than demonstrating
L2 pragmatic mastery per se.
Some researchers have attempted to address DCT shortcomings. For
example, Rose (1992) added rejoinders (i.e., a third, hearer’s turn after
the traditional second-turn blank into which the subject writes his or her
response) intended to make the DCT exchanges more nearly authentic
by embedding responses in a simulated stream of discourse. Unfortunately,
the two conditions in Rose (1992)—items with and without rejoinder—
did not elicit statistically significant differences. Again, such attempts do
not appear to address the problem: If speech acts are in fact products of
metapragmatic judgments that only loosely correspond to actual conversational practice, then inferences of L2 pragmatics ability made from
such elicitations are of questionable validity.

CONVERSATION ANALYSIS
Such problems with speech act theory and the DCT and its variants are
crucially illuminated by findings in conversation analysis (CA; e.g., Sacks,
Schegloff, & Jefferson, 1974; Schegloff, 2007). CA is an approach to the
study of language that avoids categorizations of use based on nativespeaker intuitions. Data in CA consist of audio- and/or videotaped natural conversations, which are finely transcribed using special conventions
(see appendix). A working principle of CA is that no aspect of talk can be

assumed to be nonfunctional (Heritage, 1984), and a fundamental question is, Why that now? (Schegloff, 2007): Why is the interlocutor making
a particular utterance at a particular point in a given conversational
sequence? Conclusions about language use can only arise by determining
how interlocutors themselves orient to (i.e., demonstrate understanding
of) a given utterance, as evidenced by explicit, recorded turns of talk—
rather than from possibly erroneous researcher or respondent intuition.
In explicating language use, it is not enough to rely on linguistic form
and assign utterances to speech act–theoretical categories such as question, because such utterances may be being oriented to the speakers as
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

33


something rather different, such as a complaint or request (Schegloff,
1984). Hence, from the CA perspective, as Schegloff (1988) points out,
the so-called speech act, the “single act of utterance” (p. 56) is not a fundamental unit of talk in an absolute sense. Rather, the fundamental unit
of analysis is the sequence (as in, sequence of turns of talk), of which the
minimal example is the adjacency pair (e.g., Sacks, Schegloff, & Jefferson,
1974), that is, a two-part structure composed of a first pair part uttered by
one speaker (e.g., It was just beautiful), which projects a second pair part
which is “conditionally relevant” (Schegloff, 2007, p. 20), that is, potentially uttered by another speaker (e.g., Well thank you uh I thought it was
quite nice) (Pomerantz, 1978). The qualification in “potentially uttered” is
important because conversational meanings and structures are not predetermined but co-constructed by the speakers, turn by turn. Hence, the a
priori, linguistic form–function pairing of utterances in speech act theory
is analyzably absent from natural conversation (Schegloff, 1984, 1988).
CA criticisms extend to the DCT itself, which does not allow for the
concept of sequence, integral to analyses of actual “talk-in-interaction”
(Schegloff, 2007, p. xiii). Golato (2003), for example, shows that DCTs
do not capture some aspects of actual pragmatic behavior and may elicit
responses that do not appear in natural conversation. Golato transcribed

compliment responses gathered in a German L1 domestic setting on videotape and compared them with responses to compliments elicited from
L1 German speakers via written DCTs. Comparison of frequency counts
of various response categories revealed striking differences. For example,
what may be termed the archetypal response to a compliment, danke
(“thank you”), appeared in 12.4% of the DCT responses, but not once in
the naturally occurring data. On the other hand, compliment responses
that included assessments that agreed with the complimenter and that
also contain a positive pursuit marker in a later turn (A: The meat is
excellent/B: Super, right? … Yeah) appeared in 12% of the naturally occurring talk but in only 0.5% of the DCT-elicited responses. Here, a zero or
near-zero percentage figure (i.e., frequency) can be interpreted to mean
that the results do not reflect a valid picture of native-speaker behavior.
Given the above review of traditional SLPT in the light of CA findings,
at least two conclusions may be considered. First, the speech act theory–
based SLPT emphasis on the speaker at the expense of the hearer has
resulted in a paucity of research into testing of SLPT aural comprehension; recall that the ODCT in Brown (2001) involved audio recordings of
situational descriptions, not conversational sequences. Second, in order
to enhance SLPT validity, speech act–theoretical test methods—whether
written, aural, or pictorial—should be replaced by an approach more
directly reflecting actual language use. Speech act theory, as Kasper
(2005) points out, is rooted in a rationalistic approach to language study
that assumes an ideal speaker endowed with reason. An L2 test developer,
34

TESOL QUARTERLY


approaching language use from a rationalistic perspective, makes assumptions regarding the speaker’s intentions behind his or her language use.
However, intentions are invisible to the test developer, whose assumptions regarding a subject’s intentions may lead one’s data-capturing
method astray, as shown by Golato (2003), whose DCT, for example, provided data which was empirical, but which concerned beliefs and intentions, not actual language use. In the end, when searching for a valid set
of norms on which to base an SLPT measure, what is most useful is the

empirical language-use data itself, open to public scrutiny. Given corpora
of CA analyses that are available as resources for test development, SLPT
may well abandon its rationalistic perspective and instead adopt a datadriven approach, employing empirical findings as an operational SLP
test norm, thereby potentially enhancing validity, a fundamental concern
in assessment (Bachman, 1990).
All of which leads to a question: If the use of rationalistic, speech act
theory–derived test methods is seen to be of dubious value in light of CA,
might then there be CA-informed methods that, by application of principles such as conditional relevance, can enhance SLPT validation in the
testing of aural pragmatic competence? CA has been used in the analysis
of existing educational (Marlaire & Maynard, 1990) and L2 oral language
tests (Lazaraton, 1997; Johnson & Tyler, 1998; Kim & Suh, 1998; Lazaraton,
2002; Ross, 2007a, 2007b). To date, however, no attempts have been made
to actually develop CA-informed tests of SLP aural ability.

RESEARCH QUESTIONS
Application of CA to SLPT initially seems to present a paradigmatic
mismatch: Traditional language testing (LT) usually strives to produce a
generalizable numerical score to represent a target skill level with reference to an objective criterion or norm. On the other hand, CA regards
even a single interactional event, contextualized and irreproducible, as
having a nonstatistical (indeed, nonnormative in the LT sense) significance and so quantification of behavior is irrelevant (Schegloff, 1993). In
other words, CA research, it is argued, is nonstatistical in nature and thus
inappropriate for quantitative measurement.
However, in practical terms, these differing approaches to talk-in-interaction need not preclude one approach informing the other. The strength
of CA, and its potential benefit for SLPT, lies in its ability to uncover how
speakers use various pragmatic actions (or practices) to co-construct
sequences of conversation, by analyzing how they display their respective
orientations to the emergent talk. Indeed, some LT studies have applied
CA principles to the validation of existing L2 tests. For example, Ross
(2007a) performed contrastive CA analyses on transcripts of oral
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION


35


proficiency interviews (OPIs) of an EFL examinee who had backslid to a
lower numerical score since his initial interview. Analyses found differences
in rater severity and examinee behavior that affected the score. In a followup statistical study, Ross (2007b) applied a form of Rasch modeling to a
body of OPI test-score data, which determined a negligible effect of overall
rater severity on OPI score, if rater differences were corrected through statistical equating. The essential point is that consideration of quantitative
data prompted an application of CA, which provided data on L2-test interactions (uncovering a heretofore uninvestigated range of OPI practices),
in turn motivating a statistical examination of related test data, finally resulting in a reconsideration of test-use validity. From such cross-paradigm (CA,
qualitative; LT, quantitative) research (see also Lazaraton, 1997, 2002), it
appears that CA can indeed inform SLP test validation and use. Given this
potential, it seems not unwarranted to hypothesize that CA could serve as a
resource for SLP test-item construction as well. Moreover, given the theoretical and methodological problems with speech act–based SLPT, it can be
reasoned that employing CA findings as an operational test norm in actual
SLPT development might potentially enhance validity (Bachman, 1990).
Accordingly, in an attempt to develop an alternative to traditional
SLPT methods, a pilot study was initiated into the development of a
CA-informed test (CAIT) of English as a second language (ESL) pragmatic listening comprehension. The overall goal of the study was to determine the feasibility of a CAIT measure. In this connection, it should be
noted that this overarching goal does not necessarily imply achieving a
high degree of validity in the use of the measure at an early stage of development. Evidence of feasibility, then, can lead to test revisions, in turn
leading to eventual validation of CAIT use.
For this study, the overarching goal encompassed the following specific research questions:
1. What would be the statistical features of a listening CAIT when administered to advanced ESL speakers?
2. How would responses to a listening CAIT vary according to group
differences?
3. Would the operationalized norm be confirmed or violated by nativespeaker responses to listening CAIT questions?
4. Finally, can items testing L2 listening comprehension be practically
and usefully derived from CA data?


PARTICIPANTS
Participating were 70 adults—43 nonnative speakers (NNSs) of English,
and a control group of 27 native speakers (NSs) of English. L1s spoken by
36

TESOL QUARTERLY


the NNS group were Korean (10), Arabic (8), Chinese (6), Spanish (4),
Japanese (4), Urdu (2), Albanian (2), and one each of Yoruba, Kikuyu,
Turkish, French, Thai, Brazilian Portuguese, and Baule, a language of
Côte d’Ivoire. Participants were graduate students and some of their
spouses at a U.S. university. Demographic information was collected on
all participants for age, sex, native language, second (or third) language,
academic status, and number of years of formal English-language study.
Information was collected on NNS persons only for age of arrival, length
of stay in the United States, and most recent score on the Test of English
as a Foreign Language (TOEFL), the score range for which was 550–670
on the paper-and-pencil scale. This score suggests that these NNSs were
at a relatively high level of English proficiency.

PRAGMATIC TARGETS
For this study, ESL aural pragmatic competence was operationalized as
the ability to understand three types of pragmatic actions: assessment
responses, compliment responses, and presequence responses. These
actions do not constitute a representative sample of overall pragmatic
competence; to infer such competence would be invalid. Rather, they
were chosen because they are well documented in the CA literature
(Pomerantz, 1978, 1984; Schegloff, 2007), and three targets seemed

enough for a pilot intended to determine overall CAIT feasibility.
Assessment responses are actions in which a speaker displays evaluations
of events. Among the various types, two are shown here (Pomerantz,
1984). Some are upgrades, which agree with the assessment in the first
pair part (see Transcript 1, Speaker A), with a more emphatic word
choice (Speaker B):
Transcript 1
A : T’s- tsuh beautiful day out isn’t it?
B : Yeh it’s just gorgeous ¬ (p. 61)

Other assessment responses are disagreements., The disagreement given
by Speaker B in Transcript 2, for example, is a weakened agreement.:
Transcript 2
A : I know but I, I-I still say thet the sewing machine’s quicker.
B : Oh it c’n be quicker but it doesn’ do the job. ¬ (p. 73)

There is also a range of compliment responses, as described by
Pomerantz (1978). In addition to acceptance tokens such as “Thank
you” there are evaluative shifts, in which compliment recipients may either
disagree or offer a scaled-down version of it, as Speaker B does in
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

37


Transcript 3 (Pomerantz, 1978), offering “quite nice” instead of speaker
A’s “beautiful”:
Transcript 3
A : It was just beautiful.
B : Well thank you uh I thought it was quite nice, ←


Yet another type of compliment response consists of reference shifts,
whereby the recipient of a compliment focuses the talk away from him- or
herself and onto something else:
Transcript 4
A : You’re a good rower, Honey.
B : These are very easy to row. Very light. ← (p. 102)

The third target used in the pilot was the presequence response, as given by
Speaker B in Transcript 5 (Schegloff, 2007):
Transcript 5
A : Hey I got sump’in thet’s wild
B : What. ←
A : Ya know one a’ these great big red fire alarm boxes thet’r on the corner? I got one. (p. 39)

Speaker A in the first turn performs a presequence called a pretelling, a
check to see if conditions are appropriate for delivering a news item.
Speaker B in the second turn performs the actual presequence response,
in this case a go-ahead, showing that speaker B is willing to listen to the
coming message. Speaker B, however, could have prevented A from delivering the item, by performing a blocking response, as shown in Transcript
6 (Schegloff, 2007):
Transcript 6
A : Didju hear about thee, pottery and lead poisoning
B : Yeah Ethie wz just telling us ← (p. 40)

TEST METHOD
Administration of the 10-item listening CAIT took approximately
20 minutes. Before administration of the pilot test, participants answered
a short demographic questionnaire. The listening CAIT itself involved
test takers’ listening to short tape-recorded dialogues between two native

speakers of English. There were 10 listening items in all, each having a
particular pragmatic action as the (provisional) target. The scripted dialogue prompts were derived directly from CA data examples, adopting
38

TESOL QUARTERLY


the item-form technique proposed by Roid and Haladyna (1982). In this
technique, a linguistic or conversation-sequential pattern provides a template for the generation of similar examples to form a potentially useful
pool of test items. Note Transcript 7 from Schegloff (2007, p. 31), involving a preinvitation (line 2):
Transcript 7
1
2
3
4

Judy
John
Judy
John

:
:
:
:

Hi John
Ha you doin-Well, we’re going out. Why.
Oh, I was just gonna say come out and ((turn continues))


With minor alterations, the following item-prompt (for Item 1) was
constructed:
1
2
3
4

M
W
M
W

:
:
:
:

Hi Jane, this is Dick.
Hi Dick.
How ya doin-Well we’re about to leave for class. Why.

After listening to each dialogue, the participant read a short question
prompt on a test form and then selected one of four printed response
options, as in Item 1 below. In the aural prompt, M indicates a male speaker,
W a female speaker. An asterisk indicates the answer coded correct.
Item 1
1
2

3
4

M
W
M
W

:
:
:
:

Hi Jane, this is Dick.
Hi Dick.
How ya doin-Well we’re about to leave for class. Why.

In the conversation, what do you think the man will most probably do
NEXT?
(a)
(b)
(c)
(d)

suggest going to class with the woman and the others
offer to carry the woman’s heavy book bag for her
explain that he had intended to make an invitation *
invite the woman and the other students to do something


The above item represents construction features typical of the
entire item set. Another item, based on Pomerantz (1978, p. 99), whose
aural prompt exemplifies one way in which native English speakers respond
to compliments, reflects an approach to crafting response options:
Item 7
W : Wow, you made,—like a ton of stuff. This really is a lot of food!
M : Oh::, just a few little things really,
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

39


In the conversation, what WAS the man doing?
(a)
(b)
(c)
(d)

agreeing that there are many things
rejecting the woman’s point of view
understating his own achievement*
emphasizing his own accomplishment

Here, the woman compliments the man on his having prepared a large
meal, whereas the man delivers a disagreement proposing that the credit
given is exaggerated. Inasmuch as this particular task is intended to test a
learner’s ability to identify both the compliment and the most likely
appropriate response, the answer choices are crafted in such a way as to
make the identification slightly difficult; hence, the explicit term compliment is not inserted into any of the response options, and the answer
coded correct is circuitously worded.

The recorded part of each prompt was played twice so that the skill of
aural pragmatic competence would not be confounded by the skills of
overall comprehension or memory. Prior to administering the pilot measure, audio scripts and multiple-choice items were reviewed by two CA
scholars as content specialists, which resulted in changes to some items.
For example, one piloted item (Item 1) was based on a recorded telephone conversation (Schegloff, 2007, p. 31). However, the original data
excerpt itself contained little information to convey this fact, no lexis
referring to the call, nor any telephone ring. In order to avoid ambiguity,
the first turn—namely, Hi Jane,—was slightly modified to read, Hi Jane,
this is Dick, the addition serving to identify the turn as part of a phone call.
This addition arguably threatened the authenticity of this particular item;
this issue will be taken up in the Discussion section.

RESULTS
Descriptive Statistics
Descriptive statistics of the listening CAIT—relating to the first research
question—are given in Table 1. The NS score range was smaller than
those of the combined and NNS groups, but the fact that there was a
range suggests that the operationalized norm was disconfirmed. The
overall variance (average score difference from the mean) for both groups
separately and together was narrow, less than two score points—1.87 for
the NNS group, 1.69 for the NS group, and 1.79 for the combined group.
The low NNS variance values seemed to indicate that most NNS participants responded near-normatively; that is, they got almost all the correct
answers. All distributions were nearly normal. Item-level statistics were
also calculated. The p statistic, the proportion of participants who passed
40

TESOL QUARTERLY


TABLE 1

Pilot Listening CAIT Descriptive Statistics by Group
Statistic

NS*

Mean
Mode
Median
Range
Variance
Standard deviation
Skewness
Kurtosis
Ave. item facility
Ave. item discrimination

7.33
8.00
7.33
5.00
1.69
1.30
0.77
−0.91
0.73
0.30

NNS**
7.12
7.00

7.08
6.00
1.87
1.38
0.08
−0.14
0.71
0.28

Whole group
7.20
7.00
7.20
6.00
1.79
1.33
0.01
−0.19
0.72
0.29

* NS group n = 27. ** NNS group n = 43.

the item, sometimes called an item mean (Davidson, 2000), is an index of
item difficulty or facility. The average facility value for the whole group
was p = 0.72, that for the NNS group p = 0.71, and that for the NS group
p = 0.73, suggesting that the test overall was relatively easy. The item discrimination index (symbolized by d) was obtained with the point-biserial
coefficient. If a single item correlated highly with the total test score of a
group, then the item was considered to discriminate among high and low
scorers (Haladyna, 1999). A value of d = 0.30 or higher is usually considered acceptable. On the listening CAIT, the d values were moderate, the

average index for the whole group being d = 0.29; for the NNS group,
d = 0.28; and for the NS group, d = 0.30. Of course, it may be somewhat
misleading to interpret item discrimination with an NS control group.
Reliability—the ability of a test to measure a particular trait in a consistent
manner across groups (Bachman, 1990)—was calculated with the
Spearman-Brown split-half coefficient, using Horst’s (1953) correction
formula. To obtain this coefficient, the test was divided into two equivalent halves on the basis of target pragmatic skill (see Table 2). It may be
argued the halves are not entirely equivalent. For example, Items 3 and 7
each involve different actions: a disagreement with a compliment (“ …
you’re doing a great job in all your classes”/“Well I guess she hasn’t seen
my term paper for Astronomy class”) and a downgrade (“Wow, you
made,—like a ton of stuff …”/“Oh::, just a few little things really”).
However, these actions may be considered roughly equivalent because
both are evaluative shifts (Pomerantz, 1978). Items 1 and 9 each involve
differing actions, preinvitations and preoffers, yet Schegloff (2007) indicates that invitations often seem to be a subclass of offers. Thus, there are
empirical grounds for considering the split halves of the listening CAIT
to be content equivalent. Statistically, the halves possessed equal means
and variances, a requirement for using the Spearman-Brown (Bachman,
1990) split-half coefficient, as shown by nondirectional t tests run on the
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

41


TABLE 2
Division of Listening CAIT Into Equivalent Halves
Split-half test A

Split-half test B


Item

Pragmatic target(s)

Item

9
2
3
4
5

Preoffers
Prerequests
Disagreeing with compliments
Preinvitations, go-aheads
Pretellings

1
6
7
10
8

Pragmatic target(s)
Preinvitations, hedges
Prerequests, preoffers
Downgrading compliments
Preinvitations, go-aheads
Pretellings


split-half means, and nondirectional F-tests run on the variances (see
Table 3). The coefficients obtained were r = −0.137 for the whole group;
r = −0.019 for the NNS group; and r = −0.369 for the NS group. The coefficients reveal pronounced attenuation, attributable to a narrow range of
ESL pragmatics ability among the participants; recall that the NNS group
average TOEFL score was 617.18, which was relatively high.

Demographic Factors
The second research question concerned how responses to a listening
CAIT might vary according to group differences. Pearson’s productmoment correlation coefficient (symbolized by r) was used to determine
the correlation between listening-CAIT score and the demographic variables—age, sex, native language, second (or third) language, academic
status, age at arrival and length of stay in the United States, recent TOEFL
score, and number of years of formal English study defined for both
groups as formal instruction in English grammar and language arts
(though these may differ for NS and NNS learners). The point-biserial
correlation coefficient was used to determine the correlation between
test score and gender. The correlation values are given in Table 4. Possibly
TABLE 3
Split-Half Statistics: Listening CAIT
NS
Mean, Test A
Mean, Test B
Variance, Test A
Variance, Test B
t test*
F-test**
Df

3.67
3.67

0.85
1.15
2.06 (n.s.)
1.33 (n.s.)
27

NNS

Whole group

3.58
3.54
0.82
1.06
0.83 (n.s.)
1.30 (n.s.)
43

3.61
3.59
0.82
1.09
0.86 (n.s.)
1.32 (n.s.)
69

* t-critical: 2.056 (NS); 2.021 (NNS); 1.976 (whole-group). ** F-critical: 1.93 (NS); 1.69 (NNS);
1.54 (whole-group); Alpha = 0.05.

42


TESOL QUARTERLY


because of the limited variance, as mentioned earlier (see Table 1), correlation strengths were from r = 0.28 and below, that is, from moderate to
negligible; that is, the low variances can be seen as the cause of depressed
correlation values (Cziko, 1981; Kunnan, 1992). For example, because
the test responses did not vary widely from the mean, variances were minimal; thus correlation values for age (e.g., whole group r = −0.02), length
of U.S. residence (0.02), educational status (0.01), and so on, were artificially low. There was a slight correlation of test score overall with female
participant (whole group r = 0.18). The strongest correlations, for both
NS as well as NNS groups, were with the number of languages studied
(NS r = 0.28; NNS r = 0.24) and years of formal English study (r = 0.26;
r = 0.30); implications of this result are given in the Discussion section.
No correlations with L1 appeared in the data, nor was there an association with NNS proficiency as measured by the TOEFL.

Listening CAIT Item-Content Analysis
Analysis of item-response patterns can help one determine whether
inferences about learner ability—here, ability to understand compliments,
assessments, and presequences, as well as responses—can be validly made
given the adequacy of the content coverage of the test items and the
responses to those items (Messick, 1988). Examination of these responses
was the focus of the third research question. Table 5 presents a breakdown
of operationally correct responses and alternate interpretations—in CA
terms, orientations. A glance at the operationalized target (correct) columns shows that a significant number of NSs did not adhere to the operationalized norm, as several distracters, the intended wrong answers, were
chosen by the native speakers at varying rates of frequency. The table also
shows that the NSs and NNSs produced similar response patterns. For
example, pluralities of respondents in both groups selected options coded
TABLE 4
Correlations of Listening CAIT Score With Demographic Variables


Age
Sex (F)*
Educational status
No. of languages
Years of English
Length of residence
Age of arrival
TOEFL
NS versus NNS*

NS

NNS

Whole group

−0.16
0.11
0.02
0.28
0.26
−0.18

0.23
0.07
0.16
0.24
0.30
0.21
0.05

−0.01

−0.02
0.18
0.01
0.18
0.22
0.02
0.08

* Point-biserial coefficient used for this variable; all others Pearson’s r.

A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

43


TABLE 5
Proportions of NS and NNS Orientations to Listening CAIT Items
Item

NS

NNS

Other orientations*

NS

NNS


1

Preinvitation, blocking

0.37

0.49

2

Prerequest, offers

0.59

0.81

3

Compliment, disagreement

0.70

0.47

4

Preinvitation, go-ahead

0.89


0.93

5

Pretelling

0.63

0.70

6

Prerequest, offer

0.85

0.84

7

Compliment downgrade

0.89

0.58

8

Pretelling


0.93

0.81

9

Specific preoffer

0.85

0.65

Preinvitation, go-ahead

0.63

0.84

Invitation
Suggestion
Offer
States possession of car
Giving directions
Expressing dislike
Expressing pride
Guessing
Suggestion
Telling
Guessing

Express sympathy
Suggestion/reminder
Disagreement
Agreement
Story different topic
Story different topic
Offer
Generic preoffer
Offer
Blocking, “busy”
Blocking, “go away”
Expressing boredom

0.44
0.15
0.04
0.37
0.04
0.30

0.07
0.04
0.37

0.15

0.11

0.04
0.04


0.15

0.37



0.47
0.05

0.16
0.02
0.42
0.12
0.05
0.09
0.19
0.09
0.12
0.05
0.23
0.19
0.09

0.09
0.09
0.26
0.12
0.02
0.02


10

Operationalized target(s)

Note. Dashes indicate no action was observed for the given option. Some values inflated due to
rounding.
* Chosen distractors.

correct, with the exception of the preinvitation Item 1, in which a plurality
of NS respondents selected the invitation distracter. It is interesting that
this plurality (p = 0.47) of NS respondents appears to have perceived the
sequential implicativeness of the preinvitation “uh what ’r you guys doing”
in line 3 (see Item 1), but not the significance of the blocking move in line
4, thereby addressing the prompt by selecting distracter (d).
About half of the NNS and more than a third of the NS participants
appear to have understood the meaning of the blocking move in line 4,
but about half of all participants did not, selecting the invitation distracter
(d). This trend suggests that native speakers of American English, and
advanced ESL speakers who have been exposed to the pragmatic norms,
do in fact choose to disregard blocking maneuvers and proceed with the
prefigured invitation. If so, an invitation distracter may make for an item
from which ESL aural pragmatic competence vis-à-vis blocking cannot be
reliably determined.
In several items, NNS participants chose distractors that NS participants did not select, suggesting at least a partial skill discrimination
according to the operationalized norm. One example was Item 3, in
which 12% of NNS respondents chose the expressing pride distracter (b).
44

TESOL QUARTERLY



Another example was Item 7, where 19% of NNS chose the agreement
option (a).

Test Practicality
The fourth research question concerned practicality of listening CAIT
development and use. Applying Roid and Haladyna’s (1982) item-form
technique was relatively straightforward. Moreover, the test administration was relatively brief, requiring about 20 minutes.

DISCUSSION
Before discussing the results, limitations of the current study should be
noted. First, the present sample of ESL participants was not truly random;
hence, any conclusions regarding the impact of demographic variables on
ESL pragmatic competence cannot be final. Second, the domain of pragmatic competence selected—assessment responses, compliment responses,
and presequence responses—was necessarily limited. Piloting items testing other skills, for example, turn-taking or self-initiated repair (Levinson,
1983), is one task of future CAIT development studies. Third, because of
the low reliability coefficients, apparently due to low test variance (Kunnan,
1992), it is unclear whether or not quantitative SLPT using the present
approach is workable, given a methodological conundrum: Broaden the
range of L2 proficiency to include intermediate- as well as high-proficiency
test takers, and pragmatic performance may be confounded with overall
aural comprehension or grammatical competence; indeed, we may note
Bardovi-Harlig and Dornyei’s (1998) finding that low-proficiency EFL
learners were more sensitive to target language (TL) grammatical issues
than they were to pragmatic issues, though the reverse was true for ESL
learners. Conversely, if one delimits the pool to high-proficiency learners,
then one may reap low variances and thus poor reliabilities. If so, validation of test-score inferences regarding advanced L2 pragmatic competence may rest on accepting validity without statistical reliability (Moss,
1994). Such an approach is not unthinkable because “items may be indicated as undesirable on the basis of fit or discrimination, yet have content
that is representative of the trait” (Hudson, 1991, p. 180). Further research

with intermediate-proficiency learners, perhaps with more controls for NS
versus ESL/EFL instructional approaches, seems indicated to determine
the feasibility of applying CAIT to SLPT development.
The first research question of this study concerned the statistical features of the pilot listening CAIT when administered to NSs and advanced
NNSs. In addition to the low reliabilities, a statistical item of concern is that
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

45


the NS-group p values evinced a wide range, suggesting that the NS norm
derived from the CA literature was limited. On the positive side, the overall
item discrimination was moderate and the frequency distribution of number correct was approximately normal, suggesting that the pilot CAIT
method may have some usefulness in testing SLP aural comprehension.
The second research question concerned whether demographic factors affected listening CAIT responses (see Table 4). For the NNS group,
test score correlated modestly with length of residence in the United
States. Although correlation does not necessarily imply causality, this
finding would seem to resonate with interlanguage pragmatics studies
which show length of stay in the TL community as a factor in acquisition
of TL pragmatics (Blum-Kulka & Olshtain, 1986), especially in U.S. academic contexts (Bardovi-Harlig & Hartford, 1993). Even without overt
training in pragmatics (Bouton, 1992), TL community resident L2 learners are often more sensitive than their nonresident L2-learning counterparts to pragmatics errors than grammatical ones (Bardovi-Harlig &
Dornyei, 1998). However, NNS total test score did not correlate with
TOEFL score, a result that would appear to give support to Bachman’s
(1990) model of language competence, which separates grammatical and
pragmatic competencies.
The strongest positive correlations with NNS and NS total test scores
were with total number of foreign languages learned and years of formal
English study (see Table 4). Again, though correlation does not imply
causation, it is tempting to hypothesize that this factor, theoretically dealing with meta-aware, attention-allocating use of language, had significant
bearing on the pragmatic performance by both NS and NNS participants

in this study. Such a conclusion would seem to find theoretical and empirical resonance in the notions in second language acquisition of attention
to input (e.g., Schmidt, 1983, 1990) as well as the paper-and-pencil, or
formal-traditional test-taking, factor found in Brown (2001).
One may not unreasonably assume that attention to ESL pragmatic
input occurred in the process of the formal learning of the several L2s by
the respective participants. Possible corroborative evidence for this
assumption can be seen in the coefficients obtained for total test score
with educational status for the NNS group (r = 0.16) in contrast with that
obtained for the NS group (r = 0.02). That is, along with years of formal
English study and length of residence in the United States, the number
of years devoted to formal education seems to have had some relationship to NNS acquisition of ESL pragmatic skills, whereas length of formal
education does not for NSs, for whom acquisition of English pragmatic
competence will have largely been accomplished prior to and at least
somewhat independently of formal schooling. However, it may be noted
that for the NS group, years of formal English study may have correlated
highly with CAIT score for much the same reasons that it did with the
46

TESOL QUARTERLY


NNS group, in that such would cultivate metalinguistic and metapragmatic skill. If true, then years of formal English study, somewhat counterintuitively, would not correlate highly with age or length of residence,
because metalinguistic or metapragmatic training in school may well
have been more variable across participants than mere physical residence
in the Anglophone environment.
One may also speculate that relatively high correlations of CAIT score
with years of English study and with number of languages studied resulted
from affective flexibility—empathy, or the ability to adopt new language
egos (Guiora, Brannon, & Dull, 1972). Such a conclusion should be taken
with caution, partly because of the difficulty in operationalizing empathy

cross-culturally (Brown 1994), and partly because the narrow range of
variance in participant responses has arguably (Cziko, 1981; Kunnan,
1992) caused the absolute values of all the coefficients to be low, thus possibly obscuring actual relationships between demographic variables and
listening CAIT score.
The third research question concerned whether the operationalized
norm would be confirmed by NS responses to listening CAIT questions.
Again, one may take as a general principle that the higher the p values (or
frequencies) among the NS participants, the more nearly valid the operational norm exemplified by the item, and conversely, the lower the percent values, the greater the disconfirmation of a norm. Here, one may
note the wide range of NS p values, the correct response rate being 73.3%
for the NS and 71.2% for the NNS participants—results which recall those
of the Hudson et al. (1995) study. One possible explanation for the NS
response patterns is that some of the NS-chosen distracter items (see
Table 5) represent an NS norm that is somewhat broader than the operationalized norm derived from Pomerantz (1978, 1984) and Schegloff
(2007). If so, such responses may call into question the validity of inferring ESL pragmatic ability among the NNS participants, at least with
regard to the three pragmatic actions used in this study, and additional
CA research may be needed to shed light on these results, to provide a
firmer basis for an operational CAIT norm. One may note, however, that
while the 73.3% agreement rate among NS participants is somewhat lower
than the 92% NS agreement rate in Hudson et al. (1995), this overall rate
reflects a majority of the pilot-test sample and is somewhat encouraging.
This violation of the operational norm points to a fundamental consideration regarding the validity of CAIT results and thus usefulness of the
pilot CAIT itself, namely, method effect—in CA terms, the sequential organization of the protocol. As Marlaire and Maynard (1990) point out, test
results emerge through social collaborations between tester and test
taker. The outcome of such collaborations with a listening CAIT will
depend significantly on the application of CA data and principles to aural
prompts, and on the careful crafting of the stem and response options.
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

47



A problematic example highlighted by the application of CA data and
principles to aural prompts involves the telephone-call item mentioned
earlier (Item 1). Textual additions, such as the one to compensate for the
absence of a recorded ring, may threaten the authenticity of an item—
and paradoxically invoke the kind of rationalistic assumptions about NS
behavior that application of CA to SLPT should avoid. Examples of the
latter issue, namely, careful crafting of stem and response options,
are manifest in the numerous instances in which NS participants chose
distracters. One approach of the CAIT developer in addressing the
distracter issue would be to replace them with those that better distract,
these items being repiloted to determine d and p values with a wide
sample of NSs, to eliminate nonfunctioning response options.
However, this raises several questions: One is how to precisely word distracters. In Item 7, for example, it is important to critically examine the
item content, as well as the NS response patterns, to determine whether
option (c) is in fact the right answer, or whether it should be worded in
more technically correct CA terms as downgrading the compliment or responding to a compliment. In preparing the pilot, downgrading was considered too
obscure for the ESL learners participating in the study, and understating
seemed a reasonable, nontechnical synonym. NS response data showed
that a majority (89%) chose option (c), “understating his own achievement,” but 11% selected option (b) “rejecting the woman’s point of view,”
analyzable as being synonymous with option (c).
Overall, the adequacy of CAIT distracter wording will depend on several issues. One concerns the purposes of testing and instruction. One
may imagine an ESL course in which explicit mastery of CA metalanguage, for example, downgrading or presequence, was an instructional
goal, making explicit inclusion in a correct answer choice relevant. In
such cases, prudent test design would suggest the same word be used also
in a distracter, but with an incorrect grammatical object, for example,
downgrading the amount of food. A second issue is whether nontechnical
wordings such as understating can serve as unambiguous substitutes for
more technical CA terminology without misleading the L2 test taker. A
third issue is whether a given nontechnical wording can adequately distract without seeming to overlap descriptively with the answer coded as

correct; given the NS response pattern in Item 7, option (b) may need
revision. Attending to such details, with reference to possibly emergent
findings from future CA research as to the NS pragmatic norm, will hopefully affect the “conversation” between CAIT items and the test taker such
that valid skill-inferences can be made.
Related to this third issue of CAIT item construction is the second question alluded to earlier—a fundamental concern involving application of
CA principles: Recall that one shortcoming of speech act–based SLPT was
the inability of the measures to adequately operationalize and assess
48

TESOL QUARTERLY


pragmatic competence with regard to conversational sequences. As
Schegloff (2007) notes, turns of talk are generally characterized by a relationship of adjacency or “nextness” (p. 15); that is, a next turn both displays a speaker’s understanding of another speaker’s just-prior turn and
also constitutes an action that is responsive to that just-prior turn. However,
next turns are not generated deterministically; for example, just-prior
turns are not always understood as their speaker had intended; hence,
next turns may be various. Schegloff (1988), for instance, analyzes examples of conversational breakdowns in which the speakers over a four-turn
sequence display variant understandings of the first speaker’s initial question. Thus, a conversational sequence can evolve in varying, often unpredictable directions, co-constructed by interactants in real time, evincing a
range of possible sequential trajectories (Schegloff, 1993). Hence, to
return to the pilot items under review, a given set of response options, as
in Item 1, may well legitimately embody more than one possible adjacency
pair—more than one correct answer. Thus, in applying CA principles to
pilot posttest analysis, we see that the hypothetical speaker of the blocked
preinvitation in Item 1 could explain that he had intended to make an
invitation (option c) or could ignore the blocking action and deliver the
invitation (option d). From this perspective, the CAIT developer may consider that the problem may not be a limited NS norm but that the item
may not operationalize conditional manifestations of that norm.
A solution to this problem of sequential indeterminancy for CAIT item
construction might be to use a multiple-correct item format, with perhaps

as many as three correct out of five options, to accommodate multiple conditional relevancies. In this connection, it seems appropriate to point out
that the operational use of any candidate multiple-correct options, such as
Item 1’s option (d) or Item 7’s option (b), should be supported by actual
CA studies of talk-in-interaction and not by metapragmatic judgments of
NSs who may be called on to review draft items. Otherwise, a CAIT
approach to SLPT will have little validation benefit over speech act–based
SLPT. It may thus be useful to apply CA principles (such as conditional relevance) to the wordings of the item stems and prompts in light of NS and
NNS pilot-test response patterns as a way not only to generate hypotheses
regarding the TL norm (see Schegloff, 1993, pp. 114–115, for relevant discussion), but to effect refinements on the wording (and coding) of items.
In such fine-grained considerations, one may see the potential of CA to
inform the crafting of assessment tasks that may more adequately reflect
TL behavior and thus support validity of CAIT item use.
The fourth research question concerned the overall feasibility of
employing L2 listening-comprehension items derived from CA data.
From an item-writer’s perspective, the suggested listening CAIT development procedure evinces practicality in that designing and revising
prompts from CA transcriptions using Roid and Haladyna’s (1982) itemA TEST OF L2 AURAL PRAGMATIC COMPREHENSION

49


form technique was relatively straightforward. Nor was test administration time consuming, taking approximately 20 minutes. Yet, practicality
of method, while necessary, is insufficient, whereas validity is essential.
Recall that the basic objection to the use of speech act theory–based
SLPT methods was lack of authenticity because of the arguably intuitionbased, largely norm irrelevant, metapragmatic nature of DCT responses
and hence the questionable validity of skill inference. Further, recall the
plausible conclusion, based on the moderate correlation between test
score and formal L2 instruction, that the participants were using metapragmatic skill in addressing the listening CAIT items, plus the aforementioned violations of the operationalized norm (see Table 5): One
may then legitimately ask whether the application of CA has actually
enhanced validity over traditional SLPT methods.
Three findings suggest a positive, though tentative, answer. The first is

the overall 73.3% agreement rate among NS participants. This finding
also resonates with Golato’s (2003) finding of a degree of overlap between
DCT-collected data and data gathered by CA methodology. Such overlap
of offline metapragmatic and online pragmatic behavior suggests a corresponding overlap in underlying trait. In this light, the overall 73% NS
congruence with the operationalized CAIT norm—as well as the fact that
several NS item responses had 85% or better normative congruence (see
Table 5)—can be seen as evidence for validity with regard to assessing the
delimited, aural pragmatic competence in the L2 population tested. It
would thus appear that the process of using CA research findings as a
provisional norm, as well as CA practitioners as content specialists in the
prepilot evaluation of that provisional norm, has led to test results, in the
form of high NS p values, that suggest that an NS norm can be approached,
despite the indeterminacy of natural conversation. (However, as suggested earlier, employment of a multiple-correct item format may be necessary to accommodate alternate sequential trajectories.) One may
further note that, unlike with intuition-driven, speech act–based SLPT
studies, this finding is made with reference to an objective, empirically
supportable NS norm (Pomerantz, 1978, 1984; Schegloff, 2007).
A second finding tentatively supporting usefulness of the pilot CAIT
method involves nonnormative NNS responses that NS participants did
not make—apparently genuinely NNS behavior (see Table 5). Examples
include the expressing pride distracter in Item 3 (testing compliments) and
the guessing distracter of Item 5 (testing pretellings). These responses are
encouraging from the standpoint of CAIT development in that despite
seeming limitations of the operationalized norm, it is demonstrated that
alternative responses can be crafted that are useable as candidate distracters for future iterations of the listening CAIT. More fundamentally, the
very observable existence of an area of NNS behavior outside of an operationalized NS norm further suggests that the norm itself is realizable,
50

TESOL QUARTERLY



despite CA methodological perspectives on their irrelevance (Schegloff,
1993; but see Atkinson & Heritage, 1984, p. 2), and can be a basis on which
CAIT measures can be constructed and inferences of SLP skill validly made.
A third tentative finding involves postpilot application of CA principles to
draft items, in light of NS and NNS response patterns, which revealed possible multiple conversational trajectories beyond the operational norm.
Such may indicate the outlines of a workable method for CAIT validation,
as a way to generate both hypotheses regarding NS pragmatic norms and
potential refinements on the wording and coding of response options in
keeping with CA principles such as conditional relevance. Such findings
seem to suggest that CAIT development may hold some promise as an area
of SLPT research and as an advance on traditional approaches to SLPT.

THE AUTHOR
F. Scott Walters is an assistant professor in the Department of Linguistics and
Communication Disorders, Queens College, City University of New York, United
States. His research interests include L2 testing, conversation analysis, and TESOL
assessment-literacy training. He also has engaged in less commonly taught languages
(LCTL) and students with interrupted formal education (SIFE) program evaluation.

REFERENCES
Atkinson, J. M., & Heritage, J. (Eds.). (1984). Structures of social action: Studies in conversational analysis. New York: Cambridge University Press.
Austin, J. L. (1962). How to do things with words. Cambridge, MA: Harvard University
Press.
Bachman, L. F. (1990). Fundamental considerations in language testing. Cambridge:
Cambridge University Press.
Bardovi-Harlig, K., & Dornyei, Z. (1998). Do language learners recognize pragmatic
violations? Pragmatic versus grammatical awareness in instructed L2 learning.
TESOL Quarterly, 32, 233–262.
Bardovi-Harlig, K., & Hartford, B. S. (1993). Input in an institutional setting. Studies
in Second Language Acquisition, 17, 171–188.

Blum-Kulka, S. (1982). Learning how to say what you mean in a second language.
Applied Linguistics, 3, 29–59.
Blum-Kulka, S., House, J., & Kasper, G. (Eds.). (1989). Cross-cultural pragmatics: Requests
and apologies. Norwood, NJ: Ablex.
Blum-Kulka, S., & Olshtain, E. (1986). Too many words: Length of utterance and
pragmatic failure. Studies in Second Language Acquisition, 8, 47–61.
Bouton, L. F. (1988). A cross-cultural study of ability to interpret implicatures in
English. World Englishes, 17, 183–196.
Bouton, L. F. (1992). The interpretation of implicature in English by NNS: Does it
come automatically—without being explicitly taught? In L. F. Bouton & Y. Kachru
(Eds.), Pragmatics and language learning: Vol. 3 (pp. 53–65). Urbana-Champaign:
University of Illinois Press.
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION

51


Brown, H. D. (1994). Principles of language learning and teaching (3rd ed.). Englewood
Cliffs, NJ: Prentice-Hall.
Brown, J. D. (2001). Pragmatics tests: Different purposes, different tests. In K. R.
Rose, & G. Kasper (Eds.), Pragmatics in language teaching (pp. 301–325). Cambridge:
Cambridge University Press.
Carrell, P. L. (1979). Indirect speech acts in ESL: Indirect answers. In C. A. Yorio, K.
Perkins, & J. Schachter (Eds.), On TESOL ’79 (pp. 297–307). Washington, DC:
TESOL.
Carrell, P. L. (1981). Relative difficulty of request forms in L1/L2 comprehension. In
M. Hines & W. Rutherford (Eds.), On TESOL ’81 (pp. 141–152). Washington, DC:
TESOL.
Chapelle, C. (1999). Validity in language assessment. Annual Review of Applied
Linguistics, 19, 254–272.

Cziko, G. (1981). Psychometric and edumetric approaches to language testing.
Applied Linguistics, 2, 27–43.
Davidson, F. (2000). The language tester’s statistical toolbox. System, 28, 605–617.
Educational Testing Service. (2007). TOEFL iBT tips: How to prepare for the TOEFL iBT.
Retrieved February 12, 2008, from />pdf/TOEFL_Tips.pdf
Faerch, C., & Kasper, G. (1989). Internal and external modification in interlanguage
request realization. In S. Blum-Kulka, J. House, & G. Kasper (Eds.), Cross-cultural
pragmatics: Requests and apologies (pp. 221–247). Norwood, NJ: Ablex.
Gibbs, R. W. (1984). Literal meaning and psychological theory. Cognitive Science, 8,
275–304.
Golato, A. (2003). Studying compliment responses: A comparison of DCTs and
recordings of naturally occurring talk. Applied Linguistics, 1, 1–54.
Guiora, A. Z., Brannon, R. C., & Dull, C. Y. (1972). Empathy and second language
learning. Language Learning, 22, 111–130.
Haladyna, T. M. (1999). Developing and validating multiple-choice test items (2nd ed.).
Mahwah, NJ: Erlbaum.
Heritage, J. 1984. Conversation analysis. In J. Heritage (Ed.), Garfinkel and ethnomethodology (pp. 233–292). Cambridge: Polity Press.
Horst, P. (1953). Correcting the Kuder-Richardson reliability for dispersion of item
difficulties. Psychological Bulletin, 50, 371–374.
Hudson, T. D. (1991). Relationships among IRT item discrimination and item fit indices in criterion-referenced language testing. Language Testing, 8, 160–181.
Hudson, T., Detmer, E., & Brown, J. D. (1995). Developing prototypic measures of crosscultural pragmatics. Honolulu: University of Hawai’i Press.
Johnson, M., & Tyler, A. (1998). Re-analyzing the OPI: How much does it look
like natural conversation? In R. Young & A. W. He (Eds.), Talking and testing: discourse approaches to the assessment of oral proficiency (pp. 27–51). Amsterdam:
Benjamins.
Kasper, G. (2005, April). Speech acts in interaction: Towards discursive pragmatics. Plenary
talk presented at the 16th International Conference on Pragmatics and Language
Learning. Bloomington, IN.
Kasper, G., & Rose, K. R. (2002). Pragmatic development in a second language. Oxford:
Blackwell.
Kim, K., & Suh, K. (1998). Confirmation sequences as interactional resources in

Korean language proficiency interviews. In R. Young & A. W. He (Eds.), Studies in
Bilingualism: Vol. 14. Talking and testing: Discourse approaches to the assessment of oral
proficiency (pp. 297–332). Amsterdam: Benjamins.
Koike, D. A. (1996). Transfer of pragmatic competence and suggestions in Spanish
foreign language learning. In S. M. Gass & J. Neu (Eds.), Speech acts across cultures:
52

TESOL QUARTERLY


Challenges to communication in a second language (pp. 257–281). Berlin: Mouton de
Gruyter.
Kunnan, A. J. (1992). An investigation of a criterion referenced test using G-theory,
and factor and cluster analyses. Language Testing, 9, 30–49.
Lazaraton, A. (1997). Preference organization in oral proficiency interviews: The case of
language ability assessments. Research on Language and Social Interaction, 30, 53–72.
Lazaraton, A. (2002). Studies in language testing: Vol. 14. A qualitative approach to the validation of oral language tests. Studies in language testing. Cambridge: Cambridge
University Press.
Levinson, S. C. (1983). Pragmatics. Cambridge: Cambridge University Press.
Marlaire, C. L., & Maynard, D. W. (1990). Standardized testing as an interactional
phenomenon. Sociology of Education, 63, 83–101.
Messick, S. (1988). The once and future issues of validity: Assessing the meaning and
consequences of measurement. In H. Wainer & H. I. Bruan (Eds.), Test validity.
(pp. 33–45). Hillsdale, NJ: Erlbaum.
Mey, J. L. (2001). Pragmatics: An introduction. Malden, MA: Blackwell.
Moss, P. A. (1994). Can there be validity without reliability? Educational Researcher,
23(2), 5–12.
Pomerantz, A. (1978). Compliment responses: Notes on the co-operation of multiple
constraints. In J. Schenkein (Ed.), Studies in the organization of conversational interaction (pp. 57–101). New York: Academic Press.
Pomerantz, A. (1984). Agreeing and disagreeing with assessments: Some features of

preferred/dispreferred turn shapes. In J. M. Atkinson & J. Heritage (Eds.),
Structures of social action: Studies in conversational analysis (pp. 79–112). New York:
Academic Press.
Richards, J. C., & Schmidt, R. W. (1983). Conversational analysis. In J. C. Richards &
R. W. Schmidt (Eds.), Language and communication (pp. 117–154). London:
Longman.
Roid, G., & Haladyna, T. M. (1982). Toward a technology of test-item writing. New York:
Academic Press.
Rose, K. (1992). Speech acts and questionnaires: The effect of hearer response.
Journal of Pragmatics, 17, 49–62.
Ross, S. J. (2007a). A comparative task-in-interaction analysis of OPI backsliding.
Journal of Pragmatics, 39, 2017–2044.
Ross, S. J. (2007b, November). An event history approach to unbiased task assessment.
Paper presented at the East Coast Organization of Language Testers conference,
George Washington University, Washington, DC.
Sacks, H., Schegloff, E., & Jefferson, G. (1974). A simplest schematics for the organization of turn-taking for conversation. Language, 50, 696–735.
Schegloff, E. A. (1984). On questions and ambiguities in conversation. In J. M.
Atkinson & J. Heritage (Eds.), Structures of social action: Studies in conversational analysis (pp. 28–52). New York: Cambridge University Press.
Schegloff, E. A. (1988). Presequences and indirection: Applying speech act theory to
ordinary conversation. Journal of Pragmatics, 12, 55–62.
Schegloff, E. A. (1993). Reflections on quantification in the study of conversation.
Research on Language and Social Interaction, 26, 99–128.
Schegloff, E. A. (2007). Sequence organization in interaction: A primer in conversation analysis: Vol. 1. New York: Cambridge University Press.
Schmidt, R. (1983). Interaction, acculturation, and acquisition of communicative
competence. In N. Wolfson & E. Judd (Eds.), Sociolinguistics and second language
acquisition (pp. 137–174). Rowley, MA: Newbury House.
Schmidt, R. (1990). The role of consciousness in second language learning. Applied
Linguistics, 11, 129–158.
A TEST OF L2 AURAL PRAGMATIC COMPREHENSION


53


×