Tải bản đầy đủ (.pdf) (22 trang)

Variability in ESL Essay Rating Processes

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (199.63 KB, 22 trang )

This article was downloaded by: [UNSW Library]
On: 31 October 2012, At: 16:57
Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK
Language Assessment Quarterly
Publication details, including instructions for authors and
subscription information:
/>Variability in ESL Essay Rating Processes:
The Role of the Rating Scale and Rater
Experience
Khaled Barkaoui
a
a
York University,
Version of record first published: 19 Feb 2010.
To cite this article: Khaled Barkaoui (2010): Variability in ESL Essay Rating Processes: The Role of the
Rating Scale and Rater Experience, Language Assessment Quarterly, 7:1, 54-74
To link to this article: />PLEASE SCROLL DOWN FOR ARTICLE
Full terms and conditions of use: />This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden.
The publisher does not give any warranty express or implied or make any representation
that the contents will be complete or accurate or up to date. The accuracy of any
instructions, formulae, and drug doses should be independently verified with primary
sources. The publisher shall not be liable for any loss, actions, claims, proceedings,
demand, or costs or damages whatsoever or howsoever caused arising directly or
indirectly in connection with or arising out of the use of this material.
Language Assessment Quarterly, 7: 54–74, 2010
Copyright © Taylor & Francis Group, LLC
ISSN: 1543-4303 print / 1543-4311 online


DOI: 10.1080/15434300903464418
HLAQ1543-43031543-4311Language Assessment Quarterly, Vol. 7, No. 1, Dec 2009: pp. 0–0Language Assessment Quarterly
Variability in ESL Essay Rating Processes: The Role of the
Rating Scale and Rater Experience
ESL Essay Rating ProcessesBarkaoui
Khaled Barkaoui
York University
Various factors contribute to variability in English as a second language (ESL) essay scores and rating
processes. Most previous research, however, has focused on score variability in relation to task,
rater, and essay characteristics. A few studies have examined variability in essay rating processes.
The current study used think-aloud protocols to examine the roles of rating scales, rater experience,
and interactions between them in variability in raters’ decision-making processes and the aspects of
writing they attend to when reading and rating ESL essays. The study included 11 novice and 14
experienced raters, who each rated 12 ESL essays, both holistically and analytically, while thinking
aloud. The findings indicated that rating scale type had larger effects on the participants’ rating
processes than did rater experience. With holistic scoring, raters tended to refer more often to the
essay (the focus of the assessment), whereas with analytic scoring they tended to refer to the rating
scale (the source of evaluation criteria) more frequently; analytic scoring drew raters’ attention to all
evaluation criteria in the rating scale, and novices were influenced by variation in rating scales more
than were the experienced raters. The article concludes with implications for essay rating practices
and research.
This study examined the roles and effects of two sources of variability in the rating context, rat-
ing scale and rater experience, on English as a second language (ESL) essay rating processes. It
may be useful to think of the rating process as involving a reader/rater interacting with three
texts (the writing task, the essay, and the rating scale) within a specific sociocultural context
(e.g., institution) that specifies the criteria, purposes, and possibly processes of reading and
interpreting the three texts to arrive at a rating decision (Lumley, 2005; Weigle, 2002). Although
various factors can contribute to variability in scores and rater decision-making processes,
research on second-language essay rating has tended to focus on such factors as task require-
ments, rater characteristics, and/or essay features (Barkaoui, 2007a).

However, it is obvious that other contextual factors, such as rating procedures, influence
raters’ judgment of student performance and the scores they assign. As Schoonen (2005)
argued, “The effects of task and rater are most likely dependent on what has to be scored in a
text and how it has to be scored” (p. 5). In addition, the rating scale is an important component
of the rating context because it specifies what raters should look for in a written performance
and will ultimately influence the validity of the inferences and the fairness of the decisions
Correspondence should be sent to Khaled Barkaoui, York University, Faculty of Education, 235 Winters College,
4700 Keele Street, Toronto, Ontario, M3J 1P3 Canada. E-mail:
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 55
that educators make about individuals and programs based on essay test scores (Weigle,
2002). This aspect of the rating context, however, has received little attention (Barkaoui,
2007a; Hamp-Lyons & Kroll, 1997; Weigle, 2002).
This article focuses on two types of rating scales, holistic and analytic, that are widely used in
large-scale and classroom assessments (Hamp-Lyons, 1991; Weigle, 2002). These two types of
scales differ in terms of scoring methods and implications for rater decision-making processes
(Goulden, 1992, 1994; Weigle, 2002). In terms of scoring method, in analytic scoring raters assign
subscores to individual writing traits (e.g., language, content, organization); these subscores may
then be summed to arrive at an overall score. In holistic scoring, the rater may also consider individ-
ual elements of writing but chooses one score to reflect the overall quality of the paper (Goulden,
1992, 1994). In terms of decision-making processes, with analytic scoring, the rater has to evaluate
the different writing traits separately. In holistic scoring the rater has to consider different writing
traits too, but the rater has also to weight and combine their assessments of the different traits to
arrive at one overall score, which is likely to make the rating task more cognitively demanding.
These differences are likely to influence essay rating processes and outcomes. However,
although the literature is replete with arguments for and against the two rating methods, little is
known about whether and how they impact on ESL essay reading and rating processes and scores
(Barkaoui, 2007a; Hamp-Lyons & Kroll, 1997; Weigle, 2002). Such studies as have been reported
in the literature (e.g., Bacha, 2001; O’Loughlin, 1994; Schoonen, 2005; Song & Caruso, 1996)
examined the effects of rating scales on rater and score reliability but did not consider the rating

process. Furthermore, the findings of some of these studies are mixed. For example, in two studies
comparing the holistic and analytic scores assigned by ESL and English teachers to ESL essays,
O’Loughlin (1994) found that holistic ratings achieved higher levels of interrater agreement across
both rater groups, whereas Song and Caruso (1996) found significant differences in terms of the
holistic, but not the analytic, scores across rater groups. Bacha (2001), on the other hand, reported
high levels of inter- and intrarater reliabilities for both types of rating scales.
I am not aware of any study that has examined the effects of different types of rating scales
on L2 essay rating processes (but see Barkaoui, 2007b). Most qualitative studies have investi-
gated the decision-making behaviors and aspects of writing that raters attend to when rating
essays with no specific rating guidelines (e.g., Cumming, Kantor, & Powers, 2002; Delaruelle,
1997), or when using holistic (e.g., Milanovic, Saville, & Shuhong, 1996; Sakyi, 2003;
Vaughan, 1991) or analytic scoring (e.g., Cumming, 1990; Lumley, 2005; Smith, 2000; Weigle,
1999). Lumley and Smith may be two exceptions in that, although they did not specifically com-
pare different rating scales, their findings raise several relevant questions concerning the role of
the rating scale in essay rating processes. Smith found that raters attend to other textual features
in addition to those mentioned in the rating scale, that raters with different reading strategies
interpret and apply the rating criteria differently, and that the rating criteria have different effects
on raters with different approaches to essay reading and rating. Lumley found that (a) raters may
understand the rating criteria similarly in general, but emphasize different components and
apply them in different ways, and (b) raters may face problems reconciling their impression of
the text, the specific features of the text, and the wordings of the rating scale.
Another limitation of previous research is that the frameworks that describe the essay rating
process (e.g., Cumming et al., 2002; Freedman & Calfee, 1983; Homburg, 1984; Milanovic et
al., 1996; Ruth & Murphy, 1988; Sakyi, 2003) do not discuss whether and how the content and
organization of the rating scale influence rater decision-making behaviors and the aspects of
Downloaded by [UNSW Library] at 16:57 31 October 2012
56 BARKAOUI
writing raters attend to. For example, Freedman and Calfee seemed to suggest that essay rating
is a linear process where the rater reads the essay, forms a mental representation of it, compares
and matches this representation to the rating criteria, and then articulates a rating decision. Other

studies of essay rating did not include any rating scales (e.g., Cumming et al., 2002). As a result,
these studies do not discuss the role of the rating scale in variation in rater decision-making
behaviors. Such information is crucial for designing, selecting, and improving rating scales and
rater training as well as for the validation of ESL writing assessments.
To examine rating scales inevitably means examining the individuals using them, i.e., raters.
As Lumley (2005) emphasized, the rater is at the center of the rating activity (cf. Cumming,
Kantor, & Powers, 2001; Erdosy, 2004). One of the rater factors that seems to play an
important role in the rating process is rater experience (e.g., Cumming, 1990; Lumley, 2005;
Schoonen, Vergeer, & Eiting, 1997; Wolfe, 2006). Schoonen et al., for instance, argued that the
expertise and knowledge that raters bring to the rating task are essential for a reliable and valid
rating (p. 158). There is a relatively extensive literature on the effects of rater expertise on ESL
essay rating processes (Cumming, 1990; Delaruelle, 1997; Erdosy, 2004; Sakyi, 2003; Weigle,
1999). This research indicates that experienced and novice raters employ qualitatively different
rating processes. Cumming (1990), for example, found that experienced teachers had a much
fuller mental representation of the essay assessment task and used a large and varied number of
criteria, self-control strategies,
1
and knowledge sources to read and judge ESL essays. Novice
raters, by contrast, tended to evaluate essays with only a few of these component skills and cri-
teria, using skills that may derive from their general reading abilities or other knowledge they
have acquired previously (e.g., editing).
However, there is no research on how raters with different levels of experience approach
essay rating with different types of rating scales. Cumming (1990) hypothesized that novice
raters, unlike experienced raters, may benefit from analytic scoring procedures to direct their
attention to specific aspects of writing as well as appropriate evaluation strategies and criteria,
whereas Goulden (1994) hypothesized that analytic scoring is easier for inexperienced raters, as
fewer unguided decisions (e.g., weighting different evaluation criteria) are required. It was the
aim of the present study to investigate these empirical issues. Specifically, the current study
used think-aloud protocols to examine the roles of rating scale type (holistic vs. analytic), rater
experience (novice vs. experienced) and interaction among them in variability in ESL essay

rating processes. Following previous research (e.g., Cumming et al., 2002; Lumley, 2005;
Milanovic et al., 1996), rating processes are defined as the decision-making behaviors of the
raters and the aspects of writing they attend to while reading and rating ESL essays.
METHOD
Participants
The study included 11 novice and 14 experienced raters randomly selected from among 60 vol-
unteers in a larger study on ESL essay scores and rating processes (Barkaoui, 2008). Experienced
1
Raters’ strategies for controlling their own evaluation behavior (e.g., define, assess, and revise own rating criteria;
summarize own rating judgment collectively).
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 57
raters were graduate students and/or ESL instructors who had been teaching and rating ESL
writing for at least 5 years, had an M.A. or M.Ed. degree, had received specific training in
assessment and essay rating, and rated themselves as competent or expert raters. Novice raters
were mainly teaching English as a second language students who were enrolled in or had just
completed a preservice or teacher training program in ESL, had no ESL teaching and rating
experience at all at the time of data collection, and rated themselves as novice raters. The partic-
ipants were recruited from various ESL and ESL teacher education (teaching English as a
second language) programs at universities in southern Ontario. They varied in terms of their
gender, age, and first-language backgrounds, but all were native or highly proficient non-native
speakers of English. Table 1 describes the profile of a typical participant in each group.
Data Collection Procedures
The study included 180 essays produced under real-exam conditions by adult ESL learners from
diverse parts of the world and with varying levels of proficiency in English. Each essay was
written within 30 minutes in response to one of two comparable independent prompts (Study
and Sports).
Each rater rated a random sample of 24 essays, 12 silently and 12 while thinking aloud. To
ensure counterbalancing, half the participants in each group were randomly assigned to start
with holistic rating and the other half to start with analytic rating. The holistic and analytic

scales, borrowed from Hamp-Lyons (1991, pp. 247–251), included the same evaluation criteria,
wording and number of score levels (9), but differed in terms of whether to assign one overall
score (holistic) or multiple scores (analytic) to each essay. The rating criteria in the analytic
scale were grouped under five categories: communicative quality, organization, argumentation,
linguistic accuracy, and linguistic appropriacy.
Each participant attended a 30-minute individual training session about one of the rating
scales and rated and discussed a sample of four essays. Next, each rated 12 essays silently at
home using the first rating scale (these silent ratings are not considered in this paper). Each rater
then attended a 30-min session where they received detailed instructions and careful training on
how to think aloud while rating the essays following procedures and instructions in Cumming
et al. (2001, pp. 83–85). Later, each participant rated the remaining 12 essays while thinking
TABLE 1
Typical Profile of a Novice and an Experienced Rater
Novice
a
Experienced
b
Role at time of the research TESL student ESL teacher
ESL teaching experience None 10 years or more
Rating experience None 5 years or more
Post-graduate study None M.A./M.Ed.
Received training in assessment No Yes
Self-assessment of rating ability Novice Competent or expert
Note. TESL = teaching English as a second language; ESL = English as a second
language;
a
n = 11.
b
n = 14.
Downloaded by [UNSW Library] at 16:57 31 October 2012

58 BARKAOUI
aloud into a tape-recorder. At least two weeks later, each participant attended a second training
session with the second rating scale and rated 12 essays silently and 12 while thinking aloud.
Each participant rated the same 12 think-aloud essays with both scales but in a different random
order of essays and prompts. All participants did all the think-aloud protocols individually, at
the participant’s home, to allow them enough time to verbalize and to minimize researcher
effects on the participants’ performance. Figure 1 summarizes the data collection procedures.
Data Analysis
Data for this current study consisted of the participants’ think-aloud protocols only. Because
some raters did not record their thinking aloud while rating some of the essays and because of
poor recording quality, only 558 protocols (out of 600) were analyzed. The novice raters
provided 264 of these protocols (47%). There was an equal number of protocols for each rating
scale and on each prompt. The protocols were coded with the assistance of the computer
program Observer 5.0 (Noldus Information Technology, 2003), a software for the organization,
analysis, and management of audio and video data. Using Observer allowed coding to be carried
out directly from the protocol audio-recordings (instead of transcripts).
The unit of analysis for the think-aloud protocols was a decision-making statement, which
was segmented using the following criteria from Cumming et al. (2002): (a) a pause of five
seconds or more, (b) rater reading aloud a segment of the essay, and/or (c) end or beginning of
the assessment of a single essay. The coding scheme was developed based mainly on Cum-
ming et al.’s (2002) empirically based schemes of rater decision-making behaviors and
aspects of writing raters attend to. Cumming et al.’s main model of rater behavior, as it
applied to the rating of independent prompts,
2
consists of various decision-making behaviors
grouped under three foci (rater self-monitoring behavior, ideational and rhetorical elements of
the text, control of language within the text) and two strategies (interpretation and judgment).
Interpretation strategies consist of reading strategies aimed at comprehending the essay,
whereas judgment concerns evaluation strategies for formulating a rating. Cumming et al. also
distinguished between three general types of decision-making behavior: a focus on self-monitoring

(i.e., focus on one’s own rating behavior, e.g., monitor for personal bias), a focus on the
2
Cumming et al. (2001, 2002) developed three frameworks based on data from different types of tasks and both ESL
and English teachers.
FIGURE 1 Summary of data collection procedures.
Phase 1:
1. Orientation session for rating scale 1 (scales counterbalanced).
2. Rating 12 essays silently using scale 1 (at home).
3. Think-aloud training.
4. Rating 12 essays while thinking aloud using scale 1 (at home).
Phase 2:
5. Orientation session for rating scale 2.
6. Rating 12 essays silently using scale 2 (same essays as in 2 above) (at home).
7. Rating 12 essays while thinking aloud using scale 2 (same essays as in 4 above) (at home).
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 59
essay’s realization of ideational and rhetorical elements (e.g., essay rhetorical structure,
coherence, relevance), and a focus on the essay’s accuracy and fluency in the English language
(e.g., syntax, lexis).
Based on preliminary inspections of the data, 36 codes were selected from Cumming et al.’s
frameworks and three new ones were added: (a) Read, interpret, refer, or comment on rating
scale to account for the raters’ uses of the rating scales; (b) Assess communicative effectiveness
or quality, which pertains to text comprehensibility and clarity at both the local and global
levels; and (c) Compare scores across rating categories, to account for participants’ comparison
of scores assigned to the same essay on different analytic rating categories. The final coding
scheme consisted of 39 codes. A complete list of the codes with examples from the current study
is presented in the appendix.
The author coded all the protocols by assigning each decision-making statement all the
relevant codes in the coding scheme. To check the reliability of the coding, the coding scheme
was discussed with another researcher, who then independently coded a random sample of 70

protocols (3,083 codes). Percentage agreement achieved was 81%, computed for agreement in
terms of the main categories in the appendix. Percentage agreements for main categories and
within each category varied, however (e.g., 76% for self-monitoring-judgment, 85% for
language-judgment). For most cases, the coders were able to reconcile the codes. In the few
cases where they were not able to reach an agreement, the author decided the final code to be
assigned.
As in previous studies (e.g., Cumming, 1990; Cumming et al., 2002; Wolfe, 2006;
Wolfe, Kao, & Ranney, 1998), the focus in this study is on comparing the frequency of the
decision-making behaviors and aspects of writing attended to. Consequently, the coded
protocol data were tallied and percentages were computed for each rater for each code in the
coding scheme. These percentages served as the data for comparison across rater groups
and rating scales. Statistical tests were then conducted on the main categories in the appen-
dix. Subcategories were used for descriptive purposes only and to explain significant differ-
ences in main categories. Because the coded data did not seem to meet the statistical
assumptions of parametric tests, nonparametric tests were used to compare coded data
across rating scales (Wilcoxon Signed-Ranks Test) and across rater groups (Mann-Whitney
Test).
3
Because these tests rely on ranks, the following descriptive statistics are reported
next: median (Mdn) and the highest (Max) and lowest (Min) values for each main category.
Finally, because each participant provided 12 protocols for each rating scale, each rater had
24 percentages for each code. For example, each rater had 24 percentages, 1 for each essay for
each rating scale (i.e., 12 essays × 2 rating scales), for the code “scan whole composition.” To be
able to analyze the coded data statistically, these percentages had to be aggregated as follows.
To compare coded data across rating scales, the protocols were aggregated at the rater level, by
type of rating scale, to obtain 2 average percentages for each code for each rater, 1 for each
rating scale. To compare the coded data across rater groups, the protocols were aggregated at the
rater level to obtain one proportion per rater. Statistical tests were then conducted on aggregated
data.
3

Wilcoxon signed-ranks test is a nonparametric equivalent of the dependent t test, whereas Mann-Whitney test is a
nonparametric equivalent of the independent t test for comparing two independent groups.
Downloaded by [UNSW Library] at 16:57 31 October 2012
60 BARKAOUI
FINDINGS
Scale Effects
Table 2 reports descriptive statistics of the percentages of decision-making strategies and
aspects of writing reported in the think-aloud protocols by main category across rating scales.
Overall, (a) the participants reported more judgment (Mdn = 58% and 63% for holistic and ana-
lytic, respectively) than interpretation strategies (Mdn = 42% and 37%) with both rating scales,
(b) self-monitoring focus was the most frequently mentioned (Mdn = 44% and 50%) and lan-
guage focus the least frequently mentioned (Mdn = 23% and 20%) with both rating scales, and
(c) Wilcoxon Signed-Ranks tests indicated that the holistic scale elicited significantly (p < .05)
more interpretation strategies for the three focuses (self-monitoring, Mdn = 31%; language, Mdn
= 6%; and rhetorical and ideational, Mdn = 4%) and more language focus (Mdn = 23%) than did
the analytic scale, which elicited significantly more judgment strategies (Mdn = 63%) and self-
monitoring focus (Mdn = 50%) than did the holistic scale.
In terms of subcategories, Table 3 shows the strategies that were reported more frequently
with each rating scale. Table 3 shows that there were more references to specific linguistic fea-
tures (e.g., syntax, lexis, spelling) with the holistic scale, whereas the analytic scale elicited
more reference to rating language overall (see appendix for examples). In addition, with holistic
scoring raters tended to read and interpret the essay more frequently, whereas the analytic scale
TABLE 2
Descriptive Statistics for Decision-Making Behaviors by Rating Scale
Holistic Analytic
Mdn Min Max Mdn Min Max
Focus
Self-monitoring* 43.88 36.18 62.30 50.40 39.29 62.53
Rhetorical 31.00 18.58 44.10 28.10 22.24 36.84
Language* 22.84 12.12 37.77 20.39 11.99 33.96

Strategy
Interpretation* 41.70 32.86 51.12 37.38 25.12 43.67
Judgment* 58.30 48.88 67.14 62.62 56.33 74.88
Strategy × Focus
Interpretation
Self-monitoring* 30.96 26.50 36.38 29.71 18.20 35.92
Rhetorical* 3.67 .35 13.99 3.42 .80 6.31
Language* 5.75 2.20 11.46 3.76 1.07 11.70
Judgment
Self-monitoring* 13.41 7.67 28.84 22.06 10.97 31.98
Rhetorical 24.83 15.57 36.08 26.09 19.40 33.10
Language 17.51 9.92 27.57 15.56 9.99 27.51
Note. N = 25 raters.
*Wilcoxon Signed Ranks tests indicated that the differences across rating scales were statistically significant at p < .05.
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 61
elicited more reference to the rating scale and articulating and justifying scores. Finally, the
analytic scale prompted more references to text organization and linguistic appropriacy.
Rater Experience Effects
Table 4 reports descriptive statistics for the percentages of think-aloud codes by main category
across rater groups. It shows that, overall, (a) both groups reported more judgment (Mdn = 59%
and 61% for novices and experts, respectively) than interpretation (Mdn = 41% and 39%) strate-
gies, (b) self-monitoring focus was the most frequently mentioned (Mdn = 49% and 45%) and
language the least frequently mentioned focus (Mdn = 23% and 20%) for both groups, and (c)
the novice raters reported slightly more interpretation strategies (Mdn = 41%) and self-monitor-
ing focus (Mdn = 49%) than the experienced group (Mdn = 39% and 45%, respectively), who
reported slightly more judgment strategies (Mdn = 61%) and rhetorical and ideational focus
(Mdn = 30%). Mann-Whitney tests indicated that none of these differences was statistically
significant at p < .05, however.
Table 5 shows the subcategories that each rater group reported more frequently than the other

group did. Overall, Table 5 shows that the novices tended to refer to the rating scale and to focus
on local textual aspects and understanding essay content (e.g., summarize ideas) more
frequently than did the experienced raters, who tended to refer more frequently to the essay and
to rhetorical aspects of writing such as text organization and ideas, as well as the writer’s situa-
tion and essay length, two aspects that were not included in the rating scales.
Interaction Effects
Table 6 reports descriptive statistics of the percentages of think-aloud codes by main category
across rating scales and rater groups. First, comparing across rating scales within rater group,
TABLE 3
Medians for Strategies That Differed by 1% or More Across Rating Scales
Strategies Holistic Mdn Analytic Mdn
Higher with the holistic scale
Read or reread essay 19.32% 14.33%
Interpret ambiguous or unclear phrases 2.36% 1.29%
Articulate general impression 2.97% 1.83%
Rate ideas and/or rhetoric 3.18% 2.09%
Classify errors into types 3.26% 1.82%
Consider lexis 2.28% 1.28%
Consider syntax and morphology 3.62% 2.24%
Consider spelling or punctuation 3.78% 1.91%
Higher with the analytic scale
Refer to, read or interpret rating scale 7.78% 11.07%
Articulate, justify or revise scoring decision 8.55% 16.85%
Assess text organization 2.98% 4.54%
Assess style, register, or linguistic appropriacy 1.10% 3.49%
Rate language overall 1.04% 3.32%
Downloaded by [UNSW Library] at 16:57 31 October 2012
62 BARKAOUI
Table 6 shows that both rater groups reported more self-monitoring focus and judgment strate-
gies with the analytic scale and more interpretation strategies and language-interpretation with

the holistic scale. Wilcoxon Signed Ranks tests indicated that these differences across rating
scales were statistically significant for both rater groups at p < .05. In addition, the novice raters
TABLE 4
Descriptive Statistics for Decision-Making Behaviors by Rater Group
Novice
a
Experienced
b
Mdn Min Max Mdn Min Max
Focus
Self-monitoring 49.08 43.29 62.42 45.25 39.91 54.74
Rhetorical 27.70 22.43 37.38 30.32 21.32 38.83
Language 23.04 13.45 27.99 20.43 13.91 34.62
Strategy
Interpretation 40.88 36.19 45.09 38.51 31.94 45.52
Judgment 59.12 54.91 63.81 61.49 54.48 68.06
Strategy × Focus
Interpretation
Self-monitoring 30.20 25.91 34.69 29.42 23.77 33.24
Rhetorical 4.47 .81 9.31 3.33 .90 7.30
Language 4.81 2.31 11.58 4.58 1.76 9.65
Judgment
Self-monitoring 18.30 14.81 27.73 16.79 10.83 26.43
Rhetorical 23.32 19.01 30.68 27.02 18.68 33.52
Language 16.28 11.15 20.19 15.75 12.15 24.97
a
n = 11 raters.
b
n = 14 raters.
TABLE 5

Medians for Strategies That Differed by 1% or More Across Rater Groups
Strategies Novice Mdn Experienced Mdn
Higher for the novice group
Refer to, read or interpret rating scale 10.15% 8.38%
Articulate or justify score 13.79% 11.22%
Interpret ambiguous or unclear phrases 2.03% 1.02%
Summarize ideas and propositions 1.87% 0.71%
Edit or interpret unclear phrases 1.69% 0.51%
Consider spelling and punctuation 3.78% 2.64%
Higher for the experienced group
Read or reread essay 15.89% 17.35%
Envision writer’s personal situation 0.66% 1.67%
Assess text organization 2.42% 4.19%
Rate ideas and/or rhetoric 2.01% 3.13%
Assess quantity 1.01% 2.17%
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 63
reported significantly more language focus (Mdn = 27%, particularly language-interpretation)
and rhetorical-interpretation (Mdn = 6%) with the holistic scale than they did with the analytic
scale ((Mdn=18% and 3%), while the experienced raters reported significantly more self-moni-
toring-interpretation (Mdn = 31%) with the holistic scale than they did with the analytic scale
(Mdn = 29%). The following is a list of the subcategories of strategies that each rater group
reported more frequently with each rating scale.
The novices reported the following subcategories more frequently with
TABLE 6
Descriptive Statistics for Decision-Making Behaviors by Rating Scale and Rater Groups
Holistic Analytic
Group/Scale Mdn Min Max Mdn Min Max
Novice
a

Focus
Self-monitoring 44.19 36.18 62.30 52.46 47.45 62.53
Rhetorical 26.08 21.38 43.34 26.89 22.24 32.06
Language 26.71 12.12 34.10 18.21 11.99 29.00
Strategy
Interpretation 41.70 35.48 51.12 37.38 33.61 42.44
Judgment 58.30 48.88 64.52 62.62 57.56 66.39
Strategy × Focus
Interpretation
Self-monitoring 30.56 26.50 33.59 30.04 25.33 35.92
Rhetorical 5.92 .35 13.99 3.39 1.28 6.31
Language 6.52 2.20 11.46 3.73 1.07 11.70
Judgment
Self-monitoring 14.84 7.67 28.84 22.72 18.89 28.20
Rhetorical 22.49 16.45 32.68 25.04 19.40 28.67
Language 15.25 9.92 27.57 14.63 9.99 19.34
Experienced
b
Focus
Self-monitoring 43.66 38.83 51.94 45.79 39.29 60.11
Rhetorical 31.11 18.58 44.10 29.98 23.60 36.84
Language 22.23 12.24 37.77 21.03 12.92 33.96
Strategy
Interpretation 41.63 32.86 49.32 37.05 25.12 43.67
Judgment 58.37 50.68 67.14 62.95 56.33 74.88
Strategy × Focus
Interpretation
Self-monitoring 31.08 26.95 36.38 29.36 18.20 33.00
Rhetorical 3.40 .65 10.42 3.48 .80 4.55
Language 5.14 2.24 11.21 4.02 1.27 8.34

Judgment
Self-monitoring 12.23 8.43 21.97 20.57 10.97 31.98
Rhetorical 26.15 15.57 36.08 27.01 21.27 33.10
Language 17.66 10.00 26.55 17.06 10.84 27.51
a
n = 11.
b
n = 14.
Downloaded by [UNSW Library] at 16:57 31 October 2012
64 BARKAOUI
1. Holistic scale: Read or reread essay; Articulate general impression; Interpret unclear
phrases; Rate ideas or rhetoric; Classify errors into types; Consider error gravity;
Consider lexis; Consider syntax or morphology; and Consider spelling and punctuation.
2. Analytic scale: Read or refer to rating scale; Articulate or justify score; Assess text orga-
nization; Assess style, register or linguistic appropriacy; and Rate language overall.
The experienced raters reported the following subcategories more frequently with
1. Holistic scale: Read or reread essay; Envision writer’s personal situation; Rate ideas and
rhetoric; and Classify errors into types.
2. Analytic scale: Read or refer to rating scale; Articulate or justify score; Assess style,
register or linguistic appropriacy; and Rate language overall.
Several trends emerge from these comparisons. First, the novice raters exhibited a shift from
a focus on specific linguistic features (e.g., syntax, lexis, spelling) with the holistic scale to a
focus on rating language overall with the analytic scale. Second, both groups tended to refer
more often to linguistic appropriacy with the analytic scale. Finally, the novices referred more
frequently to text organization when rating the essays analytically, suggesting that the analytic
scale drew their attention to this aspect of writing as well as linguistic appropriacy.
Second, comparing across rater groups within rating scale, Table 6 shows that (a) the experi-
enced raters made more comments related to rhetorical-judgment and rhetorical focus than the nov-
ices did with both rating scales, (b) the novices made more comments on self-monitoring-judgment
than the experienced raters did with both rating scales, (c) the experienced raters made more

comments involving language-judgment (Mdn = 17%) than the novices did (Mdn = 15%) with the
analytic scale. None of these differences was statistically significant at p < .05, however. The fol-
lowing is a list of the subcategories of strategies that each rater group reported more frequently with
each rating scale.
With the holistic scale, the following subcategories were reported more frequently by
1. Novices: Read or refer to rating scale; Articulate or justify score; and Consider spelling
and punctuation.
2. Experienced raters: Read or reread essay; Envision writer’s personal situation; Assess
reasoning and topic development; and Assess text organization.
With the analytic scale, the following subcategories were reported more frequently by
1. Novices: Read or refer to rating scale; Articulate or justify score; and Consider spelling
and punctuation.
2. Experienced raters: Read or reread essay; Assess quantity; and Consider syntax or mor-
phology.
These results indicate that novices tended to refer to the rating scales, the source of the
evaluation criteria, more frequently than did the experienced raters who referred more often to
the essay, the focus of the assessment, regardless of the rating scale used. In addition, with
holistic scoring, the experienced raters referred more frequently to rhetorical and ideational
aspects (i.e., ideas, organization, development) than did the novices, but with the analytic
scale, they referred to linguistic features (syntax and morphology) and text length more often
than did the novices.
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 65
Collectively, these results highlight four observations. First, the differences tend to be
larger (and significant) across rating scales than they are across rater groups. Although the
two groups differed in terms of several strategies, the differences across scales are more
noticeable. Second, these results suggest that the holistic scale led the novice raters to
report specific linguistic aspects (e.g., lexis, error frequency, syntax, spelling) separately,
whereas the analytic scale, by grouping these aspects under one heading in the scale (lin-
guistic accuracy), led these raters to treat these specific aspects as one category rather than

multiple categories that need to be considered (and perhaps weighted and scored) sepa-
rately. This is evident in the trend, among the novices, of reporting more specific linguistic
aspects with the holistic scale, compared to a higher proportion of reporting language over-
all with the analytic scale. Third, the novice raters tended to read or refer to the rating scale
and to articulate, revise, or justify scores more often than the experienced raters did with
both rating scales, whereas the latter tended to read or reread the essay more often. Fourth,
the analytic scale drew the attention of raters from both groups to linguistic appropriacy, an
aspect that most participants reported that they were not familiar with or sure what it meant.
In addition, the analytic scale seems to have drawn the novice raters’ attention to text orga-
nization.
The analyses reported so far were conducted using aggregated data. The decision-making
behaviors of the raters and the aspects of writing they attended to were also compared across rat-
ing scales for each rater separately using Wilcoxon Signed-Ranks tests.
4
Table 7 summarizes the
results of these analyses. The sample size in each case refers to the number of protocols by an
individual rater, rather than to the number of raters as was the case with the previous (aggre-
gated) analyses.
Table 7 shows that the patterns across rating scales are, overall, the same as those reported for
aggregated data just presented. For instance, Table 7 shows that more raters reported signifi-
cantly more self-monitoring-judgment, self-monitoring, and judgment strategies with the ana-
lytic scale than they did with the holistic scale. There were some individual differences,
however. For instance, although the main trend was toward reporting more rhetorical-interpreta-
tion strategies with the holistic scale, two experienced raters reported these strategies signifi-
cantly more often with the analytic scale. On the other hand, among those raters who reported
significantly different proportions of judgment and interpretations strategies across rating scales,
all reported more judgment strategies with analytic scoring and more interpretation strategies
with holistic scoring.
Another point worth noting in Table 7 concerns the number of raters from each group who
exhibited significant change in the proportions of aspects of writing and decision-making behav-

iors they reported across rating scales. In terms of focus (i.e., self-monitoring, rhetoric and ideas,
and language), there were 30 cases with significant change across rating scales; 17 of them
(57%) involved novice raters. In terms of strategies (i.e., interpretation and judgment), there
were 22 cases of significant change, 14 of them (64%) involving experienced raters, who
reported significantly more interpretation strategies with the holistic scale and significantly
more judgment strategies with the analytic scale. This trend suggests that variation across rating
4
In other words, for each rater the proportions of decision-making behaviors and aspects of writing attended to were
compared across rating scales. The unit of analysis was the think-aloud protocol and sample size was the number of pro-
tocols per rater per rating scale (i.e., n = 12 protocols per rater per rating scale, except for raters with missing data).
Downloaded by [UNSW Library] at 16:57 31 October 2012
66 BARKAOUI
scales tended to appear (a) in the rating criteria or focus of novice raters and (b) in the strategies
or decision-making behaviors of experienced raters.
SUMMARY AND DISCUSSION
Overall, the findings of this study indicated a larger effect of rating scale than of rater experience
on raters’ decision-making behaviors and aspects of writing attended to. The holistic scale
elicited more interpretation strategies for the three focuses (self-monitoring, language, and rhe-
torical and ideational) and more language focus than the analytic scale, which elicited signifi-
cantly more judgment strategies and self-monitoring focus. That the analytic scale resulted in
more judgment strategies, whereas the holistic scale prompted more interpretation strategies was
expected, as the raters had to make more than one score decision with the analytic scale. How-
ever, this suggests that the way the rating criteria are organized in an evaluation scheme affects
the relative frequency of the strategies raters use (cf. Barkaoui, 2007b). With holistic scoring,
raters tended to read or reread the essays, interpret unclear parts of the texts, assess task comple-
tion, ideas and rhetoric, and attend to specific linguistic features more frequently. With the ana-
lytic scale, raters tended to read and refer to the rating scale; articulate and justify scores; and
assess text organization, linguistic appropriacy, and language overall more often. These findings
TABLE 7
Comparison of Decision-Making Behaviors Across Rating Scales for Individual Raters

No. of Raters With
Significantly Higher
Proportion for Holistic
No. of Raters With
Significantly Higher
Proportion for Analytic
Novice
a
Experienced
b
Novice
a
Experienced
b
Focus
Self-monitoring0163
Rhetorical 4
c
212
Language 5222
Strategy
Interpretation 4700
Judgment 0047
Strategy × Focus
Interpretation
Self-monitoring1410
Rhetorical 6402
Language 6410
Judgment
Self-monitoring0088

Rhetorical 4102
Language 4211
Note. n = 12 protocols per rater per rating scale, except for raters with missing data.
a
n = 11.
b
n = 14.
c
This is read as follows: Of the 11 novice raters, 4 had a significantly (p < .05) larger proportion for
Rhetorical focus with the holistic scale than they did with the analytic scale.
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 67
suggest that with analytic scoring, there was closer attention to various criteria on the rating scale
and use of judgment and self-monitoring strategies. Score analyses (not reported here) indicated
that raters tended to be more self-consistent with the analytic scale (see Barkaoui, 2008).
These findings suggest that the way the rating criteria are organized in an evaluation scheme
influences the aspects of writing that raters attend to. For example, although both scales
included the same rating criteria, the holistic method seems to have focused raters’ attention on
language, rhetoric and ideas (i.e., linguistic accuracy, organization and argumentation). The ana-
lytic scale, on the other hand, seems to have drawn the raters’ attention to linguistic appropriacy,
an aspect that the raters seemed to be less familiar with. It seems that raters are more likely to
attend to all criteria listed in the scale with analytic scoring.
Differences across rater groups were not significant. However, novices reported slightly
more interpretation strategies and self-monitoring focus than did the experienced raters, who
reported slightly more judgment strategies and rhetorical and ideational focus. Generally, nov-
ices were more dependent on the rating scales for rating criteria and decisions than were the
experienced raters. They tended to refer to the rating scales and rely on criteria listed in the
scales more frequently when making their scoring decisions. In addition, they tended to focus on
specific, local aspects of writing more often and to spend more time interpreting and/or editing
text than the experienced raters did (cf. Cumming, 1990; Sakyi, 2003). These tendencies seem to

be due to the novices’ lack of experience with ESL writing, which might have led them to focus
on local linguistic features in order to understand the texts before they could evaluate other
aspects (cf. Sakyi, 2003). In addition, because they lack established criteria for judging writing
quality and/or how to approach the rating task, these novice raters may have relied on the rating
scale more heavily and/or based their score decisions and justifications on simple or easily dis-
cernable aspects of writing such as lexis, syntax, and punctuation (cf. Sakyi, 2003).
Experienced raters, by contrast, reported more judgment strategies and rhetorical and ide-
ational focus and tended to allot more time to reading and assessing the essays overall, particu-
larly in terms of rhetoric and ideas, than the novices did (cf. Cumming, 1990; Milanovic et al.,
1996). This seems to be particularly true with the holistic scale. In addition, the experienced rat-
ers tended to refer to other criteria than those mentioned in the rating scales (e.g. length, writer’s
situation) more frequently than did the novices. Score analyses (not reported here), however,
indicated that the experienced raters tended to be more self-consistent and more homogeneous in
terms of severity than were the novice raters (Barkaoui, 2008). Previous studies (e.g., Cumming,
1990; Weigle, 1999) found large qualitative differences between novice and experienced raters.
The results of this study, however, suggest that these differences may depend on the evaluation
tool (i.e., rating scale) used.
There was some evidence of a differential effect of rating scales across rater groups. There
was a general trend among the novice raters to report attending to specific linguistic features
(e.g., lexis, error frequency, syntax, spelling) more frequently with the holistic scale but to refer
to language overall more often with the analytic scale. It is possible that, because the holistic
scale lists several specific linguistic features (grammar, vocabulary, spelling, etc.) without any
indication of their importance relative to each other or to other criteria, it led the novices to treat
these features as multiple categories that need to be considered (and perhaps weighted and
scored) separately. By grouping these aspects under one heading in the scale, the analytic scale
seems to have led the novice raters to treat these specific aspects as one component rather than
multiple categories that need to be considered separately. It would, thus, seem that variation in
Downloaded by [UNSW Library] at 16:57 31 October 2012
68 BARKAOUI
the organization of rating scales is likely to influence the rating processes and criteria or focus of

novice raters. Novices also tended to attend to organization more often with the analytic scale,
suggesting that this scale drew their attention to this aspect of writing.
Overall, these findings lend support to Cumming’s (1990) hypothesis that variation in rating
scale can affect the rating criteria and strategies that novice raters employ. The analytic scale
seems to have focused the novice raters’ attention on the rating criteria in the scale and helped
them organize their rating criteria coherently. In addition, it seems to have made the rating task
easier and more manageable for them than with holistic scoring. As previously noted, novices were
more self-consistent with the analytic scale. With holistic scoring, the novices had to deal with a
more complex task that involved not only evaluating the essays in terms of the various criteria in
the scale but also weighing these criteria to arrive at a single overall score. As a result, many of
them were unable to rate consistently with this method and felt a need to break the rating task
down into a series of smaller decisions to make it manageable. It should be noted here that raters
from both groups expressed a general preference for analytic scoring because it does not require
the rater to decide on a single score when an essay displayed different levels of proficiency in dif-
ferent writing areas. Holistic scoring, by contrast, often leads to conflicting criteria, thus, making
the rating task more complex (cf. Sakyi, 2003; Vaughan, 1991). Rating scales seem also to influ-
ence the decision-making behaviors of experienced raters, but this effect was not as pronounced as
it was for the novices. Analyses at the individual rater level indicated that there was some individ-
ual variability in terms of decision-making behavior and aspects of writing attended to.
The findings of this study, thus, suggest that analytic scoring focuses raters’ attention on the
criteria listed in the rating scale (Goulden, 1994) and allows raters to reduce the number of
conflicts they face in their scoring decisions (cf. Lumley, 2005; Sakyi, 2003). As such, it seems
suitable for less experienced raters because it has the potential to focus their attention on the
rating task and rating criteria in the rubric, to lessen the cognitive demands of weighting and
arbitrating between rating criteria, and to enhance their self consistency. These are effects that
Weigle (1994, 1998) found to be associated with rater training as well. Because it is more
complex, holistic scoring may require a higher level of rating expertise (Cumming, 1990; Huot,
1993). Nevertheless, it seems that analytic scoring can be a useful tool for experienced raters
when joining a new assessment system because it can draw their attention to all rating criteria,
particularly those that they are not familiar with. For instance, the analytic scale drew the atten-

tion of raters in both groups to linguistic appropriacy, an aspect that many raters reported was
new to them. Although the same descriptor of linguistic appropriacy was included in both rating
scales, raters in both groups seem to have ignored this aspect when evaluating the essays holisti-
cally. Score analyses (not reported here) indicated that linguistic appropriacy and organization
did not have significant associations with the holistic scores the raters assigned once the other
three analytic criteria (communicative quality, argumentation, and linguistic accuracy) were
accounted for (Barkaoui, 2008).
LIMITATIONS AND FUTURE RESEARCH
As with any research, there were limitations to the present study. First, as discussed elsewhere
(Barkaoui, forthcoming), think-aloud protocols do not provide a complete picture of the rating
process and might have affected the rating processes of some participants. For example, several
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 69
raters reported that they found it difficult to report complex, intuitive, and/or tacit thoughts and
reactions; that verbalization might have affected various aspects of their rating processes (e.g.,
rating criteria); and that these effects, as well as the quality and quantity of verbalization, varied
across raters and rating scales. These limitations need to be taken into account when interpreting
the findings and conclusions of this study.
Second, the think-aloud protocols were coded and analyzed quantitatively. As Cumming et al.
(2001) noted, the coding framework has its limitations. Some behaviors overlap with one
another, making it difficult to separate them, code them, or agree on their coding. In addition,
the terminology used to describe rating behaviors is open to different interpretations; raters
might have meant different things by using the same term or meant the same thing by different
terms (Cumming et al., 2001, p. 71). Furthermore, the quantification of qualitative data in this
study was limited to comparing the frequency of codes. Although this was a useful strategy
given the relatively large number of participants in this study, it cannot detect such qualitative
differences as variation in sequences of decision-making behaviors and individual rating styles
within and across rating scales, raters, and groups. Rating style refers to how a rater reads the
essay, interprets the rating scale, and assigns a score (Lumley, 2005; Sakyi, 2003; Smith, 2000).
Both sequencing and rating styles are important aspects of the rating process.

Third, this study was cross-sectional; it compared the performance of two groups of raters at two
points on the expertise continuum. As such, although it provided important insights into similarities
and differences between the two groups, it says very little about whether, how and why raters’ rating
processes and criteria change over time or how rating expertise develops. This is also a limitation of
all previous studies on rater expertise, however. Finally, rater experience is one of the various rater
factors (e.g., L1, educational background) that can influence essay rating processes and criteria.
Finally, the study adopted an experimental, rather than a naturalistic, approach to the exami-
nation of the effects of rating scales and rater experience on essay rating processes. Although
raters were provided with a detailed description of the examinees as well as the test and its pur-
pose, some of the motivation and institutional norms that appear in rating for a real exam may
have been lacking (cf. Lumley, 2005), which may limit the generalizability of the findings to
real-test contexts. In addition, an experimental approach treats the rating process as an individ-
ual, cognitive process and isolates it from its social, institutional and political contexts.
Despite these limitations, the current study suggests several implications that can be tested in
other specific assessment contexts and points to several areas for further research. First, the
current qualitative data set could be explored further in several ways to shed more light on the
role of rater experience and the rating scale in mediating the rating process, how raters mentally
represent the rating scale and apply it, and whether and how these mental representations differ
across raters (cf. Lumley, 2002; Smith, 2000; Wolfe et al., 1998). Further analyses are being
conducted to examine whether raters employ different decision-making behaviors (e.g., judg-
ment vs. interpretation) and attend to different aspects of writing (e.g., language vs. rhetoric) at
different stages of the rating process and whether these processes vary across rating scales,
raters and rater groups. These additional analyses will also examine when and why raters refer to
the rating scale; how often and to what extent they use criteria and language from the rating
scale to decide, explain, or justify the scores they assign; and whether these behaviors vary
significantly across raters, rater groups, and rating scales.
Second, the current study could be replicated with raters from different linguistic, cultural
and professional backgrounds, with different writing tasks, with different rating scales, and in
Downloaded by [UNSW Library] at 16:57 31 October 2012
70 BARKAOUI

different assessment systems and contexts. For instance, the current study included rating scales
that are identical in terms of evaluation criteria, wording, and number of score levels. Future
studies could compare rater performance across rating scales that vary in terms of wording,
focus and number of rating criteria, and number of score levels. In addition, with the growing
interest in alternative approaches to assessment, it is worthwhile exploring the performance of dif-
ferent rating scales in the context of such assessments.
Third, there is a need for longitudinal studies to investigate how and why rater performance
changes over time, not only in terms of severity and self-consistency (cf. Lumley & McNamara,
1995) but also in terms of rating beliefs, processes, and criteria. Such research can use qualitative
methods to investigate how rating expertise develops over time, how this process varies across indi-
viduals and contexts, how raters are socialized into new institutional or assessment contexts, and how
and why raters appropriate (or not) the rating values and approaches in new assessment contexts.
Finally, the current study examined the role of one contextual factor, the rating scale, in
variability in rater performance. Future studies need to examine other contextual factors that may
influence essay rating processes and outcomes, including the broader sociocultural, institutional,
and political contexts within which ESL essay rating occurs. As Torrance (1998) argued, essay rat-
ing is a “socially situated process” with a “social meaning” and “social consequences.” The social
context within which the assessment occurs is central because it provides meaning and purpose for
the rating and shapes the processes and outcomes of this activity (Lumley, 2005). Rating scales are
one product and expression of the assessment beliefs, values, and practices of the institutional con-
text within which they are developed and used. Other expressions of these beliefs and values
include rater training practices and writing tasks, genres, and practices valued. Research on writing
assessment contexts, however, has been scarce (Barkaoui, 2007a; Weigle, 2002). Such research
will require focusing on raters making real judgments in specific courses, programs, or institutions
and using naturalistic, ethnographic approaches (e.g., observation of ratings, discussion of scores,
interviews) as well as score analyses (cf. Broad, 2003; Davison, 2004). Such research can signifi-
cantly enhance our understanding of the roles of the broader sociocultural and institutional con-
texts within which essay rating occurs in variability in rater performance and essay scores.
ACKNOWLEDGMENTS
This research was partially supported by a grant from Educational Testing Service (TOEFL

small Grant for Doctoral Research in Second or Foreign Language Assessment, 2006). An ear-
lier version of this paper was presented at AAAL conference, March 2009, Denver, CO. I would
like to thank the raters who participated in this study and Alister Cumming, Merrill Swain, Liz
Hamp-Lyons and two anonymous Language Assessment Quarterly reviewers for their com-
ments on earlier versions of this article.
REFERENCES
Bacha, N. (2001). Writing evaluation: What can analytic versus holistic essay scoring tell us? System, 29, 371–383.
Barkaoui, K. (2007a). Participants, texts, and processes in second language writing assessment: A narrative review of
the literature. The Canadian Modern Language Review, 64, 97–132.
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 71
Barkaoui, K. (2007b). Rating scale impact on EFL essay marking: A mixed-method study. Assessing Writing, 12, 86–107.
Barkaoui, K. (2008). Effects of scoring method and rater experience on ESL essay rating processes and outcomes.
Unpublished Ph.D. Thesis, University of Toronto.
Barkaoui, K. (forthcoming). Think-aloud protocols in research on essay rating: An empirical study of their veridicality
and reactivity. Language Testing.
Broad, B. (2003). What we really value: Rubrics in teaching and assessing writing. Logan: Utah State University Press.
Cumming, A. (1990). Expertise in evaluating second language compositions. Language Testing, 7, 31–51.
Cumming, A., Kantor, R., & Powers, D. (2001). Scoring TOEFL essays and TOEFL 2000 prototype writing tasks: An
investigation into raters’ decision making and development of a preliminary analytic framework (TOEFL
Monograph Series N 22). Princeton, NJ: Educational Testing Service.
Cumming, A., Kantor, R., & Powers, D. (2002). Decision making while rating ESL/EFL writing tasks: A descriptive
framework. Modern Language Journal, 86, 67–96.
Davison, C. (2004). The contradictory culture of teacher-based assessment: ESL teacher assessment practices in Australia
and Hong Kong secondary schools. Language Testing, 21, 305–334.
Delaruelle, S. (1997). Text type and rater decision-making in the writing module. In G. Brindley & G. Wigglesworth
(Eds.), Access: Issues in English language test design and delivery (pp. 215–242). Sydney, Australia: National
Center for English Language Teaching and Research, Macquarie University.
Erdosy, M. U. (2004). Exploring variability in judging writing ability in a second language: A study of four experienced
raters of ESL compositions (TOEFL Research Report No. RR-03-17). Princeton, NJ: Educational Testing Service.

Freedman, S. W., & Calfee, R. C. (1983). Holistic assessment of writing: Experimental design and cognitive theory. In P.
Mosenthal, L. Tamor, & S. A. Walmsley (Eds.), Research on writing: Principles and methods (pp. 75–98). New York:
Longman.
Goulden, N. R. (1992). Theory and vocabulary for communication assessments. Communication Education, 41, 258–269.
Goulden, N. R. (1994). Relationship of analytic and holistic methods to raters’ scores for speeches. Th
e Journal of
Research and Development in Education, 27, 73–82.
Hamp-Lyons, L. (1991). Scoring procedures for ESL contexts. In L. Hamp-Lyons (Ed.), Assessing second language
writing in academic contexts (pp. 241–276). Norwood, NJ: Ablex.
Hamp-Lyons, L., & Kroll, B. (1997). TOEFL 2000-writing: Composition, community and assessment (TOEFL
Monograph Series N 5). Princeton, NJ: Educational Testing Service.
Homburg, T. J. (1984). Holistic evaluation of ESL compositions: Can it be validated objectively? TESOL Quarterly, 18, 87–107.
Huot, B. A. (1993). The influence of holistic scoring procedures on reading and rating student essays. In M. M. Williamson
& B. A. Huot (Eds.), Validating holistic scoring for writing assessment: Theoretical and Empirical foundations
(pp. 206–236). Cresskill, NJ: Hampton.
Lumley, T. (2005). Assessing second language writing: The rater’s perspective. New York: Peter Lang.
Lumley, T., & McNamara, T. F. (1995). Rater characteristics and rater bias: Implications for training. Language Testing,
12, 54–71.
Milanovic, M., Saville, N., & Shuhong, S. (1996). A study of the decision-making behaviour of composition markers. In
M. Milanovic & N. Saville (Eds.), Performance testing, cognition and assessment: Selected papers from the 15th Lan-
guage Testing Colloquium (LTRC), Cambridge and Arnhem (pp. 92–114) Cambridge, UK: Cambridge University
Press.
Noldus Information Technology b.v. (2003). Observer (Version 5.0). [Computer software]. Wageningen, the Netherlands:
Author.
O’Loughlin, K. (1994). The assessment of writing by English and ESL teachers. Australian Review of Applied Linguistics,
17, 23–44.
Ruth, L., & Murphy, S. (1988). Designing writing tasks for the assessment of writing. Norwood, NJ: Ablex.
Sakyi, A. A. (2003). A study of the h olistic scoring behaviors of experienced and novice ESL instructors. Unpublished
doctoral dissertation, University of Toronto, Toronto, Canada.
Schoonen, R. (2005). Generalizability of writing scores: An application of structural equation modeling. Language

Testing
, 22, 1–30.
Schoonen, R., Vergeer, M., & Eiting, M. (1997). The assessment of writing ability: Expert readers versus lay readers.
Language Testing, 14, 157–184.
Smith, D. (2000). Rater judgments in the direct assessment of competency-based second language writing ability. In
G. Brindley (Ed.), Studies in immigrant English language assessment, Volume 1 (pp. 159–189). Sydney, Australia:
Macquarie University.
Downloaded by [UNSW Library] at 16:57 31 October 2012
72 BARKAOUI
Song, C. B., & Caruso, I. (1996). Do English and ESL faculty differ in evaluating the essays of native English-speaking
and ESL students? Journal of Second Language Writing, 5, 163–182.
Torrance, H. (1998). Learning from research in assessment: A response to writing assessment-raters’ elaboration of the
rating task. Assessing Writing 5, 31–37.
Vaughan, C. (1991). Holistic assessment: What goes on in the rater’s mind? In L. Hamp-Lyons (Ed.), Assessing second
language writing in academic contexts (pp. 111–125). Norwood, NJ: Ablex.
Weigle, S. C. (1994). Effects of training on raters of ESL compositions. Language Testing, 11, 197–223.
Weigle, S. C. (1998). Using FACETS to model rater training effects. Language Testing, 15, 263–287.
Weigle, S. C. (1999). Investigating rater/prompt interactions in writing assessment: Quantitative and qualitative
approaches. Assessing Writing, 6, 145–178.
Weigle, S. C. (2002). Assessing writing. Cambridge, UK: Cambridge University Press.
Wolfe, E. W. (2006). Uncovering rater’s cognitive processing and focus using think-aloud protocols. Journal of Writing
Assessment, 2, 37–56.
Wolfe, E. W., Kao, C., & Ranney, M. (1998). Cognitive differences in proficient and non-proficient essay scorers.
Written Communication, 15, 465–492.
APPENDIX
TABLE A1
Coding Scheme (Adapted From Cumming et al., 2002) With Examples
Code Examples
a
1. Self-monitoring focus

1.1. Interpretation strategies
Read or interpret prompt OK, this is an opinion piece and they have to agree or disagree with the
statement it’s more important for students to study history and literature than it is
for them to study math and science OK, alright [reads essay 176] (E20, A, 176)
Read or reread essay OK, moving on, number 108, this has five paragraphs, there is a little more body to
it . . . reading, I DISAGREE WITH THE STATEMENT BELOW (E73, H, 108)
Envision personal situation and
viewpoint of writer
ANCIENT TIMES it sounds like somebody who’s learned maybe through speaking and
he’s. . . he’s not aware about the endings inform instead of informed (Rater 75, A, 125)
Scan whole composition 184, OK, this is one is a lot shorter and it’s. . . for some reason it is divided into four
separate paragraphs even though each paragraph seems to have one or two
sentences so, OK, so I’m gonna read it now GENERALLY IT IS SAID THAT
(N28, H, 184)
Refer to, read, or interpret rating
scale
[Reads level 6 for LAP] there is limited ability
. . . so, that’s the thing is I question,
what does it mean ‘limited ability’ to manipulate the linguistic system
appropriately’. . . hmm . . . that bothers me, I don’t understand what that means,
there is limited ability but this intrudes only occasionally
, . . . . I’m gonna look at
number seven [reads level 7 for LAP] (N4, A, 184)
1.2. Judgment Strategies
Decide on macrostrategy for
reading and rating
I’m going to read the entire thing through before I make any decisions on how to
rate the paper [reads essay silently] (N9, H, 108)
Consider own personal response
or biases

Ok, I’ll give it a five, hmm . . . I’m such a generous person, OK, we’ll go on to the
next essay (N20, A, 180)
Define or revise rating criteria OK, this is what I will do. Because it’s short, I am going to . . . give this writer a
lower mark on organization, I think. Yeah (N24, A, 184)
Compare with other
compositions or ‘anchors’
THE FRIEND EASILY I think this is the most. . . this is the worst essay I’ve read
today ALSO THE CHILD (N6, A, 276)
Summarize, distinguish, or tally
judgments collectively
That was a pretty good essay, well, pretty good essay, better than the other ones I
had. OK, I’m finished (E7, A, 108)
(Continued)
Downloaded by [UNSW Library] at 16:57 31 October 2012
ESL ESSAY RATING PROCESSES 73
TABLE A1
(Continued)
Code Examples
a
Articulate general impression THE UPPER STATEMENT . . . interesting, short and sweet (E23, A, 184)
Articulate, justify, or revise
scoring decision
[Reads scale] APPLES, I would . . . I’m gonna change my mind, I think it is
irrelevant hmm I’m gonna change for argumentation [from 4] to a three.
Linguistic Accuracy [reads scale] (E12, A, 180)
Compare scores across rating
categories
I think that linguistic accuracy is . . . is . . . a higher rating than linguistic
appropriacy [reads essay] (E2, A, 223)
Monitor for personal biases

Consider harshness/leniency
of judgment/score
This is not a good essay, but I’m not gonna judge it yet, I’m gonna wait and see till
the end, but so far I notice paragraphing is a problem, ideas aren’t clear, so I’ll
continue though IN THE WORLD TODAY (E21, H, 261)
2. Rhetorical and ideational focus
2.1. Interpretation Strategies
Interpret ambiguous or unclear
phrases
TO ONE SELF OK, so I think this person is trying to say that hmm . . . although
literature and history may be interesting, they really don’t affect our day to day life
KNOWLEDGE OF THESE SUBJECTS (E19, A, 145)
Discern rhetorical structure LEARN MORE ABOUT Oh, OK, I see, so the second paragraph here is really part
of the introduction and they ‘re getting into specifics in paragraph three HISTORY
IS REALLY IMPORTANT (E2, H, 108)
Summarize ideas/ propositions Second paragraph talks about health (E24, H, 217)
2.2. Judgment Strategies
Consider writer’s use and/or
understanding of prompt
OK, looks like this person misunderstood the question THIS FOUR COURSES ′
(E7, A, 125)
Assess reasoning, logic or topic
development
PARTICIPATE IN SPORTS so, it’s sort of like a circular argument OK,
CHILDREN WHO SPEND (N4, H, 261)
Assess task completion Well, this person didn’t really address the prompt. it clearly says, do you agree or
disagree with the following statement, this person did not take a position or stand on
the prompt, so . . .’ (E7, H, 108)
Assess communicative
effectiveness or quality

WHICH ONE IS GOOD OR BAD losing it again here, losing the ability to
communicate OF COURSE BAD EVENTS (E28, H, 134)
Assess relevance SOCCER OR TENNIS, it seems to be off-topic . . . hmmm . . . oh, well,
disadvantage, I guess, IN THE WORLD (E1, H, 261)
Assess coherence TO DRAW SOMETHING OK, totally incoherent, totally off topic NOW LET’S
GO BACK (E2, H, 224)
Assess interest, originality,
creativity or sophistication
LITERATURE IS IMPORTANT [laughs] that’s a novel way to say this THE
SIDED STUDY (N11, A, 184)
Identify redundancies IN THE PAST now they’re starting to get repetitious, the writer just said this in the
previous paragraph PEOPLE MAKE THEIR (E52, H, 134)
Assess text organization For organization, it is organized very well from what I see, . . . I can . . . I’m writing
down eight for organization, let me see, the argument, OK (N16, A, 108)
Assess style, register, discourse
functions or genre
LEARN MUCH ABOUT HISTORY so I notice there is a shift in tone here; it’s a
little more academic before, but not any more, I THINK WHEN STUDENTS (E21,
A, 134)
Rate ideas and/or rhetoric [Reads paragraph 3] again it’s very biased and opinionated paragraph and I’m
reading number four [reads paragraph 4] (N9, A, 145)
3. Language focus
3.1. Interpretation Strategies
Observe layout IN HIS STUDENTS LIFE and I’ve just noticed that everything is written in caps
which is very annoying to read, but . . . well, I’ll try AS A STUDENT (E21, H, 145)
(Continued)
Downloaded by [UNSW Library] at 16:57 31 October 2012
74 BARKAOUI
TABLE A1
(continued)

Code Examples
a
Classify errors into types STURDY BODY article mistakes, s morphemes mistakes HOWEVER THERE
ARE BAD POINTS (E28, H, 276)
Edit or interpret ambiguous or
unclear phrases
BOTH HELPS THE I crossed out the s because . . . STUDENT TO GROW (N6, A,
108)
3.2. Judgment Strategies
Assess quantity of total written
production
Again, I don’t think they are ready for university level writing, but they produced a
fair amount of text (E2, A, 223)
Consider gravity of errors I am thinking there is a problem a big problem right now with vocabulary use,
moving on to paragraph two IF STUDENTS ONLY STUDY (E2, H, 184)
Consider error frequency ALSO THEIR MINDS you know what, this is pretty darn good. hmm lots of
grammar mistakes though . . . and spelling . . . and stuff like that hmmm. (E20, A, 217)
Assess fluency so this one did not address the prompt, it has sort of some some fluency, but a lot
of problems; there is . . . it has good and bad sections in it (E7, H, 108)
Consider (command of) lexis SHOULD BE NIPPED interesting choice of words IN THE BUD (E19, H, 246)
Consider (command of) syntax
or morphology
SUCCESSFULLY IN THEIR LIFES a lot of word form errors here HISTORY IS
IMPORTANT (E23, A, 125)
Consider (command of) spelling
or punctuation
FOR INSTANCE, Ok, so I can see a few problems with the punctuation here
THESE CLASSES (N24, H, 176)
Rate language overall IGNORE DELICATE EMOTIONS mostly, this student has a good use of language,
even though the argument is scarcely developed IN CONCLUSION THE

PARALLEL STUDY (E16, A, 184)
Note. Transcription conventions: CAPITAL LETTERS = text read directly from essay; Italics = text read from the
prompt; Underlined
= text read directly from the rating scale; [ ] = procedural and other behaviors.
a
Each quote is followed by rater code (N = novice, E = experienced), rating scale (A = Analytic, H = Holistic), and
essay code (101–190 = essay on study topic, 201–290 = essay on sports topic).
Downloaded by [UNSW Library] at 16:57 31 October 2012

×