Tải bản đầy đủ (.pdf) (26 trang)

The Subjective and Objective Interface of Bias Detection on Language Tests.

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (319.75 KB, 26 trang )

The Subjective and Objective Interface of
Bias Detection on Language Tests
Steven J. Ross and Junko Okabe
Kwansei Gakuin University
Kobe-Sanda, Japan
Test validity is predicated on there being a lack of bias in tasks, items, or test content.
It is well-known that factors such as test candidates’mother tongue, life experiences,
and socialization practices of the wider community mayserve to inject subtle interac-
tions between individuals’ background and the test content. When the gender of the
test candidate interacts further with these factors, the potential for item bias to influ-
ence test performances grows. A dilemma faced by test designers concerns how they
can proactively screen test content for possible sources of bias. Conventional prac-
tices in many contexts rely on the subjective opinion of review panels in detecting
sensitive topical content and potentially biased material and items. In the last 2 de-
cades this practice has been rivaled by the increased availability of item bias diagnos-
tic software. Few studies have compared the relative accuracy and cost utility of the
two approaches in the domain of language assessment. This study makes just that
comparison. A 4-passage, 20-item reading comprehension test was given to a strati
-
fied sample of 825 high school students and college undergraduates at 5 Japanese in
-
stitutions. The sampling included a focus group of 468 female students compared to a
reference group of 357 male English as a foreign language (EFL) learners. The test
passages and items were also given to a panel of 97 in-service and preservice EFL
teachers for subjective ratings of potential gender bias. The results of the actual item
responses were then empirically checked for evidence of differential item function
-
ing using Simultaneous Item Bias analysis, the Mantel-Haenszel Delta method, and
logistic regression. Concordance analyses of the subjective and objective methods
suggest that subjective screening of bias overestimates the extent of actual item bias.
Implications for cost-effective approaches to item bias detection are discussed.


Theissueoftestbiashasalwaysbeencentralintheconsiderationoftestvalidity.Bias
has beenof concern because inferences aboutthe results oftest outcomes often lead
INTERNATIONAL JOURNAL OF TESTING, 6(3), 229–253
Copyright © 2006, Lawrence Erlbaum Associates, Inc.
Correspondence should be addressed to Steven J. Ross, School of Policy Studies, Kwansei Gakuin
University, Gakuen 2–1, Sanda, Hyogo, Japan 6691337. E-mail:
to consequences affecting the life-course trajectories of test candidates, such as in
the use of tests for employment, admissions, or professional certification. Test re
-
sults may be considered unambiguously fair to the extent candidates are compared,
as in the case of norm-referenced tests, on only the domain-relevant constructs in
-
cluded in the measurement instrument devised for the purpose. In the real world of
testing practice, uncontaminated construct relevant domain coverage is often more
an idealthan a reality. This isespecially true whenthe testing construct involves do
-
mains of knowledge or ability related to language learning.
ISSUES IN SECOND LANGUAGE ASSESSMENT BIAS
Language learning, particularly second or foreign language learning, is influenced
to no small degree by factors that interact with, and that are sometimes even inde
-
pendent of, the direct consequences of formal classroom-based achievement. Yet
in many high stakes contexts, foreign or second language ability is used as a
gate-keeping criterion for employment and admissions decisions. Further, inclu-
sion of foreign language ability on selection tests is often predicated on the as-
sumption that candidates’ relative standing reflects the cumulative effects of
achievement propelled by long-term commitment to diligent scholarship. These
assumptions do not often factor in the possibly biasing influences of cross-linguis-
tic transfer and naturalistic acquisition on individual differences in test outcomes.
Constructing high stakes measures to be free of these kinds of bias presents a chal-

lenging task to language test designers, particularly when the implicit meritocratic
intention is to reward scholastic achievement.
Studiesof biason languagetests havetendedtofallinto thethreebroadcategories
of transfer, experience,and socialization practices. The first, whichaccounts forthe
influence of transfer from a first-learned language to a second or foreign language,
addresses the extent ofbias occurring when speakers of different first languages are
testedon acommonsecond language.Chenand Henning(1985),forinstance, noted
the transferability of Latin cognates from Spanish to English lexical recognition,
which served to bias native speakers of Spanish over native speakers of Chinese.
Workinginthesamevein,Sasaki(1991)corroboratedChenandHenningusingadif
-
ferent DIFdetectionmethod. Bothofthese studies suggestedthatwhen novel words
areencountered bySpanish andChinesespeakers,the cognitivetaskof lexicalinfer
-
ence differs. For instance, consider the following sample sentence:
Residents evacuated their homes during the conflagration.
For Romance language speakers, the deductive task is to parse “conflagration” for
its affixation and locate the core free morpheme. Once located, the Romance lan
-
230
ROSS AND OKABE
guage speaker can compare the root to similar known free morphemes in the
reader’s native language, for instance, incendio or conflagracion.
The Chinese speaker, in contrast, starts at the same deductive step, but must
compare the free root morpheme to all other previously learned morphemes (i.e.,
most probably, “flag”). The resulting difference leads Spanish speakers to follow a
semantically based second step, while Chinese speakers are likely to split between
a semantic and phonetic comparison strategy. The item response accuracy in such
cases favors the Romance language speakers, even when matched with Chinese
counterparts for overall proficiency.

The transferability factor applies to orthographic phenomena as well. Brown
and Iwashita (1996) detected bias favoring Chinese learners of Japanese over na
-
tive English speakers, whose native language orthography is typologically most
distant from Japanese. Given the fact that modern written Japanese relies on Chi
-
nese character compounds for the formation of nominal phrases, as well as the root
forms of many verbs, Chinese students of Japanese can transfer their knowledge of
semantic roots for many Japanese words and compounds, even without knowledge
of their corresponding phonemic representations, or exact semantic reference.
Here a similar strategic difference emerges for speakers of Chinese versus
speakers of an Indo-European language. While the exact compound might not ex-
ist in modern written Chinese, the component Chinese characters provide a deduc-
tive strategy to Chinese learners of Japanese that is not available to English speak-
ers. For instance, the following sentence contains the compound (bullet
train) which does not have a direct counterpart in Chinese.
The component characters “new,” “trunk,” and “line” provide the basis
for a lexical inference that the compound refers to a kind of rail transportation sys
-
tem. For an English-speaking learner of Japanese, the cognitive load falls on de
-
ducing the meaning of the whole compound from its components. Here, a mixed
grapheme to phoneme strategy is most likely if “new” and “line” are recog
-
nized as “shin” and “sen.” The lexical inference here might entail filling in the
missing component “trunk” with a syllable that matches the surrounding
“shin___sen” for successful compound word recognition.
Examining transferability on a macrolevel, Ross (2000), while controlling for
biographical and experiential factors such as age, educational background, and
hours of ESL learning, found weaker evidence of a language distance factor. The

distance factor was comprised of canonical syntactic structure, orthography, and
typological grouping which served to influence the relative rates of learning
English by 72 different groups of migrants to Australia.
The overall picture of transfer bias suggests that on the microlevel, particularly
in studies that triangulate two different native languages against a target language,
BIAS DETECTION ON LANGUAGE TESTS 231
evidence of transfer bias tends to be identifiable. When many languages are com
-
pared and individual differences in experiential and cognitive variables are fac
-
tored in, transfer bias at the macro or language typological level appears to be less
readily identifiable.
A secondtype ofbiasin languageassessment arisesfrom differentialexposureto
a target language that candidates might experience. Ryan and Bachman (1992), for
instance, consideredTest ofEnglishas aforeign language (TOEFL)type itemsto be
more culturallyorientedtoward theNorthAmericancontext thana Britishcompari
-
son, the First Certificate in English. Language learners withexposure to instruction
in American English and test TOEFL preparation courses were thought to have a
greater chance onsuch items thanlearners whose exposure did not preparethem for
the cultural framework TOEFL samples in its reading and listening items. Their
findings suggestthat high stakes language testsfor admissions such asTOEFL may
indirectly include knowledge of cultural reference in addition to the core linguistic
constructs considered to be the object of measurement. Presumably this phenome
-
non would be observable on language tests such as the International English Lan-
guage Testing System (IELTS), which is designed to qualify candidates for admis-
sions to universities in the United Kingdom, New Zealand, or Australia.
Cultural backgroundcomparisons insecond languageperformance assessments
havedemonstrated howspeechcommunitynormsmaytransferintoassessmentpro-

cesses like oral proficiency interviews. While not overtly recognized as a source of
assessmentbias,interlanguagepragmatictransferhasbeenseentoinfluencetheper-
formances of Asian speakers when compared to European speakers (Young, 1995;
Young & Halleck, 1998; Young & Milanovic, 1992). The implication is that if as-
sessments arenorm-referenced,speakersfrom discoursecommunitiesfavoringver-
bosity may be advantaged in assessment such as interactive interviews. This obser
-
vation apparently extends to semi-direct speech tasks such as the SPEAK test. Kim
(2001), for instance,founddifferentialrating functionsfor pronunciationand gram
-
marratingsforAsianswhencomparedtoequalabilityEuropeantestcandidates.The
implication here is that raters apply the rating scale differently.
In considering possible sources of bias in university admissions, Zwick and
Sklar (2003) opined that the foreign language component on the SAT II created a
“bilingual advantage” for particular candidates for admission to the University of
California. If candidates had been raised in bilingual households, for instance, they
would be expected to score higher on the foreign language listening comprehen
-
sion component, which is an optional third subscore on the SAT II. This test is re
-
quired for undergraduate admissions to all campuses at the University of Califor
-
nia. The issue of bias in this case stems from the assumption that the foreign
language component was presumably conceptualized as an achievement indicator,
when in fact the highest scoring candidates are from bilingual households. The
perceived advantage is that such candidates develop their proficiency not through
coursework and scholarship, but through naturalistic exposure.
232
ROSS AND OKABE
Elder (1997) reported on a similar fairness issue arising from the use of second

language tests for access to higher education in Australia. Elder noted that the
score weighting policy on the Victoria Certificate of Education, functioning as it
does as a qualification for university admission in that state, explicitly profiles the
language learning history of the test candidate. This form of candidate profiling
aimed to reweight the influence of the foreign language scores on the admissions
qualification so as to minimize the preferential bias bilingual candidates enjoyed
over conventional foreign language learners. Elder found interactions between
English and the profile categorizations were not symmetric across different for
-
eign language test candidatures and concluded that efforts to adjust for differential
exposure profiles are fraught with difficulty.
A third category of bias in language assessment deals with differences in social
-
ization patterns. Socialization patterns might involve academic tracking early in a
school student’s educational career, usually into either science or humanities aca
-
demic tracks in high school (Pae, 2004). In some cultural contexts, academic track
-
ing might correspond to gender socialization practices as well.
In contrast to cultural assumptions made about the verbal advantage females
have over males, Hyde and Linn (1988) concluded in a meta-analysis of 165 stud-
ies of gender differences on all facets of verbal tests that there was an effect size of
D = .11 for gender differences. To them, this constituted little firm evidence to sup-
port the assumed female verbal advantage. Willingham and Cole (1997), and
Zwick (2002) concur with this interpretation, noting that gender differences have
steadily diminished over the last four decades and now account for no more than
1% of the total variation on ability tests in general. Willingham and Cole (1997, p.
348) however, noted that females tend to frequent the top 10% in standardized tests
of reading and writing.
Surveys of gender differences on The Advanced Placement Test, used for uni

-
versity admissions to the more selective American universities, suggest reasons
why verbal differences in literacy still tend to persist. Dwyer and Johnson (1997, p.
136) describe considerable effect size differences between college-bound males
and females in preference for language studies. This finding would suggest that in
the North American context socialization patterns could serve to channel high
school students into academic tracks that tend to correlate with gender.
To date, language socialization issues have not been central in foreign or second
language test bias analyses in multicultural contexts because of the more immedi
-
ate and salient influences of exposure and transfer on high stakes tests. In contexts
that are not characterized by multiculturalism, a more subtle threat of bias may be
related to how socialization practices steer males and females into different aca
-
demic domains, and in doing so cumulatively serve to make gender in particular
knowledge domains differentially salient. When language tests inadvertently sam
-
ple particular domains more than others, the issue of schematic knowledge inter
-
acting with the gender of the test candidate takes on a new level of importance.
BIAS DETECTION ON LANGUAGE TESTS 233
In astudy ofdifferentialitem function (DIF)on foreignlanguage vocabularytest
for Finnish secondary students, Takala and Kaftandjieva (2000) found that individ
-
ual vocabulary items showed domain-sampling effects, whereas the total score on
testdidnot reflectsystematicgender bias.Theirstudyidentifiedhowwordssampled
from male activity domains suchas mechanics and sportsmight yield higher scores
for male test candidates than for females at the same ability level. Their approach
usedconventionalstatisticalanalysesofDIF,which,accordingtosomecurrentstan
-

dards of test practices, wouldserveto identifyand eliminatebiased itemsbefore test
scoresareinterpreted(AmericanEducationalResearchAssociation,AmericanPsy
-
chological Association, & National Council on Measurement in Education, 1999).
With such practices for bias-free testing, faulty items would be screened through
sensitivityreviewand contentmoderation prior totest administration, andthen sub
-
jected to DIF analyses before the final score tally.
The issue of interest we address in this article is how gender bias on foreign lan
-
guage tests devised for high stakes purposes can be diagnosed when accepted cul
-
tural practices disfavor the use of empirical analysis of item functioning prior to
score interpretation. In this study weaddress the issue of the accuracy ofsensitivity
review and bias screening through content moderation prior to test administration
by comparing the judgments of both expert and novice moderation groups with the
results of three different empirical approaches to DIF.
BACKGROUND TO THE STUDY
Foursamplesubtestswrittenforahighstakesuniversityadmissionstestwereusedin
the study. The subtests were all from the fourth section of a six section English as a
foreignlanguage (EFL)testgivenannuallyto approximately630,000Japanesehigh
schoolseniors.Theresults ofthe examarenorm-referencedandservetoqualifycan
-
didate forsecondaryexaminations tospecific academic departmentsat nationaland
publicuniversities(Ingulsrud, 1994).Increasingly,privateJapaneseuniversitiesuse
the results of the Center examination for admissions decisions, making the test the
most influential gate-keeping device in the Japanese educational system.
The format of the EFL test is a “discrete point” type of test of language structure
and vocabulary, sampling the high school syllabus mandated by the Japanese Min
-

istry of Education. It is construed as an achievement test because only vocabulary
and grammatical structures occurring in about 40 high school textbooks sanc
-
tioned by the Ministry of Education are sampled on the test. The six sections of the
examination cover knowledge of segmental pronunciation, tonic word stress, dis
-
crete point grammar, word order, paragraph coherence and cohesion, interpreta
-
tion of short texts describing graphics and data in tabular format, interactive
dialogic discourse in the form of a transcribed conversation, and comprehension of
234
ROSS AND OKABE
a 400-word reading comprehension passage. All items, usually 50 in all, are in
multiple-choice format to facilitate machine scoring.
The test is constructed by a committee of 20 examiners who convene 40 days
each year to draft, moderate, and revise the examination before its administration
in January each year. On several occasions during the test construction period the
draft passages and items are sent out to an external moderation panel for sensitivity
and bias review. The external moderation panel, whose membership is not known
to the test committee members, is composed of former committee members and
examination committee chairpersons. Their task is to critique the draft passages
and items and to recommend changes, large and small. On occasion the modera
-
tion panel recommends substitution of entire draft test sections. This usually oc
-
curs when issues of test sensitivity or bias are raised. The criteria for sensitivity are
themselves highly subjective and variable across moderation panels. For some, test
content should involve “heart-warming” topics that avoid dark or pessimistic
themes. For others, avoiding references to specific social or ethnic groups may be
the most important criterion.

Thefour passagesincludedin thestudywereoriginally draftedforthefourth sec-
tion of the EFL language examination. The specifications for the fourth sectioncall
forthree orfour paragraphsdescribingcharts,figures,ortabulardataconcerning hy-
potheticalexperimental orsurveydataina socialscience domain.Thissection ofthe
test isknownto bethe most domain-sensitive, because thecontent sampling usually
sits atthe borderlineof where male–female differences inexperiential schemata be-
gin to emerge in the population.
The four passages were never used in the operational test, but were held in re-
serve as alternates. All four had at various stages of development undergone exter-
nal review by the moderation panel and were found to be possibly too gender sensi
-
tive, thus ending further investment in committee time for their revision.
The operational test is not screened with DIF statistics prior to score interpreta
-
tion. The current test policy endorsed by the Japanese testing community is predi
-
cated on the assumption that the moderation panel reviews are sufficiently accurate
in detecting faulty, insensitive, or biased items before any are used on the opera
-
tional test. The research issue addressed here thus considers empirical evidence of
the accuracy of the subjective approach currently used, and directly examines evi
-
dence that subjective interpretation of gender bias in fact concurs with objective
analyses using empirical methods common to DIF analysis.
METHOD
The four-passage, 20-item reading comprehension test was given to a stratified
sample of 825 high school students and college undergraduates at five institutions.
The sampling included a focus group of 468 female students compared to a refer
-
BIAS DETECTION ON LANGUAGE TESTS 235

ence group of 357 male EFL learners. The aim of the sampling was to approximate
the range of scores normally observed in the population of Japanese high school
seniors. The 20-item test was given in multiple-choice format with enough time (1
hr) for completion, and was followed with a survey about the age, gender, and
language learning experiences of the sample test candidates.
Materials
The test section specifications call for a three to four paragraph text describing
graphs, figures, or tables written as specimens of social science types of academic
writing. In the case of the experimental test, four of these passages were used. Each
of the passages had five items that tested readers’ comprehension of the passage
content. The themes sampled on the test can be seen in Table 1.
The experimental test comprised of four short reading passages, which closely
approximate the format and content of Section Four of the Center Examination.
The sampling of students in this study yielded a mean and variance similar to the
operational test. Table 2 lists descriptive statistics for the test.
Bias Survey Procedure
A test bias survey was constructed for use by in-service and preservice EFL teach-
ers. The sampling of high school level teachers parallels the normal career path of
Japanese members of a typical moderation panel. The actual external moderation
panel is comprised of university faculty members, most of whom had followed a
career path starting with junior and senior high school EFL teaching. The bias sur-
236
ROSS AND OKABE
TABLE 1
Experimental Passage Order and
Thematic Content
Passage Thematic Content
I Letter rotation experiment
II Visual illusions experiment
III Soccer league tournament

IV Survey of transportation use changes
TABLE 2
Mean, Standard Deviation, Internal Consistency,and Sample Size
M SD Reliability Sample Size Items
12.36 4.14 .780 825 20
vey was thus devised to sample early, mid, and late career EFL teachers who were
assumed to represent the larger population of language teaching professionals
from whom future test moderation panel members are drafted. In-service teachers
(n = 37) were surveyed individually.
In addition to the sampling of in-service teachers, a larger group of preservice
EFL teachers in training were also surveyed so as to compare the ratings provided
by seasoned professional teachers with neophyte teachers (n = 60). All respon
-
dents were asked to examine the four test passages and each of the 20 items on the
test before rating the likelihood that each item would favor male or female test can
-
didates. The preservice teachers in training completed the survey during Teaching
English as a Foreign Language (TEFL) Methodology course meetings.
The rating scale used and instructions are shown in the Appendix.
ANALYSES: OBJECTIVE DIFFERENTIAL ITEM
FUNCTIONING ANALYSIS
A variety of options now exist for detecting DIF. Comparative research suggests
that DIF methods tend to differ in the extent of Type I error and power. Whitmore
and Schumacker (1999), for instance, found logistic regression more accurate than
an analysis of variance approach. A direct comparison of logistic regression and
Mantel-Haenszel procedure (Rogers & Swaminathan, 1993) indicated moderate
differences in power. Swanson, Clauser, Case, Nungester, and Featherman (2002)
more recently approached DIF with hierarchical logistic regression and found it to
be more accurate than standard logistic regression or Mantel-Haenszel estimates.
In this approach, different possible sources of item bias can be dummy-coded and

nested in the multilevel design. Recent uses of logistic regression for DIF extend to
polytomous rating categories (Lee, Breland, & Muraki, 2005) but still enable an
examination of nonuniform DIF through interaction terms between matching
scores and group membership.
Although multilevel modeling approaches offer extended opportunities for test
-
ing nested sources of potential DIF, the single level methods, such as logistic re
-
gression and Mantel-Haenszel approaches, have tended to prevail in DIF studies.
Penfield (2001) compared three variants of Mantel-Haenszel according to differ
-
ences in the criterion significance level, and concluded that the generalized ap
-
proach provided the lowest error and most power. Zwick and Thayer (2002) found
that modifications of the Mantel-Haenszel procedure involving an empirical Bayes
approach showed promise of greater potential for bias detection. A direct compari
-
son of the Mantel-Haenszel procedure with Simultaneous Item Bias (SIB;
Narayanan & Swaminathan, 1994) concluded that the Mantel-Haenszel procedure
yielded smaller Type I error rates relative to SIB.
BIAS DETECTION ON LANGUAGE TESTS 237
In this study, three empirical methods detecting of DIF were used. The choice
of bias detection methods used was based on their overall frequency of use in em
-
pirical DIF studies. The three methods were thought to represent conventional ap
-
proaches DIF research, and thus best operationalize “objective” approaches to be
compared with subjective methods.
Mantel-Haenszel Delta was computed from six sets of equipercentile-matched
ability subgroups cross tabulated by gender. Differences in the observed Deltas for

the matched males and females were evaluated against a chi-square distribution.
This method matches males and females along the latent ability continuum and de
-
tects improbable discontinuities between the expected percentage of success and
the observed data.
The second method of detecting bias was a logistic regression performed on the di
-
chotomously scored outcomes for each of the 20 items. The baseline model tested the
effects of gender controlling for each student’s total score (Camilli & Shepard, 1994).
In this binary regression, the probability of success should be solely influenced by the
individual’s overall ability. In the event of no bias, only the test score will account for
systematic covariance between the item responses on a particular item. If bias does af-
fect a particular item, the variable encoding gender will covary with the item response
independently of the covariance between the score and the outcome. Further, if bias is
associated with particular levels of ability on the latent score continuum, a nonuniform
DIF can be diagnosed with a Gender × Total Score interaction term.
Item response = constant + gender + score + (gender × score)
In the event a nonuniform DIF is confirmed not to exist, the interaction term can
be deleted to yield a main effect for gender, controlling for test score. Gender ef-
fects are then tested for nonrandomness against a t-distribution.
Thethird empiricalmethodused asimultaneousitem biasutilizingitem response
theory (Shealy & Stout, 1993). The SIB approach was performed on each of the 20
itemsin turn.Thesumsoftheallother itemswereusedin rotationasabilityestimates
in matching male and female examinees via a regression approach. This approach
employs the matching strategy of the Mantel-Haenszel method, and uses the total
scorebased onk-1items asaconcurrent covariatefor eachofthe itembias tests. Dif
-
ferences in estimates of DIF were evaluated against a z distribution.
ThecompositeresultsofthethreedifferentapproachestoestimatingDIF foreach
of the 20 items are given in Table 3. Each of the three objective measures employs a

different test statistic to assess the likelihood of the observed bias statistic. Analo
-
gous tometa-analytic methods, the different effectscan be assessedstandardized as
metrics. To this end, each DIF estimate, controlled for overall candidate ability, is
presented as a conventional probability (p < .05) of rejecting the null hypothesis.
Table 3 indicates that the Mantel-Haenszel and SIB approaches are equally par
-
simonious in detecting gender bias on the 20-item test. Both of these methods em
-
ploy ability matches of men and women along the latent ability continuum. In con
-
238
ROSS AND OKABE
trast, the logistic regression approach, which uses the total score as a covariate,
appears slightly more likely to detect bias. All three methods concur in detecting
gender bias on the Soccer item 13 shown in Table 4.
ANALYSIS: SUBJECTIVE ESTIMATES OF BIAS
The panel of 97 preservice and in-service teachers was categorized into male and
female subgroups of novices and experienced teachers based on survey responses.
BIAS DETECTION ON LANGUAGE TESTS 239
TABLE 3
Objective Bias Probabilities Per Item
Item MH Delta Logistic SIB
Letters1 0.98 0.142 0.791
Letters2 0.88 0.918 0.761
Letters3 0.2 0.901 0.133
Letters4 0.69 0.617 0.557
Letters5 0.96 0.981 0.768
Visuals6 0.39 0.029 0.686
Visuals7 0.17 0.292 0.199

Visuals8 0.36 0.178 0.281
Visuals9 0.71 0.357 0.974
Visuals10 0.24 0.106 0.361
Soccer11 0.97 0.659 0.96
Soccer12 0.87 0.776 0.806
Soccer13 0.001 0.036 0.001
Soccer14 0.37 0.7 0.414
Soccer15 0.47 0.456 0.583
Transprt16 0.61 0.099 0.827
Transprt17 0.39 0.48 0.31
Transprt18 0.29 0.048 0.539
Transprt19 0.48 0.071 0.48
Transprt20 0.84 0.529 0.605
Note. Items in bold represent significant probabilities, where p < .05. MH = Mantel–Haenszel;
SIB = simultaneous item bias.
TABLE 4
Biased Item No. 13 From the Soccer Passage
Item Description
13 If the Fighters defeat the Sharks by a score of 1–0 then:
1 The Lions will play the Sharks.
2 The Fighters will play the Bears.
3 The Sharks will play the Eagles.
4 The Fighters will play the Lions.
The aim of this subdivision of the subjective raters was to explore possible sources
of differential sensitivity to bias in the test questions. In contrast to the objective
methods of diagnosing item bias, the subjective ratings do not employ any infor
-
mation about individual ability inferred from the total score. Subjective estimates
rely completely on the apparent schematic content and presumed world knowledge
needed to answer each item. Further, because ratings were on a Likert type scale,

differences between the observed mean rating and the null hypothesis needed to be
tested to provide bias probabilities
1
comparable to those in Table 3. To this end, the
mean rating of gender bias on each of the 20 items was tested against the hypothe
-
sis that male versus female advantage on each item equaled zero. Table 5 contains
the subjectively estimated probabilities that each item is biased.
In contrast with the objective measures of bias, the subjective analysis diagno
-
ses considerably more bias. Complete subjective agreement in diagnosing bias oc
-
curs in 6 of the 20 items. As the third column in Table 5 suggests, it appears that ex
-
perienced male (EM) teachers are the most inclined to assume there is gender bias.
This subgroup in fact sees bias in the majority of items. Experienced female teach-
ers, in contrast, are the most conservative in assuming that schematic content indi-
cate possible test item bias. The novice male teachers-in-training correspond to
their more experienced male counterparts in assuming there is a schematic bias in
two of the four test passages.
Of particular interest is the tendency of the subjective raters to apply the bias diagno-
sis not to individual items, but to entire test passages. It appears likely that these male re-
spondents equate content sensitivity with test bias. Sources of this confusion will be ex-
amined in the narrative accounts provided by some of the male teachers. The tendency to
see content schema as bias suggests that subjective raters see topical domains as the key
source of possible bias. Both male and female in-service teachers would be expected to
share equivalently accurate knowledge about the cumulative consequences of socializa
-
tion on Japanese teenagers’ world-knowledge. As Table 5 would suggest, however, the
experienced male teachers appear to overgeneralize the extent of possible schematic

knowledge differences between male and female students. The domains that appear to
be high bias risk to male teachers involve spatial processing (Visuals 6–9), all of the
items concerned with a sports tournament (Soccer 11–15), and all of the items about the
passage describing changes in transportation (Transport 16–20).
Subjective Accounts of Bias
As a way of accounting for the presumed bias in the test items, interviews with
three veteran male instructors were undertaken. These interviews provide a post
240
ROSS AND OKABE
1
Subjective ratings were tested against a null hypothesis by assuming the population mean bias
(mu) equals zero and testing the observed subjective mean against a single sample t distribution. Exact
probabilities of the observed t tests were then used in Table 3.
hoc reflective account of why bias would be expected in test items. The three ac-
counts provide subjective evidence as to the sources of the putative sensitivity or
bias in test items. Three facets of belief about gender differences were included in
the interview. The first was a global impression of the test materials. The second
was concerned with the four passages used in the actual reading test. The third
question in the interview phase addressed how each male teacher assumes his col
-
leagues are aware of gender differences among students.
Teacher A (mid-30s, male)
Overall impression and belief
“In general, I believe that boys are better than girls at understanding sci
-
entific and logical essays. Having said so, I normally don’t pay attention to
such gender differences. The actual English test materials, when compared
with Japanese materials, the topics are easier and the contents are less com
-
plicated. Thus I tend to focus students on how to solve the tasks and get

higher scores rather than on the comprehension of the contents. In this
sense, rather than the topic, the form of the tasks may cause a different per
-
BIAS DETECTION ON LANGUAGE TESTS 241
TABLE 5
Subjective Bias Probabilities Per Item
Novice Expert
Item Label Males Females Males Females
Letters1 0.337 0.71 0.494 0.136
Letters2 0.165 0.42 0.666 0.104
Letters3 0.082 0.452 0.494 0.426
Letters4 0.999 0.194 0.33 0.263
Letters5 0.999 0.497 0.33 0.435
Visuals6 0.082 0.628 0.005 0.426
Visuals7 0.165 0.008 0.056 0.104
Visuals8 0.04 0.001 0.002 0.003
Visuals9 0.999 0.728 0.012 0.538
Visuals10 0.584 0.341 0.541 0.671
Soccer11 0.035 0.007 0.001 0.001
Soccer12 0.009 0.001 0.001 0.001
Soccer13 0.001 0.001 0.001 0.001
Soccer14 0.004 0.001 0.001 0.001
Soccer15 0.005 0.009 0.001 0.001
Transprt16 0.19 0.552 0.001 0.165
Transprt17 0.673 0.473 0.01 0.272
Transprt18 0.19 0.124 0.001 0.336
Transprt19 0.19 0.151 0.004 0.5
Transprt20 0.19 0.044 0.004 0.385
Note. Items in bold represent significant probabilities, where p < .05.
formance between boys and girls; boys may do better in mathematical and

logical types of tasks. Anyway, girls do much better than boys as far as Eng
-
lish exams are concerned. So I don’t think there is any particular gender dif
-
ference among different English tests.”
Here the male teacher contends that there is systematic difference in the female
students’ understanding of science and logical content. Yet assuming that the lan
-
guage test content does not narrowly sample such domains, he claims there is no
bias on the test in question. In fact, he suggests that the advantage is on the side of
the female students, because the domain is language study. This view perhaps re
-
flects a prevailing social construct in Japan. Science and math are perceived as
male domains, so test content focused on language give females an advantage.
Teacher A’s Analysis of Four Test Passages
Passage 1 (Letter Rotation): “I found this an interesting passage, but I
don’t think there would be any gender difference in the performance.”
Passage 2 (Human Factors): “As for the topic, girls may hold more inter-
est in it. However, I think boys would do better in such task type using
graphs. As a result there may not be any difference between them. Anyway, I
believe boys are good at such mathematical tasks.”
Passage 3 (Soccer Playoffs): “The topic is soccer and I don’t think there is
any gender difference in their interest and knowledge about it as far as high
school students are concerned. Rather than the topic, the knowledge about
this table may make the difference. However, again, I think this table is quite
familiar to both genders since we use it often in school ball games or other
club activities.”
Passage 4 (Transportation): “The topic is about transportation and I think
boys can do better here. Boys like vehicles, machines etc., don’t they? I think
boys definitely like and do well in passages about sports, transportation, an

-
imation, and pop idols, while girls are good at fashion, songs, singers, mov
-
ies, cooking, girls’comics and trendy dramas. If they are asked to read a pas
-
sage about fashion, boys wouldn’t understand much.”
This male teacher’s perception of domains of interest and experience corre
-
sponds to the overall tendency of male teachers to ascribe knowledge and abilities
to female and male students differentially. Interestingly, two of the passages Hu
-
man Factors and Transportation are considered in the male domain of interest,
while the Soccer Playoff Schedule is deemed less likely to trigger schematic bias.
This testimony concurs with the statistical tendency of Japanese male teachers to
assume that differences in schematic content trigger bias even when no such bias is
detected objectively. Oddly, this teacher does not predict that the one passage (Soc
-
cer Playoffs) with objective evidence of gender bias would in fact yield any.
242
ROSS AND OKABE
Teacher A: About Teachers’Awareness
“I think teachers’awareness about gender difference varies according to
the generation. In general, older generations (50s, 60s) have a clearer and
very often wrong image of gender difference among students.”
This younger male teacher, while adhering to the pattern of projected bias about
the passages, seems aware that Japanese teachers in general entertain that male –
female differences in schematic background knowledge actually exist. Interest
-
ingly, no account of the source of such putative differences, whether natural or
constructed through socialization processes, is given.

Teacher B (mid-40s, male)
Overall impression and belief
“In general, I didn’t notice gender performance differences in different
passages, apart from the fact that girls do better than boys on English tests
as a whole. If I dare raise an example of gender difference, girls might do
particularly better when it’s about fashion or designing. These topics are
very unfamiliar to boys. As for the task types, I don’t think multiple choice
type would cause any gender difference in the performance. Certainly girls
tend to perform much better than boys in essay type or self-expressive types
of tasks. In this sense, this kind of paper test may be easier for boys to show
their knowledge or ability. Performance-based types of exams such as essays
or interviews would reveal the gender difference far more dramatically.
Girls certainly are said to be weak at map reading or space/direction recog-
nition. However, paper tests wouldn’t go that far to reveal the difference be-
tween genders. Similarly, boys are said to be good at logical structure, but an
English test in multiple choice format wouldn’t be appropriate to prove it.”
The account provided by this mid-career male teacher does not specifically
nominate any particular biased passages; rather, he contends that gender differ
-
ences are pervasive. This interpretation matches the overall pattern of these Japa
-
nese male teachers seeing ubiquitous gender differences. Interestingly, this ac
-
count contends that the modality of the test serves to remove gender
differences—that the multiple choice format neutralizes the potential for bias. This
account conflicts with meta-analyses of test format (Willingham & Cole, 1997),
which have found that multiple-choice tends to favor male test candidates. The as
-
sertion here is that multiple choice format removes the “natural” advantage in lan
-

guage ability that female students are assumed to possess.
Teacher B’s Analysis of Four Test Passages
“I felt that passage[s] 2 and 4 might show some gender differences. Pas
-
sage 2 is about shapes and this is about math knowledge. Passage 4 is about
BIAS DETECTION ON LANGUAGE TESTS 243
transportation and it is a sociological topic. The former requires rather par
-
ticular vocabulary such as ‘triangle’, which is not very frequent in the nor
-
mal English textbooks. So, boys who had read something related to math in
English could answer this much more easily. Similarly passage 4 (Transpor
-
tation) has some specific vocabulary and I think it is more familiar to boys.
Passage 1 (Letter Rotation) is about angles and directions. If one can under
-
stand what these pictures mean, then it’s easy to answer. Passage 3’s topic is
soccer, but the point of this passage is this table. So I don’t think this is diffi
-
cult for those not interested in sports. Certainly this table may require some
attention, but this is a very common type of table and we see it on TV daily. In
the end, these 4 passages don’t require any special mathematical knowledge.
The only difference is in Passage 2 (Human Factors); this may require math
-
ematical vocabulary to understand the context and it may cause gender dif
-
ferences in the end.”
This male teacher’s account seems to contradict the statistical data. While
schema sampled in the passages may differ according to life experiences of male
and female students, the extent of that difference is not enough to trigger system-

atic bias in answering the test questions. It is odd therefore that male teach-
ers—both in-service and preservice teachers—have tended to assume there is a
systematic handicap for female students. The implied source of the bias is numeri-
cal reasoning, and since the four passages do not require much more than simple
arithmetic, no clear source of bias is identified. The one biased item (no. 13) nested
in the Soccer Playoff Schedule passage, is not referred to at all in the oral account.
The possible source of bias is said to be in male versus female domains of “inter-
est.” Students not interested in sports in general would be disadvantaged on a
sports-related topic.
Teacher B: About Teachers’Awareness
“I think a teacher’s own common sense should be good enough to judge
any gender-bias in the exam. Certainly some teachers, especially older ones,
tend to have different gender images about students. But more importance
lies in whether the exam questions are about what they’ve learned in the cur
-
riculum or not. This is because the gender difference would show very little
influence on the multiple-choice tests.”
Thisteacher doesnotexpectthatmaleand female teachersdifferintheirdiagnos
-
tic accuracywhen it comes to gender bias on tests. Rather, the source of differences
may emerge as a consequence of teachers’age and experience. Presumably, gender
differences observed over a career in the classroom serve to reinforce expectations
about what female students are likely to be familiar with. However, if the content is
244
ROSS AND OKABE
part of the taught curriculum, the potential for bias is considered minimized, espe
-
cially when the test format required selection from alternative answers.
Teacher C (mid-50s, male)
Overall impression and belief

“Overall, female students perform always better than males on language
tests. Rather than the gender, whether s/he knows the topic or not would af
-
fect the performance. Language ability cannot simply be the pure knowledge
of the language. It cannot be separated from the content knowledge or dis
-
course knowledge.”
Teacher C asserts that gender differences favor female students, and observed
differences are not necessarily the consequence of biased items. The account itself
seems to assume that language, content, and discourse are inseparable elements,
and somehow female students prevail. This account does not correspond well with
the objective data, were male teachers tend to assume that content sampling serves
to handicap female test candidates, who otherwise would enjoy an advantage in the
foreign language domain.
Teacher C’s Analysis of Four Test Passages
“Since our school has more female than male students, we EFL teach-
ers naturally choose female-oriented topics or tasks such as fashion or
movies. We sometimes feel uncomfortable to select topics such as baseball
with them. Here the topic (of Passage 3) is soccer, and I think in this case
the topic would hinder the female students’ performance, although this
passage can be solved without knowing about soccer. Similarly, passage 4,
which is about transportation, may be preferred by male students. As for
the task types, this multiple-choice type is easy for both male and female
students. Female students are particularly good at essay type questions.
Boys are somehow always bad at expressing themselves using words. So
the multiple-choice tasks wouldn’t reveal gender difference clearly. As for
Passages 1 and 2, I don’t think there is any difference between boys and
girls. These are quite gender-neutral and you need only your common
sense to understand the questions.”
Once again the assumption is that the multiple choice method of testing serves

to reduce the potential natural advantage that female students would enjoy. Perfor
-
mance assessments, in contrast, are thought to favor female students. What is strik
-
ing about this account is the assumption that language advantages for female for
-
eign language learners are natural, and not the consequences of differential
streaming, reinforcement, or social engineering.
BIAS DETECTION ON LANGUAGE TESTS 245
Teacher C: About Teachers’Awareness
“Sometimes students show interest in the topic that seems very tough or
unfamiliar for them. So it is all up to the teachers to make the material inter
-
esting and thought-provoking. Even though some topics, for example, fash
-
ion, are not very familiar to males, teachers can still make it interesting and
enjoyable.”
SUMMARY OF TEACHER ACCOUNTS
When these three teachers are asked for their overall impression and their own
opinion about gender differences, their first answer is always similar: they don’t
think there is significant gender difference between girls and boys in the four pas
-
sages. Girls are just better at overall performance on English exams. However, as
the interview went on and the subjects were asked about passages, these male
teachers revealed their ideas about the sources of gender differences.
Because these high school teachers find gender differences according to the
method of testing quite obvious, they do not seem to pay much attention to the dif-
ferences shown according to the topical domain of each passage. The overall pat-
tern suggests that hermeneutic assessment of bias issues is susceptible to hyper-
sensitivity. These male teachers don’t construe possible schematic differences as

products of differential socialization practices, but apparently tend to over-gener-
alize them as natural categories of gender differences.
CONCORDANCE ANALYSES
After the subjective and objective analyses of bias on the 20 test items were com
-
piled, adirect comparison wasundertaken. Theobjective and subjective probability
of biasestimates wereconvertedintoeffectsizes. Thisconversionto astandardmet
-
ric allows for adirect comparison between the objective and subjective estimates of
bias.Inthesubjectiveandobjectiveestimatesofitem bias,differentindicatorsofsta
-
tistical significancewere used.To makeall indicatorsdirectlycomparable, aneffect
size conversion (Shadish, Robinson, & Lu, 1999) yielded a single indictor to facili
-
tate direct comparisons. Zero effect size indicates an estimate of no bias. Negative
valence on the effect size estimates indicates bias thought to favor males.
As Table 6 suggests, the subjective estimates of bias produce larger effect indi
-
cators of bias than do the objective methods. For Soccer 13, the item detected by
the objective methods to produce a bias favoring males, the effect size of the bias is
in the small effect range (Cohen, 1988). The subjective diagnostics of bias, in con
-
trast, are mainly in the large effect range (± .80). While the one authentically bi
-
ased item gets concordant agreement between the subjective and objective ap
-
246
ROSS AND OKABE
proaches, the magnitude of the bias estimation is disproportionately large in the
subjective estimation.

The subjective judgments of item bias apparently differ according to the experi
-
ence and gender of the preservice and in-service teachers in this study. To examine
whether there is a conditional proclivity to identify bias subjectively, a factorial
analysis of variance was performed on the individual test items’mean effect size as
the dependent variable and dummy codes for rater experience and gender as inde
-
pendent variables. As Figure 1 indicates, there is no significant interaction be
-
tween experience and the raters’ own gender.
Both experienced and nonexperienced male EFL teachers tend to rate the test
items as being biased in favor of male test candidates. The negative effect sizes
reflect bias towards male test candidates, while an effect size of zero would indi
-
cate no anticipated bias. The factorial analysis of variance indicates that there is
a main effect for the gender of the teachers. The near-significance of experience
also suggests that preservice teachers in general tend to be less prone to assum
-
ing that test items favor male test candidates more than their female counter
-
BIAS DETECTION ON LANGUAGE TESTS 247
TABLE 6
Effect Sizes for Objective and Subjective Bias Estimates
Novice Expert
Item MHDelta SIB
Logistic Regression Males Females Males Females
Letters1 0.002 0.103 0.018 –0.384 0.076 –0.216 0.56
Letters2 –0.009 –0.007 –0.021 –0.561 0.167 0.137 0.613
Letters3 0.09 0.0087 0.105 –0.711 0.155 –0.218 0.294
Letters4 –0.028 –0.035 –0.041 0.0 –0.275 –0.312 0.417

Letters5 0.003 0.001 0.02 0.0 0.14 0.312 0.289
Visuals6 0.06 0.153 0.028 –0.711 –0.1 –0.94 –0.294
Visuals7 0.096 0.074 0.09 –0.561 0.558 –0.623 0.613
Visuals8 0.064 0.094 0.075 0.851 0.696 –1.04 –1.18
Visuals9 0.026 0.068 0.002 0.0 –0.071 –0.833 –0.202
Visuals10 0.082 0.113 0.064 –0.271 –0.196 –0.193 0.156
Soccer11 0.002 0.031 0.003 –0.876 –0.568 –1.11 –1.33
Soccer12 0.011 0.02 0.017 –1.11 –0.696 –1.11 –1.33
Soccer13 –0.23 –0.147 –0.264 –1.45 –0.696 –1.11 –1.33
Soccer14 0.063 0.027 0.057 –1.24 –0.696 –1.11 –1.33
Soccer15 0.05 0.052 0.038 –1.2 –0.549 –1.11 –1.33
Transprt16 0.035 0.116 0.015 –0.529 –0.123 –1.11 –0.52
Transprt17 0.06 0.049 0.071 –0.167 0.148 –0.85 –0.469
Transprt18 0.074 –0.139 0.043 –0.529 –0.32 –1.11 –0.357
Transprt19 0.049 0.12 0.049 –0.529 –0.298 –0.966 –0.249
Transprt20 0.014 0.044 0.036 –0.529 0.0 –0.966 –0.322
Note. Items in bold represent biased test items. MH = Mantel–Haenszel; SIB = simultaneous item
bias.
parts. Table 7 lists the main effects and test of the interaction between experi-
ence and the sex of the teachers.
2
COMPARATIVE CONCORDANCE ANALYSES
As a final comparison of the differences between subjective and objective ap
-
proaches to bias detection, separate concordance analyses were performed on the
bias estimate effects. The three objective methods of detecting bias on the test
items show a strong concordance in agreement about the lack of systematic gender
bias on the 20 test items. Table 8 indicates a Concordance W of .690 among objec
-
tive DIF detection methods. As Seigel and Castellan (1988) noted, a concordance

indicator does not necessarily provide evidence that the agreement is unidirec
-
tional. This point is illustrated in the agreement among the subjective raters of gen
-
der bias. Here, the Concordance W is comparably large (.709), but indicates the op
-
posite conclusion reached by employing objective analysis methods of bias
248
ROSS AND OKABE
FIGURE 1 Interaction of experience and rater gender on bias effect sizes. Experience = 0 re-
fers to preservice EFL teachers; Experience = 1 refers to in-service teachers. Sex = 1 refers to
male teachers; Sex = 2 refers to female teachers.
2
We have used “sex” to refer to teacher gender so as to distinguish it from the gender of the students
in the subjective and objective analyses.
detection. The subjective ratings of gender bias are stronglyin agreement about the
existence of bias in the test items, though even the three samples of narrative ac-
counts do not provide much consistent insight into why there is such hypersensitiv-
ity to the issue of gender differences.
CONCLUSION
The three conventional empirical methods of estimating item bias via differential
item functioning show strong concordance. The three methods used in the analy
-
sis, the simultaneous item bias approach, the Mantel-Haenszel Delta, and logistic
regression, were largely concordant in identifying that the majority of the 20 items
on the four test passages were not biased. A single biased item (Soccer 13) was
correctly detected by the three different objective methods.
The subjective ratings of bias suggested that novice and experienced teachers
tend to overestimate the extent of gender bias. The tendency is to see the sche
-

matic domain of whole passages as likely to induce bias. The male teachers sam
-
pled may simply confuse potential sensitivity issues with actual test bias. The
finding that both novice and experienced male raters identified phantom bias
also suggests a possible stereotypical assumption about knowledge domains
thought to favor males versus female test takers. The lack of empirical evidence
of item bias on 19 out of 20 experiment test items further suggests that female
Japanese test candidates are more familiar with particular schematic domains
than their male teachers might give them credit for.
BIAS DETECTION ON LANGUAGE TESTS 249
TABLE 7
Analysis of Variance of Experience and Teacher Sex
Source df F p
Experience 1 3.170 .079
Teacher sex 1 9.635 .003
Experience × Teacher Sex 1 0.020 .888
Error 76
Note. Dependent variable is the effect size.
TABLE 8
Kendall’s Coefficient of Concordance W
Objective DIF Methods Subjective Bias Ratings
W = .690 , p = .004, df = 19 W = .709 , p = .000, df = 19
Note. DIF = differential item functioning.
The reality is that high stakes language tests in Japan are rarely screened empir
-
ically for item bias. Current practice calls for moderation panels to conduct sensi
-
tivity reviews of draft test materials. As many of these panels are predominantly
made up of males, the potential for the needless omission of unbiased test material
is implied by the findings of this study. A large false-positive error rate is likely to

render such hermeneutic assessments of bias inefficient in terms of cost-utility cri
-
teria. This conclusion is derived from the observation that for many high stakes ad
-
missions exams, the test passages and items are crafted at high cost by sequestered
test construction committees. When possible sensitivity is confused with authentic
bias in candidate test items, unbiased items and often whole passages are omitted
despite their potential validity. It is possible that the eventual homogenization of
test passages can even work against content validity in that authentic texts repre
-
senting the wider domain of real world usage are more likely to be excluded in the
subjective sensitivity reviews. Because empirical counterevidence is rarely used to
correct the tendency to assume sensitivity is equivalent to bias, considerable re
-
sources are wasted in drafting test passages because of the high false positive error
rate resulting from the exclusive use of subjective moderation panels on these high
stakes foreign language examinations.
The implication of this study is that in addition to moderation reviews, empiri-
cal item bias should be conducted on high stakes foreign language tests used for
university admissions. Faulty items that slip through the moderation process can
be identified and omitted from the test before scoring is finalized. This approach,
given the widespread availability of item bias detection software, is likely to yield
in the long run the most cost-effective and valid approach to removing biased items
from high stakes language tests.
ACKNOWLEDGMENTS
This study was supported by kaken grant from the Japanese Ministry of Education
and Science.
We thank Sugiyama Naoto, Tom Robb, Ishikawa Tomohito, Isono Morihiko,
Ozawa Masato, and Ishihara Satoru for their assistance in data collection.
REFERENCES

American Educational Research Association, American Psychological Association, & National Coun
-
cil on Measurement in Education Joint Committee on Standards for Educational and Psychological
Testing. (1999). Standards for educational and psychological testing. Washington, DC: AERA.
Brown, A., & Iwashita, N. (1996). Language background and item difficulty: The development of a
computer-adaptive test of Japanese. System, 24, 199–206.
250 ROSS AND OKABE
Camilli, G., & Shepard, L. (1994). Methods for identifying biased test items. Thousand Oaks, CA:
Sage.
Chen, Z., & Henning, G. (1985). Linguistic and cultural bias in language proficiency tests. Language
Testing, 2, 155–163.
Cohen, J. (1988). Power analysis. Mahwah, NJ: Lawrence Erlbaum Associates, Inc.
Dwyer, C., & Johnson, L. (1997). Grades, accomplishments, and correlates. In W. Willingham & N.
Cole (Eds), Gender and fairassessment (pp. 127–156). Mahwah, NJ:Lawrence Erlbaum Associates,
Inc.
Elder, C. (1997). What does test bias have to do with fairness? Language Testing, 14(3), 261–277.
Hyde, J., & Linn, M. (1988). Gender differences in verbal ability: A meta-analysis. Psychological Bul
-
letin, 104(1), 53–69.
Ingulsrud, J. (1994). An entrance test to Japanese universities: Social and historical context. In C. Hill
& K. Parry (Eds.), From testing to assessment: English as an international language (pp. 61–81).
London: Longman.
Kim, M. (2001). Detecting DIF across the different groups in a speaking test. Language Testing, 18(1),
89–114.
Lee, Y. W., Breland, H., & Muraki, E. (2005). Comparability of TOEFL CBT writing prompts for dif
-
ference native language groups. International Journal of Testing, 5(2), 131–158.
Narayanan, P., & Swaminathan, H. (1994). Performance of the Mantel-Haenszel andsimultaneous item
bias procedures for detecting differential item functioning. Applied Psychological Measurement,
18(4), 315–328.

Pae, T. (2004). DIF for examinees with different academic backgrounds. Language Testing, 21(1),
53–73.
Penfield, R. (2001). Assessing differential item functioning among multiple groups: A comparison of
three Mantel-Haenszel procedures. Applied Measurement in Education, 14(3), 235–259.
Rogers, H. J., & Swaminathan, H. (1993). A comparison of the logistic regression and Man-
tel-Haenszel procedures for detecting differential item functioning. AppliedPsychological Measure-
ment, 17(2), 105–116.
Ross, S. J. (2000). Individual difference factors on the Certificate in Spoken and Written English. In G.
Brindley (Ed.), Studiesin Adult Migrant English (pp. 191–214).Sydney: National Center for English
Language Teaching and Research.
Ryan, K., & Bachman, L. (1992). Differential item functioning of two tests of EFL proficiency. Lan
-
guage Testing, 9(1), 12–29.
Sasaki, M. (1991). A comparison of two methods of detecting differential item functioning in an ESL
placement test. Language Testing, 8(2), 95–111.
Shadish, W., Robinson, L., & Lu, C. (1999). ES: A computer program for effect size calculation.Min
-
neapolis: Assessment Systems Corporation.
Shealy, R., & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF
from group ability differences and detects test bias/DIF as well as item bias/DIF. Psychometrika,
58(2), 159–194.
Siegel, S., & Castellan, N. J., Jr. (1988). Non-parametric statistics for the behavioral sciences. New
York: McGraw-Hill.
Swanson, D., Clauser, B., Case, S., Nungester, R., & Featherman, C. (2002). Analysis of differential
item functioning (DIF) using hierarchical logistic regression models. Journal of Educational and Be
-
havioral Statistics, 27(1), 53–75.
Takala, S., & Kaftandjieva, F. (2000). Test fairness: A DIF analysis of an L2 vocabulary test. Language
Testing, 17(3), 323–340.
Whitmore, M., & Schumacker, R. (1999). A comparison of logistic regression and analysis of variance

differential item functioning decision methods. Educationaland Psychological Measurement, 59(6),
910–927.
BIAS DETECTION ON LANGUAGE TESTS
251
Willingham, W., & Cole, N. (1997). Gender and fair assessment. Mahwah, NJ: Lawrence Erlbaum As
-
sociates, Inc.
Young, R. (1995). Conversational styles in language proficiency interviews. Language Learning, 45,
3–41.
Young, R., & Halleck, G. (1998). ‘Let them eat cake!’ or How to avoid losing your head in cross-cul
-
tural conversations. In R. Young & A. W. He (Eds), Talking and testing: Discourse approaches to the
assessment of oral proficiency (pp. 355–382). Amsterdam: Benjamins.
Young, R., & Milanovic, M. (1992). Discourse variation in oral proficiency interviews. Studies in Sec
-
ond Language Acquisition, 14, 403–424.
Zwick. R. (2002). Fair game? The use of standardized admissions tests in higher education.NewYork:
Routledge Falmer.
Zwick, R., & Sklar, J. (2003, April). California and the SAT: A reanalysis of University of California
admissions data. Paper presented at AERA, Chicago, IL.
Zwick, R., & Thayer, D. (2002). An application of an empirical Bayes’ enhancement of Man
-
tel-Haenszel differential item functioning analysis to a computerized adaptive test. Applied Psycho
-
logical Measurement, 25(1), 57–76.
APPENDIX
Gender Bias Survey
There is no need to provide answers to the test questions. Please inspect the test
passages, graphs, and charts, then rate the differential difficulty with whichmale or
female students would find each test question.

Rating Scale:
Rate –3 if you think male students would be highly advantaged
Rate –2 if you think male students would be moderately advantaged
Rate –1 if you think male students would be slightly advantaged
Rate 0 if you think there is no differential advantage
Rate 1 if you think female students would be slightly advantaged
Rate 2 if you think female students would moderately advantaged
Rate 3 if you think that female students would be highly advantaged
Please be sure to rate each of the 20 test questions.
About You
Please complete the survey
21) Your Gender:
1 Male 2 Female
22) Current Occupation:
252
ROSS AND OKABE
Pre-Service
Pre-Service Graduate
College Faculty
In-Service High School Teacher
Other
23) In what ways do you consider language test content to give differential ad
-
vantages to either male or female high school students? Please answer
freely.
BIAS DETECTION ON LANGUAGE TESTS 253

×