Tải bản đầy đủ (.pdf) (67 trang)

Fundamental considerations in language testing (Lyle F. Bachman) - Oxford Applied Linguistics

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.33 MB, 67 trang )


...

Preface

Introduction
aims the book
The climate for language testing
Research and development: needs and problems
Research and development: an agenda
Overview the book
Notes
Measurement
inProduction
Definition terms: measurement, test, evaluation
Essential measurement qualities
Properties measurement scales
Characteristics that limit measurement
Steps in measurement
Summary
Notes
Further reading
Discussion questions

Uses

2
8
12

18


18

24

30
40
52

Tests

Uses of language tests in educational programs
Research uses of language tests
Features for classifying different types of language test
Summary
Further reading
Discussion

53
67
70
78
79
79

Communicative Language
Introduction

81



vi

Language proficiency and communicative competence
theoretical framework communicative language
ability
Notes
Further reading
Discussion questions
Introduction
A framework test method facets
Applications this framework to language resting
Summary
Notes
Further reading
Discussion questions
Reliability
Introduction
Factors that affect

measurement theory

measurement: interpreting individual
within classical true score and

82
107
108
109

116

152
156
157
159
160
163
166

187

theory
criterion referenced test scores
that affect reliability estimates
measurement error
Summary
Notes
Further reading
Discussion questions

197
202
209
220
222
226
227
232
233

Introduction

Reliability and validity revisited
Validity as a unitary concept
The evidential basis validity
Test bias
The consequential or ethicai basis of validity
Postmortem: face validity

23
23
24
243
271
279
285


289
Notes

reading
questions

294

8 Some
Authentic language tests
future directions
general
explaining performance on language
tests

et
Summary
Notes
Further reading
Discussion questions

3Q0

Bibliography
397


2

Measurement

Introduc
developing language tests, we must take into account consider
tests and
ations and follow procedures that are characteristic
measurement in the social sciences in general. Likewise, our
interpretation and use of the results
language tests are subject
the same general limitations that characterize measurement in the
social sciences. The purpose
this chapter is
introduce the
fundamental concepts measurement, an understanding which is
essential the development and use of language tests. These include
the terms ‘measurement’, ‘test’, and ‘evaluation’, and how these are

distinct from each other, different types
measurement scales and
their properties, the essential qualities
measures reliability and
validity, and the characteristics
measures that
our interpret
ations test results. The process measurement described as a set
steps which, followed in test development, will provide the basis
for both reliable test scores and valid test use.

Definition

terms: measurement, test, evaluation

The terms ‘measurement’, ‘test’, and ‘evaluation’ are often used
synonymously; indeed they may, in practice, refer
the same
activity.’ When we ask for an evaluation an individual’s language
proficiency, for example, we are frequently given a test score. This
attention
superficial similarities among these
however,
tends obscure the distinctive characteristics of each, and believe
that an understanding the distinctions among the terms is vital
the proper development and use language tests.
Measurement
Measurement in the social sciences is the process of quantifying the
characteristics of persons according to explicit procedures and



This definition includes three distinguishing features: quantification,
characteristics, and explicit rules and procedures.

Quantification involves the assigning of numbers, and this distin
guishes measures from qualitative descriptions such as verbal
accounts
nonverbal, visual representations. Non-numerical cate
gories or rankings such as letter grades
C
labels (for
example, ‘excellent, good, average
may have the characteristics
of measurement, and these are discussed below under ‘properties of
measurement scales’ (pp. 26 30). However, when we actually use
categories or rankings such as these, we frequently assign numbers to
them in order analyze and interpret them, and technically, it is not
until we do this that they constitute measurement.

We can assign numbers both physical and mental characteristics
persons. Physical attributes such as height and weight can be
observed directly.
testing, however, we are almost always
interested in quantifying mental attributes and abilities, sometimes
called traits or constructs, which can only be observed indirectly.
These mental attributes include characteristics such as aptitude,
intelligence, motivation, field
attitude,
native language, fluency in speaking, and achievement in reading
rehension.

The precise definition
‘ability’ is a complex undertaking.
a
very general sense, ‘ability’ refers being able to do something, but
the circularity
this general definition provides
help for
measurement unless we can clarify what the ‘something’ is. John
Carroll
has proposed defining an ability with respect
a particular class cognitive or mental tasks that an individual
required to perform, and ‘menta1 ability’ thus refers to performance
on a set mental tasks (Carroll
268). We generally assume
that there are degrees
ability and that these are associated with
tasks or performances
increasing difficulty or complexity (Carroll
Thus, individuals with
degrees of a given ability
could be expected to have a higher probability
correct
on tasks of lower difficulty
complexity, and a lower
probability of correct performance on tasks
greater difficulty or
complexity.


Whatever attributes or abilities we measure, it is important

understand that it is these attributes abilities and
the persons
themselves that we are measuring. That is, we are far from being able
claim that a single measure or even battery of measures can
adequately characterize individual human beings in all their com
plexity.

The third distinguishing characteristic of measurement is that
quantification must be done according
explicit rules and
cedures. That is, the ‘blind’ haphazard assignment numbers
characteristics individuals cannot be regarded as measurement. In
order
be considered a measure, an observation
an attribute
must be replicable, for other observers, in other contexts and with
other individuals. Practically anyone can rate another person’s
speaking ability, for example. But while one rater may focus on
pronunciation accuracy, another may find vocabulary be the most
salient feature. Or one rater may assign a rating as a percentage,
while another might rate on a scale from zero five. Ratings such as
these can hardly be considered anything more than numerical
summaries the raters’ personal conceptualizations the individual’s speaking ability. This is because the different raters in this case
did not follow the same criteria or procedures for arriving their
ratings. Measures, then, are distinguished from such ‘pseudo
measures’ by the explicit procedures and rules upon which they are
based. There are many different types
measures in the
sciences, including rankings, rating scales, and


Test
Carroll

968) provides the following definition

a test:

a psychological educational test is a procedure designed elicit
certain behavior from which one can make inferences about
certain characteristics an individual.
(Carroll 1968: 46)
From
definition, it follows that a test is a measurement
instrument designed
a specific sample
an individual’s
behavior.
one type measurement, a test necessarily quantifies
characteristics individuals according explicit procedures. What
distinguishes a test from other types
measurement is that it is


Measurement

designed
obtain a specific sample
behavior. Consider the
following example. The Interagency Language Roundtable (ILR)
oral interview

is a test speaking consisting of
a
set of elicitation procedures, including a sequence
activities and
sets of question types and topics; and (2) a measurement scale
language proficiency ranging from a low level ‘0’ a high level
on which samples
oral language obtained via the elicitation
procedures are rated. Each the six scale levels is carefully defined
by an extensive verbal description. qualified ILR interviewer might
be able to rate an individual’s oral proficiency in a given language
according to the IER rating scale, on the basis
several years’
informal contact with that individual, and this could constitute a
measure of that individual’s oral proficiency. This measure could not
be considered a test, however, because the rater did not follow the
procedures prescribed by the ILR
interview, and consequently
may not have based her
on the
specific language
performance that are obtained in conducting an ILR oral interview.
believe this distinction is an important one, since it reflects the
primary justification for the use
language tests and has implications €or how we design, develop, and use them. we could count
on being able
measure a given aspect
language ability on the
basis any sample language
however obtained, there would

be no need design language tests. However, is precisely because
any given sample language will not necessarily enable the test user
make inferences about
ability that
need language
That is, the inferences and uses we make
language test scores
depend upon the sample
language use obtained. Language tests
can thus provide the means for more carefully focusing on the
specific language abilities that are of interest.
such, they could be
viewed as supplemental other methods measurement. Given the
limitations on measurement discussed below (pp.
and the
potentially large effect of elicitation procedures on test performance,
however, language tests can more appropriately be viewed as the best
means assuring that the sample language obtained is sufficient
for the intended measurement purposes, even if we are interested in
very general or global abilities. That is, carefully designed elicitation
procedures such as those
the ILR oral interview, those for
measuring writing ability described by Jacobs
or those
multiple-choice tests such as the Test of English as a Foreign
Language (TOEFL), provide the best assurance that scores from
language tests will be reliable,
and



22

Considerations in

While measurement is frequently based on the naturalistic observation behavior over a period time, such as in teacher
grades, such naturalistic observations might not include samples
behavior that manifest specific abilities or attributes. Thus a rating
based on a collection
personal letters, for example, might not
provide any indication
an individual’s ability
write effective
argumentative editorials for a news magazine. Likewise, a teacher’s
rating a student’s language ability based on
interactive
social language use may nor
a very good indicator
how well
that student can use language
perform various ‘cognitive/
academic’ language functions (Curnmins
not imply
that other measures are less valuable than tests, but make the point
that the value of tests lies in their capability for eliciting the specific
kinds behavior that the
user can interpret as evidence
attributes or abilities which are interest.

Evaluation can be defined as the systematic gathering information
for the purpose making decisions (Weiss

The probability
of making the correct decision in any given situation is a function not
only of the ability the decision maker, but also the quality the
information upon which the decision is based. Everything else being
equal, the more reliable and relevant the information, the better the
likelihood making the correct decision. Few
us, for example,
would base educational decisions on hearsay or rumor, since we
would not generally consider these
be reliable sources
information. Similarly, we frequently attempt screen out inform
ation, such as sex and ethnicity, that we believe be irrelevaat a
particular decision. One aspect
evaluation, therefore, is
collection
reliable and relevant information. This information
need not be, indeed seldom is, exclusively quantitative.
descriptions, ranging from performance profiles
letters of refer
ence, as well as overall impressions, can provide important inform
ation for evaluating individuals, as can measures, such as ratings and
test scores.
Evaluation, therefore, does not necessarily entail testing.
the
same token, tests in and themselves are not evaluative. Tests are
often used for pedagogical purposes, either as means motivating
students
study, or as a means of reviewing material taught, in
which case no evaluative decision is made on the basis of the test
results. Tests may also be used for purely descriptive purposes. It is



EVALUATION

Figure 2.1 Relationships among

tests, and evaluation

only when the results
tests are used as a basis for making a
decision that evaluation is involved. Again, this may seem a
point, but it places the burden for much the stigma that surrounds
testing squarely upon the test user, rather than on the test itself. Since
by far the majority
tests are used for the purpose
making
decisions about individuals, I believe it is important distinguish the
information-providing function
measurement from the decisionmaking function evaluation.
The
among measurement, tests, and evaluation are
illustrated in Figure
An
evaluation that does not
involve either tests or measures (area
is the use
qualitative
descriptions
student performance for diagnosing learning problems. An example a non test measure for evaiuasion (area
is a

teacher ranking used for assigning grades, while an example of a test
used ‘for purposes evaluation (area ‘3’) is she use of an achievement
test determine student progress. The most
non-evaluative
uses tests and measures are for research purposes.
example of
tests that are not used for evaluation (area ‘4’) is the use of a
proficiency test as a criterion in second language acquisition research.
Finally, assigning code numbers
subjects in second language
research according
native language is an example of a


24

Fundamental

Language

measure that is not used for evaluation (area
In summary, then,
not all measures are tests, not all tests are evaluative, and not all
evaluation involves either measurement or tests.

If we are interpret the score on a given test as an indicator
an
individual’s ability, that score must be both reliable and valid. These
qualities are thus essential the interpretation and use measures
language abilities, and they are the primary qualities

be
considered in developing and using tests.

Reliability is a quality test
and a perfectly reliable score, or
measure, would be one which free from errors
measurement
(American Psychological Association 1985). There are many factors
other than the ability being measured that can affect performance on
tests,
that constitute sources measurement error. Individuals’
performance
be affected by differences in testing conditions,
fatigue, and anxiety, and they may thus obtain scores that are
inconsistent from one occasion the next.
example, a student
receives a low score on a test one day and a high score on the same
test
days later, the test does not yield consistent results, and the
scores cannot be considered reliable indicators of the individual’s
ability. Or suppose
raters gave widely different ratings
the
same writing sample. In the absence of any other
we
have no basis for deciding which rating use, and consequently may
regard both as unreliable. Reliability thus has
do with the
consistency of measures across different times, test forms, raters, and
other characteristics the measurement context.

In any testing situation, there are likely to be severai different
sources of measurement error,
that the primary concerns in
examining the reliability
test scores are first, to identify the
different sources error, and then use the appropriate empirical
procedures for estimating the effect these sources of error on test
scores. The identification
potential sources
error involves
making judgments based
an adequate theory sources
Determining how much these sources error affect test scores,
the other hand, is a matter
empirical research. The different
approaches
defining and empirically investigating reliability
be discussed in detail in Chapter 6.


The most important quality test interpretation or use is validity, or
the extent to which the inferences decisions we make on the basis
test scores are
and
(American
Psychological Association
In order
a test score to be a
meaningful indicator a particular individual’s ability, we must
sure it measures that ability and very little else. Thus, in examining

the meaningfulness of test scores, we are concerned with demonstrating that they are not unduly affected by factors other than the ability
being tested.
test scores are strongly affected by errors
measurement, they will not be meaningful, and cannot, therefore,
provide the basis for valid interpretation or use. test score that is
not reliable, therefore, cannot be valid. If test scores are affected by
abilities other than the one we want
measure, they will not be
meaningful indicators of that particular ability.
for example, we
ask students to listen
a lecture and then
write a short essay
based on that lecture, the essays they write will be affected by both
their writing ability and their ability
comprehend the lecture.
Ratings
their essays, therefore, might not be valid measures of
their writing ability.
In examining validity, we must also be concerned with the
appropriateness and usefulness the test score for a given purpose.
score derived from a test developed
measure the language
abilities of monolingual elementary school children, for
might not be appropriate for determining the second language
proficiency of bilingual children of the same ages and grade levels.
use such a test for this latter purpose, therefore, would be highly
questionable (and
Similarly, scores from a test
designed to provide information about an individual’s vocabulary

knowledge might not be particularly useful
placing students in
writing program.
While reliability is a quality test scores themselves, validity is a
of test Interpretation and use.
with reliability, the
investigation
validity
both
matter of judgment and of
empirical research, and involves gathering evidence and appraising
the values and social consequences that justify specific interpretations
or uses of test scores. There are many types evidence that can be
presenred support the validity of a given test interpretation or use,
and hence many ways
investigaring validity. Different types
evidence that are relevant to the investigation of validity and
approaches collecting this evidence are discussed in Chapter
Reliability and validity are both essential
the use
tests.


Fundamental Considerations in Language ‘Testing

Neither, however, is a quality
tests themselves; reliability is a
quality of test scores, while validity is a quality the interpretations
or uses that are made test scores. Furthermore, neither absolute,
in that we can never attain perfectly error-free measures in actual

practice, and the appropriateness
particular use
test score
will depend upon many factors outside the
itself. Determining
what degree
relative reliability or validity. is required for a
particular test context thus involves a value judgment on the part
the test user.

Properties

measurement scales

If we want
measure an attribute or ability
an individual, we
need
determine what set
numbers will provide the best
measurement.
we measure the loudness
someone’s voice,
for example, we use decibels, but when we measure temperature, we
use degrees Centigrade or Fahrenheit. The sets numbers used for
measurement must be appropriate
the ability or attribute
measured, and the different ways organizing these sets numbers
constitute scales measurement.
Unlike physical attributes, such as height, weight, voice pitch, and

temperature, we cannot directly observe intrinsic attributes or
abilities, and we therefore must establish our measurement scales by
definition, rather than by direct comparison. The scales we define can
be distinguished in terms
four properties.
measure has the
property
distinctiveness if different numbers are assigned
persons with different values on the attribute, and is ordered
magnitude larger numbers indicate larger amounts of the attribute.
If equal differences between ability levels are indicated
equal
differences in numbers, the measure has equal intervals, and a
value of zero indicates the absence the attribute, the measure has
an absolute zero point.
Ideally, we would like the scales we use to have all these properties,
since each property represents a different type
information, and
the more information our scale includes, the more useful it will be for
measurement. However, because
the nature
the abilities we
wish to measure, as well as the limitations on defining and observing
the behavior that we believe be indicative those abilities, we are
not able use scales that possess all four properties for measuring
every ability. That is, not every attribute we want
measure, or
quantify,
on the same scale, and not every procedure we use for
observing and quantifying behavior yields the same scale, that it is



27

necessary to use different scales
measurement, according
the
characteristics of the attribute we wish to measure and the type
measurement procedure we use. Ratings, for example, might be
considered the most appropriate way
quantify observations
speech from an oral interview, while we might believe that the
number
items answered correctly on a multiple-choice test the
best way
measure knowledge
grammar. These abilities are
different, as are the measurement procedures used, and consequently,
the scales they yield have different properties. The way we interpret
and use scores from our measures is determined, a large extent, by
the properties that characterize the measurement scales we use, and it
is thus essential
both the development and the use
language
tests to understand these properties and the different measurement
scales they define. Measurement specialists have defined four types
measurement scales
and
according
many these four properties they possess.‘


Nominal scale
As its name suggests, a nominal scale comprises numbers that are
used
‘name’ the classes or categories of a given attribute. That is,
we can use numbers as a shorthand code for identifying different
categories. If we quantified the attribute ‘native language’, for
example, we would have a nominal scale. We could assign different
code numbers
individuals with different native language backgrounds, (for example, Amharic
Arabic
Rengali
Chinese 4, etc.) and thus create a nominal scale for this attribute.
The numbers we assign are arbitrary, since it makes no difference
what number we assign what category,
long as each category
has a unique number. The distinguishing characteristic a nominal
scale that while the categories
which we assign numbers are
distinct, they are not
with respect
each other. In
example above, although
(Amharic) is
‘2’ (Arabic), it
is neither greater than nor less than ‘2’. Nominal scales thus possess
the property of distinctiveness. Because they quantify categories,
nominal scales are also sometimes referred as ‘categorical’ scales.
special case of a nominal scale is a
in which the

attribute has only
categories, such as ‘sex’ (male and female),
‘status answer’ (right and wrong) on some types of tests.


Considerations in Language Testing

scale
An ordinal scale, as its name suggests, comprises the numbering

different levels an attribute that are ordered with respect each
other. The most common example an ordinal scale is a ranking, in
which individuals are ranked ‘first’, ‘second’, ‘third’, and so on,
according some attribute or ability. rating based on definitions
different levels ability is another measurement procedure that
typically yields scores that constitute an ordinal scale. The points, or
levels, on an ordinal scale can be characterized as ‘greater than’ or
‘less than’ each other, and ordinal scales thus possess, in addition
the property of distinctiveness, the property ordering. The use
subjective ratings in language tests is an example ordinal scales,
and is discussed on pp. and 44 5 below.

Interval scale
An interval scale is a numbering of different levels in which the
distances, or intervals, between the levels are equal. That is, in
addition
the ordering that characterizes ordinal scales, interval
scales consist equal distances or intervals between ordered levels.
Interval
thus

the properties distinctiveness, ordering,
and equal intervals. The difference between an ordinal scale and an
interval scale is illustrated in Figure 2.2.
this example, the test scores indicate that these individuals are
not equally distant from each other on the ability
This
additional information is not provided by the rankings, which might
Ordinal scale
(ranking)

Interval scale
score)

First

80

Fourth
Fifth

Figure

30

Comparison between ordinal and interval scales


Measurement

be interpreted as indicating that the intervals between these five

individuals' ability levels are all the same. Differences in approaches
developing ordinal and interval scales
language tests are
discussed on pp. 36 and 44 5 below.

scale
None of the scales discussed thus far has an absolute zero point,
which is the distinguishing characteristic of a ratio scale. Most the
scales that are used for measuring physical characteristics have true
zero points. If we looked at bathroom scale with nothing on it, for
example, we should see the pointer at
indicating the absence of
weight on the scale. The reason we call scale with an absolute zero
point a ratio scale is that we can make comparisons in terms ratios
with such scales. For example, have two pounds of coffee and
have four pounds, you have twice as much coffee (by weight) as
have, and one room is ten feet long and another thirty, the second
room is three times as long as the first.
illustrate the difference between interval and ratio scales,
consider the different scales that are used for measuring temperature.
Two commonly used scales are the Fahrenheit and Celsius (centi
grade) scales, each which defines zero differently. The Fahrenheit
scale originally comprised set of equal intervals between the
melting point of ice, which was arbitrarily defined a5
and the
temperature human blood, defined as 96 . extending the scale,
the boiling point
water was found
be
which has since

become the upper defining point this scale. The Fahrenheit scale
thus consists 180 equal intervals between 32 and
with
defined simply as 32 scale points below the melting point ice.
Fahrenheit scale, course, extends below 0" and above
The
Ceisius scale, on
other hand, defines as the melting point ice
(at
level), and
as the boiling point water, with
equal
in between. In neither
Fahrenheit nor the Celsius scale
does the zero point indicate the absence of a particular characteristic;
it does not indicate the absence heat, and
not the absence
water or ice. These scales thus do not constitute ratio scales, so
if it was 50
last night and
this noon, it is not
the case that it is twice as hot now as it was last night. we define
in terms of the volume of an 'ideal' gas, however, then
the absolute zero point could be defined as the point at which gas has
no volume. This is the definition
that is used
the Kelvin, or
absolute, scale, which is ratio scale.



Fundamental Considerations

Language Testing

Each
the four properties discussed above provides a different
type of information, and the four measurement scales are thus
ordered, with respect to each other, in terms
the amount
information they can provide. For this reason, these different scales
are also sometimes referred as levels of measurement. The nominal
scale is thus the lowest type scale, or level measurement, since it
is only capable
distinguishing among different categories, while
the ratio scale is the highest level, possessing all four properties and
thus capable
providing the greatest amount of information. The
four types
scales, or levels
measurement, along with their
properties, are summarized in Table 2.1.
Nominal

Ratio

Distinctiveness
Ordering
Equal intervals
Absolute zero point


Table
Types of measurement scales and
their properties (after Allen and
1979: 7)
Characteristics that limit measurement
test developers and
users, we all sincerely want our tests be
the best measures possible. Thus there is always the temptation to
interpret test results as absolute,
is, as unimpeachable evidence
the extent
which a given individual possesses the language
ability in question. This is understandable, since it would certainly
make educational decisions more clear-cut and research results more
convincing. However, we know that our
are not perfect
indicators
the abilities we want
measure and that test results
must always be interpreted with caution. The most valuable basis for
keeping this clearly in mind can be found,
believe, in an
understanding
the characteristics of measures of mental abilities
and the limitations these characteristics place on our interpretation
rest scores. These limitations are
kinds: limitations in
specification and limitations in observation and quantification.
Limitations in specification
In any language testing situation, as with any non-test situation in

which language use is involved, the performance an individual will


Measurement

be affected by a large number of factors, such as the testing context,
the type
test tasks required, and the time
day, as well as her
mental alertness at the time
the test, and her cognitive and
personality characteristics. (See pp.
below for a discussion
the factors that affect language test scores.) The most important
factor that affects test performance, with respect language testing,
of course, is the individual’s language ability, since it is language
ability in which we are interested.
In order
measure a given language ability, we must be able
specify what it is, and this specification generally is at two levels. At
the theoretical level, we can consider the ability as a type, and need
define it so as to clearly distinguish it
from other language
abilities and from other factors in which we are not interested, but
which may affect test performance. Thus, at the theoretical level we
need to specify the ability in relation
in contrast
other
language abilities and other factors that may affect test performance.
Given the large number of different individual characteristics

cognitive, affective, physical
that could potentially affect test
performance, this would be a nearly impossible task, even all these
factors were independent each other. How much more
given
the fact that not only are the various language abilities probably
interrelated, but that these interact with other abilities and factors in
the testing context as well. At the operational
we need
specify the instances of language performance that we are willing
interpret as indicators,
tokens, the ability we wish
measure.
This level specification, then, defines the relationship between the
ability and the test score, between type and token.
In the face the complexity of and
among
the factors
affect performance on language tests, we are forced
make certain simplifying assumptions, or
both in
designing language tests and in interpreting test scores. That is,
when we design rest, we cannot incorporate all the possible factors
that affect performance. Rather, we attempt
either exclude or
minimize by design the effects
factors in which we are no:
interested,
as
maximize the effects

the ability we want
measure. Likewise, in interpreting test scores: even though we know
that a test taker’s performance on an oral interview, for example, will
be affected to some extent by the facility the interviewer and by the
subject matter covered, and that the score will depend on the
consistency of the raters, we nevertheless interpret ratings based on
an interview as indicators of a single factor the individual’s
in speaking.


This indeterminacy in specifying what it is that our tests measure is
a major consideration in both the development and use language
tests. From a practical point
view, it means there are virtually
always more constructs or abilities involved in a given test
performance than we are capable
observing
interpreting.
Conversely, it implies that when we design a test measure a given
ability or abilities, or interpret a test score as an indicator
ability
we are simplifying, or underspecifying the factors that affect the
observations
make. Whether the indeterminacy is the theoretical level types, and language abilities are not adequately delimited
or distinguished from each other, or whether at the operational level
tokens, where the relationship between abilities and their
behavioral manifestations is misspecified, the result will be the same:
our interpretations and uses test scores will be limited validity.
For language testing research, this indeterminacy implies that any
theory

language test performance we develop is likely
be
underspecified. Measurement theory, which is discussed in Chapter
6, has developed, to a
extent, as a methodology
dealing with
the problem
underspecification, or the uncontrolled effects
factors other than the abilities in which we are interested. In essence:
provides a means for estimating the effects of the various factors
that we have not been abie
exclude from test performance, and
hence for improving both the design of tests and
interpretation
their results.
Limitations

observation and quantification

addition to the limitations related to the underspecification
factors that affect test Performance, there are characteristics
the
processes observation and quantification that
our interpretations of test results. These derive from the fact that all measures
mental ability are necessarily
and

the majority
situations where language tests are used, we are
interested in measuring the test taker’s underlying competence,

ability, rather than his performance
a particular occasion. That is,
we are generally not interested so much in how an individual
performs on a given test on a given day, as
his ability
use
language at different times in wide range
contexts. Thus, even
though our measures are necessarily based on one more individual


observations
performance, or behavior, we interpret them as
indicators of a more long-standing ability or competence.’
believe it is essential, we are properly interpret and use test
results, to understand that the relationship between test scores and
the abilities we want
measure is indirect. This is particularly
critical since the term ‘direct test’ is often used
refer
a test in
which performance resembles ‘actual’ or ‘real life’ language performance. Thus, writing samples and oral interviews are often
referred as ‘direct’ tests, since they presumably involve the use
the skills being tested. By extension, such tests are often regarded,
virtually without question, as valid evidence
the presence or
absence of the language ability in question. The problem with this,
however, is that the use of the term ‘direct’ confuses the behavioral
manifestation
the trait or competence

the construct itself.
with all mental measures, language tests are
indicators of the
underlying traits in which we are interested, whether they require
recognition the correct alternative in a multiple-choice format, or
the writing
an essay. Because scores from language tests are
indirect indicators of ability, rhe valid interpretation and use such
scores depends crucially on the adequacy of the way we have
specified the relationship between the test score and the ability we
believe
indicates. To the extent that this relationship is not
adequately specified, the interpretations and uses made
the test
score may be invalid.
Incompleteness

In measuring ianguage abilities, we are never able observe or elicit
an individual’s total performance in a given language. This could
only be accomplished by following an individual around with a tape
recorder 24 hours a day for his entire life, which
clearly an
impossible task. That
given the extent and the variation that
characterize language use, simply is not possible for
observe
and measure every instance
an individual’s use
a given
language. For this reason, our measures must be based on the

observation a part of an individual’s total language use. In other
words, the performance we observe and measure in a language test is
a
of an individual’s total performance in that language.
Since we cannot observe an individual’s total language use, one
our main concerns in language testing is assuring that the sample we
do observe representative of that total use a potentially infinite
set of utterances, whether written or spoken. we could tape-record


34

Considerations in

Testing

a year, we would
an individual’s speech for a few hours every day
have a reasonably representative sample his performance, and we
could expect a measure speaking based on this sample be very
accurate. This because the more representative our sample
an
individual’s performance, the more accurate a representation
his
total performance it will be. But even a relatively limited sample such
as this (in terms
a lifetime
language use) is generally beyond
the realm of feasibility, so we may base our measure
speaking

a 30-minute sample elicited during an oral interview. In many
large scale testing contexts, even an oral interview is not possible,
we may derive our measure
speaking from an even more restricted sample performance, such as an oral reading a text or a
non-speaking test.
Just large, representative samples yield accurate measures, the
smaller and less representative are
samples performance, the
less accurate our measures will be. Therefore, recognizing that we
almost always deal with fairly
samples
performance
language tests, it is vitally important that we incorporate into our
measurement design principles or criteria that will guide us in
determining what kinds of performance will be most relevant and
representative of the abilities we want
measure. One approach to
this might be
identify the domain
‘real-life’ language
use and then attempt
performance from this domain. In
developing a test
measure how well students can read French
literary criticism, for example, we could design
test
that it
includes reading tasks that students
French literature actually
perform in pursuing their studies.

different apprsach would be to identify critical features, or
components of language
then design test tasks that
include these. This the approach that underlies so called
point’ language tests, which tend
focus on components such as
grammar, pronunciation, and vocabulary. However, this approach
need not apply
the formal features language,
even a
single feature
language use. It might be of interest
a given
situation, for example, design a test that focuses on an individual’s
ability produce pragmatic aspects language use, such as speech
acts or implicatures, in a way that is appropriate for a given context
and audience.
The approach we choose in specifying criteria for sampling
language use on tests will be determined,
a great extent, by how
we choose define what it is we are testing. That is, we choose
define test content in terms
a domain
actual language use, we


If,

of


to

a
feel

is

is
is

on

of
or

as

as

than

a

our

use

or

units


use

a


in

of

of
so

for

our
a

so

1932;

constitute

1925,

of
of

of


of

of
test.

for
at

above


7

of
of

as a

of

of

of
of

is
it is

it


(1968)

as
a

at
a

a


for example, is subjective, as the setting
time limits and other
administrative procedures. Finally, interpretations regarding the level
ability or correctness
the performance on the test may be
subjective. All
these subjective decisions can affect both the
reliability and the validity of test results the extent that they are
sources of bias and random variation in testing procedures.
Perhaps the greatest source subjectivity the test taker herself,
who must make an uncountable number subjective decisions, both
consciously and subconsciously, in the process of taking a test. Each
test taker is likely approach the test and the tasks it requires from a
slightly different, subjective perspective, and to adopt slightly
different, subjective strategies for completing those tasks. These
differences among test takers further complicate the tasks
designing tests and interpreting test scores.


The last limitation on measures
language ability is the potential
relativeness
the levels
performance or ability we wish to
measure. When we base test content on domains language use,
on the actual performance individuals, the presence or absence of
language abilities is impossible
define in an absolute sense. The
concept
‘zero’ language ability
complex one, since in
attempting define it we must inevitably consider language ability
as a cognitive ability, its relationship other cognitive abilities, and
whether these have true zero points. This is further complicated with
respect to ability in a second or foreign language by the question
whether there are elements
the native language that are either
universal to all languages or shared with the second language. Thus,
although we can all think languages which we know not a single
word or expression, this lack knowledge the surface features
the language may not constitute absolute ‘zero’ ability. Even we
were to accept the notion
‘zero’ language ability, from a purely
practical viewpoint we rarely, even for research purposes, attempt
measure abilities individuals in whom we believe these abilities to
be completely absent.
At the other end of the spectrum, the individual with absolutely
complete language ability does not exist. From the perspective
Language history, it could be argued that given the constant change

that characterizes any language system, no such system is ever static
or ’complete’. From a cognitive perspective might be argued that
cognitive abilities are constantly developing, so that no cognitive


×