Tải bản đầy đủ (.pdf) (11 trang)

Technology in Testing, 2000

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (96.36 KB, 11 trang )

Technology in testing: the present and the future
J. Charles Alderson
Department of Linguistics and Modern English Language, Bowland College, Lancaster University, Lancaster
LA1 4YT, UK
Received 10 January 2000; received in revised form 9 February 2000; accepted 10 February 2000
Abstract
As developments in information technology have moved apace, and both hardware and
software have become more powerful and cheaper, the long-prophesied use of IT for language
testing is ®nally coming about. The Test of English as a Foreign Language (TOEFL) is
mounted on computer. CD ROM-based versions of University of Cambridge Local Exam-
inations Syndicate tests are available, and the Internet is beginning to be used to deliver lan-
guage tests. This paper reviews the advantages and disadvantages of computer-based
language tests, explores in detail developments in Internet-based testing using the examples of
TOEFL and DIALANG Ð an innovative on-line suite of diagnostic tests and self-assessment
procedures in 14 European languages Ð and outlines a research agenda for the next decade.
# 2000 Elsevier Science Ltd. All rights reserved.
Keywords: Information technology; Computer-based language tests; Internet; Self-assessment
1. Uses of IT in testing
Computers are beginning to be used to deliver language tests in many settings. A
computer-based version of the Test of English as a Foreign Language (TOEFL) was
introduced on a regional basis in the summer of 1998. More and more tests are
available on CD ROM, and both the Intranet and the Internet are beginning to be
used to deliver tests to users who are at a distance. For example, the English as a
Second Language Placement Examination at UCLA is currently being adapted to be
delivered over the Web, and Internet-based tests of Chinese, Japanese, and Korean
System 28 (2000) 593±603
www.elsevier.com/locate/system
0346-251X/00/$ - see front matter # 2000 Elsevier Science Ltd. All rights reserved.
PII: S0346-251X(00)00040-3
E-mail address: (J.C. Alderson).
are also being developed. Fulcher (1999) reports the use of the Internet to de liver a


placement test to students and describes a pilot study to investigate potential bias
against students who lack computer familiarity or have negative attitudes towards
technology. The study also estimates the usefulness of the test as a placement
instrument by comparing the accuracy of placement with a pencil-and-paper form of
the test.
2. Disadvantages of computer-based testing
There are, of course, dangers in such innovations. Many commentator s have
remarked that computer-based tests (CBTs) are currently limited in the item types
that they allow. The multiple-choice technique is ubiquitous on computer, and
indeed has enjoyed something of a resurgence in language testing, where it had
previously begun to fall into disfavour (see Alderson, 1986, for an early reference to
this problem, and Alderson, 1996, for a more recent discussion). Similarly, cloze and
gap-®lling techniques are frequently used where other techniques might be more
appropriate, but are much harder to implement in a setting where responses must be
machine-scorable.
Fairly obviously, a degree of computer literacy is required if users are not to be
disadvantaged by CBTs over paper-and-pencil tests. The ability to use a mouse and
a keyboard are obvious minimal requirements. Reading text from screen is not the
same thing as reading from print, and the need to move to and fro through screens
is much more limiting than being able to ¯ick back and forth through print. It is
noteworthy that the Educational Testing Service (ETS) conducted a large-scale
study of computer literacy in the TOEFL-taking population and of the eect of
computer literacy on TOEFL performance (Kirsch et al., 1998; Taylor et al., 1998).
Although the study found no dierence in TOEFL performance between those who
were familiar with computers and those who were not, a signi®cant 16% of the
TOEFL population had negligible computer familiarity. ETS has therefore devised a
tutorial which all CBT TOEFL takers must undergo before they take the CBT
TOEFL for real, in an eort to remove any possible suggestion of bias against
computer illiterates.
This tutorial is available on sampler CDs which demonstrate the nature of the new

CBT TOEFL, but is also mandatory, but untimed, on all administrations of CBT
TOEFL. The tutorial not only familiarises candidates with the various test techni-
ques used in CBT TOEFL, and allows practise in responding to such items, it also
gives instruction in how to use a mouse to select, point and to scroll text, and to
respond to dierent item types. There is also an untimed Writing tutorial for those
who plan to type their essay (CBT TOEFL makes the formerly optional Test of
Written English compulsory, although candidates have the option of writing their
essay by hand, instead of word-processing it).
Whether such eorts will be necessary in the future as more and more users
become computer literate is hard to predict, but currently there is considerable
concern about the eect of the lack of such literacy on test performance.
594 J.C. Alderson / System 28 (2000) 593±603
Perhaps most importantly, we appear to be currently limited in what language
skills can be tested on a computer. The highly valued productive skills of speaking
and writing can barely be assessed in any meaningful way right now Ð although, as
we shall see later in this article, developments are moving apace, and it may be
sooner than some commentators have expected that we will also be able to test the
ability to respond to open-ended productive tasks in meaningful ways.
Despite these drawbacks, there is no doubt that the use of computers in language
testing will grow massively in the next few years, and that must in part be because
computer-mounted tests oer signi®cant advantages to users or deliverers.
3. Technical advantages of computer-based testing
One obvious advantage of computer-based testing is that it removes the need for
®xed delivery dates and locations normally required by traditional paper-and-pencil-
based testing. Group administrations are unnecessary, and users can take the test at
a time (although not necessarily at a place) of their own choosing. CBTs can be
available at any time of the day or night, thus freeing users from the constraint of
test administration or even of group-administration: users can take the test on
their own, rather than being herded into a large room at somebody else' s place and
convenience.

Another advantage is that results can be available immediately after the test,
unlike paper-and-pencil-based tests which require time to be collected, marked
and results issued. As we shall see shortly, this has potential pedagogic advantages,
as well as being of obvious bene®t to the users (receiving institutions, as well as
candidates).
Whilst diskette- and CD ROM-based tests clearly have such advantages, tests
delivered over the Internet are even more ¯exible in this regard: delivery or purchase
of disks is not required, and anybody with access to the Internet can take a test.
Diskettes and CD ROMs are ®xed in format: somebody has to decide what goes
onto the diskette. Once the disk has been pressed and distributed, it cann ot normally
be updated. However, with tests delivered by the Internet, it is possible to allow
access to a much larger database of items, which can itself be updated constantly.
Indeed, using the Internet, tests can be piloted alongside live test items, they can be
calibrated on the ¯y, and can then be turned into live items as soon as the calibra-
tion is complete and the item parameters are known. Use of the Internet means that
results can be sent immediately to designated score users, which is not possible with
diskette- and CD ROM-based tests.
CBTs can make use of specially designed templates for item co nstruction, and
indeed some compan ies market special software to allow test developers to con-
struct tailor-made tests (Questionmark, e.g. at And soft-
ware like Authorware (copyright 1993, Macrome dia Inc.) can easily be used to
facilitate test development, without the need for recourse to proprietary software.
The Internet can also be used for item authoring and reviewing: item writers can
input their items into specially prepared templates, review them on screen, and they
J.C. Alderson / System 28 (2000) 593±603 595
can then be stored in a database for later review, editing and piloting. This has the
added advantage of allowing test developers to employ any item writer who has
access to the Internet, regardless of their geographical location, an asset that
has proved invaluable in the case of international projects like DIALANG Ð see
below.

CBTS, and especially Internet-delivered tests, can access large databases of items.
This means that test security can be greatly enhanced, since tests can be created by
randomly accessing items in the database and producing dierent combinations of
items. Thus any one individual is exposed to only a tiny fraction of available items
and any compromise of items that might occur will have negligible eect.
At present there is a degree of risk in delivering high-stakes tests over the Internet
(which is why ETS has not yet made the TOEFL available over the Internet, but
instead requires users to travel to specially designated test centres). Not only might
hackers break into the database and compromise items, but the diculties of pay-
ment for registration, and the risk of impersonation, are considerable. However, for
tests which are deliberately low stakes, the security risks are less important, as there
is little incentive for users to cheat or otherwise to fool the system.
4. Computer-adaptive testing
The most frequently mentioned advantage of computer -based testing in the lit-
erature is the use of computer adaptivity. In computer-adaptive tests, the computer
estimates the user's ability level on the ¯y. Once it has reached a rough estimate of
the candidate's ability, the computer can then select the next item to be presented
to the user to match that emerging ability level: if the user gets an item right, they
can be presented with a more dicult item and if the user gets the item wrong,
they can be given an easier item. In this way, users are present ed with items as close
as possible to their ability level. Adaptivity makes tests somewhat more ecient Ð
since only items close to the ability of the user are presented. It also means that test
security can be enhanced since dierent users take essentially dierent tests. It does,
however, require large banks of precalibrated items, and evidence that any one test
is equivalent, at least in measurement terms, to any other test that might have been
taken by a user with the same ability. For a useful review of computer-adaptive
testing, see Chalhoub-Deville and Deville (1999).
5. Pedagogic advantages of CBTs
Computer-adaptive tests are often argued to be more user-friendly, in that they
avoid users being presented with frustratingly dicult or easy items. They might

thus be argued to be more pedagogically appropriate than ®xed-format tests.
Indeed, I would go further and argue that the use of computers in general to
deliver tests has signi®cant pedagogic advantages and can encourage the incorpora-
tion and integration of testing and assessment directly into the teaching and learning
596 J.C. Alderson / System 28 (2000) 593±603
process. I also believe that computer-based testing will encourage and facilitate an
increased use of tests for low-stakes purposes like diagnosis.
Arguably the major pedagogic advantage of delivering tests by computer is that
they can be made more user-friendly than traditional paper-and-pencil tests. In
particular, they oer the possibility of giving users immediate feedback once a
response has been made. The test designer can allow candidates to receive feedback
after each item has been responded to, or this can be delayed until the end of the
subtest, the whole test, or even after a time de lay, although it is unlikely that any-
body would wish to delay feedback beyond the immediate end of the test. It is gen-
erally thought that feedback given immediately after an activity has been completed
is likely to be more mean ingful and to have more impact, than feedback which
is substantially delayed. Certainly, in traditional paper-and-pencil-based test s,
feedback Ð the test results Ð can be delayed for up to several months. At which
point candidates are unlikel y to remember what their responses were, and thus the
feedback is likely to be much less meaningful.
This is not a problem in settings where the only feedback the candidates are likely
to be interested in is whether they have passed or failed the test, but even then can-
didates are likely to appreciate knowing how their performance could have been
better, where their strengths and weaknesses lie, and what they might do better next
time. Interestingly, UK examining boards are currently considering how best to
allow candidates to review their own test papers once they have been marked. This is
presumably in the interests of allowing learners to learn from their marked perfor-
mance. CBTs can oer this facility with ease.
If feedback is given immediately after an item has been attempted, the possibility
exists of allowing users to make a second attempt at the item Ð with or without

penalties for doing so in the light of feedback. One interesting question then arises: if
the user gets the item right the second time, which is the true measure of ability: the
performance before or after the feedback? One might argue that the second perfor-
mance is a better indication, since it results from users having learned something
about their ®rst performance and thus is closer to current ability. Clearly, research
into this area is needed, and providing immediate feedback and allowing second
attempts raises interesting research and theoretical que stions.
Computers can also be user-friendly in oering a range of support to test takers.
On-line Help facilities are, of course, common, and could be accessed during test
performance in order to clarify instructions, for example, or to allow users to see
an example of what one is supposed to do, and more. In addition, clues as to an
appropriate performance could be made available Ð and the only limit on such
clues would appear to be our ability to devise clues which are helpful without
directly revealing the answer or the expected performance. On-line dictionaries can
also be made available, with the advantage of being tailor-made to the text and test
being taken, rather than being all-purpose dictionaries of the paper-based sort.
Not only can such support be made available: its use can be monitored and taken
into account in deriving test results or test scores. It is perfectly straightforward for
the computer to adjust scores on an item for the use of support, or for the machine
to report the use of support alongside the test score. The challenge to testers and
J.C. Alderson / System 28 (2000) 593±603 597
applied linguists is clearly to decide how this should be reported, and how this
aects the meaning of the scores.
In similar vein, users can be asked how con®dent they are that the answer they
have given is correct, and this con®dence rating can be used to adjust the test score,
or to help interpret results on particular items (users might unexpectedly get dicult
items right, and the associated con®dence rating might give insight into guessing or
partial knowledge).
An obvious extension of this principle of asking users to give insights into
their ability is the use of self-assessment: users can be asked to estimate their abil-

ity, and they can then be tested on that ability, with the possibility, via the com-
puter, of an immediate comparison between the self-assessment and the actual
performance. Such self-as sessment can be in general terms, for example, about
one's ability on the skill or area being tested, or it can be very speci®c to the text
or task about to be tested. Users can then be encouraged to re¯ect upon pos sible
reasons why their self-assessed performance on a task did or did not match their
actual performance.
I have already discussed one aspect of user friendliness allowed by CBTs Ð
adaptivity Ð which allows for an extensive amount of tailoring of test to user. How -
ever, such adaptivity of tests need not be merel y psychometrically driven. It is possible
to conceive of a test in which the user is given the choice of taking easier or more
dicult items, especially in a context where the user is given immediate feedback on
their performance. Indeed, the CBT TOEFL already allows users to see an estimated
test result immediately after taking the test, and to decide whether they wish their
score to be reported to potential receiving institutions, or whether they would prefer
the score not to be reported until they have improved it.
This notion of users driving the test can be extended. For example, assuming
large banks of items, users can be allowed to choose which skill they wish to be
tested on, or whi ch level of diculty they take a test at. In addition, they could
be allowed to choose which language they wished to see test rubrics and examples
in, they could request the language in which results are presented. They could be
allowed to make self-assessments in their own language rather than in the target
language and to get detailed feedback on their results in a language other than the
target language. Such learner-centredness alrea dy exists in the DIALANG testing
system (see below).
6. State of the art
Given this range of possibly innovative features in CBTs, what is the current state
of the art? At present, in fact, there is little evidence of innovation. True, there are
many applications of adaptive tests, but the test methods used are largely indis-
tinguishable from those used in paper-and-pencil-based testing, and no evidence has

been gathered and presented that testi®es to any change in the constructs being
measured, or to increased construct validity. Certainly, there is discussion of the
increased practicality of the shortened testing time, for example, but no attempt has
598 J.C. Alderson / System 28 (2000) 593±603
been made to show either that candidates ®nd this an advantage, or that it repre-
sents any added value in other than strictly administrative convenience. Indeed one
of the problems of computer-adaptive tests is that users cannot change their minds
about what an appropriate response might be once they have been assigned another
item. This would seem to be a backward step, and to reduce the validity of meas-
urement. Allowing second thoughts, after all, might well be thought to tap re¯ective
language use, if not spontaneous response.
ETS, the developer of TOEFL, claims, however, that CBT TOEFL does contain
innovative features. The listening section, for example, now uses photos and gra-
phics ``to create context and support the content of the mini-lectures, producing
stimuli that more closely approximate `real world' situations in which people do
more than just listen to voices'' (ETS, 1998). But no evidence is presented that this
has any impact on candidates or on validity, and in any case, many such visuals
could easily have been added to the paper-and-pencil-based TOEFL. However,
some innovations in test method are noteworthy. Although the traditional four-
option multiple-choice predominates, in one test method candidates are required to
select a visual or part of a visual. In some questions candidates must select two
choices, usually out of four, and in others candidates are asked to match or order
objects or texts. In addition, in CBT TOEFL, candidates wear headphones, can
adjust the volume control, and are allowed to control how soon the next question is
presented. Moreover, candidates see and hear the test questions before the response
options appear: it might be argued that this encourages candidates to construct their
own answer before being distracted by irrelevant options, but no studies of this
possibility are reported.
The Structure section retains the same two item types used in the paper-and-pencil
TOEFL, but the Reading section not only uses the traditional multiple choice, it

also may require candidates to select a word, phrase, sentence or paragraph in the
text itself, and other questions ask candidates to insert a sentence where it ®ts best.
Although these techniques have been used elsewhere in paper-and-pencil tests, one
advantage of their computer format is that the candidate can see the result of their
choice in context, before making a ®nal decision. This may have implications for
improved validity, as might the other item types, but again this remains to be
demonstrated.
In short, although the CBT TOEFL is computer-adaptive in the Listening and
Structure sections, there is little that appears to play to the computer's strengths
and possibilities as I have described. This is not to say that such innovations may
not appear later in the TOEFL 2000 project. Indeed, Bennett (1998) claims that the
best way to innovate in computer-based testing is ®rst to mount on computer what
can already be done in paper-and-pencil format, with possible minor improvements
allowed by the medium, in order to ensure that the basic software works well, before
innovating in test method and construct. Once the delivery mechanisms work, it is
argued, then computer-based deliveries can be developed that incorporate desirable
innovations.
A European-Union-funded computer-based diagnostic testing project Ð
DIALANG Ð has adopted this cautious approach to innovation in test method. At
J.C. Alderson / System 28 (2000) 593±603 599
the same time, it has incorporated major innovations in aspects of test design other
than test method, which are deliberately intende d to experiment with and to
demonstrate the possibilities of the medium. Moreover, unlike CBT TOEFL, DIA-
LANG is available over the Internet, and thus capitalises on the advantages of
Internet-based delivery referred to earlier. DIALANG inco rporates self-assessment
as an integral part of diagnosis, it provides immediate feedback , not only on scores,
but also on the relationship between test results and self-assessment, and it provides
for extensive explanatory and advisory feedback to users. The language of adminis-
tration, of self-assessment, and eventually of feedback, is chosen by the test user
from a list of 14 European languages, and users can decide which skill they wish to

be tested in, in any one of 14 European languages. Although current test methods
only consist of various forms of multiple-choice, gap-®lling and short-answer ques-
tions, DIALANG has already developed demonstrations of 18 dierent experi-
mental item types which can be implemented in the future, and it demonstrates the
use of help, clue, dictionary and multiple-attempt features, as well as the option to
have or to delay immediate feedback which, I have argued above, makes computer-
based testing not only more user-friendly, but also more compatible with language
pedagogy.
DIALANG currently suers from the limitations of IT in assessing learners' pro-
ductive language abilities. However, the experimental item types include an inge-
nious combination of self-assessment and benchmarking which holds considerable
promise. Tasks for the elicitation of speaking and writing performances have been
developed and administered to learners (in this case, of Finnish for the writing task,
and of English for the speaking task). Performances are rated by human judges, and
those performances which achieve the greatest agreement are selected as `bench-
marks'. A DIALANG user is presented with the same task, and in the case of
Writing, responds to the task via the keyboard. The user's performance is then pre-
sented on screen together with the pre-r ated benchmarks. The user can view various
performances at the dierent DIALANG/Council of Europe levels, and compare
their own performance with the benchmarks. In addition, since the benchmarks are
pre-analysed, the user can choose to see raters' comments on various features of the
benchmarks, in hypertext form, and consider whether they could produce a similar
quality of features.
In the case of Speaking, the candidate is simply asked to imagine how they would
respond to the task, rather than to actually record their performance. They are then
presented with recorded benchmark performances, and are asked to estimate whe-
ther they could do better or worse than each perfor mance. Since the performances
are graded, once the candidate has self-assessed himself against a number of per-
formances, the system can tell him roughly what level his own (imagined) perfor-
mance is likely to be.

Whilst this system is clearly entirely dependent on the user's willingness to play the
game, the only person being cheated if he does not do so, is the user himself. This
is so because DIALANG is intended to be low stakes and diagnostic. Indeed, it is
precisely because DIALANG is used for diagnosis, rather than for high-stakes
admissions or employment decisions, that it is possible to be so innovative. I believe
600 J.C. Alderson / System 28 (2000) 593±603
that DIALANG and low-stakes assessment in general oer many possibilities for
exciting innovation in computer-based test ing.
Other interesting developments include ``e-rater'', and ``PhonePass' ', both pro-
prietary names. Automated testing of productive language abilities has so far proved
dicult, as we have seen. However, e-rater has been developed by ETS, in an
attempt to develop `intelligent' IT-based systems that will arrive at the same con-
clusions about users' writing ability as do human raters. In essence, e-rater is trained
by being fed samples of open-ended essays that have been previously scored by
human raters, and it uses natural language processing techniques to duplicate the
performance of human raters. At present, the system is working operationally on
GMAT (Graduate Management Admissions Test) essays. E-rater research is on-
going for other programmes, such as GRE (Graduate Record Examinations ), and a
challenge has been issued to the language education and assessment community to
produce essays that will `fool' the system by inducing it to give too high or too low
scores compared with those given by humans. This will doubtless encounter inter-
esting problems if it is adapted to deal with second/foreign language testing situa-
tions, but current progress is extremely promising. For more information on this
project, contact Don Powers, Jill Burstein, or Karen Kukich at ETS (dpowers
@ets.org, , ). Further information on e-rater is
available on ater.html.
A second project, using IT to assess aspects of the speaking ability of second/
foreign language learners of English, has already been developed and is being
extensively tested and resear ched. Users of PhonePass are given sample tasks in
advance, and then have to respond to similar tasks over the telephone in `interac-

tion' with a computer. Tasks currently include reading aloud, repeating sentences,
saying opposite words, and giving short answers to questions. PhonePass returns a
score that re¯ects a candidate's ability to understand and respond appropriately to
decontextualized spoken material, with 40% of the evaluation re¯ecting the ¯uency
and pronunciation of the responses.
The system uses speech recognition technology to rate responses, by comparing
candidate performance to statistical models of native and non-native performance
on the tasks. From these compari sons, subscores for ¯uency and pronunciation are
derived, and are comb ined with IRT-based (Item Response Theory) scores derived
from right/wrong judgements based on the exact words recognized in the spoken
responses.
Studies have shown the test to exhibit a reliability coecient of 0.91 and a corre-
lation with an ETS Test of Spoken English of 0.88, an d with an ILR (Interagency
Language Round Table) Oral Pro®ciency Interview of 0.77 (Bernstein, personal
communication, January 2000). These are encouragi ng results.
The scored sample is retained on a database, classi®ed according to the various
scores assigned. Interestingly, this enables potential employers or other users to
access the same speech sample, in order to make their own judgements about the
performance for their speci®c purposes, and to compare how their candidate has
performed with other speech samples that have been rated either the same, or higher
or lower. Like DIALANG, this system thus incorporates elements of objective
J.C. Alderson / System 28 (2000) 593±603 601
assessment with an opportunity for user- assessment. Those interested in learning
more about the system can contact PhonePass at www.ordinate.com.
7. Conclusions: the need for a research agenda
One of the claimed advantages of computer-based assessment is that computers
can store enormous amounts of potential research data, including every keystroke
made by candidates and their sequence, the time it has taken a person to respond to
a task, as well as the correctness of the response, the use of help, clue and dictionary
facilities, and much more. The challenge to researchers is to make sense of this mass

of data and, indeed, to design procedures for the gathering of useful, meaningful or
at least potentially meaningful data, rather than simply trawling through everything
that can possibly be gathered in the hope that something interesting might emerge.
In other words, a research agenda is needed.
Such a research agenda ¯ows fairly naturally from two sources: issues that arise in
the development of CBTs, and from the claimed advantages and disadvantages of
computer-based testing. To illustrate the former case, let us take computer-adaptive
testing. When designing such tests, developers have to take a number of decisions:
what should the entry point or level be, and how is this best determined for any
given population? At what point should testing cease (the so-called exit point) and
what should the criteria be that determine this? How can content balance best be
assured in tests whose main principle for adaptation is psychometric? What are the
consequences of not allowing users to skip items, and can these be ameliorated?
How to ensure that some items are not presented much more frequently than others
(item exposure), because of their facility, or their content? The measurement litera-
ture has already addressed such questions, but language testing has yet to come to
grips with these issues.
Arguably more interesting are the issues surrounding the claimed advantages and
disadvantages of CBTs. Throughout this paper, I have emphasised the lack of, and the
need for, evidence that IT-based testing presents an advance in our ability to assess
language performance or pro®ciency. Whilst there may be an argument for IT-based
assessment that does not improve on current modes of assessment, because of the
compensations of the medium in terms of convenience, speed, and so on, we need to
be certain that IT-based assessment does not reduce the validity of what we do.
Thus a minimal research agenda would set out to identify comparative advantages
of each form of assessment, and to establish whether IT-based assessment oers per-
ceived or actual added value. This would apply equally to perceptions of computer-
adaptive testing, as well as to the claimed bene®ts of speed of assessment and
immediate availability of results, for example.
A substantive research agenda would investigate the eects of providing immedi-

ate feedback on attitudes, on performance, and on the measurement of ability. The
eect of the range of support facilities that can be made available needs to be
examined. Not only what are the eects, but how can they be maximised or mini-
mised? What does the provision of support imply for the validity of the tests, a nd for
602 J.C. Alderson / System 28 (2000) 593±603
the constructs that can be measured? What is the value of allowing learners to have
a second attempt, with or without feedback on the success of their ®rst attempt?
What do the resulting scores tell us, should scores be adjusted in the light of such
support, and if so, how?
What additional information can we learn from the provision of self-assessment
alongside `normal' assessment and how do users perceive the contrast between self-
assessment and `normal' assessment? What is the value of con®dence testing, and
can it throw more light onto the nature of the constructs?
What is needed above all is research that will reveal more about the validity of the
tests, that will enable us to estimate the eects of the test method and delivery
medium; research that will provide insights into the processes and strategies test-
takers use; studies that will enable the exploration of the constructs that are being
measured, or that might be measured. Alongside developm ent work that explores
how the potential of the medium might best be harnessed in test methods, support,
diagnosis and feedback, we need research that investigates the nature of the most
eective and meaningful feedback; the best ways of diagnosing strengths and weak-
nesses in language use; the most appropriate and meaningful clues that might
prompt a learner's best performance; the most appropriate use and integration of
media and multimedia that will allow us to measure those constructs that might
have eluded us in more traditional forms of measurement Ð e.g. latencies in spon-
taneous language use, planning and execution times in task performance, speed
reading and processing time more generally.
And we need research into the impact of the use of the technology on learning,
learners and the curriculum. If IT-based assessment can help us relate assessment
more closely to the learning process, as I have claimed, exactly how does this

happen Ð if it does?
References
Alderson, J.C., 1986. Computers in language testing. In: Leech, G.N., Candlin, C.N. (Eds.), Computers in
English Language Education and Research. Longman, London, pp. 99±110.
Alderson, J.C., 1996. Do corpora have a role in language assessment?. In: Thomas, J., Short, M.H. (Eds.),
Using Corpora for Language Research. Longman, London, pp. 248±259.
Bennett, R.E., 1998. Reinventing Assessment: Speculations on the Future of Large-scale Educational
Testing. Educational Testing Service, Princeton, NJ.
Chalhoub-Deville, M., Deville, C., 1999. Computer adaptive testing in second language contexts. Annual
Review of Applied Linguistics 19, 273±299.
ETS, 1998. Computer-Based TOEFL Score User Guide. Educational Testing Service, Princeton, NJ.
Fulcher, G., 1999. Computerising an English language placement test. English Language Teaching Jour-
nal 53, 289±299.
Kirsch, I., Jamieson, J., Taylor, C., Eignor, D., 1998. Familiarity Among TOEFL Examinees (TOEFL
Research Report 59). Educational Testing Service, Princeton, NJ.
Taylor, C., Jamieson, J., Eignor, D., Kirsch, I., 1998. The Relationship Between Computer Familiarity
and Performance on Computer-based TOEFL Tasks (TOEFL Research Report 61). Educational
Testing Service, Princeton, NJ.
J.C. Alderson / System 28 (2000) 593±603 603

Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×