TES KEMAHIRAN BERBAHASA INGGRIS
Kemahiran Berbahasa
Kemahiran berbahasa dimaknai secara seragam. Cummins (1984), menyatakan bahwa
kemahiran berbahasa ada yang menyebutnya terdiri dari 64 komponen yang berbeda,
tetapi ada pula yang menyebutkan hanya terdiri dari satu faktor saja. Valdés dan Figueroa
(1994) menyebutkan bahwa mengetahui suatu bahasa tidak cukup hanya menguasai
pelafalan, tatabahasa, dan santun berbahasa, tetapi juga melibatkan penguasaan sejumlah
komponen yang saling terkait dan berinteraksi satu sama lain tergantung konteks
komunikasi yang terjadi.
Oller dan Damico (1991) menyatakan bahwa rincian elemen kemahiran berbahasa belum
ditentukan dan masih terus diperdebatkan. Setiap tes kemahiran berbahasa harus
didasarkan atas model atau definisi kemahiran berbahasa yang akurat. The Council of
Chief State School Officers (CCSSO) mendefinisikan bahwa siswa yang mahir berbahasa
Inggris dapat menggunakan bahasa itu untuk bertanya, memahami ucapan gurunya dan
bahan bacaan, mengungkapkan pikirannya, dan menjawab apa yang ditanyakan di kelas.
Empat keterampilan berbahasa yang memberi kontribusi atas kemahiran berbahasa
adalah keterampilan berbicara, keterampilan membaca, keterampilan menyimak, dan
keterampilan menulis.
Canales (1994) melandasi definisi kemahiran berbahasa dengan landasan sosio-teoretis.,
yaitu bahasa tidak dilihat sebagai bagian yang terpisah-pisah (misalnya., pelafalan,
kosakata, dan tatabahasa). Bahasa berkembang dalam suatu budaya dan berfungsi sebagai
media untuk menyampaikan kepercayaan dan adat dan kebiasaan budaya (lihat kasus
penerjemahan idiom). Kemahiran berbahasa bersifat dinamis dan kontekstual (bervariasi
bergantung situasi, status penutur dan topik pembicaraan), diskursif (memerlukan ujaran
yang saling berhubungan), dan membutuhkan keterampilan integratif sehingga
kompetensi komunikatif dapat dicapai. Dengan kata lain, kemahiran berbahasa
merupakan kemampuan menggunakan unsur bahasa yang diskrit seperti kosakata,
struktur wacana dan bahasa tubuh untuk menyampaikan makna.
Keterampilan berbahasa yang mendasari keberhasilan akademik seorang siswa
mencakupi kemampuan merespon pertanyaan teman dan gurunya atas informasi tertentu,
menanyakan pertanyaan lanjutan, dan menyintesis bahan bacaan. Siswa harus memahami
instruksi lisan rutin dalam seting kelompok besar, dan komentar untuk teman sebayanya
dalam kelompok kecil. Dalam keterampilan membaca, siswa dituntut mampu menggali
informasi dari berbagai jenis teks. Dalam keterampilan menulis, siswa dituntut mampu
menulis jawaban pendek, paragraf, esei dan makalah. Pemelajar bahasa yang berhasil
juga dituntut mengetahui pranata sosial dan budaya yang berkaitan dengan penggunaan
bahasa.
Konsepsi kemahiran berbahasa sebagaimana digambarkan di atas setidaknya terdiri dari
dua hal. Pertama, definisi itu mengakomodir keempat keterampilan berbahasa: speaking,
listening, reading dan writing. Kedua, setiap definisi menempatkan kemahiran berbahasa
pada konteks tertentu, yaitu pendidikan. Dampaknya, tes kemahiran berbahasa harus
menggunakan prosedur tes yang sebisa mungkin menggambarkan kontekstualisasi bahasa
yang digunakan pada sebagian besar kelas berbahasa Inggris.
Valdés dan Figueroa (1994) menyatakan bahwa tes kemahiran berbahasa harus
mengidentifikasi tingkat tuntutan yang diminta oleh konteks dan jenis kemampuan
berbahasa yang umumnya digunakan oleh siswa penutur bahasa Inggris monolingual
yang sebagian besar sukses dalam konteks tersebut. Berdasarkan pemikiran itu, kita dapat
menetapkan criteria untuk mengukur keterampilan berbahasa siswa bukan penutur asli
bahasa Inggris untuk memutuskan apakah mereka harus didik dalam bahasa Inggris atau
dalam bahasanya sendiri. Rekomendasi itu dipahami karena tes kemahiran berbahasa
dimaksudkan untuk membantu pendidikan dalam memberikan penilaian yang akurat
apakah seorang siswa memerlukan bantuan atau tidak dalam kegiatan pembelajarannya.
Keputusan seperti itu akan aka menjadi sulit manakala tugas yang ada dalam tes itu hanya
memiliki sedikit kemiripan dengan karakteristik tugas yang biasa diberikan pada
kebanyakan kelas.
General Nature of Language Proficiency Tests
Oller dan Damico (1991) menyatakan bahwa tes kemahiran berbahasa dapat dikaitkan
dengan tiga aliran. Yang pertama adalah pendekatan diskrit yang didasari oleh asumsi
bahwa bahasa terdiri atas komponen yang dapat dipisah-pisah seperti fonologi,
morfologi, leksikon, sintaksis, dan seterusnya dan tiap-tiap komponen dapat lebih jauh
dibagi ke dalam elemen yang berbeda (misalnya, bunyi ke dalam kelas bunyi atau fonem,
suku kata, morfem, kata, idiom, dan struktur frase). Mereka menyatakan bahwa tes
bahasa tidak akan valid jika measukan beberapa keterampilan atau ranah struktur (Lado,
1961). Dengan model ini, model penilaian yang ideal akan melibatkan evaluasi setiap
ranah dan setiap keterampilan yang dianggap penting. Hasilnya dapat digabung dan
membentuk gambaran keseluruhan kemahiran berbahasa (p. 82).
Tes kemahiran berbahasa diskrit umumnya menggunakan format tes seperti membedakan
fonem dimana peserta tes diminta menentukan apakah dua kata yang diberikan secara
lisan sama atau berbeda (misalnya, /ten/ versus /den/). Contoh lainnya adalah tes yang
dirancang untuk mengukur kosakata yang meminta siswa memilih pilihan yang tepat dari
serangkaian pilihan yang sudah ditetapkan.
Kelemahan model tes secara diskrit ini di antaranya adalah:
Kesulitan membatasi pengetesan bahasa ke dalam satu keterampilan (misalnya,
writing) tanpa melibatkan keterampilan lainnya (misalnya, reading);
Kesulitan membatasi tes ke dalam satu ranah linguistic (misalnya, vocabulary)
tanpa melibatkan ranah lain (misalnya, phonology); dan
Kesulitan bahasa tanpa melibatkan konteks atau mengaitkannya dengan
pengalaman manusia.
Menurut Damico dan Oller (1991), keterbatasan itu menimbulkan munculnya trend kedua
dalam pengetesan, yaitu pendekatan integrative atau holistic. Tes seperti itu menghendaki
kemahiran berbahasa dites dalam konteks wacana yang kaya (p. 83). Asumsi yang
mendasarinya adalah bahwa pemrosesan atau penggunaan bahasa menyiratkan
penggunaan lebih dari satu komponen bahasa (misalnya, vocabulary, grammar, gesture)
dan keterampilan (misalnya, listening, speaking). Mengikuti logika ini, sebuah tugas
terintegrasi bias saja meminta seorang peserta tes untuk menyimak sebuah ceritera dan
kemudian menceriterakannya kembali atau menyimak ceritera dan menuliskan kembali
ceritera itu.
Trend pengetesan bahasa ketiga yang digambarkan Damico dan Oller (1992) dikenal
sebagai pengetesan bahasa secara pragmatic. Perbedaan mendasar pendekatan itu dengan
pendekatan integratif adalah upaya menghubungkan situasi tes dengan pengalaman
peserta tes. Seperti dinyatakan Oller dan Damico (1991), penggunaan bahasa dalam
situasi normal berkaitan dengan orang, tempat, peristiwa, dan hubungan yang
menyiratkan keseluruhan rentang pengalaman dan rentang itu dihambat oleh waktu atau
factor temporal. Oleh karenanya, tes bahasa pragmatik dirancang sebisa mungkin seperti
kenyataan "real life" atau seotentik mungkin.
Berbeda dengan tugas integrative, pendekatan tes pragmatik meminta peserta tes
mengerjakan tugas menyimak hanya dalam kondisi tekstual dan temporal yang
mencirikan kegiatan itu. Misalnya, jika peserta tes hendak menyimak sebuah ceritera dan
menceritakannya kembali, kondisi berikut harus dipenuhi. Dari sudut pandang pragmatic,
pemelajar bahasa umunya tidak menyimak ceritera yang direkam, tetapi umumnya
menyimak ceritera yang dibacakan orang dewasa. Dalam kaitan ini, tugas menyimak
sebuah ceritera yang direkam tidak memenuhi syarat pragmatic. Pendekatan pragmatic
dicirikan oleh:
Input visual normal diberikan (misalnya, isyarat pembaca, cetakan pada halaman,
nomor otentik gambar yang bdihubungkan dengan ceritera.
Waktu diatur berbeda sehingga memungkinkan pemelajar memperoleh
kesempatan untuk bertanya, menarik kesimpulan, bereaksi secara normal atas isi
ceritera.
Ceritera, temanya, pembaca dan tujuan kegiatan membentuk pengalaman siswa.
Oller dan Damico (1991) melihat kekuatan pengetesan pragmatik ada pada kenyataan
bahwa semua tujuan butir item yang disusun secara diskrit (diagnosis, focus, isolasi) akan
lebih baik dicapai melalui konteks yang kaya. Sebagai metode analisis linguistic,
pendekatan tes secara diskrit memiliki validitas, tetapi sebagai salah satu metode yang
praktis untuk menilai keterampilan berbahasa, pendekatan itu disalahgunakan,
kontraproduktif, dan secara logika tidak mungkin.
Jika tujuannya adalah mengukur kemahiran berbahasa dalam aspek tatabahasa, kosakata,
atau pelafalan, tujuan itu akan lebih mungkin dicapai melalui pendekatan bahasa
pragmatic daripada pendekatan diskrit.
Keterbatasan Tes Kemahiran Berbahasa Saat ini
Tes kemahiran berbahasa harus didasari teori atau modelkemahiran berbahasa. Akan
tetapi, belum ada konsensus di antara para ahli bahasa mengenai hakekat kemahiran
berbahasa. Akibatnya, muncul berbagai tes kemahiran berbahasa yang satu sama lain
berbeda secara mendasar. Yang lebih penting lagi adalah kenyataan bahwa tes kemahiran
berbahasa yang berbeda menghasilkan klasifikasi bahasa yang berbeda pula (misalnya,
(non-English speaking, limited English speaking and fully English proficient) untuk
siswa yang sama (Ulibarri, Spencer & Rivas, 1981). Valdés dan Figueroa (1994)
melaporkan bahwa tidak hanya kualitas tes yang harus menjadi perhatian para pendidik
melainkan juga rancangan tes kemahiran berbahasa itu sendiri.
Unfortunately, it is not only the test qualities with which educators must be concerned.
Related to the design of language proficiency tests, there may be a propensity for test
developers to use a discrete point approach to language testing. Valdés and Figueroa
(1994) state:
As might be expected, instruments developed to assess the language proficiency of
"bilingual" students borrowed directly from traditions of second and foreign language
testing. Rather than integrative and pragmatic, these language assessments instruments
tended to resemble discrete-point, paper- and-pencil tests administered orally. (p. 64)
Consequently, and to the degree that the above two points are accurate, currently
available language proficiency tests not only yield questionable results about student's
language abilities, but the results are also based on the most impoverished model of
language testing.
In closing this section of the handbook, consider the advice of Spolsky (1984):
Those involved with language tests, whether they are developing tests or using their
results, have three responsibilities. The first is to avoid certainty: Anyone who claims to
have a perfect test or to be prepared to make an important decision on the basis of a
single test result is acting irresponsibly. The second is to avoid mysticism: Whenever we
hide behind authority, technical jargon, statistics or cutely labelled new constructs, we are
equally guilty. Thirdly, and this is fundamental, we must always make sure that tests, like
dangerous drugs, are accurately labelled and used with considerable care. (p. 6)
In addition, bear in mind that the above advice applies to any testing situation (e.g.,
measuring intelligence, academic achievement, self-concept), not only language
proficiency testing. Remember also that the use of standardized language proficiency
testing, in the context of language minority education, is only about two decades old.
Much remains to be learned. Finally, there is little doubt that any procedure for assessing
a learner's language proficiency must also entail the use of additional strategically
selected measures (e.g., teacher judgments, miscue analysis, writing samples).
The Tests Described
The English language proficiency tests presented in this Guide are the:
1) Basic Inventory of Natural Language (Herbert, 1979);
2) Bilingual Syntax Measure (Burt, Dulay & Hernández-Chávez, 1975);
3) Idea Proficiency Test (Dalton, 1978;94);
4) Language Assessment Scales (De Avila & Duncan, 1978; 1991); and
5) Woodcock-Muñoz Language Survey (1993).
Test Descriptions and Publisher Information
Figure 1:
Five Standardized English Language Proficiency Tests Included in this Handbook
Assessment Instrument
General Description
Basic Inventory of Natural
Language
(BINL)
CHECpoint Systems, Inc.
1520 North Waterman Ave.
San Bernadino, CA 92404
1-800-635-1235
The BINL (1979) is used to generate a measure of the K12 student's oral language proficiency. The test must be
administered individually and uses large photographs to
elicit unstructured, spontaneous language samples from
the student which must be tape-recorded for scoring
purposes. The student's language sample is scored based
on fluency, level of complexity and average sentence
length. The test can be used for more than 32 different
languages.
Bilingual
Syntax
Measure
(BSM)
I
and
II
Psychological Corporation
P.O.
Box
839954
San Antonio, TX 78283
1-800-228-0752
The BSM I (1975) is designed to generate a measure of
the K-2 student's oral language proficiency; BSM II
(1978) is designed for grades 3 through 12. The oral
language sample is elicited using cartoon drawings with
specific questions asked by the examiner. The student's
score is based on whether or not the student produces
the desired grammatical structure in their responses.
Both the BSM I & BSM II are available in Spanish and
English.
Idea Proficiency Tests (IPT)
Ballard & Tighe Publishers
480
Atlas
Street
Brea,
CA
92621
1-800-321-4332
The various forms of the IPT ( 1978 & 1994) are
designed to generate measures of oral proficiency and
reading and writing ability for students in grades K
through adult. The oral measure must be individually
administered but the reading and writing tests can be
administered in small groups. In general, the tests can be
described as discrete-point, measuring content such as
vocabulary, syntax, and reading for understanding. All
forms of the IPT are available in Spanish and English.
Language Assessment Scales
(LAS)
CTB MacMillan McGraw-Hill
2500
Garden
Road
Monterey,
CA
93940
1-800-538-9547
The various forms of the LAS (1978 & 1991) are
designed to generate measures of oral proficiency and
reading and writing ability for students in grades K
through adult. The oral measure must be individually
administered but the reading and writing tests can be
administered in small groups. In general, the tests can be
described as discrete-point and holistic, measuring
content such as vocabulary, minimal pairs, listening
comprehension and story retelling . All forms of the
LAS are available in Spanish and English.
Woodcock-Muñoz
Language
Survey
Riverside Publishing Co.
8420 Bryn Mawr Ave.
Chicago,
IL
60631
1-800-323-9540
The Language Survey (1993) is designed to generate
measures of cognitive aspects of language proficiency
for oral language as well as reading and writing for
individuals 48 months and older. All parts of this test
must be individually administered. The test is discretepoint in nature and measures content such as vocabulary,
verbal analogies, and letter-word identification. The
Language Survey is available in Spanish and English.
Approaches to Assessment
Assessment can be broadly divided into two areas, formal and informal, but as Farr
(1991, p. 496) cautions, they really are on a continuum because both are based on student
performance. Traditional formal assessment looks at what students know at the end of a
given period of instruction. Informal assessment looks at how a student knows as well as
what he knows. Formal assessments are usually published. Informal ones are usually
teacher-developed although there are published measures, including informal reading
inventories, checklists, surveys, and interview guides. Obviously, the measure that we as
educators choose determines the information that the instrument will yield. Therefore, we
must be very clear about our purpose when we choose an assessment instrument. The
choice of assessment instrument—from teacher observation to student survey to formal
published test—should be informed by the assessor’s purpose. Selection of the wrong
instrument will not allow inferences appropriate to the assessor’s needs. Traditionally,
administrators, seeking information about students’ success in reading, selected
published, standardized tests with available normative information, such as the Iowas.
This allowed them to compare district performance with statewide and national scores
and to comply with Title I requirements. Although the comparisons may have given them
confidence in the success of local curriculums, the scores yielded little information that
would help guide instruction or curriculum design.
Tests
The preponderance of objective, norm-referenced tests traditionally have offered students
little information about themselves as learners. However, the same could be said of the
uninformed use of a teacher’s pop quiz or the misuse of the portfolio as a mere paper
repository. Traditional testing is akin to a behaviorist’s view of the learner as the passive
recipient of data. Current testing theory is based on the cognitive psychologists’ view of
the learner as an active construer of meaning from the information available from the
environment. We now know, for example, that we should not try to decontextualize test
items by using short excerpts in reading that block the reader’s use of prior knowledge to
construct new information. Short passages prevent skilled readers from using the reading
strategies they would employ with a longer passage as they become familiar with the
topic and discover the organization of the text. Current theory dictates the use of long
passages across a variety of text types and topics to gain a valid indication of reader
proficiency. We no longer depend solely on short answers, such as multiple choice, but
include open-ended items that permit test takers more latitude to display their reading
skills. Parallel issues arise in the assessment of writing. We no longer assume that
students’ abilities to revise and edit a given text reflect their abilities to generate,
organize, and elaborate original ideas. In short, editing texts is not a complete test of
writing proficiency.
Current theory holds that any test that purports to be a valid test of writing must include
opportunity for the writer to compose original, well-organized text with varied sentence
structures and rich word choice using the conventions of standard written English. New
Jersey’s new 4th-, 8th-, and 11th-grade tests, which are aligned to the language arts
literacy standards, reflect much of current theory concerning learning and testing. Not
only do they incorporate long reading passages with opportunities for open-ended
responses to diff e rent text types and theme-based topics, but they also elicit multiple
writing samples from students. In addition, they provide opportunities for students to
integrate the reading and writing processes through decision making and problem solving
in order to compose an original text using information from a reading passage as support.
The tests also honor the hallmarks of assessment outlined by Case. They are valid
because they measure what they purport to measure, that is, they provide rich contexts for
the assessment of meaningful speaking, listening, writing, reading, and viewing
behaviors. The new tests are also fair because they are aligned to the language arts
literacy standards and indicators that have been published and distributed to educators,
who will share them with their students, parents, and the community. Furthermore, this
curriculum framework provides the same audiences with vignettes and activities that
vividly translate the standards into classroom practices. Teachers can use this material to
enhance student attainment of the standards and to foster student success on the new
tests.
2.1 Common characteristics across instruments.
Bachman’s (2000) review of the literature on language tests outlines the development of
language testing over the last 20 years. He points out that while testing practice from the
mid-1960s and the 1970s tended to be based on a construction of language as skills
(listening, speaking, reading, writing) and components (grammar, vocabulary,
pronunciation), such constructions were critiqued as new approaches to the study of
language emerged. Specifically, in the 1980s, the influence of communicative approaches
to language instruction was paramount. Since applied linguists were developing
approaches to teaching that focused on the co-construction of meaning, and the
importance of context-based communication, traditional assessments (such as those
developed in the 1970s) were ill-suited for the new approach. In the 1990s, test-makers
became concerned with issues such as the development of (a) new research
methodologies, such as criterion-referenced measurement, (b) practical advances, such as
pragmatics testing, (c) factors that affect test performance, (d) authentic and performance
assessment, and (e) ethical considerations of language testing.
2.2 Language constructs represented.
The tests reviewed above are based on the assumption that language proficiency can be
measured accurately by only sampling discrete aspects such as phonology, syntax,
morphology, and lexicon. The tests rarely consider aspects of language that can be crucial
to academic success, such as pragmatic competence (Cummins, 2000). In other words,
most language proficiency tests limit the construction of language proficiency to
grammatical competence. An important flaw with this construction is that to assess
grammatical competence, tests usually rely on prescriptivist notions of grammar. For
instance, if one such type of test were to assess students’ acquisition of the English verb
system, an item like (1) below might be presented.
(1) Dad called earlier. He ___________ (might/ is/ had/ might could) stop by later this
evening.
If a student were to fill in the blank with might could, he would probably be penalized
because the Standard English verb system allows one modal verb in that position.
However, if said student were a member of the group of native English speakers who
make a distinction between (2) and (3) below, such an item would be invalid:
(2) He might stop by later this evening.
(3) He might could stop by later this evening.
While the differences in meaning are subtle and pragmatically determined, in (3) there is
less likelihood that "he" will stop by than in (2) (Wolfram and Schilling-Estes, 1998:335).
Speakers of the dialect in which sentences like (3) are common need contextual cues in
order to distinguish the forcefulness of the assertion. However, a typical language
proficiency test would not allow for nuances in meaning made by speakers of so-called
non-Standard varieties of English.
Furthermore, to limit the construction of language proficiency to a closed set of
grammatical categories negates the real need for language learners to master
communicative principles which are essential in informal and academic contexts. After
all, language learners must develop a range of communicative styles to suit their
purposes. A language learner whose repertoire is limited to academic discourse styles
cannot be considered fully communicatively competent.
Up to this point we have discussed how commonly-used tests utilize similar constructions
of language proficiency, and how this construction of language proficiency is closely
linked to prescriptivist notions of standard grammar. In the next section we discuss the
criticisms that standardized language proficiency tests have received in test reviews.
2.3 Critiques of the four most commonly-used tests.
In addition to the limitation of language proficiency to grammatical competence, other
criticisms are revealed in test reviews. These have indicated that some of the common
shortcomings are (a) that many test items are not valid (Haber, 1985; Carpenter, 1994;
Hedberg, 1995; Kao, 1998), (b) that interrater reliability is low (Crocker, 1998), and (c)
that the tests are normed on populations that are not representative of the samples of
children to whom these measures are commonly administered (Chesterfield, 1985; Haber,
1985; Shellenberger, 1985; Lopez, 2001). Table 3 includes a summary of the reviews.
TUJUAN TESTS
to make inferences about individuals’ language ability
to make predictions about individuals’ ability to use language in contexts outside the test
itself
to make decisions about individuals
Zucker
Despite this variety, tests generally share some common goals:
• measuring what students know and can do
• improving instruction
• helping students achieve higher standards
The purpose of tests is to provide educators, students, parents, and policy makers with
information that is valid, fair, and reliable. Standardized tests provide information that
helps support four critically important tasks for educators and the public:
1. Identify the instructional needs of individual students so educators can respond with
effective, targeted teaching and appropriate instructional materials;
2. Judge students’ proficiency inessential basic skills and challenging standards and
measure their educational growth over time;
3. Evaluate the effectiveness of educational programs; and,
4. Monitor schools for educational accountability including under the NCLB Act.
In sum, tests provide information to help students learn more successfully, teachers teach
more effectively, and schools to be more accountable.
There are limits to testing, however. Tests are a necessary but not the exclusive means to
evaluate current achievement and students’ growth in skills. What may be tested is not,
and cannot be, inclusive of all of the desired outcomes of instruction. Tests should be
considered a means to an end and not ends in themselves. Tests should be used in
combination with other important types of information such as teacher judgments of
student work and classroom performance plus other individual and group assessments, to
measure achievement and growth.
JENIS TES
High-Stakes Testing
High-stakes testing has consequences attached to the results. For example, high-stakes
tests can be used to determine students’ promotion from grade to grade or graduation
from high school (Resnick, 2004; Cizek, 2001). State testing to document Adequate
Yearly Progress (AYP) in accordance with NCLB is called “high-stakes” because of the
consequences to schools (and of course to students) that fail to maintain a steady increase
in achievement across the subpopulations of the schools (i.e., minority, poor, and special
education students).
Low-Stakes Testing
Low-stakes testing has no consequences outside the school, although the results may
have classroom consequences such as contributing to students’ grades. Formative
assessment is a good example of low-stakes testing.
Formative Assessment
This assessment provides information about learning in process. It consists of the weekly
quizzes, tests, and even essays given by teachers to their classes. Teachers and students
use the results of formative assessments to understand how students are progressing and
to make adjustments in instruction. Rick Stiggins calls it “day-to-day classroom
assessment” and claims evidence that it has triggered “remarkable gains in student
achievement” (Stiggins, 2004).
Summative assessment provokes most of the controversy about testing because it
includes “high-stakes, standardized” testing carried out by the states. Summative
assessment records the state of student learning at certain end points in a student’s
academic career—at the end of a school year, or at certain grades such as grades 3rd, 5th,
8th, and 11th. It literally “sums up” what students have learned.
PENAFSIRAN HASIL TES
In addition to designing to account for concerns of reliability, validity, and fairness, test
publishers design a standardized test according to how its results will be reported and
used. The number of correctly answered questions on a test, the student’s raw score, only
has meaning in the context of the test’s interpretive framework. Types of interpretive
frameworks include Norm-referenced Testing (NRT) Criterion-referenced Testing (CRT),
and Standards-based Testing.
Norm-referenced Testing (NRT)
A standardized test designed in the NRT interpretive framework can be used to compare a
test-taker’s results to the results of a reference group that has taken the same test. To
norm a test so that results can be compared, a test publisher gathers normative data
through field trials of the test with a representative, national sample of students. To
compare groups as large as entire school systems, norm referenced tests are typically
designed to cover a broad range of what test-takers are expected to know and be able to
do within a subject area. When reporting the results of a norm-referenced test, the testtaker’s raw score can be used to make a comparison to the reference group in various
ways. Two common methods for making this comparison are to report the test’s result as
a percentile rank or as a stanine.
A percentile rank (PR) reports the percentage of test-takers whose results are above or
below a certain score. For example, a test-taker with a PR of 80 on a test performed better
than 80% of the corresponding reference group. The highest possible PR is 99, meaning
that the test-taker scored higher than 99% of the reference group, while the lowest PR is
1, and a PR of 50 is the average. A stanine indicates the relative standing of a test-taker’s
score in comparison to the reference group with a low of one, a high of nine, and five as
the average. Stanines 1, 2, and 3 are considered “below average”; stanines 4, 5, and 6 are
considered “average”; and stanines 7, 8, and 9 are considered “above average.” Each
stanine represents an approximately equal unit of achievement. Therefore, the difference
between stanines 2 and 4 represents about the same difference in achievement as between
stanines 5 and 7. The percentage of scores in the reference group that are classified in
each stanine is 4, 7, 12, 17, 20, 17, 12, 7, and 4 respectively. Stanines may correspond to
certain ranges of percentile ranks and are typically presented as a curve.
Norm-Referenced and Criterion-Referenced Tests
A prospective purchaser of tests is faced with a choice, to buy norm-referenced or
criterion-referenced tests. The design and functions of each are so different that it is
necessary to discuss them in some detail.
Norm-Referenced Tests
These tests are designed to compare individual students’
achievement to that of a “norm group,” a representative
sample of his or her peers. The design is governed by the
normal or bell-shaped curve in the sense that all
elements of the test are directed towards spreading out
the results on the curve (Monetti, 2003; NASBE, 2001;
Zucker, 2003; Popham, 1999). The curve-governed
design of norm-referenced tests means that they do not
compare the students’ achievement to standards for what
they should know and be able to do—they only compare view enlarged chart
students to other students who are assumed to be in the same norm group. The Educators’
Handbook on Effective Testing (2002) lists the norms frequently used by major testing
publishers. For example, the available norms for the Iowa Test of Basic Skills are:
districts of similar sizes, regions of the country, socio-economic status, ethnicity, and type
of school (e.g., public, Catholic, private non-Catholic) in addition to a representation of
students nationally.
Purchasers of norm-referenced tests need to ensure that the chosen norm is a useful
comparison for their students. Purchasers should also be sure that the norm has been
developed recently, because populations change rapidly. A norm including a small
percentage of English language learners can become a norm with almost 50 percent
English language learners in less than the ten-year interval before it is revised.
Results of norm-referenced tests are frequently reported in terms of percentiles: a score in
the 70th percentile means that the student has done better than 70 percent of the others in
the norm group (Monetti, 2003). Percentile rankings are often used to identify students
for various academic programs such as gifted and talented, regular, or remedial classes.
On a symmetrical bell curve, a score in the 50th percentile is the average.
Because norm-referenced tests are designed to spread students’ scores along the bell
curve, the questions asked in the tests do not necessarily represent the knowledge and
skills that all students are expected to have learned. Instead, during the test development
process, “test items answered correctly by 80 percent or more of the test takers don’t
make it past the final cut [into the final test]” writes Popham (1999).
Norm-referenced tests lead to frustration on two counts. First they frustrate the teacher’s
success in teaching important knowledge and skills because students are unlikely to face
questions about that skill and knowledge on the test (Popham, 1999). Second, no group of
students can achieve at higher levels without others achieving at lower levels. Normreferenced tests make it mathematically impossible for “all the children to be above
average” (ERS; Burley, 2002).
Criterion-referenced Testing (CRT)
Rather than compare a student’s test result with the results of a reference group, criterionreferenced tests are intended to measure a level of mastery according to a specific set of
performance standards. Hence, the content of a criterion referenced test often includes
more focused subject matter than a norm-referenced test. The test-taker’s score
corresponds to a performance level, such as basic, proficient, or advanced. NCLB
requires each state to design or select an assessment yielding results that can be used to
classify students into performance levels for the corresponding academic subject.
What Is The Difference Between A Criterion-Referenced Test and A Norm Referenced
Test?
All standardized tests now administered to elementary and secondary school students
measure student achievement against a set of academic standards or curricular objectives.
The standards may be common among the states and major national academic
organizations, thus enabling national comparisons. Or, the standards may be local
standards chosen by the school district or state, which may only allow local comparisons
among students in a district or state. There are many ways to report and interpret the
results of a standardized test. One way is based on specific criteria, such as academic
skills or objectives and academic achievement standards developed at the state or local
level. For example, “She has demonstrated mastery of reading at the third-grade level” is
a determination made by a criterion-referenced test (CRT). A standardized test also can
describe a student’s performance compared to other students nationally or locally. For
example, “He reads better than 90 percent of fourth grade students nationally” is a
determination made from a norm-referenced test (NRT). A student’s score on a CRT
using local academic standards is intended to be compared only with other students who
have taken the same test. In contrast, a student’s scores on an NRT can show performance
on academic standards and also enable comparisons with students both locally and
nationally. When a local CRT is used with a national NRT, the results can be interpreted
together to obtain more comprehensive information about a student’s performance. For
example, “She is ‘proficient’ on a state mandated CRT and is performing at an academic
level that is better than 70 percent of students nationwide.”
Criterion-Referenced Tests
These tests are designed to show how students achieve in comparison to standards,
usually state standards. (NASBE, 2001; Wilde, 2004; Zucker, 2003). In contrast to normreferenced tests, it is theoretically possible for all students to achieve the highest—or the
lowest—score, because there is no attempt to compare students to each other, only to the
standards. Results are reported in levels that are typically basic, proficient, and advanced.
The test items are not chosen to sort students but to ascertain whether they have mastered
the knowledge and skills contained in the standards.
Criterion-referenced tests—sometimes, more correctly, called standards-based tests—
begin from a state’s standards, which list the knowledge and skills students are expected
to learn. Because standards are usually far more numerous than could ever be included in
a test, test designers work with teachers and content specialists to narrow down the
standards to essential knowledge and skills at the grades to be tested. They are the basis
for the development of test items.
The number of criterion-referenced tests in use at the state level has dramatically
increased since NCLB was implemented in 2001 (NCES, 2005), because they measure
achievement of the knowledge and skills required by state standards. At this writing, 44
states now use criterion-referenced assessments: 24 states use only criterion-referenced
tests, and the other 20 use both criterion-referenced tests and norm-referenced tests.
Thirteen states use “hybrid” tests, single tests that are reported both as norm-referenced
tests (in percentiles or stanines—a nine-point scale used for normalized test scores) and
as criterion-referenced tests (in basic, proficient, and advanced levels) in an attempt to
show at the same time where students score in relation to standards and in relation to a
norm group. Only one state, Iowa (home of the Iowa Test of Basic Skills, and also the
only state in the nation without state academic standards) uses a norm-referenced test
alone (Education Week 2006).
STANDARDS-BASED TESTING
Standards-based testing allows states to accomplish both objectives (NRT and CRT) at
once by incorporating elements of norm-referenced and criterion referenced testing. A
standards-based test is both normed to a reference group and aligned to a set of
performance standards. This framework, also called the augmented NRT model, enables
states to report standards-based information (content standards scores), performance
levels (cut-scores), and percentile rank information for every student. For example, a test
publisher can use a state’s academic standards to augment an existing norm-referenced
test so that the test taker’s results can be used for both comparisons to a reference group
and assigning performance levels. Typically, statewide results from the first year that a
standards-based test is administered are used to establish the test’s reference group.
Careful design by the test publisher ensures that the test is valid for measuring student
mastery of the academic standards. Because NCLB requires states to report student
performance levels while also comparing the results of specified student populations to
the results of previous years, properly designed standards-based tests are especially suited
to meet NCLB requirements.
Standardized testing means that a test is “administered and scored in a predetermined,
standard manner” (Popham, 1999). Students take the same test in the same conditions at
the same time, if possible, so results can be attributed to student performance and not to
differences in the administration or form of the test (Wilde, 2004). For this reason, the
results of standardized tests can be compared across schools, districts, or states.
Standardized testing is sometimes used as a shorthand expression for machine scored
multiple-choice tests. As we will see, however, standardized tests can have almost any
format.
A standardized achievement test is, simply, a test that is developed using standard
procedures and is then administered and scored in a consistent manner for all test takers.
Students respond to identical or very similar questions under the same conditions and test
directions. The standardization of test questions, directions, conditions of testing, and
scoring is needed to make test scores comparable and to assure, as much as possible, that
test takers have equal, unbiased opportunities to demonstrate what they know and can do.
Standardization can apply to any type or format of test. However, some types of
educational tests such as classroom and teacher-developed tests are not usually
considered to be “standardized” tests because they are given under varying conditions
and are scored using variable rules. Standardized tests may be used for a variety of
purposes. One purpose of testing is to enable educators to make high-stakes decisions
about individual students through measures such as high school graduation tests. In
contrast, the annual testing provisions of the NCLB Act are used to inform schools,
teachers, and parents about student improvement in the classroom and to hold schools
and states accountable for such improvement.
How Are Standardized Tests Used?
Information from standardized tests can be used for many purposes. These purposes may
include:
Supporting instructional decisions for individual students by identifying their
instructional needs. A test may be used to diagnose a student’s strengths and weaknesses,
thus allowing the teacher or school to choose effective instructional programs for the
student.
Demonstrating students’ proficiency in basic skills and their ability to meet academic
standards. Test results are used by states to demonstrate individual student mastery of
specified levels of achievement.
Informing parents and the public about school and student performance.
States administer standardized assessments and report the results, in part to inform the
public about how well the schools and their students are progressing over time and
compared to other localities or schools. Many states and districts publish annual report
cards on school districts and individual schools. The results of the tests can motivate
education reform by informing and influencing parents to take action to improve the
quality of local schools.
Holding schools and educators accountable for student performance on tests aligned to
high standards of what students should know and be able to do.
Consequences are often attached to test results and may include school improvement
plans, technical assistance, increased or decreased funding for schools, salary bonuses,
promotions, loss of accreditation and takeovers of local schools by the state. Such
consequences are used to leverage change at the school and classroom level.
Evaluating programs. Many federal and state education programs use standardized tests
to determine if public policy objectives are being achieved, and if public funds are wellspent.
Determining rewards and sanctions. Tests may be used for high-stakes purposes with
rewards and sanctions to make decisions about individual students, such as placement in
specific programs or classes, graduation from high school, or promotion to the next
grade.
FORMAT TEST
Multiple-choice questions: Many standardized tests require students to select a single
correct response
to each test question (called “items”) from among a small number of specific choices.
This format—called “multiple choice” or “selected response”—is efficient, practical, and
usually produces highly reliable results. Multiple-choice tests offer the advantages of
objectivity and uniformity in scoring, ease of administration, and low cost.
Performance assessment questions: Performance assessments require students to generate
a response to a question rather than choosing from a set of responses provided to them.
Examples include exhibitions, investigations, demonstrations, written or oral responses,
journals, and portfolios. Performance assessments can be given and scored according to
standard procedures and rules so that a test containing performance assessment questions
is a standardized test. Performance assessments typically focus on the process of problem
solving rather than on answers or solutions. Tests including performance assessments,
however, are generally less reliable, more difficult to score, and more costly than tests
using multiple choice items.
Constructed-response questions: Constructed-response items may be one type of
performance assessment, in which students are given the opportunity to fill-in-a-blank or
provide a brief written response to a question, rather than select from an array of possible
answers. Constructed-response questions are often included, along with multiple choice
questions, on a test to obtain additional and different types of information about what a
student knows or can do.
Test Question Formats
While there is no set format for all questions on standardized tests, the most common
standardized test question formats include Multiple-choice Questions and Short-answer
Questions.
Short-answer Questions
The short-answer question format, also known as the open-ended or constructed response
format, presents the test-taker with a question that is answered by a fill in-the-blank or
short written response. Answers to constructed-response questions are hand-scored using
a rubric that allows for a range of acceptable and partially correct answers. Questions and
answers in this format provide a more sophisticated evaluation of student performance
than selected-response questions. However, the reliability of scores obtained using
constructed-response questions depends more heavily on the scoring method. Carefully
designed constructed response questions with a clear scoring rubric can provide
important information about student performance and knowledge that cannot be as
effectively demonstrated by the selected-response format.
Open-Ended Tests
These test items ask students to respond either by writing a few sentences in short answer
form, or by writing an extended essay. Open-ended questions are also known as
“constructed response” because test-takers must construct their response as opposed to
selecting a correct answer (Zucker, 2003). The advantage of open-ended items is that
they allow a student to display knowledge and apply critical thinking skills. It
is particularly difficult to assess writing ability, for example, without an essay or writing
sample.
The disadvantage is that constructed-response items require human readers, although
attempts are being made to develop computer programs to score essays (Sireci, 2000;
Rudner, 2001; Shermis, 2001). Short-answer questions can be scored by looking for key
terms since they often don’t ask for complete sentences. But many state assessments ask
for an extended essay, often in separate tests from the one used to report AYP. Companies
across the United States assemble groups of qualified people, often retired teachers when
they can get them, to read and score essays or long answers using a common rubric for
scoring (Stover, 1999).
A rubric is a guide to scoring that provides a detailed description of essays that should be
given a particular score (frequently one-six points, with six being the best). After
extensive training with models of each score, two readers rate an essay independently. If
their scores differ, a third reader reads the essay without knowing the two preceding
scores. Group scoring of essays has a long history and has proved to be remarkably
reliable (Mitchell, 1992).
Essays and long answers have the desirable effect of promoting more writing and writing
instruction in the classroom, but they are expensive to score. Multiple-choice testing is
less expensive because it is scored by machine (ERS; NASBE, 2001). Differences in cost
can be gauged from a U.S. General Accounting Office report estimating that from 20022008, states will spend $1.9 billion on mandated testing if they use only machine-scored
multiple-choice tests. States will spend $3.9 billion if they maintain the present mixture
of multiple-choice and a few open-ended items. They will spend $5.3 billion if they
increase the use of open-ended items—including essays—making the cost of using openended items more than 2.5 times the amount of using multiple-choice tests alone (GAO,
2003). Clearly, the difference in cost makes testing choices difficult.
Performance Assessment
Also called authentic assessment, performance assessment challenges students to perform
a task just as it would be performed in the classroom or in life (e.g., a science experiment,
a piano recital). Performance assessment was widely promoted in the early 1990s
(Mitchell, 1992), but it is time-consuming, difficult to standardize, and expensive.
Portfolios
Portfolios are a type of performance assessment that were also popular before 2001, when
state testing in accordance with NCLB came to dominate. Portfolios are collections of
student work designed to show growth over a semester or a year. However, they are
difficult to evaluate accurately, because their production and contents can not be
standardized (Gearhart, 1993). Both portfolios and performance assessment are now used
as formative rather than summative assessment.
QUALITIES OF AN EFFECTIVE TEST
The requirements of NCLB pose a significant challenge to state educational systems: All
students must have the same chance to be successful at showing what they know and can
do in periodic, high-stakes assessments. Consequently, states must select or design highquality tests that can be used by the general student population while meeting the special
requirements of certain groups and even the needs of individual students. Moreover, the
high stakes involved compel states to be certain that the tests accurately measure student
achievement. All standardized tests must meet psychometric (test study, design, and
administration) standards for reliability, validity, and lack of bias (Zucker, 2003; Bracey,
2002; Joint Committee on Testing Practices, 2004).
For a test to solve this combination of challenges effectively, it must be proven to be:
• Reliable – The test must produce consistent results. Reliability means that the test is so
internally consistent that a student could take it repeatedly and get approximately the
same score.
• Valid – The test must be shown to measure what it is intended to measure.
• Unbiased – The test should not place students at a disadvantage because of gender,
ethnicity, language, or disability.
References
ACT, Inc. & The Education Trust. (2004). On course for success: A close look at selected high school
courses that prepare all students for college and work. Washington DC: The Education Trust. (Available:
)
Bracey, G. W. 2002. Put to the test: An educator’s and consumer’s guide to standardized testing. (2nd ed.)
Bloomington IN: Phi Delta Kappa International.
Burley, H. (2002, February). A Measure of Knowledge. American School Board Journal,18(2).
Cannell, J. J. (1987). Nationally normed elementary achievement testing in America’s public schools: How
all fifty states are above the national average. West Virginia: Friends for Education.
Cizek, G. J. (1998). Filling in the blanks: Putting standardized tests to the test. Washington D.C.: The
Thomas B. Fordham Foundation.
Cizek, G. J. (2001, Winter). More unintended consequences of high-stakes testing. Educational
Measurement, Issues and Practice, 20(4), 19-28.
Darling-Hammond, L. (2004, June). Standards, accountability, and school reform. Teachers College Record,
106(6), 1047-1085.
Data connections: Using assessment to improve teaching and learning [CD-ROM]. (2002). Charleston, West
Virginia: Edvantia (Formerly Appalachian Educational Laboratory).
Dickinson, A. C., Friedman, M. I., Hatch, C. W., Jacobs, J. E., Nickerson, A. B., & Schnepel, K. C. (2002).
Educators’ handbook on effective testing. Columbia, SC: Institute for Evidence-Based Decision-Making in
Education.
Educational Research Service. (n.d.). Focus on high-stakes testing. Arlington VA: Educational Research
Service
Education Week, (2006). Quality Counts At 10. Washington D.C.: Editorial Projects in Education
General Accounting Office, (2003). Characteristics of tests will influence expenses: Information sharing may
help states realize efficiencies. Washington D.C.: United States General Accounting Office.
Gearhart, M., Herman, J. L., Baker, E. L., & Whittaker, A. K. (1993, July) Whose work is it? A question for the
validity of large-scale portfolio assessment. CSE Technical Report 363. Available:
/>
Goldberg, M. (2005, January). Test mess 2: Are we doing better a year later? Phi Delta Kappan, 86(5), 389400.
Herman, J. L., & Baker, E. L. (2005, November). Making benchmark testing work. Educational Leadership,
63(3), 49-53.
Joint Committee on Testing Practices. (2004). Code of fair testing practices in education (Revised).
Washington D.C.: American Psychological Association.
Lemann, N. (1999). The big test. New York: Farrar, Strauss, and Giroux.
Linn, R. L. (2005, Summer). Fixing the NCLB accountability system. CRESST Policy Brief 8. Available:
/>McIntire, T. (2005, April). Data: Maximize your mining, part one. Technology and Learning, 25(9).
Mitchell, R. (1992). Testing for learning: How new approaches to evaluation can improve American schools.
New York: Free Press.
Monetti, D. M., & Hinkle, K. T. (2003). Five important test interpretation skills for school counselors. ERIC
Digest. ED481472 2003-09-00.
National Association of State Boards of Education. (2001). A primer on state accountability and large-scale
assessments. Available: />National Education Goals Panel. (1998). Talking about tests: An idea book for state leaders. Washington
DC: United States Department of Education.
National Center for Education Statistics. (2005). State education reforms. Standards, assessment, and
accountability. Table 1.5. Names and types of statewide assessments administered, by state: 2003-4 [Online
report]. Retrieved December 7, 2005, from />National Center for Education Statistics. (2005, August). Online assessment in mathematics and writing:
Reports for the NAEP technology-based assessment project, research and development series. Washington
DC: United States Department of Education.
Popham, J. W. (1999, March). Why standardized tests don’t measure educational quality. Educational
Leadership, 56(6), 8-15.
Princeton Review. (2003). Testing the testers 2003: An annual ranking of state accountability systems.
Available: />
Resnick, B. (2004, April). Majority of districts/schools employ “high-stakes” testing. Successful School
Marketer. Retrieved December 9, 2005, from />Resnick, M. (2004). The educated student: Defining and advancing student achievement. Alexandria VA:
National School Boards Association.
Rudner, L., & Gagne, P. (2001). An overview of three approaches to scoring written essays by computer.
ERIC Digest. ED458290 2001-12-0
Shermis, M. D., Rasmussen, J. L., Rajecki, D. W., Olson, J., & Marsilio, C. (2001). All prompts are created
equal, but some prompts are more equal than others. Journal of Applied Measurement, 2(2), 154-70.
Sireci, S. G., & Rizavi, S. (2000). Comparing computerized and human scoring of students’ essays. New
York: The College Board. ERIC report number 354.
Stiggins, R. (2004, September). New assessment beliefs for a new school mission. Phi Delta Kappan, 88(1),
22-27.
Stokes, V. (2005, October). No longer a year behind. Learning and Leading with Technology, 33(2), 15-17.
Stover, D. (1999, March, 23). Who grades the essays on standardized tests? School Board News, p. 3.
Toch, T. (2006, January). Margins of Error: The Education Testing Industry in the No Child Left Behind Era.
Washington, D.C.: Education Sector./p>
Wilde, J. (2004, January). Definitions for the no child left behind act of 2001: Assessment. Washington DC:
National Clearinghouse for English Language Acquisition (NCELA).
Zucker, S. (2003, December). Fundamentals of standardized testing. San Antonio TX: Harcourt Assessment,
Inc.
References
American Psychological Association. (1985). Standards for Educational and Psychological Testing.
Washington, DC: American Psychological Association.
Amori, B. A., Dalton, E.F. , & Tighe, P.L. (1992). IPT 1 Reading & Writing, Grades 2-3, Form 1A, English.
Brea, CA: Ballard & Tighe, Publishers.
Anastasi, A. (1988). Psychological Testing (sixth edition). New York, NY: Macmillan Publishing Company.
Ballard, W.S., Tighe, P.L., & Dalton, E. F. (1979, 1982, 1984, & 1991). Examiner's Manual IPT I, Oral Grades
K-6, Forms A, B, C, and D English. Brea, CA: Ballard & Tighe, Publishers.
Ballard, W.S., Tighe, P.L., & Dalton, E. F. (1979, 1982, 1984, & 1991). Technical Manual IPT I, Oral Grades
K-6, Forms C and D English. Brea, CA: Ballard & Tighe, Publishers.
Burt, M.K., Dulay, H.C., & Hernández-Chávez, E., (1976). Bilingual Syntax Measure I, Technical Handbook.
San Antonio, TX: Harcourt, Brace, Jovanovich, Inc.
Burt, M.K., Dulay, H.C., Hernández-Chávez, E., & Taleporos, E. (1980). Bilingual Syntax Measure II,
Technical Handbook. San Antonio, TX: Harcourt, Brace, Jovanovich, Inc.
Canale, M. (1984). On some theoretical frameworks for language proficiency. In C. Rivera (Ed.), Language
proficiency and academic achievement. Avon, England: Multilingual Matters Ltd.
Canales, J. A. (1994). Linking Language Assessment to Classroom Practices. In R. Rodriguez, N. Ramos, &
J. A. Ruiz-Escalante (Eds.) Compendium of Readings in Bilingual Education: Issues and Practices. Austin,
TX: Texas Association for Bilingual Education.
CHECpoint Systems, Inc. (1987). Basic Inventory of Natural Language Authentic Language Testing
Technical Report. San Bernadino, CA: CHECpoint Systems, Inc.
Council of Chief State School Officers (1992). Recommendations for Improving the Assessment and
Monitoring of Students with Limited English Proficiency. Alexandria, VA: Council of Chief State School
Officers, Weber Design.
CTB MacMillan McGraw-Hill (1991). LAS Preview Materials: Because Every Child Deserves to Understand
and Be Understood. Monterey, CA: CTB MacMillan McGraw -Hill.
Cummins, J. (1984). Wanted: A theoretical framework for relating language proficiency to academic
achievement among bilingual students. In C. Rivera (Ed.), Language proficiency and academic
achievement. Avon, England: Multilingual Matters Ltd.
Dalton, E. F. (1979, 1982, 1991). IPT Oral Grades K-6 Technical Manual, IDEA Oral Language Proficiency
Test Forms C and D English. Brea, CA: Ballard & Tighe, Publishers.
Dalton, E. F. & Barrett, T.J. (1992). Technical Manual IPT 1 & 2, Reading and Writing, Grades 2-6, Forms 1A
and 2A English. Brea, CA: Ballard & Tighe, Publishers.
De Avila, E.A. & Duncan, S. E. (1990). LAS, Language Assessment Scales, Oral Technical Report, English,
Forms 1C, 1D, 2C, 2D, Spanish, Forms 1B, 2B. Monterey, CA: CTB MacMillan McGraw-Hill.
De Avila, E.A. & Duncan, S. E. (1981, 1982). A Convergent Approach to Oral Language Assessment:
Theoretical and Technical Specifications on the Language Assessment Scales (LAS), Form A. Monterey, CA:
CTB McGraw-Hill.
De Avila, E.A. & Duncan, S. E. (1987, 1988, 1989, 1990). LAS, Language Assessment Scales, Oral
Administration Manual, English, Forms 2C and 2D. Monterey, CA: CTB MacMillan McGraw-Hill.
Duncan, S.E. & De Avila, E.A. (1988). Examiner's Manual: Language Assessment Scales Reading/Writing
(LAS R/W). Monterey, CA: CTB /McGraw Hill.
Durán, R.P. (1988). Validity and Language Skills Assessment: Non-English Background Students. In H.
Wainer & H.I. Braun (Eds). Test Validity. Hillsdale, NJ: Lawrence Erlbaum Associates, Publishers.
National Commission on Testing and Public Policy. (1990). From Gatekeeper to Gateway: Transforming
Testing in America. Chestnut Hill, MA: National Commission on Testing and Public Policy.
Oller, J.W. Jr. & Damico, J.S. (1991). Theoretical considerations in the assessment of LEP students. In E.
Hamayan & J.S. Damico (Eds.), Limiting bias in the assessment of bilingual students. Austin: Pro-ed
publications.
Rivera, C. (1995). How can we ensure equity in statewide assessment programs? Unpublished document.
Evaluation Assistance Center-East, George Washington University, Arlington, VA.
Roos, P. (1995). Rights of limited English proficient students under Federal Law -- A guide for school
administrators. Unpublished paper presented at Weber State University, Success for all Students
Conference, Ogden, UT.
Spolsky, B. (1984). The uses of language tests: An ethical envoi. In C. Rivera (Ed.), Placement procedures
in bilingual education: Education and policy issues. Avon, England: Multilingual Matters Ltd.
Ulibarri, D., Spencer, M., & Rivas, G. (1981). Language proficiency and academic achievement: A study of
language proficiency tests and their relationship to school ratings as predictors of academic achievement.
NABE Journal, Vol. V, No. 3, Spring.
Valdés, G. and Figueroa, R. (1994). Bilingualism and testing A special case of bias. Norwood, NJ: Ablex
Publishing Corporation.
Wheeler, P. & Haertel, G.D. (1993). Resource Handbook on Performance Assessment and Measurement: A
Tool for Students, Practitioners, and Policymakers. Berkeley, CA: The Owl Press.
Woodcock, R. W. & Muñoz-Sandoval, A.F. (1993). Woodcock-Muñoz Language Survey Comprehensive
Manual. Chicago, IL: Riverside Publishing Company.
Table 3: Critiques of four most commonly used tests
Test View of language Problematic aspects LAS Language consists of discrete skills and elements. -Hedberg (1995):
LAS-Oral is inadequate for placing language-minority students because of inadequate standardization procedures.
-Carpenter (1994): LAS reading/ writing is inappropriate to make entry and exit decisions; teacher judgement would be
just as valid. IPT Language consists of discrete skills and elements. -Lopez (2001): Norming procedures limit test
validity for a wide range of U.S. students, greater emphasis on discrete aspects of language proficiency and less
emphasis on pragmatic competence, no studies were conducted to investigate how test content relates to achievement.
-Ochoa (2001): Standardization sample is not representative of the range of U.S. English speakers, nor is the Spanish
version representative of the range of Spanish speakers in the U.S.. WMLS Cummins’ BICS/ CALP distinction
-Crocker (1998): To account for construct validity, test makers rely on intercorrelations, not on an explanation of the
underlying traits that test attempts to measure. -Kao (1998): Test-makers provide insufficient information about
validity. There is little explanation about the Cognitive-Academic Skills (CALP) construct. -Schrank, Fletcher, and
Guajardo Alvarado (1996): LAB Language consists of discrete skills and elements. -Chesterfield (1985): LAS is
problematic for identification of students for bilingual programs, contains unnecessary items, is inadequate to predict
success or as a basis for intervention. •
References
American Educational Research Association, American Psychological Association, and National Council on
Measurement in Education. (1999). Standards of educational and psychological testing. Washington, D.C.: American
Psychological Association.
August, D., & Hakuta, K. (1997). (Eds). Improving schooling for language-minority children: A research
agenda. Washington, D. C.: National Academy Press.
Bachman, L. (2000). Modern language testing at the end of the century: Assuring that what we count counts.
Language testing 17 (1) 1-42.
Bowman, B. T., Donovan, M. S., & Burns, M. (Eds.). (2001). Eager to learn: Educating our preschoolers.
Washington, D.C.: National Academy Press.
Burt. M.K., Dulay, H.C., Hernández-Chávez, E., and Taleporos, E. (1980). Bilingual Syntax Measure II,
Technical handbook. San Antonio, TX: Harcourt, Brace, Jovanovich.
Carpenter, C. D. (1994). Review of Language Assessment Scales, Reading and Writing. Supplement to the
eleventh mental measurements yearbook. Lincoln, NE: University of Nebraska Press.
Chesterfield, K. B. (1985). Review of Language Assessment Battery. The Ninth Mental Measurements Yearbook
Volume I. Lincoln, NE: University of Nebraska Press. Crocker, L. (1998). Review of the Woodcock-Muñoz Language
Survey. The Thirteenth Mental Measurements Yearbook. Lincoln, NB: University of Nebraska Press. • 679 •
Cummins, J. (2000). Language, power, and pedagogy: Bilingual children in the crossfire. Clevedon, UK:
Multilingual Matters Ltd.
Cummins, J., Muñoz-Sandoval, A.F., Alvarado, C.G., & M.L. Ruef (1998). The Bilingual Verbal Ability Tests.
Itasca, IL: Riverside.
Dalton, E. F. (1991). IPT Oral Grades K-6 Technical Manual, IDEA Oral Language Proficiency Test Forms C
and D English. Brea, CA: Ballard & Tighe, Publishers.
De Avila, E.A. & Duncan, S. E. (1990). Language Assessment Scales, Oral Technical Report, English, Forms
1C, 1D, 2C, 2D, Spanish, Forms 1B, 2B. Monterey, CA: CTB MacMillan McGraw-Hill.
Del Vecchio, A., & Guerrero, M. (1995). Handbook of language proficiency tests. Albuquerque, NM: Evaluation
Assistance Center–Western Region, New Mexico Highlands University.
Garcia, E. (1985). Review of Bilingual Syntax Measure II. The Ninth Mental Measurements Yearbook Volume I.
Lincoln, NE: University of Nebraska Press.
Garcia, G.E. and Pearson, P.D. (1994). Assessment and diversity. Review of research in education 20:337-391.
Gee, J.P. (2003). Opportunity to learn: A language-based perspective on assessment. Assessment in education
10:27-46.
Guyette, T. (1985). Review of Basic Inventory of Natural Language. The ninth mental measurements yearbook
Volume I. Lincoln, NE: University of Nebraska Press.
Guyette,T. (1994). Review of Language Assessment Scales, Reading and Writing. Supplement to the eleventh
mental measurements yearbook. Lincoln, NE: University of Nebraska Press.
Harris Stefanakis, E. (1998). Whose judgement counts?: Assessing bilingual children,
K-3. Portsmouth, NH: Heinemann.
Haber, L. (1985). Review of Language Assessment Scales. The ninth mental measurements yearbook Volume I.
Lincoln, NE: University of Nebraska Press.
Hedberg, N. L. (1995). Review of Language Assessment Scales-- Oral. The twelfth mental measurements
yearbook. Lincoln, NE: University of Nebraska Press.
Kao, C. (1998). Review of the Woodcock-Muñoz Language Survey. The Thirteenth Mental Measurements
yearbook. Lincoln, NE: University of Nebraska Press.
Kindler, A. (2002). Survey of states’ limited English proficiency students and available educational programs
and services: 2000-2001 summary report. Washington, D.C.: National Clearinghouse for English Language Acquisition
and Language Instruction Educational Programs.
Lopez, E. A. (2001). Review of the IDEA Oral Language Proficiency Test. The Fourteenth Mental
Measurements Yearbook. Lincoln, NE: University of Nebraska Press.
MacSwan, J. Rolstad, K. and Glass, G.V. (2002). Do some school-age children have no language? Some
problems of construct validity in the Pre-LAS Espol. Bilingual research journal 26: 213-238.
Macías, R. (1998). Summary Report of the Survey of the States' Limited English Proficient Students and Available
Educational Programs and Services 1995-96. Washington, D.C.: National Clearinghouse for Bilingual Education.
McLaughlin, B., Gesi Blanchard, A., & Osanai, Y. (1995). Assessing language development in bilingual
preschool children. Washington, D.C.: National Clearinghouse for Bilingual Education.
Messick, S. (1988). Validity. In R.L. Linn (Ed.) Educational measurements. Third edition. New York:
Amercian Council on Education/ McMillan.
No Child Left Behind Act. (2001). Retrieved October 2, 2002 from
Ochoa, S. H. (2001). Review of the IDEA Oral Language Proficiency Test. The Fourteenth Mental Measurements
Yearbook. Lincoln, NE: University of Nebraska Press.
Rueda, R. (in press). Student learning and assessment: Setting an agenda. In Pedraza, P. and Rivera, M. (Eds.).
National Latino/a Education Research Agenda Project.
Shellenberger, S. (1985). Review of Bilingual Syntax Measure II. The Ninth Mental Measurements Yearbook
Volume I. Lincoln, NE: University of Nebraska Press.
Tidwell, P.S. (1995). Review of Language Assessment Scales-- Oral. The twelfth mental measurements
yearbook. Lincoln, NE: University of Nebraska Press.
Valdés, G. and Figueroa, R. A. (1994). Bilingualism and testing: A special case of bias. Norwood, NJ: Ablex.
Valdés, G. (2001). Learning and not learning English: Latino students in American schools. New York:
Teacher’s College Press.
Wolfram, W., & Schilling-Estes, N. (1998). American English. Malden, MA: Blackwell.
Woodcock, R. W. and Muñoz-Sandoval, A.F. (1993). Woodcock-Muñoz Language Survey Comprehensive Manual.
Chicago: Riverside Publishing Company.
Copyright information ISB4: Proceedings of the 4th International Symposium on Bilingualism
© 2005 Cascadilla Press, Somerville, MA. All rights reserved ISBN 978-1-57473-210-8 CD-ROM
ISBN 978-1-57473-107-1 library binding (5-volume set) A copyright notice for each paper is
located at the bottom of the first page of the paper. Reprints for course packs can be
authorized by Cascadilla Press. Ordering information To order a copy of the proceedings,