Tải bản đầy đủ (.pdf) (78 trang)

A study on validity of 45 minute tests for the 11th grade = Nghiên cứu tính giá trị của bài kiểm tra 45 phút tiếng Anh lớp 11

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.1 MB, 78 trang )

Vietnam National University
hanoi university of languages and international studies
post-graduate Department





Hoà ng HỒNG TRANG



A STUDY ON VALIDITY OF 45 MINUTE TESTS FOR
THE 11
TH
GRADE
NGHIÊN CỨU TÍNH GIÁ TRỊ CỦA BÀI KIỂM TRA 45
PHÚT TIẾNG ANH LỚP 11

M.A. COMBINED PROGRAMME THESIS




Major: Methodology
Major code: 60.14.10












HANOI - 2009
VIETNAM NATIONAL UNIVERSITY
HANOI UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
POST-GRADUATE DEPARTMENT



HOÀNG HỒNG TRANG



A STUDY ON VALIDITY OF 45 MINUTE TESTS FOR THE 11
TH
GRADE
NGHIÊN CỨU TÍNH GIÁ TRỊ CỦA BÀI KIỂM TRA 45 PHÚT TIẾNG ANH
LỚP 11

M.A. COMBINED PROGRAMME THESIS


Major: Methodology
Major code: 60.14.10
Supervisor: ASSOC. PROF. DR. VÕ ĐẠI QUANG






HANOI - 2009
iv

TABLE OF CONTENTS

DECLARATION i
ABSTRACT ii
ACKNOWLEDGEMENTS iii
TABLE OF CONTENTS iv
LIST OF ABBREVIATIONS vii
LIST OF TABLES viii
INTRODUCTION
1. Rationale for the study 1
2. Significance of the study 2
3. Aims of the study 2
4. Scope of the study 3
5. Research questions 3
6. Organization of the study 3
CHAPTER 1: LITERATURE REVIEW
1.1. LANGUAGE TESTING AS PART OF APPLIED LINGUISTICS 4
1.1.1. Language testing – a brief history and its characteristics 4
1.1.2. Purposes of language testing 5
1.1.3. Validity in language testing 7
1.1.3.1. Definition and types of validity 7
v


1.1.3.2. Content validity 8
1.1.3.3. Construct validity 9
1.2. CLASS PROGRESS TESTS 10
1.2.1. Language tests – definition and types 10
1.2.2. Class progress tests as a type of achievement tests 10
1.3. TESTING TECHNIQUES 12
CHAPTER 2: METHODOLOGY OF THE STUDY
2.1. Type of research: A qualitative research 17
2.2. Techniques 18
2.2.1. Data type and data collection 18
2.2.2. Data analysis 19
CHAPTER 3: THE STUDY
3.1. THE CONTEXT OF TEACHING AND TESTING ENGLISH AT HIGH SCHOOLS
IN VIETNAM 21
3.1.1. The methodological innovation 21
3.1.2. The testing innovation 22
3.2. AN OVERVIEW OF THE TEACHING AND TESTING OF ENGLISH LANGUAGE
IN THE 11
TH
GRADE 23
3.2.1. English textbook for the 11
th
grade 23
3.2.2. Syllabus for 11
th
grade English language subject 24
3.2.3. 45 minute English language tests 29
CHAPTER 4: MAJOR FINDINGS
vi


4.1. Phonetics section in 45-minute tests 31
4.1.1. Data concerning construct validity 31
4.1.2. Data concerning content validity 37
4.2. Grammar section in 45-minute tests 42
4.2.1. Data concerning construct validity 42
4.2.2. Data concerning content validity 47
4.3. Vocabulary section in 45-minute tests 53
4.3.1. Data concerning construct validity 53
4.3.2. Data concerning content validity 57
CONCLUSION
1. DISCUSSION OF FINDINGS AND RECOMMENDATIONS 60
1.1. On pronunciation testing 60
1.2. On grammar testing 62
1.3. On vocabulary testing 64
2. CONCLUSION 66
REFERENCES 68
APPENDICES
Copies of test papers collected
vii

LIST OF ABBREVIATIONS

MCQ Multiple-choice question
GF Gap-filling
ER Error recognition
ST Sentence transformation
SB Sentence building
C.V. Construct Validity
V. Validity
viii


LIST OF TABLES

Table 1: Bookmap of the English 11 textbook
Table 2: Recommended structure of a 45 minute test
Table 3: Number of pronunciation test items having their underlined parts dissimilar in
letter format
Table 4: No correct answer
Table 5: Apparent correct answer
Table 6: Underlined letter(s) not corresponding to the sounds tested
Table 7: Content validity of phonetics section in Group 1 tests
Table 8: Content validity of phonetics section in Group 2 tests
Table 9: Content validity of phonetics section in Group 3 tests
Table 10: Content validity of phonetics section in Group 4 tests
Table 11: Summary of content validity of pronunciation test items of 4 test groups
Table 12: Summary of techniques for grammar testing in 30 tests
Table 13: Construct validity of grammar items of Group 1 tests
Table 14: Construct validity of grammar items of Group 2 tests
Table 15: Construct validity of grammar items of Group 3 tests
Table 16: Construct validity of grammar items of Group 4 tests
Table 17: Content of grammar component of Group 1 tests compared to the syllabus
Table 18: Content of grammar component of Group 2 tests compared to the syllabus
Table 19: Content of grammar component of Group 3 tests compared to the syllabus
Table 20: Content of grammar component of Group 4 tests compared to the syllabus
ix

Table 21: Summary of techniques for vocabulary testing in 30 tests
Table 22: Tests having topic-relevant reading or cloze test passages
Table 23: Content validity of vocabulary test items of Group 1 tests
Table 24: Content validity of vocabulary test items of Group 2 tests

Table 25: Content validity of vocabulary test items of Group 3 tests
Table 26: Content validity of vocabulary test items of Group 4 tests




1

INTRODUCTION
1. Rationale for the study
Language testing, a branch of applied linguistics, has witnessed its robust
development within the last fourty (nearly fifty) years in terms of professionalization,
internationalization, cooperation and collaboration (Stansfield, 2008, p. 319). Along the
process of its development, validity, together with fairness, has become a matter of
increasing concern and it is predicted that research into validity will form “the prominant
paradigm for language testing in the next 20 years” (Bachman, 2000, p. 25).
On discussing validity, much has been said about validation of standardised tests,
especially those large-scale EFL tests such as TOEFL, IELTS and TOEIC (Stoynoff, 2009;
Bachman et al., 1995, cited in Stansfield, 2008) since decisions based on the scores of
these tests are usually considered of prime importance to test takers in both their career and
life perspectives. Teacher-produced tests, on the contrary, receive much less attention.
Studies have shown that designing a good test is a “demanding” task for teachers
(Davidson and Lynch, 2002, p. 65, cited in Coniam, 2009, p. 227), since in a language test
“language is both the instrument and the object of measurement” (Bachman, 1990) (which
means difficulty regarding the careful choice of linguistic elements in a language test), and
due to teachers’ lack of time and resources (Popham, 1990, p. 200, cited in Coniam, 2009,
p. 227). Also, teachers are “unlikely to be skilled in test construction techniques” (Popham,
2001, p. 26, cited in Coniam, 2009, p. 227). That explains the reason why test item quality
of teacher-produced tests is often lower than that of standardised tests in terms of reliability
(Cunningham, 1998, p. 171, cited in Coniam, 2009, p. 227), and this leads to the low

validity of test scores interpretations as well.
Nevertheless, however inferior teacher-produced tests are compared to standardised
tests in terms of quality (according to several studies), little factual evidence has been
found to support this (Coniam, 2009, p. 227). Soranastaporn et al. (2005) (cited in Coniam,
2009) attempted to compare concurrent validity between achievement tests designed by
Thai language teachers and standardised tests like TOEFL and IELTS and has found low
correlations between the two. Another study conducted by Coniam into the reliability and
validity of teacher-produced tests for EFL students at a university in Hong Kong reported
2

poor reliability and validity results of teacher-produced tests despite a rather long process
(compared to normal period of time teachers spend on designing a test) of test design and
analysis (Coniam, 2009, p. 238).
In Vietnamese context of educational reform, textbooks at primary and secondary
level have all been redesigned in structure and content to keep pace with current changes
and development in society as well as in pedagogy. English language textbooks, following
the trend, started to be replaced in 2004 and the replacement process has just been finished
in schoolyear 2008-2009. Despite the fact that techniques and guidelines for assessment
have been provided in the new textbook set, there has not been any investigation into
quality of the actual tests that teachers produce and use for their students at school and
whether teachers follow these guidelines closely. This situation calls out for research into
quality of English language tests used at secondary schools so as to have a clearer and
more accurate picture of language testing in Vietnam.
2. Significance of the study
English language has been being learnt by over 90% of school pupils and university
students in Vietnam, not to count the number of people learning English outside schools
and universities. Therefore, assessment of the quality of teacher-produced tests will lay the
foundation for a valid interpretation of the quality of language education at schools, which
in turn helps form directions and guidelines for further instruction and assessment at
tertiary level and at other language education centers and institutions.

In a narrow scale, results of the quality assessment of school tests will assist in
improvement of test items quality, creating more reliable and valid tests.
3. Aims of the study
Within the small scope of an MA thesis, this study only aims at investigating two
aspects of validity of a common type of English tests used in schools in Vietnam. In
particular, this research tries to investigate content and construct validity of the language
components of English forty-five-minute tests used for the 11
th
grade in some high schools
in northern Vietnam.
3

4. Scope of the study
Due to the time and finance constraint, the study could only focus on forty-five
minute tests for the 11
th
grade, collected from ten high schools in five provinces in the
north of Vietnam. No other types of tests or other grades were investigated. The language
used in those tests is English so all the findings and discussions are restricted to the English
language only. However, suggestions are useful to the teaching of other foreign languages.
Furthermore, the scope of an MA thesis could only allow for an investigation into two
types of validity, that is, content and construct validity and the area chosen for
investigation is the language components in the tests collected.
5. Research questions
In short, this research aims at answering the following questions:
1.1. How valid is the construct of the language components in 45 minute English tests for the 11
th
grade?
1.2. How valid is the content of the language components in 45 minute English tests for the 11
th

grade?
Or put it in other words, the research will focus on finding out (1) whether content
of the language components of those 45 minute tests follows closely English 11 syllabus
and (2) whether test items of the language components can really measure what they are
purported to measure. In other words, this research investigates content validity and
construct validity of the language components of forty five minute tests.
6. Organization of the study
This research report is divided into four main parts. After the introduction with an
overview of the study comes the first part which reviews previous studies whose focus and
findings are relevant and beneficial to this one. The second part discusses methodology of
this study, including the research approach, methods of data collection and data analysis.
The third part presents the study in detail, including the context of teaching and testing
English at high schools when conducting this research, syllabus of the 11
th
grade and
information on forty-five minute tests. The fourth part reports all findings and their
discussions as well as recommendations. Finally, the report ends with the conclusion part
which summarizes the research in some main remarkable points.
4

CHAPTER 1: LITERATURE REVIEW
1.1. LANGUAGE TESTING AS PART OF APPLIED LINGUISTICS
1.1.1. Language testing – a brief history and its characteristics
Language testing, as I usually think of it, involves testing the examinee’s level of
understanding and using the language. However, the main functions that language testing
serves varies according to different approaches and different periods in its history.
As Stansfield (2008) reviewed in his article in Language Testing 25, Spolsky
(1978) divided language testing history into three periods or stages, up to his time, i.e., pre-
scientific, psychometric-structuralist, and integrative-sociolinguistic. In the first period,
language experts were involved in the development of language tests and because of their

presence, they claim their tests to be reliable and valid. This stage corresponds to the first
approach to language testing: the essay-translation approach, in which “subjective
judgement of the teacher” is of utmost importance, rather than “skill or expertise” in
testing (Heaton, 1988). Popular components of a language test in this stage are essay
writing, translation, and grammatical analysis (Heaton, 1988).
The second period saw the dominance of structural linguistics and this explained
the reason why test items in this stage were designed to test discrete language elements
(such as sounds, words, and structures) in isolation from context (Stansfield, 2008, p.312).
This came to be known as discrete point testing, and named as the structuralist approach to
language testing. Also, the emphasis of this approach on quality of a language test was put
on reliability and objectivity (Heaton, 1988, p.16).
The third period – the integrative-sociolinguistic stage – witnessed a more scientific
appearance of language testing compared to the previous stages as statistics started to be
utilized in the examination of tests. John Oller, an outstanding author of this period,
proclaimed that there was “a general factor” constituting language proficiency, and he
called it “a grammar of expectancies”, which could be “directly tested through the cloze
test” (Oller, 1972; 1973; 1975; cited in Stansfield, 2008). Cloze tests and dictation,
together with oral interviews, translation and essay writing, are present in most integrative
tests and this was called the integrative approach to language testing.
5

It can be understood from Stansfield’s (2008) review that another fourth stage
should be added to Spolsky’s summary of history of language testing, which is
characterised by the communicative approach, and in which stage how language is used in
communication is the primary concern (Heaton, 1988). Therefore, instead of testing the
four skills separately like the structuralist approach, which is irrelevant in real life, the
communicative approach advocates integrative assessment, and authenticity of language
tasks and materials. Also, context of language use is a matter of great concern. Besides,
this stage witnessed the shift of concern from reliability to validity (Stansfield, p. 318),
which, according to Stansfield, “brought US and European testing specialists much closer

together by the early 1990s.”
Throughout a nearly 50 year history, from the 60s of the twentieth century up to
now, language testing has undergone several changes in its characteristics. Its nature has
developed to become “less impositional, more humanistic”, “conceived not so much to
catch people out on what they do not know, but as a more neutral assessment of what they
do” (McNamara, 2000, p. 4). Also, the computerisation of language tests enables one test
to be carried out almost anywhere in the world, by examinees of any nation, any race as
long as there is a computer connected to the internet, or computers may help tailor the
content of the test to the particular abilities of candidates (in case of computer-based tests
such as TOEFL CBT). The limited number of assessors or automatic scoring somehow
makes language tests fairer and the interpretation of test scores more reliable and valid.
Besides, the emergence of objective scoring also contributes to reduce test bias, and
language band systems help “increase the reliability of the scoring” (Heaton, 1988, p. 20).
However, there is one unique characteristic of language tests which remains forever
unchanged, that is, “language is both the instrument and the object of measurement”
(Bachman, 1990, p. 2). In a language test, language is used to measure language ability.
Therefore, language assessment involves the assessment of not only the content and the
structure/organization of a language test, but also the language used to denote that content.
And this inevitably poses a dilemma for language testers.
1.1.2. Purposes of language testing
6

Inevitably, language testing serves many purposes, most of which have been
mentioned in Heaton (1997), that is:
1. Finding out about progress: Tests that aim at identifying the extent to which
students have mastered what they have been taught are called progress tests and these are
usually regarded as “the most important kind of tests for teachers” (Heaton, 1997).
2. Encouraging students: Learning language is unique in the sense that students at
certain levels of proficiency do not realize that they are making progress, which will, of
course, disappoint them. That is why a good test can help show students that they actually

are moving forward, thus encouraging them to continue making efforts in their language
study.
3. Finding out about learning difficulties: This particular job is often taken over by
diagnostic tests, in which items are carefully designed so as students’ strength and
weaknesses are clearly reflected.
4. Finding out about achievement: Achievement tests are somewhat like progress
tests but they cover a longer period of time and are often conducted at the end of the
semester, school-year or language course to make educational decisions, for example,
promoting students to the higher level.
5. Placing students: Tests are sometimes also given to categorize students into
different groups based on their ability. Language tests are often divided into several levels
of language proficiency such as KET, PET, FCE, CAE, CPE (as in the Cambridge
rankings), or A-, B-, C- level in the Vietnamese language education system, and so on.
6. Selecting students: After the purpose of finding out about students’ ability,
strengths and weaknesses comes the task of selecting students for a job or a course.
Categorizing students is inevitably one part of identifying and selecting them.
7. Finding out about proficiency: This purpose of language tests relates closely to
two other purposes mentioned above, that is, placing and selecting. Actually, finding out
about students’ language proficiency is just one step towards making decisions concerning
students’ future education or future life (migration, for example). If language tests serving
7

other purposes tend to look back at what students have learnt, proficiency tests looks
forward to anticipate what students will have to do/be able to do in the future.
Other purposes may include “program evaluation”, “providing research criteria”, or
“assessment of attitudes and sociopsychological differences” (Henning, 1987).
1.1.3. Validity in language testing
1.1.3.1. Definition and types of validity
Validity refers to “the appropriateness of a given test or any of its component parts
as a measure of what it is purported to measure” (Henning, 1987). Validity is “the most

important consideration in test evaluation” according to the Standards for Educational and
Psychological testing (1985, p.9; cited in Wright & Stone, 1999).
Traditionally, the Standards discussed three types of validity: content related,
criterion related and construct related, which were considered three “related facets of a
single problem” (Wright & Stone, 1999).
In the modern times, validity is still considered a unitary concept made up of
several components, the validity of each of which will contribute to the overall validity of
test application and use.
Additionally, validity can be seen from both qualitative and quantitative aspects.
Qualitatively, validity includes content and construct. “These two forms of validity explain
the organization and construction of items and their use in eliciting manifestations of the
variable” (Wright & Stone, 1999). The quantitative aspects of validity, however, have no
relation to text or content, and they are rather statistical and numerical. Criterion-related
validity indeed falls into this category.
Besides, we can also talk about empirical and non-empirical kinds of validity,
which respectively corresponds to the quantitative and qualitative mentioned above.
Examples of non-empirical validity are face/content validity and response validity, while
those of empirical are concurrent and predictive validity, namely criterion-related validity.
8

To sum up, according to Cronbach (1955, p. 297) (cited in McNamara & Roever),
there is no such thing as a “valid test” since “one cannot validate a test, but only a principle
for making inferences”, and “one validates not a test, but an interpretation of data arising
from a specified procedure” (Cronback, 1971, p.447; cited in McNamara & Roever).
1.1.3.2. Content validity
It is generally assumed that content validity deals with the representativeness and
comprehensiveness of the content of the test so that the test is a valid measure of what it is
supposed to measure (Henning, 1987; Borg & Gall, 1974; Bachman, 1990; McNamara,
2000; Heaton, 1998). Therefore, in order to assess content validity of a test, we have to
look at two aspects of its content, that is, representativeness and comprehensiveness, or in

other words, content relevance and content coverage (Bachman, 1990, p. 244).
With regard to content relevance, Messick (1980: p. 1017) (cited in Bachman,
1990, p, 244) suggested that the investigation of content relevance requires “the
specification of the behavioral domain in question and the attendant specification of the
task or test domain”. This can be understood that not only the content of the test is a matter
of content validity but also the setting in which the test is given, or the measurement
procedure. Popham (1978) (cited in Bachman, 1990, p. 245) specifies the elements in test
design: “what it is that the test measures”, “the attributes of the stimuli that will be
presented to the test taker”, and “the nature of the responses that the test taker is expected
to make. Hambleton (1984) relates these three elements to content validity (in Bachman).
Concerning content coverage, test developers need to closely analyse the language
tested and the course objectives (Heaton, 1998) so that there is always an apparent
correspondence between the two. This is especially true to the achievement tests while
things would not be that easy in case of proficiency tests for test designers in this context
have to base on their knowledge, experience and research results to decide which content
to choose.
Content validity is one component of qualitative validity as mentioned above, and it
plays a central role in developing language tests for specific purposes, for which content
9

relevance is a matter of primary concern. Usually a test is selective in content and the
method of content selection should be taken into great consideration.
1.1.3.3. Construct validity
A construct, or psychological contruct, is “an attribute, proficiency, ability or skill
that happens in the human brain and is defined by established theories” (Brown, 2000, p.9).
While content validity mostly discusses the relationship between test content and
course objectives (in achievement tests) or test content and what examinees are supposed
to be able to do with language in non-test contexts (in proficiency tests), construct validity
is concerned with the relationship between “performance on tests” and “a theory of
abilities, or constructs” (Bachman, 1990, p. 255). And a test which shows considerable

correspondence between the two is said to have construct validity.
Contruct validity has increasingly been viewed as a unified concept which is
formed by three other aspects of validity: content validity, criterion-related validity and
construct validity. This way of understanding construct validity was proposed first by
Messick (1980) (cited in Bachman, 1990, p. 241).
Contruct validity is indeed the unifying concept that integrates criterion and content considerations
into a common framework for testing rational hypotheses about theoretically relevant relationships.
(Messick, 1980, p. 1015) (cited in Bachman, 1990, p. 256)
In order to assess construct validity, the construct to be measured has to be defined
first. Bachman (1997) noted that investigating contruct validity needs to take into
consideration both construct definition and characteristics of the test task.
Furthermore, according to Brown (2000), construct validity can be demonstrated
via either an experimental study or an intervention one. In an experimental study, two
groups are compared based on their performance. One group is with contruct and the other
is not. If the group with construct performs better than the other one without construct,
then the test is said to have construct validity. For an intervention study, a group weak in
the construct is tested, then taught the construct and later re-tested. If there is a significant
10

difference in the results of the pre-test and the post-test, it may mean that the test has
construct validity.
Additionally, in “Language Test Construction and Evaluation”, Alderson, Clapham
and Wall (2001) presents several approaches to construct validation including comparison
with theory, internal correlations, comparisons with students’ biodata and psychological
characteristics, and multitrait-multimethod analysis. Among which, multitrait-multimethod
is the most complicated method. This study used “comparison with theory” while assessing
characteristics of testing techniques as the method to evaluate construct validity of tests.
1.2. CLASS PROGRESS TESTS
1.2.1. Language tests – definition and types
Language tests can be simply understood as the tests that evaluate examinees’

language ability (which may include “language competence”, “strategic competence”, and
“psychophysiological competence” according to the communicative approach to language
testing (Weir, 1990)). Bachman (1990) mentioned five features to categorize language
tests, and each criterion will result in different test types. According to purpose or use,
there are selection, entrance, and readiness tests (related to admission decisions);
placement and diagnostic tests (regarding specific areas which need instruction); and
progress, achievement, attainment, or mastery tests (in terms of how well students achieve
the objectives of the study program, or how students should “proceed with the program”).
Or we can have theory-based tests like proficiency tests and syllabus-based tests like
achievement tests when talking about the content of the test. Regarding frame of reference,
there are norm-referenced and criterion-referenced tests; or subjective versus objective
tests if basing on scoring procedure; or multiple choice, completion, dictation, cloze tests,
and so on, when considering testing methods used in a test. Also based on testing methods,
McNamara could divide tests into paper-and-pencil and performance tests.
Generally, according to Heaton (1998), most testing specialists divide tests into
achievement/attainment, proficiency, aptitude and diagnostic tests.
1.2.2. Class progress tests as a type of achievement tests
11

According to Henning (1987), achievement tests “are used to measure the extent of
learning in a prescribed content domain, often in accordance with explicitly stated
objectives of a learning program”. While proficiency tests are knowledge-based,
achievement tests are syllabus-based and therefore, if it is not based on a specific syllabus,
it is no longer an achievement test. Syllabus content and objectives are the first and
foremost criteria on which achievement tests are based and assessed.
Class progress tests are a subtype of achievement tests, often referred to as progress
achievement tests, besides final achievement tests, and they are also the most popular test
type, commonly designed by teachers in and for a specific situation (Heaton, 1998). In
order to design a class progress test, a teacher often has to base on his/her knowledge of
students’ ability, objectives of the program he/she is teaching, content of the specific part

of the program that he/she is hoping to incorporate into the test, and his/her available
source of test tasks.
With a view to evaluating the extent to which students have mastered what they
have been taught in the program, class progress test also provides students with a chance to
show their progress, thus, encourage them to learn and to make continuous efforts in their
study. It is similar to a teaching device which stimulates learning and reinforces what has
been taught (Heaton, 1998).
Via progress tests, students realize whether they have mastered the essential
knowledge, how much they have mastered it, and which language areas they should review
and pay more attention to.
Unlike achievement tests, which are usually given at the end of a semester or a
course, progress tests are conducted throughout the course/semester, focusing on the very
recent, important items that students need acquire. Without progress tests, certain quite
important items may be ignored since they are important at the unit-level but not so
important at the program-level to be included in the achievement test. Progress tests
accomodate them all and therefore, is a better and more comprehensive reflection of
students’ understandings and progress.
12

As advocates for continuous, formative assessment continues to grow in number,
progress tests have long remained a central role in any educational program.
1.3. TESTING TECHNIQUES
In order to test students’ language skills or language areas, test designers have to
base on different testing techniques or test methods. Testing techniques can be simply
understood as “means of eliciting behaviour from candidates which will tell us about their
language abilities” (Hughes, 1989, p. 59). According to Hughes, the ideal testing
techniques will have to satisfy four requirements:
1. will elicit behaviour which is a reliable and valid indicator or the ability in which
we are interested;
2. will elicit behaviour which can be reliably scored;

3. are as economical of time and effort as possible;
4. will have a beneficial backwash effect.
Regarding categorization, common testing techniques may be divided in terms of
the language areas or skills they are applied to, for example, techniques to test grammar,
vocabulary, reading, listening, writing, and speaking. Besides, we also have objective and
subjective testing techniques according to whether the test items will be graded objectively
or subjectively.
To serve the objectives of this study, this section will first discuss the differences
between objective and subjective testing. Then common types of objective and subjective
testing techniques will be presented.
To begin with, subjective and objective here refer to the scoring of tests, not the
construction of tests or performance on tests. Every stage in devising a test requires
teachers/test designers to make subjective judgements on selecting what to test and how to
test. As for students, they also have to carry out subjective judgements when doing the
tests. The only thing objective here is how teachers/markers grade the tests. If the tests will
13

be scored the same no matter who grades it, they are objective. Otherwise, they must be
subjective tests.
Objective testing can be applied to any skill or element, however, it will be used far
more effectively in some skills than the others. Grammar, phonology, reading, vocabulary,
or listening, for example, often lend themselves to objective testing. However, writing and
speaking can only be satisfactorily tested via subjective testing methods (Heaton, 1998).
That explains the reason why we come across multiple-choice grammar, vocabulary,
reading and listening items far more frequently than writing ones.
However, objective testing is often criticised on the ground that objective testing
does not allow for real communicative ability to be tested. Instead, students are tested on
their ability to manipulate language and such situations have never happened in everyday
language use. Besides, objective testing gives room to wild guessing and chances. Even
though, most students base their guesses on partial knowledge (Heaton, 1998, p. 27), it is

highly likely that they do not know anything at all and just do the test with simple,
uneducated guesses. Chances are that they will often have 25% of getting the correct
answer.
Nevertheless, the continued use of objective tests has shown that good objective
tests are really useful, especially in class progress tests (Heaton). As long as objective tests
are not used to measure students’ communicative ability or evaluate students’ actual
performance, they will continue to occupy a stable and firm position in language testing.
And because of the advantages and disadvantages of these two types of testing, it is
recommended that a good test should include both subjective and objective test items.
 Multiple-choice questions – the most common objective testing technique
Multiple-choice questions are those in which there is only one correct answer called
key or answer among several options. Those incorrect options are distractors, aiming at
distracting students from the key.
Reliable, rapid and economical scoring is the most striking characteristic of
multiple-choice questions, which explains the reason why multiple-choice questions
14

(MCQs) are favoured in many cases (Hughes, 1989; Cohen, 1994). However, there are
several disadvantages of MCQs that Hughes (1989) and Weir (1990) have revealed:
1. The technique tests only recognition knowledge.
2. Guessing may have a considerable but unknowable effect on test scores.
3. The technique severely restricts what can be tested.
4. It is very difficult to write successful items. (Common problem areas include: more than
one correct answer, no correct answer, there are clues in the options as to which is correct,
ineffective distractors).
5. Backwash may be harmful.
6. Cheating may be facilitated.
7. There is considerable doubt about their validity as measures of language ability.
Answering MCQs is an unreal task with distractors presenting choices that otherwise might
not have been thought of.

 Gap-filling:
Gap-filling is “the test in which the candidate is given a short passage in which
some words or phrases have been deleted. The candidate’s task is to restore the missing
words” (Alderson, Clapham, Wall, 1995). Gap-filling indeed is a modified form of cloze
test and it has managed to avoid cloze tests’ weakness. Weir (1990) named it “selective
deletion gap-filling”. Gap-filling has been very useful in testing grammar, reading
comprehension, or vocabulary since test writers are able to focus on the items that are
considered important by selecting them to be deleted. The difficulty in using this testing
technique is to ensure that students are led to write the expected words in the gaps. It
would be ideal if there is only one correct answer for each gap, however, this is difficult to
achieve. Therefore, in order to achieve marking reliability, it is essential that the number of
alternative answers be reduced to the minimum and no other possible answers be not listed
in the answer key.
15

A banked gap-filling task can be the solution to this (Alderson, Clapham, Wall,
1995). In a banked gap-filling task, missing words and phrases are provided, together with
some distracting words, which means that there are more words/phrases than necessary.
And students’ task is just to select the correct word for each gap.
According to Weir (1990), this technique “restricts to sampling a much more
limited range of enabling skills than do the short answer and multiple-choice formats”.
Sometimes the deleted word does not at all affect the sentence, that is, the sentence
is equally good with or without the deleted word. Such case should be avoided because of
its confusion towards students.
 Sentence transformation items:
This type of item is very useful for testing ability to produce structures, so it can
test grammatical production. It is the objective item type which “comes closest to
measuring some of the skills tested in composition writing”, although transforming
sentences and producing sentences are not alike.
There are two common types of sentence transformation. In the first type, there is

often at least one word given at the beginning of the new sentence and the candidate’s job
is to finish the sentence with exactly the same meaning as the original one. In the second
type, the candidate is given one word to include it in the new sentence, and he can put it
anywhere in the new sentence as long as the word is not changed in form and the new
sentence can still remain the meaning of the original one.
This type of test format is somehow similar to completion items in the sense that
there is often more than one correct answers. However, test designers can still be aware of
all possible correct answers, and of the specific area they are testing.
This item type is more suitable for use in intermediate and advanced tests than in
tests at an elementary level (Heaton, 1997, p. 101) maybe due to the fact that elementary
level often involves few and too simple structures for different ways of expressing the
same thing.
16

According to Heaton, the major shortcoming of this item type is the lack of context.
“It is practically impossible to provide a context for items involving the rewriting of
sentences”.
Although this item type is often used in the writing section of the test, and some
people refer to it as a kind of controlled writing, still I have the feeling that this item type
involves more grammatical knowledge than writing skills and this is more like testing
grammar production.
 Besides the above-mentioned techniques, other techniques may include true/false items
(a modification of multiple choice questions), error recognition (either in multiple-choice
format – just like multiple choice questions or having no options and students have to find
out the mistakes themselves), sentence building (which is to some extent like sentence
transformation in the fact that it tests more of students’ production of grammar than their
writing skills).
17

CHAPTER 2: METHODOLOGY OF THE STUDY

2.1. TYPE OF RESEARCH: A QUALITATIVE RESEARCH
This research is conducted qualitatively in the sense that it does not aim at testing
hypothesis or generalization, but rather “exploratory” and “discovery-oriented” (Nunan,
1992) as qualitative research “is not set out to test hypothesis” (Larsen, 1999).
Burnes (1999) defines qualitative research as the one conducted “to draw
conclusions from the data collected to make sense of how human behaviours, situations
and experiences construct realities”. When one carries out qualitative research, one wants
to find out what is going on “from the actor’s own frame of reference” (Nunan), that is
from the points of view of those being investigated. Besides, qualitative researchers view
each individual as a unique entity so there is no point in generalization because there is no
theory that fits all and is true to all. Because of no generalization, the number of samples in
qualitative research is often restricted and underplayed. While quantitative data are usually
gathered using probability sampling, that is, each unit in the population stands some
chance of being selected, using some form of random selection, qualitative research mostly
relies on non-probability sampling for data collection. Non-probability sampling does not
involve random selection, and does not “depend on the rationale of probability theory”
(Trochim). Also, each researcher is a unique individual. He brings his viewpoints into his
research so each research is actually biased by its researcher(s)’s individual perceptions
(Trochim); thus, establishing external validity or objectivity in any research, according to
qualitative researchers, is just pointless.
Additionally, while many researchers claim that there would be no numbers
(quantification) in qualitative data, Trochim (2006) argues that “all qualitative data can be
coded quantitatively” or “anything that is qualitative can be assigned meaningful numerical
values”. Indeed, “qualitative” data are usually categorized in the analysis process and the
act of categorizing is quantitative in itself, which many people fail to realize (Trochim,
2006). Trochim furthers his statement by saying that “all quantitative data is based on
qualitative judgement” and he believes that without qualitative judgement, quantitative
data is just valueless.

×