Tải bản đầy đủ (.pdf) (67 trang)

Disigning an english achievement test for the 12th form of non english major at le quy don high school dong da, hanoi

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.47 MB, 67 trang )

1
VIETNAM NATIONAL UNVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FACULTY OF POST-GRADUATE STUDIES

VŨ HOÀNG DUNG
DESIGNING AN ENGLISH ACHIEVEMENT TEST FOR
THE 12TH FORM OF NON-ENGLISH MAJORS AT
LE QUY DON HIGH SCHOOL – DONG DA, HANOI
(Thiết kế một bài kiểm tra Tiếng Anh cuối kì cho học sinh lớp 12
khơng chun ngữ tại trường Trung học phổ thông Lê Quý Đôn –
Đống Đa, Hà Nội)

M.A. Minor Programme Thesis
Field: Methodology
Code: 60 14 10

Hanoi, August 2010


2
VIETNAM NATIONAL UNVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FACULTY OF POST-GRADUATE STUDIES

VŨ HOÀNG DUNG
DESIGNING AN ENGLISH ACHIEVEMENT TEST FOR
THE 12TH FORM OF NON-ENGLISH MAJORS AT
LE QUY DON HIGH SCHOOL – DONG DA, HANOI
(Thiết kế một bài kiểm tra Tiếng Anh cuối kì cho học sinh lớp 12
khơng chun ngữ tại trường Trung học phổ thông Lê Quý Đôn –


Đống Đa, Hà Nội)

M.A Minor Programme Thesis
Field: Methodology
Code: 60 14 10
Supervisor: Pham Thi Hanh, M.A

Hanoi, August 2010


5
TABLE OF CONTENTS
STATEMENT OF AUTHORSHIP ............................................................................

i

ACKOWLEDGEMENTS ............................................................................................

ii

TABLE OF CONTENTS .............................................................................................

iii

ABSTRACT ..................................................................................................................

vi

LIST OF ABBREVIATIONS ......................................................................................


vii

LIST OF TABLES AND APPENDICES …………………………………………...

viii

Part A: INTRODUCTION…………………………………………………………...

1

1. Rationale for the study…………………………………………………………...….

1

2. Aims and scope of the study…………………………………………………...…….

2

3. Methods of the study…………………………………………………………..…….

2

4. Research questions……………………………………………………………...…...

3

5. Design of the study……………………………………………………………...…...

3


Part B: LITERATURE REVIEW……………………………………………….…..

4

1. Basic Concepts of Testing……..…………………………………………………….

4

2. Roles of testing………………..……………………………………………………..

4

3. Types of Test…………………………………………………...……………………

5

3.1. Achievement tests…………………………………………...…………………

5

3.2. Placement tests………………………………………………...………………

8

3.3. Diagnostic tests………………………………………………...………………

8

3.4. Proficiency tests………………………………………………..……………...


9

3.5. Direct versus indirect testing…………………………………..………………

9

3.6. Discrete point versus integrative testing………………………..……………..

11

3.7. Norm-referenced versus criterion-referenced testing…………..……………...

11

3.8. Objective versus subjective testing……………………………..……………..

12

3.9. Communicative language testing………………………………..…………….

13

4. Characteristics of a Good Test………………………………………..……………..

14

4.1. Validity………………………………………………………..……………….

14


4.1.1. Face validity…………………………………………..………………...

14

4.1.2. Content validity………………………………………..………………..

15

4.1.3. Construct validity………………………………………..……………...

16

4.2. Reliability………………………………………………………..…………….

17

4.3. Practicality……………………..……………………………………………...

18


6
4.4. Discrimination……………..…………………………………………………..

19

5. Test Items………………………..…………………………………………………..

19


5.1. Direct test items……………..………………………………………………...

19

5.2. Indirect test items…………..………………………………………………….

20

Part C: THE STUDY…………………………………….…………………………...

22

1. The issues influencing current testing situation at LQD-DD High School……..…...

22

1.1. The students and their backgrounds……………………………………..……

22

1.2. The English teaching staff………………………………………………..…...

22

1.3. English teaching and learning…………………………………………..…….

23

1.4. The objectives of the K-12 course………………………………………..…..


24

1.5. Teaching material used for the K-12……………………………………..…..

24

1.6. The content and construction required for the achievement test for the K-12

26

2. The current testing situation…………………………………………………...…….

26

3. The process of testing…………………………………………………...…………...

27

3.1. Determining the purpose of the test………………………………...………...

28

3.2. Planning the test………………………………………………………..……..

28

3.3. Selecting items and tasks……………………………………………..………

29


3.4. Administering the test………………………………………………..……….

29

3.5. Scoring and rating…………………………………………………..………...

30

3.6. Descriptive data from test score……………………………………..………..

30

3.6.1. Frequency distribution..............................................................................

31

3.6.2. Measures of central tendency....................................................................

32

3.6.3. Measures of dispersion..............................................................................

33

3.6.4. Item analysis..............................................................................................

34

4. Survey..........................................................................................................................


36

4.1. Data collection procedure..................................................................................

36

4.2. Findings ............................................................................................................

37

5. Archiving …………………………………………………...……………………….

38

6. Some discussions from test result analysis..................................................................

38

Part D: CONCLUSION ……………….……………………………………………..

39

1. Conclusion …………………………………………………………………………..

39

2. Recommendation ……………………………………………………………………

39


REFERENCES………………………………………………………………………..

40


8
LIST OF ABBREVIATIONS

1. D:

Discrimination Index

2. FV:

Facility Value

3. GCSE:

General Certificate of Secondary Education

4. K-12:

The 12th form

5. L:

The number of lower group students who answered the item correctly

6. LQD – DD:


Le Quy Don – Dong Da

7. MCQs:

Multiple Choice Questions

8. S.D.:

Standard Deviation

9. U:

The number of upper group students who got the item correct


9
LIST OF TABLES AND APPENDICES

Table 1:

Objectives and contents of the test

Table 2:

The frequency of marking scores

Table 3:

The measures of central tendency


Table 4:

The measures of dispersion

Table 5:

Facility value and Discrimination index

APPENDIX 1:

The contents of the eight units taught in the 1st term of K-12

APPENDIX 2:

The structure of English tests for the GCSE national examination

APPENDIX 3:

Revision outline for the 1st term test

APPENDIX 4:

The 1st-term English examination for K-12

APPENDIX 5:

Keys for the 1st-term English examination for K-12

APPENDIX 6:


Students’ performance on test arranged from the highest score to
the lowest score

APPENDIX 7:

The survey questionnaire for students

APPENDIX 8:

The survey questionnaire for teachers


10
Part A: INTRODUCTION
1. Rationale for the study
Assessment plays a significant part in the field of linguistics. It allows teachers as
well as administrators to make important decisions regarding the proficiency, placement,
and achievement of second language learners. Among some kinds of assessment, testing is
the one that is typically used at the end of a stage of instruction to measure students‟
achievement. Henning (1987, p. 1) says that the most common purpose of language tests is
“to pinpoint strengths and weaknesses in the learned abilities of the student”. Heaton
(1990, p. 9) also shares the idea that it is “to find out how well the students have mastered
the language areas and skills which have just been taught”.
Besides, teachers can use language tests to encourage students. Students can show
their ability and progress through tests. Heaton (1990, p. 10) agrees that tests are also “used
for the purpose of increasing motivation”. Having good results, students will be eager for
their coming study. With not very good test results, they will know which level of
language they are at so as to set a more realistic goal for themselves.
Being aware of the above things, the author of this study, who has been a teacher of
English for about 10 years and has written many tests to evaluate the language abilities of

students, has noticed that not all the tests have satisfying results. Some tests have bad
results, which means that too many students get bad marks or too many students get very
good marks. Teachers cannot use the results of these tests to categorize the students.
Therefore, such tests are not reliable.
Taking the problem into consideration, the author has found out that it is the
technique of designing tests that leads to the situation. When writing tests, the test writer
may only feel that this or that test item is suitable for their students. She or he may have
never analyze the test designing and the test result carefully in order to find out the causes
for the problem. On the other hand, a well-designed test is always necessary for all
students of every language level, especially those of high school level - the elementary
level which aims at acquiring survival English.
Therefore, getting more knowledge and experience in designing tests is the author‟s
desire. In this minor thesis, the author would like to base herself on the knowledge of
testing and testing situation to design an achievement test for the 12th form of non-English
majors at Le Quy Don-Dong Da High School, Ha Noi.


11
2. Aims and scope of the study
This study seeks to use the academic knowledge in test design to write a reliable
achievement test for the 12th form of non-English majors at Le Quy Don-Dong Da High
School. Because this is an achievement test given to students after a study period of about
four months, the 1st term of the school year, and students are required to use what has
recently been taught and practiced to do the test, we, the teachers of English at LQD – DD
High School, expect our students to score fairly high marks, at least 70% of which should
be from mark 6 upwards. Therefore, this study specifically seeks to answer such questions
as whether the items of the test meet the requirements of the course, whether these test
items are suitable for at least 70% of students to get marks from 6 upwards and what
recommendations should be given to improve the testing situation so that we can build a
bank of reliable tests for the school in the coming years.

As students of grade 12 are going to take part in the national examination for General
Certificate of Secondary Education (GCSE), they should be given to a chance to get used
to the structure of this kind of examination. Therefore, the study will focus on the existing
situation at LQD – DD High School, which means that the author will design a multiple
choice written achievement test only on such fields as phonetics, grammar, vocabulary,
communicative function, reading and writing skills as directed in the book called “Cấ u trúc
đề thi môn Ngữ văn , Lịch sử, Đi ̣a lí , Ngoại ngữ” (i.e. “The Test Structures of Literature,
History, Geography, Foreign Languages”) by Nguyen (2009, p. 104). The study will also
provide analyzed data of the test, the teachers‟ and students‟ comments on the test as well
as their suggestions for its improvement.
3. Methods of the study
The main methods employed in this study included test design, test result analysis
and teacher and student survey.
For the purpose of designing tests, the author based on the theory and principle of
language testing, major characteristics of a good test, especially a good achievement test
and testing situation at LQD-DD High School
The test result was analyzed by the software SPSS to present the frequency
distribution, central tendency and dispersion of the test score. The analysis of test items is
carried out manually.
Finally, the survey enabled the author to collect teachers‟ and students‟ opinions on
the reliability and suitability of the test. A questionnaire in both English and Vietnamese


12
was given to 50 students and another in English was given to 9 teachers of English at LQD
– DD High School. The questionnaires include most closed-ended questions because the
quantitative data was the major interest and it was easy to carry out the survey. There was
only one open-ended question in each questionnaire so that more suggestions could be
given to improve the quality of test design.
4. Research questions

This study was intended to answer such research questions as:
1. What test items are appropriate to meet both the requirements of the course and the
test writer‟s expectation in the testing situation at LQD-DD High School? What items
are the easiest and most difficult of all?
2. What should be done to improve the test?
5. Design of the study
This thesis consists of four parts, with a list of references, and appendices.
Part A, “Introduction”, provides the rationale for the study, the aims, the scope, the
methods, the research questions and the design of the study.
Part B, “Literature Review”, discusses the basic concepts of testing, the roles of
testing, the types of test, the characteristics of a good test and the test items.
Part C, “The Study”, presents five sections. The first is issues influencing current
testing situation at LQD-DD High School. The second is the current testing situation. The
third is the process of testing, which is ended with the descriptive data from test score.
Then the study goes on with the teacher and student survey including data collection
procedure and some findings. The next is about the archive of good test items. The last is
some discussions from test result analysis.
Part D, “Conclusion”, gives some brief conclusions on the study and some
recommendations for further study.


13
Part B: LITERATURE REVIEW
1. Basic Concepts of Testing
For the concepts of testing, many linguists have expressed their opinions. It is said to
refer to “the specific procedures that teachers and examiners employ to try to measure
ability in the language, using what learners show they know as an indicator of their ability”
(Hedge, 2000, p. 378). According to Bachman (1990, p. 20), a test is a measurement
instrument designed to elicit a specific sample of an individual‟s


ehaviour. Besides the

term “testing”, it is necessary that we should consider some more terms such as
“evaluation”, “assessment” and “measurement” as all of them have a close relationship.
Bachman (1990, p. 18) proposes that these terms are often used as synonyms that refer to
the same activity. However, he also points out that each of them has its own distinctive
characteristics. Therefore, tests are only one of many different types of measurement.
Hedge (2000) also agrees that:
Assessment is the more inclusive term: it refers to the general process of monitoring
or keeping track of the learners‟ progress. Testing is one kind of assessment, one
which is typically used at the end of a stage of instruction to measure student
achievement. (Hedge, 2000, p. 376)
Bachman (1990, p. 22) says that “tests in and of themselves are not evaluative”.
Davies (2000, pp. 170-171) shares the ideas that evaluation is a more general concept than
testing but tests remain to be the main instruments for evaluation of learning in most
teaching situations.
In summary, “not all measures are tests, not all tests are evaluative, and not all
evaluation involves either measurement or tests” (Bachman, 1990, p. 24).
2. Roles of testing in teaching and learning
The roles of testing have been recognized by many researchers. McNamara (2000, p.
4) and Heaton (1988, p. 7) say that language tests play a powerful part as a tool for
teachers to find out which parts of the language program have caused difficulty to the
class. They act as gateways at important transitional moments in education, in
employment, and in moving from one country to another. Teaching and testing have an
intimate relationship, which is considered as that of partnership (Hughes 1989, p. 2), and
testing is also essential in learning. It provides students with an opportunity to show their
ability to perform certain tasks in the language (Davies, 2000, pp. 169-170). It helps


14

teachers to find out about students‟ progress in learning and their learning difficulties. It
also encourages students to learn (Heaton, 1990, pp. 9-11; Harrison, 1983, p. 1).
It is obvious that testing is an integral part of teaching and learning and is not
significantly separated from classroom activities. However, many students see tests as dark
clouds over their heads since they are afraid of getting bad results which mean bad marks
for them. Hughes (1989, p. 1), who considers the effect of testing on teaching and learning
as backwash, argued that “backwash can be harmful or beneficial”. Backwash is harmful if
the test content and testing techniques are different from the objectives of the course.
To sum up, not only the teachers but also the students may gain benefits from testing
provided that the test content and testing techniques meet the objectives of the course and
the test is suitable for students‟ language competence.
3. Types of Test
In this section, types of tests which are categorized according to their purposes are
taken into consideration. Therefore, such types of tests as achievement tests, placement
tests, diagnostic tests and proficiency tests are mentioned. The section, then, goes on to
distinguish between direct and indirect testing, between discrete point and integrative
testing, between norm-referenced and criterion-referenced testing and between objective
and subjective testing. Finally, communicative language testing is presented.
3.1. Achievement tests
This kind of tests interests teachers of English most as these tests are the
commonest basis for the marks teachers give students during and at the end of each course.
Achievement tests in foreign language classes are directly related to language courses.
They attempt to assess whether a student has met the requirements of a given course, and
sometimes whether he has satisfied a language requirement at an institution. In other
words, their purpose is to measure learners‟ language and skill progress in relation to the
syllabus learners have been following (Krashen, 1987, p. 179; Hughes, 1989, p. 10;
Harmer, 1991, p. 321; Henning, 1987, p. 6). Achievement tests also relate to the past
because they measure what language the students have learned as a result of teaching
(McNamara, 2000, p. 7). Harrison (1983, p. 7) states that “an achievement test (also called
an attainment or summative test) looks back over a longer period of learning than the

diagnostic test, for example a year‟s work, or a whole course, or even a variety of different


15
courses”. He also proposes that an achievement test is intended to show the standard which
the learners have now reached in relation to other learners at the same level.
Achievement tests are divided into two types: final achievement tests and progress
achievement tests (Hughes, 1989, p. 10; Davies, 2000, pp. 171-172).
Final achievement tests are longer-term achievement tests than progress ones. They
are administered at the end of a course of study to check how well learners have done over
a whole course, so they are also called course tests. These tests may be written and
administered by members of teaching institutions, by official examining boards, or by
ministries of education. The content of these tests “must be related to the courses with
which they are concerned, but the nature of this relationship is a matter of disagreement
amongst language testers” (Hughes, 1989, p. 11). There are two approaches to the base of
the content of a final achievement test. The first is the “syllabus-content approach”, which
has an obvious appeal because the test only contains what it is thought that the learners
have actually encountered and its content is based directly on a detail course syllabus or on
the books and other materials used (Hughes, 1989, p. 11). Alderson, Clapham & Wall
(1995, p. 12) propose that the content of both progress and achievement test is generally
based on the course syllabus or the course textbook. This means that an achievement test
should contain item types which the learners are familiar with (Harmer, 1991, p. 321). In a
reading test, for example, learners should be provided with similar texts and familiar task
types with the texts they have seen before. If students are faced with completely new
material, the test will not measure the learning that has been taking place, even though it
can still measure general language proficiency. Hughes (1989, p. 11), however, also points
out the disadvantage of this approach. He argues that “if the syllabus is badly designed, or
the books and other materials are badly chosen, then the results of a test can be very
misleading” and “successful performance on the test may not truly indicate successful
achievement of course objectives”. This argument leads to an alternative approach, the

second one, which bases the test content directly on the objective of the course. The
following are a number of advantages of this approach:
1. It compels course designers to be explicit about objectives.
2. It makes it possible for performance on the test to show just how far students
have achieved those objectives.
3. This in turn puts pressure on those responsible for the syllabus and for the
selection of books and materials to ensure that these are consistent with the
course objectives.


16
4. Tests based on objectives work against the perpetuation of poor teaching
practice, something which course-content-based tests, almost as if part of a
conspiracy, fail to do.
(Hughes, 1989, p. 11)
Therefore, Hughes (1989) believes that to base the content of final achievement
tests on course objectives is much to be preferred. However, it is Hughes (1989, p. 13) who
says that if the second approach is already being followed, “not only is there likely to be
natural resistance to change, but such a change may represent a threat to many people” as
“a great deal of skill, tact and, possibly, political manoeuvring may be called for”.
Progress achievement tests, or short-term achievement tests, which include
achievement tests at the end of the term and progress tests at the end of a unit, a fortnight,
etc., are used to check how well learners are doing after each lesson or unit, and provide
consolidation or remedial work if necessary (Davies, 2000, p. 171). They are intended to
measure the progress that learners are making. They can also help us to decide on changes
to future teaching programs where students do significantly worse in (parts of) the test than
we might have expected (Harmer, 1991, p. 321). Hughes (1989, p. 12) claims that as
„progress‟ is towards the achievement of course objectives, these tests should relate to
objectives. Then he suggests that to make a clear progression towards the final
achievement test based on course objectives, a series of well-defined short-term objectives

should be established. In other words, the content of progress achievement tests should
also be based on course objectives, especially the short-term objectives of the course.
In this subsection, the general concepts of achievement tests as well as the
approaches to the base of their content have been discussed. Each approach has its own
advantages and disadvantages. Therefore, whichever approach applied to achievement test
design depends on the testing situation and on the agreement of members of each teaching
institution.
3.2. Placement tests
A placement test is designed to provide information which will help to categorise
new students according to their knowledge background at the beginning of the course, so
that they can start a course at approximately the same level as the other students in the
right class in a school (Harrison, 1983, p. 4; Harmer, 1991, p. 321; Hughes, 1989, p. 14;
Heaton, 1990, p. 15). Placement tests obviously have their own characteristics. For
example, the test should consist of questions directly concerned with the specific language


17
skills which students will require on their course. A placement test also looks forward to
the language demands which will be made on students during their course. What is more,
“a placement test should try to spread out the students‟ score as much as possible” in order
to be possible to divide students into groups based on their various ability levels (Heaton,
1990, p. 15).
From the above characteristics, it is possible to conclude that, unlike achievement
tests, placement tests are usually based on syllabuses and materials the learners will follow
and use when their level has been decided on. This kind of tests is essential in institutions
that frequently receive new learners.
3.3. Diagnostic tests
It is pointed out by Hughes (1989, p. 13) that diagnostic tests are used to identify
students‟ strengths and weaknesses. They are intended to check our students‟ progress and
identify their difficulties, gaps in their knowledge, and skill deficiencies during a course so

as to ascertain what further teaching is necessary (Harmer, 1991, p. 321; Heaton, 1990, p.
11). In other words, diagnostic tests help to show learners what their difficulties are, where
gaps exist in their command of the language and what skills they should pay attention to.
This type of tests helps teachers to be aware of problems in order to teach effectively. Test
researchers also say that diagnostic tests can be used at the start of the course (Davies,
2000, p. 171), at the end of a unit in the course book or after a lesson designed to teach one
particular point (Harrison, 1983, p. 6).
When designing a diagnostic test, we must select areas which we think students are
likely to have problems with. However, according to Heaton (1990, p. 11), tests of
grammar and pronunciation are more suitable for diagnosing students‟ difficulties than
tests of skills as it is more difficult to use a skills test such as a reading test or test of free
writing to determine problem areas in a systematic way.
Concerning the approach to the base of the content of a diagnostic test, Harrison
(1983, p. 6) argues that it must relate to specific short-term objectives and should include
further examples of the same kind of material as that used in teaching.
As a result of the definition, the purpose and approach to this kind of tests, it can be
said that diagnostic tests, to some extent, are the same as progress achievement tests.
3.4. Proficiency tests


18
It is agreed that proficiency tests are used to measure the knowledge and ability of
learners with different language training backgrounds in relation to generally accepted
standards (Harrison, 1983, p.7; Hughes, 1989, p. 9; Heaton, 1990, p.17; Harmer, 1991, p.
321; Alderson et al., 1995, p. 12; Davies, 2000, p. 172). The content of a proficiency test,
therefore, is not based on the objectives of any language course that learners may have
followed. These linguists also state that there are two types of proficiency tests. One type is
tests used for assessing learners‟ ability to use a language for a particular purpose. This
type of tests are suitable for people who need to measure their foreign language proficiency
so as to determine whether they are good enough for their job or for their future study

abroad. When designing this type of tests, test designer should pay careful attention to
language areas and skills that the candidate will need. The other type of proficiency tests,
by contrast, does not relate to any occupation or course of study. “The function of these
tests is to show whether candidates have reached a certain standard with respect to certain
specified abilities” (Hughes, 1989, p. 10).
To sum up, as Davies (2000, p. 172) writes, proficiency tests “are useful for the
objective evaluation of learning, and also for the indirect evaluation of course design and
teaching”. The popular systems of international proficiency tests are TOEFL, GMAT and
IELTS. In Vietnam, proficiency tests are of different levels namely A, B, C.
3.5. Direct versus indirect testing
Direct and indirect testing are two approaches to test construction. Each of them
naturally has its own advantages and disadvantages to test design.
Testing is said to be direct when it requires the candidate to perform precisely the
skill which we wish to measure in real and uncontrived communication situations (Hughes,
1989, p. 15; Henning, 1987, p. 5). In other words, if we want to measure candidates‟ ability
of writing, we should ask them to write a composition. If we want to know how well they
pronounce a language, we should ask them to speak. This approach to test construction has
following advantages:
-

Provide that we are clear about just what abilities we want to assess, it is
relatively straightforward to create the conditions which will elicit the behavior
on which to base our judgements.

-

At least in the case of the productive skills, the assessment and interpretation of
students‟ performance is also quite straightforward.



19
-

Since practice for the test involves practice of the skills that we wish to foster,
there is likely to be a helpful backwash effect.

-

Direct testing is inevitably limited to a rather small sample of tasks, which may
call on a restricted and possibly unrepresentative range of grammatical
structures.
(Hughes, 1989, pp. 15-16)

This test specialist also agues that when a direct test is used, its tasks and texts
should be as authentic as possible. In fact, however, as candidates are aware that they are
in a test situation, the tasks cannot be really authentic. What is more, when candidates‟
productive skills are tested, the result is not really reliable since the test is marked
subjectively.
Indirect testing, by contrast, “attempts to measure the abilities which underlie the
skills in which we are interested” (Hughes, 1989, p. 15). Multiple choice recognition tests
are typical of this approach to language testing. The main appeal of indirect testing is that
it seems to offer the possibility of testing a representative sample of a finite number of
abilities which underlie a potentially indefinitely large number of manifestations of them.
For example, if a representative sample of grammatical structures is taken in a test, it can
be relevant for all the situations in which control of grammar is necessary. As a result of
this argument, indirect testing is preferable to direct testing because its results are more
generalisable (Hughes, 1989, p. 16).
3.6. Discrete point versus integrative testing
According to Hughes (1989, p. 16) and McNamara (2000, p. 14), discrete point
testing refers to the testing of separate, individual points of knowledge at a time, item by

item. Henning (1987, p. 5) says that the distinction between discrete point and integrative
testing was originated by John B. Carroll (1961). Discrete point tests, as a variety of
diagnostic tests, are designed to measure knowledge or performance in very restricted
areas of the target language. In this kind of tests, the points of grammar chosen for
assessment would be tested one at a time; tests of grammar would be separate from tests of
vocabulary; and material to be tested would be presented with minimal context or in an
isolated sentence. In order to test individual points, item formats of the multiple choice
question type are most suitable (McNamara, 2000, p. 14). Discrete point tests are almost
always indirect tests (Hughes, 1989, p. 17).


20
Integrative testing, as Hughes (1989, p. 16) points out, requires the candidate to
combine many language elements in the completion of a task. Henning (1987, p. 5) claims
that integrative tests “tap a greater variety of language abilities concurrently”. These tests,
which include tasks of writing a composition, making notes while listening to a lecture,
taking a dictation, or completing a cloze passage, tend to be direct (Hughes, 1989, pp. 1617).
However, Henning (1987, p. 5) cites Farhady‟s (1979) evidence that “there are no
statistically revealing differences” between discrete point and integrative tests. The cloze
procedure, for example, in which the integrative testing method is employed is indirect
(Hughes, 1989, p. 17).
3.7. Norm-referenced versus criterion-referenced testing
A test is said to be norm-referenced when it is designed to give the information of
one candidate‟s performance in relation to that of other candidates. With the result of this
kind of tests, we are not told directly how well the candidate has performed in the test
(Hughes, 1989, p. 17; Bachman, 1990, p. 72). By contrast, we can easily compare the
candidate‟s performance or achievement with that of his/her peers in a larger population of
candidates. Acceptable standards of candidates‟ performance can only be determined after
the test has been conducted. These standards are considered the mean or average score of
other candidates from the same population (Henning, 1987, pp. 7-8). As a result, this kind

of tests is suitable for placement purposes.
Criterion-referenced tests are the ones providing us the information of how well a
candidate perform in a language. The purpose of these tests is to classify candidates
according to whether or not they are able to do some task or set of tasks satisfactorily
(Hughes, 1989, p. 18; Bachman, 1990, p. 74). The standard of candidates‟ performance is
devised before the test is designed. Tests of this kind have two virtues: “they set standard
meaningful in terms of what people can do, which do not change with different groups of
candidates; and they motivate students to attain those standards”. However, for bright
students, this kind of tests may not encourage them to reach higher standards after they
have easily attained the criterion level of language (Henning, 1987, p. 7).
It can be said that criterion-referenced tests are more suitable for classroom
activities than norm-referenced ones. As teachers can set certain objectives for students to
achieve in tests, they can find out students‟ strengths and weaknesses, and thus can help
them improve students‟ language competence.


21

3.8. Objective versus subjective testing
These two types of testing are distinguished according to the manner in which tests
are scored. It is agreed that objective tests are the ones that need no knowledge or training
in the examined content area from scorers. Subjective tests, by contrast, are the ones
calling for judgement of scorers (Henning, 1987, p. 4; Heaton, 1988, p. 25; Hughes, 1989,
p. 19; Heaton, 1990, p. 30).
A typical example of objective testing is multiple-choice recognition tests. As each
item in this kind of tests has only one correct answer, tests can be marked by a machine or
by an inexperienced person. Therefore, it can be said that the results from the tests are
extremely reliable. That is, a candidate will get the same mark no matter which examiner
marks the test (Heaton, 1988, p. 25). It is the convenience in marking and the great
reliability of scoring that objective tests have become very popular among examining

bodies who responsible for testing large number of candidates. In Vietnam, this kind of
tests has proved its popularity in recent years; it has been used in such national
examinations as the ones for General Certificate of Secondary Education (GCSE) and
University Entrance Examinations. However, objective testing has some negative effects
that should be taken into consideration. It is more suitable for testing students‟ knowledge
of forms of language and how language works than for testing their ability of using the
language (Heaton, 1988, p. 26). As a result, if students are given such kind of tests
frequently, they will not learn how to use the language (Heaton, 1990, p. 33). What is
more, it is said that objective tests of multiple-choice type do not encourage students to use
their language competence to do the task but encourage them to guess the answers.
However, if each multiple-choice item has four or five alternatives, it is possible to reduce
the possibility of guessing (Heaton, 1988, p. 26).
Examples of subjective tests are the ones that require candidates to write
compositions, reports, letters, answers to comprehension questions, etc. as well as require
them to talk or to make conversations using their own words. Henning (1987, p. 4) cites
Oller‟s (1979) opinion that “many tests, such as cloze tests permitting all grammatically
acceptable responses to systematic deletions from a context, lie somewhere between the
extremes of objectivity and subjectivity”. Obviously, subjective tests are preferable to
objective ones in that they can test language skills and certain areas of language. However,
since the answer to the question of these tests allows much freedom and requires
flexibility, scoring needs to be carried out by a competent marker or teacher (Heaton, 1990,


22
p. 32). Moreover, as scoring involves the marker‟s subjective judgement, the result from
the test usually denotes unreliable or undependable.
In summary, designing objective tests requires much more time and more careful
preparation than designing subjective ones, but the first ones reward us with the ease of
marking. Whichever test is used in classroom activities and in what occasion depends
greatly on the teachers who should consider the most suitable type of tests for their

students.
3.9. Communicative language testing
Communicative language testing is another approach to testing which emphasizes
the importance of the meaning of utterances rather than their form and structure (Heaton,
1988, p. 19). McNamara points out two features of these tests:
1. They were performance tests, requiring assessment to be carried out when the
learner or candidate was engaged in an extended act of communication, either
receptive or productive, or both.
2. They paid attention to the social roles candidates were likely to assume in real
world settings, and offered a means of specifying the demands of such roles in
detail.
(McNamara, 2000, pp. 16-17)
It is clear that the second feature distinguishes communicative tests from
integrative ones: communicative tests are concerned primarily with how language is used
in communication. In other words, these tests should meet the performance conditions of
certain contexts as much as possible (Weir, 1990, p. 11). As communicative testing
requires candidates to communicate under realistic linguistics, situational, cultural and
affective constraints or candidates have to perform both receptively and productively in
relevant contexts (Weir, 1990, p. 12), it is also direct testing.
4. Characteristics of a Good Test
When designing tests, designers have to take various factors into consideration. Such
things are the purpose of the test, the content of the syllabus, the course objectives, pupils‟
backgrounds and so on. To design a good test, of course, besides the above factors,
characteristics of a good test should be included in the test designer‟s consideration. It is
agreed that validity, reliability, practicality and discrimination are the most important
qualities of a good test (Harrison, 1983; Henning, 1987; Heaton, 1988; Hughes, 1989;


23
Weir, 1990; Alderson et al., 1995; Bachman & Palmer, 1996). In the following subsections

these characteristics will be discussed in turn.
4.1. Validity
Validity of a test is known as the extent to which it measures accurately what it is
intended to measure. However, when it is carefully examined, the concept of validity
reveals several typical aspects as follows:
4.1.1. Face validity
A test is said to have face validity if it looks, on the „face‟ of it, acceptable to those
involved in its development or use (Harmer, 1991, p. 322; McNamara, 2000, p. 50). That
is, face validity is concerned with what other testers, teachers and students think of the test
(Harrison, 1983, p. 11; Heaton, 1988, p. 159). For example, students might not be
convinced of the test‟s face validity if it included only three-multiple choice items however
reliable or practical it is thought to be (Harmer, 1991, p. 322). If the test looks sound, or
the test has good face validity, it will be acceptable to students and they will certainly try
harder. Therefore, students‟ motivation is maintained. The test‟s face validity can only be
found out by asking students whether the test was appropriate to their expectations or by
asking teachers concerned for their opinions, either formally by means of a questionnaire
or informally by discussion in class or staff room (Henning, 1987, p. 94; Harrison, 1983, p.
11).
What is more, a test which has face validity should look as if it measures what it is
supposed to (Hughes, 1989, p. 27). For example, if students‟ pronunciations ability is
intended to be measured, students should be asked to speak. If not, for example in the case
of testing students‟ pronunciation ability in an indirect test, or in a multiple-choice test, the
test might be thought to lack face validity. However, Hughes (1989, p. 27) also argues that
indirect tests will have face validity if novel techniques are introduced slowly with care
and with convincing explanation.
4.1.2. Content validity
The content validity has been greatly discussed by many language test specialists.
Bachman (1990, p. 244) considers the content as one of the first characteristics of the test.
He says that a necessary part of validation is to demonstrate that a test is relevant to and
covers a given area of content or ability. Weir (1990, p. 24) shares the idea that particular



24
attention must be paid to content validity to ensure that the knowledge area included in the
test is as representative of the target domain as possible. Hughes (1989, p. 22) asserts that a
test whose content constitutes a representative sample of the language skills, structures,
etc. with which it is meant to be concerned is said to have content validity. Content
validity, according to Harrison (1983, p. 11), is concerned with “what goes into the test”.
Henning (1987, p. 94) and Davies (2000, p. 172) even give a more specific statement. They
claim that content validity should be “sufficiently representative and comprehensive for
the test to be a valid measure of what it is supposed to measure”. In other words, a test
must be selective in content which depends on the course syllabus and the purpose of the
test. For example, the Passive Voice should only be given in a test if it has been practiced
by the learners. Hughes (1989, p. 23) also points out that it is necessary to determine the
content of the test by what is important to test rather than what is easy to test.
As a result of the necessity of content validity, a table of specifications - a list of
what candidates are asked to do (McNamara, 2000, p. 50) or a statement of what the
content of the test ought to be (Alderson et al., 1995, p. 173) - upon which a test is based
should be included in the procedures of test design. The content specification is important
because it ensures as far as possible that the test reflects the particular language skills and
areas in a suitable percentage weighting. That is, the test writer should quantify and
balance the test components so as to indicate the importance of each component in relation
to the others in the test (Heaton, 1988, p. 161). The table of specifications, according to
Harrison (1983, pp. 16-23) and Alderson et al. (1995, pp. 11-14), should include:
-

the objectives of the test,

-


the skills and language elements which should be tested,

-

the sort of candidates,

-

the sort of tasks that are required,

-

the number of sections,

-

the number of items,

-

the test method that is used,

-

the type of texts that are chosen,

-

the rubrics or instructions for candidates,


-

the length of time allowed for the test, etc.
In summary, content validity is an integral characteristic of a good test. However,

its limitation is that it focuses only on test, not on test scores (Bachman, 1990, p. 247).


25
4.1.3. Construct validity
According to Alderson et al. (1995, p. 183), construct validity is the most difficult
concept to explain. While content validity focuses on test, construct validity directly relates
to the meaningfulness and appropriateness of the interpretations on test scores. The term
construct validity is therefore used to refer to the extent to which a given test score can be
interpreted as an indicator of the abilities, or constructs, that are intended to measure
(Bachman, 1990, p. 255; Bachman & Palmer, 1996, p. 21). Constructs, according to
Bachman (1990, p. 255), “can be viewed as definitions of abilities that permit us to state
specific hypotheses about how these abilities are or are not related to other abilities, and
about the relationship between these abilities and observed behaviour”. However, Alderson
et al. (1995, p. 183) cites Ebel‟s and Frisbie‟s (1991, p. 108) opinion, saying that:
The term construct refers to a psychological construct, a theoretical
conceptualisation about an aspect of human behaviour that cannot be measured or
observed directly. Examples of constructs are intelligence, achievement motivation,
anxiety, achievement, attitude, dominance, and reading comprehension.
A test is said to have construct validity if it measures just the ability which it is
supposed to measure (Hughes, 1989, p. 26). It means that the test should include exercises
and tasks similar to those used in the course and correspond to the general approach of the
course. For example, students should not be required to translate a passage in a test if they
are not familiar with this activity in the course (Davies, 2000, p. 172). If a test is against
this principle, it will be seen unfair by the teachers and the learners. In Heaton‟s (1988, p.

161) opinion, if a course has adopted communicative approach as its centre teaching
method, a test which consists of chiefly multiple-choice items will lack construct validity.
4.2. Reliability
Reliability is another essential quality of a good test. A test cannot be considered to
have validity without having reliability (Harrison, 1983, p. 12). Moreover, reliability is
primarily important to the use of both public achievement and proficiency tests and
classroom tests (Heaton, 1988, p. 162). It is commonly agreed that for a test to be reliable,
the test designer should take the following aspects into account (Harrison, 1983; Henning,
1987; Heaton, 1988; Hughes, 1989; Weir, 1990; Harmer, 1991; Davies, 2000):
1. How far we can believe or trust the results of the test: for example, if the same test is
given to the same group of students twice within two days; they should get the same
results on each occasion. In other words, the results should be consistent.


26
2. How far the test can be marked objectively: multiple choice tests are said to be
objective by nature as they do not need any personal judgement from the scorer. On the
contrary, direct tests, where more than one answer is possible, are purely subjective as
they require scorers to use their personal judgement.
3. The length and difficulty of the test: it is fair to students if the length and difficulty of
the test given are appropriate to the time allowance and students‟ language
competence.
4. The discriminability of the test: if items in the test have high discriminability, the test
as a whole will have greater power to discriminate among students who are low and
high in the ability of interest. That is the test will have greater reliability.
5. The test instruction: the instructions should be clear and unambiguous for all the
students and there should be no errors in the test, for example, if the students have to
„select the best answer – a, b, c, or d‟, there should not actually be two or more
acceptable answers.
6. How the test is administered: differences in the time of test administration, the extent

of test administrator interaction with examinees, the prevention of cheating behaviour
and the reporting of the time remaining or interruptions and distractions for one group
of examinees and not for another may reduce reliability.
7. Personal factors: temporary changes in examinees such as fatigue, sickness and
emotional disturbance may also cause the reduction of reliability.
8. Response characteristics: this aspect concerns students‟ guessing. As mentioned in
subsection 2.3.8., objective tests of multiple-choice type is said not to encourage
students to use their language competence to do the task but to encourage them to
guess the answers.
4.3. Practicality
A test must be practical. That is, it must be well organized in advance to be
straightforward to administer. There are many matters concerning practicality of a test
which a test designer should take into consideration, however, the following are typical
ones pointed out by Harrison (1983, pp. 12-13) and Heaton (1988, pp. 167-168):
-

the length of time available for the administration of the test as it is frequently
misjudged even by an experienced test designer.

-

whether or not special arrangements have to be made, for example, what happens to
the rest of the class while individual speaking tests take place.


27
-

the preparation of equipment needed for the test such as tape recorder, overhead
projector, etc.


-

the length of time for marking and the number of teachers involved.

-

the reproduction of test materials in quantity and its cost.

-

how the test will be stored before the administration and between sittings of the
test.

-

the presentation of the test paper itself, that is, it should be printed and appear neat
and tidy.

Discrimination
Discrimination is also another important feature of a good test which discriminates
among the different students and reflects the differences in the performances of the
individuals in the group (Heaton, 1988, p. 165). In other words, it is “the extent to which a
test separates the students from each other” (Harrison, 1983, p. 14). The extent to
discriminate will depend on the purpose of the test. For example, in classroom tests, to find
out how well the students have mastered the syllabus, the teacher may hope for a cluster of
marks around 80 per cent and 90 per cent brackets (Heaton, 1988, 165). Briefly, to make
finer discrimination among students by tests, the items should be spread over a wide
difficulty level as follows:
-


extremely easy items

-

very easy items

-

easy items

-

fairly easy items

-

items below average difficulty level

-

items of average difficulty level

-

items above average difficulty level

-

fairly difficult items


-

difficult items

-

very difficult items

-

extremely difficult items
(Heaton, 1988, p. 167)

5. Test Items
5.1. Direct test items


28
Direct test items are the ones that ask students to perform the communicative skills
which are being tested. They should be as much like real-life language use as possible
(Harmer, 1991, p. 322). For direct tests items to be „validity‟ and „reliable‟, test designers
need to:


Create a ’level playing field’: that is, test items need to avoid making excessive
demands on the student‟s general or specialist knowledge.




Replicate real-life interaction: it means that test items should be as much like real life
as possible.
(Harmer, 1991, pp. 325-326)

5.2. Indirect test items
According to Harmer (1991, 322) indirect test items “try to measure a student‟s
knowledge and ability by getting at what lies beneath their receptive and productive skills”.
In other words, they try to find out about a student‟s language knowledge through such
more controlled items as multiple choice questions or grammar transformation items.
Indirect test items are divided into a wide range of types, however, the following
are in common use:


Multiple choice questions (MCQs): this type of test items is set out to require students
to select a correct answer from a number of given options (Weir, 1990, p. 43;
Shohamy, 2002, p. 38). The initial part of a multiple choice item is the stem. The
choices

from

which

the

students

select

their


answers

are

known

as

options/responses/alternatives. One option is the answer/correct option or key, while
the others are distractors. Distractors are used to distract the students who do not know
the answer from the correct option (Heaton, 1988, p. 28). MCQs can provide a useful
means of teaching and testing in various learning situations. They are particularly
useful in measuring students‟ ability to recognize correct grammatical forms. They can
also help both students and teachers to identify areas of difficulty (Heaton, 1988, p.
27). What is more, a test with MCQs has almost complete marker reliability as only
one of the options given in each item is correct. This also enables the scoring to be
done by a computer, so this kind of tests can be applied to testing a large number of
students. The MCQ, however, is one of the most difficult and time-consuming types of
items to construct. This kind of items “frequently does not lend itself to the testing of
language as communication” (Heaton, 1988, p. 27). When MCQs are used to test


×