CHAPTER 1: INTRODUCTION
1.1. RATIONALE
A good test can be used as a valuable teaching device. Heaton (1991:5) states that
“test may be constructed primarily as devices to reinforce learning and to motivate the
student or primarily as a means of assessing the students’ performance in the language.”
According to this linguist, the relationship between testing and teaching is “so closely
interrelated that it is virtually impossible to work in either field without being constantly
concerned with the other”. For proper evaluation and assessment of the English language
learning and teaching process, testing, an important tool in educational research and for
program evaluation (Lauwerys and Seanlon (1969:2) is employed as an indispensable part
of the training program at Ngo Quyen high school (NGHS) in Hai Phong city.
However, the designing a good test is not simple. Having been a teacher of English
for many years, I have been involved in designing, administering and marking many kinds
of English tests such as progress and end-of-term tests and also have often heard teachers
and test-takers at NQHS complaining that some of the final achievement tests for 12
th
form
students do not faithfully reflect the real linguistic competence of the test-takers. What is
tested is not really taught and the test measures neither the achievement of the course
objectives nor the expected linguistic skills and knowledge of the students. Probably, this is
because the test writers use the tests which are designed elsewhere and are not suitable for
the students. What test writers are concerned with seems to be the reliability of the test
rather than its validity. The situation coincides with the comments made by some test
researchers as Brown (1994: 373) and Hughes (1989:1) on recent language testing, “a great
deal of language testing is of very poor quality. Too often language testing has a harmful
effect on teaching and learning and too often they fail to measure accurately whatever it is
they are intended to measure”. Another reason is that language testing here has not been
paid enough attention to. I have not witnessed either comprehensive or systematic
evaluation on the effectiveness and appropriateness of these tests.
For the above-mentioned reasons, the author is encouraged to undertake this minor
thesis with the aim at investigating the designing final written achievement tests for the 12
th
form students at NQHS through evaluating a current final achievement test by both
students and teachers mainly in terms of its validity. I hope that the result of the study can
1
then help to improve the quality of the final achievement tests for the 12
th
form students at
NGHS.
1.2. SCOPE OF THE STUDY
Due to the limitations of time and ability, the scope of the study is limited to
research on examining the current final achievement test for the 12
th
form students at
NQHS mainly in terms of its validity.
The study provides empirical evidence of the current final achievement test and
proposes practical suggestions on the improvement of the final tests for the 12
th
form
students at NQHS in general.
1.3. AIMS OF THE STUDY
The study is aimed at reporting the result of the examination of current final
achievement test for the 12
th
form students at NQHS in terms of its validity. It highly
emphasizes analyzing the teachers’ and students’ evaluation on the test and their
suggestions towards its improvement.
The specific aims of the research are:
- To investigate the NQHS English teachers’ and the 12
th
form students’ evaluation of the
current final achievement test in terms of its validity.
- To find out the differences and similarities (if there are any) in teachers’ and test takers’
evaluation of the test and to suggest reasons why there are such similarities and
differences.
- To provide some practical recommendations for the improvement of the final
achievement tests so as to achieve more accurate measures of students’ English
competence.
1.4. RESEARCH QUESTIONS
The research questions of the study are as follows:
- How is the current final achievement test for the 12
th
form students at NGHS evaluated by
both students and teachers in terms of its validity?
- What improvements are recommended by the teachers and students with regard to the
validity of the test?
1.5. METHODS OF THE STUDY
In order to achieve the above aims, a study has been carried out with the following
2
methodologies.
First, the author based herself both on the theory and principles of language testing,
major characteristics of a good test (with special focus on test validity), achievement test
and practical tips to write it. From her critical reading, many reference materials have been
gathered, analyzed, and synthesized to draw out a theoretical basis to evaluate the current
final achievement test for the 12
th
form students at NQHS .
Second, qualitative methodologies involving data collected through survey
questionnaires were employed. Two questionnaires were administered to the 12
th
form
students and teachers of English at NQHS in order to investigate their evaluative
comments on the current final achievement test in terms of its validity and their
suggestions for its improvement. Besides, many other methods such as interviews,
informal discussion with students, teachers, and classroom testing observation are also
used to get more needed information.
1.6. STRUCTURE OF THE STUDY
The minor thesis is organized into five major chapters:
- Chapter one presents basic information such as the rationales, the aims, the research
questions, the methods, and the structure of the study.
- Chapter two is about a review of related literature that provides the theoretical basis for
evaluating and building a good language test. This review consists of background on
language testing, criteria of a good test, theory on the written achievement test such as its
two kinds and practical tips for writing achievement tests.
- Chapter three, the main part of the study, analyzes the results of the survey including the
questionnaires and direct interviews to find out the existing problems in designing the
current achievement test in particular and other final achievement tests in general at
NGHS.
- Chapter four proposes some suggestions on improvement of designing the final
achievement tests basing on the mentioned theoretical and practical study.
- Chapter five provides a summary and suggestions for further research on the topic, and
reference materials as well.
3
CHAPTER 2: LITERATURE REVIEW
2.1. DEFINITION OF TESTING
A test is generally defined by Carroll (1968:46) as “A psychological or educational
test is a procedure designed to elicit certain behavior from which one can make inferences
about certain characteristics of an individual”. Simply put, a test is an instrument designed
to elicit a specific sample of an individual’s behavior.
Similarly, Davies (1991:13) states that the tests are operational in nature, i.e, they
are intended to measure whether or not the candidates can do certain things in English. The
“things” they are asked to do are specified at each level and represent authentic tasks of the
sort which confront language users in real life.
Genesee and John A. Upshur (1996) look at tests as a task that measures one’s
ability to perform a particular task. They argue that a test is, first of all, about something.
That is, it is about intelligence, or European history, or second language proficiency. In
educational terms, tests have subject matter or content. Second, a test is a task or set of
tasks that elicits observable behavior from the test taker. The test may consist of only one
task, such as writing a composition, or a set of tasks, such as in a lengthy multiple-choice
examination in which each question can be thought of as a separate task. Different test
tasks represent different methods of eliciting performance. Third, tests yield scores that
represent attributes or characteristics of individuals. In order to be meaningful, test scores
must have a frame of reference. Test scores along with the frame of reference used to
interpret them is referred to as measurement. Thus, tests are a form of measurement.
(p.141). In other words, content, methods and measurement are three aspects of tests. The
quality of the end-of-year tests depends on whether the content of the test is a good sample
of the relevant subject matter. If the content of a test is a poor reflection of what has been
taught or what is supposed to be learned, then performance on the test will not provide a
good indication of achievement in that subject area. What a test is measuring is a reflection
of not only its content but also the method it employs. Tests that employ different methods
are measuring somewhat different skills, no matter how similar their content might be.
Tests in education measure differences in degree. They describe how proficiently students
can read a second language or how appropriately they speak in particular social situations,
for example.
4
In the foreign language teaching context, a test can be defined as an educational
instrument which is designed to measure what someone can do with the foreign language
to serve a particular purpose. (McNamara:11) As an instrument, a test may be responded to
differently by testees and test-users. Understanding testees and test-users’ responses to, and
perceptions of tests has been a critical issue in foreign language testing. Such
understanding is even more important where learner-centredness is promoted as a
philosophical orientation in foreign language teaching.
Testing, the act of administering a test, is closely related to teaching and learning.
This relationship is discussed in the next section (section 2.2)
2.2. RELATIONSHIP BETWEEN TESTING, TEACHING AND LEARNING
With regard to the relationship between testing, teaching and learning, there have
been two extreme views.
In the past, there was a common view that teaching and learning were separated
both theoretically and in practice. According to this view, a test is a necessary but
unpleasant imposition from outside the classroom: it helps to set standards but uses up
valuable class time.
But other researchers acknowledge the close link between them. For example,
Harrison (1991:7) believes that far from being divorced from each other, testing and
teaching are closely interrelated. A test is seen as a natural extension of classroom that can
serve each as a basis for improvement.
Upshur (1971) adds that, language testing both serves and is served by research in
language acquisition and language teaching. Language tests can be valuable sources of
information about the effectiveness of learning and teaching. Language teachers regularly
use tests to help diagnose student strengths and weaknesses, to assess student progress, and
to assist in evaluating student achievement. Language tests are also frequently used as
sources of information in evaluating the effectiveness of different approaches to language
teaching. As sources of feedback on learning and teaching, language tests can thus provide
useful input into the process of language teaching.
That kind of feedback is termed “backwash” by Hughes (1989) who defines the
term as “the effect of testing on teaching and learning”. He goes on to explain that testing
can have either a beneficial or a harmful effect on teaching and learning. “If a test is
regarded as important, then preparation for it can come to dominate all teaching and
5
learning activities. And if the test content and testing techniques are at variance with the
objectives of the course, then there is likely to be harmful backwash”. (p.1). However, he
notes that the relationship between teaching and testing is that of partnership. In other
words, we cannot expect testing only to follow teaching; rather a good test is an obedient
servant since it follows and apes the teaching (Davies (1968: 5). What we should demand
of it, however, is that it should be supportive of good teaching and, where necessary, exert
a corrective influence on bad teaching. If testing always had a beneficial backwash on
teaching, it would have much better reputation amongst teachers (Hughes:2)
Cohen (1994) discusses the effects of backwash more broadly, in terms of “how
assessment instruments affect educational practices and beliefs” (p.41). Wall and Alderson
(1993), go a little bit farther to argue convincingly on the basis of extensive empirical
research, that backwash has potential for affecting not only individuals, but the educational
system as well.
Read (1983:2) points out: “A test can help both teachers and learners to clarify what
the learners really need to know assuming that it is unrealistic to expect them to master
everything they are presented with during a particular course.” The result of tests shows
teachers not all but part of learners’ ability, which helps teachers to improve ways of
teaching or revise knowledge.
According to Heaton (1898:7), “a well-constructed classroom test will provide the
students with an opportunity to demonstrate their ability to perform certain tasks in the
language and the students should be able to learn from their weakness”. Obviously, under
the influence of the tests, the students are motivated to use what they have done and avoid
the mistakes and errors that they have made. The learners know how far they have
achieved the object of the course so that they can upgrade their level or they have to learn
more. “A good test can sustain or enhance class morale and aid learning.” (Madsen,
(1983:3).
Because of the important role a test plays in either supporting or impeding teaching
and learning, it is critical that a test must be supportive of good teaching. This raises the
necessity to investigate the opinions of the test users, specifically the learners and the
teachers.
2.3. TYPES OF ACHIEVEMENT TESTS
An achievement test is one of the means available to teachers and students alike of
6
assessing progress. According to Hughes (1990:10), “achievement tests are directly related
to language course, their purpose being to establish how successful individual students,
groups of students, or the courses themselves have been in achieving objectives”. To make
it clearer and to distinguish it from others simultaneously, Harrison (1991) stresses that
“an achievement test looks back over a longer period of learning than the diagnostic test”
(p.7). He provides a clear distinction between achievement and diagnostic tests in that
achievement tests cover a much wider range of material than a diagnostic tests and relate to
long-term rather than short-term objectives. Achievement tests are designed to assess the
whole course or even a number of courses. Those students who have finished an English
course will sit for the test and will be evaluated whether or not they have learnt it well.
Their standards and differences are judged in relation with other students in the same stage
by test results. On the other hand, diagnostic tests also look back on the previous course for
persistent errors for which they from remedial work. It can be referred that diagnostic tests
can be used to predict and improve future teaching and learning.
Additionally, Heaton- when widening the concept of achievement tests- defined
them as the ones “based on what the students are presumed to have learnt- not necessary on
what they have actually learnt nor on what has actually been taught” (Heaton, 1991:172).
According to the time of administration and designed objectives, achievement tests
can be subdivided into two kinds of achievement tests: Progress achievement and final
tests.
2.3.1. Progress achievement tests
Progress achievement tests are always administered during the course, after a
chapter or a term, and often written by the teacher. They are based on teaching program.
Hughes (1990:12) claims “these tests are intended to measure the progress that students are
making.” Since “progress” in achieving course objectives, these tests should be related to
objectives. These should make a clear progression towards the final achievement tests
based on course objectives. Then if the syllabus and teaching methods are appropriate to
these objectives, progress tests based on short term objectives will fit well with what has
been taught. If not, there will be pressure to create a better fit.
Progress achievement tests are supposed to help the teacher to judge the degree of
success of his or her teaching and help to find out how much students have gained from
what have been taught. Accordingly, the teachers can identify the weakness of the learners
7
or diagnose the areas not properly achieved during the course of study.
In short, progress achievement tests can be regarded as a useful device that provide
the students with a good chance to perform the target language in a positive and effective
manner and to gain additional confidence in doing them. This way can be a good
preparative and supportive step towards the final achievement test for the students because
they will get familiar with the tests and the strategy to do them.
2.3.2. Final achievement tests
Final achievement tests, as the name suggest, is usually a formal examination,
given at the end of the school year or at the end of the course to measure how far students
have achieved the teaching goals (Hughes(1990:10). They may be written and
administered by ministries of education, official examining board, or by members of
teaching institutions. The content of these tests must be related to the courses with which
they are concerned. Hughes (1990:11) suggests two approaches towards designing
achievement tests: syllabus-content approach and objective content approach.
The syllabus-content approach means that the content of a final achievement tests
should be based on a detailed course syllabus or on the books and other material used. The
tests designed basing on what the students have already learnt in the course books can be
considered fair tests. On the contrary, the badly designed syllabus or badly chosen material
which is different from the course objectives may bring about misleading results which are
unlikely to show what students have achieved on the other. When this occurs, test results
will fail to meet the test validity in terms of course objectives.
The syllabus-objective approach is to design the test content directly on the
objectives of the course. This approach has some good points. Firstly, it forces course
designers to elicit about course objectives. Secondly, this approach can help to work
against the poor teaching practice that syllabus content-based tests fail to do. However, this
approach has to cope with the problems in testing what the students have neither learned
nor prepared.
Of the two approaches mentioned, Hughes (1990:11) favors the latter one by
arguing that it will provide more accurate information about individual and group
achievement, and it is likely to promote a more beneficial backwash effect on teaching.
2.3.3 Roles of achievement tests
The roles of achievement tests are clearly shown by McNamara(2000:6): “
8
Achievement tests accumulate evidence during or at the end of a course of study in order to
see whether and where progress has been made in terms of the goals of learning.
Achievement tests should support the teaching to which they relate.” That is, achievement
tests play an important role in the teaching-learning process. Besides bearing all the
characteristics of a normal test, achievement tests can supply more accurate and fuller
information because they look back on the course students have been learning. These tests’
backwash effect can show teachers how appropriately or effectively their teaching has
been. Furthermore, results obtained from achievement tests enable teachers to become
familiar with the of each student and with the progress of the class in general (Heaton,
1991:1). With final achievement tests and progress achievement tests, teachers can take
usual control of their class through their results.
This type of test works mainly as a motivation to learning. It should “encourage the
students to perform well in the target language and to gain additional confidence” (Heaton,
1990:171). When a good test is conducted and the test results are high, students may feel
encouraged and try more. Even a bad performance can be an incentive to work more,
because the frequency of progress achievement test during the course of study is very high.
In short, the achievement test works as assessment of both teachers’ and students
performance during the whole course and as an encouragement to both teachers’ and
students’ progress. Thus, it cannot be neglected in any syllabus and any teaching program.
2.3.4 Practical tips for writing achievement tests.
Harrison (1991) states that designing and setting an achievement test is a bigger
and more formal operation than the equivalent work for a diagnostic test. An achievement
test involves more detailed preparation and covers a wider range of material as it relates to
long-term rather than short-term objectives. (p.64)
As for Harrison (1983:7), it is necessary for test writers to draw out a test
specification before writing a test. Test specification is resulted from the process of
designing test content and test method (Mc Namara (2000:31). The specifications include
information on the length, the structure of each part of the test, the type of materials, the
extent to which the candidates will have to engage, the source of materials, the extent to
which authentic materials may be altered, the response format, the test rubrics and how
responses are to be scored. They are usually written before the test and then the test is
written on the basis of the specifications. After the test is written, the specification should
9
be consulted again to see whether the test matches the objective set in the specification.
Therefore, writing specifications is an important step because it insures that item
writers can write up test items that measure appropriately whatever the test developers
intend to and that the range of conditions suitable for the test objectives will not be
exceeded. When writing specifications, teachers should use an index card on the top of
which they can write the test objectives and below is the table of specifications. They
should try not to repeat the wording of the objective; remember to increase the level of
detail preparatory to writing tests items. The final step is writing the items themselves and
entering them on the back of the index card.
Harrison (1983:16) indicates the following factors to be taken into consideration
when one sets up the table of specification for a test.
- Time: The first factor teachers should be attended on is answering the question how
much can be tested in the time available for the test. They should decide a reasonable
amount of time for the majority of the test takers to be able to complete the test. If not,
a counter effect will happen, as the students are too panic and fearful to do the work
under pressure of time. Students who are not given enough time will not be able to
demonstrate their full achievement. On the other hand, students who are given too
much time to do a test can treat it like a puzzle rather than an actual language test.
- Coverage: The next important factor to be taken into account is determining the test
content in terms of grammatical and functional items and skills so that it accurately
reflects the syllabus and objectives. It also involves determining whether the test
should ask for the main idea, specific details or inferences, etc.
- Test techniques: subjective and objective methods: There are many techniques for
testing both language and skills. Most of them are only familiar to teachers. They
should also be familiar to students before being used in a test. Heaton (1988:27) takes a
similar view on test techniques by arguing that a good classroom test will usually
contain both subjective and objective test items. Each method has its own strong points
and weak points. The reason he gives for such a combination of test techniques is that it
helps to guarantee a high quality of the test. The choice of test type will depend on
what has been taught, to what extent and how, that is to say, it depends on the syllabus.
Objective items (for instance, multiple choice items, matching, true-false
items, etc.) can be marked very quickly and completely reliably because it has only
10
one correct answer or a limited number of correct answers. And this kind of test can
be marked by a machine or by an inexperienced person. Objective tests, therefore,
can produce reliable results and focus on accuracy and discreet items, but they
provide an assessment of only a limited range of the students’ abilities. An
objective test will be a very poor test if its test items are poorly written, irrelevant
areas and skills are emphasized in the test simply because they are testable, and if it
is confined to language based usage and neglects the communicative skills involve.
Subjective items (such as compositions, reports, letters, information
transferring, etc) on the other hand, offer better ways of testing language skills and
certain areas of language than objective questions. Subjective tests can provide
information about the students’ wider command of communication, but that
information may be supplied somewhat haphazardly and is not always easy to
assess in a reliable way-through marking guides of performance descriptions can go
a considerable way towards reducing this unreliability.
- Format: Another factor that needs to be focused on is the test format – the form the
test items are going to take. Teachers themselves have to decide the length of the test
items as well as the whole test, the number of questions, kind of used test methods such
as objective or subjective methods, and finally the time allowance. More importantly,
some guidelines should be given to the testees when determining the test format.
Lastly, the test writers have to decide at this point whether to use an objective or
subjective format for each part of the test. This choice has important implication for the
marking of the test.
- Difficulty is another area that calls for teachers’ attention when constructing test. It
involves choosing appropriate level for each item or part of the test. The level of
difficulty of items included in the test should parallel that of the practice activities done
by the students during the course. This kind of variation in level of difficulty of test
items appropriate to placement or proficiency tests is not necessary in an achievement
test, as it is not their primary aim to discriminate between strong and week students.
- Rubrics: The test instructions should be clear and not ambiguous unless these will
invalidate the test by misleading them by turning the instructions into an additional test
item, though unintended (Dangerfield (1985:150). The students may complete the
items wrongly because they misinterpreted the instruction. It is also advisable to
11
provide an example of an answered test item where the format permits (e.g in the case
of multiple choice or sentence transformation items but not, of course, in the case of
compositions).
- Marking: Marking is an important but complicated part in the testing structure. It is
usually the last step of the whole test-designing process to enable the tester to have the
exact and true evaluation of the testees’ performance in the test. It contains the keys,
marking instructions, marking scale, etc, needed for each item and the whole test.
- The most important point to be noted here is that weighting on different parts test
should reflect the balance of the syllabus. Second, the weighting of marks should take
into consideration the difficulty of a test item and, to an extent, the proportion of the
overall test time that is likely to take students to complete those items. A final point in
relation to marks is that, if the test includes an element which has to be marked
subjectively, the teachers should give careful proportion of the total marks for the test,
but also to the criteria to be used for assessing that element. Even when only one
person is marking a set of test papers, it is important for reliability and consistency that
marking should be done according to guidelines of one form or another.
2.4. MAJOR CHARACTERISTICS OF A GOOD TEST
The most important consideration in designing a language test is its usefulness, and
this can be defined in terms of some basic test characteristics. To write a good test,
(Harrison, 1983:10) claims that it is essential for test designers to considerate reliability,
validity, discrimination and practically. These four test qualities all contribute to test
usefulness, so that they cannot be evaluated independently of each other.
2.4.1. Test reliability
As for Harrison, the reliability of a test is its consistency. It is important that the
student’s score should be the same (or as nearly the same) whether he takes one version of
a test or another, and whether one person marks the test or another. Reliability also means
the consistency with which a test measures the same thing all the time. There are therefore
three aspects to reliability: the circumstances in which the test is taken, the way in which it
is marked and the uniformity of the assessment it makes.
Test reliability considered by Moore (1992:110) as a measurement device and its
consistency is the dependability and trustworthiness of that device.
12
Bachman (1990:24), a leading testing expert describes reliability as “a quality of
test score”. He points out that if a student receives a low score on a test one day and high
score on the same test two days later, the test doesn’t yield consistent results, and the score
cannot be considered reliable indicator of the individual’s ability.
To sum up, reliability is a necessary characteristic of any good tests. In other words,
for a test, it to be valid at all, a test must first be reliable as a measuring instrument or it
should measure precisely whatever it is supposed to measure.
2.4.2. Test validity
Heaton (1991:159) defines validity as: “the extent to which the test measures what
is intended to measure”. A test is considered valid when it specifically measures what it is
supposed to access. In other words, the test interpretation made from its results is
appropriate to the purpose of testing.
Similarly, Henning (1987) states that a test is valid if it measures accurately what it
is intended to measure. This seems simple enough. However, it is not simple to say
whether or not a test is valid because of its variously different sub-kinds such as face,
content and construct, each of which deserves our attention. In this part I will present each
aspect in turn.
2.4.2.1. Face validity
According to Tim McNamara (2000:105) “face validity is a type of validity
referring to the degree to which a test appears to measure the knowledge or abilities it
claims to measure, as judged by untrained observer (such as the candidate taking the test,
or the institution which plans to administer it)”. Face validity is concerned with what
teachers and students think of the test. Does it appear to them a reasonable way of
assessing the students, or does it seem trivial, or too difficult, or unrealistic? A test which
pretended to measure pronunciation ability but which did not require the candidate to
speak might be thought to lack face validity. That means, face validity concerns the appeal
of the test to the popular judgement, typically that of other testers, teachers, moderators,
and test takers.
Alderson and Clapham and Wall (1995:173) recognized face validity as an
influence factor in testing. According to them, while opinions of students about tests are
not experts, they can be important because those opinions represent the kind of response
that you can get from the people who are taking the test. If a test does not appear to be
13
valid to the test takers, they may not do their best, so the perceptions of non-experts are
useful.
2.4.2.2. Content validity
Content validity, along with face validity is considered as two types of internal
validity, which is validity in terms of the test itself.
Harrison (1983:11) defines content validity as “Content validity is concerned with
what goes into the test. The content of the test should be decided by considering the
purpose of the assessment, and then drawing up a list known as a content specification”.
This means the test content constitutes the representative sample of language skills,
structures or even the course to be measured. In this case, the relationship between the test
items and the course objectives is always apparent. A grammar test, for instance, must be
made up of items testing knowledge or control of grammar. The test would have content
validity only if it included a proper sample of the relevant structures. Just what are the
relevant structures will depend, of course, upon the purpose of the test.
What is the importance of content validity? “First, the greater a test’s content
validity, the more likely it is to be accurate measure of what it is supposed to measure. A
test in which major areas identified in the specification are under-represented- or not
represented at all –is unlikely to be accurate. Secondly, such a test is likely to have a
harmful backwash effect. Areas which are not tested are likely to become areas ignored in
teaching and learning” (Hughes (1989:22)
So, in content validation, the experts should look at whether the test is
representative of the skills they are trying to test. That is to say, the experts look at the
content of the test and compare it with a statement of what the content ought to be. This
involves looking at the syllabus, in the case an achievement test, and the test specifications
and deciding what the test was intended to test and whether it accomplishes what it is
intended to. In other words, the content validity depends on a careful analysis of the
language being tested and of the particular course objectives.
2.4.2.3. Construct validity
Davies (1999:33) defined that “the construct validity of a language test is an
indication of how representative it is of an underlying theory of language learning.
Construct validation involves an investigation of the quantities that a test measures, thus
providing a basis for the rationale of a test”
14
For Rthur Hughes (1989:26): “A test is said to have construct validity if it can be
demonstrated that it measures just the ability which it is supposed to measure”. The word
“construct” refers to any underlying ability which is hypothesized in a theory of language
ability. Take reading for example. We construct the items related to reading ability and
administer them as a pilot test. Then, we take samples of reading ability and draw reliable
scores. After all, the comparison between the two is made. If the co-efficiency is agreeable,
the test is said to measure reading ability. In other words, if we attempted to measure the
ability in a particular test then that part of the test would have construct validity only if we
were able to demonstrate that we were indeed measuring just that ability.
2.4.3. Relationship between reliability and validity.
Test researchers and developers have admitted that reliability and validity are
essential measurement qualities. This is because these are the qualities that provide the
major justification for using test scores numbers as a basis for marking inferences or
decisions (Bachman and Palmer, 1996:19).
We often think of reliability and validity as two distinct but related characteristics
of test scores. Although validity is the most important characteristic, reliability is a
necessary condition to validity. The two measurement qualities, reliability and construct
validity, are thus essential to the usefulness of any language tests. Reliability is a necessary
condition for construct validity, and hence for usefulness. To be valid a test must provide
consistently accurate measurements. It must therefore be reliable.
Reliability and validity are considered two basic principles by (Heaton, (1990:6)
when writing useful tests. A reliable test, however, may not be valid at all. In other words,
reliability is not sufficient condition for either construct validity or usefulness. Suppose, for
example, that we needed a test for placing individuals into different levels in an academic
writing course. A multiple-choice test of grammatical knowledge might yield very
consistent or reliable score, but this would not be sufficient to justify using test as a
placement test for writing course. This is because grammatical knowledge is only one
aspect of the ability to use language to perform academic writing tasks.
It should be noted that a test could be reliable without possessing validity.
However, reliability is clearly inadequate by itself if a test does not succeed in measuring
what it is supposed to measure. It is impossible for test writers to try in vain to increase the
validity of a reliable test due to the features of test items that constructs it. From the outset
15
of test construction, test validity should be of most essential focus of all. A reliable test, in
fact, may not be quite valid. For example, a multiple-choice test which is very reliable, but
its validity is poor if it fails to measure what it intend to measure.
Furthermore, the emphasis on test validity is recognized by Hughes (1989:22) that,
“the greater a test’s content validity is, the more likely it is to be accurate measure of what
it is to measure”. In other words, if major areas in the test specification are not identified or
not represented, the test is said to be inaccurate. Furthers, such an inaccurate test is likely
to have a harmful backwash effect because those are not presented or not tested will
probably be ignored in teaching and learning.
Due to the importance of validity in the test, sometimes a trade-off which is in
favor of validity at the expense of reliability is accepted. Taking this perspective, this study
focused more on the validity of the final achievement test used at Ngo Quyen High School.
2.4.4. Practicality
All tests cost time and money- to prepare, administer, score and interpret. Time and
money are in limited supply, and so there is often likely to be a conflict between what
appears to be a perfect testing solution in a particular situation and considerations of
practically.
A test must be practicable. Practicality plays a crucial role in deciding whether a
test is good or not. This characteristics of a test involves administration, scoring, stationery,
interpretation of results etc. According to Brown (1994:253), a test is impractical if it is
taken in 10 hours and if it is prohibitively expensive. The duration of a test, for example,
may affect its successful operation. Students will feel very tired at the end of the test, and
the score will be surely affected. Or if the scoring is complicated, it will cost much time
and money because these will be staffs to mark students’ papers. The longer it takes to
construct, administer and score, the higher costs are. Testers should avoid this.
In brief, “tests should be as economical as possible in time (preparation, sitting and
marking) and in cost (materials and hidden costs of time spent)” (Heaton, 1991:172).
2.4.5. Discrimination
Another important feature of a test is discrimination. Heaton (1988:165) identifies
discrimination of a test as the capacity to discriminate the different candidates and to
reflect the differences in the performances of the individual in the group. If a test is either
16
too easy or too difficult, it cannot realize its purpose of discrimination between candidates.
Therefore, the test items must be in a wide difficulty scale, ranging from “extremely easy
items” to “extremely difficult items”. Below is how the items in the test should be spread
over a wide difficulty level :
- extremely easy items
- very easy items
- easy items
- fairly easy items
- items below average difficult level
- items of average difficult items
- items above average difficult level
- fairly difficult items
- difficult items
- very difficult items
- extremely difficult items
Similarly, Harrison (1994:14) defines discrimination as “the extent to which a test
separates the students from each other”. The extent of the need to discriminate will vary
depending on the purpose of the test. For example, if a placement test is able to efficiently
discriminate among students, it will be much easier to divide students into suitable groups
and similarly to an achievement or a diagnostic test, the level of each individual will
clearly be shown.
Conclusion: In this chapter, I have reviewed the literature on important issues related to
language testing. These include the relationship between testing and teaching, which is
often referred to as "backwash effect", types of achievement tests, the characteristics of a
good test with an emphasis on four important constructs, i.e, reliability, validity,
practicality and discrimination. Of these constructs, validity, particularly content validity
seems to be the most important determinant which gives the test the power to test what is
to be measured.
The next chapter will present the study which includes the participants, the methods
of data collection and the data analysis.
CHAPTER 3: THE STUDY
17
In this chapter, the writer provides some information about the current situations of
teaching, learning English, and language testing at NQHS in Hai Phong as the basic
settings for the study. The rationale for the method chosen for the study is also presented
here. The primary focus of this chapter is the data analysis and finding from the data.
3.1. THE SUBJECTS AND THE CURRENT ENGLISH TEACHING,
LEARNING AND TESTING SITUATIONS AT NQHS
3.1.1. Students and their backgrounds
Pupils who have been studying at NQHS were selected from many lower secondary
schools in the city after taking the recruitment exam. Most of them had been studying
English for four years at lower secondary schools. Because English was not a core subject
which they had to taken to enter the upper secondary schools, it was not paid much
attention by both teachers and students. As a result, their level of proficiency in English
was varied on the one hand, and unsatisfactory, on the other.
At supper secondary schools, English is one of the six core subjects, which are
compulsory in the national examination. Therefore English has become an important
subject, especially for 12
th
form students. After 2 years of learning English at high school,
the 12
th
form students have been learning English for 105 periods, covering the last 16
units of the textbook Tieng Anh 12. They have three class hours of English every week.
As far as training targets are concerned, the primary concern of the students is
getting good marks at written achievement tests and the national examination and even at
the university entrance examinations. Different motivation, different objectives lead to
different ways of learning and different ways of teaching. The question here is that how to
test them appropriately to meet the needs of students and requirements of NQHS.
3.1.2. The English teaching staff
The English group consists of 15 teachers. All the English teachers were trained in
Vietnam and none of them was ever trained abroad. They are well-trained and rather
professionally experienced with at least 3 years’ teaching. About a quarter of the teachers
has done and is doing M.A course. Therefore, most teachers are qualified enough to
conduct communicative activities in a foreign language lesson. They can use English in
class quite well. They also continuously acquire for themselves a great knowledge of
general English and specialized subjects through their self-study and in-country training
18
programs.
3.1.3. English teaching and learning at NQHS
Being one of the six core subjects, which are compulsory in the national
examination at the end of supper secondary school, English is paid much attention at every
school in general and at NQHS in particular.
With the renovation in education, the English program at supper secondary schools
has been redesigned recently. The seven-year English program is used nationwide in
placement of the previous three-year one. The purpose of the new one is maybe to narrow
the gaps between classroom English and real English. This course book focuses on four
skills and provides appropriate grammar and vocabulary. According to the content of the
course book for 12
th
form students, after studying 16 units students are expected to have the
following abilities:
Listening: Students are able to understand passages or dialogues related to 16 topics in the
textbooks.
Speaking: Students are able to carry out conversation about culture, future life, sports
events
Reading: Students are expected to understand the general and detailed contents of the
reading passages in length of 280-320 words about the topics of 16 units.
Writing: Students are able to write a letter of request, describe the world in the future, give
instructions
However, due to the lack of materials and equipments, the shortage of time, the
large-sized classes, mixed-level students, mixed motivations and expectations of learning
English, the teachers have a lot of difficulties in their conducting teaching effectively. We
do not have a suitable laboratory for students to study listening. Whenever teachers want
their students to practice listening, they have to bring cassettes to the class. Teachers
sometimes ignore the speaking section because of the limited time. Also, writing is a
challenge for students so they are not very interested in doing them.
Furthermore, the main aim of the teaching English at supper secondary schools is to
help the students to get the best marks at written tests. Owning to the characteristics of the
tests, teachers pay much attention only to reading and supply them with grammar rules and
vocabulary.
19
3.1.4. The current testing situation at NQHS.
In every school year, the 12
th
form students at NQHS have to take two final written
achievement tests at the end of the 1
st
and 2
nd
terms. All teachers are asked to design the
resource tests (they may or may not teach 12
th
form students). The tests then are collected
and the final paper is designed out of three or four resource tests by the leader of the
English group. Objective items such as multiple-choice question items have been used for
the final written achievement tests in order to achieve the high-test reliability and
discrimination among test takers. In addition to make it easier for teachers to score the
examination papers, separate sheets are provided for the test takers to write down the
answers.
However, I have learned that most language tests do not follow test specifications
because there are not any available to them. They themselves design most tests by a cut-
and-paste method, by which I mean they use commercial tests available to write tests
without following any rules of testing. Testing techniques have not been paid proper
attention to and the role of the testing in teaching and learning has not been fully
recognized. Tests, therefore, may lack some major important criteria of a good test
concerning its validity, reliability, format and practicality.
The first and foremost characteristic is that except for those carefully designed
tests, some are of poor quality, misspelling, or too difficult. This is because test makers
only try to fulfill her or his duty without considering its effectiveness on the one hand, and
most of them may not be aware of testing theories on the other. Those tests often fail to
measure accurately whatever they are intended to measure.
Moreover, test content is sometimes found to be unrelated to the objectives of the
course. They are likely to fail to measure some language skills such as speaking and
listening. There is no listening part in the final achievement tests. Also, teachers have no
chance to test learners’ speaking ability otherwise measuring the way of pronunciation of
students is done by phonetics section of the test which seems to be not accurate to measure
the students’ speaking skill.
More importantly, students are often clear about the test formats. As the result, most
teaching practice and class activities are accordingly test oriented, and what will not be
tested might be left uncovered. Students apparently shape an attitude of learning for testing
and for grading only.
20
For these reasons, the final achievement teat at NQHS has never been empirically
evaluated by test users. This study is the first attempt to explore test users' opinions of the
test. To provide background information, table 1 shows the structure of the test.
Table 1: The English final achievement tests have been constructed as follow:
Time allowance: 60 minutes
(50 multiple choice questions in total)
Part Questions Total Marking
scale
A. Phonetics I. Multiple choice
(3 spelling, 2 word stress)
5 choices 5
B. Grammar
and Vocabulary
II. Multiple choice
(completing each sentence by choosing
the most suitable word or phrase)
25 choices 25
C. Writing III. Multiple choice (correcting mistakes)
IV. Multiple choice (sentence
transformation or sentence building)
5 sentences
5 sentences
5
5
D. Reading V. Multiple choice (gap filling)
VI. Multiple choice (Choosing the
correct answer)
5 gaps
5 sentences
5
5
It is known that testing is among various types of evaluation forms, which provide
information on the strength and weakness in the achievement of the students. Thus,
whether or not what we are doing is worthwhile? Whether or not what we are testing
depicts the true pictures of our students’ abilities?
The next section provides an account of the methods used in this study to answer
the above-mentioned questions.
3.2. RESEARCH METHODS
This study used survey questionnaire and interview as an information collection
tool. The overall purpose of the survey is to investigate the perception of teachers (test
makers at the English group) and the 12
th
form students (test takers) at NQHS of the
existing final achievement test based on the criteria of a good test mentioned in (2.3).
However, the major focus is on terms of validity.
21
It is expected that the result of the survey would help to:
(1) find out the differences and similarities in teachers and students’ evaluations
towards the test validity.
(2) achieve more accurate measures of students’ achievement with reference to the
training objectives.
As far as concerned, the key methods applied for reliable information are survey
questionnaires and interviews. Those methods help to collect and confirm different kinds
of data. However, each has its own advantages and disadvantages.
3.2.2. Survey questionnaires
Two sets of survey questionnaires were administered with the assistance of 15
teachers and 200 12
th
-form students at NQHS.
The first objective is to find out how these subjects evaluate the current final
achievement test for 12
th
form students based on the criteria of a good test and to compare
these responses in order to figure out what are similar and what are different and
recommendation to narrow the gaps. The survey also aims at collecting teachers’ attitudes
and suggestions towards the improvement and designing of the final achievement test for
12
th
form students.
In examining the actual testing situation of the current 12
th
form achievement test, a
fourteen-item questionnaire was given to 15 teachers, and fourteen-questionnaire was
administered to 200 12
th
form students at NQHS. This questionnaire consists of one part to
collect students’ and teachers’ opinions on the whole test and each particular section of the
test such as the Grammar and Vocabulary, Reading comprehension, Writing section and
one part with questions for improvements of the final test. For the subjects to have a clear
idea of the content and be able to decide what/ how to respond in a relevant way to a
certain question/situation, clear instructions were given at the beginning of the
questionnaire. In every question, informants are asked to tick (√) the most appropriate
column among “strongly disagree”, “Disagree”, “Don’t know”, “Agree”, “Strongly
agree”. The questionnaire is anonymous, so the participants feel more comfortable to
answer the questions.
This research method has a number of advantages. It can reach large number of
people in a short time and as a result, it gets from the informants a lot of data that
researchers needs and the survey data appear to be valid due to the answer collected from
22
many people. Moreover, the data collected are relatively easy to be summarized and
reported as all the informants answer the same questions. Most importantly, the
questionnaires give the student-informants an opportunity to express their opinions and
needs without fear either to be embarrassed or to speak their mind. Because the
confidentiality is ensured by not mentioning students’ names, the student-informants are
more likely to give unbiased answers. Finally, using questionnaires is quite inexpensive.
However, this method inherently carries some disadvantages, which should not
affect the quality of the collected data. The first drawback is that there is little provision for
the expression of unanticipated responses since all the surveys have to follow a fixed
format. The other disadvantage lies in the common fact that all questionnaires used for
research purposes are of limited utility in getting at the causes of problems or possible
solutions. Accordingly, it should need the aids of other methods.
3.2.3. Interviews
The interviews with the teachers of English group and 12
th
form students just
finished the English final achievement test administered by the school for information. The
interviews are primarily based on the initial analysis of the valid questionnaires to classify
any vague information from the questionnaire. I took notes of the interviews which were
analyzed in triangulation with the questionnaire data. The strong point of this method is
that experiences, opinions and drawings are exchanged much more openly and directly.
Nevertheless, some informants are shy and afraid of expressing their own ideas. That may
lead to the difficulty collecting some sensitive information.
3.3. DATA ANALYSIS
This section deals with the data collected from a survey on both the teachers and
students concerning their evaluation on the current final achievement test for 12
th
form
students given at the end of the school year.
3.3.1. Data analysis of students’ survey questionnaires and interviews.
Two hundred questionnaires (see the questionnaire in appendix 1) were
administered to two hundred 12
th
form students of NQHS.
The author intends to collect data in stratification with an aim to classify the
differences in perceptions of the test among students themselves. Therefore, the student
population was divided into two separated groups. Group 1 consists of 100 students whose
23
main subjects are math, physics and chemistry. They learn basis English only and are
assumed to be the A stream (A). The other group with 100 students who learn advanced
English, math and literature is named D stream (D).
The data collected from the students’ survey questionnaires with 14 questions in
total are divided into 5 parts. The main part 1 consists of 5 questions used to ask for
students’ comments on the whole test. Part 2, 3, and 4 collect their opinions on each
particular section of the test such as grammar and vocabulary (2 questions), reading
comprehension (4 questions) and writing (3 questions). Each part is displayed in a
separated table as follows:
Table 2: Students’ opinions on the whole test
Options Strongly
disagree
Disagree Don’t
know
Agree Strongly
agree
Questions A
%
D
%
A
%
D
%
A
%
D
%
A
%
D
%
A
%
D
%
1. The test measures what the
students have been taught.
0 0 73 14 12 5 10 48 5 33
2. The task types given in the test
are familiar to the students.
0 0 0 0 15 3 56 68 29 29
3. Time allowance for this test is
enough.
15 2 53 11 9 4 10 57 13 26
4. The weighting demonstrated on
the marking scale of the test is
appropriate.
24 44 37 24 13 7 15 13 11 12
5. In terms of test item format, this
English test mainly intends to
measure the students’ grammar and
vocabulary knowledge.
0 0 0 0 25 7 63 72 12 21
Table 2 shows the information collected with 6 questions in which students are
asked to state their views towards whether or not the test relates to what they have been
taught (Q1), if the task types given in the text are familiar to the students (Q2), their opinions on
the time allowance (Q3), the marking scale of the test (Q4), the main knowledge is measured
through the test (Q5), and if the result of the test can encourage them to learn better (Q6).
As shown in table 2, 48% and 33% of the students from D stream agree and
strongly agree that the can measure what they have been taught while 14% of students
disagree and 5% have no ideas. In contrast, only 15% of the students from the A stream go
24
along with this idea and the number with opposite views is 73%. It is obvious that the test
is quite easy for students of D stream and it doesn’t evaluate their real ability. However, the
students from A stream find the test far more difficult than what they have been taught.
As can be seen from the table, most of the students both from A stream (85%) and
D stream (87%) agree that the task types given in the test are similar to them. No students
find them strange. Those who choose “agree” explain the reason why they can do the test
well simply is their teachers often give them similar task types and help them practice
much before taking the test.
According to the statistic data, majority of the students from D stream (83%) think
that the time allowance for the test is enough except for a small number of students insist
that the test should be longer in time accounting for 13%. Meanwhile it is clearly seen that
the number of students from A stream expecting the time allowance to be longer makes up
68% since the test is quite long and difficult for them.
Being asked about their opinion on the marking scale, more than half of the
students from both streams (61% and 68%) do not think it is good whereas 26% of A
stream’s students and 25% of students from D stream think that it is acceptable. For
students of A stream, more points should be given to easier part such as Grammar so that
they can get better result for the test while the test results sometime dissatisfy the students
of D stream as it seems unable to discriminate themselves from other weaker students. As
for them, there should be more points on reading or writing parts.
Concerning the main knowledge measured in the test, the majority of the students
(>80%) agree that the test mainly intends to measure students’ grammar and vocabulary
knowledge.
Table 3: Students’ comments on the Grammar and Vocabulary section
Options Strongly
disagree
Disagree Don’t
know
Agree Strongly
agree
Questions A
%
D
%
A
%
D
%
A
%
D
%
A
%
D
%
A
%
D
%
1. The Grammar and Vocabulary
part is long enough and related to
what students have been taught.
4 23 5 29 7 4 68 24 16 20
2. A student who is given a high
score in the grammar and
5 13 9 20 8 33 55 21 23 13
25