Tải bản đầy đủ (.doc) (38 trang)

Evaluating the Reliability and Validity of an English Achievement Test for Third-year Non- major students at the University of Technology, Ho Chi Minh National University and some suggestions for chan

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (372.69 KB, 38 trang )

- 1 -
CHAPTER 1: INTRODUCTION
1.1 Rationale for choosing this topic
English has already played a specially important role in the increasing development
of science, technology and international relations, which has resulted in the growing needs
for English language learning and teaching in many parts of the world. English has become a
compulsory subject in national education in many countries, among which Vietnam has
considered learning and teaching English as a major strategic tool to develop human
resources, as a way to keep up with other countries. Therefore, in any level of education,
from primary to university or postgraduate degree, learners must learn or want to learn
English as a compulsory subject or their target to access to information technology and to
find a good job. It is true that English teaching/ learning is essential for job training.
Fully aware of the importance of the English language, the University of
Technology, Ho Chi Minh National University has encouraged and required their students to
learn it as a compulsory subject during the first three academic years. Therefore, English has
been taught at the University of Technology since it was established, aiming at equipping the
students with an essential tool to go deeper into the world. However, to evaluate how
students acquire when they learn a foreign language, how well they use what they have been
taught and at which level of English they are standing is not paid much attention to. The
evaluation only counts for calculating the percentage of the number of students who pass
English tests, which ; therefore, doesn’t say anything about the validity, reliability or
discrimination of the tests. The results of English test are not successfully and completely
employed. In addition, during the time I have worked as a teacher of English at the
University of Technology, I have heard teachers and learners complaining about the English
achievement test in terms of its content, its structure. As a result, the English section has
decided to implement the renewal of the item bank in order to make it more valid and more
reliable.
Seeing the point, the author is encouraged to undertake this study entitled
“Evaluating the Reliability and Validity of an English Achievement Test for Third-year
Non- major students at the University of Technology, Ho Chi Minh National University
and some suggestions for changes” with the intention to find out how valid and reliable the


test is. More importantly, the writer hopes that the result of the study can then be applied to
- 2 -
improve the current testing and to create a new really reliable item bank. It is also intended
to encourage both teachers and learners in their teaching and learning.
1.2 Scope of study
The scope of this thesis is limited to a research on examining the existing
achievement test in terms of its validity and reliability for the third-year non-English major
students at the University of Technology, Ho Chi Minh National University. The study gives
analyzed statistic data of the currently used test and proposes practical suggestions to
improve the test. Due to the limitations of time and research conditions, it is impossible for
the author to cover all used achievement tests for third-year students. Instead, only one test is
studied.
1.3 Aims of study
The major aim of the study is to evaluate the currently used achievement test of the 3
rd
year non-English students of technology with a special focus on the test reliability and
validity. The specific aims of the research are:
 To evaluate the test validity and reliability through initial score statistics obtained
from the achievement test result of third-year students,
 To pinpoint the strengths and weaknesses of the test, and
 To provide practical suggestions for the test improvement.
1.4 Methods of study
In order to achieve the above-mentioned aims, the study has been carried out with the
following methodologies.
First, the author based herself both on the theory and principles of Language Testing,
major characteristics of a good test, especially test validity and reliability, achievement test
and statistic methods used in interpreting test results. From critical reading, the writer has
gathered, analyzed and synthesized many reference materials to draw out a theoretical basis
to evaluate the used achievement test for the 3
rd

year students in terms of its validity and
reliability.
Then, quantitative methodology was used to collect and analyze data. After
collecting data, the author employed statistic software to interpret it and to present suggested
findings.
- 3 -
1.5 Research questions
This study is implemented to find answers to the following research questions:
1. Is the achievement test for third-year non-English major students at the
University of Technology, Ho Chi Minh National University reliable?
2. Is the achievement test for third-year non-English major students at the
University of Technology, Ho Chi Minh National University valid?
3. Is it necessary to make some changes to the test? If yes, what are the changes?
1.6 Design of study
The thesis is organized into four major chapters:
Chapter 1- Introduction presents such basic information as: the rationales, the aims, the
method, the research questions and the design of the study.
Chapter 2- Literature Review reviews theoretical backgrounds on evaluating a test, which
includes language testing, criteria of good tests and theoretical ideas on test reliability and
validity as well as achievement tests.
Chapter 3- The study is the main part of the thesis showing the context of the study and the
detailed results obtained from collected tests and findings in response to the research
questions.
Chapter 4- Conclusion offers conclusions and practical implications for the test
improvement. In this part, the author also proposes some suggestions for further research on
the topic.
- 4 -
CHAPTER 2: LITERATURE REVIEW
This chapter provides an overview of the theoretical background of the study. It
includes four main sections. Section 2.1 discusses the importance of testing in education.

Section 2.2 is about language testing. It is then followed by Section 2.3 in which the author
provides a brief review of major characterictics of a good test with the major focus on test
reliability and validity. Finally, in Section 2.4, the achievement test and its types are
explored.
2.1 The importance of testing in education
Testing is an important part of every teaching and leaning experience. Testing is a
tool to measure learners’ ability. It may creates positive or negative attitudes toward teaching
and learning process. Testing reflects teaching process and overall training objectives.
Through testing, the administrators can make important decisions on the course, the syllabus,
the course book, teachers, learners and administration.
Testing contributes a very important part in teaching/ learning process. It is the last
stage in education technology. Therefore, to take advantages of testing to measure the quality
of education, the administrators must build an essential and eligible testing technology. This
is to evaluate learners’ ability, suitability of teaching methods, teaching/ learning materials
and teaching/ learning conditions and suitability of set-up training objectives.
Testing and Teaching
Testing and teaching are closely related because it is impossible to work in either
field without being constantly concerned with the other (Heaton, 1998: 5). In other words,
Heaton implied that teaching and learning provide a great source of language materials for
testing to make use of. In turn, testing reinforces, encourages and perfects the teaching/
learning process. Hughes (1989: 2) summarizes the relationship as: “The proper relationship
between teaching and testing is surely that of partnership”. To explain this, Hughes
mentioned the effect of the vise versa relationship as backwash. If the testing leaves good
effects on teaching, the backwash is said beneficial. However, there may be occasions when
the teaching is good and appropriate and the testing is not, we are then likely to suffer from
harmful backwash. Test result will give information for both teachers and learners for their
future action, such as improving knowledge and skills, revising knowledge, or applying a
- 5 -
new teaching method. As Brown (1994: 375) shared the idea that testing is “what teachers
measure or judge learners’ competence all the time and, ideally, learners measure and judge

themselves”.
Shortly speaking, it is undeniable that testing is an integrative part of teaching and it
can be separated from the program or from the course goals. Testing has both positive and
negative impact on teaching. Testing provides the teacher with information on how effective
his teaching has been, or the teacher can use tests to diagnose his own efforts as well as those
of his students.
Testing and Learning
Testing is a tool to “pinpoint strengths and weaknesses in the learned abilities of the
student” (Henning, 1987: 1). That is, through testing, learners can find out at which level
they are standing and what difficulties they have faced up with. As a result, they can adjust
their learning; explore more effective ways of learning. At the same time, the teacher can
rely on the result of tests to understand better learners’ ability and then can improve his
methods of teaching or revise knowledge. Thus, Read (1982: 2) said that “a test can help
both teachers and learners to clarify what the learners really need to know”. It is clear that
not only the teacher but also learners may achieve the benefits through testing.
To sum up, tests can benefit students, teachers and even administrators by confirming
progress that has been made and showing how they can best redirect their future efforts.
Tests can help. In addition, good tests can sustain or enhance class morale and aid learning.
2.2 Language Testing
Language testing is one of the forms of testing and it is also one form of
measurements. Its importance in English learning is reviewed as: “properly made English
tests can help create positive attitudes toward instruction by giving students a sense of
accomplishment and a feeling that the teacher’s evaluation of them matches what he has
taught them. Good English tests also help students learn the language by requiring them to
study hard, emphasizing course objectives, and showing them where they need to improve”
(Davies, 1996: 5).
Mc Namara (2000) presented three main roles of language testing, which is applied
not only in education but in other fields as well. Firstly, language testing is considered as a
key to succeed as language testing is a decisive way in recruitment. Secondly, it serves
educational goals. According to Mc Namara, tests are used to place learners in a suitable

- 6 -
course. The third role of language testing is for the grant of research. Every researcher who
wishes to do a research on a language need to evaluate standard tests or to design tests in that
language.
From Henning’s view, he suggested six purposes of language tests as follows:
 Diagnosis and Feedback: to explore strengths and weaknesses of the learners.
 Screening and Selection: to assist in the decision of who should be allowed to
participate in a particular program of instruction.
 Placement: to identify a particular performance level of the student and to place
him at an appropriate level of instruction.
 Program Evaluation: to provide information about the effectiveness of programs
of instruction.
 Providing Research Criteria: to provide a standard of judgment in a variety of
other research contexts based on language test scores.
 Assessment of Attitudes and Sociopsychological Differences: to determine the
nature, direction, and intensity of attitudes related to language acquisition.
(Henning, 1987: 1)
2.3 Major characteristics of a good test
In order to make a well-designed test, teachers have to take into account a variety of
factors such as the purpose of the test, the content of the syllabus, the pupils’ background,
the goal of administrators and so forth. Moreover, test characteristics play a very important
role in constructing a good test.
The most important consideration in determining whether a test is good or not is the
use for which it is intended. That is to say, the most important quality of a test is its
usefulness. It is believed that test usefulness provides a kind of metric by which test
developers can evaluate not only the tests that they develop and use, but also all aspects of
test development and use. Generally speaking, usefulness quality includes six components:
reliability, construct validity, authenticity, interactness, impact and practicality. However,
there is problem that should be pointed out that rather than emphasizing the tension among
the different qualities, test developers need to recognize their complementarity.

Bachman and Palmer (1996) consider the criteria as qualities of test usefulness rather
than individual factors. Their idea of usefulness can be visually presented as in Figure 2.1:
Usefulness = reliability + validity +impact + authenticity +
- 7 -
interactiveness + practicality

Figure 2.1 Usefulness
(Bachman and Palmer, 1996)
Henning (1987) added more test characteristics and he summarized in the form of the
table called A checklist for Test Evaluation. The checklist is for rating of the adequacy of a
test for any given purpose.
Test
usefulness
Practicality
Reliability
Validity
Authenticity
Interactiveness

Impact
- 8 -
Table 2.1 A checklist for test evaluation
Name of test ________________________________
Purpose Intended ____________________________
Test characteristic Rating (0 = highly inadequate, 10 = highly adequate)
1. Validity _______________________
2. Difficulty _______________________
3. Reliability _______________________
4. Applicability _______________________
5. Relevance _______________________

6. Replicability _______________________
7. Interpretability _______________________
8. Economy _______________________
9. Availability _______________________
10. Acceptability _______________________
________________________ Total
(Adapted from Henning, 1987: 14)
Other leading scholars in testing also share the idea about test characteristics with the
two scholars mentioned above. Among these test characteristics, they all agree that reliability
and validity are essential to the interpretation and use of measures of language abilities and
are the primary qualities to be considered in developing and using tests. For this reason, in
the study, the author would like to employ these essential measurement qualities to evaluate
the test taken by a large number of third-year non-English major students at the University of
Technology. Following is a brief discussion about reliability and validity.
2.3.1 Test Reliability
Reliability has been defined in different ways by different authors. Perhaps the best
way to look at reliability is the extent to which the measurements resulting from a test are the
result of characteristics of those being measured. For example, reliability has elsewhere been
defined as "the degree to which test scores for a group of test takers are consistent over
repeated applications of a measurement procedure and hence are inferred to be dependable
and repeatable for an individual test taker" (Berkowitz, Wolkowitz, Fitch, and Kopriva,
2000). This definition will be satisfactory if the scores are indicative of properties of the test
takers; otherwise they will vary unsystematically and not be repeatable or dependable.
Test reliability refers to the consistency of scores students would receive on alternate
forms of the same test. Due to differences in the exact content being assessed on the alternate
- 9 -
forms, environmental variables such as fatigue or lighting, or student error in responding, no
two tests will consistently produce identical results. This is true regardless of how similar the
two tests are. For example, a test that includes a translation part would probably produce
different scores from one administration to another because it is subjective, and it would thus

be unreliable.
Henning (1987: 10) claimed that all tests are subject to inaccuracies. The ultimate
scores gained by the test-takers only provide approximate estimations of their true abilities.
While some measurement error is unavoidable, it is possible to quantify and greatly
minimize the presence of measurement error. A test on which the scores obtained are
generally similar when it is administered to the same students with the same ability, but at a
different time is said to be a reliable test. And since test reliability is related to test length, so
that the longer tests tend to be more reliable than shorter tests, knowledge of the importance
of the decision to be based on examination results can lead us to use tests with different
numbers of test items.
Test reliability is considered as “a quality of test score” by Bachman (1990: 24). He
makes a further point that if a student receives a low score on a test one day and high score
on the same test two days later, the test doesn’t yield consistent results, and the score cannot
be considered reliable indicator of the individual’s ability.
Reliability can also be viewed as an indicator of the absence of random error when
the test is administered. When random error is minimal, scores can be expected to be more
consistent from administration to administration.
Sources of Error
According to Bachman (1990, 165), there are four factors that affect language test
scores. The effects of these various factors on a test score can be illustrated as in Figure 2.2.
- 10 -
Figure 2.2 Factors that affect language test scores
We can infer from the figure that a score in a language test is indicated by
communicative language ability. Also, the language test is affected by factors other than
communicative language ability. They are:
 Test method facets: are systematic to the extent that is uniform from one test
administration to another (Appendix 1).
 Personal attributes: include individual characteristics such as cognitive style,
knowledge of particular content areas and group characteristics such as: sex, race,
and ethnic background. It is also systematic.

 Random factors: are unsystematic factors including unpredictable and largely
temporary conditions such as his mental alertness or emotional stage and so on.
Thus, a test is considered to be reliable if it possesses such ideas as:
• The results of one test achieved at two different times of the same candidate are
coefficient.
• Candidates are not allowed too much freedom.
TEST SCORE
Test method
facets
Personal
attributes
Random
factors
Communicative
language ability
- 11 -
• Clear and explicit instructions are provided.
• The same test scores are given by two or three administrators.
• The test results measure the learners’ true ability.
The reliability of a test is indicated by the reliability coefficient which is calculated
by the formula as follows:
(1). (Henning, 1987)
(In which, Rt: reliability coefficient, N: number of items, X: Mean of all scores, SD: standard
deviation of the test)
Rt is expressed as a number ranging between 0 and 1.00, with r = 0 revealing no reliability
and r = 1.00 indicating perfect reliability. An acceptable reliability coefficient must not be
below 0.90, less than this value indicates inadequate reliability. For instance, r = 0.90 on a
test means that 90% of the test score is accurate while the remaining 10% consists of
standard error. If the r = 0.60, it means that only 60% of the test score is reliable and the
other 40% may be caused by an error.

Thus, the higher the reliability coefficient is, the lower the standard error is. The
lower the standard error is, the more reliable the test scores are.
Types of reliability estimates
According to Henning (1987), there are several types of reliability estimates, each
influenced by different sources of measurement error, which may arise from bias of item
selection, from bias due to time of testing or from examiner bias. These three major sources
of bias may be addressed by corresponding methods of reliability estimate:
a. Selection of specific items:
- Parallel Form Reliability
- Internal Consistency Reliability estimates (Split Half Reliability)
- Rational equivalence
b. Time of testing:
- Test-retest Method
c. Examiner bias
- Inter-rater Reliability
- 12 -
Parallel form reliability: indicates how consistent test scores are likely to be if a person
takes two or more forms of a test. A high parallel form reliability coefficient indicates that
the different forms of the test are very similar which means that it makes virtually no
difference which version of the test a person takes. On the other hand, a low parallel form
reliability coefficient suggests that the different forms are probably not comparable; they
may be measuring different things and therefore, cannot be used interchangeably.
A formular for this method may be expressed as follows:
(2). R
tt
= r
A,B
(Henning, 1987)
(In which, R
tt

: reliability coefficient; r
A,B
: correlation of form A with form B of the test when
administered to the same people at the same time).
Internal consistency reliability indicates the extent to which items on a test measure the
same thing. A high internal consistency reliability coefficient for a test indicates that the
items of the test are very similar to each other in content. It is important to note that the
length of a test can affect internal consistency reliability.
Split-half reliability is one variety of internal consistency methods. The test may be split in
a variety of ways, then the two halves are scored separately and are correlated with each
other.
A formula for the split-half method may be expressed as follows:
(3).
BA
BA
tt
r
r
R
,
,
1
2
+
=
(Henning, 1987)
(In which: R
tt
: reliability estimated by the split half method; r
A, B

: the correlation of the
scores from one half of the test with those from the other half).
Rational equivalence is another method which provides us with coefficient of internal
consistency without having to compute reliability estimates for every possible split half
combination. This method focuses on the degree to which the individual items are correlated
with each other.
(4).
( )










=

2
22
1
t
tt
tt
s
ss
n
n

R
Kuder-Richardson Formular 20
(Henning, 1987)
- 13 -
Test-retest reliability indicates the repeatability of a test scores with the passage of time.
This estimate also reflects the stability of the characteristics or constructs being measured by
the test.
The formula for this method is as follows:
(5). R
tt
= r
1, 2
(Henning, 1987)
(In which: R
tt
: the reliability coefficient using this method; r
1, 2
: the correlation of the scores
at time one with those at time two for the same test used with the same person).
Inter-rater reliability is used when scores on the test are independent estimates by two or
more judges or raters. In this case reliability is estimated as the correlation of the ratings of
one judge with those of another. This method is summarized in the following formula:
(6).
( )
BA
BA
tt
rn
nr
R

,
,
11
−+
=
(In which R
tt:
Inter-rater reliability, N: the number of raters whose combined estimates form
the final mark for the examinees, r
A, B
: the correlation between the raters, or the average
correlation among the raters if there are more than two).
To improve the reliability of a test is to become aware of test characteristics that may
affect reliability. Among these characteristics are test difficulty, discriminability, item
quality, etc.
Test difficulty: is calculated by the following formular:
(7)
(In which, p: difficulty, Cr: sum of correct responses, N: number of examinees)
According to Heaton (1988: 175), the scale for the test difficulty is as follows:
p: 0.81-1: very easy (the percentage of correct responses is 81%-100%)
p: 0.61-0.8: easy (the percentage of correct responses is 61%-80%)
p: 0.41-0.6: acceptable (the percentage of correct responses is 41%-60%)
p: 0.21-0.4: difficult (the percentage of correct responses is 21%-40%)
p: 0-0.2: very difficult (the percentage of correct responses is 0-20%)
Discriminability
The formula for item discriminability is given as follows:
- 14 -
(8)
(In which, D: discriminability, Hc: number of correct responses in the high group, Lc:
number of correct responses in the low group).

The range of discriminability is from 0 to 1. The greater the D index is, the better the
discriminability is.
The item properties of a test can be shown visually in a table as below:
Table 2.2 Item property
Item property Index Interpretation
Difficulty 0.0-0.33
0.33-0.67
0.67-1.00
Difficult
Acceptable
Easy
Discriminability 0.0-0.3
0.3-0.67
0.67-1.00
Very poor
Low
Acceptable
(Henning, G., 1987)
This index sets a ground for remarking the difficulty and discriminability in the final
achievement test that was chosen by the author.
2.3.2 Test Validity
It should be noted that different scholars think of validity in different ways. Heaton
(1988: 159) also provides a simple but complete definition of validity as “the validity of a
test is the extent to which it measures what it is supposed to measure”. Hughes (1989: 22)
claimed that “A test is said to be valid if measures accurately what it is intended to measure”.
It is taken from the Standards for Educational and Psychological Testing (1985: 9)
that “Validity is the most important consideration in test evaluation. The concept refers to
the appropriateness, meaningfulness, and usefulness of the specific inferences from the test
scores. Test validation is the process of accumulating evidence to support such inferences”.
Thus, to be valid, a test needs to assess learners’ ability of a specific area that is proposed on

the basis of the aim of the test. For instance, a listening test with written multiple-choice
options may lack validity if the printed choices are so difficult to read that the exam actually
measures reading comprehension as much as it does listening comprehension.
- 15 -
Validity is classified into such subtypes as:
Content validity
This is a non-statistical type of validity that involves “the systematic examination of
the test content to determine whether it covers a representative sample of the behavior
domain to be measured” (Anastasi & Urbina, 1997: 114). A test has content validity built
into it by careful selection of which items to include. Items are chosen so that they comply
with the test specification which is drawn up through a thorough examination of the subject
domain. Content validity is very important in evaluating the validity of the test in terms of
that “the greater a test’s content validity, the more likely it is to be an accurate measure of
what is supposed to measure” (Hughes, 1989: 22).
Construct validity
A test has construct validity if it demonstrates an association between the test scores
and the prediction of a theoretical trait. Intelligence tests are one example of measurement
instruments that should have construct validity. Construct validity is viewed from a purely
statistical perspective in much of the recent American literature Bachman and Palmer
(1981a). It is seen principle as a matter of the posterior statistical validation of whether a test
has measured a construct that has a reality independence of other constructs.
To understand whether a piece of research has construct validity, three steps should
be followed. First, the theoretical relationships must be specified. Second, the empirical
relationships between the measures of the concepts must be examined. Third, the
empirical .evidence must be interpreted in terms of how it clarifies the construct validity of
the particular measure being tested (Carmines & Zeller, 1991: 23).
Face validity
A test is said to have face validity if it looks as if it measures what it is supposed to
measure. Anastasi (1982: 136) pointed out that face validity is not validity in technical sense;
it refers, not to what the test actually measures, but to what it appears superficially measure.

Face validity is very closely related to content validity. While content validity
depends on a theoretical basis for assuming if a test is assessing all domains of a certain
criterion, face validity relates to whether a test appears to be good measure or not.

×