Tải bản đầy đủ (.pdf) (70 trang)

Đánh giá độ giá trị của bài kiểm tra cuối kỳ cho sinh viên không chuyên tiếng Anh năm thứ hai tại khoa Điện – Điện tử, Trường Đại học Sư phạm Kỹ thuật Nam Định

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.71 MB, 70 trang )


VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FACULTY OF POST – GRADUATE STUDIES





SUBMITTED BY: TRẦN THỊ THU HƯƠNG
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Arts






EVALUATING THE VALIDITY OF THE FINAL ACHIEVEMENT TEST
FOR SECOND – YEAR NON – MAJOR STUDENTS AT ELECTRONIC –
ELECTRICAL ENGINEERING DEPARTMENT, NAM DINH
UNIVERSITY OF TECHNOLOGY EDUCATION
(Đánh giá độ giá trị của bài kiểm tra cuối kỳ cho sinh viên không
chuyên tiếng Anh năm thứ hai tại khoa Điện – Điện tử, Trường
Đại học Sư phạm Kỹ thuật Nam Định)



M.A. MINOR THESIS





Field: Language Teaching Methodology
Code: 60 14 10








HANOI, 2011

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FACULTY OF POST – GRADUATE STUDIES





SUBMITTED BY: TRẦN THỊ THU HƯƠNG
A thesis submitted in partial fulfillment of the requirements
for the degree of Master of Arts






EVALUATING THE VALIDITY OF THE FINAL ACHIEVEMENT TEST
FOR SECOND – YEAR NON – MAJOR STUDENTS AT ELECTRONIC –
ELECTRICAL ENGINEERING DEPARTMENT, NAM DINH
UNIVERSITY OF TECHNOLOGY EDUCATION
(Đánh giá độ giá trị của bài kiểm tra cuối kỳ cho sinh viên không
chuyên tiếng Anh năm thứ hai tại khoa Điện – Điện tử, Trường
Đại học Sư phạm Kỹ thuật Nam Định)





M.A. MINOR THESIS




Field: Language Teaching Methodology
Code: 60 14 10
Supervisor: Phạm Lan Anh, M.A









HANOI, 2011


vi
TABLE OF CONTENTS
DECLARATION i
ACKNOWLEDGEMENTS ii
ABSTRACT iii
LIST OF FIGURES, TABLES AND CHARTS iv
LIST OF ABBREVIATIONS v
TABLE OF CONTENTS vi
CHAPTER 1: INTRODUCTION 1
1.1. Rationale 1
1.2. Scope of the study 2
1.3. Aims of the study 2
1.4. Methods of the study 2
1.5. Research questions 3
1.6. Design of the study 3
CHAPTER 2: LITERATURE REVIEW 4
2.1. Relationship between teaching, learning and assessment 4
2.2. Purposes of formative and summative assessments 8
2.3. Achievement tests and their characteristics 9
2.3.1. Achievements tests 9
2.3.2. Characteristics of a good EGP test 11
2.3.3. Characteristics of a good ESP test 14
2.4. Face validity 15
2.4.1. Definition 15
2.4.2. Relationship between reliability and validity 16
2.4.3. Reasons for choosing face validity 17
2.5. Some measures to increase face validity 18

vii

CHAPTER 3: THE STUDY 20
3.1. English learning and teaching at Nam Dinh University of Technology Education 20
3.1.1. Students’ backgrounds 20
3.1.2. The English teaching staff 20
3.1.3. Objectives of the English course 21
3.1.4. Checklist of the course book 22
3.1.5. Objectives of the final test 23
3.1.6. Difficulty level and discrimination of the final test 24
3.2. English testing at Nam Dinh University of Technology Education 24
3.2.1. Testing situation 24
3.2.2. The current final achievement test 25
3.3. Research methods 26
3.3.1. Survey questionnaire 26
3.3.2. Interview and informal discussion 26
3.4. Data analysis of survey questionnaires and interviews 26
3.4.1. Data analysis of the administration of the test 27
3.4.1.1. Data analysis of the format of the test 27
3.4.1.2. Data analysis of the logistics of the test 28
3.4.2. Data analysis of face validity of the test 29
3.4.2.1. Data analysis of general opinion about the test 30
3.4.2.2. Data analysis of reading comprehension task 31
3.4.2.3. Data analysis of grammar knowledge task 33
3.4.2.4. Data analysis of translation task 34
3.5. Discussion and findings 36
3.5.1. Similarities in teachers and students’ perception 36
3.5.1.1. Test administration 36

viii
3.5.1.2. Face validity 36
3.5.1.2.1. General opinion about the test 36

3.5.1.2.2. Reading comprehension task 36
3.5.1.2.3. Grammar task 37
3.5.1.2.4. Translation task 37
3.5.2. Differences in teachers and students’ perception 37
3.5.2.1. Test administration 37
3.5.2.2. Face validity 37
3.5.2.2.1. Grammar task 37
3.5.2.2.2. Translation task 37
3.6. Suggestions to improve the final achievement test 38
CHAPTER 4: CONCLUSION 41
REFERENCES 42
APPENDICES I
APPENDIX 1 I
APPENDIX 2 V
APPENDIX 3 IX
APPENDIX 4 XIII

iv
LIST OF FIGURES, TABLES AND CHARTS

Figures
Figure 1: Three Considerations for Test Choice.
Figure 2: The Scope of Impact of Language Tests.
Figure 3: Relationship between reliability and validity.
Tables
Table 1: ESP syllabus content allocation.
Table 2: Specification of test 12.
Table 3: Teachers and students’ comments on the format of the test.
Table 4: Teachers and students’ comments on the administration of the test.
Table 5: Teachers and students’ comment on the whole test.

Table 6: Teachers and students’ comment on students’ reading comprehension ability,
theme and instruction of the reading comprehension task.
Table 7: Teachers and students’ comment on the grammar task.
Table 8: Teachers and students’ comment on the translation task.
Charts
Chart 1: Percentage of teachers and students’ comment on what language ability the test
mainly intend to measure.
Chart 2: Percentage of students and teachers’ opinions on what test items can measure their
true ability.
Chart 3: Percentage of the results which students get in the test.
Chart 4: Percentage of what test tasks students cannot do.
Chart 5: Percentage of teachers and students’ comment on the length of the reading text.
Chart 6: Students’ comment on whether or not the reading text is difficult.
Chart 7: Teachers’ comment on which types of reading skills are expressed in the reading
comprehension task.
Chart 8: Teachers and students’ comment on which students’ ability this translation task
requires.

v
LIST OF ABBREVIATIONS

NUTE: Nam Dinh University of Technology Education
EGP: English for General Purposes
ESP: English for Specific Purposes



1
CHAPTER 1: INTRODUCTION


1.1. Rationale
Learning English in Vietnam today is quite popular and its popularity is increasing
day by day. This is due to the fact that Vietnam has recently adopted an open-door policy
which encourages broadening and improving its relationship and cooperation with other
countries in many aspects of life such as diplomatic, economic, cultural, scientific and
technological areas. English is not only a means but also a key to gaining access to latest
scientific and technological achievements for a developing country like Vietnam, where
modern science and technology is badly needed.
Recognizing the necessity of this global language, most of the schools, colleges and
universities in Vietnam consider English as the main, compulsory subjects that students
must learn. In the English language learning and teaching process, evaluating a test is
significant. Test is a part of teaching and learning process because it also provides
feedback about the achievement of teaching and learning objectives for those who are
involved in the education system. Moreover, the use and necessity of knowing a person’s
language ability is through a language test. In education, especially at the Faculty of
Foreign Languages at Nam Dinh University of Technology Education (NUTE), testing the
students’ achievement toward teaching objectives is needed. Without an achievement test,
it is difficult to see how rational educational decisions can be made.
NUTE is a technological university and students’ learning English ability is really
low. Evaluating a test is also one of the methods to improve students’ learning process and
results. Although there were many theses mentioned this problem but at NUTE it is still
new and very necessary because how to evaluate the test after each semester still receives
little attention and up to now the process of test analysis after each examination hasn’t
been fully investigated. Consequently, students’ results are getting worse and worse. As a
teacher myself, I see that we, teachers at NUTE, just stop at experienced level of test
making procedure, test administration, test marking procedure during and after
examination. We do not care testing evaluation from other teachers and students, so the
results of test are not still improved.



2
Moreover, test’s validity can be seen as an attempt for improving the test quality.
Being a measure of students’ achievement toward learning objectives, final examinations
must be valid. Validity is one of the characteristics of a qualified test. Therefore,
“Evaluating the validity of the final achievement test for second – year non – major
students at Electronic – Electrical Engineering Department, Nam Dinh University of
Technology Education” is chosen with the hope that the study will be helpful with the
author, the teachers, the test-takers and everyone who is concerned with language testing in
general and validity of an achievement test in particular. Due to limit time in collecting
students’ scores, this study is different from the previous study. The author only focuses on
the face validity of this test. The author hopes that the result of the study can then be
applied to improve the current test and to create a new really reliable item bank. It is also
intended to encourage both teachers and learners in their teaching and learning.
1.2. Scope of the study
The scope of this thesis is limited to a research on teachers’ and test-takers’
evaluation of the existing achievement test in terms of its face validity for the second-year
non-English major students at Electronic – Electrical Engineering Department, NUTE due
to the limitations in time, ability and availability of data. Moreover, it is impossible for the
author to cover all used final achievement tests as well as design a sample achievement test
for second-year students. Instead, only a test specification for test 12 in semester 3 is
presented.
1.3. Aims of the study
Following the scope of the research above, the aims of this research are:
1. To indentify the English teachers and students’ evaluation of the final existing
achievement test (test 12) at NUTE in terms of face validity.
2. To provide suggestions for test designers.
1.4. Methods of the study
In order to achieve the above aims, the study has been carried out as follows:
First, the author goes to library to read theory about assessment and testing,
achievement test with characteristics of a good achievement test and test validity with a

special focus on face validity and some measures to increase it. From her critical reading,
many reference materials have been gathered, analyzed, and synthesized to draw out a

3
theoretical basis to evaluate the current test being used for the 3
rd
semester students in
terms of its face validity.
Then, qualitative methodologies involving data collected through survey
questionnaires and interviews were employed from both teachers and students at NUTE.
1.5. Research questions
This study is implemented to find answers to the following research questions:
1. What are the teachers’ and test takers’ (students’) perceptions of the final 3
rd

semester English achievement test at NUTE in terms of its face validity?
2. What are suggestions to improve face validity of the final 3
rd
semester English
achievement test at NUTE?
1.6. Design of the study
The thesis is divided into four major chapters:
Chapter 1: Introduction presents basic information such as: the rationale, the scope, the
aims, the method, the research questions and the design of the study.
Chapter 2: Literature review reviews theoretical backgrounds on evaluating a test, which
includes relationship between teaching, learning and assessment, purposes of formative
and summative assessments, achievement tests, characteristics of good EGP and ESP tests,
face validity and some measures to increase face validity.
Chapter 3: The study is the main part of the thesis showing the context of the study and the
detailed result obtained from collected tests and findings in response to the research

questions. Then, the author gives some solutions to improve the final achievement test.
Chapter 4: Conclusion offers conclusions and proposes some suggestions for further
research on the topic.

4
CHAPTER 2: LITERATURE REVIEW

This chapter provides an overview of the theoretical background of the study. It
includes three main sections. Section 2.1 discusses the relationship between teaching,
learning and assessment. Section 2.2 focuses on the purposes of formative assessment and
summative assessment. Section 2.3 gives a brief description of achievement tests,
characteristics of a good EGP test and ESP test. It is then followed by section 2.4 in which
face validity is focused. Finally, section 2.5 suggests some measures to increase face
validity.
2.1. Relationship between teaching, learning and assessment
In the relationship between teaching, learning and assessment; curriculum and
content standards also play an important role. Curriculum is best characterized as what
should take place in the classroom. It describes the topics, themes, units and questions
contained within the content standards. Content standards are the framework for
curriculum. Curriculum can vary from programs to programs, as well as from instructors to
instructors. Unlike content standards, curriculum focuses on delivering the “big” ideas and
concepts that the content standards identify as necessary for the learner to understand and
apply. Curriculum serves as a guide for instructors; addressing teaching techniques,
recommending activities, scope and sequence, and modes of presentation considered most
effective. In addition, curriculum indicates the textbooks, materials, activities and
equipment that help learners achieve the content standards best. In the teaching and
learning process, assessment is a tool to give the nature of evidence required to
demonstrate that the content standards have been met. To ensure valid and reliable
accountability, the assessment selected should test the state standards. Clearly, assessment,
curriculum and content standards have close relationship; assessment is the basis to give

the content standards and curriculum is generalized from the content standards.
Longman dictionary of language teaching and applied linguistics (the 3
rd
edition)
(Richard, etc., 2005) defines assessment as “a systematic approach to collecting
information and making references about the ability of a student or the quality or success
of a teaching course on the basis of various sources of evidence”.

5
Assessment is a critical link for teaching and learning, which also plays a vital role
in the process of curriculum design and teaching implementation. From the perspective of
the behavior research in classroom teaching, Richards & Nunan (1990) hold that
assessment refers to the set of processes through which we make judgments about a
learner’s level of skills and knowledge. Assessment should:
- Insure reliability and validity;
- Provide for pre-, while- and post-testing;
- Be criterion – or standards – referenced;
- Inform instruction;
- Serve as an accountability measure;
- Be adaptable to a variety of instructional environments;
- Accommodate learners with special needs.
Various assessment measures are known to all, like evaluation, examination,
questionnaire, interview, discussion and observation, so on. And testing is the most
available means to implement the assessment in the teaching process. In Brown’s (2001)
view, in the curriculum system, it needs analysis, objectives, testing, materials, teaching
and evaluation. So does Richards (1990) say, the language curriculum exploration needs
analysis, goods and objectives, syllabus design, methodologies, testing and evaluation.
Both of them emphasize the importance of testing.
Bachman (1990: 20) defines the term “test” as “a measurement instrument designed
to elicit a specific sample of an individual’s behavior”. The definition provides the basis

and general of tests. Oller (1979: 1) defines language test as an instrument that attempts to
measure the extent to which students have learned in a foreign language course. From the
two definitions, this research agrees that language test is a set of instruments in forms of
questions and problems whose function is to measure an individual student’s language
abilities and knowledge in relation to a foreign language that he or she has learned.
Language test is a useful instrument with which educators can obtain reliable and
valid information on their students’ language abilities. Teachers can monitor and evaluate
student learning and indentify students’ strengths and weaknesses to clarify what they
really need to know. Students’ test results can become an important feedback on how well
an English course has been taught or learned and a necessary feedforward for the students
in the beginning of the English courses. Feedback and feedforward are very important in
the teaching and learning process. The author expresses the relationship between feedback

6
and feedforward through an example of catching a ball. When we move to catch a ball, we
must interpret our view of the ball’s movement to estimate its future trajectory. Our
attempt to catch the ball incorporates this anticipation of the ball’s movement in
determining our own movement. As the ball gets closer, or exhibits spin, we may find it
departing from the expected trajectory, and we must adjust our movement accordingly. It
means that feedforward will help teachers to give the anticipated problems at the beginning
of the course which students can have in the learning process so that students can feel more
confident to avoid the problems and study more effectively. Whereas feedback will help
teachers to adjust the teaching method reasonably so that students can get the best results.
Feedback also helps the teacher to evaluate the effectiveness of the syllabus as well as the
methods and materials he or she is using. Test results become a feedback on the curriculum
that have been developed and implemented.
In addition, testing may bring many impacts on teaching and learning. Hughes (1989:
01) calls the effect of testing on teaching and learning as “backwash”. He appreciates the role of
backwash in the teaching-learning process. Backwash can be harmful if the test content doesn’t
go with the objectives of the course. It leads to the problem of teaching in one way and testing in

another way and vice versa. However, backwash need not always be harmful, it can be positive,
too. A test which would be based directly on the needs of a specific group of learners will be
useful for them to perform in real life.
In view of the important role of language test in education system, Shohamy (2001:
2) emphasizes that “language tests need to be of high quality and follow careful rules of
science of psychometrics.” In other words, a good language test must present accurate
answers to the test takers in reference to the aspect of knowledge that it measures.
Furthermore, a high-quality language test must be reliable and valid so as to give precise
information on the test takers’ language ability. Language test may differ according to the
purposes of their design and how they are designed (see figure 1)





Purpose
Justification
Method

7
Figure 1: Three Considerations for Test Choice

A test’s intended impacts refer to the effects that the test designer intends (see figure 2).
Bachman and Palmer (1996) point out entities potentially affected by a test include
individuals (students and teachers), language classes and programs; and society.

Impact
Narrow Broad











Figure 2: The Scope of Impact of Language Tests
Obviously, the importance of testing can not be denied. In detailed, this research
focuses on testing English for Specific Purposes Testing (ESP). ESP has been playing an
important role in teaching and learning ESP at universities now. From the early 1960s, ESP
has grown to become one of the most prominent areas of English foreign language
teaching. This development is reflected in an increasing number of publications,
conferences and journals dedicated to ESP discussions. Similarly, more traditional general
English courses gave place to courses aimed at specific areas, for example, English for
Business Purposes. In addition to the emergence of ESP, a strong need for testing of
specific groups of learners was created. As a result, ESP testing movement has shown a
slow but definite growth over the past few years. Obviously, ESP testing and EFL testing
are very indispensable in the teaching and learning process.
On
an
Individual
student
On
student
and
teachers
On
student,

teachers,
classes,
and
programs
On
student,
teachers,
and
programs
and
institutions
On
student,
teachers,
classes,
and
programs,
institutions
and society

8
To sum up, the relationship between teaching, learning and assessment are
correlated because testing, teaching and learning are not separate entities. A good test can
be used as a valuable teaching and learning device. Teaching has always been a process of
helping others to discover “new” ideas and “new” ways of organizing that what they have
learned. Whether this process takes place through systematic teaching, learning and testing,
or whether it is through a discovery approach, testing was, and remains an integral part of
teaching and learning.
2.2. Purposes of formative and summative assessments
As said above, assessment is the process of documenting to measure knowledge,

skills, attitudes and beliefs. There are many assessments collected in a course such as:
continuous assessment, formative assessment, summative assessment, peer-assessment,
self-assessment and so on. However, in this research the author will focus on the
relationship between two main kinds of assessment: formative assessment and summative
assessment. "As coach and facilitator, the teacher uses formative assessment to help
support and enhance student learning. As judge and jury, the teacher makes summative
judgments about a student's achievement " (Atkin, Black & Coffey, 2001).
Formative assessment is designed to provide feedback and feedforward to students
and instructors for the purpose of the development of teaching and learning. From a
student's perspective, formative assessment provides information on a student's
performance, how they are progressing with the skills and knowledge required by a
particular course and the problems which they will have in a course. Generally the results
of formative assessment do not contribute to a student's final grade but are purely for the
purpose of assisting students to understand their strengths and weaknesses in order to work
towards improving their overall performance. From an instructor's perspective, formative
assessment is a diagnostic tool that can be used to evaluate the effectiveness of course and
curriculum design. Formative assessment has the potential to highlight areas in which
teaching and curriculum design needs to be improved as well as any areas where teaching
methods have been very effective in improving student. The sample tests in this kind are
diagnostic test and placement test. Placement test is used at the beginning of a course to
indentify a student’s level of language and find the best class for them. Diagnostic test is
used to identify problems that students have with language. The teacher diagnoses the

9
language problems students have. It helps the teacher to plan what to teach in future and
provide students with the anticipated problems and solutions.
The purpose of summative assessment is to provide "a sampling of student
achievements which lead to a meaningful statement of what they know, understand and can
do" (Brown & Knight, 1999: 37). Generally summative assessment occurs at the end of a
topic or the end of a course in order to evaluate how well students have acquired the

knowledge and skills presented in that section or during the complete course.
Achievement test is a typical sample in this summative assessment.
Clearly, the relationship between formative and summative assessment is cohesive
which is expressed through their purposes, the teacher needs to use both to evaluate the
student’s ability and enhance the quality of teaching and learning. The teacher has to
indentify student’s problem to assign student’s level, adjust teaching method and finally
test to know how well students have acquired the lesson.
2.3. Achievement tests and their characteristics
There are two above mentioned assessments and in this research, the author only
uses summative assessment because of its purpose. This research evaluates the final ESP
test, so summative assessment is used reasonably in here, achievement test in detailed.
2.3.1. Achievements tests
Achievement tests play an important role in the school programs, especially in
evaluating students’ acquired language knowledge and skills during the course and they are
widely used at different school levels.
In the view of Sparatt (1985:145), he supposes that “an achievement test is one of
the means available to teachers and students alike of assessing progress. It is the aim and
content of an achievement test that distinguishes it from other kinds of test”.
David (1999: 2) also shares an idea that “achievement refers to the mastery of what
has been learnt, what has been taught or what is in the syllabus, textbook, materials, etc.
An achievement test therefore is an instrument designed to measure what a person has
learnt within or up to a given time”.
Similarly, Brown (1994b:259) proposes a concept that “An achievement test is
related directly to classroom lesson, units or even a total curriculum. They are limited to
particular materials covered in a curriculum within a particular time frame”. Unlike
progress test, achievement test should attempt to cover as much of the syllabus as possible.

10
If we confine our test to only part of the syllabus, the contents of the test will not reflect all
that has been learned.

There are two kinds of achievement tests: final achievement test and progress
achievement test.
Progress achievement tests (short-term achievement tests) are always administered
during the course after a chapter or a term, and often written by the teacher. These tests are
of course based on the teaching program. Hughes (1900:12) claims “these tests are
intended to measure the progress that students are making”. In other words, progress
achievement tests are supposed to help the teachers to judge the degree of success of his or
her teaching and help to find out how much students have gained from what has been
taught. Accordingly, the teachers can identify the weakness of the learners or diagnose the
areas not properly achieved during the course of study. In the other hand, for students, this
test can be regarded as a useful device that provides the students with a good chance to
perform the target language in a positive and effective manner and to gain additional
confidence in doing them. This way can be a good preparative and supportive step towards
the final achievement test for the students because they will get familiar with the tests and
the strategy to do them.
Final achievement tests (longer – term achievement tests) are those administered at
the end of a course of study. They may be written and administered by ministries of
education, official examining boards, or by members of teaching institutions. They are
used to check how well learners have done after a whole course in terms of objective and
content of the course. Therefore, according to Hughes (1990:11), there are two kind of
final achievement tests: syllabus-content approach and syllabus-objective approach.
The syllabus-content approach is based directly on a detailed course syllabus or on
the books and other material used. The test only contains what it is thought that the
students have actually encountered, and thus can be considered, in this respect at least, a
fair test. The disadvantage of this type is that if the syllabus is badly designed, or the books
and other materials are badly chosen, then the results of a test can be very misleading.
Successful performance on the test may not truly indicate successful achievement of course
objectives.
The syllabus-objective approach refers to the one in which the test contents are
based directly on the objectives of the course. This approach has some benefits. First, it


11
forces course designers to elicit course objectives. Second, students on the test can show
how far they have achieved those objectives. This in turn puts pressure on those who are
responsible for the syllabus and for the selection of books and materials to ensure that
these are consistent with the course objectives. Tests based on course objectives work
against the perpetuation of poor teaching practice. The author believes that test content
based on course objectives is much preferable, which provides more accurate information
about individual and group achievement, and is likely to promote a more beneficial
backwash effect on teaching.
2.3.2. Characteristics of a good EGP test
In order to make a well-designed test, teachers have to take into account a variety
of factors such as the purpose of the test, the content of the syllabus, the students’
background, the goal of administrators and so forth. Moreover, test characteristics play a
very important role in constructing good English for General Purpose (EGP) tests. The
most important quality of a test is its usefulness. The usefulness quality generally consists
of 4 main components: reliability, validity, practicality and washback.
Reliability has been defined in different ways by different authors. Berkowitz,
Wolkowitz, Fitch and Kopriva (2000) define reliability as “the degree to which test scores
for a group of test takers are consistent over repeated applications of a measurement
produce and hence are inferred to be dependable and repeatable for an individual test
taker”. Bachman (1990: 24) considers test reliability as “a quality of test score. Clearly,
both views refer to the consistency of the test scores obtained on a test. Every test should
be reliable. If a group of students were to take the same test on two occasions, their results
should be roughly the same – provided that nothing has happened in the interval. Thus if
the students’ results are very different, the test cannot be described as reliable.
Validity refers to the degree that a test actually measures what it was designed to
measure. Validity is often discussed under the headings: face, content, construct and
criterion-related.
Content validity

This is non-statistical type of validity that involves “the systematic examination of
the test content to determine whether it covers a representative sample of the behavior
domain to be measured” (Anastasi & Urbina, 1997: 114). A test has content validity built
into it by careful selection of which items to include. Items are chosen so that they comply

12
with the test specification which is drawn up through a thorough examination of the subject
domain. Foxcraft et al. (2004: 49) notes that by using a panel of experts to review the test
specifications and the selection of items the content validity of a test can be improved. The
experts will be able to review the items and comment on whether the items cover a
representative sample of the behavior domain.
Construct validity
A test has construct validity if it accurately measures a theoretical, non-observable
construct or trait. The construct validity of a test is worked out over a period of time on the
basis of an accumulation of evidence. There are a number of ways to establish construct
validity. Two methods of establishing a test’s construct validity are convergent/divergent
validation and factor analysis.
A test has convergent validity if it has a high correlation with another test that
measures the same construct. By contrast, a test’s divergent validity is demonstrated
through a low correlation with a test that measures a different construct.
Factor analysis is a complex statistical procedure which is conducted for a variety
of purposes, one of which is to assess the construct validity of a test or a number of tests.
Face validity
Hughes (1989) defines “a test is said to have face validity if it looks as it is
measures what it is supposed to measure. Anatasi (1982: 136) pointed out that face validity
is not validity in technical sense; it refers, not to what the test actually measures, but to
what it appears superficially measure.
Face validity is very closely related to content validity. While content validity
depends on a theoretical basis for assuming if a test is assessing all domains of a certain
criterion, face validity relates to whether a test appears to be good measure or not.

Criterion-related validity
Criterion-related validity is a concern for tests that are designed to predict
someone’s status on an external criterion measure. A test has criterion-related validity if it
is useful for predicting a person’s behavior in a specified situation. Criterion-related
validity consists of two types (Davies, 1977): concurrent validity and predicative validity.
In concurrent validation, the predictor and criterion data are collected at or about
the same time. This kind of validation is appropriate for tests designed to assess a person’s
current criterion status. It is good diagnostic screening tests when you want to diagnose.

13
In Predictive validation, the predictor scores are collected first and criterion data
are collected at some later/future point. This is appropriate for tests designed to assess a
person’s future status on a criterion.
Practicality is the ability of a test to be easy to construct, administer, score and
interpret. A test must be carefully organized well in advance. How long will the test take?
What special arrangements have to be made (for example, what happens to the rest of the
class while individual speaking test take place)? Is any equipment needed (tape recorder,
language lab, overhead projector)? How is marking the work handled? How are tests stored
between sittings of tests? All of these questions are practical since they help ensure the
success of a test and testing (Heaton, 1988; Hughes, 1997; Carroll & Hall, 1985).
The last important factor in testing is backwash or washback effect. Washback is
the effect of testing on the teaching and learning processes. Washback can be harmful of
beneficial. If a test is regarded as important then preparation for it can dominate all
teaching and learning activities negatively or positively. In the case the test content and
testing techniques are at variance with the objectives of the course, then there is likely to
be harmful washback. If the skill of writing, for example, is tested only by multiple choice
items, then there is pressure to practise such items rather than practise the skill of writing
itself. This harmful washback is clearly undesirable. An example that often comes up is the
effect of the university entrance examinations in Vietnam on high school language
teaching and learning. However, washback need not always be harmful; indeed it can be

positively beneficial. If an English test for first year undergraduate students is designed on
the basis of an analysis of the English language needs of these students and which includes
tasks as similarly as possible to those which they would have to perform as undergraduates
(reading textbooks, taking notes during lectures, etc) and administer instead of one which
was entirely multiple choice, then beneficial washback can be achieved. There will be an
immediate effect on teaching and learning the syllabus will be redesigned, new books will
be selected, classes will be conducted differently and students’ way of learning will change
to reflect the demand of the new test.
In a nutshell, the author has just give a common overview about achievement test
and characteristics of a good achievement EGP test so that readers can understand how to
evaluate a good final achievement EGP test.

14
2.3.3. Characteristics of a good ESP test
Nowadays, the ESP teaching and research has achieved tremendous improvement
home and abroad. In the aspect of teaching, it has formed the system of Vocational English
(VE: Business English, Tourism English, Hotel English, Medical English…) and English
for Academic Purposes.
“ESP is not a matter of teaching specialized varieties of English. The fact that
language is used for a specific purpose does not imply that it is a special form of language,
different in kind from other forms. Though the content of learning may vary, there is no
reason to suppose that processes of learning should be any different any different for the
ESP learner than for the general English learner” (Hutchinson, 1987).
From the above view, we acquire two points that ESP is one kind of English, with
its specific language characteristics, which is not applied to teach some particular items,
and the similarity between ESP and EGP is more distinguishable than their difference; the
other is there is no difference in essence in the teaching principles and procedure between
ESP and EGP. In other words, EGP is the premier stage for ESP, and ESP is the advanced
stage of EGP teaching. The testing and evaluation for ESP should be carried out in
accordance with the teaching contents and objectives. Therefore, only with the efficient

principles, available teaching methods and modes, it makes ESP useful for stimulating the
students’ motive of language learning, arousing their enthusiasm of learning, and
contributing to the construction of harmony between teachers and students. Clearly, ESP
tests are the same as all good EGP tests. It means that every ESP tests consists of 4
mentioned components: reliability, validity, practicality and washback.
However, two aspects of ESP testing that may be said to distinguish it from more
general purpose language testing: authenticity of task and the interaction between language
knowledge and specific purpose content knowledge.
Authenticity of task means that the ESP test tasks should share critical features of
tasks in the target language use situation of interest to the test takers. The key to this
assessment is to present learners with tasks that resemble in some ways that they may have
to do with the language in real life. Therefore, the ESP approach in testing is based on the
analysis of learners’ target language use situations and specialized knowledge of using
English for real communication.

15
The interaction between language knowledge and specific purpose content
knowledge is perhaps the clearest defining feature of ESP testing. In more general purpose
language teaching, the factor of background knowledge is usually seen as a confounding
variable, contributing to measurement error and to be minimized as much as possible. In
ESP testing, background knowledge is a necessary, integral part of the concept of specific
purpose language ability.
To sum up, EGP is pre-stage for ESP. ESP will be taught when the students have
had general English grammar and knowledge. ESP tests are similar to EGP tests but focus
on specific purpose in the target language use English situation for real communication.
2.4. Face validity
2.4.1. Definition
Hughes (1989) defines “a test is said to have face validity if it looks as it is
measures what it is supposed to measure”. Its look means face validity of the test. It
concerns the appeal of the test to the popular (non-expert) judgment, typically that of the

candidate, the candidate’s family and members of the public. The test is what students and
parents want and it looks familiar to them. For example, for the past 8 years the Grade 9
exam has used passages, comprehension questions and grammar exercises taken directly
from English 9. Students have prepared for the exam by memorizing the book. This year,
the Foreign Language Specialist writes the exam using parallel texts and exercises, not
taken directly from the book without warning anyone. This test lacks face validity. Face
validity is hardly a scientific concept, yet it is very important. A test which does not have
face validity may not be accepted by candidates, teachers, education authorities or
employers. In favor of this view, Mc Namara (2000: 133) defines face validity as a degree
of language test acceptability for those who are involved in its designing and use. A
language test is said to be face valid only if it satisfies their expectation. Ingram (1977: 18),
as cited by Anderson et all (1995: 289), also agrees that face validity is “surface credibility
or public acceptability”.
Ensuring face validity of a language is important in view that this validation
procedure is one of the major aspects of validity. The procedure of face validation
“involves an intuitive judgment about the test’s content by people whose judgment is not
necessary expert”, as it is mentioned by Anderson et al (1995: 289). Anderson et al (1995:
172) mentions that the process of face validation simply deal with how those people

16
comment on the appearance of the language test, although there may be little attention paid
to the content of test items. Analyzing face validity of an English test is thus an attempt for
gathering people’s opinion on whether the test looks valid as an English test or not.

2.4.2. Relationship between reliability and validity
We often think of reliability and validity as separate ideas but, in fact, they're
related to each other. Reliability and validity are the two vital characteristics that constitute
a good test. However, validity and reliability have a complicated relationship.
If the test is not reliable, it cannot be valid at all. To be valid, according to Hughes
(1988:42), “a test must provide consistently accurate measurements. It must therefore be

reliable. However, a reliable test may not be valid at all”. For example, in a writing test,
candidates are required to translate a text of 500 words into their own language. This could
well be a reliable test but it can’t be a valid test of writing. To this end, if a test is valid, it
must also be reliable. Therefore, reliability is a necessary but not sufficient condition for
validity. To understand more, the author wants to show their relationship through the
following figures. Think of the center of the target as the concept that you are trying to
measure. Imagine that for each person you are measuring, you are taking a shot at the
target. If you measure the concept perfectly for a person, you are hitting the center of the
target. If you don't, you are missing the center. The more you are off for that person, the
further you are from the center.








Figure 3: Relationship between reliability and validity
The figure above shows three possible situations. In the first one, you are hitting the
target consistently, but you are missing the center of the target. That is, you are
consistently and systematically measuring the wrong value for all respondents. This

17
measure is reliable, but no valid (that is, it's consistent but wrong). The second shows hits
that are randomly spread across the target. You seldom hit the center of the target. In this
case, your measure is neither reliable nor valid. Finally, you consistently hit the center of
the target. Your measure is both reliable and valid. In brief, reliability is a necessary but
not sufficient condition for validity.
2.4.3. Reasons for choosing face validity

As the relationship between reliability and validity shown above, validity is an
indispensable quality of all good tests. Hughes (1982: 22) says that, “the greater a test’s
content validity is, the more likely it is to be an accurate measure of what it is to measure”.
Therefore, from the outset of test construction, test validity should be the most essential
part of all.
Validity of a language test has four facets, namely face validity, content validity,
construct validity and criterion - referenced validity. However, the author focuses on face
validity because of some reasons.
Firstly, the later three facets of validity, content validity, construct validity and
criterion – referenced validity are excluded from this research because of the limitation of
time and source. Anastasi (1982: 136) as cited by Weir (1990: 26) stated that “face validity
is not validity in the technical sense”. Face validation is significant in that it involves in
whether or not the test “looks valid” to those who deal with the test, so the researcher
performs the analysis of face validation. Heaton (1988:60) contributed that “face validity
can provide not only a quick and reasonable guide but also a balance to too great of
concern with statistical analysis.”He points that the students’ motivation is maintained if a
test has good face validity plays a certain role in any test and it is of great concern in this
thesis. According to Anastasi & Urbina (1997: 114), content validity is a non-statistical
type of validity that involves “the systematic examination of the test content to determine
whether it covers a representative sample of the behavior domain to be measured”. Content
validity evidence involves the degree to which the content of the test matches a content
domain associated with the construct. Obviously, content validity has to need a
representative sample test to analyze and compare. According to Bachman and Cohen
(1998: 50), construct validation deals with the “judgmental and empirical justifications
supporting the inferences made from test scores”. Bachman and Palmer (1996: 21) also
mention that construct validation is related to the “meaningfulness and appropriateness” of

18
the researcher’s interpretations relevant to the actual test scores. Bachman (1990: 248)
mentions that criterion – referenced validity deals with demonstrating “a relationship

between test scores and some criterion which is believed as an indicator of the ability
tested”. However, this research is not provided with the actual test scores or a sample
reliable test, and thus excluded the three validation processes from its investigation.
Secondly, face validity is chosen because of its importance in society. As Hughes
(1989) says that face validity is hardly a scientific concept, yet it is very important and a
test which does not have face validity may not be accepted by candidates, teachers,
education authorities or employers. Agreeing with this view, Huong (2000: 69) points out
that test appearance is an important consideration in test use. She supposed that useful
information can be obtained to inform test development by investigating the test takers’
perception of the appropriate and connection between test takers and relevant real life tasks
that test takers later encounter. Clearly, face validity of the test is very important in
society’s evaluation because the later three facets of validity belong to the specialization of
test designers.
This study only helps the author to give a test suitable with students’ ability at
NUTE. Therefore, the reasons discussed here are regarded as a strong impetus that
initiates this thesis into investigating the face validity of the achievement language test 12
at NUTE.
2.5. Some measures to increase face validity
Face validity is an important aspect of a test; it relates to the question of whether
non-professional testers such as parents and students think the test is appropriate. If these
non-specialists do not think the test is testing candidates’ knowledge in a suitable manner,
they may complain vociferously and the candidates may not tackle the test with the
required zeal. If the test lacks face validity, it may not work as it should and may have to
be redesigned. (Alderson, Clapham & Wall, 1995). Therefore, it’s necessary to give
measures to increase face validity about administration and content of the test as follows:
- Test format is familiar and clear to the students;
- The quantity of questions designed in a test is suitable with the time allowance;
- Test conditions (space and atmosphere in the testing room in particular) that are
biased for best that bring out students’ best performance;

×