Tải bản đầy đủ (.pdf) (63 trang)

Concurrent validity of the English tests in the national secondary school leaving examination, school years 2008-2009, 2009-2010 Tính giá trị so sánh của bài t

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (557.78 KB, 63 trang )

1

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FALCUTY OF POST- GRADUATE STUDIES

NGUYỄN THỊ HOA

CONCURRENT VALIDITY OF THE ENGLISH TESTS IN THE
NATIONAL SECONDARY SCHOOL LEAVING EXAMINATION,
SCHOOL YEARS 2008-2009, 2009-2010
(Tính giá trị so sánh của bài thi tốt nghiệp trung học phổ thông (THPT) môn
tiếng Anh năm học 2008-2009 và năm học 2009-2010)

M.A. Minor Programme Thesis

Major: Methodology
Code : 601410

HANOI, 2010


2

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
FALCUTY OF POST- GRADUATE STUDIES

NGUYỄN THỊ HOA

CONCURRENT VALIDITY OF THE ENGLISH TESTS IN THE


NATIONAL SECONDARY SCHOOL LEAVING EXAMINATION,
SCHOOL YEARS 2008-2009, 2009-2010
(Tính giá trị so sánh của bài thi tốt nghiệp trung học phổ thông (THPT) môn
tiếng Anh năm học 2008-2009 và năm học 2009-2010)

M.A. Minor Programme Thesis

Major
: Methodology
Code
: 601410
Supervisor: HOÀNG THỊ XUÂN HOA, Ph.D.

HANOI, 2010


6

TABLE OF CONTENTS

Declaration

i

Acknowledgements

ii

Abstract


iii

Table of contents

iv

List of abbreviations

viii

List of tables and charts

ix

PART A: INTRODUCTION

1

1. Rationale

1

2. Scope of study

2

3. Aims of study

2


4. Methods of study

2

5. Research questions

3

6. Design of study

3

PART B: DEVELOPMENT

5

CHAPTER 1: LITERATURE REVIEW

5

1.1 Achievement tests

5

1.1.1 Definition

5

1.1.2 Kinds of achievement tests


6

1.1.3 Benefits of achievement testing

6


7

1.2 Quality of a good test

8

1.2.1 Test reliability

8

1.2.2 Practicality

9

1.2.3 Comparison and Discrimination

10

1.2.4 Test validity

11

1.2.4.1. Types of test validity


11

1.2.4.1.1 Face validity

12

1.2.4.1.2 Construct validity

13

1.2.4.1.3 Content validity

14

1.2.4.1.4 Criterion – related validity

15

1.2.4.1.4.1 Concurrent validity

15

1.2.4.1.4.2 Predictive validity

16

1.2.4.2 Reasons for giving more emphasis to concurrent validity

16


1.3 Statistical analysis of test results

17

1.3.1 Correlation

17

1.3.2 Descriptive statistics

18

1.3.2.1 The mean

18

1.3.2.2 The mode

19

1.3.2.3 The median

19

1.3.2.4 The range

19



8

1.3.2.5 The standard deviation

19

1.3.3 Classical item analysis

20

1.3.3.1 Item facility

20

1.3.3.2 Item discrimination

20

CHAPTER 2: METHODOLOGY

22

2.1 Setting of the study

22

2.1.1 Education in Vietnam

22


2.1.2 Cao Ba Quat Gia Lam High School

22

2.1.3 Nguyen Gia Thieu High School

23

2.1.4 The National secondary school leaving examination

24

2.1.4.1 Objectives

24

2.1.4.2 Test specification

24

2.2 Research methodology

25

2.2.1 Participants

25

2.2.2 Data collection


26

2.2.3 Procedures

26

2.2.4 Data Analysis

27

CHAPTER 3: DISCUSSION AND FINDINGS

28

3.1 Analysis of data

28

3.1.1 Correlation coefficient

28


9

3.1.2 Descriptive statistics

29

3.2 Interpretation of data


34

PART C: CONCLUSIONS

35

1. Conclusions

35

2. Implications for improvement of the English test in the NSSLE

36

3. Limitations and suggestions for future research

37

REFERENCES

39

APPENDIXES

I

Appendix 1: The English test in the NSSLE school year 2008-2009

I


Appendix 2: The English test in the NSSLE school year 2009-2010

VII

Appendix 3: Example of calculating correlation coefficient

XIII

Appendix 4: Example of calculating the mean

XIV

Appendix 5: Example of calculating the mode

XIV

Appendix 6: Example of calculating the median

XIV

Appendix 7: Example of calculating the range

XV

Appendix 8: Example of calculating the standard deviation

XV



10

LIST OF ABBREVIATIONS
EFL: English as a foreign language.
ELT: English language teaching.
L2: Second language: in the context of this study, it usually refers to English.
NSSLE: National Secondary School Leaving Examination


11

LIST OF TABLES AND FIGURES
TABLES
Table 1.1: Correlations

17

Table 2.1: Test specification

25

Table 2.2: Students‘ grade in English, school year 2009-2010

26

Table 3.1: Descriptive statistics of the two tests

29

FIGURES

Figure 1.1: Factors that affect language test scores

8

Figure 1.2: Practicality

9

Figure 1.3: Construct validity of score interpretations

13

Figure 3.1 Correlation = +0.977

28

Figure 3.2: Desctiptive statatistics of test one (2008-2009)

30

Figure 3.3: Desctiptive statatistics of test two (2009-2010)

30

Figure 3.4: Distribution of scores on test one

32

Figure 3.5: Distribution of scores on test two


33


12

PART A: INTRODUCTION
1. Rationale
Education has always played an important part in people‘s life. Now, in the conditions of
the world economic crisis, solid knowledge and skills will help people to save their present
job or will make it easier to find a new one. However, it is said that effective education is
impossible without an effective management and that knowledge and skills must be
checked and controlled effectively. Therefore, testing in education, an attempt to measure a
person‘s knowledge, intelligence, or other characteristics in a systematic way, is of great
significance.
Ever since the English language began to be taught in formal settings, the development of
tests to assess learner‘s performance has been an integral part of the language learning and
teaching process. Language testing, then, is central to language teaching. It provides goals
for language teaching and monitors success in reaching those goals.
In Vietnam, the English test in the National Secondary School Leaving Examination
(NSSLE) plays an important role in English language teaching (ELT): it stimulates student
progress and evaluates students‘ achievement in acquiring English throughout 7 years from
junior to upper secondary school. Failure or pass this test is a matter of great concerns as it
decides whether a student can proceed to higher education or not. Besides, through this test
teachers can also evaluate the effectiveness of a new teaching method or of new materials
(Valette, 1977). The English tests in the NSSLE school years 2008-2009, 2009-2010 were
particularly significant because high school English teachers had chance to look back at
their success as well as their failure in implementing the new English textbook series:


13


English 10, English 11, and English 12 which ran officially in 2004; therefore, they can
make necessary amendments in the school years after.
This paper is the writer‘s attempt to evaluate the Concurrent Validity of the English Tests
in the NSSLE school years 2008-2009, 2009-2010 by establishing the correlation between
the scores of the two tests, showing how widely the scores are spread out, presenting how
closely they cluster and illustrating how well the tests have separated students from each
other. It is hoped that the results of the study can raise the awareness of English teachers in
general and those interested in making better English Tests in the NSSLE in particular.
2. Scope of the study
This research will focus on the concurrent validity of the English Tests in the NSSLE,
school years 2008-2009, 2009-2010 only. Therefore, other aspects in evaluating an
achievement language test will be beyond the scope of this study.
Also, due to the fact that the English tests in the NSSLE school years 2008-2009, 20092010 were multiple choice tests marked by a scoring machine which only gives out final
test scores, a careful analysis of score patterns on each of the test items is out of the
question.
In addition, due to limitations in time, ability and conditions, it is impossible for the author
to take a sample population that includes representatives from different geographical areas
(e.i urban, rural, island and mountainous) as well as those from a variety of ethnic groups
(Kinh, Cham, H‘Mong, Kh‘Me..). Therefore, this study investigates the concurrent validity
of the English Tests in the NSSLE, school years 2008-2009, 2009-2010 in Gia Lam – Long
Bien Districts, where the writer is currently working only.
3. Aims of the study
This study is intended to examine the concurrent validity of the English tests in the NSSLE
school years 2008-2009, 2009-2010. It places high emphasis on investigating and
analyzing test scores in order to set up the correlation coefficient between the two sets of
test results reveal the spread of score and determine the tests‘ ability to discriminate
students.



14

4. Methods of study
To realize all of the aims mentioned above, the author has employed a combination of
methodologies.
First, through critical reading the author gathered, analyzed and synthesized literature
relating to achievement tests in language testing and qualities of a good language test with
a special focus on concurrent validity. This is to draw out a theoretical basis to evaluate the
two most current English tests in the NSSLE.
Second, quantitative methodology is applied to get data collected through the author‘s
marking the English test in the NSSLE, school year 2008-2009 and visiting the schools to
ask for the test results of the English Test in the NSSLE, school year 2009-2010.
Third, the study also made use of supporting methods such as informal discussion, opinion
exchanges with teachers and colleagues, and consulting experienced and enthusiastic
supervisor to gather information needed.
5. Research questions
This study is implemented to find out the answers to the following research questions:
-

How are the English tests in the National Secondary School Leaving
Examination, school years 2008-2009, 2009-2010 correlated?

-

How do scores of each test cluster together?

-

How do scores of each test spread out?


-

How do the tests discriminate students‘ achievement?

6. Design of the study
The thesis is organized into three major parts:
Part A, INTRODUCTION, presents such basic information as the rationales, the aims, the
methods, the research questions, and the design of the study.
Part B, DEVELOPMENT, provides the literature review, the methodology and the findings
of the study in three corresponding chapters one, two, and three. In chapter one, literature


15

review, theoretical background for evaluating a language test is described. This chapter
also includes reasons for testing, criteria of a good language test, achievement tests and
issues on test concurrent validity. In chapter two, methodology, the setting of the study and
the methodology employed to carry out the research are fully portrayed. In chapter three,
discussion and findings, results of analyzed data regarding the correlation coefficient of the
two sets of score, the mean, the mode, the range, and the standard deviation are shown in
great details.
Part C, CONCLUSION, gives a summary of the study, its implications for improvement of
the English test in the NSSLE, its limitations and suggestions for future research.


16

PART B: DEVELOPMENT
CHAPTER 1: LITERATURE REVIEW
1.1 Achievement tests

An achievement test is concerned with measuring a student‘s competence with regard to
what has been taught or what is in the syllabus. This type of test is usually given at the end
of a period of instruction and as a result, its content is a sample of what has been included
in the syllabus. This test is normally school-based and typically provides control over
previous learning. However, it should be borne in mind that the purpose of achievement
tests should be to indicate how successful the learning experiences have been for the
learner, rather than to show in what respects they were insufficient, and the tests
themselves should also be firmly established in preceding classroom experiences in terms
of activities practiced, language used, and criteria of evaluation adopted (Weir, 1993).
1.1.1 Definition
There are a variety of definitions on achievement tests. As Baker (1982) put it,
achievement tests are “used presumably to assess the subject matter and skills that students
have learnt‖. J.B. Heaton (1988) added that achievement tests should be laid foundation on
― what students are presumed to have learnt – not necessarily on what they have actually
learnt not on what has actually been taught‖. In the same vein, Hughes (1989) emphasized
the importance of achievement tests in assessing students‘ success in reaching the language
course‘s goals. Tim McNamara (2000) shared the same point of view with Hughes (1989).
He asserted that ―Achievement tests accumulate evidence during or at the end of a course
of study in order to see whether and where progress has been made in terms of the goals of
learning‖.


17

These definitions all have one thing in common. That is, achievement tests are used to
determine a student‘s academic strengths and weaknesses and measure a student‘s mastery
of a given subject or skill. They are directly related to language courses, their purpose
being to establish how successful individual students, groups of students, or the courses
themselves have been in achieving objectives.
1.1.2 Kinds of achievement tests

Hughes (1989) divided achievement tests into two types: final and progress, which can be
either content - or objective – based. While final achievement tests are given upon
completion of a course, progress achievement tests are delivered at a particular stage of a
course to measure students‘ advance towards the course‘s objectives. Hughes (1989) also
stated that content-based achievement tests are considered to be fair as it tests what
students have already encountered but they can be misleading in case of badly-designed
syllabus or badly-chosen materials. Whereas, objective-based achievement tests are
beneficial because they help to determine whether there is a consistency between the
material and the course objectives and promote a positive backwash effect on teaching.
On the contrary to Hughes‘ idea, Alderson, Clapham and Wall (1995) considered
achievement tests and progress tests two independent and distinct categories. They claimed
that although progress tests and achievement tests are both content-based, they are given at
different stages of the course. In my own point of view, it is a good idea to group the two
types of test under one roof instead of dividing them into two as they bear so much
resemblance.
McNamara (2000) asserted that achievement tests can gather proofs to show whether and
what students have acquired as well as how the students‘ advance toward the learning
goals. Personally I think that tests alone cannot reflect exactly students‘ progress. Mental
and physical health should also be taken into account.
1.1.3 Benefits of achievement testing
Achievement tests serve a variety of purposes, none of which, obviously, is to instill a
sense of anxiety and frustration in the students and/or teachers. However, being assessed is


18

inevitably an anxiety-provoking experience. Nevertheless, as teachers and students, we can
also attest that a well-written, content-valid test provides us with an opportunity to take
stock of what we have learned, and to demonstrate to ourselves and to others the
knowledge and skills that we have accumulated. A well-constructed test will give both the

teacher and students an appraisal of their respective achievements. It provides teachers
with invaluable information regarding students‘ needs, abilities, and a measure of how well
the students have met the course objectives.
Rather than lessening self-confidence, achievement tests have the capacity to foster it. One
of our goals as teachers is to create multiple opportunities for the students to experience
success and to excel as language learners. Achievement tests provide one of strongest ways
in which we can help to instill and strengthen positive feelings. Such feelings towards the
L2 learning experience, as a whole, can be fostered through tests that challenge students
with items that have been designed to emphasize what the students are able to do with the
L2. The importance of administering tests with challenging items and a high degree of
content validity cannot be stressed enough. Administering tests will lose its importance if
the items do not pose a particular challenge to the students and/or if the items do not
adequately reflect the given body of content.
Assessment tests in the L2 classroom can foster language learning in a number of ways,
including the following: (a) tests can enhance students‘ motivation by serving as indicators
to the progress that they have made, (b) tests can help students establish learning goals for
themselves, both prior to and after the test, (c) tests can help students confirm their
strengths and weaknesses, thus helping to promote autonomy in their learning experience,
(d) tests are able to provide a degree of periodic closure to particular units, while providing
students with a sense of accomplishment and mastery of the specified content area, (e) tests
can assist teachers in evaluating their own effectiveness, and (f) tests can foster the
retention of the particular content area by way of the feedback that they give regarding the
students‘ level of mastery.
In short, it is evident that assessment tests can be invaluable components of the L2
curriculum. Such tests foster the overall language development of the students, while
simultaneously providing teachers with critical information regarding the students‘ mastery


19


of the specified instructional domain. Achievement tests can be administered in the second
language classroom without jeopardizing the interactive, communicative focus of the L2
classroom that teachers value so greatly.

1.2 Qualities of a good test
1.2.1 Test reliability
A fundamental concern in the development and use of language tests is their reliability;
that is, the stability of the test as a measure. Reliability refers to the consistency of the
examination scores. Also, it refers to the extent to which the test produces consistent
results if different markers mark it. Put it another way, a test is reliable if it is consistent
within itself and across time (Alderson, Clapham, & Wall, 1995:6).
In the same vein, Bachman (1990:160) noted that reliability of test score should be
exempted from any measurement errors caused by both outside factors like testing
conditions and subjective factors such as tiredness or anxiety.

Communicative
language ability

TEST SCORE
Random
factors

Test method
facets

Personal
attributes


20


Figure 1.1: Factors that affect language test scores
(Bachman, 1990:165)

Figure 1.1 shows the effects of various factors on a test. In this type of diagram, rectangles
are used to represent observed variables, such as test scores, ovals to represent unobserved
variables, or hypothesized factors, and straight hypothesized causal relationships. The
result of the effects of all these factors is that whenever individuals take a language test,
they are not all likely to perform equally well, and so their scores vary.
Heaton (1988) named two factors affecting the reliability of a test: the extent of the sample
of material selected for testing and the administration of the test. He also suggested readministering the same test after a lapse of time or administering parallel forms of the test
to the same group to measure reliability of a test. He noted in the first case, it is assumed
that all candidates have been treated in the same way in the interval – that they have either
all been taught or that none of them have. Provided that such assumptions can be made,
comparison of the two results would then show how reliable the test has proved. In the
second case, the parallel forms of the test should be identical in the nature of their
sampling, difficulty, length, rubrics, etc. If the correlation between the two tests is high, the
test can be termed reliable.
1.2.2 Practicality
Practicality was defined as ―the relationship between the resources that will be required in
the design, development, and use of the test and resources that will be available for these
activities‖ (Bachman and Palmer, 1996).This relationship can be represented as below:

Practicality

=

Available resources
Required resources


If practicality ≥ 1, the test development and use is practical
If practicality ≤1, the test development and use is not practical


21

Figure 1.2: Practicality (Bachman &Palmer, 1996)
Practicality is a matter of the extent to which the demands of the particular test
specifications can be met within the limits of existing resources. If the resource demands of
the test specifications do not exceed the available resources at any stage in test
development, then the test is practical and development and test use can proceed. If
available resources are exceeded, then the test is not practical and the developer must either
modify the specifications to reduce the resources required, or increase the available
resources or reallocate them so that they can be utilized more efficiently. Thus, a practical
test is one whose design, development, and use do not require more resources than are
available. (Bachman &Palmer, 1996: 36). This idea was similar to Harrison‘s (1983:13)
who pointed out that a test should be as economical as possible in time (preparation,
sitting, and marking) and in cost (materials and hidden costs of time spent).
1.2.3 Comparison and Discrimination
In a sense, all assessment is based on comparison, either between one student and another,
or between the student as he is now and as he was earlier or between the student‘s
capability and the task the test requires him to perform. It is also important for assessment
to have the capacity to discriminate among the different candidates and to reflect the
differences in the performances of the individuals in the group (Heaton, 1988:165).
Also according to Heaton (1988:167) the differences can be seen in the spread of test
scores. Briefly, the item in the test should be spread over a wide difficulty level as follows:
-

extremely easy items


-

very easy items

-

easy items

-

fairly easy items

-

items below average difficulty level

-

items above average difficulty level

-

fairly difficult items

-

difficult items

-


very difficult items

-

extremely difficult items


22

As for Harrison (198:14), discrimination was defined as the ―extent to which a test
separates the students from each other‖. However, the extent of the need to discriminate
will vary depending on the purpose of the test, whether to check students‘ mastering of the
syllabus or to locate areas of difficulties for students.
1.2.4 Test validity
The primary concern in test development and use is demonstrating not only that test score
are reliable but that the interpretations and uses we make of test scores are valid. In
examining validity, we look beyond the reliability of the test scores themselves, and
consider the relationships between test performance and other types of performance in
other contexts. The types of performance and contexts we select for investigation will be
determined by the uses or interpretations we wish to make of the test results.
Alderson et al (1995:6) stated that: ― Validity is the extent to which a test measures what it
is intended to measure: it relates to the uses made of test scores and the ways in which test
scores are interpreted, and is therefore always relative to test purpose ‖ . Chapelle
(1999:258) shared the same view point: ―Validity is considered an argument concerning
test interpretation and use: the extent to which test interpretations and used can be
justified‖.
In test validation, we are not examining the validity of the test content or of event the test
scores themselves, but rather the validity of the way we interpret or use the information
gathered through the testing procedure. (Bachman, 1990)
1.2.4.1. Types of test validity

Authors in language testing have divided validity into subtypes differently. Hughes
(1989:258) divided validity into 4 types including content validity, criterion-related
validity, construct validity, and face validity. Alderson et al (1995:171-183) classified
validity into 3 types including internal validity which has three subtypes – face validity,
content validity and response validity; external validity which has two subtypes concurrent
validity and predictive validity; and construct validity.


23

However, it has been traditional to classify validity into different types such as content,
criterion, and construct validity like what Alderson et al and Hughes did. Measurement
specialists have come to view these as aspects of a unitary concept of validity that
subsumes all of them. This unitary view of validity has also been clearly endorsed by the
measurement profession as a whole in the most recent revision of the Standards for
Educational and Psychological Testing:
Validity …is a unitary concept. Although evidence may be accumulated in many
ways, validity always refers to the degree to which that evidence supports the
inferences that are made from the scores. The inferences regarding specific uses of
a test are validated, not the test itself.
(American Psychological Association 1985: 9)
It is still necessary to gather information about content relevance, predictive utility, and
concurrent criterion relatedness in the process of developing a given test. However, it is
important to recognize that none of these by itself is sufficient to demonstrate the validity
of a particular interpretation or use of test scores.
In this study I employ Hughes‘ classification as it is simple and clear.
1.2.4.1.1 Face validity
Face validity is when a test appears valid to examinees who take it, personnel who
administer it and other untrained observers. According to Alderson et. Al (1995:172)
―Essentially face validity involves an intuitive judgment about the test‘s content by people

whose judgment is not necessarily experts‖.
Face validity is not a technical sense of test validity; i.e., just because a test has face
validity does not mean it will be valid in the technical sense of the word. Just because it
looks valid doesn‘t mean it is. Hughes (1989: 27) affirmed that ―A test is said to have face
validity if it looks as if it measures what it is supposed to measure‖.
Face validity is considered important. J.B. Heaton (1988: 160) said that a test that has good
face validity can help maintain students‘ motivation because it makes them try harder;
otherwise, students may not put maximum efforts in doing the test if the test doesn‘t look


24

sound in their eyes. Hughes (1989:27) agreed with Heaton‘s idea. He pointed out that a test
which lacks face validity cannot be welcomed by candidates, teachers, education
authorities and employers, and therefore may not be used. If used, the candidates may not
perform it in a way that doesn‘t reflect their ability.
1.2.4.1.2 Construct validity
A test has construct validity if it accurately measures a theoretical, non-observable
construct or trait. The construct validity of a test is worked out over a period of time on the
basis of an accumulation of evidence. J.B Heaton (1988: 161) noted that ―if a test has
construct validity, it is capable of measuring certain specific characteristics in accordance
with a theory of language behavior and learning. This type of validity assumes the
existence of certain learning theories or constructs underlying the acquisition of abilities
and skills.‖ Bachman (1990:254-5) put it in a more simple way. He stated that there should
be a consistency between the performance on tests and our predictions made on the basis
of a theory of abilities or constructs. In his later publication with Palmer, Bachman
(1996:21) advocated the view by saying that interpretation of a given test score can be an
indicator of the ability(ies) or construct we want to measure.

SCORE INTERPRETATION:

Inferences about
Domain of
language ability
generalization
(Construct definition)

construct
validity

Language ability

TEST SCORE

Interactiveness

authenticity

Characteristics of
the test task


25

Figure 1.3: Construct validity of score interpretations
( Bachman & Palmer, 1996:22)
The figure above indicates that test scores are to be interpreted appropriately as indicators
of the ability we intend to measure with respect to a specific domain of generalization.
1.2.4.1.3 Content validity
A test has content validity if it measures knowledge of the content domain of which it was
designed to measure knowledge. Put it another way, validity primarily concerns with ―what

goes into the test‖ (Harrison, 1983). That is, a content specification or the adequacy with
which the test items adequately and representatively sample the content area are to be
measured. For e.g., a comprehensive math achievement test would lack content validity if
good scores depended primarily on knowledge of English, or if it only had questions about
one aspect of math (e.g., algebra).
According to J.B. Heaton (1988: 160), ―this kind of validity depends on a careful analysis
of the language being tested and of the particular course objectives‖. This is to say that the
test should be constructed so as to contain a representative sample of the course and that
the relationship between the test items and the course objectives should be apparent.
Hughes (1989) asserted that in order to judge whether or not a test has content validity, a
test specification made at early stage of test construction should be taken into
consideration. A comparison of test specification and test content is the basis for judgment
as to content validity.
Alderson et. Al (1995: 173) argued that ―content validation involves gathering the
judgment of experts: people whose judgment one is prepared to trust, even if it disagrees
with one‘s own‖. This means expert judgments (not statistics) is the primary method used
to determine whether a test has content validity. Nevertheless, the test should have a high
correlation with other tests that purport to sample the same content domain.


26

Now that we can see the most important distinction between content and face validity: in
face validation, the judgment of others may not be necessarily accepted while in content
validation judgments from experts are gathered and believed.

1.2.4.1.4 Criterion – related validity
Another approach to test validity is to see how far results on the test agree with those
provided by some independent and highly dependable assessment of the candidate‘s
ability. This independent assessment is thus the criterion measure against which the test is

validated. According to Hughes (1989:23-5), there are essentially two kinds of criterionrelated validity: concurrent validity and predictive validity.
1.2.4.1.4.1 Concurrent validity
Concurrent validity is established when the test and the criterion are administered at about
the same time (Hughes, 1989:23). It refers to how well scores on a new test correspond to
the scores obtained in other previously validated measures of the same skills.
According to Bachman (1990:248) information on concurrent criterion relatedness takes
one of the two forms: (1) examining differences in test performance among groups of
individuals at different level of language ability, or (2) examining correlations among
various measures of a given ability.
Alderson et al (1995:177) stated:
…concurrent validation involves the comparison of the test scores with some other
measure for the same candidates taken at roughly the same time as the test. This
other measure may be scores from a parallel version of the same test or from some
other test; the candidates‘ self-assessments of their language abilities; or ratings of
the candidate on relevant dimensions by teachers, subject specialists or other
informants…. The results of the comparison are usually expressed as a correlation
coefficient, ranging in value from -1.0 to +1.0. Most concurrent validity
coefficients range from +.5 to +.7- higher coefficient are possible for closely related


27

and reliable tests, but unlikely for measures like self-assessments or teacher
assessment.
Bachman (1990: 290) pointed out that demonstration of criterion relatedness consists of
identifying an appropriate criterion behavior – another language test, or other observed
language use – and then demonstrating that scores on the test are functionally related to
this criterion. The criterion may occur nearly simultaneously with the test, in which case
we can speak of concurrent relatedness (concurrent validity). She also asserted that the
major consideration in collecting evidence of criterion relatedness is that of determining

the appropriateness of the criterion. That is, we must ensure that the criterion is itself a
valid indicator of the same abilities measured by the test in question.
1.2.4.1.4.2 Predictive validity
This refers to the correlation between scores obtained on a measure such as a proficiency
test and the language performance of the students when they use the language in the real
world. In predictive validation, the predictor scores are collected first and criterion data are
collected at some later/future point. This is appropriate for tests designed to asses a
person‘s future status on a criterion. One consideration in examining predictive utility is
that of determining the importance, relative to a variety of other factors, of the test score as
a predictor. (Bachman, 1990:290)
1.2.4.2 Reasons for giving more emphasis to concurrent validity
There are several reasons why the author investigated concurrent validity of the two
English tests in the NSSLE.
Firstly, information on concurrent criterion relatedness is undoubtedly the most commonly
used in language testing. If we can identify groups of individuals that are at different levels
on the ability in which we are interested, we can investigate the degree to which a test of
this ability accurately discriminates between these groups of individuals.


28

Also, it is of great interests to know whether a new test is correlated with some
standardized test. Concurrent validity is estimated in this study because there appears to be
a validated equivalent measure that could act as a concurrent test in the present setting.
Besides, it is essential to show concurrent validity between two tests claiming to measure
the same thing. Other types of validity such as face validity, which is considered as the
least scientific method, content validity, which is considered difficult to measure in the
social and educational sciences, or construct validity, which is regarded as a rather abstract
concept, were not investigated in this study.
1.3 Statistical analysis of test results

According to Alderson et al., 1995, the following aspects should be taken into
consideration when analyzing a test score: correlation, descriptive statistics, and classical i
1.3.1 Correlation
Correlation refers to the extent to which two sets of results agree with each other. A
correlation of +1 is a perfect positive correlation whereas a correlation of -1 is regarded as
perfect negative correlation: -1. A correlation of 0 means there exists no correlation
between the two sets of scores.
Correlation can be shown through a scatter plot – X Y chart. Each factor or variable should
be drawn on a graph, with X plotted on the vertical axis and Y plotted on the horizontal
axis, or visa-versa. Each factor or variable should be graphed in such a way that they
intersect both axis making one point for each of the X-Y paired numbers.
The information above can be summarized as in the table below:
rho = + 1.0 Strong – Perfect Positive

As X goes up, Y always also goes up

rho = + 0.5 Weak - Positive

As X goes up, Y tends to usually also go up

rho = 0

X and Y are not correlated

- No Correlation -


×