Tải bản đầy đủ (.pdf) (208 trang)

Nghiên cứu xác trị các điểm cắt của kết quả bài thi nghe đánh giá năng lực tiếng anh từ bậc 3 đến bậc 5 theo khung năng lực ngoại ngữ 6 bậc dành cho việt nam degree

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.35 MB, 208 trang )

VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
******

NGUYỄN THỊ QUỲNH YẾN

DOCTORAL DISSERTATION

AN INVESTIGATION INTO THE CUT-SCORE VALIDITY
OF THE VSTEP.3-5 LISTENING TEST

MAJOR: ENGLISH LANGUAGE TEACHING METHODOLOGY
CODE: 9140231.01

HANOI, 2018


VIETNAM NATIONAL UNIVERSITY, HANOI
UNIVERSITY OF LANGUAGES AND INTERNATIONAL STUDIES
******

NGUYỄN THỊ QUỲNH YẾN

DOCTORAL DISSERTATION

AN INVESTIGATION INTO THE CUT-SCORE VALIDITY
OF THE VSTEP.3-5 LISTENING TEST
(Nghiên cứu xác trị các điểm cắt của kết quả bài thi Nghe
Đánh giá năng lực tiếng Anh từ bậc 3 đến bậc 5 theo
Khung năng lực Ngoại ngữ 6 bậc dành cho Việt Nam)


MAJOR: ENGLISH LANGUAGE TEACHING METHODOLOGY
CODE: 9140231.01

SUPERVISORS:

1. PROF. NGUYỄN HÒA
2. PROF. FRED DAVIDSON

HANOI, 2018


This dissertation was completed at the University of Languages and
International Studies, Vietnam National University, Hanoi.

This dissertation was defended on 10th May 2018

This dissertation can be found at:
- National Liberary of Vietnam
- Liberary and Information Center -Vietnam National University, Hanoi

i


DECLARATION OF AUTHORSHIP

I hereby certify that the thesis I am submitting is entirely my own original
work except where otherwise indicated. I am aware of the University's
regulations concerning plagiarism, including those regulations concerning
disciplinary actions that may result from plagiarism. Any use of the works of
any other author, in any form, is properly acknowledged at their point of use.


Date of submission:

_____________________________

Ph.D Candidate’s Signature:

_____________________________

ii


I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.

_____________________________________________

Prof. Nguyễn Hòa
(Supervisor)

I certify that I have read this dissertation and that, in my opinion, it
is fully adequate in scope and quality as a dissertation for the degree
of Doctor of Philosophy.

_____________________________________________

Prof. Fred Davidson
(Co-supervisor)


iii


TABLE OF CONTENTS
LIST OF FIGURES………………………………………………………………………..

viii

LIST OF TABLES…………………………………………………………………………

ix

LIST OF KEY TERMS…………………………………………………………………….

xiii

ABSTRACT………………………………………………………………………………..

xvii

ACKNOWLEDGMENTS………………………………………………………………….

xix

CHAPTER I: INTRODUCTION………………………………………………………...

1

1. Statement of the problem………………………………………………………………...


1

2. Objectives of the study…………………………………………………………………..

4

3. Significance of the study ….…………………………………………………………….

4

4. Scope of the study………………………………………………………………………..

4

5. Statement of research questions…………………………………………………………

5

6. Organization of the study………………………………………………………………..

5

CHAPTER II: LITERATURE REVIEW……………………………………………….

7

1. Validation in language testing……….…………………………………………………..

7


1.1. The evolution of the concept of validity...………………………………………..

7

1.2. Aspects of validity.……………………………………………………………….

9

1.3. Argument-based approach to validation…………………………………………

11

2. Standard setting for an English proficiency test…………………………………………

15

2.1. Definition of standard setting……………………..……………………………....

15

2.2. Overview of standard setting methods…………………………………………....

17

2.3. Common elements in standard setting…………………………………………….

21

2.3.1. Selecting a standard-setting method………………………………………..


21

2.3.2. Choosing a standard setting panel………………………………………….

23

2.3.3. Preparing descriptions of performance-level descriptors…………………..

24

2.3.4. Training panelists…………………………………………………………..

24

2.3.5. Providing feedback to panelists……………………………………………

26

2.3.6. Compiling ratings and obtain cut scores…………………………………...

27

2.3.7. Evaluating standard setting………………………………………………...

27

2.4. Evaluating standard setting…………………………………………………….…

28


2.4.1. Procedural evidence.………………………………….……………………

30

iv


2.4.2. Internal evidence…………………………………………………………...

32

2.4.3. External evidence…………………………………………………………..

32

2.4.3.1. Comparisons to other standard-setting methods…………………...

33

2.4.3.2. Comparisons to other sources of information……………………...

33

2.4.3.3. Reasonableness of cut scores………………………………………

34

3. Testing listening….…………………………………………………………………..….

34


3.1. Communicative language testing…………………………………………………

34

3.2. Listening construct………………………………………………………………..

36

4. Statistical analysis for a language test………………………………………………….

42

4.1. Statistical analysis of multiple choice (MC) items………………………………..

42

4.2. Investigating reliability of a language test………………………………………...

46

5. Review of validation studies……………………………………………………………..

49

5.1. Review of validation studies on standard setting…………………………………

49

5.2. Review of studies employing argument-based approach in validating language

tests……………………………………………………………………………….

52

6. Summary…………………………………………………………………………………

60

CHAPTER III: METHODOLOGY……………………………………………………...

61

1. Context of the study……………………………………………………………………...

61

1.1. About the VTEP.3-.5 test…………………………………………………………

61

1.1.1 The development history of the VSTEP.3-5 test…………………………...

61

1.1.2. The administration of the VSTEP.3-5 test in Vietnam…………………….

62

1.1.3. Test takers………………………………………………………………….


62

1.1.4. Test structure and scoring rubrics………………………………………….

62

1.1.5. The establishment of the cut scores ……………………………………….

63

1.2. About the VSTEP.3-5 listening test………………………………………………

64

1.2.1. Test purpose………………………………………………………………

64

1.2.2. Test format………………………………………………………………..

64

1.2.3. Performance standards …………………………………………………...

64

1.2.4. The establishment for the cut scores of the VSTEP.3-5 listening test……

68


2. Building an interpretive argument for the VSTEP.3-5 listening test……………………

68

3. Methodology…………………………………………………………………………….

70

v


3.1. Research questions…………………………………………………………….....

70

3.2. Description of methods of the study……………………………………………...

71

3.2.1. Analysis of the test tasks and test items…………………………………..

72

3.2.1.1. Analysis of test tasks…………………………………………….

72

3.2.1.2. Analysis of test items……………………………………………

73


3.2.2. Analysis of test reliability………………………………………………...

75

3.2.3. Validation of cut-scores…………………………………………………..

76

3.2.3.1. Procedural……………………………………………………….

76

3.2.3.2. Internal…………………………………………………………..

76

3.2.3.3. External………………………………………………………….

77

3.3. Description of Bookmark standard setting procedures ………………………….

78

3.4. Selection of participants of the study……………………………………………..

81

3.4.1. Test takers of early 2017 administration………………………………….


81

3.4.2. Participants for Bookmark standard setting method……………………...

82

3.5. Descriptions of tools for data analysis……………………………………………

83

3.5.1. Text analyzing tools………………………………………………………

83

3.5.1.1. English Profile…………………………………………………..

83

3.5.1.2. Readable.io………………………………………………………

84

3.5.2. Speech rate analyzing tool………………………………………………..

84

3.5.3. Statistical analyzing tools………………………………………………..

85


3.5.3.1. WINSTEPS (3.92.1)……………………………………………..

85

3.5.3.2. Iteman 4.3 ……………………………………………………….

86

4. Summary………………………………………………………………………………...

87

CHAPTER IV: DATA ANALYSIS……………………………………………………...

89

1. Analysis of the test tasks and test items…………..…………………………………….

89

1.1. Analysis of the test tasks……………………………………………….………….

89

1.1.1. Characteristics of the test rubric…………………………………………..

89

1.1.2. Characteristics of the input………………………………………………..


94

1.1.3. Relationship between the input and response……………………………..

102

1.2. Analysis of the test items…………………………………………………………..

102

1.2.1. Overall statistics of item difficulty and item discrimination………………

102

vi


1.2.2. Item analysis……………………………………………………………..

107

2. Analysis of the test reliability…….….…………………………………………………

128

3. Analysis of the cut-scores……..………………………………………………………..

130


3.1. Procedural evidence………………………………………………………………

130

3.2. Internal evidence………………………………………………………………….

131

3.3. External evidence…………………………………………………………………

132

CHAPTER V: FINDINGS AND DISCUSSIONS………………………………………

145

1. The characteristics of the test tasks and test items………………………………………

145

2. The reliability of the VSTEP.3-5 listening test………………………………………….

151

3. The accuracy of the cut scores of the VSTEP.3-5 listening test ………………………..

151

CHAPTER VI: CONCLUSION ……………………….…...……………………………


154

1. Overview of the thesis…………………………………………………………………...

154

2. Contributions of the study……………………………………………………………….

157

3. Limitations of the study………………………………………………………………….

158

4. Implications of the study….……………………………………………………………..

158

5. Suggestions for further research…………………………………………………………

159

LIST OF THESIS-RELATED PUBLICATIONS…………………………………………

161

REFERENCES……………………………………………………………………………..

162


APPENDIX 1: Structure of the VSTEP.3-5 test…………………………………………..

172

APPENDIX 2: Summary of the directness and interactiveness between the texts and the
questions of the VSTEP.3-5 listening test………………………………………………….

174

APPENDIX 3: Consent form (workshops)………………………………………………...

177

APPENDIX 4: Agenda for Bookmark standard-setting procedure………………………..

179

APPENDIX 5: Panelist recording form……………………………………………………

180

APPENDIX 6: Evaluation form for standard-setting participants ………………………...

181

APPENDIX 7: Control file for WINSTEPS………………………………………………..

183

APPENDIX 8: Timeline of the VSTEP.3-5 test administration…………………………...


185

APPENDIX 9: List of the VSTEP.3-5 developers…………………………………………

186

vii


LIST OF FIGURES
Figure 2.1: Model of Toulmin’s argument structure (1958, 2003)………………………...

12

Figure 2.2: Sources variance in test scores (Bachman, 1990)……………………………...

47

Figure 2.3: Overview of interpretive argument for ESL writing course placements………

57

Figure 4.1: Item map of the VSTEP.3-5 listening test……………………...…... ………..

105

Figure 4.2: Graph for item 2………………………………………………………………..

108


Figure 4.3: Graph for item 3………………………………………………………….........

110

Figure 4.4: Graph for item 6………………………………………………………………..

112

Figure 4.5: Graph for item 13………………………………………………………………

115

Figure 4.6: Graph for item 14………………………………………………………………

117

Figure 4.7: Graph for item 15……………………………………………………………..

119

Figure 4.8: Graph for item 19……………………………………………………………..

121

Figure 4.9: Graph for item 20……………………………………………………………..

123

Figure 4.10: Graph for item 28……………………………………………………….........


125

Figure 4.11: Graph for item 34……………………………………………………….........

126

Figure 4.12: Total score for the scored items……………………………………………....

129

viii


LIST OF TABLES
Table 2.1: Review of standard-setting methods (Hambleton & Pitoniak, 2006)…………...

21

Table 2.2: Standard setting Evaluation Elements (Cizek & Bunch, 2007)….……………… 30
Table 2.3: Common steps required for standard setting (Cizek & Bunch, 2007)………….

32

Table 2.4: A framework for defining listening task characteristics (Buck, 2001)…………

38

Table 2.5: Criteria for item selection and interpretation of item difficulty index…………..


44

Table 2.6: Criteria for item selection and interpretation of item discrimination index…….

46

Table 2.7: General guideline for interpreting test reliability (Bachman, 2004)…………….

48

Table 2.8: Number of proficiency levels & test reliability…………………………………

48

Table 2.9: Summary of the warrant and assumptions associated with each inference in the
TOEFL interpretive argument (Chapelle et al., 2008)……………………..…………………….

56

Table 3.1: Structure of the VSTEP.3-5 test……………………………………………….

63

Table 3.2: The cut scores of the VSTEP.3-5 test………………………………………….

63

Table 3.3: Performance standard of Overall Listening Comprehension (CEFR: learning,
teaching, assessment)………………………………………………………………………..


65

Table 3.4: Performance standard of Understanding conversation between native speakers
(CEFR: learning, teaching, assessment)…………………………………………………….

66

Table 3.5: Performance standard of Listening as a member of a live audience (CEFR:
learning, teaching, assessment)……………………………………………………………...

66

Table 3.6: Performance standard of Listening to announcements and instructions (CEFR:
learning, teaching, assessment)……………………………………………………………...

67

Table 3.7: Performance standard of Listening to audio media and recordings (CEFR:
learning, teaching, assessment)……………………………………………………………...

67

Table 3.8: The cut scores of the VSTEP.3-5 test……..…………………………………….

68

Table 3.9: Criteria for item selection and interpretation of item difficulty index……..……

74


Table 3.10: Criteria for item selection and interpretation of item discrimination index……

75

Table 3.11: Number of proficiency levels & test reliability………………………………..

76

Table 3.12: The venue for Angoff and Bookmark standard setting method……………….

77

Table 3.13: Comparison between the Flesch-Kincaid readability analysis and the CEFR IELTS grading systems……………………………………………………………………...

ix

85


Table 3.14: Summary of the interpretative argument for the interpretation and use of the
VSTEP.3-5 listening cut-scores …………………………………………………………….

88

Table 4.1: General instruction of the VSTEP.3-5 listening test…….……………………...

90

Table 4.2: Instruction for Part 1……….……………….…………………………………...


91

Table 4.3: Instruction for Part 2………..…………………………………………………...

92

Table 4.4: Instruction for Part 3…….……………………………………………………....

93

Table 4.5: Information provided in the specifications for the VSTEP.3-5 listening test……...

94

Table 4.6: Summary of the texts for items 1-8………..……………………………………

96

Table 4.7: Description of language levels for texts of items 1 -8 in the specification……..

97

Table 4.8: Summary of the texts for items 9-20…………………………………………....

98

Table 4.9: Description of language levels for texts of items 9 -20 in the specification…....

99


Table 4.10: Summary of the texts for items 21-35…………………………………………

100

Table 4.11: Description of language levels for texts of items 21-35 in the specification……...

101

Table 4.12: Summary of item discrimination and item difficulty………………………….

104

Table 4.13: Summary statistics for the flagged items……………………………………....

106

Table 4.14: Information for item 2………………………………………………………….

108

Table 4.15: Item statistics for item 2………………………………………………………..

109

Table 4.16: Option statistics for item 2……………………………………………………..

109

Table 4.17: Quantile plot data for item 2…………………………………………………...


109

Table 4.18: Information for item 3………………………………………………………….

110

Table 4.19: Item statistics for item 3………………………………………………………..

110

Table 4.20: Option statistics for item 3……………………………………………………..

111

Table 4.21: Quantile plot data for item 3……………………………………………….......

111

Table 4.22: Information for item 6………………………………………………………….

112

Table 4.23: Item statistics for item 6………………………………………………………..

112

Table 4.24: Option statistics for item 6……………………………………………………..

113


Table 4.25: Quantile plot data for item 6…………………………………………………...

113

Table 4.26: Information for item 13………………………………………………………...

115

Table 4.27: Item statistics for item 13………………………………………………………

115

Table 4.28: Option statistics for item 13……………………………………………………

116

Table 4.29: Quantile plot data for item 13………………………………………………….

116

x


Table 4.30: Information for item 14………………………………………………………..

118

Table 4.31: Item statistics for item 14………………………………………………………

118


Table 4.32 Option statistics for item 14……………………………………………………

118

Table 4.33: Quantile plot data for item 14………………………………………………….

118

Table 4.34: Information for item 15………………………………………………………...

120

Table 4.35: Item statistics for item 15………………………………………………………

120

Table 4.36: Option statistics for item 15……………………………………………………

120

Table 4.37: Quantile plot data for item 15………………………………………………….

120

Table 4.38: Information for item 19………………………………………………………..

121

Table 4.39: Item statistics for item 19………………………………………………………


121

Table 4.40: Option statistics for item 19……………………………………………………

122

Table 4.41: Quantile plot data for item 19………………………………………………….

122

Table 4.42: Information for item 20………………………………………………………...

123

Table 4.43: Item statistics for item 20………………………………………………………

123

Table 4.44: Option statistics for item 20……………………………………………………

124

Table 4.45: Quantile plot data for item 20………………………………………………….

124

Table 4.46: Information for item 28……………………………………………………..….

125


Table 4.47: Item statistics for item 28………………………………………………………

125

Table 4.48: Option statistics for item 28……………………………………………………

125

Table 4.49: Quantile plot data for item 28………………………………………………….

126

Table 4.50: Information for item 34………………………………………………………...

127

Table 4.51: Item statistics for item 34………………………………………………………

127

Table 4.52: Option statistics for item 34……………………………………………………

127

Table 4.53: Quantile plot data for item 34………………………………………………….

127

Table 4.54: Summary of statistics…………………………………………………………..


129

Table 4.55: Test reliability ……………………………..…….…………………………….

129

Table 4.56: The person reliability and item reliability of the test…………………………..

130

Table 4.57: Number of proficiency levels and test reliability…………………….…………

131

Table 4.58: The test reliability of the VSTEP.3-5 listening test……….………………….

132

Table 4.59: Order of items in the booklet…………………………………………………..

133

Table 4.60: Summary of Output from Round 1 of Bookmark standard-setting Procedure ……..

135

xi



Table 4.61: Conversion table……………………………………………………………….

136

Table 4.62: Summary of statistics in raw score metric for round 1………………………...

137

Table 4.63: Summary of Output from Round 2 of Bookmark standard-setting Procedure……….

139

Table 4.64: Round 3 Feedback for Bookmark Standard-setting Procedure………………...

141

Table 4.65: Summary of Output from Round 3 of Bookmark standard-setting Procedure………...

143

Table 4.66: The cut scores set for the VSTEP.3-5 listening test by Bookmark method……

144

Table 4.67: The cut scores set for the VSTEP.3-5 listening test by Angoff method……….

144

Table 4.68: Comparison between the results of two standard-setting methods…………….


144

xii


LIST OF KEY TERMS

Construct: A construct refers to the knowledge, skill or ability that's being tested.
In a more technical and specific sense, it refers to a hypothesized ability or mental
trait which cannot necessarily be directly observed or measured, for example,
listening ability. Language tests attempt to measure the different constructs which
underlie language ability.
Cut score: A score that represents achievement of the criterion, the line between
success and failure, mastery and non-mastery.
Descriptor: A brief description accompanying a band on a rating scale, which
summarizes the degree of proficiency or type of performance expected for a test
taker to achieve that particular score.
Distractor: The incorrect options in multiple-choice items.
Expert panel: A group of target language experts or subject matter experts who
provide comments about a test.
High-stakes test: A high-stakes test is any test used to make important decisions
about test takers.
Inference: A conclusion that is drawn about something based on evidence and
reasoning.
Input: Input material provided in a test task for the test taker to use in order to
produce an appropriate response.
Interpretive argument: Statements that specify the interpretation and use of the
test performances in terms of the inferences and assumptions used to get from a
person’s test performance to the conclusions and decisions based on the test results.
Item (also, test item): Each testing point in a test which is given a separate score or

scores. Examples are: one gap in a cloze test; one multiple choice question with

xiii


three or four options; one sentence for grammatical transformation; one question to
which a sentence-length response is expected.
Key: The correct option or response to a test item.
Multiple-choice item: A type of test item which consists of a question or
incomplete sentence (stem), with a choice of answers or ways of completing the
sentence (options). The test taker’s task is to choose the correct option (key) from a
set of possibilities. There may be any number of incorrect possibilities (distractors).
Options: The range of possibilities in a multiple-choice item or matching tasks
from which the correct one (key) must be selected.
Panelist: A target language expert or subject matter expert who provides comments
about a test.
Performance level description: Brief operational definitions of the specific
knowledge, skills, or abilities that are expected of examinees whose performance on
a test results in their classification into a certain performance; elaborations of the
achievement expectations connoted by performance level labels.
Performance level label: A hierarchical group of single words or short phrases that
are used to label the two or more performance categories created by the application
of cut scores to examinee performance on a test.
Performance standard: The abstract conceptualization of the minimum level of
performance distinguishing examinees who possess an acceptable level of
knowledge, skill, or ability judged necessary to be assigned to a category, or for
some other specific purpose, and those who do not possess that level. This term is
sometimes used interchangeably with cut score.
Proficiency test: A test which measures how much of a language someone has
learned. Proficiency tests are designed to measure the language ability of examinees

regardless of how, when, why, or under what circumstances they may have
experienced the language.

xiv


Readability: Readability is the ease with which a reader can understand a written
text. The readability of text depends on its content (the complexity of its vocabulary
and syntax) and its presentation (such as typographic aspects like font size, line
height, and line length).
Reliability: The reliability of a test is concerned with the consistency of scoring and
the accuracy of the administration procedures of the test.
Response probability (RP) criterion: In the context of Bookmark and similar
item-mapping standard-setting procedures, the criterion used to operationalize
participants’ judgment regarding the probability of a correct response (for
dichotomously scored items) or the probability of achieving a given score point or
higher (for polytomously scored items). In practical applications, two PR criteria
appear to be used most frequently (RP50 and RP67); other PR criteria have also
been used though considerably less frequently.
Rubric: A set of instructions or guidelines on an exam paper.
Selected-response: An item format in which the test taker must choose the correct
answer from alternative provided.
Specifications (also, test specifications): A description of the characteristics of a
test, including what is tested, how it is tested, and details such as number and length
of forms, item types used.
Standard setting: A measurement activity in which a procedure is applied to
systematically gather and analyze human judgment for the purpose of deriving one
or more cut scores for a test.
Standardized test: A standardized test is any form of test that (1) requires all test
takers to answer the same questions, or a selection of questions from common bank

of questions, in the same way, and that (2) is scored in a “standard” or consistent
manner, which makes it possible to compare the relative performance of individual
students or groups of students.

xv


Test form: Test forms refer to different versions of tests that are designed in the
same format and used for different administrations.
Validation: An action of checking or proving the validity or accuracy of something.
The validity of a test can only be established through a process of validation.
Validity: The degree to which a test measures what it is supposed to measure, or
can be used successfully for the purpose for which it is intended. A number of
different statistical procedures can be applied to a test to estimate its validity. Such
procedures generally seek to determine what the test measures, and how well it does
so.
Validity argument: A set of statements that provide a critical evaluation of the
interpretive argument.
Warrant: The underlying connection between the claim and evidence in an
interpretive argument.
* These key terms are taken from the glossary provided by Cizek & Michael (2007)
and from the glossary on the website of Second Language Testing, Inc
( />
xvi


ABSTRACT

Standard setting is an important phase in the development of an examination
program, especially for a high-stakes test. Standard setting studies are designed to

identify reasonable cut scores and to provide backing for this choice of cut scores.
This study was aimed at investigating the validity of the cut scores established for a
VSTEP.3-5 listening test administered in early 2017 on 1562 test takers by one
institution permitted by the Ministry of Education and Training, Vietnam to design
and administer the VSTEP.3-5 tests. The study adopted the current argument-based
validation approach with a focus on three main inferences constructing the validity
argument. They were (1) test tasks and items, (2) test reliability and (3) cut scores.
The argument is that in order for the cut-scores of the VSTEP.3-5 listening test to
be valid, the test tasks and test items first needed to be designed in accordance with
the characteristics specified in the specifications. Second, the listening test scores
should be sufficiently reliable so as to reasonably reflect test-takers’ listening
proficiency. Third, the cut scores were reasonably established for the VSTEP.3-5
listening test.
In this study, both qualitative and quantitative methods were combined and
structured to back for and against the assumptions in each of these three inferences.
With regards to the first inference and second inference, an analysis of the test tasks
and the test items was conducted whereas test reliability was investigated in order to
see if it was in the acceptable range or not. In terms of the third inference about the
cut scores of the VSTEP.3-5 listening test, Bookmark standard setting method was
implemented and the results were compared with those currently applied for the
test. This study offers contributions in three areas. First, this study supports the
widely-held notion of validity as a unitary concept and validation is the process of
building an interpretive argument and collecting evidence in support of that
argument. Second, this study contributes towards raising the awareness of the

xvii


importance of


evaluating the cut scores of the high stakes language tests in

Vietnam so that fairness can be ensured for all of the test takers. Third, this study
contributes to the construction of a systematic, transparent and defensible body of
validity argument for the VSTEP.3-5 test in general and its listening component in
particular. The results of this study are helpful in providing informative feedback to
the establishment of the cut scores for the VSTEP.3-5 listening test, the test
specifications, and the test development process. The positive results can provide
evidence to strengthen the reasonableness of the cut scores, the specifications and
the quality of the VSTEP.3-5 listening test. The negative results can give
suggestions for changes or improvement in the cut scores, the specifications and the
design of the VSTEP.3-5 listening test.

xviii


ACKNOWLEDGMENTS

I would like to take this opportunity to express my heartfelt gratitude to all the
people without whom this thesis would never have been possible. Although it is just
my name on the cover, many people have contributed to the research in their own
particular way and for that I want to give them my special thanks.
First and foremost, I would like to express my whole-hearted thanks to my
supervisor, Professor Nguyen Hoa. I am so lucky to be one of his Ph.D students. I
appreciate all his contributions of time, ideas and other assistance to make my Ph.D
experience productive and stimulating. His enthusiasm and encouragement were
motivational for me, making my Ph.D pursuit a short and enjoyable journey. I am
also very grateful to him for involving me in his various research projects, which
has provided me with a lot of experience in conducting this study. He has been a
tremendous mentor.

I would also like to thank my co-supervisor, Fred Davidson, Professor Emeritus
from the University of Illinois, for giving me the very first ideas, advice and
guidance on how to start my Ph.D. study. His advice on both research as well as on
my career has been invaluable.
I am especially thankful to Professor Nathan T. Carr from California State
University, Fullerton for conducting a series of workshops on designing and
analyzing language tests at the University of Languages and International Studies,
Vietnam National University - Hanoi. Being able to discuss my work with him has
been invaluable for developing my ideas. Sharing his knowledge and experience
about language testing and assessment in general and standard-setting methods in
particular have been a great contribution to the completion of my Ph.D thesis.
I want to thank all of my colleagues at the University of Languages and
International Studies, Vietnam National University - Hanoi, especially my

xix


colleagues at the Center for Language Testing and Assessment, for sharing my
workload and always cheering me up when I was down.
My sincere thanks also go to Dr. Huynh Anh Tuan, Dean of the faculty of the Postgraduate Studies and his staff for helping me to process the paperwork and
constantly reminding me of the deadlines. Without their support and
encouragement, I would have postponed my thesis defense for one or two more
years.
Words cannot express how grateful I am to my family. I want to say thank you to
my parents and siblings for their encouragement during the time I conducted my
study.
This thesis is dedicated to my beloved husband and my daughter for their love,
endless support, encouragement and sacrifices throughout this experience.
As a final word, I would like to thank each and every individual who has been a
source of support and encouragement and helped me to achieve my goal and

complete my thesis work successfully.

xx


CHAPTER I
INTRODUCTION
This chapter is to introduce the topic of the study and present the main reasons for
choosing it. After that, the chapter presents the questions that are going to be
addressed within the scope of the study. A brief overview of the organization of the
thesis will close the chapter.

1. Statement of the problem
The term “cut scores” refers to the lowest possible scores on a standardized test,
high-stakes test or other forms of assessment that help to separate a test score scale
into two or more regions, creating categories of performance or classification of
examinees. Clearly, if the cut scores are not appropriately set, the results of the
assessment could come into question. For this reason, establishing cut scores for a
test has been considered an important and practical aspect of standard setting. In
Kane’s (2006) recent discussion for test validation, besides emphasizing the
importance of carefully defining the selected cut scores, he highlights the evaluation
of the reasonableness of the cut scores and states that the establishment of the cut
scores is a complex endeavor, but the validation of the cut scores is even more
difficult.
According to the Standards for Educational and Psychological Testing (AERA et
al., 1999, p.9), validity is defined as “the degree to which evidence and theory
support the interpretation of test scores entailed by proposed uses of tests” and test
validation is the process of making a case for the proposed interpretation and use of
test scores. This case takes the forms of an argument that states a series of
propositions supporting the proposed interpretation and use of test scores and

summarizes the evidence supporting these propositions (Kane, 2006). With regard
to standard setting, since there are no ‘gold standards” and “true cut scores”, to
1


validate established cut scores means to provide evidence in support of the
plausibility and appropriateness of the proposed cut score interpretation, their
credibility and defensibility (Kane, et al., 1999). In the world, though plenty of
studies have been conducted on the validity of cut scores established for a test, these
studies mainly aim at cross-validating two different methods of standard setting and
comparing the results of these methods instead of investigating the validity of cut
scores as a whole.
In Vietnam, the National Foreign Language 2020 Project (NFL2020) was initiated
in 2008 with the aim to “renovate the teaching and learning of foreign languages
within the national education system” so that “… by 2020, most Vietnamese
students graduating from secondary, vocational schools, colleges and universities
will be able to use a foreign language confidently in their daily communication,
their study and work in an integrated, multi-cultural and multi-lingual environment,
making foreign languages a comparative advantage of development for Vietnamese
people in the cause of industrialization and modernization for the country”
(Decision 1400/QD-TTg). Language assessment is considered a major component
of this project. The biggest achievement of this component is the emergence of the
first ever-standardized test of English proficiency (the VSTEP.3-5 test). The test
was officially released by the Ministry of Education and Training, Vietnam on 11 th
March 2015. The test aims at measuring English ability across a broad language
proficiency continuum from level 3 to level 5, which is equivalent to B1 - C1 CEFR
levels (Common European Framework of Reference for Languages). The cut scores
of the VSTEP.3-5 test help to categorize test takers and certify them based on the
levels they achieve. These cut scores are applied for all of the results of the
VSTEP.3-5 tests, which are supposed to be strictly built in accordance with the test

specifications.
At the moment, the results and certificates of the VSTEP.3-5 test are used by many
companies as the requirement for a job position and by many educational
institutions as a “visa” for learners to be accepted into or graduate from an academic
2


program. For example, English teachers from primary schools and secondary
schools throughout Vietnam are expected to obtain level 4 in English (equivalent to
B2) while the requirement for those working in high schools, colleges and
universities is level 5 (equivalent to C1). Besides, in order to graduate from
universities, English major students need to show the evidence of their English in
level 5 (equivalent to C1) and that for non-English major students is level 3
(equivalent to B1). This shows that the uses of the VSTEP.3-5 test and the decisions
that are made from the test cut scores have important consequences for the
stakeholders. Like other high-stakes tests such as TOEFL, IELTS, PTE, or
Cambridge Tests, in order to gain credibility and defensibility, more research needs
to be conducted on the test in general and the validity of the VSTEP.3-5 cut scores
in particular. However, so far, there have been few studies on the VSTEP.3-5 test
and there is no validation research on the cut scores of the test.
Among the skills tested in high stakes examination, listening is the skill that the
fewest researchers choose to conduct a study on. According to Buck (2001), the
assessment of listening ability is one of the least understood and least developed
areas of language and assessment. However, Buck (2001) also states that the
assessment of listening ability is one of the most important testing aspects. In terms
of standard setting and cut score validation, the procedure for listening tests is also
much more complicated and time-consuming. However, for the author of this study,
listening is a skill that is really interesting and thus needs discovering.
All of the reasons mentioned above have intrigued the author of this doctoral thesis
to conduct a validation study on the cut scores of the VSTEP.3-5 listening test by

using validity argument-based model proposed by Kane (2013). A validity
argument is a set of related propositions that, taken together, form an argument in
support of an intended use or interpretation of the test scores. With the deeplyrooted desire to develop a good proficiency listening test in Vietnam, this research
is expected to bring the author of this doctoral thesis a profound insight into this
specific area of interest for her future professional development.
3


×