Tải bản đầy đủ (.doc) (60 trang)

EVALUATING a FINAL ENGLISH READING TEST FOR THE STUDENTS AT HANOI, TECHNICAL AND PROFESSIONAL SKILLS TRAINING SCHOOL – HANOI CON

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (461.88 KB, 60 trang )

<span class="text_page_counter">Trang 1</span><div class="page_container" data-page="1">

<b>GIÁP THỊ AN</b>

<b>EVALUATING A FINAL ENGLISH READING TESTFOR THE STUDENTS AT HANOI, TECHNICAL AND</b>

<b>PROFESSIONAL SKILLS TRAINING SCHOOL –HANOI CONSTRUCTION CORPORATION</b>

<i><b>(ĐÁNH GIÁ BÀI KIỂM TRA HẾT MÔN TIẾNG ANH CHO HỌCSINH TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ NỘI</b></i>

<i><b>– TỔNG CÔNG TY XÂY DỰNG HÀ NỘI)</b></i>

<b>M.A minor thesis</b>

</div><span class="text_page_counter">Trang 2</span><div class="page_container" data-page="2">

<b>GIÁP THỊ AN</b>

<b>EVALUATING A FINAL ENGLISH READING TESTFOR THE STUDENTS AT HANOI, TECHNICAL AND</b>

<b>PROFESSIONAL SKILLS TRAINING SCHOOL –HANOI CONSTRUCTION CORPORATION</b>

<i><b>(ĐÁNH GIÁ BÀI KIỂM TRA HẾT MÔN TIẾNG ANH CHO HỌCSINH TRƯỜNG TRUNG HỌC KỸ THUẬT VÀ NGHIỆP VỤ HÀ</b></i>

<i><b>NỘI – TỔNG CÔNG TY XÂY DỰNG HÀ NỘI)</b></i>

<b>MA. minor thesis</b>

<b> Field: Methodology Code: 60.14.10</b>

<b> Supervisor: Phùng Hà Thanh, M.A.</b>

<b>HANOI – 2008</b>

</div><span class="text_page_counter">Trang 4</span><div class="page_container" data-page="4">

I would like to express my deepest thanks to my supervisor Ms. Phùng Hà Thanh, M.A. for the invaluable support, guidance, and timely encouragement she gave me while I was doing this research. I am truly grateful to her for her advice and suggestions right from the beginning when this study was only in its formative stage. I would like to send my sincere thanks to the teachers at English Department, HATECHS, who have taken part in the discussion as well given insightful comments and suggestion for this paper.

My special thanks also go to students in groups KT1, KT2, KT3, KT4-K06 for their participation to the study as the subjects of the study. With out them, this project could not have been so successful.

I owe a great debt of gratitude to my parents, my sisters, my husband especially my son, who have constantly inspired and encouraged me to complete this research.

</div><span class="text_page_counter">Trang 5</span><div class="page_container" data-page="5">

Test evaluation is a complicated phenomenon which has been paid much attention by number of researchers since the importance of language test in assessing the achievements of students was raised. When evaluating a test, evaluator should have concentrated on criteria of a good test such as the mean, the difficulty level, discrimination, the reliability and the validity.

This present study, researcher chose the final reading test for students at HATECHS to evaluate with an aim at estimating the reliability and checking the validity. This is a new test that followed the PET form and was used in school year 2006 – 2007 as a procedure to assess the achievement of students at HATECHS. From the interpretation of the data got from scores, researcher has found out that the final reading test is reliable in the aspect of internal consistency. The face and construct validity has been checked as well and the test is concluded to be valid based on the calculated validity coefficients. However, the study remains limitations that lead to the researcher’s directions for future studies.

</div><span class="text_page_counter">Trang 6</span><div class="page_container" data-page="6">

<small>ACKNOWLEGEMENTii</small>

</div><span class="text_page_counter">Trang 7</span><div class="page_container" data-page="7">

<i>2.3.1.1. Course objectives</i> 20

<i>2.3.3.4. Summary to the results of the study</i> 32

</div><span class="text_page_counter">Trang 8</span><div class="page_container" data-page="8">

HATECHS: Hanoi, Technical and Professional Skill Training School

</div><span class="text_page_counter">Trang 9</span><div class="page_container" data-page="9">

Table 1: Types of language tests 9

Table 4: The syllabus for teaching English – Semester 2 21

Table 7: The Raw Scores of the final reading test and the PET 24

</div><span class="text_page_counter">Trang 10</span><div class="page_container" data-page="10">

<i><b>Discrimination is the spread of scores produced by a test, or the extent to which a test</b></i>

separates students from one another on a range of scores from high to low. Also used to describe the extent to which an individual multi-choice item separates the students who do well on the test as a whole from those who do badly.

<i><b>Difficulty is the extent to which a test or test item is within the ability range of a</b></i>

particular candidate or group of candidates.

<i><b>Mean is a descriptive statistic, measuring central tendency. The mean is calculated by</b></i>

dividing the sum of a set of scores by the number of scores.

<i><b>Median is a descriptive, measuring central tendency: the middle score or value in a set.Marker, also scorer is the judge or observer who operates a rating scale in the</b></i>

measurement of oral and written proficiency. The reliability of markers depends in part on the quality of their training, the purpose of which is to ensure a high degree of comparability, both inter- and intra-rater.

<i><b>Mode is a descriptive statistic, measuring central tendency: the most frequent</b></i>

occurring score or score interval in a distribution.

<i><b>Raw scores – test data in their original format, not yet transformed statistically in any</b></i>

way ( eg by conversion into percentage, or by adjusting for level of difficulty of task or any other contextual factors).

<i><b>Reading comprehension test is a measure of understanding of text.</b></i>

<i><b>Reliability is the consistency, the extent to which the scores resulting from a test are</b></i>

similar wherever and whenever it is taken, and whoever marks it.

</div><span class="text_page_counter">Trang 11</span><div class="page_container" data-page="11">

measure. The measure may be based on ratings, judgements, grades, number of test items correct.

<i><b>Standard deviation is the property of the normal curve. Mathematically, it is the</b></i>

square root of the variance of a test.

<i><b>Test analysis is the data from test trials are analyzed during the test development</b></i>

process to evaluate individual items as well as the reliability and validity of the test as a whole. Test analysis is also carried out following test administration in order to allow the reporting of results. Test analysis may also be conducted for research purposes.

<i><b>Test item is part of an objective test which sets the problem to be answered by the</b></i>

student: usually either in multi choice form as statement followed by several choices of which one is the right answer and the rest are not; or in true/false statement which the student must judge to be either right or wrong.

<i><b>Test taker is a term used to refer to any person undertaking a test or examination.</b></i>

Other terms commonly used in language testing are candidate, examinee, testee.

<i><b>Test retest is the simplest method of computing test reliability; it involves</b></i>

administering the same test to the same group of subjects on two occasions. The time between administrations is normally limited to no more than two weeks in order minimize the effect of learning upon true scores.

<i><b>Validity is the extent to which a test measures what it is intended to measure. The test</b></i>

validity consists of content, face and construct validity.

</div><span class="text_page_counter">Trang 12</span><div class="page_container" data-page="12">

<b>PART ONE: INTRODUCTION</b>

</div><span class="text_page_counter">Trang 13</span><div class="page_container" data-page="13">

<b>1. Rationale</b>

Testing is necessary in the process of language teaching and learning, therefore it has gained much concern from teachers and learners. Through testing, teacher can evaluate learners’ achievements in a certain learning period; self assess their different teaching method and provide input into the process of language teaching (Bachman, 1990, p. 3). Thanks to testing, learners also self-assess their English ability to examine whether their levels of English meet the demand of employment or studying abroad. The important role of test makes test evaluation necessary. By evaluating tests, test designers would have the best test papers for assessing their students.

Despite the importance of testing, in many schools, tests are designed without following any rigorous principles or procedures. Thus, the validity and the reliability should be doubted. In HATECHS, the final English course tests had been designed by teachers at English Department at the end of the course, and some of tests were used repeatedly with no adjustment. In school year 2006 – 2007, there has been a change in test designing. Final tests were designed according to PET (Preliminary English Test) procedure. The PET is from Cambridge Testing System for English Speakers of Other Languages. Based on the PET, new Final Reading Test has been also developed and used as an instrument to assess students’ achievement in reading skill. The test was delivered to students at the end of school year 2006 - 2007, there was not any evaluation. To decide whether the test is reliable and valid a serious study is needed. The context at HATECHS has inspired the author, a teacher of English to take this

<i>opportunity to undertake the study entitled “Evaluating a Final English Reading Test</i>

<i>for the Students at Hanoi Technical and Professional Skills Training School” with an</i>

aim to evaluate the test to check the validity and reliability of the test. The author was also eager to have a chance to find out some suggestions for test designers to get better and more effective test for their students.

</div><span class="text_page_counter">Trang 14</span><div class="page_container" data-page="14">

<b>2. Objectives of the study</b>

The study is aimed at evaluating final reading test for the students at Hanoi, Technical and Professional Skill Training School. The test takers are non-majors. The results of the test will be analyzed, evaluated and interpreted with the aims:

- to calculate the internal consistency reliability of the test - to check the face and construct validity of the test

<b>3. Scope of the study</b>

Test evaluation is a wide concept and there are many criteria in evaluating the test. Normally, there are four major criteria - item difficulty, the discrimination, reliability and the validity when any test evaluator wants to evaluate a test. However, it is said that item difficulty and the discrimination of the test are difficult to evaluate and interpret; therefore, with in this study the researcher focuses on the reliability and the validity of the test as a whole.

At HATECHS, at the end of Semester 1, there is a reading achievement test, and at the end of the first year, after finishing 120 periods studying English, there is a final reading test. The researcher chose the final test to evaluate the internal consistency reliability, face and construct validty.

<b>4. Methodology of the study</b>

In this study, the author evaluated the test by adopting both qualitative and quantitative methods. The research is quantitative in the sense that the data will be collected through the analysis to the scores of the 30 random papers of students at the Faculty of Finance and Accounting. To calculate the internal consistency reliability researcher use Formula called Kuder-Richardson 21 and Pearson Correlation Coefficient Formula would be adopted to calculate the validity coefficient. It is

</div><span class="text_page_counter">Trang 15</span><div class="page_container" data-page="15">

qualitative in the aspect of using a semi-structured interview with open questions which were delivered to teachers at HATECHS at the annual meeting on teaching syllabus and methodology. The conclusion to the discussion would be used as the qualitative data of the research.

<b>5. The organization of the study</b>

The study is divided into three parts:

<i><b>Part one: Introduction – is the presentation of basic information such as the</b></i>

rationale, the scope, the objectives, the methods and the organization of the study.

<i><b>Part two: Development – This part consists of two chapters</b></i>

<i><b>Chapter 1: Literature Review – in which the literature that related to language of</b></i>

testing and test evaluation.

<i><b>Chapter 2: Methodology and Results – is concerned with the methods of the study,</b></i>

the selection of participants, the materials and the methods of data collection and analysis as well the results of the process of data analysis.

<i><b>Part three: Conclusion – this part will be the summary to the study, limitations as</b></i>

well the recommendations for further studies. Then, Bibliography and Appendices

</div><span class="text_page_counter">Trang 16</span><div class="page_container" data-page="16">

<b>PART TWO: DEVELOPMENT</b>

</div><span class="text_page_counter">Trang 17</span><div class="page_container" data-page="17">

<b>CHAPTER 1</b>

<b>LITERATURE REVIEW</b>

This chapter is an attempt to establish theoretical backgrounds for the study. Approaches to language testing and testing reading as well as some literature to the test evaluation will be reviewed.

<b>1.1. Language testing</b>

<i><b>1.1.1. Approaches to language testing</b></i>

<i>1.1.1.1. The essay translation approach</i>

According to Heaton (1998), this approach is commonly referred to as the pre-scientific stage of language testing. In this approach, no special skill or expertise in testing is required. Tests usually consist of essay writing, translation and grammatical analysis. The tests, for Heaton, also have a heavy literary and cultural bias. He also criticized that public examination i.e. secondary school leaving examinations resulting from the essay translation approach sometimes have an aural/oral component at the upper intermediate and advanced levels though this has sometimes been regarded in the past as something additional and in no way an integral part of the syllabus or examination (p. 15)

<i>1.1.1.2. The structuralist approach</i>

<small>“This approach is characterized by the view that language learning is chiefly concernedwith the systematic acquisition of a set of habits. It draws on the work of structurallinguistics, in particular the importance of contrastive analysis and the need to identify andmeasure the learner’s mastery of the separate elements of the target language: phonology,vocabulary and grammar. Such mastery is tested using words and sentences completelydivorced from any context on the grounds that a language forms can be covered in the testin a comparatively short time. The skills of listening, speaking, reading and writing are alsoseparated from one another as much as possible because it is considered essential to test onething at a time”</small> Heaton, 1998, p.15).

</div><span class="text_page_counter">Trang 18</span><div class="page_container" data-page="18">

According to him, this approach is now still valid for certain types of test and for certain purposes such as the desire to concentrate on the testees’ ability to write by attempting to separate a composition test from reading. The psychometric approach to measurement with its emphasis on reliability and objectivity forms an integral part of structuralist testing. Psychometrists have been able to show early that such traditional examinations as essay writing are highly subjective and unreliable. As a result, the need for statistical measures of reliability and validity is considered to be the utmost importance in testing: hence the popularity of the multi-choice item – a type of item which lends itself admirably to statistical analysis.

<i>1.1.1.3. The integrative approach</i>

Heaton (1998, p.16) considered this approach the testing of language in context and is thus concerned primarily with meaning and the total communicative effect of discourse. As the result, integrative tests do not seek to separate language skills into neat divisions in order to improve test reliability: instead, they are often designed to assess the learner’s ability to use two or more skills simultaneously. Thus, integrative tests are concerned with a global view of proficiency – an underlying language competence or ‘grammar of expectancy’, which it is argued every learner possesses regardless of the purpose for which the language is being learnt.

The integrative testing, according to Heaton (1998) are best characterized by the use of cloze testing and dictation. Beside, oral interviews, translation and essay writing are also included in many integrative tests – a point frequently overlooked by those who take too narrow a view of integrative testing.

Heaton (1998) points out that cloze procedure as a measure of reading difficulty and reading comprehension will be treated briefly in the relevant section of the chapter on testing reading comprehension. Dictation, another major type of integrative test, was previously regarded solely as a means of measuring students’ skills of listening comprehension. Thus, the complex elements involved in tests of dictation were largely overlooked until fairly recently. The integrated skills involved in test dictation

</div><span class="text_page_counter">Trang 19</span><div class="page_container" data-page="19">

includes auditory discrimination, the auditory memory span, spelling, the recognition of sound segments, a familiarity with the grammatical and lexical patterning of the language, and overall textual comprehension.

<i>1.1.1.4. The communicative approach</i>

According to Heaton (1998, p.19), “the communicative approach to language testing is sometimes linked to the integrative approaches. However, although both approaches emphasize the importance of the meaning of utterances rather than their form and structure, there are nevertheless fundamental differences between the two approaches”. The communicative approach is said to be very humanistic. It is humanistic in the sense that each student’s performance is evaluated according to his or her degree of success in performing the language tasks rather than solely relation to the performance of other students. (Heaton, 1998, p.21).

However, the communicative approach to language testing reveals two drawbacks. First, teachers will find it difficult to assess students’ ability without comparing achievement results of performing language tests among students. Second, communicative approach is claimed to be somehow unreliable because of various real-life situation. (Hoang, 2005, p.8). Nevertheless, Heaton (1988) proposes a solution to this matter. In his point of view, to avoid the lack of reliability, very careful drawn - up and well-established criteria must be designed, but he does not set any criteria in detail.

It a nutshell, each approach to language testing has its weak points and strong point as well. Therefore, a good test should incorporate features of these four approaches. (Heaton, 1988, p.15).

<i><b>1.1.2. Classifications of Language Tests</b></i>

Language tests may be of various types but different scholars hold different views on the types of language tests.

</div><span class="text_page_counter">Trang 20</span><div class="page_container" data-page="20">

Henning (1987), for instant, establishes seven kinds of language test which can be

 <small>Subjective tests are scored based on the raters’ judgements oropinions. They are claimed to be unreliable and dependent.2Direct vs indirect tests</small>  <small>Direct tests are in the forms of spoken tests (in real life</small>

 <small>Indirect tests are in the form of written tests.3Discrete vs integrative</small>

 <small>Discrete tests are used to test knowledge in restricted areas.</small>  <small>Integrative tests are used to evaluate general languge</small>

<small>knowledge.4Aptitude, achievement</small>

<small>and proficiency tests</small>

 <small>Aptitude tests (intelligence tests) are used to select students in</small>

 <small>Criterion referenced tests: The instructions are designed afterthe tests are devised. The tests obey the teaching objectivesperfectly.</small>

 <small>Norm-reference tests: there are a large number of people fromthe target population. Standards of achievement such as themean, average score are established after the course.</small>

<small>6Speed test and powertests</small>

 <small>Speed tests consist of items, but time seems to be insufficient.</small>  <small>Power tests contain difficult items, but time is sufficient.7Others </small>

<i>Table 1: Types of language tests(Source: Henning, 1987, pp 4-9)</i>

However, Hughes (1989) mentions two categories: kinds of tests and kinds of language testing. Basically, kinds of language testing consist of direct vs indirect testing, norm-referenced testing vs criterion-referenced testing, discrete vs integrative testing, objective vs subjective testing (Hughes, 1989, pp 14-19). Apart from this, he

</div><span class="text_page_counter">Trang 21</span><div class="page_container" data-page="21">

<i>develops one more type of test called communicative language testing which isdescribed as the assessment of the ability to take part in acts of communication</i>

<i>(Hughes, 1989, p.19). Hughes also discusses kinds of tests which can be illustrated in</i>

the following table:

<b><small>NoKinds of testsCharacteristics</small></b>

<small>1</small> <b><small>Proficiency</small></b> <small>Sufficient command of language for a particularpurpose</small>

<small>2</small> <b><small>Achievement</small></b> <i><sup>Final Achievement</sup><sub>Progress Achievement</sub></i> <sup>Organized after the end of the course</sup>

<small>Measure the students’ progress</small>

<small>3</small> <b><small>Diagnostic </small></b> <small>Find students’ strengths and weaknesses; whatfurther teaching necessary.</small>

<small>4</small> <b><small>Placement </small></b> <small>Classify students into classes at different levels.</small>

<i>Table 2: Types of tests</i>

(<i><small>Source: Hoang, 2005, p.13 as cited in Hughes, 1990, pp 9-14)</small></i>

Language tests are divided into two types by Mc Namara (2000) based on test methods and test purposes. About test methods, he believes that there exists two basic types, namely traditional paper and pencil language tests which are used to assess either separate components or receptive understanding; performance tests. Regarding to test purpose he divides language tests into two types: achievement tests and proficiency tests.

<b>1.2. Testing reading</b>

Reading can be defined as the interaction between the reader and the text (Aebersold & Field, 1997). This dynamic relationship portrays the reader as creating meaning of the text in relation to his or her prior knowledge (Anderson, 1999). Reading is one of four main skills, which plays a decisive role in process of acquiring a language. Therefore, testing reading comprehension is also important. Traditionally, testing reading is no doubt because of the social important of literacy and because these tests are considered more reliable than speaking test.

</div><span class="text_page_counter">Trang 22</span><div class="page_container" data-page="22">

Alderson (1996) proposes that reading teachers feel uncomfortable in testing reading. To him, although most teachers use a variety of techniques in their reading classes, they do not tend to use the same variety of techniques when they administer reading tests. Despites the variety of testing techniques, none of them is subscribed to as the best one. Alderson (1996, 2000) considers that no single method satisfies reading teachers since each teacher has different purposes in testing. He listed a number of test techniques or formats often used in reading assessments, such as cloze tests, multiple-choice techniques, alternative objective techniques (e.g., matching techniques, ordering tasks, dichotomous items), editing tests, alternative integrated approaches (e.g., the C-test, the cloze elide test), short-answer tests (e.g., the free-recall test, the summary test, the gapped summary), and information-transfer techniques. Among the many approaches to testing reading comprehension, the three principal methods have been the cloze procedure, multiple-choice questions, and short answer questions (Weir, 1997).

Cloze test is now a well-known and widely-used integrative language test. Wilson Taylor (1953) first introduced the cloze procedure as a device for estimating the readability of a text. However, what brought the cloze procedure widespread popularity was the investigations with the cloze test as a measure of ESL proficiency (Jonz, 1976, 1990; Bachman, 1982, 1985; Brown, 1983, 1993). The results of the substantial volume of research on cloze test have been extremely varied. Furthermore, major technical defects have been found with the procedure. Alderson (1979), for instance, showed that changes in the starting point or deletion rate affect reliability and validity coefficients. Other researchers like Carroll (1980), Klein-Braley (1983, 1985) and Brown (1993) have questioned the reliability and different aspects of validity of cloze tests.

According to Heaton (1998) “cloze test was originally intended to measure the reading difficulty level of the text. Used in this way, it is a reliable means of determining whether or not certain texts are at an appropriate level for particular groups of students” (p.131). However, for Heaton the most common purpose of the cloze test is to measure reading comprehension. It has long been argued that cloze

</div><span class="text_page_counter">Trang 23</span><div class="page_container" data-page="23">

measures text involving the interdependence of phrases, sentences and paragraphs

<i>within the text. However, a true cloze is said generally to measure global reading</i>

comprehension although insights can undoubtedly be gained into particular reading difficulty. In contrast, Cohen (1998) concludes that cloze tests do not assess global reading ability but they do assess local-level reading. Each research tends to show his evident to prove their arguments; however, most of them agree that cloze procedure is really effective in testing reading comprehension.

Another technique that Alderson (1996, 2000), Cohen (1998), and Hughes (2003) discuss is ‘multiple-choice’; a common device for text comprehension. Ur (1996, p.38) defines multiple-choice questions as consisting “... of a stem and a number of options (usually four), from which the testee has to select the right one”. Alderson (2000: 211) states that multiple-choice test items are so popular because they provide testers with the means to control test-takers’ thought processes when responding; they “… allow testers to control the range of possible answers …”

Weir (1993) points out that short-answer tests are extremely useful for testing reading comprehension. According to Alderson (1996, 2000), ‘short-answer tests’ are seen as ‘a semi-objective alternative to multiple choice’. Cohen (1998) argues that open-ended questions allow test-takers to copy the answer from the text, but firstly one needs to understand the text to write the right answer. Test-takers are supposed to answer a question briefly by drawing conclusions from the text, not just responding ‘yes’ or ‘no’. The test-takers are supposed to infer meaning from the text before answering the question. Such tests are not easy to construct since the tester needs to see all possible answers. Hughes (2003: 144) points out that “the best short-answer questions are those with a unique correct response”. However, scoring the responses depends on thorough preparation of the answer-key. Hughes (2003) proposes that this technique works well when the aim is testing the ability to identify referents.

These above techniques are what usually used in testing reading, however, it difficult to say which the most effective one is because it depends on the purpose of teachers in assessing their students.

</div><span class="text_page_counter">Trang 24</span><div class="page_container" data-page="24">

<b>1.3. Criteria in evaluating a test</b>

Test evaluation is a complicated phenomenon; this process needs to analyze number of criteria. However, there are five main criteria that most researchers evaluate their tests; they are the mean, difficult level, discrimination, reliability and validity.

<i><b>1.3.1. The mean</b></i>

According to a dictionary of language testing by Milanovic and some other authors, the mean, also the arithmetical average is a descriptive statistic, measuring central tendency. The mean is calculated by dividing the sum of a set score by the number of score. Like other measures of central tendency the mean gives an indication of the trend or the score which is typical of the whole group. In normal distributions the mean is closely aligned to the median and the mode. This measure is by far the most commonly used and it is the basis of a number of statistical tests of comparison between groups commonly used in language testing. (Milanovic et al, 1999, p.118) In language test evaluation, this also a criterion needs evaluating because the mean score of the test will tell you how difficult or easy the test was for the given group. This is useful for evaluators to have reasonable adjustment to the test as a whole.

<i><b>1.3.2. The difficulty level</b></i>

Difficulty level of a test tells you how difficult or easy each item of the test is. Difficulty also shows the ability range of a particular candidate or group of candidates. “In language testing, most tests are designed in such a way that the majority of items are not too difficult or too easy for the relevant sample of test candidates.” (Milanovic et al, 1999, p.44)

Item difficulty requirements vary according to test purpose. In selection test, for example, there may be no need for fine graded assessment within the ‘pass’ or ‘fail’

</div><span class="text_page_counter">Trang 25</span><div class="page_container" data-page="25">

groups so that the most efficient test design will have a majority f items clustering near the critical cut-score. Information about item difficulty is also useful in determining the order of items on a test. Tests tend to begin with easy items in order to boost confidence and to ensure that weaker candidates do not waste valuable time on items which are two difficult for them.

For test evaluators, difficulty level of a test should be analyzed for its importance in deciding the sequence of items on a test. As well, this is one of factors that affect the test scores of test-takers.

<i><b>1.3.3. Discrimination</b></i>

According to Heaton, “the discrimination index of an item indicates the extent to which the item discriminates between the testees, separating the more able testees from the less able (Heaton, 1998, p. 179). For him, the index of discrimination tells us whether those students who performed well on the whole test tended to do well or badly on each item in the test.

As well, in Milanovic’s definition, it is understood as “a fundamental property of a language test, in their attempt to capture the range of individual abilities. On that basis the more widely discrimination is an important indicator of a test’s reliability”. (Milanovic et al, 1999, p.48)

By looking at the test scores, can the evaluators check the discrimination. Because of its decisive role in categorizing the test takers into bad and good group, discrimination of a test needs analyzing in the process of evaluating a test.

<i><b>1.3.4. Reliability</b></i>

Reliability is another factor of a test should be estimated by the test evaluator. “Reliability is often defined as consistency of measurement”. (Bachman & Palmer, 1996, p.19). A reliable test score will be consistent across different characteristics of the testing situation. Thus, reliability can be considered to be a function of

</div><span class="text_page_counter">Trang 26</span><div class="page_container" data-page="26">

consistency of scores from one set of test tasks to another. Reliability is also means “the consistency with which a test measure the same thing all the time”. (Harrison, 1987, p.24)

For test evaluators, reliability can be estimated by some of methods such as “parallel form, split half, rational equivalence, test-retest and inter-rater reliability checks” (Milanovic et al, 1999, p.168). According to Shohamy (1985), the types and the description as well the ways to calculate the reliability are summarized in the following table:

<b><small>Reliability typesDescriptionHow to calculate</small></b>

<small>1. Test-retestThe extent to which the testscore are stable from oneadministration to anotherassuming no learningoccurred between twooccasions</small>

<small>Correlations between scoresof the same test given on twooccasions</small>

<small>2. Parallel formThe extent to which 2 tests Correlations between two</small>

</div><span class="text_page_counter">Trang 27</span><div class="page_container" data-page="27">

<small>taken form the same domainmeasure the same things</small>

<small>forms of the same rater ondifferent occasions or oneoccasion</small>

<small>3. Internal consistencyThe extent to which the testquestions are related to oneanother, and measure thesame trait</small>

<small>Kuder-Richardson Formula21</small>

<small>4. Intra-raterThe extent to which the samerater is consistent in hisrating form one occasion toanother, or in occasions butwith different test-takers</small>

<small>Correlations between scoresof the same rater on differentoccasions, or one occasion.</small>

<small>5. Inter-raterThe extent to which thedifferent raters agree aboutthe assigned score or rating.</small>

<small>Correlations among ratingprovided by different raters.</small>

<i>Table 3: Types of reliability</i>

<i><small>(Source: Hoang, 2005, p.31 as cited in Shohamy, 1985, p.71)</small></i>

However, the reliability is said to be a necessary but not a sufficient quality of a test. And the reliability of a test should be closely interlocked with its validity. While reliability focuses on the empirical aspects of the measurement process, validity focuses on theoretical aspects and seeks to interweave these concepts with the empirical ones. For this reason it is easier to assess reliability than validity.

Test reliability could be analyzed by looking at the test score. If the test score unchanged in different times the test is taken, the test is said a reliable one and vice-versa. However, this depends on some of conditions and situations such as the circumstances in which the test is taken, the way in which it is marked and the uniformity of the assessment it makes. Therefore, it is necessary for evaluators when they try to estimate the reliability of a test.

<i><b>1.3.5. Validity</b></i>

</div><span class="text_page_counter">Trang 28</span><div class="page_container" data-page="28">

Validity is the most important consideration in test evaluation. The concept refers to the appropriateness, meaningfulness and usefulness of the specific inferences made from test scores. Test evaluation is the process of accumulating evidence to support such inferences. Validity, however, is a unitary concept. Although evidence may be accumulated in many ways, validity refers to the degree to which that evidence supports the inferences that are made from scores. The inferences regarding specific uses of a test are validated, not the test itself.

Traditionally, validity evidence has been gathered in three distinct categories:

<i>content-related, criterion related and constructed evidence of validity. More recent</i>

writing on validity theory stress the important of viewing validity as a ‘unitary concept’ (Messick, 1989). Thus, while the validity evidence is presented in separate categories, this categorization is principally an organizational technique for the purpose of the presentation of research in this manual.

<i>According to Milanovic et al (1999), content and construct validity are conceptualwhereas concurrent and predictive (criterion-related) validity are statistical. Or in</i>

other words, scores obtained on the test may be used to investigate criterion-related validity, for example, by relating them to other test scores or measure such as teachers’ assessment or future prediction. (pp. 220-221)

<i>Another type of test validity, for Milanovic et al, is face validity which refers to the</i>

degree to which a test appears to measure the knowledge or abilities it claims to measures, as judged by untrained observer such as the candidate taking the test or the institution which plans to administer it. (Milanovic et al, 1999, p. 221)

In a book by Alderson et al (1995), the authors divided validity into other

<i>categories; they are internal, external and construct validity. Internal validityaccording to them consists of three sub-types – face, content and response validity.For external validity, there are two sub-types; they are concurrent and predictive</i>

validity. And construct validity relates to five forms; they are comparison with theory, internal correlations, comparison biodata and psychological characteristic,

</div><span class="text_page_counter">Trang 29</span><div class="page_container" data-page="29">

multitrait – multi method analysis and convergent – divergent validation and factor analysis. (Alderson et al, 1995, pp. 171-186)

The validity of the test is paid much attention by number of researchers, test evaluators should take time in checking the validity of the test based on the categories of it which is categorized by authors and researchers. Through the test scores, evaluators check whether the test is valid or not so that they will have good adjustment to the test they evaluated.

<i>Summary: In this chapter, we have attempted to establish the theoretical framework</i>

for the thesis. Language testing is one of most important procedures for language teachers in student assessing. There are number of approaches to language testing and testing reading. This has been discussed in the first part of the chapter. The second matter has been explored in the chapter is the theory of test evaluation which related to the criteria of a test need analyzing by test evaluators.

</div><span class="text_page_counter">Trang 30</span><div class="page_container" data-page="30">

<b>CHAPTER 2</b>

<b> METHODOLOGY AND RESULTS</b>

This chapter will include the research questions, the selection of participants who took part in the study and the testing materials. The methods of data collection and data analysis as well the results are presented afterwards.

<b>2.1. Research questions</b>

On the basis of the literature review, this chapter aims at answering two research questions:

1) Is the final reading test for the students at HATECHS reliable?

2) To what extent is the final reading test valid in terms of face and construct?

<b>2.2. The participants</b>

The students at HATECHS are from different provinces, cities and towns in the North of Vietnam. They are generally aged between 18 and 21. Thirty participants were chosen randomly from students at Faculty of Finance and Accounting of school year 2006 – 2007. All of them are first year students. In addition, seven teachers at English Department were chosen for the interview. These teachers are all female and mostly get more than five year experience of teaching English. These teachers all took part in teaching the students at the school year 2006-2007.

At the school, the students take an English course in the first year. The course is divided into two components, each lasts 60 periods. It is a compulsory subject at school. After finishing the course, they are required to have pre-intermediate level. However, students often have varying English levels prior to the course. Some of them have learnt English for 7 years at high school, or some have learnt it for 3 years due to each part of the country. Some of them even have never learnt English because at the lower level of school they learned other foreign languages not English. It is

</div>

×