Language Testing

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (133.4 KB, 45 trang )

(1)Structure 1. Test Development: Types of tests, Qualities of a good test 2. Issues Specific to Language Tests 3. Developing Item Specifications.

(2) Language Test Development: From Test Specification to Test Use.

(3) What makes a good test? – Validity:. • the test fulfils its purpose, • the test gives you the information you want, • the test enables you to make well-founded decisions – Reliability:. • the test is precise enough for its purpose.

(4) – Practicality:. • the test can be administered and scored in a reasonable amount of time and with reasonable use of resources – Fairness:. • students know the purpose of the test, • results are only used for decisions that they can reasonably inform.

(5) What test for what purpose? proficiency test: • assesses students' knowledge of a language in general without reference to a curriculum or syllabus, • usually ranks students' in relation to each other (a norm-oriented test), • one of the main considerations in constructing them is discrimination: use a mix of easy, medium, and difficult items.

(6) • this will make it possible to distinguish between students at different levels • if the test only consisted of easy items, there would no way to tell a medium-ability student from a high-ability student • Examples: TOEFL, TOEIC, IELTS, university admission tests.

(7) achievement test: • assesses what students have learned, • ranks students in terms of their degree of mastery of a curriculum or syllabus (a criterion-oriented test), • discrimination may or may not be important for achievement tests:.

(8) – if a very well-defined body of knowledge is tested (e.g., a set of vocabulary words), it’s fine to just randomly sample from the possibilities and not worry about discrimination – if a more abstract construct is tested (e.g., reading comprehension), discrimination is important because only by including items of different levels of difficulty can different levels of knowledge in the students be distinguished • Examples: mid-term and final tests in schools and universities.

(9) Know your construct - Validity • Construct: the invisible, intangible attribute about which testers are trying to collect information, e.g., English language proficiency, intelligence, religious devotion, suitability as a pilot etc. • the problem of constructs is that they are not directly observable, so testers must gather observable performance and then draw conclusions about the construct.

(10) • in other words, testers must collect the right kind of evidence to make statements about the construct? • the construct – evidence connection must be theoretically and empirically defensible, e.g., – a test taker's time in the 100-yard dash (evidence) has no conceivable connection to their ability to comprehend spoken English (construct) => this performance provides no useful information – a test taker's score on a test of English listening comprehension with taped dialogs and multiple-choice questions (evidence) has a much stronger connection to their ability to comprehend spoken English (construct) => this performance provides useful information.

(11) Threats to validity • construct underrepresentation: only an aspect of the construct is tested, not everything, e.g., a writing test where students only produce individual sentences, not extended texts • construct-irrelevant variance: factors that might influence the measurement but that are not the object of the measurement, e.g., in a listening comprehension test with a tape and multiple choice questions, reading ability influences the result, possibly also topic knowledge.

(12) Sources of construct-irrelevant variance • A test of ESL listening, where test-takers listen to a short conversation and then answer multiple-choice questions • A test of ESL writing, where test-takers have 30 minutes to produce a brief essay on a general topic • A test of ESL speaking, where test takers role play a situation with the tester • A test of ESL reading, where test takers read the text and then answer brief-response questions about the main point, specific information, and the author's stance.

(13) Construct Validity • Construct Validity: A test has construct validity if it is a way to gain useful information about the construct and therefore inferences and decisions based on test scores are justifiable and defensible • To make sure you make construct-valid tests, work backwards from decisions.

(14) • Decisions: What decisions will you make based on the scores?  • Construct: What construct underlies these decisions?  • Evidence: What evidence / information do you need to find out about the strength of the construct in a test taker?  • Measurement procedures: What kinds of testing procedures will help you gather that information?.

(15) • Example: Constructing a test for assessing student’s learning after a semester of ESL • Decision: Is the student ready for the next level? • Construct: English proficiency • Evidence: test takers' comprehension and production of academic English in the oral and written mode • Measurement procedures: brief-response listening comprehension tests with lecture stimuli, multiplechoice reading test with academic texts, oral interview, writing sample.

(16) Test items: Reliability and Practicality • Reliability: The precision / consistency with which the test measures. • Reliability is a necessary condition for validity: an imprecise test cannot elicit useful information • Reliability is not a sufficient condition for validity: a test may be highly precise but may measure something entirely different than the construct (e.g., 100-yard dash as a measure of ESL vocabulary knowledge).

(17) • the more items measure the same attribute, the more precise the measurement will be => higher reliability! • to increase reliability, many short items are better than a few long items; essays are the worst for reliability • however, certain abilities can only be measured with long items, e.g., essay writing ability => sometimes faithful representation of the construct means a loss of reliability.

(18) Practicality • Practicality: The ratio between resources available and resources necessary to administer and score the test. • A highly practical test requires few resources and is likely to be used whereas an impractical test requires many resources and is much less likely to be used • in reality, practicality is a trade-off between measurement precision and construct validity on the one hand and real-world constraints on the other:.

(19) • major considerations in practicality: – preparation: a test must be written, assembled, piloted etc., so if items are easy to write and can be re-used, the test becomes more practical – length: a test cannot be so long that test takers get tired and lose concentration; maximum length depends on test takers' proficiency but a 4-hour test is the absolute maximum – medium: a paper-and-pencil test is much cheaper to produce than a computer-based test.

(20) • scoring: dichotomous items (multiple choice, true / false) are much easier to score than extended writing (essay) items or speaking tests (Oral Proficiency Interviews); a test that can be scored by a machine is very practical with regard to scoring • what if reliable measurement of the construct would require so many items that the test would become too long and not practical? => limit the scope of the construct, limit the inferences drawn from scores.

(21) • item specifications are necessary to make sure items are produced in a systematic fashion and have predictable measurement properties • even with item specifications there will still be "rogue" items that measure something other than the construct under investigation • for advice on how to write items of different types, cf. Brown (2005) or Hughes (2003).

(22) Validation • once the test is built, it needs to be piloted with a small group to make sure the instructions are comprehensible and it can be done in the time allotted • once piloted, it needs to be revised and then run experimentally with a larger population • validation is the collection of evidence to ensure that the test measures the construct it is supposed to measure and inferences drawn from scores are defensible.

(23) Consequences: Fairness, Ethics and Inferences • Scores from tests are used to make inferences about the strength of the construct in test takers, e.g., their English proficiency, and these inferences lead to decisions, e.g., whether to admit the test taker to a university program or not • for the test to be fair, it is important to avoid test bias, i.e., test taker characteristics other than the construct influencing the scores.

(24) • test bias is present if items are easier for one group of test takers than for another, e.g., different genders, different races, different ages, different socio-economic backgrounds • certain background characteristics are likely to coincide with certain constructs, e.g., test takers from rich families may have attended better schools, had better ESL instruction, have higher English proficiency, and therefore score higher on an English test: whether this is bias or not is a judgment call.

(25) • fairness increases the less judgment is involved in scores: "objectively" scored items (like multiple-choice) are best, "subjectively" scored items (like oral proficiency interviews or essays) are more problematic but can be improved by having scoring guidelines, scorer training, and multiple scorers • ideally, scoring would be anonymous but that is not always possible.

(26) Issues Specific to Language Tests.

(27) Competence and Performance • Competence is idealized knowledge: what someone would be able to do under ideal conditions (no fatigue, distraction, full concentration) • Performance is actual production: what someone does in a real-world situation • We can only ever observe performance and try to infer competence from it • Performance assessments try to avoid this inference by having test takers do real-world (like) tasks.

(28) Controlled and automatic processing • Automatized processing is fast and effortless and it’s unavoidable in listening and speaking • In conversations, people have to comprehend and contribute quickly, in real time, otherwise they get lost • Controlled processing is slower and takes more effort • It’s possible in reading and writing where there’s no pressure of an ongoing interaction (however, internet chat might be different).

(29) Testing Language Skills • Learners’ L2 competence can be divided in various ways for the purposes of assessment, for example: – “building blocks”: Grammar, phonemes, vocabulary – Skills: listening, reading, writing, speaking – Notions & functions: complaining, describing, negotiating – Situations: on the phone, in a shop, in a lecture – Genres: letter writing, giving a speech, making small talk….

(30) • Testing skills and language code is context-free, which is unnatural (language use always happens in context) • However, skills are possibly applicable across contexts • Testing functions & notions or situations / genres is more contextualized and closer to actual language use • However, it requires knowing in advance how & where learners will use the language (needs analysis) • Unlike skills, there’s an unlimited number of notions, functions, and situations.

(31) Issues in Skills Assessments • Cross-contamination: most tests assess several skills at the same time, e.g., – a writing test where the instructions are written out also assesses reading to an extent – a listening test with multiple-choice questions also assesses reading – any speaking test done through an oral interview assesses listening – a reading test with brief-response prompts also assesses writing.

(32) • Pervasive language code effects: low grammatical competence or lack of vocabulary will affect a test taker’s performance in all four skills • This leads to higher correlations between test sections but it is also inefficient because the same attribute is measured several times • It may also be unfair because a test taker is punished for lack of ability in one area multiple times.

(33) Item Specifications • item specs provide a blueprint of an item (type) • this helps create more similar items (increases reliability!) or replace overused / retiring items.

(34) Components of item specs • GD (General Description): general description of the item, • PA (Prompt Attributes): description of prompt attributes: what will the test question look like? • RA (Response Attributes): description of the response attributes: what will the test taker have to do? • SI (Sample Item): sample item: an example item • [SS: specification supplement: additional information].

(35) General Description (GD) • concise summary of what the spec is about • sketch out the ability or criterion for which the spec is supposed to produce tasks • GDs can be quite general… • (R1) This item tests reading comprehension. Test takers will understand the gist of a non-technical text. • (W1) This item tests writing. Test takers will be able to write a summary of a genuine academic lecture on a social science topic..

(36) • … or more specific… • (R2) This item tests reading comprehension. The test takers will demonstrate their in-depth understanding of a non-technical text, specifically: – they will understand the gist of the text, – they will be able to extract specific information, – they will understand the author's stance towards the issue,.

(37) – they will understand the logical structure of the text.. • (S1) This item tests speaking. Test takers will be able to bargain for a lower price in a shop setting. • (G1) This item tests grammar. Test takers will be able to recognize the correct tense for talking about the past, present, and future..

(38) Prompt Attributes • describes the task / stimulus / elicitation procedure • summarizes what the test taker will have to do • (R1) The test taker will read a complete, selfcontained non-technical text of between 300-500 words. The text should be dealing with an academic topic like social science, environment, psychology, political science, business but it should be written for a non-specialist audience..

(39) Articles from Time or Newsweek are often suitable. The question will be asked in a multiple-choice format with a one-sentence item stem and 4 response options: – the correct answer – an answer focusing on a minor point in the text – an answer overreaching the text's main point – an answer claiming the opposite from the main point.

(40) • (S1) The test taker will be interviewed by one tester. The test taker will be given a role-play card explaining the situation of bargaining in a shop. The text on the role play card should specify what the object is and that the goal of the interaction is to reduce the price substantially. The tester will assume the role of the shopkeeper..

(41) The object to be bargained about can be a vase, clothing, etc., and objects in the test environment can be used in place of the imaginary object. The tester will open the interaction by quoting a price and in the course of the interaction will try to keep the price as high as possible while keeping the interaction going. The interaction will not take more than 5 minutes..

(42) Response Attributes (RA) • RAs describe what the test taker will have to do • (R1) The test taker will mark the best answer on the answer sheet. • (S1) The test taker will claim that the price of the object is too high and will try to bargain for a lower price. S/he will respond to tester reactions and obtain a significantly reduced price. S/he will be comprehensible and use appropriate language, although minor errors that do not interfere with comprehension are acceptable..

(43) Sample Item (SI) (R1) [text on test validation] What is the main point of the article? 1. Validitation involves the collection of evidence of test usefulness. 2. Validation depends crucially on high reliability. 3. Validation is a political-educational enterprise. 4. Validation can be done through one of a number of methods..

(44) Problems / Challenges • GD and PA: don't specify the prompt attributes in the general description, e.g., • This item test reading comprehension. The test taker will summarize in one paragraph the content of a non-technical text from a popular science journal containing at least one graph or diagram..

(45) • the GD should describe the criterion or ability—it should be a reflection of the specific part of the construct that is getting tested • the GD should focus on general description, not the specifics of test materials and responses • PA and RA: the PA should focus on the test materials and what their effects whereas • the RA should focus on the test takers' actions or interactions with the materials.

(46)

Language Testing

Tài liệu liên quan

Tài liệu bạn tìm kiếm đã sẵn sàng tải về