Springer Texts in Education
Kaycheng Soh
Understanding
Test and
Exam Results
Statistically
An Essential Guide for Teachers and
School Leaders
Springer Texts in Education
More information about this series at />
Kaycheng Soh
Understanding Test
and Exam Results
Statistically
An Essential Guide for Teachers
and School Leaders
123
Kaycheng Soh
Singapore
Singapore
ISSN 2366-7672
Springer Texts in Education
ISBN 978-981-10-1580-9
DOI 10.1007/978-981-10-1581-6
ISSN 2366-7980
(electronic)
ISBN 978-981-10-1581-6
(eBook)
Library of Congress Control Number: 2016943820
© Springer Science+Business Media Singapore 2016
This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission
or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar
methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this
publication does not imply, even in the absence of a specific statement, that such names are exempt from
the relevant protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this
book are believed to be true and accurate at the date of publication. Neither the publisher nor the
authors or the editors give a warranty, express or implied, with respect to the material contained herein or
for any errors or omissions that may have been made.
Printed on acid-free paper
This Springer imprint is published by Springer Nature
The registered company is Springer Science+Business Media Singapore Pte Ltd.
On Good (And Bad) Educational Statistics
In Lieu of a Preface
There are three kinds of lies: lies, damned lies, and statistics.
We education people are honest people, but we often use test and examination
scores in such a way that the effect is the same as lies, though without the intention
but not without the falsity.
We count 10 correctly spelt words as if we count 10 apples. We count correctly
the chosen 10 words in an MCQ test as if we count 10 oranges. We count 10
correctly corrected sentences as if we count 10 pears. Then, we add
10 + 10 + 10 = 30. We then concluded that Ben has 30 fruits, called Language.
We do the same for something we call Math (meat). And, something we call Art, or
Music, or PE (snacks). We then add fruits, meat, and snacks and call the total
Overall (edible or food). We then make important decision using the Overall.
When doing this honestly, sincerely, and seriously, we also assume that there is
no error in counting, be it done by this or another teacher (in fact, all teachers
concerned). We also make the assumption, tacitly though, that one apple is as good
as one orange, and one cut of meat as good as one piece of moachee. Right or
wrong, life has to go on. After all, this has been done as far back as the long
forgotten days of the little red house, and since this is a tradition, there must be
nothing wrong. So, why should we begin to worry now?
A few of my class scored high, a few low, and most of them somewhere in between, reported
Miss Lim on the recent SA1 performance of her class.
A qualitative description like this one fits almost all normal groups of students.
After hearing a few more descriptions similar to these, Miss Lim and her colleagues
were not any wiser about their students’ performance.
When dealing with the test or examination scores of a group of students, more
specific descriptions are needed. It is here where numbers are more helpful than
words. Such numbers, given the high-sounding name statistics, help to summarize
v
vi
On Good (And Bad) Educational Statistics
the situation and make discussion more focused. Even when looking at one student’s test score, it has to be seen in the context of the scores of other students who
have taken the same test, for that score to have any meaning.
Thus, statistics are good. But, that is not the whole truth, there are bad statistics.
That is why there are such interesting titles as these: Huff, D. (1954) How to Lie
with Statistics; Runyon, R.P. (1981) How Numbers Lie; Hooke, R. (1983) How to
Tell the Liars from the Statisticians; Homes, C.B. (1990) The Honest Truth about
Lying with Statistics; Zuberi, T. (2001) Think than Blood: How Racial Statistics
Lie; Joel Best (2001) Damned Lies and Statistics; and Joel Best (2004) More
Damned Lies and Statistics: How Numbers Confuse Public Issues.
These interesting and skeptical authors wrote about social statistics, statistics
used by proponents and opponents to influence social policies. None deals with
educational statistics and how it has misled teachers and school leaders to make
irreversible decisions that influence the future of the student, the school, and even
the nation.
On the other hand, people also say “Statistics don’t lie but liars use statistics.”
Obviously, there are good statistics and there are bad statistics, and we need to be
able to differentiate between them.
Good statistics are the kind of numbers which simplifies a messy mass of
numbers to surface the hidden trends and helps in the understanding of them and
facilitates informed discussion and sound policy-making. Bad statistics, on the
other hand, do the opposite and makes things even more murky or messy than it
already is. This latter case may happen, unintentionally due to lack of correct
knowledge of statistics. Bad statistics are those unintentionally misused. A rational
approach to statistics, noting that they can be good or bad, is to follow Joel Best’s
advice:
Some statistics are bad, but others are pretty good, and we need statistics—good statistics—
to talk sensibly about social problems. The solution, then, is not to give up on statistics, but
to become better judges of the numbers we encounter. We need to think critically about
statistics… (Best 2001, p. 6. Emphasis added)
In the educational context, increasingly more attention is being paid to statistics,
using it for planning, evaluation, and research at different levels, starting from the
classroom to the boardroom. However, as the use of statistics has not been part of
professional development in traditional programs, many users of educational
statistics pick up ideas here and there on the job. This is practical out of necessity,
but it leaves too much to chance, and poor understanding and misuse can be fast
contagious.
The notes in this collection have one shared purpose: to rectify misconceptions
which have already acquired a life of their own and to prevent those that are to be
born. The problems, issues, and examples are familiar to teachers and school
administrators and hence should be found relevant to daily handling of numbers in
the school office as well as the classroom. The notes discuss the uses and misuses of
descriptive statistics which school administrators and teachers have to use and
interpret in the course of their normal day-to-day work. Inferential statistics are
On Good (And Bad) Educational Statistics
vii
mentioned by the way but not covered extensively because in most cases they are
irrelevant to the schools as they very seldom, if ever, have numbers collected
through a random process.
The more I wrote, the more I realized that many of the misconceptions and
misuses were actually caused by misunderstanding of something more fundamental
—that of educational measurement. Taking test scores too literally, obsession with
decimals, and seeing too much meaning in small difference are some cases in point.
Because educational statistics is intimately tied up with educational measurement
(much more than other social statistics do), misinterpretation of test and examination scores (marks, grades, etc.) may have as its root lack of awareness of the
peculiar nature of educational statistics. The root causes could be one or all of these:
1. Taking test scores literally as absolute when they are in fact relative.
2. Taking test scores as equivalent when they are not.
3. Taking test scores as error-free when error is very much part of them.
(Incidentally, “test score” will mean “test and examination scores” hereafter to
avoid the clumsiness.)
These arise from the combination of two conceptual flaws. First is the lack of
understanding of levels of measurement. There is a mix-up of highly fallible educational measurement (e.g., test scores) with highly infallible physical measurement
(e.g., weight or height), looking at a test scores of 50 as if it is the same as 50 kg or
50 cm. Secondly, there is a blind faith in score reliability and validity that the test
scores have perfect consistency and truthfulness.
This indicates a need to clarify the several concepts relevant to reliability,
validity, item efficiency, and levels of tests. And, above all these, the question of
consequences of test scores used, especially on students and curriculum, that is,
what happens to them, the two most critical elements in schooling.
Statistics can be learned for its own sake as a branch of mathematics. But, that is
not the reason for teachers and school leaders to familiarize themselves with it. In
the school context, statistics are needed for proper understanding of test and
examination results (in the form of scores). Hence, statistics and measurement need
to go hand in hand so that statistics are meaningful and measurement is understood.
In fact, while statistics can stand-alone without educational measurement, educational measurement on which tests and examinations are based cannot do without
statistics.
Most books about tests and examination begin with concepts of measurement
and have an appendix on statistics. In this book, statistical understanding of test
scores come first, followed by more exposition of measurement concepts. The
reversed order comes with the belief that without knowing how to interpret test
scores first, measurement is void of meanings.
Anyway, statistics is a language for effective communication. To build such a
common language among educational practitioners calls for willingness to give up
non-functioning notions and needs patience to acquire new meanings for old labels.
By the way, as the notes are not meant to be academic discourse, I take the
liberty to avoid citing many references to support the arguments (not argumentative
viii
On Good (And Bad) Educational Statistics
statements but just plain statements of ideas) and take for granted the teachers’ and
school leaders’ trust in my academic integrity. Of course, I maintain my intellectual
honesty as best I can, but I stand to be corrected where I do not intentionally lie.
I would like to record my appreciation for the anonymous reviewers for their
perceptive comments on the manuscript and their useful suggestions for its
improvement. Beyond this, errors and omissions are mine.
Reference
Best, J. (2001). Damned Lies and Statistics: Untangling Numbers from the Media, Politicians, and
Activists. Berkeley: University of California Press.
Contents
Part I
Statistical Interpretation of Test/Exam Results
1
On Average: How Good Are They?. . . .
1.1 Average Is Attractive and Powerful
1.2 Is Average a Good Indictor? . . . . .
1.2.1 Average of Marks . . . . . .
1.2.2 Average of Ratings . . . . .
1.3 Two Meanings of Average . . . . . .
1.4 Other Averages . . . . . . . . . . . . . .
1.5 Additional Information Is Needed .
1.6 The Painful Truth of Average . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
4
4
4
5
6
7
8
2
On Percentage: How Much Are There?. . .
2.1 Predicting with Non-perfect Certainty .
2.2 Danger in Combining Percentages . . .
2.3 Watch Out for the Base . . . . . . . . . .
2.4 What Is in a Percentage? . . . . . . . . .
2.5 Just Think About This . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
11
12
13
13
13
3
On Standard Deviation: How Different Are They? . . . .
3.1 First, Just Deviation . . . . . . . . . . . . . . . . . . . . . .
3.2 Next, Standard. . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 Discrepancy in Computer Outputs . . . . . . . . . . . .
3.4 Another Use of the SD . . . . . . . . . . . . . . . . . . . .
3.5 Standardized Scores . . . . . . . . . . . . . . . . . . . . . .
3.6 Scores Are not at the Same Type of Measurement .
3.7 A Caution . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
16
17
18
18
20
22
23
4
On Difference: Is that Big Enough? . . . . . . . . . . . . . . . . . . . . . . .
4.1 Meaningless Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Meaningful Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25
25
26
.
.
.
.
.
.
.
.
.
ix
x
Contents
4.3 Effect Size: Another Use the SD . . . . . . . . .
4.4 Substantive Meaning and Spurious Precision .
4.5 Multiple Comparison . . . . . . . . . . . . . . . . .
4.6 Common but Unwarranted Comparisons . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
29
30
31
33
5
On Correlation: What Is Between Them? . . . . . . .
5.1 Correlations: Foundation of Education Systems
5.2 Correlations Among Subjects. . . . . . . . . . . . .
5.3 Calculation of Correlation Coefficients . . . . . .
5.4 Interpretation of Correlation . . . . . . . . . . . . .
5.5 Causal Direction . . . . . . . . . . . . . . . . . . . . .
5.6 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
36
37
40
41
44
45
45
6
On Regression: How Much Does It Depend?.
6.1 Meanings of Regression . . . . . . . . . . . .
6.2 Uses of Regression. . . . . . . . . . . . . . . .
6.3 Procedure of Regression . . . . . . . . . . . .
6.4 Cautions . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
47
47
48
49
50
7
On Multiple Regression: What Is the Future? .
7.1 One Use of Multiple Regression . . . . . . .
7.2 Predictive Power of Predictors . . . . . . . . .
7.3 Another Use of Multiple Regression. . . . .
7.4 R-Square and Adjusted R-Square . . . . . . .
7.5 Cautions . . . . . . . . . . . . . . . . . . . . . . . .
7.6 Concluding Note . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
53
53
54
55
56
56
8
On Ranking: Who Is the Fairest of Them All? .
8.1 Where Does Singapore Stand in the World?
8.2 Ranking in Education . . . . . . . . . . . . . . . .
8.3 Is There a Real Difference? . . . . . . . . . . . .
8.4 Forced Ranking/Distribution . . . . . . . . . . .
8.5 Combined Scores for Ranking . . . . . . . . . .
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
57
57
59
61
61
62
63
9
On Association: Are They Independent? . . . . . . . . . . .
9.1 A Simplest Case: 2 × 2 Contingency Table. . . . . .
9.2 A More Complex Case: 2 × 4 Contingency Table .
9.3 Even More Complex Case . . . . . . . . . . . . . . . . .
9.4 If the Worse Come to the Worse . . . . . . . . . . . . .
9.5 End Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
65
65
67
68
70
71
71
Contents
Part II
xi
Measurement Involving Statistics
10 On Measurement Error: How Much Can
Test Scores? . . . . . . . . . . . . . . . . . . . . .
10.1 An Experiment in Marking . . . . . . .
10.2 A Score (Mark) Is not a Point . . . . .
10.3 Minimizing Measurement Error . . . .
10.4 Does Banding Help? . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . .
We Trust
........
........
........
........
........
........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
75
76
78
79
80
81
11 On Grades and Marks: How not to Get Confused? .
11.1 Same Label, Many Numbers . . . . . . . . . . . . . .
11.2 Two Kinds of Numbers . . . . . . . . . . . . . . . . .
11.3 From Labels to Numbers . . . . . . . . . . . . . . . .
11.4 Possible Alternatives . . . . . . . . . . . . . . . . . . .
11.5 Quantifying Written Answers . . . . . . . . . . . . .
11.6 Still Confused? . . . . . . . . . . . . . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
83
83
84
85
87
88
89
89
12 On Tests: How Well Do They Serve?
12.1 Summative Tests . . . . . . . . . . .
12.2 Selection Tests . . . . . . . . . . . .
12.3 Formative Tests . . . . . . . . . . . .
12.4 Diagnostic Tests . . . . . . . . . . .
12.5 Summing up . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
93
94
95
96
96
13 On Item-Analysis: How Effective Are
13.1 Facility . . . . . . . . . . . . . . . . . .
13.2 Discrimination . . . . . . . . . . . . .
13.3 Options Analysis . . . . . . . . . . .
13.4 Follow-up . . . . . . . . . . . . . . . .
13.5 Post-assessment Analysis . . . . .
13.6 Concluding Note . . . . . . . . . . .
Reference . . . . . . . . . . . . . . . . . . . . .
the Items? .
.........
.........
.........
.........
.........
.........
.........
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
97
98
100
100
101
102
103
103
14 On Reliability: Are the Scores Stable? . . . . . . . . . .
14.1 Meaning of Reliability . . . . . . . . . . . . . . . . . .
14.2 Factors Affecting Reliability . . . . . . . . . . . . . .
14.3 Checking Reliability . . . . . . . . . . . . . . . . . . . .
14.3.1 Internal Consistency . . . . . . . . . . . . . .
14.3.2 Split-Half Reliability. . . . . . . . . . . . . .
14.3.3 Test–Retest Reliability . . . . . . . . . . . .
14.3.4 Parallel-Forms Reliability . . . . . . . . . .
14.4 Which Reliability and How Good Should It Be?
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
105
105
106
107
107
109
109
109
110
xii
15 On Validity: Are the Scores Relevant? . . . . . .
15.1 Meaning of Validity . . . . . . . . . . . . . . . .
15.2 Relation Between Reliability and Validity .
Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Contents
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
111
111
115
116
16 On Consequences: What Happens to the Students,
Teachers, and Curriculum? . . . . . . . . . . . . . . . . .
16.1 Consequences to Students . . . . . . . . . . . . . . .
16.2 Consequences to Teachers. . . . . . . . . . . . . . .
16.3 Consequences to Curriculum . . . . . . . . . . . . .
16.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
117
117
120
121
122
124
17 On Above-Level Testing: What’s Right and Wrong with It?
17.1 Above-Level Testing in Singapore . . . . . . . . . . . . . . . .
17.2 Assumed Benefits . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.3 Probable (Undesirable) Consequences . . . . . . . . . . . . . .
17.4 Statistical Perspective . . . . . . . . . . . . . . . . . . . . . . . . .
17.5 The Way Ahead . . . . . . . . . . . . . . . . . . . . . . . . . . . .
17.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
125
126
127
127
129
131
132
132
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
133
134
134
137
139
18 On Fairness: Are Your Tests and Examinations Fair?.
18.1 Dimensions of Test Fairness . . . . . . . . . . . . . . . .
18.2 Ensuring High Qualities . . . . . . . . . . . . . . . . . . .
18.3 Ensuring Test Fairness Through Item Fairness . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Epilogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Appendix A: A Test Analysis Report . . . . . . . . . . . . . . . . . . . . . . . . . . 143
Appendix B: A Note on the Calculation of Statistics . . . . . . . . . . . . . . . 149
Appendix C: Interesting and Useful Websites. . . . . . . . . . . . . . . . . . . . 153
About the Author
Dr. Kaycheng Soh (1934) studied for Diploma in Educational Guidance (1965)
and Master in Education (Psychology) at the University of Manchester, UK (1970)
and was conferred the Doctor of Philosophy by the National University of
Singapore (1985) for his research on child bilingualism.
Dr. Soh started as a primary school teacher and principal, then became a teacher
educator of long-standing, and later held senior positions in the Ministry of
Education and consulted on social surveys with other Ministries in Singapore. He
served as a consultant to revise the school appraisal indicator systems to the Hong
Kong SAR Education Bureau. After retirement from the National Institute of
Education, Nanyang Technological University, Singapore, he actively promoted
classroom-based action research and conducted workshops for schools and the
ministry. Currently, he is the Research Consultant of the Singapore Centre for
Chinese Language.
His research focuses on creativity, language teaching, and world university
rankings, and his articles were published in international learned journals. Examples
of his recent publications are as follows:
• Soh, Kaycheng (2015). Creativity fostering teacher behavior around the world:
Annotations of studies using the CFTIndex. Cogent Education, 1−8.
This summarizes studies using the Creativity Fostering Teacher Behavior Index
he crafted and published in the Journal of Creative Behavior. The scale has been
translated into several languages and used for Ph.D. dissertations.
• Soh, Kaycheng (2013). Social and Educational Ranking: Problems and
Prospects. New York: Untested Ideas Research Centre.
The chapters are based on his journal articles dealing with several methodological and statistical issues in world university rankings and other social
rankings.
xiii
xiv
About the Author
• Soh, Kaycheng, Ed. (2016). Teaching Chinese Language in Singapore:
Retrospect and Challenges. Springer.
This monograph covers many aspects of the teaching of Chinese Language in
the Singapore context, including its past, present, and future, and several surveys of
teacher perceptions, teaching strategies, and assessment literacy.
Part I
Statistical Interpretation
of Test/Exam Results
Chapter 1
On Average: How Good Are They?
At the end of a jetty, there is this signboard:
WARNING
Average depth 5 meters within 50 meters
So, he dived and got a bump on the forehead.
1.1
Average Is Attractive and Powerful
Average is so attractively simple and powerfully persuasive so much so that we
accept it without much thinking. Average is attractive because it is simple. It is
simple because it simplifies.
During the department’s post-examination meeting, performances of classes
were to be evaluated. Miss Tan reported, “My class has two 45, four 52, seven 60,
ten 68, …” The HOD stopped her at this point, “Miss Tan, can you make it
simpler?” “Yes, the average is 74.” The other teachers took turns to report, “‘my
class has an average of 68’; ‘mine is 72.’; … and ‘my class scored the highest, the
average is 91.’” That is the magic of average. It simplifies reporting and makes
comparison and the ensuing discussion much more convenient.
The average is of course the total of all scores of the students of a class divided
by the number of students in that class. Arithmetically, mathematically, or statistically (depending on whether you like simple or big words), an average presents
the general tendency of a set of scores and, at the same time, ignores the differences
among them. Laymen call this average, statisticians call it the mean. Average or
mean, it is an abstraction of a set of marks to represent the whole set by using just
one number. Implicitly, the differences among marks are assumed to be
© Springer Science+Business Media Singapore 2016
K. Soh, Understanding Test and Exam Results Statistically,
Springer Texts in Education, DOI 10.1007/978-981-10-1581-6_1
3
4
1 On Average: How Good Are They?
unimportant. It also ignores the fact that it is possible that none of the students has
actually obtained that mark called average. The power of average comes from its
ability to make life easier and discussion possible. If not for the average (mean), all
teachers will report the way Miss Tan first did!
1.2
Is Average a Good Indictor?
It depends. Four groups of students took the same test (Table 1.1). All groups have
an average of 55. Do you think we can teach the groups the same way simply
because they have the same average?
1.2.1
Average of Marks
It is unlikely in the classroom reality that all students get the same scores like in
Group A. The point is that if the group is very homogeneous, teach them all in the
same way and one size may fit all. Group B has students who are below or around
the average but with one who scores extremely high when compared with the rest.
Group C, on the other hand, has more students above the average but with one
scoring extremely low. Group D has scores spreading evenly over a wide range.
Obviously, the average is not a good indicator here because the scores spread
around the average in different ways, signaling that the groups are not the same in
the ability tested. Such subtle but important differences are masked by the average.
1.2.2
Average of Ratings
Assessment rubrics have become quite popular with teachers. So, let us take a
realistic example of rubric ratings. Two teachers assessed three students on oral
Table 1.1 Test marks and
averages
Group A
Group B
Student 1
55
45
Student 2
55
45
Student 3
55
45
Student 4
55
55
Student 5
55
85
Average
55
55
Note Students with the same number
different groups
Group C
Group D
15
60
60
70
70
55
are different
40
50
55
60
70
55
persons in
1.2 Is Average a Good Indictor?
Table 1.2 Assessment marks
and averages
5
Student
Teacher A
Teacher B
Average
X
Y
Z
3
2
1
3
4
5
3
3
3
presentation. A generic five-point rubric was used. As is commonly done, the marks
awarded by the two teachers were averaged for each student (Table 1.2).
Using the rubric independently, both teachers awarded a score of 3 to Student X;
the average is 3. Teacher A awarded a score of 2 to Student Y who got a score of 4
from Teacher B; the average is also 3. Student Z was awarded scores of 1 and 5 by
Teacher A and Teacher B, respectively; the average is again 3. Now that all three
students scored the same average of 3, do you think they are the same kind of
students? Do the differences in the marks (e.g., 2 for Students Y and 4 for Student
Z) awarded to the same students worry you? Obviously, the average is not a good
indicator because the two teachers did not see Students Y in the same way. They
also did not see Student Z the same way. Incidentally, this is a question of
inter-rater consistency or reliability. In this example, the rating for Student X is
most trustworthy and that for Student Z cannot be trusted because the two teachers
did not see eye to eye in this case.
On the five-point scale, the points are usually labeled as 1 = Poor, 2 = Weak,
3 = Average, 4 = Good, and 5 = Excellent. Thus, all three students were rated as
average, but they are different kinds of “average” students.
1.3
Two Meanings of Average
In the rubric assessment example, average is used with two different though related
meanings. The first is the usual one when marks are added and then divided by the
number of, in this case, teachers. This, of course, is the mean, which is its statistical
meaning because it is the outcome of a statistical operation.
Average has a second meaning when, for instance, Mrs. Lee says, “Ben is an
average student” or when Mr. Tan describes his class as an “average class.” Here,
they used average to mean neither good nor weak, just like most other students or
classes, or nothing outstanding but also nothing worrisome. In short, average here
means typical or ordinary. Here, average is a relative label and its meaning depends
on the experiences or expectations of Mrs. Lee and Mr. Tan.
If Mrs. Lee has been teaching in a prime school, her expectation is high and Ben
is just like many other students in this school. Since Ben is a typical student in that
school, he is in fact a good or even excellent student when seen in the context of the
student population at the same class level in Singapore, or any other country.
Likewise, if there are, say, five classes at the same class levels formed by ability
grouping in Mr. Tan’s school, then his so-called average class is the one in the
6
1 On Average: How Good Are They?
middle or there about, that is, class C among classes A to E. Moreover, Mr. Tan’s
average class may be an excellent or a poor one in a different school, depending on
the academic standing of the school. By the same token, an average teacher in one
school may be a good or poor one in another school.
In short, average is not absolute but relative. Up to this point, we have noticed
that classes having the same average may not be the same in ability or achievement.
We have also seen that students awarded the same average marks may not have
been assessed in the same way by different teachers. The implication is that we
should not trust the average alone as an indicator of student ability or performance;
we need more information. In short, an average standing alone can misinform and
mislead. Obviously, we need other information to help us make sense of an average.
And what is this that we need?
1.4
Other Averages
Before answering the question, one more point needs to be made. What we have
been talking about as average is only one of the several averages used in educational statistics. The average we have been discussing up to now should strictly be
called the arithmetic mean. There is also another average called the mode; it is
simply the most frequently appearing mark(s) in a set. For example, 45 appears
three out of five times in Group B; since it is the most frequent mark, it is the mode.
The mode is a quick and rough indicator of average performance, used for a quick
glance.
A more frequently used alternative to the arithmetic mean is the median. When a
set of marks are ordered, the middlemost is the median. For example, in Table 1.1,
the scores of Group D are sequenced from the lowest to the highest, the middlemost
mark is 55 and it is the median of the set of five marks. Statistically, the median is a
better average than the arithmetic mean when a set of marks are “lopsided,” or
statistically speaking skewed. This happens when a test is too easy for a group of
students, resulting in too many high scores. The other way round is also true when a
test is too difficult and there are too many low scores. In either of these situations,
the median is a better representation of the scores.
Another situation when the median is a better representation is when there is one
or more extremely high (or low) scores and there is a large gap between such scores
and the rest. In Table 1.1, Group C has an unusually low score of 15 when the other
scores are around 65 (the mean of 60s and 70s). In this case, the mean of 55 is not
as good as the median of 60 (the middlemost score) to represent the group since 55
is an underestimation of the performance of the five students. Had Bill Gates joined
our teaching profession, the average salary of teachers, in Singapore or any other
country, will run into billions!
1.5 Additional Information Is Needed
1.5
7
Additional Information Is Needed
Let us go back to the question of the additional information we need to properly
understand and use an average.
What we need is an indication of the spread of the marks so that we know not
only what a representative mark (average or mean) is but also how widely or
narrowly the marks are spreading around the average. The simplest indicator of the
spread (or variability) is the range; it is simply the difference between the highest
and the lowest marks. In Table 1.1, the range for Group A is zero since every mark
is the same 55 and there are no highest and lowest marks. For Group B, the range is
85 − 45 = 40. For Group C, it is 70 − 15 = 55, and for Group D, 70 − 40 = 30.
What do these ranges tell us? Group A (0) is the most homogeneous, followed
by Group D (30), then Group B (40), and finally the most heterogeneous Group C
(55). As all teachers know, heterogeneous classes are more challenging to teach
because it is more difficult to teach at the correct level that suits most if not all
students, since they differ so much in the ability of achievement. The opposite is
true for homogeneous classes. Thus, if we look only at the averages of the classes,
we will misunderstand the different learning capabilities of the students.
While the range is another quick and rough statistics (to be paired with the mode),
the standard deviation (SD) is a formal statistics (to be elaborated in Chap. 3, On
Standard Deviation). Leave the tedious calculation to a software (in this case, the
Excel), we can get the SDs for the four groups. We can then create a table (Table 1.3)
to facilitate a more meaningful discussion at the post-examination meeting.
Table 1.3 drops the individual marks of the students but presents the essential
descriptive statistics useful for discussing examination results. It shows for each
group the lowest (Min) and the highest (Max) marks, the range (Max–Min), the
mean, and the SD. Thus, the discussion will not be only about the average performance of each class but also how differing the classes and students were in their
examination results.
You must have noticed that Group A has the lowest range (0) and the lowest SD
(0.00). On the other hand, Group C has the greatest range (55) and the greatest SD
(22.9). The other two groups have the “in-between” ranges and the “in-between”
SDs. Yes, you are right. In fact, there is a perfect match between ranges and SD’s
among the groups. Since both the range and the SD are indications of the spread of
marks, this high consistency between them is expected. In short, the group with the
greatest range also has the greatest SD, and vice versa.
We will discuss this further in Chap. 3.
Table 1.3 Descriptive
statistics for four groups
Group
Min
Max
Range
Mean
SD
A
B
C
D
55
45
15
40
55
85
70
70
0
40
55
30
55
55
55
55
0.0
17.3
22.9
11.2
8
1.6
1 On Average: How Good Are They?
The Painful Truth of Average
Before we leave the average to talk more about the SD, one final and important
point needs to be mentioned. When professor Frank Warburton of the University of
Manchester (which was commissioned to develop the British Intelligence Scale)
was interviewed on BBC about the measurement of intelligence, he did not know
that what he said was going to shock the British public, because a newspaper the
next day printed something like “Prof. Warburton says half of the British population is below average in intelligence.” (We can say the same about our Singapore
population.)
Prof. Warburton was telling the truth, nothing but the truth. The plain fact to him
(and those of us who have learned the basics of statistics) is that, by definition, the
average intelligence score (IQ 100) of a large group of unselected people is a point
on the intelligence scale that separates the top 50 % who score at or higher than the
mean (average) and the bottom 50 % who score below it. He did not mean to shock
and said nothing to shock, it was just that the British public (or rather, the newsmen) at that time interpreted average using its layman’s meaning. By the way, when
the group is large and the scores are normally distributed, the arithmetic mean and
the median coincide at the same point.
This takes us to another story. An American governorship candidate of a particular state promised his electorate that if he was returned to the office, he would
guarantee that all schools in the state will become above-average. We do not know
whether the voters believed him. They should not, because the candidate had no
possibility to keep his promise. The simple reason is that, statistically speaking,
when all schools in his state are uplifted, the average (mean) moves up accordingly
and there is always half of the schools below the state average! If he did not know
this, he made a sincere mistake; otherwise, he lied with an educational statistics.
Chapter 2
On Percentage: How Much Are There?
The principal Mrs. Fang asked, “How good is the chance that our Chinese orchestra will get
a gold medal in the Central Judging?”
The instructor Mr. Zhang replied, “Probably, 95 % chance.”
Mrs. Fang said, “That is not be good enough, we should have 105 % chance.”
Obviously, there is some confusion of the concepts of percentage and probability in this short conversation. Here, percentage is used to express the expectations of certainty of an upcoming event. Both the principal and instructor spoke
about figures figuratively. Statistically, percentage as used here does not make
sense. What Mr. Zhang said was that there was a very high chance (probability) of
success but Mrs. Fang expected more than certain certainty of success (a probability
of p = 1.05!).
The percentage is one of the most frequently used statistics in the school and in
daily life. It could very well also be the most misunderstood and misused statistic.
2.1
Predicting with Non-perfect Certainty
When 75 of 100 students passed an examination, the passing rate is 75 %. This of
course is derived, thus
100 % Ã ðNo. of passing studentsÞ=ðNo. of students sat for the examÞ
¼ 100 % Ã ð75=100Þ
¼ 100 % Ã ð0:75Þ
¼ 75 %
© Springer Science+Business Media Singapore 2016
K. Soh, Understanding Test and Exam Results Statistically,
Springer Texts in Education, DOI 10.1007/978-981-10-1581-6_2
9
10
2 On Percentage: How Much Are There?
In a sense, the percentage is a special kind of the arithmetic mean (or what is
known as the average, in school language) where the scores obtained by students
are either 1 (pass) or 0 (fail).
Because the student intakes over years are likely to be about the same, we can
say that our students have a 75 % chance of passing the same kind of examination
or there about. We are here using past experience to predict future happenings.
However, our prediction based on one year’s experience may not be exact because
there are many uncontrolled factors influencing what will really happen in the
following years. If it turns out to be 78 %, we have a fluctuation (statistically called
error, though not a mistake) of 3 % in our prediction. The size of such error
depends on which years’ percentage we use as a basis of prediction.
Knowing that the percentages vary from year to year, it may be wiser of us to
take the average of a few years’ percentages as the basis of prediction instead of just
one year’s. Let’s say, over the past five years, the percentages are 73, 78, 75, 76,
and 74 %, and their average is 75.2 or 75 % after rounding. We can now say that,
“Based on the experience of the past five years, our students will have around 75 %
passes in the following year.” When we use the word around, we allow ourselves a
margin of error (fluctuation) in the prediction.
But the word around is vague. We need to set the upper and lower limits to that
error. We then add to and subtract from the predicted 75 % a margin.
What then is this margin? One way is to use the average deviation of the five
percentages, calculated as shown in Table 2.1. First, we find the average percentage
(75.2 %). Next we find for each year its deviation from the five-year average, for
example, the first year, the deviation is (73 − 75.2 %) = −2.2 %. For all 5 years,
the average deviation is 0.0 and this does not help.
We take the absolute deviation of each year, for example, |−2.2 %| = 2.2 %. The
average of the absolute deviation is (7.2 %/5) = 1.44 %. Adding 1.44 % to the
predicted 75 %, we get 76.44 % or, after rounding 76 %. On the other hand,
subtracting 1.44 % from 75 %, we get 73.56 or 74 % (after rounding). Now we say,
“Based on the experience of the past five years, our students are likely to have
between 74 and 76 % passes next year.” This is a commonsensical way of making
prediction and at the same time allowing for fluctuation.
A more formal way is to use the standard deviation (SD) in place of the average
absolute deviation. Once the SD has been calculated for the five year’s percentages,
we use it to allow for fluctuations. If we are happy to be 95 % sure, the limits will
Table 2.1 Absolute average
deviation
Year
Passes %
Deviation %
Absolute deviation %
1997
1998
1999
2000
2010
Average
73
78
75
76
74
75.2
−2.2
2.8
−0.2
0.8
−1.2
0.0
2.2
2.8
0.2
0.8
1.2
1.44
2.1 Predicting with Non-perfect Certainty
11
be 71 and 79 %. We then can say, “Based on the experience of the past five years,
we have 95 % confidence that our students are likely to have between 71 and 79 %
passes next year.” (See Chap. 3, On Standard Deviation.) Using the SD is a more
formal statistical approach because this is done with reference to the normal distribution curve, assuming that the five years’ percentages together form a good
sample of a very large number of percentages of passes of the schools’ students.
Statistically speaking, the 95 % is a level of confidence, and the 71–79 % limits
together form the interval of confidence. Now, for the level of confidence 99 %,
what are the limits forming the corresponding interval of confidence?
2.2
Danger in Combining Percentages
In the above example, we assumed that the cohorts have the same size or at least
very close (which is a more realistic assumption). However, if the group sizes are
rather different, then averaging the percentages is misleading. Table 2.2 shows for
two groups the numbers of passes and the percent passes for each group. If we add
the two percentages and divide the sum by two, (75 % + 50 %)/2, the average is
62.5 %. However, if the total number of passes is divided by the total number of
students, (80/120), the average is 66.7 %. This reminds us that when group sizes
differ, averaging percentages to get an average percentage is misleading.
It is a well-documented fact that generally boys do better in mathematics while
girls in language. In statistical terms, there is a sex–subject interaction which needs
be taken into account when discussing achievement is such gender-related subjects.
In this example, sex is a confounding or lurking variable which cannot be ignored if
proper understanding is desired.
Incidentally, Singapore seems to be an exception where mathematics is concerned. In the 1996 Trends in International Mathematics and Science Study
(TIMSS), Singapore together with Hong Kong, Japan, and Korea headed the world
list in mathematics. However, a secondary analysis (Soh and Quek 2001) found
Singapore girls outperformed their counterparts in the other three Asian nations,
while boys of all four countries performed on par with one another. This is another
example of the Simpson’s paradox. By the way, the Singaporean girls’ advantage
shows up again in the TIMSS 2007 Report, while boys of Taipei, Hong Kong, and
Japan scored higher than Singapore’s boys. By the way, Korea did not take part in
the 2007 study.
Table 2.2 Calculation of
percentages
Group
Number of students
No. of passes
% of passes
A
B
Total
80
40
120
60
20
80
75
50
62.5 or 66.7?