Journal of the American Academy of Orthopaedic Surgeons
290
Classifications of musculoskeletal
conditions have at least two central
functions. First, accurate classifica-
tion characterizes the nature of a
problem and then guides treatment
decision making, ultimately im-
proving outcomes. Second, accu-
rate classification establishes an
expected outcome for the natural
history of a condition or injury, thus
forming a basis for uniform report-
ing of results for various surgical
and nonsurgical treatments. This
allows the comparison of results
from different centers purportedly
treating the same entity.
A successful classification system
must be both reliable and valid.
Reliability reflects the precision of a
classification system; in general, it
refers to interobserver reliability, the
agreement between different ob-
servers. Intraobserver reliability is
the agreement of one observer’s re-
peated classifications of an entity.
The validity of a classification
system reflects the accuracy with
which the classification system
describes the true pathologic pro-
cess. A valid classification system
correctly categorizes the attribute of
interest and accurately describes the
actual process that is occurring.
1
To
measure or quantify validity, the
classification of interest must be
compared to some “gold standard.”
If the surgeon is classifying bone
stock loss prior to revision hip ar-
throplasty, the gold standard could
potentially be intraoperative assess-
ment of bone loss. Validation of the
classification system would require
a high correlation between the pre-
operative radiographs and the intra-
operative findings. In this example,
the radiographic findings would be
considered “hard” data because dif-
ferent observers can confirm the
radiographic findings. Intraopera-
tive findings, on the other hand,
would be considered “soft” data be-
cause independent confirmation of
this intraoperative assessment is
often impossible. This problem with
the validation phase affects many
commonly used classification sys-
tems that are based on radiographic
criteria, and it introduces the ele-
ment of observer bias to the valida-
tion process. Because of the difficulty
of measuring validity, it is critical
that classification systems have at
least a high degree of reliability.
Assessment of Reliability
Classifications and measurements
in general must be reliable to be
assessed as valid. However, be-
cause confirming validity is diffi-
cult, many commonly used classifi-
Dr. Garbuz is Assistant Professor, Department
of Orthopaedics, University of British
Columbia, Vancouver, BC, Canada. Dr. Masri
is Associate Professor and Head, Division of
Reconstructive Orthopaedics, University of
British Columbia. Dr. Esdaile is Professor and
Head, Division of Rheumatology, University of
British Columbia. Dr. Duncan is Professor
and Chairman, Department of Orthopaedics,
University of British Columbia.
Reprint requests: Dr. Garbuz, Laurel Pavilion,
Third Floor, 910 West Tenth Avenue,
Vancouver, BC, Canada V5Z 4E3.
Copyright 2002 by the American Academy of
Orthopaedic Surgeons.
Abstract
Classification systems help orthopaedic surgeons characterize a problem, suggest
a potential prognosis, and offer guidance in determining the optimal treatment
method for a particular condition. Classification systems also play a key role in
the reporting of clinical and epidemiologic data, allowing uniform comparison
and documentation of like conditions. A useful classification system is reliable
and valid. Although the measurement of validity is often difficult and some-
times impractical, reliability—as summarized by intraobserver and interobserv-
er reliability—is easy to measure and should serve as a minimum standard for
validation. Reliability is measured by the kappa value, which distinguishes true
agreement of various observations from agreement due to chance alone. Some
commonly used classifications of musculoskeletal conditions have not proved to
be reliable when critically evaluated.
J Am Acad Orthop Surg 2002;10:290-297
Classification Systems in Orthopaedics
Donald S. Garbuz, MD, MHSc, FRCSC, Bassam A. Masri, MD, FRCSC,
John Esdaile, MD, MPH, FRCPC, and Clive P. Duncan, MD, FRCSC
Donald S. Garbuz, MD, et al
Vol 10, No 4, July/August 2002
291
cation systems can be shown to be
reliable yet not valid. On preopera-
tive radiographs of a patient with a
hip fracture, for example, two ob-
servers may categorize the fracture
as Garden type 3. This measure-
ment is reliable because of interob-
server agreement. However, if the
intraoperative findings are of a
Garden type 4 fracture, then the
classification on radiographs, al-
though reliable, is not valid (ie, is
inaccurate). A minimum criterion
for the acceptance of any classifica-
tion or measurement, therefore, is a
high degree of both interobserver
and intraobserver reliability. Once
a classification system has been
shown to have acceptable reliability,
then testing for validity is appropri-
ate. If the degree of reliability is
low, however, then the classification
system will have limited utility.
Initial efforts to measure reliabil-
ity looked only at observed agree-
ment—the percentage of times that
different observers categorized
their observations the same. This
concept is illustrated in Figure 1, a
situation in which the two sur-
geons agree 70% of the time. In
1960, Cohen
2
introduced the kappa
value (or kappa statistic) as a mea-
sure to assess agreement that oc-
curred above and beyond that
related to chance alone. Today the
kappa value and its variants are the
most accepted methods of measur-
ing observer agreement for categor-
ical data.
Figure 1 demonstrates how the
kappa value is used and how it dif-
fers from the simple measurement
of observed agreement. In this
hypothetical example, observed
agreement is calculated as the per-
centage of times both surgeons
agree whether fractures were dis-
placed or nondisplaced; it does not
take into account the fact that they
may have agreed by chance alone.
To calculate the percentage of
chance agreement, it is assumed
that each surgeon will choose a cate-
gory independently of the other.
The marginal totals are then used to
calculate the agreement expected by
chance alone; in Figure 1, this is
0.545.
To calculate the kappa value, the
observed agreement (Po) minus the
chance agreement (Pc) is divided by
the maximum possible agreement
that is not related to chance (1 − Pc):
κ = (Po − Pc) / (1 − Pc)
This example is the simplest
case of two observers and two cate-
gories. The kappa value can be
used for multiple categories and
multiple observers in a similar
manner.
In analyzing categorical data,
which the kappa value is designed
to measure, there will be cases in
which disagreement between vari-
ous categories may not have as pro-
found an impact as disagreement
between other categories. For this
reason, categorical data are divided
into two types: nominal (unranked),
in which all categorical differences
are equally important, and ordinal
(ranked), in which disagreement
between some categories has a more
profound impact than disagreement
between other categories. An exam-
ple of nominal data is eye color; an
example of ordinal data is the AO
classification, in which each subse-
quent class denotes an increase in
severity of the fracture.
The kappa value can be un-
weighted or weighted depending
on whether the data are nominal or
ordinal. Unweighted kappa values
should always be used with un-
ranked data. When ordinal data
are being analyzed, however, a
decision must be made whether or
not to weight the kappa value.
Weighting has the advantage of
giving some credit to partial agree-
ment, whereas the unweighted
kappa value treats all disagree-
ments as equal. A good example of
appropriate use of the weighted
kappa value is in a study by
Kristiansen et al
3
of interobserver
agreement in the Neer classifica-
tion of proximal humeral fractures.
This well-known classification has
four categories of fractures, from
nondisplaced or minimally dis-
placed to four-part fractures.
Weighting was appropriate in this
case because disagreement be-
Surgeon No. 2
Surgeon No. 1
Displaced Nondisplaced Total
Displaced 50 15 65
Nondisplaced 15 20 35
Total 65 35 100
Observed agreement = 0.70
Chance agreement =
65 × 65
+
35 × 35
100 = 0.545
100 100
Agreement beyond chance (κ) =
0.70 – 0.545
= 0.34
1 – 0.545
Figure 1 Hypothetical example of agreement between two orthopaedic surgeons classify-
ing radiographs of subcapital hip fractures.
Classification Systems in Orthopaedics
Journal of the American Academy of Orthopaedic Surgeons
292
tween a two-part and three-part
fracture is not as serious as dis-
agreement between a nondisplaced
fracture and a four-part fracture.
By weighting kappa values, one
can account for the different levels
of importance between levels of
disagreement. If a weighted kappa
value is determined to be appropri-
ate, the weighting scheme must be
specified in advance because the
weights chosen will dramatically
affect the kappa value. In addition,
when reporting studies that have
used a weighted kappa value, the
weighting scheme must be docu-
mented clearly. One problem with
weighting is that without uniform
weighting schemes, it is difficult to
generalize across studies. Sample
size will allow the confidence inter-
val to be narrower but it does not
automatically affect the number of
categories.
Although the kappa value has
become the most widely accepted
method to measure observer agree-
ment, interpretation is difficult.
Values obtained range from −1.0
(complete disagreement) through
0.0 (chance agreement) to 1.0 (com-
plete agreement). Hypothesis test-
ing has limited usefulness when the
kappa value is used because it al-
lows the researcher to see only if
obtained agreement is significantly
different from zero or chance agree-
ment, revealing nothing about the
extent of agreement. Consequently,
when kappa values are obtained for
assessing classifications of mus-
culoskeletal conditions, hypothe-
sis testing has almost no role. As
Kraemer stated, “It is insufficient to
demonstrate merely the nonran-
domness of diagnostic procedures;
one requires assurance of substantial
agreement between observations.”
4
This statement is equally applicable
to classifications used in ortho-
paedics.
To assess the strength of agree-
ment obtained with a given kappa
value, two different benchmarks
have gained widespread use in
orthopaedics and other branches of
medicine. The most widely adopted
criteria for assessing the extent of
agreement are those of Landis and
Koch:
5
κ > 0.80, almost perfect;
κ = 0.61 to 0.80, substantial;
κ = 0.41 to 0.60, moderate;
κ = 0.21 to 0.40, fair;
κ = 0.00 to 0.20, slight; and
κ < 0.00, poor.
Although these criteria have
gained widespread acceptance, the
values were chosen arbitrarily and
were never intended to serve as
general benchmarks. The criteria of
Svanholm et al,
6
while less widely
used, are more stringent than those
of Landis and Koch and are perhaps
more practical for use in medicine.
Like Landis and Koch, Svanholm et
al chose arbitrary values:
κ ≥ 0.75, excellent;
κ = 0.51 to 0.74, good; and
κ ≤ 0.50, poor.
When reviewing reports of stud-
ies on agreement of classification
systems, readers should look at the
actual kappa value and not just at
the arbitrary categories described
here.
Although the interpretation of a
given kappa value is difficult, it is
clear that the higher the value, the
more reliable the classification sys-
tem. When interpreting a given
kappa value, the impact of preva-
lence and bias must be considered.
Feinstein and Cicchetti
7,8
refer to
them as the two paradoxes of high
observed agreement and low kappa
values. Most important is the effect
that the prevalence (base rate) can
have on the kappa value. Preva-
lence refers to the number of times a
given category is selected. In gener-
al, as the proportion of cases in one
category approaches 0, or 100%, the
kappa value will decrease for any
given observed agreement. In
Figure 2, the same two hypothetical
orthopaedic surgeons as in Figure 1
review and categorize 100 different
radiographs. The observed agree-
ment is the same as in Figure 1, 0.70.
However, the agreement beyond
chance (kappa value) is 0.06. The
main difference between Figures 1
and 2 is the marginal totals or the
Surgeon No. 2
Surgeon No. 1
Displaced Nondisplaced Total
Displaced 65 15 80
Nondisplaced 15 5 20
Total 80 20 100
Observed agreement = 0.70
Chance agreement =
80 × 80
+
20 × 20
100 = 0.68
100 100
Agreement beyond chance (κ) =
0.70 – 0.68
= 0.06
1 – 0.68
Figure 2 Hypothetical example of agreement between two orthopaedic surgeons classify-
ing radiographs, with a higher prevalence of displaced fractures than in Figure 1.
Donald S. Garbuz, MD, et al
Vol 10, No 4, July/August 2002
293
underlying prevalence of displaced
and nondisplaced fractures, de-
fined as the proportion of dis-
placed and nondisplaced fractures.
If one category has a very high
prevalence, there can be paradoxi-
cal high observed agreement yet
low kappa values (although to
some extent this can be the result of
the way chance agreement is calcu-
lated). The effect of prevalence on
kappa values must be kept in mind
when interpreting studies of ob-
server variability. The prevalence,
observed agreement, and kappa
values should be clearly stated in
any report on classification reliabil-
ity. Certainly a study with a low
kappa value and extreme preva-
lence rate will not represent the
same level of disagreement as will
a low kappa value in a sample with
a balanced prevalence rate.
Bias (systematic difference) is the
second factor that can affect the
kappa value. Bias has a lesser effect
than does prevalence, however. As
bias increases, kappa values para-
doxically will increase, although
this is usually seen only when
kappa values are low. To assess the
extent of bias in observer agree-
ment studies, Byrt et al
9
have sug-
gested measuring a bias index, but
this has not been widely adopted.
Although the kappa value, influ-
enced by prevalence and bias, mea-
sures agreement, it is not the only
measure of the precision of a classi-
fication system. Many other factors
can affect both observer agreement
and disagreement.
Sources of Disagreement
As mentioned, any given classifica-
tion system must have a high degree
of reliability or precision. The de-
gree of observer agreement obtained
is affected by many factors, includ-
ing the precision of the classification
system. To improve reliability, these
other sources of disagreement must
be understood and minimized.
Once this is done, the reliability of
the classification system itself can be
accurately estimated.
Three sources of disagreement or
variability have been described:
1,10
the clinician (observer), the patient
(examined), and the procedure
(examination). Each of these can
affect the reliability of classifications
in clinical practice and studies that
examine classifications and their
reliability.
Clinician variability arises from
the process by which information is
observed and interpreted. The in-
formation can be obtained from dif-
ferent sources, such as history,
physical examination, or radio-
graphic examination. These raw
data are often then converted into
categories. Wright and Feinstein
1
called the criteria used to put the
raw data into categories conversion
criteria. Disagreement can occur
when the findings are observed or
when they are organized into the
arbitrary categories commonly used
in classification systems.
An example of variability in the
observational process is the mea-
surement of the center edge angle of
Wiberg. Inconsistent choice of the
edge of the acetabulum will lead to
variations in the measurements
obtained (Fig. 3).
As a result of the emphasis on
arbitrary criteria for the various cate-
gories in a classification system, an
observer may make measurements
that do not meet all of the criteria of
a category. The observer will then
choose the closest matching category.
Another observer may disagree
about the choice of closest category
and choose another. Such variability
in the use of conversion criteria is
common and is the result of trying
to convert the continuous spectrum
of clinical data into arbitrary and
finite categories.
The particular state being mea-
sured will vary depending on
when and how it is measured. This
results in patient variability. A
good example is the variation ob-
tained in measuring the degree of
spondylolisthesis when the patient
is in a standing compared with a
supine position.
11
To minimize pa-
tient variability, examinations should
be performed in a consistent, stan-
dardized fashion.
The final source of variability is
the procedure itself. This often
refers to technical aspects, such as
the taking of a radiograph. If the
exposures of two radiographs of the
same patient’s hip are different, for
example, then classification of the
degree of osteopenia, which de-
pends on the degree of exposure,
will differ as a result of the variabil-
ity. Standardization of technique
will help reduce this source of vari-
ability.
Figure 3 Anteroposterior radiograph of a
dysplastic hip, showing the difficulty in
defining the true margin of the acetabulum
when measuring the center edge angle of
Wiberg (solid lines). The apparent lateral
edge of the acetabulum (arrow) is really a
superimposition of the true anterior and
posterior portions of the superior rim of
the acetabulum. Inconsistent choice
among observers may lead to errors in
measurement.
Classification Systems in Orthopaedics
Journal of the American Academy of Orthopaedic Surgeons
294
These three sources of variation
apply to all measurement processes.
The variability of classification sys-
tems is not just a problem of im-
proving a classification system itself;
it is only one aspect by which the
reliability and utility of classification
systems can be improved. Under-
standing these sources of measure-
ment variability and how to mini-
mize them is critically important.
1,10
Assessment of Commonly
Used Orthopaedic
Classification Systems
Although many classification sys-
tems have been widely adopted and
frequently used in orthopaedic sur-
gery to guide treatment decisions,
few have been scientifically tested
for their reliability. A high degree of
reliability or precision should be a
minimum requirement before any
classification system is adopted. The
results of several recent studies that
have tested various orthopaedic clas-
sifications for their intraobserver and
interobserver reliability are summa-
rized in Table 1.
12-21
In general, the reliability of the
listed classification systems would
be considered low and probably
unacceptable. Despite this lack of
reliability, these systems are com-
monly used. Although Table 1 lists
only a limited number of systems,
they were chosen because they have
been subjected to reliability testing.
Many other classification systems
commonly cited in the literature
have not been tested; consequently,
there is no evidence that they are or
are not reliable. In fact, most classi-
fications systems for medical condi-
tions and injuries that have been
tested have levels of agreement that
are considered unacceptably low.
22,23
There is no reason to believe that
the classification systems that have
not been tested would fare any bet-
ter. Four of the studies listed in
Table 1 are discussed in detail to
highlight the methodology that
should be used to assess the reliabil-
ity of any classification system: the
AO classification of distal radius
fractures,
15
the classification of
acetabular bone defect in revision
hip arthroplasty,
13
the Severin clas-
sification of congenital dislocation
of the hip,
14
and the Vancouver clas-
sification of periprosthetic fractures
of the femur.
12
Kreder et al
15
assessed the reli-
ability of the AO classification of
distal radius fractures. This classi-
fication system divides fractures
into three types based on whether
the fracture is extra-articular (type
A), partial articular (type B), or com-
plete articular (type C). These frac-
ture types can then be divided into
groups, which are further divided
into subgroups with 27 possible
combinations. Thirty radiographs
of distal radial fractures were pre-
sented to observers on two occa-
sions. Before classifying the radio-
graphs, a 30-minute review of the
AO classification was conducted.
Assessors also had a handout, which
they were encouraged to use when
classifying the fractures. There
were 36 observers in all, including
attending surgeons, clinical fellows,
residents, and nonclinicians. These
groups were chosen to ascertain
whether the type of observer had an
influence on the reliability of the
classification. In this study, an
unweighted kappa value was used.
The authors evaluated intraobserver
and interobserver reliability for AO
type, AO group, and AO subgroup.
The criteria of Landis and Koch
5
were used to grade the levels of
agreement. Interobserver agree-
ment was highest for the initial AO
type, and it decreased for groups
and subgroups as the number of
categories increased. This should be
expected because, as the number of
categories increases, there is more
opportunity for disagreement.
Intraobserver agreement showed
similar results. Kappa values for
AO type ranged from 0.67 for resi-
dents to 0.86 for attending surgeons.
Again, with more detailed AO sub-
groups, kappa values decreased
progressively. When all 27 cate-
gories were included, kappa values
ranged from 0.25 to 0.42. The con-
clusions of this study were that the
use of AO types A, B, and C pro-
duced levels of reliability that were
high and acceptable. However, sub-
classification into groups and sub-
groups was unreliable. The clinical
utility of using only the three types
was not addressed and awaits fur-
ther study.
Several important aspects of this
study, aside from the results, merit
mention. This study showed that
not only the classification system is
tested but also the observer. For
any classification system tested, it
is important to document the ob-
servers’ experience because this
can substantially affect reliability.
One omission in this study
15
was
the lack of discussion of observed
agreement and the prevalence of
fracture categories; these factors
have a distinct effect on observer
variability.
Campbell et al
13
looked at the
reliability of acetabular bone defect
classifications in revision hip arthro-
plasty. One group of observers
included the originators of the clas-
sification system. This is the ulti-
mate way to remove observer bias;
however, it lacks generalizability
because the originators would be
expected to have unusually high
levels of reliability. In this study,
preoperative radiographs of 33 hips
were shown to three different
groups of observers on two occa-
sions at least 2 weeks apart. The
groups of observers were the three
originators, three reconstructive
orthopaedic surgeons, and three
senior residents. The three classifi-
cations assessed were those attrib-
uted to Gross,
24
Paprosky,
25
and the
American Academy of Orthopaedic
Surgeons.
26
The unweighted kappa
Donald S. Garbuz, MD, et al
Vol 10, No 4, July/August 2002
295
Table 1
Intraobserver and Interobserver Agreement in Orthopaedic Classification Systems
Intraobserver Interobserver
Observed Observed
Study Classification Assessors Agreement (%) κ Value Agreement (%) κ Value
Brady Periprosthetic Reconstructive — 0.73 – 0.83* — 0.60 – 0.65*
et al
12
femur fractures orthopaedic surgeons,
(Vancouver) including originator;
residents
Campbell Acetabular bone Reconstructive — 0.05 – 0.75* — 0.11 – 0.28*
et al
13
defect in revision orthopaedic surgeons,
total hip (AAOS
26
) including originators
Campbell Acetabular bone Reconstructive — 0.33 – 0.55* — 0.19 – 0.62*
et al
13
defect in revision orthopaedic surgeons,
total hip (Gross
24
) including originators
Campbell Acetabular bone Reconstructive — 0.27 – 0.60* — 0.17 – 0.41*
et al
13
defect in revision orthopaedic surgeons,
total hip including originators
(Paprosky
25
)
Ward et al
14
Congenital hip dis- Pediatric orthopaedic 45 – 61 0.20 – 0.44* 14 – 61 −0.01 – 0.42*
location (Severin) surgeons 0.32 – 0.59
†
0.05 – 0.55
†
Kreder et al
15
Distal radius (AO) Attending surgeons, — 0.25 – 0.42* — 0.33*
fellows, residents,
nonclinicians
Sidor et al
16
Proximal humerus Shoulder surgeon, 62 – 86 0.50 – 0.83* — 0.43 – 0.58*
(Neer) radiologist, residents
Siebenrock Proximal humerus Shoulder surgeons — 0.46 – 0.71
†
— 0.25 – 0.51
†
et al
17
(Neer)
Siebenrock Proximal humerus Shoulder surgeons — 0.43 – 0.54
†
— 0.36 – 0.49
†
et al
17
(AO/ASIF)
McCaskie Quality of cement Experts in THA, — 0.07 – 0.63* — −0.04*
et al
18
grade in THA consultants, residents
Lenke et al
19
Scoliosis (King) Spine surgeons 56 – 85 0.34 – 0.95* 55 0.21 – 0.63*
Cummings Scoliosis (King) Pediatric orthopaedic — 0.44 – 0.72* — 0.44*
et al
20
surgeons, spine
surgeons, residents
Haddad Femoral bone Reconstructive — 0.43 – 0.62* — 0.12 – 0.29*
et al
21
defect in revision orthopaedic surgeons
total hip (AAOS,
30
Mallory,
28
Paprosky et al
29
)
* Unweighted
†
Weighted
Classification Systems in Orthopaedics
Journal of the American Academy of Orthopaedic Surgeons
296
value was used to assess the level of
agreement.
As expected, the originators had
higher levels of intraobserver agree-
ment than did the other two observer
groups (AAOS, 0.57; Gross, 0.59;
Paprosky, 0.75). However, levels of
agreement fell markedly when tested
by surgeons other than the origina-
tors. This study underscores the im-
portance of the qualifications of the
observers in studies that measure
reliability. To test the classification
system itself, experts would be the
initial optimal choice, as was the
case in this study.
13
However, even
if the originators have acceptable
agreement, this result should not be
generalized. Because most classifi-
cation systems are developed for
widespread use, reliability must be
high among all observers for a sys-
tem to have clinical utility. Hence,
although the originators of the clas-
sifications of femoral bone loss were
not included in a similar study
21
at
the same center, the conclusions of
the study remain valuable with re-
spect to the reliability of femoral
bone loss classifications in the hands
of orthopaedic surgeons other than
the originators.
Ward et al
14
evaluated the Severin
classification, which is used to as-
sess the radiographic appearance
of the hip after treatment for con-
genital dislocation. This system has
six main categories ranging from
normal to recurrent dislocation and
is reported to be a prognostic indi-
cator. Despite its widespread accep-
tance, it was not tested for reliability
until 1997. The authors made every
effort to test only the classification
system by minimizing other poten-
tial sources of disagreement. All
identifying markers were removed
from 56 radiographs of hips treated
by open reduction. Four fellow-
ship-trained pediatric orthopaedic
surgeons who routinely treated con-
genital dislocation of the hip inde-
pendently rated the radiographs.
Before classifying the hips, the
observers were given a detailed de-
scription of the Severin classification.
Eight weeks later, three observers
repeated the classifying exercise.
The radiographs were presented in
a different order in an attempt to
minimize recall bias. Both weighted
and unweighted kappa values were
calculated. Observed agreement
also was calculated and reported so
that the possibility of a high ob-
served agreement with a low kappa
value would be apparent. The
kappa values, whether weighted or
unweighted, were low, usually less
than 0.50. The authors of this study
used the arbitrary criteria of
Svanholm et al
6
to grade their agree-
ment and concluded that this classi-
fication scheme is unreliable and
should not be widely used. This
study demonstrated the method-
ology that should be used when test-
ing classification systems. It elimi-
nated other sources of disagreement
and focused on the precision of the
classification system itself.
The Vancouver classification of
periprosthetic femur fractures is an
example of a system that was tested
for reliability prior to its wide-
spread adoption and use.
12
The
first description was published in
1995.
27
Shortly afterward, testing
began on the reliability and the
validity of this system. The meth-
odology was similar to that de-
scribed in the three previous stud-
ies. Reliability was acceptable for
the three experienced reconstruc-
tive orthopaedic surgeons tested,
including the originator. To assess
generalizability, three senior resi-
dents also were assessed for their
intraobserver and interobserver
reliability. The kappa values for
this group were nearly identical to
those of the three expert surgeons.
This study confirmed that the
Vancouver classification is both
reliable and valid. With these two
criteria met, this system can be rec-
ommended for widespread use and
can subsequently be assessed for its
value in guiding treatment and out-
lining prognosis.
Summary
Classification systems are tools for
identifying injury patterns, assessing
prognoses, and guiding treatment
decisions. Many classification sys-
tems have been published and wide-
ly adopted in orthopaedics without
information available on their relia-
bility. Classification systems should
consistently produce the same results.
A system should, at a minimum,
have a high degree of intraobserver
and interobserver reliability. Few
systems have been tested for this re-
liability, but those that have been
tested generally fall short of accept-
able levels of reliability. Because
most classification systems have poor
reliability, their use to differentiate
treatments and suggest outcomes is
not warranted. A system that has not
been tested cannot be assumed to be
reliable. The systems used by ortho-
paedic surgeons must be tested for
reliability, and if a system is not
found to be reliable, it should be
modified or its use seriously ques-
tioned. Improving reliability involves
looking at many components of the
classification process.
1
Methodologies exist to assess
classifications, with the kappa value
the standard for measuring ob-
server reliability. Once a system is
found to be reliable, the next step is
to prove its utility. Only when a
system is shown to be reliable
should it be widely adopted by the
medical community. This should
not be construed to mean that
untested classification systems, or
those with disappointing reliability,
are without value. Systems are
needed to categorize or define sur-
gical problems before surgery in or-
der to plan appropriate approaches
and techniques. Classification sys-
tems provide a discipline to help
define pathology as well as a lan-
Donald S. Garbuz, MD, et al
Vol 10, No 4, July/August 2002
297
guage to describe that pathology.
However, it is necessary to recog-
nize the limitations of existing clas-
sification systems and the need to
confirm or refine proposed preop-
erative categories by careful intra-
operative observation of the actual
findings. Furthermore, submission
of classification systems to statisti-
cal analysis highlights their inher-
ent flaws and lays the groundwork
for their improvement.
References
1. Wright JG, Feinstein AR: Improving the
reliability of orthopaedic measurements.
J Bone Joint Surg Br 1992;74:287-291.
2. Cohen J: A coefficient of agreement
for nominal scales. Educational and
Psychological Measurement 1960;20:37-46.
3. Kristiansen B, Andersen UL, Olsen CA,
Varmarken JE: The Neer classification
of fractures of the proximal humerus:
An assessment of interobserver varia-
tion. Skeletal Radiol 1988;17:420-422.
4. Kraemer HC: Extension of the kappa
coefficient. Biometrics 1980;36:207-216.
5. Landis JR, Koch GG: The measurement
of observer agreement for categorical
data. Biometrics 1977;33:159-174.
6. Svanholm H, Starklint H, Gundersen
HJ, Fabricius J, Barlebo H, Olsen S:
Reproducibility of histomorphologic
diagnoses with special reference to the
kappa statistic. APMIS 97 1989;689-698.
7. Feinstein AR, Cicchetti DV: High
agreement but low kappa: I. The prob-
lems of two paradoxes. J Clin Epidemiol
1990;43:543-549.
8. Cicchetti DV, Feinstein AR: High
agreement but low kappa: II. Resolving
the paradoxes. J Clin Epidemiol 1990;43:
551-558.
9. Byrt T, Bishop J, Carlin JB: Bias,
prevalence and kappa. J Clin Epidemiol
1993;46:423-429.
10. Clinical disagreement: I. How often it
occurs and why. Can Med Assoc J 1980:
123;499-504.
11. Lowe RW, Hayes TD, Kaye J, Bagg RJ,
Luekens CA: Standing roentgeno-
grams in spondylolisthesis. Clin
Orthop 1976;117:80-84.
12. Brady OH, Garbuz DS, Masri BA,
Duncan CP: The reliability and valid-
ity of the Vancouver classification of
femoral fractures after hip replace-
ment. J Arthroplasty 2000;15:59-62.
13. Campbell DG, Garbuz DS, Masri BA,
Duncan CP: Reliability of acetabular
bone defect classification systems in
revision total hip arthroplasty. J Arthro-
plasty 2001;16:83-86.
14. Ward WT, Vogt M, Grudziak JS,
Tumer Y, Cook PC, Fitch RD: Severin
classification system for evaluation of
the results of operative treatment of
congenital dislocation of the hip: A
study of intraobserver and interob-
server reliability. J Bone Joint Surg Am
1997;79:656-663.
15. Kreder HJ, Hanel DP, McKee M,
Jupiter J, McGillivary G, Swiont-
kowski MF: Consistency of AO frac-
ture classification for the distal radius.
J Bone Joint Surg Br 1996;78:726-731.
16. Sidor ML, Zuckerman JD, Lyon T,
Koval K, Cuomo F, Schoenberg N: The
Neer classification system for proximal
humeral fractures: An assessment of
interobserver reliability and intraob-
server reproducibility. J Bone Joint Surg
Am 1993;75:1745-1750.
17. Siebenrock KA, Gerber C: The repro-
ducibility of classification of fractures
of the proximal end of the humerus.
J Bone Joint Surg Am 1993;75:1751-1755.
18. McCaskie AW, Brown AR, Thompson
JR, Gregg PJ: Radiological evaluation
of the interfaces after cemented total
hip replacement: Interobserver and
intraobserver agreement. J Bone Joint
Surg Br 1996;78:191-194.
19. Lenke LG, Betz RR, Bridwell KH, et al:
Intraobserver and interobserver reli-
ability of the classification of thoracic
adolescent idiopathic scoliosis. J Bone
Joint Surg Am 1998;80:1097-1106.
20. Cummings RJ, Loveless EA, Campbell
J, Samelson S, Mazur JM: Interobserver
reliability and intraobserver repro-
ducibility of the system of King et al.
for the classification of adolescent idio-
pathic scoliosis. J Bone Joint Surg Am
1998; 80:1107-1111.
21. Haddad FS, Masri BA, Garbuz DS,
Duncan CP: Femoral bone loss in total
hip arthroplasty: Classification and
preoperative planning. J Bone Joint
Surg Am 1999;81:1483-1498.
22. Koran LM: The reliability of clinical
methods, data and judgments (first
of two parts). N Engl J Med 1975;293:
642-646.
23. Koran LM: The reliability of clinical
methods, data and judgments (second
of two parts). N Engl J Med 1975;293:
695-701.
24. Garbuz D, Morsi E, Mohamed N,
Gross AE: Classification and recon-
struction in revision acetabular arthro-
plasty with bone stock deficiency. Clin
Orthop 1996;324:98-107.
25. Paprosky WG, Perona PG, Lawrence
JM: Acetabular defect classification
and surgical reconstruction in revision
arthroplasty: A 6-year follow-up eval-
uation. J Arthroplasty 1994;9:33-44.
26. D’Antonio JA, Capello WN, Borden LS:
Classification and management of ace-
tabular abnormalities in total hip ar-
throplasty. Clin Orthop 1989;243:126-137.
27. Duncan CP, Masri BA: Fractures of
the femur after hip replacement. Instr
Course Lect 1995;44:293-304.
28. Mallory TH: Preparation of the proxi-
mal femur in cementless total hip revi-
sion. Clin Orthop 1988;235:47-60.
29. Paprosky WG, Lawrence J, Cameron
H: Femoral defect classification:
Clinical application. Orthop Rev 1990;
19(suppl 9):9-15.
30. D’Antonio J, McCarthy JC, Bargar WL,
et al: Classification of femoral abnor-
malities in total hip arthroplasty. Clin
Orthop 1993;296:133-139