80
S E C T I O N I I Pediatric Critical Care: Tools and Procedures
development in the United States are classified into four phases by the
FDA. In each progressive phase, the number of patients, complexity,
duration, and costs of the trial increase. The complete process is long;
the average interval from drug development to market is approximately 10 years.58 The longest portion of this interval is usually drug
testing in clinical trials, which collectively occur over 5 to 7 years.
Phase I trials are dosing trials, in which a small number of human subjects receive several doses of the study drug to assess its
pharmacokinetics, pharmacodynamics, and side effects in either
healthy volunteers or patients with terminal conditions with few
remaining treatment options. A phase II trial evaluates drug efficacy, usually in the patient population of ultimate interest, while
continuing to monitor safety and side effects in more subjects
over a longer period. Finally, phase III trials are rigorously designed RCTs, with strictly defined outcomes or clinical end
points, enrolling hundreds to thousands of patients across multiple centers. Phase IV trials are postmarket trials conducted following drug approval by the FDA. These studies, in addition to
consumer and clinician reporting of drug-associated safety concerns, allow for ongoing evaluation of rare adverse events associated with the drug and evaluation in new populations.58
effects. Also, performing multiple additional statistical tests increases
the likelihood of a type I error or identifying an effect by chance
simply because so many tests were done. For these reasons, subgroup
analyses should be prespecified and limited in number.60
Hypothesis Testing and Determining the
Study Result
Inference and Estimate of Effect
The results of research studies are judged by their reliability and
validity. A trial is reliable if it is repeated under the same circumstances and the same results are achieved. A study has internal
validity if the results are real and not due to bias, chance, or confounding; it has external validity if its results can be generalized to
a broader population.
A clinical trial observes the effect of an intervention in a small
sample of patients; however, researchers want to generalize these
results to the entire (theoretical) population. Statistical analysis
allows this generalization to be made. Based on the study design
and distribution of study measurements, researchers choose an
appropriate statistical test to compare the study results against the
null hypothesis.
For binary outcome measures (e.g., mortality), the estimate of
effect is usually expressed as a relative risk or a risk ratio: the proportion of subjects with the outcome in one group divided by the
proportion of subjects with the outcome in the second group. Alternatively, relative odds or odds ratio may be reported. Odds are
calculated as the ratio of number of events to nonevents, and the
odds ratio is the odds of the event in one group divided by the
odds of the event in the second group. For rare events, the odds
ratio will approximate the relative risk. The odds ratio is amenable
to mathematical operations and can be generated from logistic
regression analyses, which can include adjustment for confounding
factors in calculating the estimate of effect.
Whom to Analyze?
95% Confidence Interval
All randomized studies should be analyzed based on original
group assignment, or intention to treat. By including all patients
randomized into each group—by the treatment that they were
intended to receive—all consequences of the treatment and all
benefits of balance achieved by randomization are preserved. It
can be tempting to analyze based on actual receipt or completion
of treatment, termed per-protocol analysis, which excludes patients
who crossed over between treatment groups or excludes patients
after randomization for other reasons. While these analyses will
reduce dilution and possibly identify a greater treatment effect,
they will also lose the benefits of randomization and will introduce selection bias in the result.4 However, the most appropriate
approach to analysis depends on the trial. Pragmatic trials, which
aim to test the real-world performance of an intervention in a
broad population of patients in a clinical setting, differ from explanatory trials, which aim to identify the biological effect of an
intervention in a more idealized setting.4 In pragmatic trials in
which loss to follow-up and incomplete adherence are an expected
part of the intervention, per-protocol analysis may help translate
trial results to alternate clinical settings in which differences
in adherence, demographics, and other factors may substantially
influence the effect of the intervention.59
Subgroup analyses focus on the effects of a treatment within a
particular group of study participants, such as women, those within
a given age strata, or among members of a specific race. These analyses are typically performed when there is a suspicion based on observational data or biology that the treatment effects may differ among
groups, also known as effect modification or interaction. Since these
analyses involve smaller numbers compared with the whole trial,
they are typically underpowered to definitively identify treatment
Next, the authors consider the certainty of the estimate of effect
observed in the study. The 95% confidence interval (CI) describes
the range of true effect values that would plausibly yield the observed effect in the study. If the estimate of effect in a study is a
relative risk reduction of 50%, then a 95% CI of 40% to 60% indicates that it would reasonably have obtained this study result if
the true effect was anywhere from 40% to 60%. A higher-powered
study will usually achieve a narrower 95% CI, giving more certainty
about the true magnitude of effect. When comparing binary outcomes, a CI that crosses 100% (or 1.0) for odds ratio or relative risk
ratio translates to no statistically significant difference between
groups. Similarly, for continuous outcomes, a CI that crosses 0 (i.e.,
no risk difference) translates to no statistically significant difference.
Statistical Analysis and Reporting
P Values
Statistical testing assesses whether the results support rejecting the
null hypothesis within some margin of error. Researchers calculate
the probability that the observed difference (or one greater) would
have occurred by chance, if the null hypothesis were true—the
P value. A low probability indicates that the results are unlikely to
have occurred by chance and would support rejecting the null
hypothesis. Notably, even study results with an imprecise estimate
of effect (e.g., those with a very wide 95% CI, or wide range of
true values that could be consistent with the observed results) can
meet statistical significance (i.e., P , .05).
The method for calculating the P value varies by study design
and outcome measure. Parametric tests assume that the outcome
measure has a normal distribution, indicating that it can be fully
described by its mean and standard deviation. For this kind of
CHAPTER 11 Essential Concepts in Clinical Trial Design and Statistical Analysis
outcome, a t-test is used to compare means from one or two
samples; the analysis of variance test is used to compare means
from more than two samples. Nonparametric tests do not depend
on a normal distribution, but because they make fewer assumptions about the distribution of the data, they are typically less
powerful. These can be more appropriate for data that are highly
skewed (e.g., length of stay, which is typically right skewed, as
some patients have very long lengths of stay) or otherwise expected to have a nonnormal distribution. Nonparametric tests
include the Wilcoxon rank sum or Mann-Whitney U test (for unpaired comparison of the median in two groups) and the KruskalWallis test (for comparison of medians across multiple groups).
Different tests are used for categorical data (e.g., ethnicity,
gender, pediatric operational performance category score). The chi
square test compares observed to expected values in a 2 3 2 table.
Fisher’s exact test is similar but calculates a more accurate P value
when small cell numbers are present. McNemar’s test is used for
paired categorical data.
For time-to-event data, survival curves are often used to display data. This allows investigators to handle varying times of
observation prior to events, as well as partial data from patients
who are observed for some time but not observed to have an
event during follow-up (censored data). A hazard ratio is typically reported for survival data, describing the ratio of “hazard”
rate (moment-to-moment outcome or event rates) between
groups.
Additional Sources and Mitigation of Bias
Bias results in differences between study populations that are not
due to chance. Some features of design that reduce or eliminate
bias, including study population selection, randomization, and
blinding, have already been discussed. Some additional features of
study analysis can further reduce bias.
Bias due to loss of data occurs when data from subjects are
eliminated from the final analyses. Protocol violations, postrandomization exclusion, or unequal dropout or loss of follow-up
can all result in missing data. Data that are unequally missing
between groups, or missing not at random, can introduce bias.
For example, if the study treatment was poorly tolerated by a
subgroup of the study population, then those patients might be
more likely to drop out, and their data would be missing from
the final study results. The total amount of missing data can
also impact study results. One review of 71 major RCTs in toptier medical journals identified that 13 of the trials (18%) were
missing outcome data on 20% or more of enrolled subjects.61
Imputation, sensitivity analyses, and other advanced statistical
methods can be used to explore how much the results might be
affected by missing data.
Additional Methods of Exploring Study Results
A clear distinction should be made between relative risk reduction and absolute risk reduction in study results. A reduction in
mortality from 60% to 20% and a reduction from 3% to 1%
both represent a relative risk reduction—1 minus the risk ratio—of 66%. However, the absolute risk reduction, or the difference in risk between groups, is substantially different: 40%
versus 2%. The number needed to treat (NNT) is a related statistic that describes what number of patients would have to be
exposed to the intervention to result in one “saved” outcome,
equal to 1 divided by absolute risk reduction. In the examples
above, the NNT would be only 2.5 patients for the intervention
81
that reduces mortality from 60% to 20% but would be 50 patients for the intervention that reduces mortality from 3% to
1%.7 The NNT can facilitate comparing results from different
studies and consideration of side effects, costs, and other aspects
of an intervention. It can also be adjusted for a particular patient’s baseline risk compared with the average risk of patients in
the study; among patients with twice the baseline risk of the
outcome, the NNT would be cut in half.62 These terms and additional common calculations are summarized in eTables 11.1
and 11.2.
The fragility index, another method used to describe study
results, refers to the number of outcomes (or events) that would
have to be changed to nonoutcomes in order to raise the P value
from statistically significant (classically, ,.05) to nonsignificant
or to raise the likelihood that the observed study results occurred by chance beyond a threshold of acceptability.63–65 Across
43 published RCTs in pediatric critical care, a median of only
two event switches would have been required to alter the study
results from statistically significant to nonsignificant.66 This
methodology can only be applied to studies with a binary outcome and is subject to the same methodologic concerns as the
P value itself.67
Negative Studies
The design and analysis of interventional trials is geared toward
testing a null hypothesis. Rejecting the null hypothesis (i.e., not
finding evidence of a difference) is not the same as finding evidence of no difference. Evidence of no difference would be a study
in which the estimate of effect was close to unity, with a high
degree of confidence in the result (e.g., a low P value and a narrow
CI). While many clinicians refer to a “negative” study as one in
which P . .05, this indicates simply a reasonable likelihood that
the study result was identified by chance. If the estimate of effect
favored the intervention, but P . .05, this study may be merely
underpowered to identify a true effect. Clinicians should not
necessarily conclude that such a study supports equal outcomes
between treatment and control.
Conclusions
Good trial design is more important than statistical analysis. Once
a trial is completed, shortcomings in design cannot be mitigated,
whereas statistical analyses can be modified or corrected. The most
common shortcomings in trial design are the introduction of, or
failure to accommodate, bias and imprecision in estimating the
treatment effect, leading to an inability to address the initial study
question.
Key References
Koepsell TD, Weiss NS. Randomized Trials. Epidemiologic Methods:
Studying the Occurrence of Illness. 1st ed. New York: Oxford University
Press; 2003.
Piantadosi S. Clinical Trials: A Methodologic Perspective. 2nd ed. Hoboken, NJ: John Wiley and Sons, Inc; 2005.
Pocock SJ, McMurray JJ, Collier TJ. Making sense of statistics in clinical
trial reports: part 1 of a 4-part series on statistics for clinical trials.
J Am Coll Cardiol. 2015;66(22):2536-2549.
Pocock SJ, Clayton TC, Stone GW. Design of major randomized trials:
part 3 of a 4-part series on statistics for clinical trials. J Am Coll Cardiol. 2015;66(24):2757-2766.
The full reference list for this chapter is available at ExpertConsult.com.
81.e1
eTABLE
Two-by-Two Table for Calculations
11.1
DISEASE OR OUTCOME
Test or exposure
Present
Absent
Positive
a
b
Negative
c
d
eTABLE
Selected Terms and Definitions
11.2
Absolute risk reduction (ARR): The difference in event rates in treated patients compared with control patients. Note that the order is reversed compared
with the attributable risk (see below).
ARR 5 [c/(c 1 d)] 2 [a/(a 1 b)]
Ascertainment bias: Observer bias; bias introduced by study staff or investigators knowing or being able to determine treatment group assignment in
randomized studies.
Attributable risk (AR): The effect of an exposure on the risk of disease in those exposed compared with those unexposed.
AR 5 (Frequency in exposed group) 2 (Frequency in unexposed group) 5 [a/(a 1 b)] 2 [c/c 1 d)]
Blinding: Obscuring study treatment group assignment from individuals in a trial (patients, research study staff, investigators, and/or other clinical
providers).
Confidence interval (CI): The range of values likely to include the true value for the entire population. The standard is 95%, in which 95% of such
intervals will contain the true population mean.
Confounding: An effect of a third factor, one associated with both a predictor and an outcome (but not on the causal pathway between the predictor and
outcome), that may influence the observed effect of a predictor on outcome.
Effect modification: An effect of a third factor that influences the magnitude of the observed effect of a predictor on outcome.
Estimate of effect: The observed effect of an intervention in a particular study, usually presented along with an estimate of a range of effect sizes that
would be consistent with the study’s result (e.g., a relative risk and its 95% confidence interval).
Explanatory trials: Designed to observe the true biological effect, or efficacy, of an intervention, typically under tightly controlled circumstances.
Intention-to-treat analysis: Data are analyzed according to the groups to which subjects were assigned, regardless of what treatment subjects actually
received (analyzed as randomized).
Negative predictive value: The proportion of people with a negative test who are free of disease.
NPV =
d
c+d
Number needed to treat (NNT): The number of patients needed to treat to achieve one outcome. It is the inverse of the attributable risk ratio.
NNT =
1
a
c
= 1/
2
ARR
c + d a + b
Odds: The ratio of events to nonevents (i.e., chances of something happening divided by chances against something happening). This is not the same as
risk (which has a different denominator; see definition below). The odds of getting heads when flipping a coin are 1:1 (one to one).
Odds ratio (OR), or relative odds: The odds of an event in a treated patient versus the odds in a control patient. In case-control studies, relative risk (RR)
cannot be calculated because subjects are selected on the basis of outcome, not exposure. For rare outcomes (e.g., ,10% of the population), RR can
be estimated by OR.
a /c
ad
OR =
=
b/d
bc
Per-protocol analysis: Analysis based on subjects who received the intended intervention or adhered to treatment; this analysis loses the benefits of randomization but may be helpful in pragmatic studies.
Positive predictive value (PPV): The proportion of people with a positive test who have disease.
PPV =
a
a+b
Pragmatic trials: Designed to test the effectiveness of an intervention in a real-world scenario, often involving a clinical environment and a broadly selected population.
81.e2
eTABLE
Selected Terms and Definitions—cont’d
11.2
P value: The probability that the observed difference, or a larger one, would have been found by chance in a particular study if no effect is truly present.
Sensitivity: The proportion of people with disease who have a positive test, 5 a/(a 1 c).
Specificity: The proportion of people free of disease who have a negative test, 5 d/(b 1 d).
Relative risk (RR): The risk of development of disease in the exposed group relative to those who were not exposed (also called risk ratio).
RR 5
Prevalence in exposed group
Prevalence in unex p osed group
5
a / (a + b)
c / (c + d)
Relative risk reduction (RRR): Percent reduction in events in treated versus untreated groups.
RRR 5 (1 2 [a/(a 1 b)]/[c/(c 1 d)]) 3 100%
Risk (probability): The ratio of events to all possible events (i.e., the chances of something happening divided by the total number of chances). The risk
(probability) of getting heads when flipping a coin is 0.5, or 50%.
Type I error (a): The chance that a difference between treated and control groups studied is found when, in reality, there is no difference.
Type II error (b): The chance that no difference between treated and control groups studied is found when, in reality, there is a difference.
Power (1 – b): Statistical power is the ability of an experiment to observe a significant difference between groups when a difference truly exists. Power is
equal to 1 minus the type II error (b).
Validity: Internal validity refers to results that are real and not due to bias, chance, or confounding. External validity refers to results that can be generalized to a broader population.
e3
References
1. Piantadosi S. The study cohort, Treatment allocation. Clinical Trials:
A Methodologic Perspective. 2nd ed. Hoboken, NJ: John Wiley and
Sons, 2005.
2. Koepsell TDW, Weiss NS. Overview of Study Designs. Epidemiologic
Methods: Studying the Occurrence of Illness. New York: Oxford University Press; 2003.
3. Pocock SJ, Clayton TC, Stone GW. Design of Major randomized
trials: part 3 of a 4-part series on statistics for clinical trials.
J Am Coll Cardiol. 2015;66:2757-2766.
4. Koepsell TDW, Weiss NS. Randomized Trials. Epidemiologic Methods: Studying the Occurrence of Illness. New York: Oxford University
Press; 2003.
5. Johnson N, Lilford RJ, Brazier W. At what level of collective equipoise does a clinical trial become ethical? J Med Ethics. 1991;17:3034.
6. Doig GS, Simpson F. Efficient literature searching: a core skill for
the practice of evidence-based medicine. Intensive Care Med.
2003;29:2119-2127.
7. Pocock SJ, McMurray JJ, Collier TJ. Making sense of statistics in
clinical trial reports: part 1 of a 4-part series on statistics for clinical
trials. J Am Coll Cardiol. 2015;66:2536-2549.
8. Lin Y, Zhu M, Su Z. The pursuit of balance: an overview of covariate-adaptive randomization techniques in clinical trials.
Contemp Clin Trials. 2015;45:21-25.
9. Suresh K. An overview of randomization techniques: An unbiased
assessment of outcome in clinical research. J Hum Reprod Sci.
2011;4:8-11.
10. Horng S, Miller FG. Ethical framework for the use of sham procedures in clinical trials. Crit Care Med. 2003;31:S126-S130.
11. Savulescu J, Wartolowska K, Carr A. Randomised placebo-controlled trials of surgery: ethical analysis and guidelines. J Med Ethics. 2016;42:776-783.
12. Menon K, McNally JD, Zimmerman JJ, et al. Primary outcome
measures in pediatric septic shock trials: a systematic review. Pediatr
Crit Care Med. 2017;18:e146-e154.
1 3. Quartin AA, Schein RM, Kett DH, Peduzzi PN, Group
ftDoVASSCS. Magnitude and duration of the effect of sepsis
on survival. JAMA. 1997;277:1058-1063.
14. Kaplan V, Clermont G, Griffin MF, et al. Pneumonia: Still the old
man’s friend? Arch Intern Med. 2003;163:317-323.
15. Herridge MS, Chu LM, Matte A, et al. The RECOVER program:
disability risk groups and 1-year outcome after 7 or more days of mechanical ventilation. Am J Respir Crit Care Med. 2016;194:831-844.
16. Volakli EA, Sdougka M, Drossou-Agakidou V, Emporiadou M,
Reizoglou M, Giala M. Short-term and long-term mortality following pediatric intensive care. Pediatr Int. 2012;54:248-255.
17. Pinto NP, Rhinesmith EW, Kim TY, Ladner PH, Pollack MM.
Long-term function after pediatric critical illness: results from the
survivor outcomes study. Pediatr Crit Care Med. 2017;18:e122-e130.
18. Matsumoto N, Hatachi T, Inata Y, Shimizu Y, Takeuchi M. Longterm mortality and functional outcome after prolonged paediatric
intensive care unit stay. Eur J Pediatr. 2019;178:155-160.
19. Knaus WA, Draper EA, Wagner DP, Zimmerman JE. Prognosis in
acute organ-system failure. Ann Surg. 1985;202:685-693.
20. Raffin TA. Intensive care unit survival of patients with systemic
illness. Am Rev Respir Dis. 1989;140:S28-S35.
21. Leteurtre S, Duhamel A, Salleron J, Grandbastien B, Lacroix J,
Leclerc F. PELOD-2: an update of the Pediatric logistic organ dysfunction score. Crit Care Med. 2013;41:1761-1773.
22. Matics TJ, Sanchez-Pinto LN. Adaptation and validation of a pediatric sequential organ failure assessment score and evaluation of the
sepsis-3 definitions in critically Ill children. JAMA Pediatr. 2017;171:
e172352.
23. Weinstein MC, Stason WB. Foundations of cost-effectiveness analysis for health and medical practices. N Engl J Med. 1977;296:716721.
24. Predicting outcome in ICU patients. 2nd european consensus conference in intensive care Medicine. J Intensive Care Med. 1994;
20:390-397.
25. Yagiela LM, Barbaro RP, Quasney MW, et al. Outcomes and patterns of healthcare utilization after hospitalization for pediatric critical illness due to respiratory failure. Pediatr Crit Care Med.
2019;20:120-127.
26. Oczkowski WJ, Barreca S. The functional independence measure: its
use to identify rehabilitation needs in stroke survivors. Arch Phys
Med Rehabil. 1993;74:1291-1294.
27. Enright PL, Sherrill DL. Reference equations for the six-minute
walk in healthy adults. Am J Respir Crit Care Med. 1998;158:13841387.
28. Pollack MM, Holubkov R, Glass P, et al. Functional status scale: new
pediatric outcome measure. Pediatrics. 2009;124:e18-e28.
29. Ridley SA, Wallace PG. Quality of life after intensive care.
Anaesthesia. 1990;45:808-813.
30. Tarlov AR, Ware Jr JE, Greenfield S, Nelson EC, Perrin E, Zubkoff M.
The medical outcomes study. An application of methods for monitoring the results of medical care. JAMA. 1989;262:925-930.
31. Visser MC, Fletcher AE, Parr G, Simpson A, Bulpitt CJ. A comparison of three quality of life instruments in subjects with angina
pectoris: the sickness impact profile, the Nottingham Health Profile, and the Quality of Well Being Scale. J Clin Epidemiol. 1994;47:
157-163.
32. Kaplan RM, Atkins CJ, Timms R. Validity of a quality of well-being
scale as an outcome measure in chronic obstructive pulmonary disease.
J Chronic Dis. 1984;37:85-95.
33. Chelluri L, Grenvik AN, Silverman M. Intensive care for critically ill
elderly: Mortality, costs, and quality of life. Review of the literature.
Arch Intern Med. 1995;155:1013-1022.
34. Slatyer MA, James OF, Moore PG, Leeder SR. Costs, severity of illness and outcome in intensive care. Anaesth Intensive Care.
1986;14:381-389.
35. Varni JW, Seid M, Kurtin PS. PedsQL 4.0: reliability and validity of
the Pediatric Quality of Life Inventory version 4.0 generic core scales
in healthy and patient populations. Medical Care. 2001;39:800-812.
36. Leteurtre S, Martinot A, Duhamel A, et al. Development of a pediatric multiple organ dysfunction score: use of two strategies. Med
Decis Making. 1999;19:399-410.
37. Doughty L, Carcillo JA, Kaplan S, Janosky J. Plasma nitrite and
nitrate concentrations and multiple organ failure in pediatric sepsis.
Crit Care Med. 1998;26:157-62.
38. Leteurtre S, Martinot A, Duhamel A, et al. Validation of the paediatric logistic organ dysfunction (PELOD) score: prospective, observational, multicentre study. Lancet. 2003;362:192-197.
39. Proulx F, Fayon M, Farrell CA, Lacroix J, Gauthier M. Epidemiology of sepsis and multiple organ dysfunction syndrome in children.
Chest. 1996;109:1033-1037.
40. Graciano AL, Balko JA, Rahn DS, Ahmad N, Giroir BP. The Pediatric Multiple Organ Dysfunction Score (P-MODS): development
and validation of an objective scale to measure the severity of multiple organ dysfunction in critically ill children. Crit Care Med.
2005;33:1484-1491.
41. Typpo KV, Petersen NJ, Hallman DM, Markovitz BP, Mariscalco
MM. Day 1 multiple organ dysfunction syndrome is associated with
poor functional outcome and mortality in the pediatric intensive
care unit. Pediatr Crit Care Med. 2009;10:562-570.
42. Schlapbach LJ, Straney L, Bellomo R, MacLaren G, Pilcher D. Prognostic accuracy of age-adapted SOFA, SIRS, PELOD-2, and qSOFA
for in-hospital mortality among children with suspected infection
admitted to the intensive care unit. Intensive Care Med. 2018;44:179188.