Fault Tolerant Computer
Architecture
iii
Chapter Title here
Kratos
Editor
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50 to 150 page publications on
topics pertaining to the science and art of designing, analyzing, selecting and interconnecting
hardware components to create computers that meet functional, performance and cost goals.
Fault Tolerant Computer Architecture
Daniel Sorin
2009
The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale
Machines
Luiz André Barroso and Urs Hölzle
2009
Computer Architecture Techniques for Power-Efficiency
Stefanos Kaxiras and Margaret Martonosi
2008
Chip Mutiprocessor Architecture: Techniques to Improve Throughput and Latency
Kunle Olukotun, Lance Hammond, James Laudon
2007
Transactional Memory
James R. Larus, Ravi Rajwar
2007
Quantum Computing for Computer Architects
Tzvetan S. Metodi, Frederic T. Chong
2006
Synthesis Lectures on Computer
Architecture
Copyright © 2009 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Fault Tolerant Computer Architecture
Daniel Sorin
www.morganclaypool.com
ISBN: 9781598299533 paperback
ISBN: 9781598299540 ebook
DOI: 10.2200/S00192ED1V01Y200904CAC005
A Publication in the Morgan & Claypool Publishers series
SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE
Lecture #5
Series Editor: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
ISSN 1935-3235 print
ISSN 1935-3243 electronic
Fault Tolerant Computer
Architecture
Daniel J. Sorin
Duke University
SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #5
ABSTRACT
For many years, most computer architects have pursued one primary goal: performance. Architects
have translated the ever-increasing abundance of ever-faster transistors provided by Moore’s law
into remarkable increases in performance. Recently, however, the bounty provided by Moore’s law
has been accompanied by several challenges that have arisen as devices have become smaller, includ-
ing a decrease in dependability due to physical faults. In this book, we focus on the dependability
challenge and the fault tolerance solutions that architects are developing to overcome it. The two
main purposes of this book are to explore the key ideas in fault-tolerant computer architecture and
to present the current state-of-the-art—over approximately the past 10 years—in academia and
industry.
vi
KEYWORDS
fault tolerance (or fault tolerant), reliability, dependability, computer architecture, error detection,
error recovery, fault diagnosis, self-repair, autonomous, dynamic verification
vii
“To Deborah, Jason, and Julie”
DedicationDedication
viii
I would like to thank my family for their support while I was writing this lecture. I would also like to
thank Mark Hill for inviting me to write this lecture and Mike Morgan for organizing the produc-
tion of the lecture. Valuable feedback on early drafts of the lecture was provided by Babak Falsafi,
Jude Rivers, and Mark Hill. I would also like to thank Lihao Xu for helping me with a question
about error coding.
Acknowledgments
1. Introduction 1
1.1 Goals of this Book 1
1.2 Faults, Errors, and Failures
2
1.2.1 Masking
2
1.2.2 Duration of Faults and Errors
3
1.2.3 Underlying Physical Phenomena
3
1.3 Trends Leading to Increased Fault Rates
5
1.3.1 Smaller Devices and Hotter Chips
5
1.3.2 More Devices per Processor
6
1.3.3 More Complicated Designs
6
1.4 Error Models
7
1.4.1 Error Type
7
1.4.2 Error Duration
8
1.4.3 Number of Simultaneous Errors
8
1.5 Fault Tolerance Metrics
9
1.5.1 Availability
9
1.5.2 Reliability
10
1.5.3 Mean Time to Failure
10
1.5.4 Mean Time Between Failures
10
1.5.5 Failures in Time
10
1.5.6 Architectural Vulnerability Factor
11
1.6 The Rest of This Book
12
1.7 References
13
2. Error Detection 19
2.1 General Concepts 19
2.1.1 Physical Redundancy
19
2.1.2 Temporal Redundancy
22
Contents
ix
2.1.3 Information Redundancy 22
2.1.4 The End-to-End Argument
25
2.2 Microprocessor Cores
27
2.2.1 Functional Units
27
2.2.2 Register Files
29
2.2.3 Tightly Lockstepped Redundant Cores
29
2.2.4 Redundant Multithreading Without Lockstepping
30
2.2.5 Dynamic Verification of Invariants
34
2.2.6 High-Level Anomaly Detection
39
2.2.7 Using Software to Detect Hardware Errors
41
2.2.8 Error Detection Tailored to Specific Fault Models
42
2.3 Caches and Memory
44
2.3.1 Error Code Implementation
44
2.3.2 Beyond EDCs
45
2.3.3 Detecting Errors in Content Addressable Memories
46
2.3.4 Detecting Errors in Addressing
47
2.4 Multiprocessor Memory Systems
48
2.4.1 Dynamic Verification of Cache Coherence
49
2.4.2 Dynamic Verification of Memory Consistency
50
2.4.3 Interconnection Networks
52
2.5 Conclusions
52
2.6 References
52
3. Error Recovery 61
3.1 General Concepts 61
3.1.1 Forward Error Recovery
61
3.1.2 Backward Error Recovery
62
3.1.3 Comparing the Performance of FER and BER
68
3.2 Microprocessor Cores
69
3.2.1 FER for Cores
69
3.2.2 BER for Cores
69
3.3 Single-Core Memory Systems
71
3.3.1 FER for Caches and Memory
71
3.3.2 BER for Caches and Memory
72
3.4 Issues Unique to Multiprocessors
73
x FAULT TOLERANT COMPUTER ARCHITECTURE