Joel Emer
December 7, 2005
6.823, L24-1
Reliable Architectures
Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Joel Emer
December 7, 2005
6.823, L24-2
Strike Changes State of a Single Bit
0
1
Joel Emer
December 7, 2005
6.823, L24-3
Impact of Neutron Strike on a Si Device
neutron strike
source
drain
+ +
+ --+
- -
Strikes release electron
& hole pairs that can be
absorbed by source &
drain to alter the state
of the device
Transistor Device
• Secondary source of upsets: alpha particles from packaging
Joel Emer
December 7, 2005
6.823, L24-4
Cosmic Rays Come From Deep Space
p
p
n
n
p
n
n
p
n
p
n
Earth’s Surface
• Neutron flux is higher in higher altitudes
3x - 5x increase in Denver at 5,000 feet
100x increase in airplanes at 30,000+ feet
Physical Solutions are hard
Joel Emer
December 7, 2005
6.823, L24-5
• Shielding?
– No practical absorbent (e.g., approximately > 10 ft of concrete)
– unlike Alpha particles
• Technology solution: SOI?
– Partially-depleted SOI of some help, effect on logic unclear
– Fully-depleted SOI may help, but is challenging to manufacture
• Circuit level solution?
– Radiation hardened circuits can provide 10x improvement with
significant penalty in performance, area, cost
– 2-4x improvement may be possible with less penalty
Triple Modular Redundancy
(Von Neumann, 1956)
M
M
V
Result
M
V does a majority vote on the results
Joel Emer
December 7, 2005
6.823, L24-6
Dual Modular Redundancy
Joel Emer
December 7, 2005
6.823, L24-7
(e.g., Binac, Stratus)
Error?
M
C
Mismatch?
M
Error?
• Processing stops on mismatch
• Error signal used to decide which processor be used to
restore state to other
Pair and Spare Lockstep
Joel Emer
December 7, 2005
6.823, L24-8
(e.g., Tandem, 1975)
M
Primary
C
Mismatch?
M
M
Backup
C
M
• Primary creates periodic checkpoints
• Backup restarts from checkpoint on mismatch
Mismatch?
Redundant Multithreading
Joel Emer
December 7, 2005
6.823, L24-9
(e.g., Reinhardt, Mukherjee, 2000)
Leading Thread
X
W
X
X
W
C
X
X
Fault?
W
C
Fault?
Trailing Thread
X
W
• Writes are checked
X
X
W
X
X
W
C
Fault?
Joel Emer
December 7, 2005
6.823, L24-10
Component Protection
Parity
1
1
Parity
Error?
ECC
0
1
1
…
0
0
…
ECC
1
1
…
• Fujitsu SPARC in 130 nm technology (ISSCC 2003)
– 80% of 200k latches protected with parity
– versus very few latches protected in commodity microprocessors
Joel Emer
December 7, 2005
6.823, L24-11
Strike on a bit (e.g., in register file)
Bit
Read?
no
yes
benign fault
no error
Bit has error
protection?
detection &
correction
no
no error
detection only
affects program
outcome?
yes
SDC
no
benign fault
no error
affects program
outcome?
yes
yes
True DUE
no
no
False DUE
SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error
Metrics
• Interval-based
–
–
–
–
MTTF = Mean Time to Failure
MTTR = Mean Time to Repair
MTBF = Mean Time Between Failures = MTTF + MTTR
Availability = MTTF / MTBF
• Rate-based
– FIT = Failure in Time = 1 failure in a billion hours
– 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT
– SER FIT = SDC FIT + DUE FIT
Image removed due to
copyright restrictions.
Hypothetical Example
Cache: 0 FIT
+ IQ: 100K FIT
+ FU: 58K FIT
Total of 158K FIT
Joel Emer
December 7, 2005
6.823, L24-12
Joel Emer
December 7, 2005
6.823, L24-13
Cosmic Ray Strikes: Evidence & Reaction
• Publicly disclosed incidence
– Error logs in large servers, E. Normand, “Single Event Upset at
Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996.
– Sun Microsystems found cosmic ray strikes on L2 cache with
defective error protection caused Sun’s flagship servers to crash,
R. Baumann, IRPS Tutorial on SER, 2000.
– Cypress Semiconductor reported in 2004 a single soft error
brought a billion-dollar automotive factory to a halt once a
month, Zielger & Puchner, “SER – History, Trends, and
Challenges,” Cypress, 2004.
Joel Emer
December 7, 2005
6.823, L24-14
# Vulnerable Bits Growing with Moore’s Law
10000
12x GAP
1000
100
Year
2012
2011
2010
2009
2008
2007
2006
20% Vulnerable
2005
1
2004
100% Vulnerable
2003
10
1000 year MTBF Goal
Typical SDC goal: 1000 year MTBF
Typical DUE goal: 10-25 year MTBF
Joel Emer
December 7, 2005
6.823, L24-15
Architectural Vulnerability Factor (AVF)
AVFbit = Probability Bit Matters
# of Visible Errors
=# of Bit Flips from Particle Strikes
FITbit= intrinsic FITbit * AVFbit
Joel Emer
December 7, 2005
6.823, L24-16
Architectural Vulnerability Factor
Does a bit matter?
• Branch Predictor
– Doesn’t matter at all (AVF = 0%)
• Program Counter
– Almost always matters (AVF ~ 100%)
Joel Emer
December 7, 2005
6.823, L24-17
Statistical Fault Injection (SFI)
with RTL
1
0
Simulate Strike on
Latch
Logic
Logic
0
output
Does Fault Propagate
to Architectural State
+ Naturally characterizes all logical structures
Joel Emer
December 7, 2005
6.823, L24-18
Architecturally Correct Execution (ACE)
Program Input
Program Outputs
• ACE path requires only a subset of values to flow correctly
through the program’s data flow graph (and the machine)
• Anything else (un-ACE
path) can be derated away
Example of un-ACE instruction:
Dynamically Dead Instruction
Joel Emer
December 7, 2005
6.823, L24-19
Dynamically
Dead
Instruction
Most bits of an un-ACE instruction do not affect
program output
Joel Emer
December 7, 2005
6.823, L24-20
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=1
ACE% = 2/4
Joel Emer
December 7, 2005
6.823, L24-21
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=2
ACE% = 1/4
Joel Emer
December 7, 2005
6.823, L24-22
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=3
ACE% = 0/4
Joel Emer
December 7, 2005
6.823, L24-23
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
T=4
ACE% = 3/4
Joel Emer
December 7, 2005
6.823, L24-24
Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state
=
=
(2+1+0+3)/4
4
Average number of ACE bits in a cycle
Total number of bits in the structure
Little’s Law for ACEs
N ace = T ace × Lace
N ace
AVF =
Ntotal
Joel Emer
December 7, 2005
6.823, L24-25