Tải bản đầy đủ (.pdf) (40 trang)

Reliable Architectures

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (157.72 KB, 40 trang )

Joel Emer
December 7, 2005
6.823, L24-1

Reliable Architectures


Joel Emer
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology


Joel Emer
December 7, 2005
6.823, L24-2

Strike Changes State of a Single Bit

0
1


Joel Emer
December 7, 2005
6.823, L24-3

Impact of Neutron Strike on a Si Device
neutron strike

source


drain

+ +
+ --+
- -

Strikes release electron
& hole pairs that can be
absorbed by source &
drain to alter the state
of the device

Transistor Device

• Secondary source of upsets: alpha particles from packaging


Joel Emer
December 7, 2005
6.823, L24-4

Cosmic Rays Come From Deep Space


p
p

n
n
p

n

n
p

n

p

n

Earth’s Surface
• Neutron flux is higher in higher altitudes
3x - 5x increase in Denver at 5,000 feet
100x increase in airplanes at 30,000+ feet


Physical Solutions are hard

Joel Emer
December 7, 2005
6.823, L24-5

• Shielding?
– No practical absorbent (e.g., approximately > 10 ft of concrete)
– unlike Alpha particles

• Technology solution: SOI?
– Partially-depleted SOI of some help, effect on logic unclear
– Fully-depleted SOI may help, but is challenging to manufacture


• Circuit level solution?
– Radiation hardened circuits can provide 10x improvement with
significant penalty in performance, area, cost
– 2-4x improvement may be possible with less penalty


Triple Modular Redundancy
(Von Neumann, 1956)

M

M

V

Result

M

V does a majority vote on the results

Joel Emer
December 7, 2005
6.823, L24-6


Dual Modular Redundancy

Joel Emer

December 7, 2005
6.823, L24-7

(e.g., Binac, Stratus)
Error?

M
C

Mismatch?

M
Error?

• Processing stops on mismatch
• Error signal used to decide which processor be used to
restore state to other


Pair and Spare Lockstep

Joel Emer
December 7, 2005
6.823, L24-8

(e.g., Tandem, 1975)

M

Primary

C

Mismatch?

M
M

Backup
C

M

• Primary creates periodic checkpoints
• Backup restarts from checkpoint on mismatch

Mismatch?


Redundant Multithreading

Joel Emer
December 7, 2005
6.823, L24-9

(e.g., Reinhardt, Mukherjee, 2000)
Leading Thread

X

W


X

X

W

C

X

X

Fault?

W

C

Fault?

Trailing Thread

X

W

• Writes are checked

X


X

W

X

X

W

C

Fault?


Joel Emer
December 7, 2005
6.823, L24-10

Component Protection
Parity
1

1

Parity
Error?

ECC

0

1

1



0

0



ECC

1

1



• Fujitsu SPARC in 130 nm technology (ISSCC 2003)
– 80% of 200k latches protected with parity
– versus very few latches protected in commodity microprocessors


Joel Emer
December 7, 2005
6.823, L24-11


Strike on a bit (e.g., in register file)
Bit
Read?
no

yes

benign fault
no error

Bit has error
protection?

detection &
correction

no

no error

detection only
affects program
outcome?
yes
SDC

no
benign fault
no error


affects program
outcome?
yes

yes

True DUE

no
no
False DUE

SDC = Silent Data Corruption, DUE = Detected Unrecoverable Error


Metrics
• Interval-based





MTTF = Mean Time to Failure
MTTR = Mean Time to Repair
MTBF = Mean Time Between Failures = MTTF + MTTR
Availability = MTTF / MTBF

• Rate-based
– FIT = Failure in Time = 1 failure in a billion hours


– 1 year MTTF = 109 / (24 * 365) FIT = 114,155 FIT

– SER FIT = SDC FIT + DUE FIT

Image removed due to
copyright restrictions.

Hypothetical Example
Cache: 0 FIT

+ IQ: 100K FIT

+ FU: 58K FIT

Total of 158K FIT


Joel Emer
December 7, 2005
6.823, L24-12


Joel Emer
December 7, 2005
6.823, L24-13

Cosmic Ray Strikes: Evidence & Reaction

• Publicly disclosed incidence

– Error logs in large servers, E. Normand, “Single Event Upset at
Ground Level,” IEEE Trans. on Nucl Sci, Vol. 43, No. 6, Dec 1996.
– Sun Microsystems found cosmic ray strikes on L2 cache with
defective error protection caused Sun’s flagship servers to crash,
R. Baumann, IRPS Tutorial on SER, 2000.
– Cypress Semiconductor reported in 2004 a single soft error

brought a billion-dollar automotive factory to a halt once a

month, Zielger & Puchner, “SER – History, Trends, and

Challenges,” Cypress, 2004.



Joel Emer
December 7, 2005
6.823, L24-14

# Vulnerable Bits Growing with Moore’s Law

10000

12x GAP

1000
100

Year


2012

2011

2010

2009

2008

2007

2006

20% Vulnerable

2005

1

2004

100% Vulnerable
2003

10

1000 year MTBF Goal

Typical SDC goal: 1000 year MTBF

Typical DUE goal: 10-25 year MTBF


Joel Emer
December 7, 2005
6.823, L24-15

Architectural Vulnerability Factor (AVF)

AVFbit = Probability Bit Matters

# of Visible Errors
=# of Bit Flips from Particle Strikes
FITbit= intrinsic FITbit * AVFbit


Joel Emer
December 7, 2005
6.823, L24-16

Architectural Vulnerability Factor

Does a bit matter?

• Branch Predictor
– Doesn’t matter at all (AVF = 0%)

• Program Counter
– Almost always matters (AVF ~ 100%)



Joel Emer
December 7, 2005
6.823, L24-17

Statistical Fault Injection (SFI)

with RTL

1

0

Simulate Strike on
Latch

Logic
Logic
0

output

Does Fault Propagate
to Architectural State

+ Naturally characterizes all logical structures



Joel Emer

December 7, 2005
6.823, L24-18

Architecturally Correct Execution (ACE)

Program Input

Program Outputs
• ACE path requires only a subset of values to flow correctly
through the program’s data flow graph (and the machine)
• Anything else (un-ACE

path) can be derated away


Example of un-ACE instruction:
Dynamically Dead Instruction

Joel Emer
December 7, 2005
6.823, L24-19

Dynamically
Dead
Instruction

Most bits of an un-ACE instruction do not affect
program output



Joel Emer
December 7, 2005
6.823, L24-20

Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state


T=1

ACE% = 2/4


Joel Emer
December 7, 2005
6.823, L24-21

Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state


T=2

ACE% = 1/4


Joel Emer
December 7, 2005
6.823, L24-22


Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state


T=3

ACE% = 0/4


Joel Emer
December 7, 2005
6.823, L24-23

Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state


T=4

ACE% = 3/4


Joel Emer
December 7, 2005
6.823, L24-24

Vulnerability of a structure
AVF = fraction of cycles a bit contains ACE state



=

=

(2+1+0+3)/4
4

Average number of ACE bits in a cycle
Total number of bits in the structure


Little’s Law for ACEs


N ace = T ace × Lace
N ace
AVF =
Ntotal

Joel Emer
December 7, 2005
6.823, L24-25


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×