Tải bản đầy đủ (.pdf) (718 trang)

code design for dependable systems theory and practical applications

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (11.86 MB, 718 trang )

Code Design for
Dependable Systems
Theory and Practical Applications
Eiji Fujiwara
Tokyo Institute of Technology
A JOHN WILEY & SONS, INC., PUBLICATION
Code Design for Dependable Systems

Code Design for
Dependable Systems
Theory and Practical Applications
Eiji Fujiwara
Tokyo Institute of Technology
A JOHN WILEY & SONS, INC., PUBLICATION
Copyright # 2006 by John Wiley & Sons, Inc. All rights reserved
Published by John Wiley & Sons, Inc., Hoboken, New Jersey
Published simultaneously in Canada
No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by
any means, electronic, mechanical, photocopying, recording, scanning, or otherwise, except as permitted under
Section 107 or 108 of the 1976 United States Copyright Act, without either the prior written permission
of the Publisher, or authorization through payment of the appropriate per-copy fee to the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, (978) 750-8400, fax (978) 750-4470,
or on the web at www.copyright.com. Requests to the Publisher for permission should be addressed to the
Permissions Department, John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, (201) 748-6011,
fax (201) 748-6008, or online at />Limit of Liability/Disclaimer of Warranty: While the publisher and author have used their best efforts in
preparing this book, they make no representations or warranties with respect to the accuracy or completeness
of the contents of this book and specifically disclaim any implied warranties of merchantability or fitness
for a particular purpose. No warranty may be created or extended by sales representatives or written sales
materials. The advice and strategies contained herein may not be suitable for your situation. You should
consult with a professional where appropriate. Neither the publisher nor author shall be liable for any loss


of profit or any other commercial damages, including but not limited to special, incidental, consequential,
or other damages.
For general information on our other products and services or for technical support, please contact our
Customer Care Department within the United States at (800) 762-2974, outside the United States
at (317) 572-3993 or fax (317) 572-4002.
Wiley also publishes its books in a variety of electronic formats. Some content that appears in print may not
be available in electronic formats. For more information about Wiley products, visit our web site at
www.wiley.com.
Library of Congress Cataloging-in-Publication Data is available.
ISBN-13 978-0-471-75618-7
ISBN-10 0-471-75618-0
Printed in the United States of America
10987654321
Contents
Preface ix
1 Introduction 3
1.1 Faults and Failures / 3
1.2 Error Models / 6
1.3 Error Recovery Techniques for Dependable Systems / 10
1.4 Code Design Process for Dependable Systems / 16
References / 19
2 Mathematical Background and Matrix Codes 23
2.1 Introduction to Algebra / 23
2.2 Linear Codes / 33
2.3 Basic Matrix Codes / 48
Exercises / 71
References / 75
3 Code Design Techniques for Matrix Codes 77
3.1 Minimum-Weight & Equal-Weight-Row Codes / 78
3.2 Odd-Weight-Column Codes / 82

3.3 Even-Weight-Row Codes / 84
3.4 Odd-Weight-Row Codes / 86
3.5 Rotational Codes / 87
Exercises / 92
References / 93
v
4 Codes for High-Speed Memories I: Bit Error Control Codes 97
4.1 Modi fied Hamming SEC-DED Codes / 98
4.2 Modi fied Double-Bit Error Correcting BCH Codes / 105
4.3 On- Chip ECCs / 110
Exercises / 123
References / 126
5 Codes for High-Speed Memories II: Byte Error Control Codes 133
5.1 Singl e-Byte Error Correcting (SbEC) Codes / 134
5.2 Singl e-Byte Error Correcting and Double-Byte Error Detecting
(SbEC-DbED) Codes / 154
5.3 Singl e-Byte Error Correcting and Single p-Byte within a Block
Error Detecting (SbEC-S
pÂb=B
ED) Codes / 171
Exercises / 180
References / 183
6 Codes for High-Speed Memories III: Bit / Byte Error
Control Code s 187
6.1 Singl e-Byte / Burst Error Detecting SEC-DED Codes / 188
6.2 Singl e-Byte Error Correcting and Double-Bit Error Detecting
(SbEC-DED) Codes / 217
6.3 Singl e-Byte Error Correcting and Double-Bit Error Correcting
(SbEC-DEC) Codes / 230
6.4 Singl e-Byte Error Correcting and Single-Byte Plus Single-Bit

/ 244
Exercises / 254
References / 258
7 Codes for High-Speed Memories IV: Spotty Byte Error
Control Code s 263
7.1 Spotty Byte Errors / 264
7.2 Singl e Spotty Byte Error Correcting (S
t=b
EC) Codes / 264
7.3 Singl e Spotty Byte Error Correcting and Single-Byte Error
Detecting (S
t=b
EC-SbED) Codes / 274
7.4 Singl e Spotty Byte Error Correcting and Double Spotty Byte
Error Detecting (S
t=b
EC-D
t=b
ED) Codes / 284
7.5 A General Class of Spotty Byte Error Control Codes / 290
Exercises / 326
References / 330
8 Parallel Decoding Burst / Byte Error Cont rol Codes 335
8.1 Parallel Decoding Burst Error Control Codes / 336
vi CONTENTS
Error Detecting (SbEC-(Sb þ S)ED) Codes
8.2 Parallel Decoding Cyclic Burst Error Correcting Codes / 351
8.3 Transient Behavior of Parallel Encoder / Decoder Circuits
of Error Control Codes / 353
Exercises / 369

References / 370
9 Codes for Error Location: Error Locating Codes 373
9.1 Error Location of Faulty Packages and Faulty Chips / 373
9.2 Block Error Locating (S
b=pÂb
EL) Codes / 376
9.3 Single- Bit Error Correcting and Single-Block Error Locating
(SEC-S
b=pÂb
EL) Codes / 377
9.4 Single- Bit Error Correcting and Single-Byte Error Locating
(SEC-S
e=b
EL) Codes / 389
9.5 Burst Error Locating Codes / 396
9.6 Code Conditions for Error Locating Codes / 404
Exercises / 409
References / 410
10 Codes for Unequal Error Control / Protection ( UEC / UEP ) 413
10.1 Error Models for UEC Codes and UEP Codes / 413
10.2 Fixed-Byte Error Control UEC Codes / 417
10.3 Burst Error Control UEC / UEP Codes / 427
10.4 Appli cation of the UEC / UEP Codes / 439
Exercises / 457
References / 461
11 Codes for Mass Memories 465
11.1 Tape Memory Codes / 465
11.2 Magnetic Disk Memory Codes / 487
11.3 Optical Disk Memory Codes / 500
Exercises / 509

References / 512
12 Coding for Logic and System Design 517
12.1 Self-checking Concept / 518
12.2 Self-testing Checkers / 536
12.3 Self-checking ALU / 552
12.4 Self-checking Design for Computer Systems / 570
Exercises / 585
References / 590
13 Codes for Data Entry Systems 599
13.1 M-Ary Asymmetric Errors in Data Entry Systems / 599
CONTENTS vii
13.2 M-Ary Asymmetric Symbol Error Correcting Codes / 600
13.3 Nonsystematic M-Ary Asymmetric Error Correcting Codes with
Deletion / Insertion / Adjacent-Symbol-Transposition Error
Correction Capabilities / 623
13.4 Cod es for Two-Dimentional Matrix Symbols / 632
Exercises / 644
References / 646
14 Codes for Multiple / Distributed Storage Systems 649
14.1 MD S Array Codes Tolerating Multiple-Disk Failures / 650
14.2 Cod es for Distributed Storage Systems / 661
Exercises / 675
References / 677
Index 679
viii CONTENTS
Preface
Error cont rol coding theory has been studied for over half a century, and it is still going
stronger than ever. The most recent examples are the turbo codes and the low-density
parity check codes (LDPCs). Also, during these years, error control codes have been
extensively applied to various digital systems, such as computer and communication

systems, as an essential technique to improve system reliability. As an integral part of
modern day high-speed dependable systems and semiconductor memories, high-speed
parallel decoding is essential. Error control codes suitable for high-speed parallel
decoding are regularly expressed and studied in parity-check matrices. For highly reliable
communication systems and disk memory systems, on the other hand, serial decoding
based on linear feedback shift registers (LFSRs) is used. Error control codes for serial
decoding are typically expressed and studied using generator polynomials. In this book,
the former codes are called matrix codes and the latter polynomial codes. So far,
traditional coding theory has been studied mainl y using code generator polyn omials. We
emphasize that the linear codes expressed in polynomials can always be expressed using
parity-check matrices, but the converse is not always possible. This book focuses
specifically on the design theory for matrix codes and their practical applications, which
has been seriously lacking in the traditional scope of coding theory investigations.
In dependable computer systems, many types of error control codes have been applied
to memory subsystems and processors in order to achieve efficient and reliable data
processing and storage. Some systems could never have been realized without the
application of cost-effective error control codes, mainly very large capacity, high-speed
semiconductor memories, very high-density magnetic disk memories, and recent optical
disk memories such as compact disc (CD) and digital versatile disc (DVD). More recently
mobile digital systems have gained wide popularity, and these systems are sometimes
operated under unfavorable environments where electromagnetic noise, a-particles and
cosmic rays abound. Modern high-speed, high-density VLSI processor s and semicon-
ductor memories are operated at low supply voltage levels and thus low logic signal
swing; they therefore are vulnerable to external disturbances that can induce transient
errors. Transient errors are a dominant concern in today’s digital systems. Error control
ix
coding is the most efficient and effective way to tolerate these errors, and is expected to
become ever more important in future VLSI systems.
The challenge is to choose among many different applications of error control
codes. Often a new application calls for a new type of code that can be developed most

efficiently to fit a new requirement. Matrix codes are far more flexible compared with
polynomial codes. Parity-check matrices can be manipulated easily. Some known
examples are column vector exchange in a matrix, the odd-weight-column matrix, the
low-density matrix, and the rotational matrix form. These manipulations of matrices
have yielded many useful codes for important applications. Polynomial codes, on the
other hand, are impossible to be manipulated in a similar way for code design fine-
tuning. The main reason is that the matrix code is capable of expressing various types
of code functions and thus allows for very high design flexibility. In practice, such
flexibility has led to excellent code designs, satisfying the various reliability requirements
of the dependable systems.
This book builds on the author’s previous book, Error Control Coding for Computer
Systems (Prentice-Hall, 1989), and it likewise aims at introducing the latest developments
and advances in the field. However, as was mentioned earlier, additionally the book is
unique in its concentration on the treatment of matrix codes. Unlike any existing coding
theory books, this book will not burden the reader with unnecessary background on
polynomial algebra. The book includes only the mathematical background essential for
the understanding of matrix code construction and design. Such an arrangement frees up
space for the description of some fine artistry of matrix code design strategies and
techniques. Matrix code designs are presented with respect to practical applications, such
as high-speed semiconductor memories, mass memories of disks and tapes, logic circuits
and systems, data entry systems, and distributed storage systems. Also new classes of
matrix codes, such as error locating codes, spotty byte error control codes, and unequal
error control codes, are presented in their practical settings. The new parallel decoding
algorithm of the burst error control codes is demonstrated and further extended to the
generalized parallel decoding of the codes.
Chapter 1 provides background and a preview of material covered in the subsequent
chapters. First, it defines faults, errors, and failures and explains the many types of faults
and errors. This is the core knowledge needed to understand what constitutes a good
code. To design an efficient and effective code for a given application, it is important first
to know what types of errors matter, how much the system’s reliability can be improved

by coding techniques, and what are the constraints on check-bit length, decoding speed,
and so forth. The matrix code designing procedure is laid out in this chapter from this
standpoint. The chapter concludes with a brief introduction to the competitors of the
coding technique in depend able systems, namely conventional error recovery techniques
and / or error masking techniques.
Chapter 2 provides the fundamental mathematical background and coding theory
necessary to understand the later chapters. The chapter covers the matrix representations
of well-known error control codes, such as simple parity-check codes, cyclic codes,
Hamming codes, BCH codes, Reed-Solomon codes, and Fire codes. These codes are
manipulated in the later chapters in examples of how matrix codes satisfy the system
requirements for given applications.
Chapter 3 discusses the matrix code design techniq ues related to high-speed decoding,
area efficient encoding / decoding hardware, modularized organization of encoding /
decoding circuits, and so forth.
x PREFACE
Chapters 4, 5, 6, and 7 cover topics on matrix code design for high-speed
semiconductor memories. Depending on the application, the matrix code can be designed
to handle bit or byte errors and in some cases a mixture of both bit and byte errors. The
latter are typical errors found in large capacity semiconductor memory systems using
high-density RAM chips. Chapter 4 discusses bit error control codes, such as the modified
Hamming single-bit error correcting and double-bit error detecting (SEC-DED) codes, the
modified double-bit error correcting BCH codes, and the memory on-chip codes. For the
memory systems using byte-organized RAM chips, single-byte error correcting (SbEC)
codes, and single-byte error correcting and double-byte error detecting (SbEC-DbED)
codes, are presented in Chapter 5. The codes for the mixed type of bit errors and byte
errors are presented in Chapter 6. Among them, a byte error detecting SEC-DED code,
developed by the author and his colleague in the 1980s, has found practical application in
recent workstations. Chapter 7 presents a relatively new class of byte error control codes:
spotty byte error control codes. This class of codes has been specifically designed to fit
the large capacity memory systems that use high-density RAM chips with wide input /

output data of 8, 16, and 32 bits. Also a general class of these codes with minimum
Hamming distance-d and with maximum distance separable (MDS) characteristics is
presented in this chapter. The well-known Reed-Solomon codes are included in these
generalized codes, which makes them practically and theoretically important. They will be
quite useful for future applications.
Chapter 8 presents the generalized parallel decoding algorithm for error control codes.
Initially developed for burst error control codes, this new decoding algorithm includes the
conventional parallel decoding algorithm of the existing bit / byte error correcting codes.
The generalized algorithm can also be used for multiple burst or byte error correcting
codes. The chapter takes this new algorithm and demonstrates how the parallel decoding
method can be implemented in combinational circuits. In addition the chapter addresse s
the important problem of glitches in parallel decoding circuits. Parallel decoding circuits
depend heavily on large exclusive-OR tree circuits, which are well known to readily
produce glitches. The glitches are the unwanted logic signal transitions that can generate,
propagate, and accumulate in the logic circuits and then induce noise and instability on the
power supply lines. The chapter explains why the glitches are generated, how they are
propagated and accumulated in the circuits, and how to reduce these undesirable effects.
Chapter 9 presents a new class of codes , namely error locating codes. Error location is
an error control function lying midway between error correction and error detection. An
error locating code will indicate where the errors lie but not the precise erroneous digit
positions. This type of codes is useful for identifying the faulty block, faulty package, or
faulty chip, and thus enables fault isolation and reconfiguration. The chapter includes
practical codes for memory systems to use in locating faulty packages / card s. It also
provides a practical code for locating faulty chips. Both codes have the capability to
correct single-bit errors, even though the codes are mainly designed for identifying the
faulty areas. In addition, burst error locating codes are introduced here. The chapter
concludes with a precise analysis of error locating codes with an emphasis on the code
conditions and their relation between error locating codes and error correcting / dete cting
codes.
Chapter 10 shows yet another new class of unequal error control (UEC) codes. In many

applications certain positions in a word have higher error rates or require more protection.
The UEC codes can indicate the area in a word having a higher error rate with stronger
error control code functions, and the area having a lower error rate with weaker error
PREFACE xi
control functions. In other words, this type of code has different code functions within a
code word, depending on the area and the associated error rate. The chapter provides
optimal codes with some UEC code functions. Similar codes exist in unequal error
protection (UEP) codes. This type of code protects the valuable information part of a word
against errors. For example, control information or address information in communication
messages or computer words, or similarly pointer information in the database words, must
be more protected from errors than their other parts. The chapter provides some UEP
codes that protect against burst errors and also against single-bit errors. The chapter
includes examples of UEC and UEP codes used in holographic memories and lossless
compressed data.
Chapters 11, 12, 13, and 14 present the codes for some specific systems, namely mass
memories such as magnetic tapes and disks, logic circuits and systems, data entry
systems, and distributed storage systems. Chapter 11 covers the codes designed
specifically for mass memories such as tape memories, magnetic disk memories, and
recent optical disk memories. The various modified types of Reed-Solomon codes and
adaptive parity codes are presented to the tape memories and to the disk memories.
Codes for recent CDs and DVDs are also introduced. Chapter 12 mentions error
checking for logic systems using efficient error detecting codes. An important concept
of self-checking is first introduced. The chapter then clarifies how the errors in the logic
circuits and systems are detected, how the error detecting checker circuits are
implemented, how the errors in the checker itself are detected, and how the self-testing
checkers are implemented. Especially self-checking ALU is presented by using parity-
based codes, and also self-checking design for processor systems is demonstrated.
Chapter 13 presents the codes for data entry systems . In these systems, in general,
nonbinary symbols are routinely used in character recognition systems, and recent two-
dimensional symbols. The chapter characterizes the errors that occur in these nonbinary

symbols as asymmetric errors and presents some asymmetric error control codes. These
codes are basically nonlinear, and are designed by usi ng elements in newly defined
rings. Also nonsystematic nonbinary asymmetric error correcting codes are designed
based on a multilevel coding method and a set-partitioning algorithm, and QR codes
and two-dimensional unidirectional clustered error correcting codes are presented for
two-dimensional matrix symbols. Chapter 14 provides the codes for distributed storage
systems connected via networks. Codes for recent RAID systems that tolerate two
disk failures are introduced, and then an efficient error recovery scheme from multiple
disk failures in the distributed storge system is discussed and is implemented by using
block design in combinatorial theory.
The introductory portion of the book, Chapters 1 and 2, and the parts of Chapters 3, 4, 5,
6, 8, 9, and 10, can be used as the text for a course at an advanced undergraduate level or
for an introductory one-semester course at the graduate level. For graduate classes and
advanced students who have the background in mathematics, logic circuits, and
rudimentary knowledge of codes, the book can be used as a whole with selected topics
from each of the chapters. Practicing engineers / designers will find useful discussions in
Chapters 6 to 14, which demonstrate, in detail, the procedure of designing sophi sticated
codes in practical form. For the practicing engineer, Chapter 2 presents mathematics and
coding theory, not in strict form but in introductory form, which is necessary in
understanding the later chapter s. Many examples, figures, exercises, and references are
provided in each chapter of the book. Many attractive codes with practical code
parameters and their evaluation data on decoding hardware and error detection capabilities
xii P R EFACE
are fully demonstrated. These can be used by practicing engineers as a practical guide and
handy reference.
My sincere appreciation goes to many people. Professors Jack K. Wolf of the
University of California San Diego, Hideki Imai of the University of Tokyo, T. R. N. Rao
of the University of Louisiana Lafayette, and Bella Bose of Oregon State University
encouraged me to continue my research on code design theory and to write this book.
Emeritus professor Yoshihiro Tohma of Tokyo Institute of Technology, Professors Takashi

Nanya of the University of Tokyo, Hideo Ito of Chiba University, and Jien-Chung Lo of
the University of Rhode Island gave important suggestions and valuable discussions on
research for dependable systems. Recently Professor Lo also provided valuable comments
on the final book and an important discussion on glitches, (i.e., logical noise) that are
generated, propagated, and accumulated in large exclusive-OR tree circuits in the parallel
decoder of the codes. The author’s NTT colleagues, Dr. Shigeo Kaneda, now professor
at Doshisha University, and Dr. Kazumitsu Matsuzawa, now professor at Kanagawa
University, collaborated to develop practical codes for computer memories. Dr. Masato
Kitakami, now associate professo r at Chiba University, Dr. Mitsuru Hamada, now
associate professor at Tamagawa University, Dr. Shuxin Jiang, Dr. Saowapa Kiattichai, Dr.
Hongyuang Chen, Dr. Kazuteru Namba, Dr. Ganesan Umanesan, Dr. Haruhiko Kaneko,
Dr. Kazuyoshi Suzuki, Mr. Tsuyoshi Tanaka, Mr. Toshihiko Kashiyama, and Mr. Hiroyuki
Ohde devoted themselves to designing the excellent codes in their master’s and / or
doctorate course programs at the Tokyo Institute of Technology. Much of the motivation
for making the codes practical was due to discussions with many researchers and engineers
in Japanese industry.
Thanks also go to art designer, Mr. Ippei Inoh, a friend of mine, who proposed and
directed the marvelous idea of the front cover design. Ms. Tiki Ishizuka, a computer
graphic designer, arranged the wonderful fine art of this cover. You can see ‘‘Hoh-Oh,’’ a
legendary happy bird, in the center of the front cover whose original pattern was introduced
from China more than one thousand years ago to Japan, and since then appeared as an art
design in Japanese art and craft products. I sincerely hope the book will bring happiness and
pleasure to the reader.
At this point in a preface, I usually thank my wife, Sachiko, and my daughter’s fam ily,
Sayaka, Makoto, and Asuka, for encouraging me in continuing this difficult project.
E
IJI FUJIWARA
ðAutumn in 2005 on the foot of Mt. FujiÞ
PREFACE xiii



CONTENTS
1.1 Faults and Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.1 Faults . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.1.2 Failures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Error Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Hard Errors and Soft Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.2 Random Errors, Clustered Errors, and Their Mixed-Type Errors . . . . . . 7
1.2.3 Symmetric Errors, Asymmetric Errors, and Unidirectional Errors . . . . . 9
1.2.4 Unequal Error Probability Model and Unequal Error Protection Model . 10
1.3 Error Recovery Techniques for Dependable Systems . . . . . . . . . . . . . . . . . . 10
1.3.1 Error Detection / Error Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Error Recovery / Error Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Code Design Process for Dependable Systems . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Code Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Code Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1
Introduction
Before designing a dependable system, we need to have enough knowledge of the system’s
faults, errors, and failures of the dependable techniques including coding techniques, and of
the design process for practical codes. This chapter provides the background on code design
for dependable systems.
1.1 FAULTS AND FAILURES
First, we need to make clear the difference between three frequently encountered technical
terms in designing dependable systems—namely faults, errors, and failures. Th ese terms
are fully defined in [LAPR92, AVIZ04]. Faults are primarily identified as the generic
sources of abnormalities that alter the operation of circuits, devices, modules, systems, and /
or software products. Failure can arise from any type of possible faults. Faults are often
called defects when they occur in hardware and bugs when in software.

1.1.1 Faults
As causes of failure, faults are sometimes predictable but di fficult to identify. F aults can occur
during any stage in a system’s or product’s life cycle: during specification, design, production,
or operation. Faults are characterized by their origin and their nature [LAPR92, GEFF02].
Origin of Faults T iming is a factor because faults can pro voke failure in the operation phase
at an y one of a system’ s p revious life phases: specification, design, production, and operation.
During the specification phase, for example, an incomplete definition of services may
lead to different interpretations by the client, the designer, and the user. Eventually, in the
Code Design for Dependable Systems: Theory and Practical Applications, by Eiji Fujiwara
Copyright # 2006 John Wiley & Sons, Inc.
3
operation phase, the failure becomes evide nt when the services provided differ from the
user’s expectations.
During the design and the production phases, for example, a designer’s lack of
sufficient knowledge of architectural levels, structural levels, and the like, may result in a
type of physical defect that induces, for example, short or open circuits.
During the operation phase, for example, an elevation of ambient temperature can cause
electronic devices and products to malfunction.
Nature of Faults During the specification and the design phases, faults that occur are called
human-made faults. During the production and the operation phases, these may occur physical
faults, har dwar e faults, or solid faults. Each type is due to some physical abnormality in the
component arising from aging or defecti v e materials. Faults are of two types in their duration:
1. Permanent. These faults arise, for example, from a power supply breakdown,
defective open or short circuits, bridging or open lines, electro-migration, and so
forth. The defects in the input / output of the logical circuits or lines are called
stuck-at ‘1’ faults or stuck-at ‘0’ faults.
2. Temporal. These faults can be transient or intermittent. Transient faults occur
randomly and externally because of external noise, namely environmental problems
of external electromagnetic waves but also external particles such as a-particles and
neutrons. Intermittent faults occur randomly but internally because of unstable or

marginally stable hardware, varying hardware or software state as a function of load
or activity, or signal coupling (i.e., crosstalk) between adjacent signal lines. Some
intermittent faults may be due to glitches [LO05], which are unpredictable spike
noise pulses occurring and propagated especially in large exclusive-OR (XOR) tree
networks (see Chapter 8). Parallel decoding circuits of error control codes with
large code lengths require large exclusive-OR tree networks, so glitches can become
serious problems. This topic will be covered in more detail in Section 8.3.
Transient faults and Intermittent faults are the major source of errors in modern-day
digital systems. Some reports show that more than 60% of all failures in computer systems
are caused by transient or intermittent faults. For example, in DRAM (Dynamic Random
Access Memory) chips, transient errors result mainly from a-particles emitted by the decay
of radioactive particles in the semiconductor materials [MAY79, NOOR80, SAIH82]. One
identified source of a-particles is the lead solder balls used to attach the chip to the substrate.
As they pass through the chip, a-particles create sufficient electron-hole pairs to add
charge to the DRAM capacitor cells. These particles have low energy level, and thus have
very low probability of causing more than one memory cell to flip when the memory cells
are not packed in extreme density. In today’s ultra–high-density RAMs, not only DRAMs
but also SRAMs (Static Random Access Memories), it has been recognized that multiple
cosmic-ray-induced transient errors are a serious problem [OSAD03, 04].
Temporal errors have also been observed in microprocessor chips. The trend toward
smaller geometries by ever-shrinking semiconductor designs results in l ower operating si gnal
voltages and hi gher speed operation, and therefore brings additional trans ient or intermittent
errors int o play [KARN04]. In t oday’s ubiquitous digital device or system e nvironment, PDAs
and personal computers equipped with these high-speed microprocessor chips and high-
density RAM chips are further prone to be damaged by even worse circumstances when
operated in airplanes at high altitude or near the high-v oltage electric power lines.
4 INTRODUCTION
The important point is that the faults due to temporary environmental problems do not
need repair because the hardware is physically undamaged.
Cosmic rays, however, can give rise to significant transient errors, called soft errors

[KARN04, MAKI00, HAZU00, ZIEG98, MASS96, CALV94]. Figure 1.1 shows the
cosmic ray and its influence at the earth surface level. In the cosmic environment heavy
particles with very high energy from solar winds can penetrate the semiconductor chips in
satellite digital systems and cause more than double-bit errors [MUEL99]. Sometimes
they can cause physical faults such as latchup in CMOS circuits.
A detailed report of field testing for soft errors due to cosmic rays was presented in 1996
[ZIEG96a, 96b, 96c, OGOR96, SRIN96]. In the report cosmic rays are defined as particles
in solar wind originating in the sun or as galactic particles that enter the solar system
striking atmospheric atoms and creating a shower of secon dary particles. Most such
particles produced by the shower either decay spontaneously or lose energy gradually, and
eventually lose all energy in the cascade. Some of these particles may strike the earth.
Therefore the cosmic rays at sea level consist mostly of neutrons, protons, pions, muons,
electrons, and photons. About 95% of these particles are neutrons with no charge but with
the high energy (more than 10 MeV) that causes significant soft errors or latchups in
electronic circuits. So cosmic rays can create multiple errors. Altitude causes the neutron
flux to increase exponentially, and hence the fail rate of electronic circuits at airplane
altitude is about one hundred times worse than at terrestrial level. Concrete shielding with
several feet of thickness can significantly attenuate the flux of these high-energy particles.
Figure 1:2 shows how neutrons and other particles, including a-particles, generated by
the collision of nuclei in the atmosphere, can strike silicon chips and produce sufficient
electron-hole pairs in the chips to impair their functioning.
Earth
Cosmic ray
Neutron
Pion
Neutron
Proton
-Meson
-Particle
Proton

Neutron
Atmospheric
zone
Collision with nucleus
in atmosphere
Collision with nucleus Proton, Neutron, Pion
-Meson
Neutron (energy level > 10 MeV):
0.01 Particles/(cm s) at sea level
2
.
1.0 Particles/(cm s) at 10,000 m high level
2
.
Cosmic zone
Figure 1.1 Cosmic rays.
FAULTS AND FAILURES 5
1.1.2 Failures
A failure is defined as nonperformance that occurs when a delivered service no longer
complies with its specifications [LAPR92], and a failure is also defined as nonperformance
when the system or component is unable to perform its intended function for a specified
time under specified environmental conditions [LEVE95].
Some types of failure are defined with respect to specific conditions. For example, a
value failure means that the value of the delivered service does not comply with the
specification and a timing failure represents a response in incorrect timing, either faster or
slower than the specified time. A temporary failure means an erroneous behavior at a
certain moment lasting only a short time. A crash failure,orcatastrophic failure, is the one
that stops the mission because the system is completely blocked.
1.2 ERROR MODELS
An error is a manifestation of an unexpected fault within a system that is liable to lead to

system failure. The transformation of a fault to an error is called fault activation. The
mechanism that creates errors in the system and finally provokes a failure is called error
propagation. Before provoking a failure, errors can be masked or corrected by some error
control mechanisms such as error correcting codes, retries, or triple modular redundancy
(TMR) and thus recovered without inducing a system failure.
A fault remains in passive mode until an error first appears at some structure of the
system. This occurrence is called an initial activation and the error is called a primitive
error. In this case latency is defined as the mean time between the fault’s occurrence and its
initial activation as an error. Figure 1:3 presents the causal relationship betwee n fault,
error, and failure. Various types of errors can occur, and these different types are covered
below.
Charged particles
Si nucleus
Neutron
Si nucleus
Neutron
Charged
particle
Charged
(Moved by collision)
Si chip
(No collision)
particle
(α-particle, Proton,
Electron)
Figure 1.2 Electron holes in a silicon chip caused by particles.
6 INTRODUCTION
1.2.1 Hard Errors and Soft Errors
Hard errors are caused by permanent faults; they therefore affect the system functions for
a long period of time. This type of error is typically provoked by faults that appear as open

or short anywhere on the chips, modules, cards, or boards. Hard errors are also called
permanent errors.
Soft errors, on the other hand, are caused by temporal faults, especially those resulting
from external causes. Soft errors have a limited duration, meaning they interrupt system
functions for a very short time period. The most likely sources of soft errors are radioactive
particles and external noise. Alpha particles and cosmic particles [ZIEG96a, ZIEG96b,
ZIEG96c, OGOR96, SRIN96] are the major contributors mentioned previously. Therefore
soft errors are also called transient errors. The intermittent errors are provoked by
intermittent faults.
1.2.2 Random Errors, Clustered Errors, and Their Mixed-Type Errors
Multiple errors that occur randomly in time and / or space are called random errors.
Error can oc cur in every bit position of a word with almost equal probability. The
random type of error is unpredictable and is typically caused by white noise or
external particles.
Errors may cluster non-uniformly in a word, and these multiple errors may gather in
particular and unpredictable positions in the word. Clustered errors include burst errors
and byte errors. Burst errors occur typically in disks or tape memory. Byte errors are
typically found in semiconductor memory. The difference is in the data-recording
medium. In disk memory, the data are recorded on a continuous surface. In semiconductor
memory, the data are stored in RAM chips, and a data fragment, called a byte, is read or
stored in each chip. In disk or tape memory, defects or dust particles on the recording
surface can cause burst errors to occur anywhere in the continuous recording medium.
Failure
interface
User
Error
Fault
(Activated)
(Masked/Recovered)
Fault

Error
(Activated) (Propagated)
Fault
.
.
.
System/Module/Product
(Non-activated)
Figure 1.3 F ault, error, and failure.
ERROR MODELS 7
Clustered errors may occur in the two-dimensional matrix symbols as well as in the tape or
disk memory of a continuous two-dimensional recording medium. In semiconductor
memory, on the other hand, byte errors may occur in a fragment of readout data, nam ely in
a single byte, corresponding to the faulty chip. This is becau se each chip is physically
separated and independent, and therefore the presence of a fault in a chip does not extend
to the adjacent chips. Figure 1.4 illustrates the different cases of random errors, byte errors,
and burst errors.
Another error model consists of mixed clustered and random errors in the operational
phase. The clustered errors mentioned above are sometimes caused by physical faults due
Random Errors
External noise, particles, and
permanent faults occurred randomly
Received data /
Readout data
Byte Errors
b b b b
.
.
.
.

.
.
.
.
.
.
.
.
bbbb
Faulty
chip
Memory chip with
b
-bit output
Readout data
b
Burst Errors
Readout data
Defects
Continuous
recording
medium
b
-Bit burst error
Memory card
(Package)
b
-Bit byte error
Figure 1.4 Models of r an dom errors, byte errors, and burst er ro rs.
8 INTRODUCTION

to agin g problems. However, systems and devices are more prone to damage from
transient faults than from physical faults. Transient faults are source of random errors.
Therefore, when a physical fault occurs during the operational phase, both types of
error—clustered and random—must be taken into account. For example, in semiconductor
memories with byte-organized RAM chips, the major types of errors are transient errors,
(i.e., random bit errors) caused by a-particles or external noises. After some time in
operation, byte errors will occur due to the aging of RAM chips. Therefore both bit errors
and byte errors, meaning both random errors and permanent errors, may occur separately
or simultaneously. A similar situation holds for transmission systems, where both random
bit errors and burst errors can occur. Chapter 6 deals with the codes which control the
mixed type of single-byte errors and random bit errors.
1.2.3 Symmetric Errors, Asymmetric Errors, and Unidirectional Errors
In binary systems the probab ility of e rrors that force 0 to 1, cal led 0-errors, is, in general, equal
to those going from 1 to 0, called 1-errors. This class of errors is known as symmetric err ors.
When these errors occur with unequal probabilities, they are called asymmetric err ors.Inthe
binary asymmet ric error model, only one type of error, either 0-errors or 1-errors, can occur ,
and the error type i s known a priori. If both error t ypes occur but are not mixed, then this class
of errors is said to be unidirectional errors [BLAU93]. In binary systems these errors are
caused by symmetric faul ts, asy mmetric fau lts, or unidirectional faults.
In nonbinary systems using numerals, 0; 1; 2; 3; ; 9, or alpha-numeric symbols,
asymmetric errors are the type that occur. That is, the probability of an error that forces one
nonbinary symbol A to another symbol B is sometimes different from that of symbol A
forced to yet another symbol C. For example, in handwritten character recognition
systems, the probability of a 7 being mistaken for a 9 is much higher than that of a 7 being
mistaken for a 4, or pð9j7Þ)pð4j7Þ, where pðBjAÞ means probability of a symbol A being
mistaken for another symbol B. This is because the numbers 7 and 9 are close in shape
whereas 7 and 4 are not so similar. Likewise in keyboard input systems the symbols
located on adjacent keys can be more easily mistyped. Figure 1:5 shows examples of these
error models. In the asymmetric error model, the error graphs are not perfect and
sometimes not bi-directional. On the other hand, in the symmetric nonbinary error model,

they are perfect and bi-directional.
If symbols are removed or added in a word, as is sometimes caused by human mistakes
(i.e., human-made faults), this class of errors is called deletion errors or insertion errors,
respectively.
(a) Example of an asymmetric error graph for handwritten
character (numerals) recognition systems
(b) Asymmetric error graph for keyboard systems
0
1
2
97
6
8 5
4
3
(Numeric key pad layout)
7 8 9
4 5 6
1 2 3
0
7 8 9
4 5 6
1 2 3
0
Figure 1.5 Asymmetric errors in nonbinary systems. Source: [KANE04] ß 2004 IEEE.
ERROR MODELS
9

×