Tải bản đầy đủ (.pdf) (667 trang)

Reliable distributed systems technologies web services and applications

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (4.55 MB, 667 trang )

Kenneth P. Birman

Reliable Distributed Systems
Technologies, Web Services,
and Applications


Kenneth P. Birman
Cornell University
Department of Computer Science
Ithaca, NY 14853
U.S.A.


Mathematics Subject Classification (2000): 68M14, 68W15, 68M15, 68Q85, 68M12

Based on Building Secure and Reliable Network Applications, Manning Publications Co., Greenwich, c 1996.

ISBN-10 0-387-21509-3
ISBN-13 978-0-387-21509-9

Springer New York, Heidelberg, Berlin
Springer New York, Heidelberg, Berlin

c 2005 Springer Science+Business Media, Inc.
All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the
publisher (Springer Science+Business Media Inc., 233 Spring Street, New York, NY, 10013 USA), except for brief excerpts
in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval,
electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is
forbidden.
The use in this publication of trade names, trademarks, service marks and similar terms, even if they are not identified as


such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.

Printed in the United States of America.

987654321
springeronline.com

SPIN 10969700

(KeS/HP)


0
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xvii

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xix

A User’s Guide to This Book . . . . . . . . . . . . . . . . . . . . . . .

xxix

Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xxxiii
PART I
1


Basic Distributed Computing Technologies . . . . . . . . . . . . . .

1

Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.1
1.2

1.3
1.4
1.5
2

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
Components of a Reliable Distributed Computing System . . . . . . . .
1.2.1
Communication Technology . . . . . . . . . . . . . . .
1.2.2
Basic Transport and Network Services . . . . . . . . . . .
1.2.3
Reliable Transport Software and Communication Support . . .
1.2.4
Middleware: Software Tools, Utilities, and Programming Languages
1.2.5
Distributed Computing Environments . . . . . . . . . . .
1.2.6
End-User Applications . . . . . . . . . . . . . . . . .

Critical Dependencies . . . . . . . . . . . . . . . . . . . . . .
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . . . .

3
7
12
14
15
16
17
19
20
22
23

Basic Communication Services . . . . . . . . . . . . . . . . . . . . . .

25

2.1
2.2
2.3
2.4
2.5

25
27
31
33

33
33
34
34
35
36
37

2.6
2.7

Communication Standards . . . . . . . .
Addressing . . . . . . . . . . . . . .
Network Address Translation . . . . . . .
IP Tunnelling . . . . . . . . . . . . .
Internet Protocols . . . . . . . . . . . .
2.5.1
Internet Protocol: IP layer . . . .
2.5.2
Transmission Control Protocol: TCP
2.5.3
User Datagram Protocol: UDP . .
2.5.4
Multicast Protocol . . . . . . .
Routing . . . . . . . . . . . . . . . .
End-to-End Argument . . . . . . . . . .

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.


vi

Contents
2.8
2.9
2.10

3


High Assurance Communication
3.1
3.2
3.3
3.4
3.5
3.6

4

. . . . . . . . . . . . . . . . . . . . .

Notions of Correctness and High Assurance Distributed Communication
The Many Dimensions of Reliability . . . . . . . . . . . . . .
Scalability and Performance Goals . . . . . . . . . . . . . . .
Security Considerations . . . . . . . . . . . . . . . . . . .
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . .

Remote Procedure Calls and the Client/Server Model
4.1
4.2
4.3
4.4
4.5
4.6

45


.
.
.
.
.
.

45
45
49
50
51
52

. . . . . . . . . . . . .

53
53
57
60
63
65
67
67
69
70
71
74
75
78

81
83

Styles of Client/Server Computing . . . . . . . . . . . . . . . . . . . . .

85

5.1
5.2
5.3
5.4
5.5
5.6
5.7

The Client/Server Model . . . . . . . . .
RPC Protocols and Concepts . . . . . . .
Writing an RPC-based Client or Server Program
The RPC Binding Problem . . . . . . . .
Marshalling and Data Types . . . . . . . .
Associated Services . . . . . . . . . . .
4.6.1
Naming Services . . . . . . . .
4.6.2
Time Services . . . . . . . . .
4.6.3
Security Services . . . . . . . .
4.6.4
Threads packages . . . . . . .
4.6.5

Transactions . . . . . . . . .
The RPC Protocol . . . . . . . . . . . .
Using RPC in Reliable Distributed Systems .
Layering RPC over TCP . . . . . . . . .
Related Reading . . . . . . . . . . . .

Stateless and Stateful Client/Server Interactions
Major Uses of the Client/Server Paradigm . .
Distributed File Systems . . . . . . . . .
Stateful File Servers . . . . . . . . . . .
Distributed Database Systems . . . . . . .
Applying Transactions to File Servers . . . .
Message-Queuing Systems . . . . . . . .

. . .
. . .
. .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .
. . .


.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.

39
41
43

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


4.7
4.8
4.9
4.10
5

OS Architecture Issues: Buffering and Fragmentation . . . . . . . . .
Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

85
85
92
99
106
113
116



Contents
5.8
5.9
6

The ANSA Project . . . . . . . . . . . .
Beyond ANSA to CORBA . . . . . . . . .
Web Services . . . . . . . . . . . . . .
The CORBA Reference Model . . . . . . .
IDL and ODL . . . . . . . . . . . . . .
ORB . . . . . . . . . . . . . . . . . .
Naming Service . . . . . . . . . . . . .
ENS—The CORBA Event Notification Service .
Life-Cycle Service . . . . . . . . . . . .
Persistent Object Service . . . . . . . . . .
Transaction Service . . . . . . . . . . . .
Interobject Broker Protocol . . . . . . . . .
Properties of CORBA Solutions . . . . . . .
Performance of CORBA and Related Technologies
Related Reading . . . . . . . . . . . . .

. . . . . . . . . .

System Support for Fast Client/Server Communication . . . . . . . . . . . . .

141

7.1
7.2
7.3

7.4
7.5
7.6

.
.
.
.
.
.

143
146
147
149
154
155

Web Technologies . . . . . . . . . . . . . . . . . . . . . . . .

157

The World Wide Web . . . . . . . . . . . . . . . . . . . . . . . . . .

159

8.1
8.2
8.3
8.4

8.5

159
161
164
169
169

The World Wide Web . . .
The Web Services Vision . .
Web Security and Reliability
Computing Platforms . . .
Related Reading . . . . .

.
.
.
.
.

.
.
.
.

.
.
.
.
.

. .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.

.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.


.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
. .

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

119
119
122
124
124
131
132
133
133
135
135
135
136
136
137
140

Lightweight RPC . . . . . .
fbufs and the x-Kernel Project .
Active Messages . . . . . .

Beyond Active Messages: U-Net
Protocol Compilation Techniques
Related Reading . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.

116
117

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

PART II
8

Related Topics . . . . . . . . . . . . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . . . .

CORBA: The Common Object Request Broker Architecture
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
6.9
6.10
6.11
6.12
6.13
6.14
6.15


7

vii

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.


viii
9

Contents
Major Web Technologies . . . . . . . . . . . . . . . . . . . . . . . . .


171

9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13

.
.
.
.
.
.
.
.
.
.
.
.
.


171
173
174
174
175
179
180
187
188
189
190
192
192

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

193

Components of the Web . . . . . . . . . . .
HyperText Markup Language . . . . . . . . .
Extensible Markup Language . . . . . . . . .
Uniform Resource Locators
. . . . . . . . .
HyperText Transport Protocol . . . . . . . . .
Representations of Image Data . . . . . . . . .
Authorization and Privacy Issues . . . . . . . .
Web Proxy Servers . . . . . . . . . . . . .
Web Search Engines and Web Crawlers . . . . .
Browser Extensibility Features: Plug-in Technologies
Future Challenges for the Web Community . . . .

Consistency and the Web . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . .

10 Web Services
10.1
10.2
10.3
10.4
10.5
10.6

10.7
10.8
10.9
10.10
10.11

.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
. .
. .
. .

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

What is a Web Service? . . . . . . . . . . . . . . . . . . . . .
Web Service Description Language: WSDL . . . . . . . . . . . . .
Simple Object Access Protocol: SOAP . . . . . . . . . . . . . . .
Talking to a Web Service: HTTP over TCP . . . . . . . . . . . . . .
Universal Description, Discovery and Integration Language: UDDI . . . .
Other Current and Proposed Web Services Standards . . . . . . . . . .
10.6.1
WS− RELIABILITY . . . . . . . . . . . . . . . . . .
10.6.2
WS− TRANSACTIONS . . . . . . . . . . . . . . . . .
10.6.3

WS− RELIABILITY . . . . . . . . . . . . . . . . . .
10.6.4
WS− MEMBERSHIP . . . . . . . . . . . . . . . . . .
How Web Services Deal with Failure . . . . . . . . . . . . . . . .
The Future of Web Services . . . . . . . . . . . . . . . . . . . .
Grid Computing: A Major Web Services Application . . . . . . . . . .
Autonomic Computing: Technologies to Improve Web Services Configuration
Management . . . . . . . . . . . . . . . . . . . . . . . . .
Related Readings . . . . . . . . . . . . . . . . . . . . . . . .

11 Related Internet Technologies . . . . . . . . . . . . . . . . . . . . . . .
11.1
11.2

File Transfer Tools . . . . . . . . . . . . . . . . . . . . . . .
Electronic Mail . . . . . . . . . . . . . . . . . . . . . . . . .

193
197
198
201
202
203
203
204
206
208
208
210
211

212
213
215
215
216


Contents
11.3
11.4
11.5
11.6
11.7
11.8

ix
.
.
.
.
.
.

217
219
220
222
225
226


12 Platform Technologies . . . . . . . . . . . . . . . . . . . . . . . . . .

227

12.1

Network Bulletin Boards (Newsgroups) . . . . .
Instant Messaging Systems . . . . . . . . . .
Message-Oriented Middleware Systems (MOMS) .
Publish-Subscribe and Message Bus Architectures .
Internet Firewalls and Network Address Translators
Related Reading . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.

235

13 How and Why Computer Systems Fail

237

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


. . . . . . . . . . . . . . . . . . .

Hardware Reliability and Trends
Software Reliability and Trends
Other Sources of Downtime . .
Complexity . . . . . . . .
Detecting Failures . . . . . .
Hostile Environments . . . .
Related Reading . . . . . .

.
.
.
.
.
.
.

.
.
.
.
.
.
.

237
238
240
241

242
243
246

. . . . . . . . . . . . . . . .

247

Consistent Distributed Behavior . . . . . . . . . . . . . . . . . .
14.1.1
Static Membership . . . . . . . . . . . . . . . . . . .
14.1.2
Dynamic Membership . . . . . . . . . . . . . . . . . .

247
249
251

14 Overcoming Failures in a Distributed System
14.1

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

PART III Reliable Distributed Computing . . . . . . . . . . . . . . . . . .

13.1
13.2
13.3
13.4
13.5
13.6
13.7

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

228
228
229
230
230
230
231
231
232
232
233
233
233

233
234

12.3
12.4

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

12.2

Microsoft’s .NET Platform . . . . . . . . . .
12.1.1
.NET Framework . . . . . . . . .
12.1.2
XML Web Services . . . . . . . . .
12.1.3
Language Enhancements . . . . . .
12.1.4
Tools for Developing for Devices . . .
12.1.5
Integrated Development Environment .
Java Enterprise Edition . . . . . . . . . . . .
12.2.1

J2EE Framework . . . . . . . . . .
12.2.2
Java Application Verification Kit (AVK)
12.2.3
Enterprise JavaBeans Specification . .
12.2.4
J2EE Connectors . . . . . . . . . .
12.2.5
Web Services . . . . . . . . . . .
12.2.6
Other Java Platforms . . . . . . . .
.NET and J2EE Comparison . . . . . . . . .
Further Reading . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.
.
.
.

.

.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.


.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.



x

Contents
14.2
14.3
14.4
14.5

.
.
.
.
.
.
.
.

253
254
261
262
264
271
274
275

15 Dynamic Membership . . . . . . . . . . . . . . . . . . . . . . . . . .

277


14.6

15.1

Formalizing Distributed Problem Specifications
Time in Distributed Systems . . . . . . .
Failure Models and Reliability Goals . . . .
The Distributed Commit Problem . . . . .
14.5.1
Two-Phase Commit . . . . . . .
14.5.2
Three-Phase Commit . . . . . .
14.5.3
Quorum update revisited . . . . .
Related Reading . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

16 Group Communication Systems . . . . . . . . . . . . . . . . . . . . . .


303

16.1
16.2

303
307
311
313
314
316
318
320
333
335
336
336
338

17 Point to Point and Multi-group Considerations . . . . . . . . . . . . . . . .

341

16.4
16.5
16.6

17.1

Causal Communication Outside of a Process Group


.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

16.3

Group Communication . . . . . . . . . . . . . . . . . .
A Closer Look at Delivery Ordering Options . . . . . . . . .
16.2.1

Nonuniform Failure-Atomic Group Multicast . . . .
16.2.2
Dynamically Uniform Failure-Atomic Group Multicast
16.2.3
Dynamic Process Groups . . . . . . . . . . . .
16.2.4
View-Synchronous Failure Atomicity . . . . . . .
16.2.5
Summary of GMS Properties . . . . . . . . . . .
16.2.6
Ordered Multicast . . . . . . . . . . . . . . .
Communication from Nonmembers to a Group . . . . . . . .
16.3.1
Scalability . . . . . . . . . . . . . . . . . .
Communication from a Group to a Nonmember . . . . . . . .
Summary of Multicast Properties . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.

277
278
282
284
286
288
290
294
297
300
302

15.4
15.5

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

15.2
15.3


Dynamic Group Membership . . . . . . . . . . . . . . .
15.1.1
GMS and Other System Processes . . . . . . . . .
15.1.2
Protocol Used to Track GMS Membership . . . . . .
15.1.3
GMS Protocol to Handle Client Add and Join Events .
15.1.4
GMS Notifications With Bounded Delay . . . . . .
15.1.5
Extending the GMS to Allow Partition and Merge Events
Replicated Data with Malicious Failures . . . . . . . . . . .
The Impossibility of Asynchronous Consensus (FLP) . . . . .
15.3.1
Three-Phase Commit and Consensus . . . . . . . .
Extending our Protocol into the Full GMS . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.

. . . . . . . . . .

342


Contents
17.2
17.3
17.4
17.5
17.6
17.7

Extending Causal Order to Multigroup Settings
Extending Total Order to Multigroup Settings .
Causal and Total Ordering Domains . . . .
Multicasts to Multiple Groups . . . . . . .
Multigroup View Management Protocols . .
Related Reading . . . . . . . . . . . .


18 The Virtual Synchrony Execution Model
18.1
18.2
18.3

18.4

.
.
.
.
.
.

344
346
348
349
349
350

. . . . . . . . . . . . . . . . . .

351

Virtual Synchrony . . . . . . . . . . . .
Extended Virtual Synchrony . . . . . . .
Virtually Synchronous Algorithms and Tools .
18.3.1

Replicated Data and Synchronization
18.3.2
State Transfer to a Joining Process .
18.3.3
Load-Balancing . . . . . . . .
18.3.4
Primary-Backup Fault Tolerance .
18.3.5
Coordinator-Cohort Fault Tolerance
Related Reading . . . . . . . . . . . .

19 Consistency in Distributed Systems
19.1
19.2
19.3
19.4
19.5

xi
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.

.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

376
384
385
389
390

PART IV Applications of Reliability Techniques . . . . . . . . . . . . . . . .

391


20 Retrofitting Reliability into Complex Systems

393

20.2
20.3
20.4
20.5

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.

.

20.1

.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

375

.

.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

. . . . . . . . . . . . . . . . . . . .
.
.
.
.
.

.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

351
356
362
362
367
369
371
372
374

.
.
.
.

.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

Consistency in the Static and Dynamic Membership Models

Practical Options for Coping with Total Failure . . . . .
General Remarks Concerning Causal and Total Ordering .
Summary and Conclusion . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

. . . . . . . . . . . . . . . .


Wrappers and Toolkits . . . . . . . . . . . . . . .
20.1.1
Wrapper Technologies . . . . . . . . . .
20.1.2
Introducing Robustness in Wrapped Applications
20.1.3
Toolkit Technologies . . . . . . . . . . .
20.1.4
Distributed Programming Languages . . . . .
Wrapping a Simple RPC server . . . . . . . . . . .
Wrapping a Web Site . . . . . . . . . . . . . . .
Hardening Other Aspects of the Web . . . . . . . . .
Unbreakable Stream Connections . . . . . . . . . .
20.5.1
Reliability Options for Stream Communication .
20.5.2
An Unbreakable Stream That Mimics TCP . .

.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.
.
.

394
396
402
405
407
408
410
411
415
416
417


xii

Contents

20.6


20.7

20.5.3
Nondeterminism and Its Consequences . . . . . . . . .
20.5.4
Dealing with Arbitrary Nondeterminism . . . . . . . .
20.5.5
Replicating the IP Address . . . . . . . . . . . . . .
20.5.6
Maximizing Concurrency by Relaxing Multicast Ordering .
20.5.7
State Transfer Issues . . . . . . . . . . . . . . . .
20.5.8
Discussion . . . . . . . . . . . . . . . . . . . .
Reliable Distributed Shared Memory . . . . . . . . . . . . . .
20.6.1
The Shared Memory Wrapper Abstraction . . . . . . . .
20.6.2
Memory Coherency Options for Distributed Shared Memory
20.6.3
False Sharing . . . . . . . . . . . . . . . . . . .
20.6.4
Demand Paging and Intelligent Prefetching . . . . . . .
20.6.5
Fault Tolerance Issues . . . . . . . . . . . . . . . .
20.6.6
Security and Protection Considerations . . . . . . . . .
20.6.7
Summary and Discussion . . . . . . . . . . . . . .

Related Reading . . . . . . . . . . . . . . . . . . . . . .

21 Software Architectures for Group Communication
21.1
21.2
21.3
21.4
21.5
21.6
21.7

21.8
21.9
21.10
PART V

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

419
420
420
421
424
424
425
426
428
431
431
432
433
433
434

. . . . . . . . . . . . . .

435

Architectural Considerations in Reliable Systems . . . . . . . . . . .
Horus: A Flexible Group Communication System . . . . . . . . . . .
21.2.1
A Layered Process Group Architecture . . . . . . . . . . .
Protocol stacks . . . . . . . . . . . . . . . . . . . . . . . . .
Using Horus to Build a Publish-Subscribe Platform and a Robust Groupware
Application . . . . . . . . . . . . . . . . . . . . . . . . . .
Using Electra to Harden CORBA Applications . . . . . . . . . . . .

Basic Performance of Horus . . . . . . . . . . . . . . . . . . . .
Masking the Overhead of Protocol Layering . . . . . . . . . . . . .
21.7.1
Reducing Header Overhead . . . . . . . . . . . . . . .
21.7.2
Eliminating Layered Protocol Processing Overhead . . . . . .
21.7.3
Message Packing . . . . . . . . . . . . . . . . . . . .
21.7.4
Performance of Horus with the Protocol Accelerator . . . . . .
Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance and Scalability of the Spread Toolkit . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . . . .
Related Technologies . . . . . . . . . . . . . . . . . . . . . . .

22 Security Options for Distributed Settings
22.1
22.2

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.

436
439
440
443
445
448
450
454
455
457
458
458
459
461
464
465

. . . . . . . . . . . . . . . . . .

467

Security Options for Distributed Settings . . . . . . . . . . . . . .
Perimeter Defense Technologies . . . . . . . . . . . . . . . . . .

467

471


Contents
22.3
22.4

.
.
.
.
.
.
.
.
.
.

475
477
478
480
483
484
487
489
490
492

23 Clock Synchronization and Synchronous Systems . . . . . . . . . . . . . . .


493

22.5
22.6
22.7
22.8

23.1
23.2
23.3
23.4

Access Control Technologies . . . . . .
Authentication Schemes, Kerberos, and SSL
22.4.1
RSA and DES . . . . . . . .
22.4.2
Kerberos . . . . . . . . . .
22.4.3
ONC Security and NFS . . . .
22.4.4
SSL Security . . . . . . . .
Security Policy Languages . . . . . . .
On-The-Fly Security . . . . . . . . . .
Availability and Security . . . . . . . .
Related Reading . . . . . . . . . . .

xiii
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.

.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.

.
.
.
.
.
.
.
.
.

.
.
.
.

493
498
505
508

24 Transactional Systems . . . . . . . . . . . . . . . . . . . . . . . . . .

509

24.1
24.2

Clock Synchronization . . . . . . . . . . . .
Timed-Asynchronous Protocols . . . . . . . .
Adapting Virtual Synchrony for Real-Time Settings

Related Reading . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

509
511
511
512
513
514
515
516
518
521
522
523
524
524
525
525
528

25 Peer-to-Peer Systems and Probabilistic Protocols . . . . . . . . . . . . . . .

529

24.3
24.4

24.5
24.6

24.7
24.8

25.1

Review of the Transactional Model . . . . . . . . . . . . . . . .
Implementation of a Transactional Storage System . . . . . . . . . .
24.2.1
Write-Ahead Logging . . . . . . . . . . . . . . . . .
24.2.2
Persistent Data Seen Through an Updates List . . . . . . .
24.2.3
Nondistributed Commit Actions . . . . . . . . . . . . .
Distributed Transactions and Multiphase Commit . . . . . . . . . .
Transactions on Replicated Data . . . . . . . . . . . . . . . . .
Nested Transactions . . . . . . . . . . . . . . . . . . . . . .
24.5.1
Comments on the Nested Transaction Model . . . . . . . .
Weak Consistency Models . . . . . . . . . . . . . . . . . . .
24.6.1
Epsilon Serializability . . . . . . . . . . . . . . . . .
24.6.2
Weak and Strong Consistency in Partitioned Database Systems
24.6.3
Transactions on Multidatabase Systems . . . . . . . . . .
24.6.4
Linearizability . . . . . . . . . . . . . . . . . . . .

24.6.5
Transactions in Real-Time Systems . . . . . . . . . . .
Advanced Replication Techniques . . . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . . .

Peer-to-Peer File Sharing . . . . . . . . . . . . . . . . . . . . .

531


xiv

Contents

25.2

25.3

25.4

25.5
25.6

25.1.1
Napster . . . . . . . . . . . . . . . . . . . . . .
25.1.2
Gnutella and Kazaa . . . . . . . . . . . . . . . . . .
25.1.3
CAN . . . . . . . . . . . . . . . . . . . . . . .
25.1.4

CFS on Chord and PAST on Pastry . . . . . . . . . . .
25.1.5
OceanStore . . . . . . . . . . . . . . . . . . . . .
Peer-to-Peer Distributed Indexing . . . . . . . . . . . . . . . .
25.2.1
Chord . . . . . . . . . . . . . . . . . . . . . . .
25.2.2
Pastry . . . . . . . . . . . . . . . . . . . . . . .
25.2.3
Tapestry and Brocade . . . . . . . . . . . . . . . . .
25.2.4
Kelips . . . . . . . . . . . . . . . . . . . . . . .
Bimodal Multicast Protocol . . . . . . . . . . . . . . . . . . .
25.3.1
Bimodal Multicast . . . . . . . . . . . . . . . . . .
25.3.2
Unordered pbcast Protocol . . . . . . . . . . . . . . .
25.3.3
Adding CASD-style Temporal Properties and Total Ordering .
25.3.4
Scalable Virtual Synchrony Layered Over Pbcast . . . . . .
25.3.5
Probabilistic Reliability and the Bimodal Delivery Distribution
25.3.6
Evaluation and Scalability . . . . . . . . . . . . . . .
25.3.7
Experimental Results . . . . . . . . . . . . . . . . .
Astrolabe . . . . . . . . . . . . . . . . . . . . . . . . . .
25.4.1
How it works . . . . . . . . . . . . . . . . . . . .

25.4.2
Peer-to-Peer Data Fusion and Data Mining . . . . . . . .
Other Applications of Peer-to-Peer Protocols . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . . . . . . . . . .

26 Prospects for Building Highly Assured Web Services
26.1
26.2
26.3
26.4
26.5
26.6
26.7

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

532
533
534
536
536
537
538
541
543
543
546
549
551
552
554
554
557
558
558
560
564
568

568

. . . . . . . . . . . . .

571

.
.
.
.
.
.
.

571
578
582
583
584
587
588

27 Other Distributed and Transactional Systems . . . . . . . . . . . . . . . . .

589

27.1

Web Services and Their Assurance Properties . . . .
High Assurance for Back-End Servers . . . . . . .

High Assurance for Web Server Front-Ends . . . .
Issues Encountered on the Client Side . . . . . . .
Highly Assured Web Services Need Autonomic Tools!
Summary . . . . . . . . . . . . . . . . . .
Related Reading . . . . . . . . . . . . . . .

Related Work in Distributed Computing
27.1.1
Amoeba . . . . . . . .
27.1.2
BASE . . . . . . . . .
27.1.3
Chorus . . . . . . . . .
27.1.4
Delta-4 . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.

.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.


.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.

.
.
.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.

.

.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.


.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

589
590
590
590
591


Contents

27.2


27.3

27.1.5
Ensemble . . . . . . . . . . .
27.1.6
Harp . . . . . . . . . . . . .
27.1.7
The Highly Available System (HAS)
27.1.8
The Horus System . . . . . . .
27.1.9
The Isis Toolkit . . . . . . . .
27.1.10 Locus . . . . . . . . . . . .
27.1.11 Manetho . . . . . . . . . . .
27.1.12 NavTech . . . . . . . . . . .
27.1.13 Paxos . . . . . . . . . . . .
27.1.14 Phalanx . . . . . . . . . . .
27.1.15 Phoenix . . . . . . . . . . .
27.1.16 Psync . . . . . . . . . . . .
27.1.17 Rampart . . . . . . . . . . .
27.1.18 Relacs . . . . . . . . . . . .
27.1.19 RMP . . . . . . . . . . . .
27.1.20 Spread . . . . . . . . . . . .
27.1.21 StormCast . . . . . . . . . .
27.1.22 Totem . . . . . . . . . . . .
27.1.23 Transis . . . . . . . . . . . .
27.1.24 The V System . . . . . . . . .
Peer-to-Peer Systems . . . . . . . . . .
27.2.1

Astrolabe . . . . . . . . . . .
27.2.2
Bimodal Multicast . . . . . . .
27.2.3
Chord/CFS . . . . . . . . . .
27.2.4
Gnutella/Kazaa . . . . . . . .
27.2.5
Kelips . . . . . . . . . . . .
27.2.6
Pastry/PAST and Scribe . . . . .
27.2.7
QuickSilver . . . . . . . . . .
27.2.8
Tapestry/Brocade . . . . . . .
Systems That Implement Transactions . . . .
27.3.1
Argus . . . . . . . . . . . .
27.3.2
Arjuna
. . . . . . . . . . .
27.3.3
Avalon . . . . . . . . . . . .
27.3.4
Bayou . . . . . . . . . . . .
27.3.5
Camelot and Encina . . . . . .
27.3.6
Thor . . . . . . . . . . . . .


.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

591
591
592
593
593
594
594
595
595
596
596
596
597
597
597
598
598
599
600
600
601
601
602

602
602
602
602
603
603
603
603
604
604
605
605
606

. . . . . . . . . . . . . . . . . . . . . . . . . .

607

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

629

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

661

Appendix: Problems
Bibliography
Index


xv
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


PART I
Basic Distributed Computing Technologies

Although our treatment is motivated by the emergence of the World Wide Web, objectoriented distributed computing platforms such as J2EE (for Java), .NET (for C# and other
languages) and CORBA, the first part of the book focuses on the general technologies on
which any distributed computing system relies. We review basic communication options
and the basic software tools that have emerged for utilizing them and for simplifying the
development of distributed applications. In the interests of generality, we cover more than
just the specific technologies embodied in the Web as it exists at the time of this writing,
and, in fact terminology and concepts specific to the Web are not introduced until Part II.
However, even in this first part, we discuss some of the most basic issues that arise in

building reliable distributed systems, and we begin to establish the context within which
reliability can be treated in a systematic manner.


1
Fundamentals

1.1 Introduction
Reduced to the simplest terms, a distributed computing system is a set of computer programs,
executing on one or more computers, and coordinating actions by exchanging messages. A
computer network is a collection of computers interconnected by hardware that directly
supports message passing. Most distributed computing systems operate over computer
networks, but one can also build a distributed computing system in which the components
execute on a single multitasking computer, and one can build distributed computing systems
in which information flows between the components by means other than message passing.
Moreover, there are new kinds of parallel computers, called clustered servers, which
have many attributes of distributed systems despite appearing to the user as a single machine
built using rack-mounted components. With the emergence of what people are calling “Grid
Computing,” clustered distributed systems may surge in importance. And we are just starting
to see a wave of interest in wireless sensor devices and associated computing platforms.
Down the road, much of the data pulled into some of the world’s most exciting databases
will come from sensors of various kinds, and many of the actions we’ll want to base on
the sensed data will be taken by actuators similarly embedded in the environment. All of
this activity is leading many people who do not think of themselves as distributed systems
specialists to direct attention to distributed computing.
We will use the term “protocol” in reference to an algorithm governing the exchange
of messages, by which a collection of processes coordinate their actions and communicate
information among themselves. Much as a program is a set of instructions, and a process
denotes the execution of those instructions, a protocol is a set of instructions governing the
communication in a distributed program, and a distributed computing system is the result

of executing some collection of such protocols to coordinate the actions of a collection of
processes in a network.


4

1. Fundamentals
This text is concerned with reliability in distributed computing systems. Reliability is
a very broad term that can have many meanings, including:
• Fault tolerance: The ability of a distributed computing system to recover from
component failures without performing incorrect actions.
• High availability: In the context of a fault-tolerant distributed computing system,
the ability of the system to restore correct operation, permitting it to resume providing services during periods when some components have failed. A highly available
system may provide reduced service for short periods of time while reconfiguring
itself.
• Continuous availability: A highly available system with a very small recovery time,
capable of providing uninterrupted service to its users. The reliability properties
of a continuously available system are unaffected or only minimally affected by
failures.
• Recoverability: Also in the context of a fault-tolerant distributed computing system, the
ability of failed components to restart themselves and rejoin the system, after the cause
of failure has been repaired.
• Consistency: The ability of the system to coordinate related actions by multiple components, often in the presence of concurrency and failures. Consistency underlies the
ability of a distributed system to emulate a non-distributed system.
• Scalability: The ability of a system to continue to operate correctly even as some aspect
is scaled to a larger size. For example, we might increase the size of the network on
which the system is running—doing so increases the frequency of such events as network
outages and could degrade a “non-scalable” system. We might increase numbers of users,
or numbers of servers, or load on the system. Scalability thus has many dimensions; a
scalable system would normally specify the dimensions in which it achieves scalability

and the degree of scaling it can sustain.
• Security: The ability of the system to protect data, services, and resources against misuse
by unauthorized users.
• Privacy: The ability of the system to protect the identity and locations of its users, or
the contents of sensitive data, from unauthorized disclosure.
• Correct specification: The assurance that the system solves the intended problem.
• Correct implementation: The assurance that the system correctly implements its
specification.
• Predictable performance: The guarantee that a distributed system achieves desired levels
of performance—for example, data throughput from source to destination, latencies
measured for critical paths, requests processed per second, and so forth.
• Timeliness: In systems subject to real-time constraints, the assurance that actions are
taken within the specified time bounds, or are performed with a desired degree of temporal
synchronization between the components.
Underlying many of these issues are questions of tolerating failures. Failure, too, can have
many meanings:


1.1 Introduction

5

• Halting failures: In this model, a process or computer either works correctly, or simply
stops executing and crashes without taking incorrect actions, as a result of failure. As
the model is normally specified, there is no way to detect that the process has halted
except by timeout: It stops sending “keep alive” messages or responding to “pinging”
messages and hence other processes can deduce that it has failed.
• Fail-stop failures: These are accurately detectable halting failures. In this model, processes fail by halting. However, other processes that may be interacting with the faulty
process also have a completely accurate way to detect such failures—for example, a
fail-stop environment might be one in which timeouts can be used to monitor the status

of processes, and no timeout occurs unless the process being monitored has actually
crashed. Obviously, such a model may be unrealistically optimistic, representing an
idealized world in which the handling of failures is reduced to a pure problem of how
the system should react when a failure is sensed. If we solve problems with this model,
we then need to ask how to relate the solutions to the real world.
• Send-omission failures: These are failures to send a message that, according to the logic
of the distributed computing systems, should have been sent. Send-omission failures
are commonly caused by a lack of buffering space in the operating system or network
interface, which can cause a message to be discarded after the application program has
sent it but before it leaves the sender’s machine. Perhaps surprisingly, few operating
systems report such events to the application.
• Receive-omission failures: These are similar to send-omission failures, but they occur
when a message is lost near the destination process, often because of a lack of memory
in which to buffer it or because evidence of data corruption has been discovered.
• Network failures: These occur when the network loses messages sent between certain
pairs of processes.
• Network partitioning failures: These are a more severe form of network failure, in
which the network fragments into disconnected sub-networks, within which messages
can be transmitted, but between which messages are lost. When a failure of this sort is
repaired, one talks about merging the network partitions. Network partitioning failures
are a common problem in modern distributed systems; hence, we will discuss them in
detail in Part III of this book.
• Timing failures: These occur when a temporal property of the system is violated—for
example, when a clock on a computer exhibits a value that is unacceptably far from the
values of other clocks, or when an action is taken too soon or too late, or when a message
is delayed by longer than the maximum tolerable delay for a network connection.
• Byzantine failures: This is a term that captures a wide variety of other faulty behaviors,
including data corruption, programs that fail to follow the correct protocol, and even
malicious or adversarial behaviors by programs that actively seek to force a system to
violate its reliability properties.

An even more basic issue underlies all of these: the meaning of computation, and the model
one assumes for communication and coordination in a distributed system. Some examples
of models include these:


6

1. Fundamentals


Real-world networks: These are composed of workstations, personal computers, and
other computing devices interconnected by hardware. Properties of the hardware and
software components will often be known to the designer, such as speed, delay, and error
frequencies for communication devices; latencies for critical software and scheduling
paths; throughput for data generated by the system and data distribution patterns; speed
of the computer hardware, accuracy of clocks; and so forth. This information can be
of tremendous value in designing solutions to problems that might be very hard—or
impossible—in a completely general sense.
A specific issue that will emerge as being particularly important when we consider
guarantees of behavior in Part III concerns the availability, or lack, of accurate temporal
information. Until the late 1980s, the clocks built into workstations were notoriously
inaccurate, exhibiting high drift rates that had to be overcome with software protocols
for clock resynchronization. There are limits on the quality of synchronization possible
in software, and this created a substantial body of research and lead to a number of
competing solutions. In the early 1990s, however, the advent of satellite time sources
as part of the global positioning system (GPS) changed the picture: For the price of
an inexpensive radio receiver, any computer could obtain accurate temporal data, with
resolution in the sub-millisecond range. However, the degree to which GPS receivers
actually replace quartz-based time sources remains to be seen. Thus, real-world systems
are notable (or notorious) in part for having temporal information, but of potentially low

quality.
The architectures being proposed for networks of lightweight embedded sensors
may support high-quality temporal information, in contrast to more standard distributed
systems, which “work around” temporal issues using software protocols. For this reason,
a resurgence of interest in communication protocols that use time seems almost certain
to occur in the coming decade.
• Asynchronous computing systems: This is a very simple theoretical model used to
approximate one extreme sort of computer network. In this model, no assumptions
can be made about the relative speed of the communication system, processors, and
processes in the network. One message from a process p to a process q may be delivered
in zero time, while the next is delayed by a million years. The asynchronous model
reflects an assumption about time, but not failures: Given an asynchronous model, one
can talk about protocols that tolerate message loss, protocols that overcome fail-stop
failures in asynchronous networks, and so forth. The main reason for using the model is
to prove properties about protocols for which one makes as few assumptions as possible.
The model is very clean and simple, and it lets us focus on fundamental properties
of systems without cluttering up the analysis by including a great number of practical
considerations. If a problem can be solved in this model, it can be solved at least as
well in a more realistic one. On the other hand, the converse may not be true: We may
be able to do things in realistic systems by making use of features not available in the
asynchronous model, and in this way may be able to solve problems in real systems that
are impossible in ones that use the asynchronous model.


1.2 Components of a Reliable Distributed Computing System

7

• Synchronous computing systems: Like the asynchronous systems, these represent an
extreme end of the spectrum. In the synchronous systems, there is a very strong concept

of time that all processes in the system share. One common formulation of the model can
be thought of as having a system wide gong that sounds periodically; when the processes
in the system hear the gong, they run one round of a protocol, reading messages from
one another, sending messages that will be delivered in the next round, and so forth. And
these messages always are delivered to the application by the start of the next round, or
not at all.
Normally, the synchronous model also assumes bounds on communication latency
between processes, clock skew and precision, and other properties of the environment. As
in the case of an asynchronous model, the synchronous one takes an extreme point of view
because this simplifies reasoning about certain types of protocols. Real-world systems
are not synchronous—it is impossible to build a system in which actions are perfectly
coordinated as this model assumes. However, if one proves the impossibility of solving
some problem in the synchronous model, or proves that some problem requires at least
a certain number of messages in this model, one has established a sort of lower bound.
In a real-world system, things can only get worse, because we are limited to weaker
assumptions. This makes the synchronous model a valuable tool for understanding how
hard it will be to solve certain problems.
• Parallel-shared memory systems: An important family of systems is based on multiple
processors that share memory. Unlike for a network, where communication is by message
passing, in these systems communication is by reading and writing shared memory
locations. Clearly, the shared memory model can be emulated using message passing,
and can be used to implement message communication. Nonetheless, because there are
important examples of real computers that implement this model, there is considerable
theoretical interest in the model per-se. Unfortunately, although this model is very rich
and a great deal is known about it, it would be beyond the scope of this book to attempt
to treat the model in any detail.

1.2 Components of a Reliable Distributed Computing
System
Reliable distributed computing systems are assembled from basic building blocks. In the

simplest terms, these are just processes and messages, and if our interest was purely theoretical, it might be reasonable to stop at that. On the other hand, if we wish to apply theoretical
results in practical systems, we will need to work from a fairly detailed understanding
of how practical systems actually work. In some ways, this is unfortunate, because real
systems often include mechanisms that are deficient in ways that seem simple to fix, or
inconsistent with one another, but have such a long history (or are so deeply embedded into
standards) that there may be no way to improve on the behavior in question. Yet, if we want
to actually build reliable distributed systems, it is unrealistic to insist that we will only do


8

1. Fundamentals
so in idealized environments that support some form of theoretically motivated structure.
The real world is heavily committed to standards, and the task of translating our theoretical
insights into practical tools that can interplay with these standards is probably the most
important challenge faced by the computer systems engineer.
It is common to think of a distributed system as operating over a layered set of network
services (see Table 1.1). It should be stated at the outset that the lower layers of this hierarchy
make far more sense than the upper ones, and when people talk about ISO compatibility or
the ISO layering, they almost always have layers below the “session” in mind, not the session
layer or those above it. Unfortunately, for decades, government procurement offices didn’t
understand this and often insisted on ISO “compatibility.” Thankfully, most such offices
have finally given up on that goal and accepted that pure ISO compatibility is meaningless
because the upper layers of the hierarchy don’t make a great deal of sense.
Table 1.1. OSI Protocol Layers
Application
Presentation
Session
Transport
Network

Data Link
Physical

The program using the communication connection
Software to encode application data into messages and to decode on reception
The logic associated with guaranteeing end-to-end properties such as reliability
Software concerned with fragmenting big messages into small packets
Routing functionality, usually limited to small- or fixed-size packets
The protocol used to send and receive packets
The protocol used to represent packets on the wire

Each layer corresponds to a software abstraction or hardware feature, and may be implemented in the application program itself, in a library of procedures to which the program
is linked, in the operating system, or even in the hardware of the communication device.
As an example, here is the layering of the International Organization for Standardization
(ISO) Open Systems Interconnection (OSI) protocol model (see Comer, Comer and Stevens
[1991, 1993], Coulouris et al., Tanenbaum):
• Application: This is the application program itself, up to the points at which it performs
communication operations.
• Presentation: This is the software associated with placing data into messages in a format
that can be interpreted by the destination process(es) to which the message will be sent
and for extracting data from messages in the destination process.
• Session: This is the software associated with maintaining connections between pairs
or sets of processes. A session may have reliability properties and may require some
form of initialization or setup, depending on the specific setting with which the user is
working. In the OSI model, the session software implements any reliability properties,
and lower layers of the hierarchy are permitted to be unreliable—for example, by losing
messages.


1.2 Components of a Reliable Distributed Computing System


9

• Transport: The transport layer is responsible for breaking large messages into smaller
packets that respect size limits imposed by the network communication hardware. On the
incoming side, the transport layer reassembles these packets into messages, discarding
packets that are identified as duplicates, or messages for which some constituent packets
were lost in transmission.
• Network: This is the layer of software concerned with routing and low-level flow control
on networks composed of multiple physical segments interconnected by what are called
bridges and gateways.
• Data link: The data-link layer is normally part of the hardware that implements a
communication device. This layer is responsible for sending and receiving packets,
recognizing packets destined for the local machine and copying them in, discarding
corrupted packets, and other interface-level aspects of communication.
• Physical: The physical layer is concerned with representation of packets on the wire—
for example, the hardware technology for transmitting individual bits and the protocol
for gaining access to the wire if multiple computers share it.
It is useful to distinguish the types of guarantees provided by the various layers: end-toend guarantees in the case of the session, presentation, and application layers and point-topoint guarantees for layers below these. The distinction is important in complex networks
where a message may need to traverse many links to reach its destination. In such settings,
a point-to-point property is one that holds only on a per-hop basis—for example, the datalink protocol is concerned with a single hop taken by the message, but not with its overall
route or the guarantees that the application may expect from the communication link itself.
The session, presentation, and application layers, in contrast, impose a more complex
logical abstraction on the underlying network, with properties that hold between the end
points of a communication link that may physically extend over a complex substructure. In
Part III of this book we will discuss increasingly elaborate end-to-end properties, until we
finally extend these properties into a completely encompassing distributed communication
abstraction that embraces the distributed system as a whole and provides consistent behavior
and guarantees throughout. And, just as the OSI layering builds its end-to-end abstractions
over point-to-point ones, we will need to build these more sophisticated abstractions over

what are ultimately point-to-point properties.
As seen in Figure 1.1, each layer is logically composed of transmission logic and the
corresponding reception logic. In practice, this often corresponds closely to the implementation of the architecture—for example, most session protocols operate by imposing
a multiple session abstraction over a shared (or multiplexed) link-level connection. The
packets generated by the various higher-level session protocols can be thought of as merging
into a single stream of packets that are treated by the IP link level as a single customer for
its services.
One should not assume that the implementation of layered protocol architecture involves
some sort of separate module for each layer. Indeed, one reason that existing systems
deviate from the ISO layering is that a strict ISO-based protocol stack would be quite


10

1. Fundamentals

Figure 1.1. Data flow in an OSI protocol stack. Each sending layer is invoked by the layer above it and passes
data off to the layer below it, and conversely on the receive-side. In a logical sense, however, each layer interacts
with its peer on the remote side of the connection—for example, the send-side session layer may add a header
to a message that the receive-side session layer strips off.
inefficient in the context of a modern operating system, where code-reuse is important and
mechanisms such as IP tunneling may want to reuse the ISO stack “underneath” what is
conceptually a second instance of the stack. Conversely, to maximize performance, the
functionality of a layered architecture is often compressed into a single piece of software,
and in some cases layers may be completely bypassed for types of messages where the
layer would take no action—for example, if a message is very small, the OSI transport layer
wouldn’t need to fragment it into multiple packets, and one could imagine a specialized
implementation of the OSI stack that omits the transport layer. Indeed, the pros and cons
of layered protocol architecture have become a major topic of debate in recent years (see
Abbott and Peterson, Braun and Diot, Clark and Tennenhouse, Karamcheti and Chien, Kay

and Pasquale).
Although the OSI layering is probably the best known such architecture, layered communication software is pervasive, and there are many other examples of layered architectures
and layered software systems. Later in this book we will see additional senses in which the
OSI layering is outdated, because it doesn’t directly address multiparticipant communication
sessions and doesn’t match very well with some new types of communication hardware, such
as asynchronous transfer mode (ATM) switching systems. In discussing this point we will see
that more appropriate layered architectures can be constructed, although they don’t match
the OSI layering very closely. Thus, one can think of layering as a methodology matched


1.2 Components of a Reliable Distributed Computing System

11

to the particular layers of the OSI hierarchy. The former perspective is a popular one that
is only gaining importance with the introduction of object-oriented distributed computing
environments, which have a natural form of layering associated with object classes and
subclasses. The latter form of layering has probably become hopelessly incompatible with
standard practice by the time of this writing, although many companies and governments
continue to require that products comply with it.
It can be argued that layered communication architecture is primarily valuable as a
descriptive abstraction—a model that captures the essential functionality of a real communication system but doesn’t need to accurately reflect its implementation. The idea
of abstracting the behavior of a distributed system in order to concisely describe it or to
reason about it is a very important one. However, if the abstraction doesn’t accurately
correspond to the implementation, this also creates a number of problems for the system
designer, who now has the obligation to develop a specification and correctness proof
for the abstraction; to implement, verify, and test the corresponding software; and to
undertake an additional analysis that confirms that the abstraction accurately models the
implementation.
It is easy to see how this process can break down—for example, it is nearly inevitable

that changes to the implementation will have to be made long after a system has been
deployed. If the development process is really this complex, it is likely that the analysis of
overall correctness will not be repeated for every such change. Thus, from the perspective
of a user, abstractions can be a two-edged sword. They offer appealing and often simplified
ways to deal with a complex system, but they can also be simplistic or even incorrect. And
this bears strongly on the overall theme of reliability. To some degree, the very process of
cleaning up a component of a system in order to describe it concisely can compromise the
reliability of a more complex system in which that component is used.
Throughout the remainder of this book, we will often have recourse to models and
abstractions, in much more complex situations than the OSI layering. This will assist us in
reasoning about and comparing protocols, and in proving properties of complex distributed
systems. At the same time, however, we need to keep in mind that this whole approach
demands a sort of meta-approach, namely a higher level of abstraction at which we can
question the methodology itself, asking if the techniques by which we create reliable systems
are themselves a possible source of unreliability. When this proves to be the case, we need
to take the next step as well, asking what sorts of systematic remedies can be used to fight
these types of reliability problems.
Can well structured distributed computing systems be built that can tolerate the failures
of their own components, or guarantee other kinds of assurance properties? In layerings such
as OSI, this issue is not really addressed, which is one of the reasons that the OSI layering
won’t work well for our purposes. However, the question is among the most important
ones that will need to be resolved if we want to claim that we have arrived at a workable
methodology for engineering reliable distributed computing systems. A methodology, then,
must address descriptive and structural issues, as well as practical ones such as the protocols
used to overcome a specific type of failure or to coordinate a specific type of interaction.


12

1. Fundamentals


1.2.1 Communication Technology
The most basic communication technology in any distributed system is the hardware support
for message passing. Although there are some types of networks that offer special properties,
most modern networks are designed to transmit data in packets with some fixed, but small,
maximum size. Each packet consists of a header, which is a data structure containing
information about the packet—its destination, route, and so forth. It contains a body, which
are the bytes that make up the content of the packet. And it may contain a trailer, which is
a second data structure that is physically transmitted after the header and body and would
normally consist of a checksum for the packet that the hardware computes and appends to
it as part of the process of transmitting the packet.
When a user’s message
is transmitted over a network, the packets actually
sent on the wire include
headers and trailers, and may
have a fixed maximum size.
Large messages are sent as
multiple packets. For example, Figure 1.2 illustrates a
message that has been fragmented into three packets,
Figure 1.2. Large messages are fragmented for transmission.
each containing a header and
some part of the data from
the original message. Not all fragmentation schemes include trailers, and in the figure no
trailer is shown.
Modern communication hardware often permits large numbers of computers to share a
single communication fabric. For this reason, it is necessary to specify the address to which
a message should be transmitted. The hardware used for communication will therefore
normally support some form of addressing capability, by which the destination of a message
can be identified. More important to most software developers, however, are addresses
supported by the transport services available on most operating systems. These logical

addresses are a representation of location within the network, and are used to route packets
to their destinations. Each time a packet makes a “hop” over a communication link, the
sending computer is expected to copy the hardware address of the next machine in the path
into the outgoing packet. Within this book, we assume that each computer has a logical
address, but will have little to say about hardware addresses.
Readers familiar with modern networking tools will be aware that the address assigned
to a computer can change over time (particularly when the DHCP protocol is used to
dynamically assign them), that addresses may not be unique (indeed, because modern
firewalls and network address translators often “map” internal addresses used within a LAN
to external ones visible outside in a many-to-one manner, reuse of addresses is common),
and that there are even multiple address standards (IPv4 being the most common, with IPv6


1.2 Components of a Reliable Distributed Computing System

13

Figure 1.3. The routing functionality of a modern transport protocol conceals the network topology from the
application designer.

promoted by some vendors as a next step). For our purposes in this book, we’ll set all of
these issues to the side, and similarly we’ll leave routing protocols and the design of high
speed overlay networks as topics for some other treatment.
On the other hand, there are two addressing features that have important implications
for higher-level communication software. These are the ability of the software (and often,
the underlying network hardware) to broadcast and multicast messages. A broadcast is
a way of sending a message so that it will be delivered to all computers that it reaches.
This may not be all the computers in a network, because of the various factors that can
cause a receive omission failure to occur, but, for many purposes, absolute reliability is not
required. To send a hardware broadcast, an application program generally places a special

logical address in an outgoing message that the operating system maps to the appropriate
hardware address. The message will only reach those machines connected to the hardware
communication device on which the transmission occurs, so the use of this feature requires
some knowledge of network communication topology.
A multicast is a form of broadcast that communicates to a subset of the computers that
are attached to a communication network. To use a multicast, one normally starts by creating
a new multicast group address and installing it into the hardware interfaces associated with
a communication device. Multicast messages are then sent much as a broadcast would be,
but are only accepted, at the hardware level, at those interfaces that have been instructed to
install the group address to which the message is destined. Many network routing devices
and protocols watch for multicast packets and will forward them automatically, but this is
rarely attempted for broadcast packets.
Chapter 2 discusses some of the most common forms of communication hardware in
detail.


×