Tải bản đầy đủ (.pdf) (493 trang)

Multi core embedded systems

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (8.33 MB, 493 trang )


MULTI-CORE
EMBEDDED SYSTEMS


Embedded Multi-Core Systems
Series Editors

Fayez Gebali and Haytham El Miligi
University of Victoria
Victoria, British Columbia

Multi-Core Embedded Systems, Georgios Kornaros


MULTI-CORE
EMBEDDED SYSTEMS

Edited by

Georgios Kornaros

Boca Raton London New York

CRC Press is an imprint of the
Taylor & Francis Group, an informa business


MATLAB® and Simulink® are trademarks of The MathWorks, Inc. and are used with permission. The MathWorks does not warrant the accuracy of the text of exercises in this book. This book’s use or discussion
of MATLAB® and Simulink® software or related products does not constitute endorsement or sponsorship
by The MathWorks of a particular pedagogical approach or particular use of the MATLAB® and Simulink®


software.

CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2010 by Taylor and Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government works
Printed in the United States of America on acid-free paper
10 9 8 7 6 5 4 3 2 1
International Standard Book Number: 978-1-4398-1161-0 (Hardback)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts
have been made to publish reliable data and information, but the author and publisher cannot assume
responsibility for the validity of all materials or the consequences of their use. The authors and publishers
have attempted to trace the copyright holders of all material reproduced in this publication and apologize to
copyright holders if permission to publish in this form has not been obtained. If any copyright material has
not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented,
including photocopying, microfilming, and recording, or in any information storage or retrieval system,
without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.
com ( or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood
Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that provides licenses and
registration for a variety of users. For organizations that have been granted a photocopy license by the CCC,
a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used
only for identification and explanation without intent to infringe.
Library of Congress Cataloging-in-Publication Data
Multi-core embedded systems / editor, Georgios Kornaros.

p. cm. -- (Embedded multi-core systems)
“A CRC title.”
Includes bibliographical references and index.
ISBN 978-1-4398-1161-0 (hard back : alk. paper)
1. Embedded computer systems. 2. Multiprocessors. 3. Parallel processing
(Electronic computers) I. Kornaros, Georgios. II. Title. III. Series.
TK7895.E42M848 2010
004.16--dc22
Visit the Taylor & Francis Web site at

and the CRC Press Web site at


2009051515


Contents

List of Figures

xiii

List of Tables

xxi

Foreword

xxiii


Preface
1 Multi-Core Architectures for Embedded Systems
C.P. Ravikumar
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 What Makes Multiprocessor Solutions Attractive? .
1.2 Architectural Considerations . . . . . . . . . . . . . . . . .
1.3 Interconnection Networks . . . . . . . . . . . . . . . . . . .
1.4 Software Optimizations . . . . . . . . . . . . . . . . . . . .
1.5 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 HiBRID-SoC for Multimedia Signal Processing . . .
1.5.2 VIPER Multiprocessor SoC . . . . . . . . . . . . . .
1.5.3 Defect-Tolerant and Reconfigurable MPSoC . . . . .
1.5.4 Homogeneous Multiprocessor for Embedded Printer
Application . . . . . . . . . . . . . . . . . . . . . . .
1.5.5 General Purpose Multiprocessor DSP . . . . . . . .
1.5.6 Multiprocessor DSP for Mobile Applications . . . . .
1.5.7 Multi-Core DSP Platforms . . . . . . . . . . . . . .
1.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xxv
1
.
.
.
.
.
.
.

.
.

2
3
9
11
13
14
14
16
17

.
.
.
.
.
.
.

18
20
21
23
25
25
27

2 Application-Specific Customizable Embedded Systems

Georgios Kornaros
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Challenges and Opportunities . . . . . . . . . . . . . . . . .
2.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Categorization . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Customized Application-Specific Processor Techniques

31
32
34
35
37
37

v


vi

Table of Contents
2.3.2

Customized Application-Specific On-Chip Interconnect
Techniques . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Configurable Processors and Instruction Set Synthesis . . . .
2.4.1 Design Methodology for Processor Customization . . .
2.4.2 Instruction Set Extension Techniques . . . . . . . . .
2.4.3 Application-Specific Memory-Aware Customization . .
2.4.4 Customizing On-Chip Communication Interconnect .
2.4.5 Customization of MPSoCs . . . . . . . . . . . . . . . .

2.5 Reconfigurable Instruction Set Processors . . . . . . . . . . .
2.5.1 Warp Processing . . . . . . . . . . . . . . . . . . . . .
2.6 Hardware/Software Codesign . . . . . . . . . . . . . . . . . .
2.7 Hardware Architecture Description Languages . . . . . . . .
2.7.1 LISATek Design Platform . . . . . . . . . . . . . . . .
2.8 Myths and Realities . . . . . . . . . . . . . . . . . . . . . . .
2.9 Case Study: Realizing Customizable Multi-Core Designs . . .
2.10 The Future: System Design with Customizable Architectures,
Software, and Tools . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Power Optimization in Multi-Core System-on-Chip
Massimo Conti, Simone Orcioni, Giovanni Vece and Stefano Gigli
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Low Power Design . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Power Models . . . . . . . . . . . . . . . . . . . . .
3.2.2 Power Analysis Tools . . . . . . . . . . . . . . . .
3.3 PKtool . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Basic Features . . . . . . . . . . . . . . . . . . . .
3.3.2 Power Models . . . . . . . . . . . . . . . . . . . . .
3.3.3 Augmented Signals . . . . . . . . . . . . . . . . . .
3.3.4 Power States . . . . . . . . . . . . . . . . . . . . .
3.3.5 Application Examples . . . . . . . . . . . . . . . .
3.4 On-Chip Communication Architectures . . . . . . . . . .
3.5 NOCEXplore . . . . . . . . . . . . . . . . . . . . . . . . .
3.5.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . .
3.6 DPM and DVS in Multi-Core Systems . . . . . . . . . . .
3.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

40
41
43
44
48
48
49
52
53
54
55
57
58
60
62
63
63
71

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

72
74
75

80
82
82
83
84
85
86
87
90
91
95
100
101
102

Routing Algorithms for Irregular Mesh-Based Network-onChip
111
Shu-Yen Lin and An-Yeu (Andy) Wu
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.2 An Overview of Irregular Mesh Topology . . . . . . . . . . . 113


Table of Contents

vii

4.2.1 2D Mesh Topology . . . . . . . . . . . . . . . . . . . .
4.2.2 Irregular Mesh Topology . . . . . . . . . . . . . . . . .
4.3 Fault-Tolerant Routing Algorithms for 2D Meshes . . . . . .
4.3.1 Fault-Tolerant Routing Using Virtual Channels . . . .

4.3.2 Fault-Tolerant Routing with Turn Model . . . . . . .
4.4 Routing Algorithms for Irregular Mesh Topology . . . . . . .
4.4.1 Traffic-Balanced OAPR Routing Algorithm . . . . . .
4.4.2 Application-Specific Routing Algorithm . . . . . . . .
4.5 Placement for Irregular Mesh Topology . . . . . . . . . . . .
4.5.1 OIP Placements Based on Chen and Chiu’s Algorithm
4.5.2 OIP Placements Based on OAPR . . . . . . . . . . . .
4.6 Hardware Efficient Routing Algorithms . . . . . . . . . . . .
4.6.1 Turns-Table Routing (TT) . . . . . . . . . . . . . . .
4.6.2 XY-Deviation Table Routing (XYDT) . . . . . . . . .
4.6.3 Source Routing for Deviation Points (SRDP) . . . . .
4.6.4 Degree Priority Routing Algorithm . . . . . . . . . . .
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

113
113
115
116
117
126
127
132
136
137
140
143
146
147

147
148
151
151
151

5 Debugging Multi-Core Systems-on-Chip
Bart Vermeulen and Kees Goossens
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . .
5.2 Why Debugging Is Difficult . . . . . . . . . . . . . .
5.2.1 Limited Internal Observability . . . . . . . .
5.2.2 Asynchronicity and Consistent Global States
5.2.3 Non-Determinism and Multiple Traces . . . .
5.3 Debugging an SoC . . . . . . . . . . . . . . . . . . .
5.3.1 Errors . . . . . . . . . . . . . . . . . . . . . .
5.3.2 Example Erroneous System . . . . . . . . . .
5.3.3 Debug Process . . . . . . . . . . . . . . . . .
5.4 Debug Methods . . . . . . . . . . . . . . . . . . . .
5.4.1 Properties . . . . . . . . . . . . . . . . . . . .
5.4.2 Comparing Existing Debug Methods . . . . .
5.5 CSAR Debug Approach . . . . . . . . . . . . . . . .
5.5.1 Communication-Centric Debug . . . . . . . .
5.5.2 Scan-Based Debug . . . . . . . . . . . . . . .
5.5.3 Run/Stop-Based Debug . . . . . . . . . . . .
5.5.4 Abstraction-Based Debug . . . . . . . . . . .
5.6 On-Chip Debug Infrastructure . . . . . . . . . . . .
5.6.1 Overview . . . . . . . . . . . . . . . . . . . .
5.6.2 Monitors . . . . . . . . . . . . . . . . . . . .
5.6.3 Computation-Specific Instrument . . . . . . .
5.6.4 Protocol-Specific Instrument . . . . . . . . .

5.6.5 Event Distribution Interconnect . . . . . . . .

155
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

156
158
158
159
161
163
164
165
166
169
169
171
174
175
175
176
176
178
178
178
180
181
182


viii


Table of Contents
5.6.6 Debug Control Interconnect . .
5.6.7 Debug Data Interconnect . . .
5.7 Off-Chip Debug Infrastructure . . . .
5.7.1 Overview . . . . . . . . . . . .
5.7.2 Abstractions Used by Debugger
5.8 Debug Example . . . . . . . . . . . .
5.9 Conclusions . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . .

. . . . . .
. . . . . .
. . . . . .
. . . . . .
Software
. . . . . .
. . . . . .
. . . . . .
. . . . . .

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.


.
.
.
.
.
.
.
.
.

6 System-Level Tools for NoC-Based Multi-Core Design
Luciano Bononi, Nicola Concer, and Miltos Grammatikakis
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
6.1.1 Related Work . . . . . . . . . . . . . . . . . . . . .
6.2 Synthetic Traffic Models . . . . . . . . . . . . . . . . . .
6.3 Graph Theoretical Analysis . . . . . . . . . . . . . . . . .
6.3.1 Generating Synthetic Graphs Using TGFF . . . .
6.4 Task Mapping for SoC Applications . . . . . . . . . . . .
6.4.1 Application Task Embedding and Quality Metrics
6.4.2 SCOTCH Partitioning Tool . . . . . . . . . . . . .
6.5 OMNeT++ Simulation Framework . . . . . . . . . . . .
6.6 A Case Study . . . . . . . . . . . . . . . . . . . . . . . .
6.6.1 Application Task Graphs . . . . . . . . . . . . . .
6.6.2 Prospective NoC Topology Models . . . . . . . . .
6.6.3 Spidergon Network on Chip . . . . . . . . . . . . .
6.6.4 Task Graph Embedding and Analysis . . . . . . .
6.6.5 Simulation Models for Proposed NoC Topologies .
6.6.6 Mpeg4: A Realistic Scenario . . . . . . . . . . . . .
6.7 Conclusions and Extensions . . . . . . . . . . . . . . . . .

Review Questions . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.

183
183
184
184
184
190
193

194
194
201

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

7 Compiler Techniques for Application Level Memory
Optimization for MPSoC
Bruno Girodias, Youcef Bouchebaba, Pierre Paulin, Bruno Lavigueur,
Gabriela Nicolescu, and El Mostapha Aboulhamid
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.2 Loop Transformation for Single and Multiprocessors . . . . .
7.3 Program Transformation Concepts . . . . . . . . . . . . . . .
7.4 Memory Optimization Techniques . . . . . . . . . . . . . . .
7.4.1 Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . .
7.4.2 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.4.3 Buffer Allocation . . . . . . . . . . . . . . . . . . . . .
7.5 MPSoC Memory Optimization Techniques . . . . . . . . . .
7.5.1 Loop Fusion . . . . . . . . . . . . . . . . . . . . . . . .

202
204

206
207
209
210
210
214
216
217
217
218
219
221
223
227
231
234
235

243

244
245
246
248
249
249
249
250
251



Table of Contents
Comparison of Lexicographically Positive and Positive
Dependency . . . . . . . . . . . . . . . . . . . . . . . .
7.5.3 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.5.4 Buffer Allocation . . . . . . . . . . . . . . . . . . . . .
7.6 Technique Impacts . . . . . . . . . . . . . . . . . . . . . . . .
7.6.1 Computation Time . . . . . . . . . . . . . . . . . . . .
7.6.2 Code Size Increase . . . . . . . . . . . . . . . . . . . .
7.7 Improvement in Optimization Techniques . . . . . . . . . . .
7.7.1 Parallel Processing Area and Partitioning . . . . . . .
7.7.2 Modulo Operator Elimination . . . . . . . . . . . . . .
7.7.3 Unimodular Transformation . . . . . . . . . . . . . . .
7.8 Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.8.1 Cache Ratio and Memory Space . . . . . . . . . . . .
7.8.2 Processing Time and Code Size . . . . . . . . . . . . .
7.9 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

7.5.2

252
253
254
255
255

256
256
256
259
260
261
262
263
263
264
265
266

8 Programming Models for Multi-Core Embedded Software 269
Bijoy A. Jose, Bin Xue, Sandeep K. Shukla and Jean-Pierre Talpin
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
8.2 Thread Libraries for Multi-Threaded Programming . . . . . 272
8.3 Protections for Data Integrity in a Multi-Threaded Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
8.3.1 Mutual Exclusion Primitives for Deterministic Output 276
8.3.2 Transactional Memory . . . . . . . . . . . . . . . . . . 278
8.4 Programming Models for Shared Memory and Distributed
Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.4.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . 279
8.4.2 Thread Building Blocks . . . . . . . . . . . . . . . . . 280
8.4.3 Message Passing Interface . . . . . . . . . . . . . . . . 281
8.5 Parallel Programming on Multiprocessors . . . . . . . . . . . 282
8.6 Parallel Programming Using Graphic Processors . . . . . . . 283
8.7 Model-Driven Code Generation for Multi-Core Systems . . . 284
8.7.1 StreamIt . . . . . . . . . . . . . . . . . . . . . . . . . . 285
8.8 Synchronous Programming Languages . . . . . . . . . . . . . 286

8.9 Imperative Synchronous Language: Esterel . . . . . . . . . . 288
8.9.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . 288
8.9.2 Multi-Core Implementations and Their Compilation
Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . 289
8.10 Declarative Synchronous Language: LUSTRE . . . . . . . . . 290
8.10.1 Basic Concepts . . . . . . . . . . . . . . . . . . . . . . 291
8.10.2 Multi-Core Implementations from LUSTRE
Specifications . . . . . . . . . . . . . . . . . . . . . . . 291


x

Table of Contents
8.11 Multi-Rate Synchronous Language: SIGNAL . . . . . . .
8.11.1 Basic Concepts . . . . . . . . . . . . . . . . . . . .
8.11.2 Characterization and Compilation of SIGNAL . .
8.11.3 SIGNAL Implementations on Distributed Systems
8.11.4 Multi-Threaded Programming Models for SIGNAL
8.12 Programming Models for Real-Time Software . . . . . . .
8.12.1 Real-Time Extensions to Synchronous Languages .
8.13 Future Directions for Multi-Core Programming . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.
.
.
.
.

292
292
293
294
296
299
300
301
302
305

9 Operating System Support for Multi-Core Systems-on-Chips
Xavier Gu´erin and Fr´ed´eric P´etrot
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2 Ideal Software Organization . . . . . . . . . . . . . . . . . .
9.3 Programming Challenges . . . . . . . . . . . . . . . . . . . .
9.4 General Approach . . . . . . . . . . . . . . . . . . . . . . . .
9.4.1 Board Support Package . . . . . . . . . . . . . . . . .
9.4.2 General Purpose Operating System . . . . . . . . . . .
9.5 Real-Time and Component-Based Operating System Models
9.5.1 Automated Application Code Generation and RTOS
Modeling . . . . . . . . . . . . . . . . . . . . . . . . .
9.5.2 Component-Based Operating System . . . . . . . . . .
9.6 Pros and Cons . . . . . . . . . . . . . . . . . . . . . . . . . .
9.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

309

10 Autonomous Power Management in Embedded Multi-Cores
Arindam Mukherjee, Arun Ravindran, Bharat Kumar Joshi,
Kushal Datta and Yue Liu
10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . .
10.1.1 Why Is Autonomous Power Management Necessary?
10.2 Survey of Autonomous Power Management Techniques . . .
10.2.1 Clock Gating . . . . . . . . . . . . . . . . . . . . . . .
10.2.2 Power Gating . . . . . . . . . . . . . . . . . . . . . .
10.2.3 Dynamic Voltage and Frequency Scaling . . . . . . .
10.2.4 Smart Caching . . . . . . . . . . . . . . . . . . . . . .
10.2.5 Scheduling . . . . . . . . . . . . . . . . . . . . . . . .
10.2.6 Commercial Power Management Tools . . . . . . . . .
10.3 Power Management and RTOS
. . . . . . . . . . . . . . . .

10.4 Power-Smart RTOS and Processor Simulators . . . . . . . .
10.4.1 Chip Multi-Threading (CMT) Architecture Simulator
10.5 Autonomous Power Saving in Multi-Core Processors . . . . .
10.5.1 Opportunities to Save Power . . . . . . . . . . . . . .

337

310
311
313
314
314
317
322
322
326
329
330
332
333

338
339
342
342
343
343
344
345
346

347
349
350
351
353


Table of Contents
10.5.2 Strategies to Save Power .
10.5.3 Case Study: Power Saving
10.6 Power Saving Algorithms . . . .
10.6.1 Local PMU Algorithm .
10.6.2 Global PMU Algorithm .
10.7 Conclusions . . . . . . . . . . . .
Review Questions . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . .

. . . . . . . . . .
in Intel Centrino
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .

xi
.
.
.

.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.

11 Multi-Core System-on-Chip in Real World Products
Gajinder Panesar, Andrew Duller, Alan H. Gray and Daniel Towner
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . .
11.2 Overview of picoArray Architecture . . . . . . . . . . . . .
11.2.1 Basic Processor Architecture . . . . . . . . . . . . .
11.2.2 Communications Interconnect . . . . . . . . . . . . .
11.2.3 Peripherals and Hardware Functional Accelerators .
11.3 Tool Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11.3.1 picoVhdl Parser (Analyzer, Elaborator, Assembler) .
11.3.2 C Compiler . . . . . . . . . . . . . . . . . . . . . . .
11.3.3 Design Simulation . . . . . . . . . . . . . . . . . . .
11.3.4 Design Partitioning for Multiple Devices . . . . . . .
11.3.5 Place and Switch . . . . . . . . . . . . . . . . . . . .
11.3.6 Debugging . . . . . . . . . . . . . . . . . . . . . . . .
11.4 picoArray Debug and Analysis . . . . . . . . . . . . . . . .
11.4.1 Language Features . . . . . . . . . . . . . . . . . . .
11.4.2 Static Analysis . . . . . . . . . . . . . . . . . . . . .
11.4.3 Design Browser . . . . . . . . . . . . . . . . . . . . .

11.4.4 Scripting . . . . . . . . . . . . . . . . . . . . . . . .
11.4.5 Probes . . . . . . . . . . . . . . . . . . . . . . . . . .
11.4.6 FileIO . . . . . . . . . . . . . . . . . . . . . . . . . .
11.5 Hardening Process in Practice . . . . . . . . . . . . . . . .
11.5.1 Viterbi Decoder Hardening . . . . . . . . . . . . . .
11.6 Design Example . . . . . . . . . . . . . . . . . . . . . . . .
11.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.

354
356
358
358
358
360
362
363
369

.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

370
371
371
373
373

375
376
376
378
381
381
381
381
382
383
383
385
387
387
388
389
392
396
396
397

12 Embedded Multi-Core Processing for Networking
399
Theofanis Orphanoudakis and Stylianos Perissakis
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
12.2 Overview of Proposed NPU Architectures . . . . . . . . . . . 403
12.2.1 Multi-Core Embedded Systems for Multi-Service
Broadband Access and Multimedia Home Networks . 403
12.2.2 SoC Integration of Network Components and Examples
of Commercial Access NPUs . . . . . . . . . . . . . . 405



xii

Table of Contents
12.2.3 NPU Architectures for Core Network Nodes and
High-Speed Networking and Switching . . . . . . . . .
12.3 Programmable Packet Processing Engines . . . . . . . . . . .
12.3.1 Parallelism . . . . . . . . . . . . . . . . . . . . . . . .
12.3.2 Multi-Threading Support . . . . . . . . . . . . . . . .
12.3.3 Specialized Instruction Set Architectures . . . . . . . .
12.4 Address Lookup and Packet Classification Engines . . . . . .
12.4.1 Classification Techniques . . . . . . . . . . . . . . . .
12.4.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . .
12.5 Packet Buffering and Queue Management Engines . . . . . .
12.5.1 Performance Issues . . . . . . . . . . . . . . . . . . .
12.5.2 Design of Specialized Core for Implementation of Queue
Management in Hardware . . . . . . . . . . . . . . .
12.6 Scheduling Engines . . . . . . . . . . . . . . . . . . . . . . .
12.6.1 Data Structures in Scheduling Architectures . . . . . .
12.6.2 Task Scheduling . . . . . . . . . . . . . . . . . . . . .
12.6.3 Traffic Scheduling . . . . . . . . . . . . . . . . . . . .
12.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Review Questions . . . . . . . . . . . . . . . . . . . . . . . . . . .
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Index

407
412

413
418
421
422
424
426
431
433
435
442
443
444
450
453
455
459
465


List of Figures

1.1

1.2

1.3
1.4
1.5
1.6
1.7

1.8
1.9
1.10
2.1

2.2

2.3

Power/performance over the years. The solid line shows the
prediction by Gene Frantz. The dotted line shows the actual
value for digital signal processors over the years. The ‘star’
curve shows the power dissipation for mobile devices over the
years. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of multi-core architectures. The x-axis shows the
logarithm of the number of processors to the base 2. The yaxis shows the run-time of the multi-core for a benchmark. .
Network-on-Chip architectures for an SoC. . . . . . . . . . .
Architecture of HiBRID multiprocessor SoC. . . . . . . . . .
Architecture of VIPER multiprocessor-on-a-chip. . . . . . .
Architecture of a single-chip multiprocessor for video applications with four processor nodes. . . . . . . . . . . . . . . . .
Design alternates for MPOC. . . . . . . . . . . . . . . . . .
Daytona general purpose multiprocessor and its processor architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Chip block diagram of OMAP4430 multi-core platform. . . .
Chip block diagram of C6474 multi-core DSP platform. . . .
Different technologies in the era of designing embedded
system-on-chip. Application-specific integrated processors
(ASIPs) and reconfigurable ASIPs combine both the flexibility of general purpose computing with the efficiency in performance, power and cost of ASICs. . . . . . . . . . . . . . .
Optimizing embedded systems-on-chips involves a wide spectrum of techniques. Balancing across often conflicting goals is
a challenging task determined mainly by the designer’s expertise rather than the properties of the embedded application.
Extensible processor core versus component-based customized

SoC. Computation elements are tightly coupled with the base
CPU pipeline (a), while (b), in component-based designs, intellectual property (IP) cores are integrated in SoCs using
different communication architectures (bus, mesh, NoC, etc.).

4

10
12
15
16
18
19
20
21
24

34

36

41

xiii


xiv

List of Figures
2.4


2.5

2.6
2.7

2.8

3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
3.9

3.10

3.11

Typical methodology for design space exploration of application specific processor customization. Different algorithms
and metrics are applied by researchers and industry for each
individual step to achieve the most efficient implementation
and time to market. . . . . . . . . . . . . . . . . . . . . . . .
A sample data flow subgraph. Usually each node is annotated
with area and timing estimates before passing to a selection
algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
A RASIP integrating the general purpose processor with
RFUs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LISATek infrastructure based on LISA architecture specification language. Retargetable software development tools (C
compiler, assembler, simulator, debugger, etc.) permit iterative exploration of varying target processor configurations. .
Tensilica customization and extension design flow. Through
Xplorer, Tensilica’s design environment, the designer has access to the tools needed for development of custom instructions and configuration of the base processor. . . . . . . . .
Power analysis and optimization at different levels of the design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Complexity estimation from SystemC source code. . . . . .
I2C driver instruction set. . . . . . . . . . . . . . . . . . . .
Power dissipation model added to the functional model. . .
System level power modeling and analysis. . . . . . . . . . .
power model architecture. . . . . . . . . . . . . . . . . . . .
Example of association between sc module and power model.
PKtool simulation flux. . . . . . . . . . . . . . . . . . . . . .
NoC performance comparison for a 16-node 2D mesh network:
steady-state network average delay for three different traffic
scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
NoC performance comparison for a 16-node 2D mesh network:
steady-state network throughput for three different traffic scenarios. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Example of probabilistic analysis. The message delay probability density referred to all messages sent and received by a
NoC under traffic equally distributed with 50% of messages
sent in burst and message generation intensity of 32%; network has 16 nodes, topology is 2D mesh and routing is deterministic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

46
52

58

61
76

78
79
80
80
83
84
84

92

92

93


List of Figures
3.12

3.13

3.14
3.15
3.16
3.17

3.18
3.19

3.20


4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8
4.9
4.10
4.11

Example of temporal evolution analysis. The graph shows the
number of flits in a router on top side of a 2D mesh network. Each router has globally 120 flit memory of capacity
distributed in five input and five out ports. The figure shows
that, for this traffic intensity and scenarios, buffer configuration is oversized and the performance is maintained even if
the router has a smaller memory. . . . . . . . . . . . . . . .
Example of power graph where power state is indicated over
time, router by router. Dark color means high power state.
Router power machine has nine power states and follows ACPI
standard: values from 1 to 4 are ON states, values from 5 to
8 are SLEEP states and value 9 is the OFF state. . . . . . .
Four ON states, four SLEEP states and OFF state of the
ACPI standard. . . . . . . . . . . . . . . . . . . . . . . . . .
DPM and communication architectures. . . . . . . . . . . .
Clock frequency, supply voltage and power dissipation for the
different power states of the ACPI standard. . . . . . . . . .
Percentage of the time the three masters and two slaves and
the bus are in the different power states during simulation in
a low bus traffic test case with local DPM and global DPM.

Energy and bus throughput normalized to the architecture
without DPM. . . . . . . . . . . . . . . . . . . . . . . . . . .
Qualitative results in terms of bus throughput as a function of
bus traffic intensity for different DPM architectures and bus
arbitration algorithm. . . . . . . . . . . . . . . . . . . . . . .
Qualitative results in terms of average energy per transfer as
a function of bus traffic intensity for different DPM architectures and bus arbitration algorithm. . . . . . . . . . . . . .
(a) A conventional 6 × 6 2D mesh and (b) a 6 × 6 irregular
mesh with 1 OIP and 31 normal-sized IPs. . . . . . . . . . .
Possible cycles and turns in 2D mesh. . . . . . . . . . . . . .
Six turns form a cycle and allow deadlock. . . . . . . . . . .
The turns allowed by (a) west-first algorithm, (b) north-last
algorithm, and (c) negative-first algorithm. . . . . . . . . . .
The six turns allowed in odd-even turn models. . . . . . . .
A minimal routing algorithm ROU T E that is based on the
odd-even turn model. . . . . . . . . . . . . . . . . . . . . . .
The localized algorithm to form extended faulty blocks. . . .
Three examples to form extended faulty blocks. . . . . . . .
E-XY routing algorithm. . . . . . . . . . . . . . . . . . . . .
Eight possible cases of the E-XY in normal mode. . . . . . .
Four cases of the E-XY in abnormal mode: (a) south-to-north,
(b) north-to-south, (c) west-to-east, and (d) east-to-west direction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xv

94

95
96
97

98

99
99

100

101
114
117
118
119
119
120
121
122
123
123

124


xvi

List of Figures
4.12
4.13

4.14
4.15

4.16
4.17
4.18
4.19

4.20

4.21
4.22
4.23

4.24
4.25
4.26
4.27

4.28
4.29
4.30
4.31
4.32
4.33
4.34
4.35
4.36
4.37
4.38

An example to form faulty blocks for Chen and Chiu’s algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
Two examples of f-rings and f-chains: (a) one f-ring and one

f-chain in a 6 × 6 mesh and (b) one f-ring and eight different
types of f-chains in a 10 × 10 mesh. . . . . . . . . . . . . . . 126
Pseudo codes of the procedure Message-Route Modified. . . 126
Pseudo codes of the procedure Normal-Route. . . . . . . . . 127
Pseudo codes of the procedure Ring-Route. . . . . . . . . . . 128
Pseudo codes of the procedure Chain-Route Modified. . . . . 129
Pseudo codes of the procedure Overlapped-Ring Chain Route. 130
Examples of Chen and Chiu’s routing algorithm: (a) the routing paths (RF, CF, and RO) in Normal-Route, and (b) Two
examples of Ring-Route and Chain-Route. . . . . . . . . . . 131
Traffic loads around the OIPs by using (a) Chen and Chiu’s
algorithm [5] (unbalanced), (b) the extended X-Y routing
algorithm [34] (unbalanced), and (c) the OAPR [21] (balanced). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
The OAPR: (a) eight default routing cases and (b) some cases
to detour OIPs. . . . . . . . . . . . . . . . . . . . . . . . . . 133
Restrictions on OIP placements for the OAPR. . . . . . . . 133
The OAPR design flow: (a) the routing logic in the five-port
router model, (b) the flowchart of the OAPR design flow, and
(c) the flowchart to update LUTs. . . . . . . . . . . . . . . . 134
Overview of APSRA design methodology. . . . . . . . . . . 135
An example of APSRA methodology: (a) CG, (b) T G, (c)
CDG, (d) ASCDG, and (e) the concurrency of the two loops. 137
An example of the routing table in the west input port of node
X: (a) original routing table and (b) compressed routing table. 138
An example of the compressed routing table in node X with
loss of adaptivity: (a) the routing table by merging destinations A and B and (b) the routing table by merging regions
R1 and R3. . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
OIP placement with different sizes and locations. . . . . . . 140
Effect on latency with central region in NoC. . . . . . . . . 141
Latency for horizontal shift of positions. . . . . . . . . . . . 141
Latency for vertical shift of positions. . . . . . . . . . . . . . 142

OIP placements with different orientations. . . . . . . . . . 142
An example of a 12 × 12 distribution graph. . . . . . . . . . 144
Latencies of one 3 × 3 OIP placed on a 12 × 12 mesh. . . . 144
Latencies of one four-unit OIP placed on a 12 × 12 mesh: (a)
horizontal placements and (b) vertical placements. . . . . . 145
(a) Routing paths without turning to destination D and (b)
Routing paths with two turns to D. . . . . . . . . . . . . . . 146
TT routing algorithm for one destination D. . . . . . . . . . 147
XYDT routing algorithm for one destination D. . . . . . . . 148


List of Figures

xvii

4.39
4.40
4.41
4.42

Degree priority routing algorithm. . . . . . . . . . . . . .
Examples showing the degrees of the nodes A, B, C, and
An example of the degree priority routing algorithm. . .
Routing tables of nodes 1, 6, 10, C, and X. . . . . . . .

. .
D.
. .
. .


149
150
150
150

5.1
5.2
5.3

Design refinement process. . . . . . . . . . . . . . . . . . . .
Safe asynchronous communication using a handshake. . . . .
Lack of consistent global state with multiple, asynchronous
clocks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Non-determinism in communication between clock domains.
Example of system communication via shared memory. . . .
System traces and permanent intermittent errors. . . . . . .
Scope reduced to include Master 2 only. . . . . . . . . . . .
Scope reduced to include Master 1 and Master 2 only. . . .
Debug flow charts. . . . . . . . . . . . . . . . . . . . . . . .
Run/stop debug methods. . . . . . . . . . . . . . . . . . . .
Debug abstractions. . . . . . . . . . . . . . . . . . . . . . . .
Debug hardware architecture. . . . . . . . . . . . . . . . . .
Example system under debug. . . . . . . . . . . . . . . . . .
Off-chip debug infrastructure with software architecture. . .
Physical and logical interconnectivity. . . . . . . . . . . . . .

157
160

5.4

5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13
5.14
5.15
6.1
6.2
6.3
6.4
6.5

6.6
6.7
6.8

6.9
6.10
6.11

Our design space exploration approach for system-level NoC
selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Metis-based Neato visualization of the Spidergon NoC layout. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Source file for Scotch partitioning tool. . . . . . . . . . . .
Target file for Scotch partitioning tool. . . . . . . . . . . .

Application models for (a) 2-rooted forest (SRF), (b) 2-rooted
tree (SRT), (c) 2-node 2-rooted forest(MRF) application task
graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The Mpeg4 decoder task graph. . . . . . . . . . . . . . . . .
The Spidergon topology translates to simple, low-cost VLSI
implementation. . . . . . . . . . . . . . . . . . . . . . . . . .
Edge dilation for (a) 2-rooted and (b) 4-rooted forest, (c) 2
node-disjoint and (d) 4 node-disjoint trees, (e) 2 node-disjoint
2-routed and (f) 4 node-disjoint 4-routed forests in function
of the network size. . . . . . . . . . . . . . . . . . . . . . . .
Relative edge expansion for 12-node Mpeg4 for different target
graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Model of the router used in the considered NoC architectures.
Maximum throughput as a function of the network size for (a)
2-rooted forest, (b) 4-rooted forest (SRF), (c) 2-rooted tree,
(d) 4-rooted tree (SRT), (e) 2-node 2-rooted forest and (f)
4-node 2-rooted forest (MRF) and different NoC topologies.

161
162
162
165
166
167
168
175
177
179
181
185

189
205
208
214
215

218
218
220

222
223
225

226


xviii
6.12
6.13
6.14
6.15
6.16
7.1
7.2
7.3
7.4
7.5
7.6
7.7

7.8
7.9
7.10
7.11
7.12
7.13
7.14
7.15
7.16
7.17
7.18
7.19
8.1
8.2
8.3
8.4
8.5
8.6
8.7
8.8
8.9
8.10
8.11
8.12
8.13

List of Figures
Amount of memory required by each interconnect. . . . . .
(a) Task execution time and (b) average path length for
Mpeg4 traffic on the considered NoC architectures. . . . . .

Average throughput on router’s output port for (a) Spidergon,
(b) ring, (c) mesh and (d) unbuffered crossbar architecture.
Network RTT as a function of the initiators’ offered load. .
Future work: dynamic scheduling of tasks. . . . . . . . . . .
Input code: the depth of each loop nest Lk is n (n loops), Ak
is n dimensional. . . . . . . . . . . . . . . . . . . . . . . . .
Code example and its iteration domain. . . . . . . . . . . .
An example of loop fusion. . . . . . . . . . . . . . . . . . . .
An example of tiling. . . . . . . . . . . . . . . . . . . . . . .
An example of buffer allocation. . . . . . . . . . . . . . . . .
An example of three loop nests. . . . . . . . . . . . . . . . .
Partitioning after loop fusion. . . . . . . . . . . . . . . . . .
Difference between positive and lexicographically positive dependence. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Tiling technique. . . . . . . . . . . . . . . . . . . . . . . . .
Buffer allocation for array B. . . . . . . . . . . . . . . . . .
Classic partitioning. . . . . . . . . . . . . . . . . . . . . . . .
Different partitioning. . . . . . . . . . . . . . . . . . . . . .
Buffer allocation for array B with new partitioning. . . . . .
Sub-division of processor P1 ’s block. . . . . . . . . . . . . .
Elimination of modulo operators. . . . . . . . . . . . . . . .
Execution order (a) without fusion (b) after fusion and (c)
after unimodular transformation. . . . . . . . . . . . . . . .
StepNP platform. . . . . . . . . . . . . . . . . . . . . . . . .
DCache hit ratio results for four CPUs. . . . . . . . . . . . .
Processing time results for four CPUs. . . . . . . . . . . . .
Abstraction levels of multi-core software directives, utilities
and tools. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Threading structure of fork-join model. . . . . . . . . . . . .
Work distribution model. . . . . . . . . . . . . . . . . . . . .
Pipeline threading model. . . . . . . . . . . . . . . . . . . .

Scheduling threading structure. . . . . . . . . . . . . . . . .
Parallel functions in thread building blocks. . . . . . . . . .
Program flow in host and device for NVIDIA CUDA. . . . .
Stream structures using filters. . . . . . . . . . . . . . . . . .
OC program in Listing 8.2 distributed into two locations. . .
LUSTRE to TTA implementation flow. . . . . . . . . . . . .
Weakly endochronous program with diamond property. . . .
Process-based threading model. . . . . . . . . . . . . . . . .
Fine grained thread structure of polychrony. . . . . . . . . .

228
228
230
231
233
247
248
249
250
250
251
252
253
254
255
257
257
258
259
260

261
262
263
264
272
273
274
274
277
281
283
285
290
292
295
296
297


List of Figures
8.14
8.15

xix

8.16

SDFG-based multi-threading for SIGNAL. . . . . . . . . . .
TAXYS tool structure with event handling and code generation [23]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Task precedence in a multi-rate real time application [37]. .


300
301

9.1
9.2
9.3
9.4
9.5
9.6
9.7
9.8
9.9
9.10
9.11
9.12
9.13
9.14
9.15

Example of HMC-SoC. . . . . . . . . . . . . . . . . .
Ideal software organization. . . . . . . . . . . . . . .
Parallelization of an application. . . . . . . . . . . . .
BSP-based software organization. . . . . . . . . . . .
BSP-based application development. . . . . . . . . .
BSP-based boot-up sequence strategies. . . . . . . .
Software organization of a GPOS-based application. .
GPOS-based application development. . . . . . . . .
GPOS-based boot-up sequence. . . . . . . . . . . . .
Software organization of a generated application. . .

Examples of computations models. . . . . . . . . . .
Tasks graph with RTOS elements. . . . . . . . . . . .
Component architecture. . . . . . . . . . . . . . . . .
Component-based OS software organization. . . . . .
Example of a dependency graph. . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

310
312
313
315
316
316
318
319

320
323
324
325
326
327
328

10.1

Pipelined micro-architecture of an embedded variant of UltraSPARC T1. . . . . . . . . . . . . . . . . . . . . . . . . . . .
Trap logic unit. . . . . . . . . . . . . . . . . . . . . . . . . .
Chip block diagram. . . . . . . . . . . . . . . . . . . . . . .
Architecture of autonomous hardware power saving logic. .
Global power management unit. . . . . . . . . . . . . . . . .

352
352
353
355
356

10.2
10.3
10.4
10.5
11.1
11.2
11.3
11.4

11.5
11.6
11.7
11.8
11.9
11.10
11.11
11.12
11.13
11.14
11.15

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.

.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.


picoBus interconnect structure. . . . . . . . . . . . . . . . .
Processor structure. . . . . . . . . . . . . . . . . . . . . . . .
VLIW and execution unit structure in each processor. . . .
Tool flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Behavioral simulation instance . . . . . . . . . . . . . . . . .
Example of where-defined program analysis. . . . . . . . . .
Design browser display. . . . . . . . . . . . . . . . . . . . . .
Diagnostics output from 802.16 PHY. . . . . . . . . . . . . .
Hardening approach. . . . . . . . . . . . . . . . . . . . . . .
Software implementation of Viterbi decoder and testbench. .
Partially hardened implementation of Viterbi decoder and
testbench. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Fully hardened implementation of Viterbi decoder and testbench. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Femtocell system. . . . . . . . . . . . . . . . . . . . . . . . .
Femtocell. . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Femtocell reference board. . . . . . . . . . . . . . . . . . . .

298

371
372
372
377
380
384
385
386
389
390
391

391
393
394
395


xx

List of Figures
12.1
12.2

12.3
12.4
12.5
12.6
12.7
12.8
12.9
12.10
12.11
12.12
12.13
12.14
12.15
12.16
12.17
12.18
12.19
12.20

12.21
12.22
12.23
12.24
12.25
12.26

Taxonomy of network processing functions. . . . . . . . . .
Available clock cycles for processing each packet as a function
of clock frequency and link rate in average case (mean packet
size of 256 bytes is assumed). . . . . . . . . . . . . . . . . .
Typical architecture of integrated access devices (IADs) based
on discrete components. . . . . . . . . . . . . . . . . . . . .
Typical architecture of SoC integrated network processor for
access devices and residential gateways. . . . . . . . . . . .
Evolution of switch node architectures: (a) 1st generation (b)
2nd generation (c) 3rd generation. . . . . . . . . . . . . . .
PDU flow in a distributed switching node architecture. . .
Centralized (a) and distributed (b) NPU-based switch architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Generic NPU architecture. . . . . . . . . . . . . . . . . . .
(a) Parallel RISC NPU architecture (b) pipelined RISC NPU
architecture (c) state-machine NPU architecture. . . . . . .
(a) Intel IXP 2800 NPU, (b) Freescale C-5e NPU. . . . . .
Architecture of PRO3 reprogrammable pipeline module
(RPM). . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
The concept of the EZchip architecture. . . . . . . . . . . .
Block diagram of the Agere (LSI) APP550. . . . . . . . . .
The PE (microengine) of the Intel IXP2800. . . . . . . . .
TCAM organization [Source: Netlogic]. . . . . . . . . . . .
Mapping of rules to a two-dimensional classifier. . . . . . .

iAP organization. . . . . . . . . . . . . . . . . . . . . . . .
EZchip table lookup architecture. . . . . . . . . . . . . . . .
Packet buffer manager on a system-on-chip architecture. . .
DMM architecture. . . . . . . . . . . . . . . . . . . . . . . .
Details of internal task scheduler of NPU architecture [25]. .
Load balancing core implementation [25]. . . . . . . . . . .
The Porthos NPU interconnection architecture [32]. . . . .
Scheduling in context of processing path of network routing/switching nodes. . . . . . . . . . . . . . . . . . . . . . .
Weighted scheduling of flows/queues contending for same
egress network port. . . . . . . . . . . . . . . . . . . . . . .
(a) Architecture extensions for programmable service disciplines. (b) Queuing requirements for multiple port support.

401

405
406
407
408
409
409
410
412
414
415
416
417
419
424
426
429

430
436
437
446
447
448
450
451
452


List of Tables

1.1

Growth of VLSI Technology over Four Decades . . . . . . .

3

4.1

Rules for Positions and Orientations of OIPs . . . . . . . . .

145

6.1

Initiator’s Average Injection Rate and Relative Ratio with
Respect to UPS-AMP Node . . . . . . . . . . . . . . . . . .


229

8.1

SIGNAL Operators and Clock Relations . . . . . . . . . . .

294

9.1

Solution Pros and Cons . . . . . . . . . . . . . . . . . . . . .

330

10.1
10.2
10.3
10.4

Power Gating Status Register
Power Gating Status Register
Clock Gating Status Register
DVFS Status Register . . . .

.
.
.
.

346

356
357
357

11.1

Viterbi Decoder Transistor Estimates . . . . . . . . . . . . .

392

12.1
12.2

DDR-DRAM Throughput Loss Using 1 to 16 Banks . . . .
Maximum Rate Serviced When Queue Management Runs on
IXP 1200 . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Packet Command and Segment Command Pointer Manipulation Latency . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of DMM . . . . . . . . . . . . . . . . . . . . .

434

12.3
12.4

.
.
.
.

.

.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.

.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.

.
.
.

.
.
.
.

.
.
.
.

435
440
441

xxi



Foreword
I am delighted to introduce the first book on multi-core embedded systems. My
sincere hope is that you will find the following pages valuable and rewarding.
This book is authored to address many challenging topics related to the
multi-core embedded systems research area, starting with multi-core architectures and interconnects, embedded design methodologies for multi-core
systems, to mapping of applications, programming paradigms and models of
computation on multi-core embedded systems.
With the growing complexity of embedded systems and the rapid improvements in process technology the development of systems-on-chip and of embedded systems increasingly is based on integration of multiple cores, either

homogeneous (such as processors) or heterogeneous. Modern systems are increasingly utilizing a combination of processors (CPUs, MCUs, DSPs) which
are programmed in software, reconfigurable hardware (FPGAs, PLDs), and
custom application–specific hardware. It appears likely that the next generation of hardware will be increasingly programmable, blending processors and
configurable hardware.
The book discusses the work done regarding the interactions among multicore systems, applications and software views, and processors configuration
and extension, which add a new dimension to the problem space. Multiple
cores used in concert prove to be a new challenge forming a concurrent architecture with resources for scheduling, with a number of concurrent processes that perform communication, synchronization and input and output
tasks. The choice of programming and threading models, whether symmetric or asymmetric, communication APIs, real-time OS services or application
development consist of areas increasingly challenging in the realm of modern
multi-core embedded systems-on-chip.
Beyond exploration of different architectures of multi-core embedded systems and of the network-on-chip infrastructures that ushered in support of
these SoCs in a straightforward manner, the objectives of this book cover
also the presentation of a number of interrelated issues. HW/SW development, tools and verification for multi-core systems, programming models, and
models of computation for modern embedded systems are also explored.
The book may be used either in a graduate-level course as a part of the
subject of embedded systems, computer architecture, and multi-core systemson-chips, or as a reference book for professionals and researchers. It provides
a clear view of the technical challenges ahead and brings more perspectives
xxiii


xxiv

Foreword

into the discussion of multi-core embedded systems. This book is particularly
useful for engineers and professionals in industry for easy understanding of
the subject matter and as an aid in both software and hardware development
of their products.
Acknowledgments
I would like to express my sincere gratitude to all the co-authors for their

invaluable contributions, for their constructive comments, and essential assistance throughout this project. All deserve special thanks for utilizing their
great expertise to make this book exciting.
I also wish to thank Miltos Grammatikakis for his input on chapter organization and his suggestions.
I would also like to mention my publisher, Nora Konopka, Amy Blalock,
and Iris Fahrer for their guidance in authoring and organization.
Finally, I am indebted to my family for their enduring support and encouragement thoughout this long and tiring journey.
A windy Sunday morning of February 2010.
Georgios Kornaros


Tài liệu bạn tìm kiếm đã sẵn sàng tải về

Tải bản đầy đủ ngay
×