Tải bản đầy đủ (.pdf) (24 trang)

User-level Interprocess Communication for Shared Memory Multiprocessors

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (1.66 MB, 24 trang )

Lker-Level Interprocess Communication
for Shared Memory Multiprocessors
BRIAN N. BERSHAD
Carnegie

Mellon

University

THOMAS

E. ANDERSON,

University

of Washington

Interprocess
(between

communication
address

operating
IPC

applications
encounter
level

and



communication
facilities

These

when

at the user

level,

with

thread

Terms—thread,

local

to the design
of the

kernel,

is architecturally

from

one


problems
them

since

between

cornmzmzcation
of contemporary
but

limited

address

space

kernel-based
by the

to another.

cost of
Second,

their
own thread
management
the interaction

between
kernel-

User-Level

can be solved
at the user

kernel

address

structure

allows

the

sees threads

on the

same
Call

shared
kernel

within


the communica-

each address

and processor

Procedure

using

by moving

level

invocation

spaces

Remote

protocol

The programmer

unconventional

towards

central


M. LEVY

management.

these

is improved

This

oriented

threads
and must provide
problems
stemming
from

and supporting

motivated

and HENRY

the responsibility

a processor

space communication


space communication.

Index

kernel

communicating

observations

though

been

multiprocessor,

performance

cross-address

managed

IPC

has become

its performance

reallocating


and user-level

memory
out of the

be avoided

particular

First,

that
need inexpensive
functional
and performance

Communication

fast

in

has traditionally
problems

kernel

On a shared
tion


IPC

inherent

the

(IPC),

D. LAZOWSKA,

spaces on the same machine),

systems.

has two

invoking

EDWARD

can

machine.

(URPC)

memory

URPC


with

through

combines

lightweight

to be bypassed

and RPC

space.

reallocation

during

a

threads

cross-address

a conventional

interface,

performance,


multiprocessor,

operating

system,

parallel

programming,

performance,

communication
Categories

and

Subject

Descriptors:

D. 3.3

[Programming

Languages]:

Language

Constructs


[Operating Systems]: Process Management— concurrency, multiprocessing
/ multiprograrnmmg;
D.4.4 [Operating
Systems]: Communications Manfl!ow
agement; D.4. 6 [Operating Systems]: Security and Protection— accesscontrols, information
controls;
D.4. 7 [Operating
systems]: Organization and Desig~ D.4.8 [Operating Systems]:
Performance— measurements

and

Features

— modules,

This

material

is based

8619663,
and

on work

CCR-87OO1O6,


Digital

Program).

supported

CCFL8703049,

Equipment
Bershad

D.4. 1

packages;

Corporation

was

(the

supported

by the

and

National

CCR-897666),


Systems

by an AT&T

Science
the

Research
Ph.D.

Center

Fellowship.

Authors’
Pittsburgh,

addresses:
B. N. Bershad,
School of Computer
Science,
PA 15213;
T. E. Anderson,
E. D. Lazowska,
and
Science

Permission
not made

of the

and Engineering,

to copy without
or distributed

publication

Association

and

specific

permission.

@ 1991

ACM

fee all or part

for direct
its

for Computing

FR-35,


date

of this

commercial

appear,

Machinery.

0734-2071/91/0500-0175
ACM

University

Transactions

and

material
is given

To copy otherwise,

the

External

Anderson


CCRCenter,

Research
by an IBM

Carnegie
Mellon
University,
H. M
Levy,
Department

is granted
the ACM
that

(grants

Technology

and

of Washington,

advantage,
notice

and

Scholarship,


Graduate

Computer

Foundation

Washington

Seattle,
provided
copyright

copying

or to republish,

WA
that
notice

the copies

are

and the title

is by permission
requires


of

98195.

of the

a fee and/or

$01.50
on Computer

Systems,

Vol

9, No. 2, May 1991, Pages 175-198


176

.

General

B. N

Terms:

Addltlonal


Bershad

Design,

Key

et al

Performance,

Words

and

Phrases:

Measurement
Modularity,

remote

procedure

call,

small-kernel

operating

system


1. INTRODUCTION
Efficient

interprocess

communication

is central

to the design

of contemporary

operating
systems
[16, 231. An efficient
communication
facility
encourages
system decomposition
across address space boundaries.
Decomposed
systems
have several advantages
over more monolithic
ones, including
failure
isolation (address space boundaries
prevent

a fault in one module from “leaking”
into another),
extensibility
(new modules
can be added to the system without
having

to modify

mechanism

existing

rather

Although

than

address

which they
primitives.

ones),

and

modularity


(interfaces

are enforced

by

by convention).

spaces

can be a useful

structuring

device,

the

extent

to

can be used depends on the performance
of the communication
If cross-address
space communication
is slow, the structuring

benefits
that come from decomposition

are difficult
to justify
to end users,
whose primary
concern
is system
performance,
and who treat
the entire
operating
system as a “black
box” [181 regardless
of its internal
structure.
Consequently,
designers
are forced to coalesce weakly
related
subsystems
into

the same address

modularity
Interprocess
operating
problems:

space, trading


away

failure

isolation,

extensibility,

and

for performance.
communication

system

—Architectural

has traditionally

kernel.

However,

performance

barriers.

been the responsibility

kernel-based


The

communication

performance

of the
has

of kernel-based

two

syn-

chronous
communication
is architecturally
limited
by the cost of invoking
the kernel
and reallocating
a processor from one address space to another.
In our earlier
work on Lightweight
Remote Procedure
Call (LRPC) [10], we
show that
it is possible

to reduce
the overhead
of a kernel-mediated
cross-address
space call to nearly
the limit
possible
on
processor
architecture:
the time to perform
a cross-address
slightly

greater

than

that

required

to twice

invoke

a conventional
LRPC is only

the kernel


and have

it

reallocate
a processor from one address space to another.
The efficiency
of
LRPC comes from taking
a “common
case” approach
to communication,
thereby
avoiding
unnecessary
synchronization,
kernel-level
thread
management,
and data copying
for calls between
address spaces on the same
machine.
The majority
of LRPC’S overhead
(70 percent)
can be attributed
directly
to the fact that the kernel

mediates
every cross-address
space call.
—Interaction

between

kernel-based

communication

and

high-performance

The performance
of a parallel
application
running
on a
multiprocessor
can strongly
depend on the efficiency
of thread
management operations.
Medium
and fine-grained
parallel
applications
must use

a thread
management
system
implemented
at the user level to obtain
user-level

ACM

TransactIons

threads.

on Computer

Systems,

Vol

9, No

2, May

1991


User-Level Interprocess
satkfactory
have


Communication

performance

strong

for Shared Memory Multiprocessors

[6, 36]. Communication

interdependencies,

though,

and thread

and the

across
protection
boundaries
(kernel-level
level thread
management)
is high in terms

.

177


management

cost of partitioning

communication
of performance

them

and
userand system

complexity.
On

a shared

eliminate
Because

memory

multiprocessor,

these

problems

the kernel
from the path of cross-address

address spaces can share memory
directly,

have

space
shared

a solution:

communication.
memory
can be

used as the data transfer
channel.
Because a shared memory
multiprocessor
has more than one processor,
processor
reallocation
can often be avoided by
taking

advantage

of a processor

already


active

without
involving
the kernel.
User-level
management
of cross-address
improved
performance
over kernel-mediated
–Messages
kernel.

are sent

between

–Unnecessary
processor
reducing
call overhead,
contexts

across

inherent

exploited,


the

same

address
and

spaces,

processor

call

Procedure

directly,

results

without

invoking

in

the

without

components

management,
which

reallocation.

(URPC),

the

communication

kernel

can be

of a message

can be

mediation.

system

described

between
URPC

address


isolates

of interprocess
communication:
and data transfer.
Control

is the

is implemented

and receiving

its overhead

performance.

Call

safe and efficient

machine

programmer,

in the sending

improving

Remote


other the three
cation,
thread

spaces

space

reallocation
between
address spaces is eliminated,
and helping
to preserve
valuable
cache and TLB

parallelism

provides

address

space communication
approaches
because

reallocation
does prove to be necessary,
several independent

calls.

thereby

User-Level
paper,

target

calls.

–When
processor
amortized
over
—The

address

in the

communication
through

Only

processor

volvement;
thread

management
and
ment and interprocess
communication
rather
than by the kernel.

from

presented

of thread

reallocation

one an-

processor reallotransfer
between

abstraction

a combination

in this
spaces on

to the

management


requires

data transfer
do not. Thread
are done by application-level

kernel

in-

managelibraries,

URPC’S approach
stands in contrast
to conventional
“small
kernel”
systems in which the kernel
is responsible
for address spaces, thread
management, and interprocess
communication.
Moving
communication
and thread
management
to the user level leaves the kernel
responsible
only for the

mechanisms
that allocate
processors to address spaces. For reasons of performance and flexibility,
this is an appropriate
division
of responsibility
for
operating
systems
of shared
memory
multiprocessors.
(URPC
can also be
appropriate
for uniprocessors
running
multithreaded
applications.
)
The latency of a simple cross-address
space procedure
call is 93 psecs using
a URPC implementation
running
on the Firefly,
DEC SRC’S multiprocessor
workstation
[37]. Operating
as a pipeline,

two processors
can complete
one
ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May 1991.


178

.

call

every

B. N. Bershad et al
53 psecs.

that involve

support
for
kernel-based
user-level

On the

same

multiprocessor

hardware,

RPC

threads

that

accompany

URPC

takes

43

psecs,

while


operation
for the kernel-level
threads
that
accompany
LRPC
millisecond.
To put these figures
in perspective,
a same-address
dure

facilities

the kernel
are markedly
slower,
without
providing
adequate
user-level
thread
management.
LRPC,
a high
performanceimplementation,
takes 157 psecs. The Fork operation
for the


call takes

7 psecs on the Firefly,

and a protected

kernel

the

Fork

take over a
space proce-

invocation

(trap)

takes 20 psecs.
We describe the mechanics
of URPC in more detail in the next section. In
Section 3 we discuss the design rationale
behind
URPC.
We discuss performance

in Section

we present


4. In Section

2. A USER-LEVEL
In

many

5 we survey

related

work.

REMOTE PROCEDURE

contemporary

uniprocessor

applications

communicate

sages

narrow

across


operations
(create,
between
programs

a control

languages
memory
tween

in Section

6

with

channels

and

CALL FACILITY
multiprocessor

operating

system

that


or ports

support

and

that
for

by

only

a small

systems,

sending

mes-

number

of

send, receive,
destroy).
Messages
permit
communication

on the same machine,
separated
by address space bound-

data-structuring

support

device

synchronous

communication

address

operating

services

aries, or between
programs
on different
machines,
boundaries.
Although
messages
are a powerful
communication
sent


Finally,

our conclusions.

within

spaces is with

foreign

procedure
an

untyped,

primitive,
to traditional

call,

address

separated

data

space.

asynchronous


typing,

by

physical

they

repre-

Algol-like
and shared

Communication

messages.

be-

Programmers

of message-passing
systems who must use one of the many popular
Algol-like
languages
as a systems-building
platform
must think,
program,

and structure according
to two quite different
programming
paradigms.
Consequently,
almost every mature
message-based
operating
system also includes
support
for Remote
Procedure
Call (RPC) [111, enabling messages to serve as the
underlying
transport
mechanism
beneath
a procedure
call interface.
Nelson defines RPC as the synchronous
language-level
transfer
of control
between
programs
in disjoint
address spaces whose primary
communication
medium
is a narrow

channel
[301. The definition
of RPC is silent about the
operation
of that narrow
channel
and how the processor scheduling
(reallocation) mechanisms
interact
with data transfer.
IJRPC exploits
this silence in
two ways:
-Messages

are passed

between

address

spaces through

logical

channels

kept

in memory

that is pair-wise
shared between
the client
and server.
The
channels
are created and mapped once for every client/server
pairing,
and
are used for all subsequent
communication
between
the two address spaces
so that several interfaces
can be multiplexed
over the same channel.
The
integrity
of data passed through
the shared memory
channel
is ensured by
ACM

Transactions

on Computer

Systems,


Vol

9, No

2, May

1991


User-Level Interprocess
a combination
statically)
correctness

Communication

for Shared Memory Multiprocessors

of the pair-wise

mapping

and the URPC runtime
is verified
dynamically).

(message

system


authentication

in each address

—Thread
management
is implemented
entirely
at the
integrated
with the user-level
machinery
that manages
nels.

A user-level

directly

thread

enqueuing

channel.

the

No kernel

Although


sends

a message

message

calls

on the

are necessary

a cross-address

space

server,

it blocks

waiting

to send a call

procedure

for the reply

link


call

signifying

179

is implied

space (message

user level
and is
the message chan-

to another

outgoing

perspective
of the programmer,
it is asynchronous
thread
management.
When a thread
in a client

.

address

of the

space

by

appropriate

or reply

message.

is synchronous

from

the

at and beneath the level of
invokes
a procedure
in a

the procedure’s

return;

while

blocked,

its processor
can run another
ready thread
in the same address
space. In our earlier
system, LRPC, the blocked thread
and the ready thread
were really
the same; the thread just crosses an address space boundary.
In
contrast,

URPC

address
that

always

tries

space on the client

can be handled

When

the reply

entirely


arrives,

to

schedule

thread’s

another

processor.

by the

This

user-level

the blocked

client

thread

thread

thread

from


the

is a scheduling

same

operation

management

system.

can be rescheduled

on any

of the processors allocated
to the client’s
address space, again without
kernel
intervention.
Similarly,
execution
of the call on the server side can be done
by a processor

already

and need not occur

By preferentially
takes

advantage

in switching
call

this

address
requires

executing

in the context

of the fact that

a processor
context

of the server’s

address

synchronously
with the call.
scheduling
threads within

the same address
there

to another
than

switching)

is significantly

thread
in

reallocating

space, URPC

less overhead

in the same address

space,

involved

space (we will

it to a thread

in


another

reallocation).
Processor reallocation
space (we will call this processor
changing
the mapping
registers
that
define
the virtual
address

space context
architectures,
in privileged
Several

in which
these
kernel

components

tion.
There
are
processor should


a processor

mapping
mode.

is executing.

registers

contribute

are protected

to the

high

On conventional
and can only

overhead

processor
be accessed

of processor

realloca-

scheduling

costs to decide the address
space to which
a
be reallocated;
immediate
costs to update the virtual
mem-

ory mapping
registers
and to transfer
the processor between
and long-term
costs due to the diminished
performance
of
translation
lookaside
buffer
(TLB) that occurs whefi locality
address space to another
[31. Although
there is a long-term
with context
switching
within
when processors are frequently
To demonstrate
the relative


the same address space, that cost is less than
reallocated
between
address spaces [201.
overhead
involved
in switching
contexts
be-

tween two threads
in the
between
address
spaces,
same-address

space

same-address

space

same address space
we introduce
the
On the
context
switch.
context

switch
requires
ACM

address spaces;
the cache and
shifts from one
cost associated

TransactIons

versus reallocating
processors
latency
concept
of a minimal
C-VAX,
the minimal
latency
saving
and then restoring
the

on Computer

Systems,

Vol

9, No


2, May 1991.


180

B. N Bershad et al

.

machine’s

general-purpose

registers,

reallocating

the

takes

55 psecs without

about

Because
reallocation,
switching


processor

from

and

takes

one address

including

about

15 psecs.

space to another

the long-term

C-VAX

cost.

of the cost difference
between
context
switching
URPC
strives

to avoid processor
reallocation,
whenever

In contrast,

on the

and processor
instead
context

possible.

Processor
reallocation
is sometimes
necessary
with
URPC,
though.
If a
client
invokes
a procedure
in a server that has an insufficient
number
of
processors
to handle

the call (e. g., the server’s
processors
are busy doing
other
reply.
with

work),
the client’s
processors
may idle for some time
awaiting
the
This is a load-balancing
problem.
URPC considers
an address space
pending

incoming

messages

on its

channel

to be “underpowered.”

An


idle processor in the client address space can balance the load by reallocating
itself to a server that has pending
(unprocessed)
messages from that client.
Processor
reallocation
requires
that
the kernel
be invoked
to change
the
processor’s
virtual
memory
context
to that
of the underpowered
address
space. That done, the kernel
upcalls
into a server routine
that handles
the
client’s
outstanding
requests.
After these have been processed, the processor
is returned

to the client address space via the kernel.
The responsibility
for detecting
incoming
messages and scheduling
threads
to handle
them belongs
to special,
low-priority
threads
that are part of a
URPC runtime
library
linked
incoming
messages only when
Figure
proceeding

1 illustrates
downward.

manager
(WinMgr)
a separate
address
procedures

into

they

each address space. Processors
would otherwise
be idle.

a sample
execution
timeline
One client,
an editor,
and

and a file cache manager
space. Two threads
in

in the window

manager

for URPC
two servers,

scan

for

with
time

a window

(FCMW),
are shown. Each is in
T1 and
T2, invoke
the editor,

and the file

cache manager.

Initially,

the

editor has one processor allocated
to it, the window
manager
has one, and the
space call into
file cache manager
has none. T1 first makes a cross-address
the window
manager
by sending
and then attempting
to receive a message.
The window
manager

receives
the message
and begins processing.
In the
meantime,
the thread
scheduler
in the editor has context
switched
from TI
(which is blocked waiting
T2 initiates
a procedure

to receive a message) to another
thread
call to the file cache manager
by sending

Thread
a message

T2.

T1 can be unblocked
because the window
and waiting
to receive
a reply.
T1 then calls

manager
has sent a reply message back to the editor.
Thread
into the file cache manager
and blocks. The file cache manager
now has two
pending
messages from the editor
but no processors
with which
to handle
T1 and T2 are blocked
in the editor,
which
has no other
them.
Threads
threads
waiting
to run. The editor’s
thread
management
system detects the

outgoing
pending
messages
and donates
a processor
to the file cache manager, which can then receive, process, and reply to the editor’s

two incoming
calls before returning
the processor back to the editor.
At this point,
incoming
reply messages from the file cache manager
can be handled.
Tz each terminate
ACM

TransactIons

when

on Computer

they

receive

Systems,

Vol

their
9, No

replies.

2, May


1991

the two
7’1 and


User-Level Interprocess

Communication

for Shared Memory Multiprocessors

(has no processors)

(scanning channels)
m

181

FCMgr

WinMgr

Editor

.

m


Send WinMgr

TI

Recv & Process
Reply to T1’s Call

Call
[

Recv WinMgr
/

Context Switch
T2

Send FCMgr

Call
[

Recv FCMgr

Context

Switch

Send

TI

Call
[

FCMgr

Recv

FCMgr

Reallocate
Processor

Recv

to

FCMgr

Reply

Recv

& Process
to T2’s Call

& Process

Reply

to T1’s Call


Return

Processor

to Editor
Context
Terminate

Context
Terminate

Switch
T2

Switch
T1

Time
Fig.

1.

URPC

timeline

URPC’S treatment
of pending
messages is analogous

to the treatment
of
runnable
threads
waiting
on a ready queue. Each represents
an execution
context in need of attention
from a physical
processor;
otherwise
idle processors poll the queues looking
for work in the form of these execution
contexts.
There are two main differences
between
the operation
of message channels
and thread
ready queues.
First,
URPC enables
on address space to spawn
work in another
address space by enqueuing
a message (an execution
context
consisting
of some control
information

and arguments
or results).
Second,
ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May 1991


182

B. N Bershad et al.

.

Server

Client

Application


Application

Stubs

Message

URPC



Stubs

Channel
mm

L

URPC

Fa~tThreads

FastThreads

/

\
/

~eallocate~roces~F-<


Fig,2

URPC

enables

another

Thesoftware

processors

through

components

in one address

an explicit

reallocation

of URPC

space to execute

in the

if the


between

workload

context

of

them

is

imbalance.
2.1

The User View

It is important
to emphasize
that all of the URPC mechanisms
described
in
this section exist “under
the covers” of quite ordinary
looking
Modula2
+ [34]
interfaces.
The RPC paradigm
provides

the freedom to implement
the control
and data transfer
mechanisms
in ways that are best matched
by the underlying hardware.
URPC
consists
of two
code. One package,
called

software

are managed

level

at the

user

packages

FczstThreacis

and

[5],


used by stubs and application
provides
lightweight
threads
that

scheduled

on top of middleweight

kernel

implements
the channel
managethreads.
The second package, called URPC,
package
ment and message primitives
described
in this section. The URPC
FastThreads,
as
lies directly
beneath
the stubs and closely interacts
with
shown

in Figure


3. RATIONALE

2.

FOR THE URPC DESIGN

In this section we discuss the design rationale
behind
URPC.
rationale
is based on the observation
that there
are several
components
to a cross-address
space call and each can be
separately.

The main

components

In brief, this
independent
implemented

are the following:

management:
blocking

the caller’s thread, running
a thread through
– ‘Thread
the procedure
in the server’s
address
space, and resuming
the caller’s
thread on return,
ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May

1991


User-Level Interprocess
—Data

moving


transfer:

spaces,

arguments

between

the

client

and

.

server

183
address

and

–Processor

ensuring
that there is a physical
processor
in the server and the server’s reply in the client.


to handle

reallocation:

the client’s
In

Communicahon for Shared Memory Mulhprocessors

call

conventional

message

systems,

these

three

components

are

combined

beneath
a kernel interface,

leaving
the kernel responsible
for each. However,
thread management
and data transfer
do not require
kernel
assistance,
only
processor
how

reallocation

the

another,

3.1

does. In the three

components

and the benefits

Processor
attempts

occurs

URPC

through
the
optimistically

—The

client

address
—The

to reduce

arise

from

that

call

follow

we describe

can be isolated

from


one

such a separation.

has
serious

the

frequency

with

use of an optimistic
assumes the following:
other

messages,

not have

that

space

Reallocation

URPC


incoming

subsections

of a cross-address

work
and

effect

to

do,

in

a potential

which

processor

reallocation

the

delay

form


of runnable

in the

on the performance

reallocation
At

policy.

processing

of other

threads

call

time,

threads
of a call

or
will

in the client’s


space.

server

has, or will

soon have,

a processor

with

which

it can service

a

message.
In

terms

several
context

of performance,

ways. The first
switch

between

URPC’S

assumption
user-level

optimistic
makes
threads

assumptions

can pay

off in

it possible
to do an inexpensive
during
the blocking
phase of a

cross-address
space call. The second assumption
enables a URPC to execute
in parallel
with
threads
in the client’s

address
space while
avoiding
a
processor
reallocation.
An implication
of both
assumptions
is that
it is
possible

to amortize

the

cost of a single

outstanding
calls between
the
threads in the same client make
a single

processor

reallocation

processor


reallocation

across

several

same address
spaces. For example,
if two
calls into the same server in succession, then
can handle

both.

A user-level
approach
to communication
and thread management
is appropriate for shared memory
multiprocessors
where applications
rely on aggressive multithreading
to exploit parallelism
while at the same time compensating for multiprogramming
effects and memory
latencies
[2], where a few key
operating
system services

are the target
of the majority
calls [8], and where operating
system functions
are affixed
ing nodes for the sake of locality
[31].
In contrast
well

suited

–Kernel-level

to URPC,

contemporary

for use on shared

memory

communication

and

uniprocessor

structures


are not

multiprocessors:
thread

pessimistic
processor reallocation
policies,
rency within
an application
to reduce
scheduling
[13] underscores
this
Handoff
tion blocks the client
and reallocates
its
ACM

kernel

of all application
to specific process-

TransactIons

management

facilities


rely

on

and are unable to exploit concurthe overhead
of communication.
pessimism:
a single kernel operaprocessor
directly
to the server.

on Computer

Systems,

Vol. 9, No

2, May 1991


184

B. N. Bershad et al

.

Although

handoff


is limited

by the cost of kernel

–In

a traditional

scheduling

operating

shared

performance,

invocation

and processor

system

running
on a multiprocessor,
distributed
over the many
large-scale

does improve


memory

kernel

designed

for

the

improvement

reallocation.
a uniprocessor,

but

kernel resources are logically
centralized,
processors
in the system.
For a mediummultiprocessor

such as the Butterfly

but
to

[71, Alewife


[41, or DASH [251, URPC’S user-level
orientation
to operating
system design
localizes
system resources
to those processors
where the resources
are in
use, relaxing
the performance
bottleneck
that
comes from
relying
on
centralized
kernel
data structures.
This bottleneck
is due to the contention
for logical
resources
(locks) and physical
resources
(memory
and interconnect bandwidth)
that results
when a few data structures,

such as thread
run

queues

and message
By distributing

processors.
functions

to

directly,

the

address

contention

URPC

can

channels,
are shared
by a large
number
of

the communication
and thread
management

also

spaces

(and

hence

processors)

that

use

be appropriate

on uniprocessors

applications
where inexpensive
threads
physical
concurrency
within
a problem.


running

multithreaded

can be used to express
Low overhead
threads

the logical and
and communi-

cation make it possible to overlap
even small amounts
of external
tion. Further,
multithreaded
applications
that are able to benefit
layed

reallocation

tion

protocols

uniprocessor,
long

them


is reduced.

can do so without
[191. Although

having

reallocation

it can be delayed

to develop
will

by scheduling

their

eventually
within

computafrom de-

own communicabe necessary

an address

on a


space for as

as possible.

3.1.1

The

Optimistic

Assumptions

Won’

t Alulays

In cases where

Hold.

the

optimistic
assumptions
do not hold, it is necessary
to invoke
the kernel
to
force a processor reallocation
from one address space to another.

Examples
of
where it is inappropriate
to rely on URPC’S optimistic
processor reallocation
policy
are single-threaded
latency
must be bounded),
initiate

the

and priority
call

is of high

1/0

operation

invocations

applications,
high-latency
early
(where

priority).


since

real-time
applications
1/0 operations
(where
it will

the thread

To handle

these

take

a long

executing

situations,

time

(where
call
it is best to
to complete),


the cross-address
URPC

address space to force a processor
reallocation
to the
there might
still be runnable
threads
in the client’s.

allows

server’s,

space

the client’s
even

though

3.1.2 The Kernel
Handles
Processor
Reallocation.
The kernel implements
the mechanism
that support
processor

reallocation.
When an idle processor
discovers
an underpowered
address
space, it invokes
a procedure
called
Processor.
Donate,
passing
in the identity
of the address space to which the
processor

(on which
the invoking
thread
is running)
should be reallocated.
Donate
transfers
control down through
the kernel
and then up to a
prespecified
address
in the receiving
space. The identity
of the donating

Processor.

address
ACM

space is made

TransactIons

known

on Computer

to the receiver

Systems,

Vol

9, No

2, May

by the kernel.
1991


User-Level Interprocess
3.1.3


Voluntary

interface
URPC,

Return

defines
as with

Communication
of

a contract

for Shared Memory Multiprocessors

Processors

Cannot

between

traditional

RPC

a client

systems,


Be

and

in the

185

A service

Guaranteed.

a server.

implicit

.

In

the

contract

case of

is that

the


server obey the policies
that determine
when a processor
is to be returned
back to the client.
The URPC communication
library
implements
the following policy
in the server:
upon receipt
of a processor
from a client
address
space, return
the processor
when all
have generated
replies,
or when the
become

“underpowered”

(there

outstanding
messages from
server determines

that the

are outstanding

messages

and one of the server’s processors
is idle).
Although
URPC’S runtime
libraries
implement
no way

to enforce

that

kernel

transfers

control

guarantee
returning

protocol.
of


Just

the

as with

processor

back

a specific

kernel-based
to

an

to the client,

protocol,

there

systems,

application,

there

a fair


available

deals

processing

power.

of load balancing

URPC’s

between

direct

reallocation

applications

that

clients,

is

once the
is


that
the application
will
voluntarily
return
the processor
from the procedure
that the client
invoked.
The server could,

example,
use the processor to handle requests from other
this was not what the client had intended.
It is necessary
to ensure
that
applications
receive
problem

the client
client
has

no
by
for

even though

share
only

of the

with

are communicating

the
with

one another.
Independent
of communication,
a multiprogrammed
system
requires
policies
and mechanisms
that balance the load between
noncommunicating
(or noncooperative)
address spaces. Applications,
for example,
must
not be able to starve
able

to delay


clients

one another

out for processors,

indefinitely

by

not

returning

and servers
processors.

policies,
which
forcibly
reallocate
processors
from one address
other, are therefore
necessary
to ensure that applications
make
A processor
reallocation

policy
should
be work
conserving,
high-priority
thread waits for a processor while a lower priority
and,

by implication,

that

anywhere
in the system,
specifics of how to enforce

no processor

address
be easily

3.2

when

there

is work

not be


Preemptive
space to anprogress.
in that
no
thread runs,
for

it to do

even if the work is in another
address space. The
this constraint
in a system with user-level
threads

are beyond the scope of this paper,
The direct processor
reallocation
optimization
space of a
ing to that
standpoint
reason for
processor

idles

must


but are discussed
done in URPC

by Anderson
et al. in [61.
can be thought
of as an

of a work-conserving
policy.
A processor
idling
in the address
URPC client can determine
which address spaces are not respondclient’s
calls, and therefore
which
address spaces are, from the
of the client, the most eligible
to receive a processor.
There is no
the client to first voluntarily
relinquish
the processor to a global
allocator,
only to have the allocator
then
determine
to which


space the processor
made

by the client

should

be reallocated.

This

is a decision

that

can

itself.

Data Transfer Using Shared Memory

The requirements
of a communication
system can be relaxed
and its efficiency improved
when it is layered beneath
a procedure
call interface
rather
than exposed to programmers

directly.
In particular,
arguments
that are part
ACM

TransactIons

on Computer

Systems,

Vol. 9, No

2, May 1991.


186

B. N. Bershad et al

.

of a cross-address
space procedure
call
while still guaranteeing
safe interaction

can be passed using shared

between mutually
suspicious

memory
subsys-

tems.
Shared memory
message
of client-server
interactions.
still overload
one another,

channels
do not increase the “abusability
factor”
As with traditional
RPC, clients and servers can
deny service,
provide
bogus results,
and violate

communication
protocols
(e. g., fail
nel data structures).
And, as with


to release
traditional

protocols

abuses

to ensure

that

lower

level

a well-defined
manner
(e. g., by raising
down the channel).
In URPC,
the safety of communication
responsibility

of the stubs.

channel
RPC,

filter


locks, or corrupt
it is up to higher

up to the application
exception

a call-faded
based

The arguments

on shared

of a URPC!

layer

memory

are passed

is implemented

one address
space to another.
application
programmers
deal

necessary


nor sufficient

by having

Such
directly

form of messages. But, when standard
runtime
as is the case with
URPC,
having
the kernel
spaces is neither

is the

in message
phase
server.

unmarshal
the message’s
data into
copying
and checking
is necessary
to


ensure the application’s
safety.
Traditionally,
safe communication
copy data from
necessary
when

in

or by closing

buffers
that are allocated
and pair-wise
mapped
during
the binding
that precedes the first cross-address
space call between
a client and
On receipt
of a message,
the stubs
procedure
parameters
doing whatever

chanlevel


the

kernel

an implementation
is
with raw data in the

facilities
and stubs are used,
copy data between
address

to guarantee

safety.

Copying
in the kernel
is not necessary
because programming
languages
are implemented
to pass parameters
between
procedures
on the stack, the
heap, or in registers.
When data is passed between
address spaces, none of

these

storage

areas

can,

in general,

be used

directly

by both

the

client

and

the server.
Therefore,
the stubs must
copy the data into and out of the
memory
used to move arguments
between
address

spaces. Safety
is not
increased
by first doing an extra kernel-level
copy of the data.
Copying
in the kernel
is not sufficient
for ensuring
safety when using
type-safe
languages
such as Modula2
+ or Ada [1] since each actual parameter must be checked by the stubs for conformity
with the type of its corresponding
formal.
Without
such checking,
for example,
a client
could crash
the server by passing in an illegal
(e. g., out of range) value for a parameter.
These

points

motivate

the use of pair-wise


shared

memory

for cross-address

space communication.
Pair-wise
shared memory
can be used to transfer
between
address spaces more efficiently,
but just as safely, as messages
are copied by the kernel between
address spaces.

3.2.1

Controlling

different
address
test-and-set
locks
nitely
on message
is simply

if the


spin-wait.

The

ACM

TransactIons

data
that

Channel
Access.
Data flows between
URPC packages in
spaces over a bidirectional
shared
memory
queue with
on either
end. To prevent
processors
from waiting
indefichannels,
the locks are nonspinning;
i.e., the lock protocol

lock


is free,

rationale
on Computer

here

acquire

is that

Systems,

Vol

it,

or

else go on

the receiver
9, No

2, May

1991

to something


of a message

else– never

should

never


User-Level Interprocess

Communication

for Shared Memory Multiprocessors

187

.

have to delay while the sender holds the lock. If the sender stands to benefit
from the receipt of a message (as on call), then it is up to the sender to ensure
that locks are not held indefinitely;
if the receiver
stands to benefit
(as on
return),
realized

then the sender could just as easily prevent
the benefits

by not sending the message in the first place.

Cross-Address

3.3

Space Procedure

The calling
semantics
normal
procedure
call,
performing
software
address

a server

starts

from

space procedure
call, like those
with
respect to the thread
that

the call. Consequently,

there
system
that
manages
threads

the

server.

thread.

When

the

The
server

communication

thread
client

client

thread

finishes,


—High

and thread

performance

grained

parallel

management.

thread

it

stops,
sends

waiting

on a receive

a response

back

to the

we argue


facilities

are

the following:

necessary

for

fine-

programs.

— While thread management
facilities
at the user level, high-performance
be provided

(send

Finally,
the
to be started.
for the next message.
between
cross-address
space


In brief,

management

function

management
synchronization
sends
a message to the server,

client,
which
causes the client’s
waiting
thread
stops again
on a receive,
waiting
server’s thread
In this subsection
we explore the relationship
communication

of
is

is a strong interaction
between
the

and the one that
manages
cross-

Each underlying

and receive) involves
a corresponding
function
(start and stop). On call, the
which

being

Call and Thread Management

of cross-address
are synchronous

space communication.

from

at the user

can be provided
either in the kernel or
thread
management
facilities

can only

level.

—The close interaction
between
communication
and thread management
can
be exploited
to achieve extremely
good performance
for both, when both are
implemented
3.3.1

threads

Concurrent

within

activities.
parallel

together

at the user

Programming


an

address

Concurrency
programs

space

needs

and

Thread

simplify

the

can be entirely

for

access to a thread

or

it


to the
can

some amount
address space.

management

system

that

continuum

major

middleweight,

are three

and lightweight,

for which

on the order of thousands,
hundreds,
Kernels
supporting
heavyweight


points
thread

application,

as with
as with

an

of its own computation
with
In either
case, the program-

of control.
can be described

there

of concurrent

be external,

create, schedule,
and destroy threads
Support
for thread
management
on which


Multiple

Management.

management

internal

multiprocessors,

application
that needs to overlap
activity
on its behalf in another
mer

level.

makes
in

it possible

terms

of reference:
management

of


to

a cost

heavyweight,
overheads

are

and tens of psecs, respectively.
threads
[26, 32, 35] make no distinction

between
a thread,
the dynamic
component
of a program,
and its address
space, the static component.
Threads
and address spaces are created,
scheduled, and destroyed
as single entities.
Because an address space has a large
amount
of baggage,
such as open file descriptors,
page tables, and accounting

ACM

Transactions

on Computer

Systems,

Vol

9, No

2, May 1991


188

B. N Bershad et al.

.

state that
must
be manipulated
during
thread
management
operations,
operations
on heavyweight

threads
are costly, taking
tens of milliseconds.
Many contemporary
operating
system kernels
provide
support
for middleweight,
address

threads
[12, 361. In contrast
to heavyweight
middleweight
threads
are decoupled,
so that

or kernel-level
spaces and

address space can have many middleweight
threads.
As with
threads,
though,
the kernel
is responsible
for implementing

management
functions,
and directly
schedules
an application’s
the

available

defines

physical

a general

processors.

programming

With

middleweight

interface

for

heavyweight
the thread
threads

on

threads,

use by all

threads,
a single

the

concurrent

kernel
applica-

tions.
Unfortunately,

the

grained

parallel

slicing,

or

between


saving

threads

applications,

cost of this

applications.
and

restoring

contexts

traps
are required
kernel
must treat

affects

example,
floating

can be sources

even those for which


thread
management
malfeasant
programs.

generality

For

the

simple
point

performance

features

registers

of performance

the features

overhead

are unnecessary.

of kernel-level
for all parallel


weight

for

all

A kernel-level

thread
management
operation,
and the
suspiciously,
checking
arguments
and

from

the effect

of thread

to the high
cost of
but more pervasive

management


on program

policy

There are a large number
of parallel
programming
models, and
a wide variety
of scheduling
disciplines
that are most appropri-

ate (for performance)
to a given
strongly
influenced
by the choice
rithms
used to implement
threads,

In response
to lightweight

switching

system must also protect
itself from malfunctioning
or

Expensive
(relative
to procedure
call) protected
kernel
to invoke
any
each invocation

stemming

when

as time

degradation

access privileges
to ensure legitimacy.
In addition
to the direct
overheads
that
contribute
kernel-level
thread
management,
there
is an indirect
performance.

within
these,

of fine-

such

thread
is unlikely
programs.

model
[25, 33]. Performance,
though,
is
of interfaces,
data structures,
and algoso a single model represented
by one style
to have

an implementation

to the costs of kernel-level
threads,
threads
that execute in the context

threads


provided

by the kernel,

but

that

is efficient

programmers
have turned
of middleweight
or heavy-

are managed

at the user level

by

a library
linked
in with each application
[9, 38]. Lightweight
thread managetwo-level
scheduling,
since the application
library
schedules

ment implies
lightweight
threads
on top of weightier
threads,
which are themselves
being
scheduled
by the kernel.
Two-level
schedulers
attack
both the direct
and indirect
costs of kernel
threads.
Directly,
it is possible
to implement
efficient
user-level
thread
management
functions
because they are accessible
through
a simple procedure call rather
than a trap, need not be bulletproofed
against
application

errors, and can be customized
to provide
only the level of support needed by a
given application.
The cost of thread
management
operations
in a user-level
scheduler

3.3.2
for two
ACM

can be orders
The

Problem

reasons:

fiansactlons

of magnitude

with

Having

(1) to synchronize


on Computer

Systems,

Vol

less than
Functions

their
9, No

for kernel-level
at

activities
2, May 1991

Two

Levels.

within

threads.

Threads
an address


block
space


User-Level Interprocess

Communication

for Shared Memory Multiprocessors

(e.g., while controlling
access to critical
sections],
events in other address spaces (e. g., while reading
manager).
pensive
kernel,

If threads

are implemented

at the

and flexible
concurrency),
but
then the performance
benefits


switching

are lost when

scheduler

accurately

reflects

In contrast,

when

efficient

user-level

(necessary

in separate

address

of the application,

awaiting

and thread


and

management

scheduling

and then

kernel-mediated

com-

are moved

together
at user level, the
system
can control
threads

synchronization

for inex-

is implemented
in the
scheduling
and context

activities


state

applications
notified.

communication

of the kernel
and implemented
needed by the communication

level

operation
must be implemented
and
the application,
so that the user-level

the scheduling

again (2) in the kernel,
so that
munication
activity
are properly

user


between

189

and (2) to wait for external
keystrokes
from a window

communication
of inexpensive

synchronizing

spaces. In effect, each synchronizing
executed
at two levels:
(1) once in

.

out

synchronization
with
the same

primitives

that


are used

by applications.

3.4 Summary of Rationale
This

section

has

described

the

design

advantages
of a user-level
approach
made without
invoking
the kernel
processors

between

processor reallocation
over multiple
calls.

grated

with

spaces.

do turn
Finally,

out to be necessary,
their
user-level
communication

thread

programs,
as within,

4. THE PERFORMANCE

Further,

as

measurements
megabytes

4.1


have

kernel

processor
described

many

as

dedicated

main

invocation

and

cost can be amortized
can be cleanly
intenecessary

can inexpensively

in

this

of URPC

by DEC’s

six
to

C-VAX

1/0.

paper

for

fine-

synchronize

The

had

on the Firefly,
an experiSystems Research Center.
microprocessors,
Firefly

four

used


C-VAX

to

plus

one

collect

the

processors

and

32

of memory.

User-Level Thread Management

Performance

In this section we compare the performance
of thread management
when implemented
at the user level and in the kernel.
The
threads

are those provided
by Taos, the Firefly’s
native
operating
Table
mented
invokes

The

calls can be
reallocating

OF URPC

A

can

URPC.

facilities,

programs
spaces.

the performance
workstation
built


Firefly

when

management

so that
address

This section describes
mental
multiprocessor
MicroVAX-11

behind

address

user-level

grained
parallel
between,
as well

rationale

to communication
are that
and without

unnecessarily

primitives
kernel-level
system.

I shows the cost of several
thread
management
operations
impleat the user level
and at the kernel
level.
PROCEDURE
CALL
the null procedure
and provides
a baseline
value for the machine’s

performance.
FORK creates, starts, and terminates
a thread.
FORK;JOIN
is
like FORK, except that the thread which creates the new thread blocks until
the forked
thread
completes.
YIELD

forces a thread
to make a full trip
through
the scheduling
machinery;
when a thread
yields the processor,
it is
blocked (state saved), made ready (enqueued
on the ready queue), scheduled
ACM

Transactions

on Computer

Systems,

Vol

9, No. 2, May 1991


190

.

B. N Bershad et al
Table


1.

Comparative

Performance

of Thread

URPC
Test

II

Component

(ysecs)

7
43
102
37
27
53

7
1192
1574
57
27
271


(dequeued),

Server
(ysecs)

18

13

poll

6

6

receme

10
20

25

Total

54

53

unblocked


and forth on a condition
means of paired
signal
Each

(state

measured
in

the

was

elapsed

table.

processor.

variable,
and wait

and unblocks

test

The


run

restored),

was divided

using

the

continued.

ACQUIRE;

in succession.

number

FastThreads

tests

and

a blocking
(nonspinning)
lock for
has two threads
“pingpong”
back


blocking
and unblocking
one another
by
operations.
Each pingpong
cycle blocks,

two threads
a large

time

The

9

dispatch

RELEASE
acquires
and then
releases
which there is no contention.
PING-PONG

schedules,

of a URPC


Client
(~secs)

send

again

Taos Threads

FastThreads

Breakdown

Component

Operations

(psecs)

PROCEDURE
CALL
FORK
FORK;JOIN
YIELD
ACQUIRE, RELEASE
PINGPONG

Table


Management

of times

as part

by the loop limit

tests

were

kernel-level

of a loop,

to determine

constrained

to run

threads

no such

had

and


the

the values
on a single
constraint;

Taos caches recently
executed
kernel-level
threads
on idle processors
to
reduce wakeup
latency.
Because
the tests were run on an otherwise
idle
machine,
the caching optimization
worked well and we saw little
variance
in
the measurements.
The table
demonstrates
the performance
advantage
of
implementing
thread

management
at the user-level,
where
the operations
have an overhead
hundred,
as when
4.2

on the
threads

URPC Component

order of a few procedure
calls,
are provided
by the kernel.

rather

than

a few

Breakdown

In the case when an address space has enough processing
power to handle all
incoming

messages without
requiring
a processor reallocation,
the work done
send (enqueuing
a
during
a URPC can be broken down into four components:
poll
(detecting
the channel
on which
a
message on an outgoing
channel);
receive
(dequeueing
the message);
and dispatch
(dismessage has arrived);
patching
the appropriate
thread
to handle
an incoming
message).
Table II
shows the time taken by each component.
The total
processing

times
are almost
the same for the client
and the
server. Nearly
half of that time goes towards
thread management
(dispatch),
ACM TransactIons

on Computer

Systems,

Vol

9, No

2, May 1991


User-Level Interprocess
demonstrating

the

munication
The

influence


for Shared Memory Mulhprocessors

that

thread

management

.

overhead

has

191
on

com-

performance.

send

and

marshaling

receive


that

spaces.

On

buffers

adds

passed

in

average
Table

Communication

is

the

Firefly,

about

most

do


cross-address

and

not

the

cost

of

psec
space

is

reflect

cost

are

data

of

data


passed

passed

of latency.

in

calls

the

influenced

by

is

and

address

shared

the

memory

amount


small

the

copying

between

Because

procedure

most

of data

the

parameters

word

additional

performance

not

when


each

one

call
II,

times

incurred

[8,

of data

16,

components

17,

24],

shown

in

transfer.

4.3 Call Latency and Throughput

The

figures

there

in

Table

is no need

latency

and

call

throughput
server

II

for

depend

space,

on


for

are

the

S,

and

of client

reallocation.
load

number
the

Two

and

dependent,
of

client

number


of

server

other

load,

important

though.

Both

processors,

runnable

C,

call

latency

and

the

threads


provided

metrics,
number

in

the

of

client’s

T.

To evaluate
values

independent

throughput,

processors,

address

are

processor


T,

latency

and

C,

S

and

throughput,
in

we ran

which

threads to make 100,000 “Null”
loop. The Null
procedure
call

we

timed

a series
how


of tests

long

procedure
calls into
takes no arguments,

it

using

took

different

the

the server
computes

client’s

T

inside a tight
nothing,
and


returns
no results;
it exposes the performance
characteristics
of a round-trip
cross-address
space control
transfer.
Figures
3 and 4 graph call latency and throughput
as a function
of T. Each
line on the graphs represents
a different
combination
of S and C. Latency
is
measured
as the time from when a thread
calls into the Null stub to when
control
returns
from the stub, and depends
on the speed of the message
primitives
and the length of the thread ready queues in the client and server.
When
(T>

the


number

processor

of caller

call

C + S),

latency

in the client

exceeds
since

or the server

to T. Except

is less sensitive
call processing

threads
increases

times


total
call

or both.

for the special

in the client

the
each

number
must

When

the

number

of calling

on the other

case where

S = 0, as long

and server


message
wasted

transfer.
processing

threads,

are roughly

client

equal
T >

hand,
as the

(see Table

II),

At that
no further

C + S.

can


processors,

a free

and

be

server

proces-

latency
is 93 ~secs. This is “pure”
latency,
in
basic processing
required
to do a round-trip

1 Unfortunately,
time

for

Throughput,

until
then throughput,
as a function

of T, improves
point,
processors
are kept
completely
busy,
so there
T.
improvement
in throughput
by increasing
sors is 1 (T = C = S = 1), call
the sense that it reflects
the

of processors

wait

the

due to idling

latency

in the

includes

client


and

a large
server

amount

while

of

waiting

1 The 14 psec discrepancy between the obseved latency and the expected latency (54 + 53, cf.,
Table II) is due to a low-level scheduling optimization
that allows the caller thread’s context to
remain loaded on the processor if the ready queue is empty and there is no other work to be done
anywhere in the system. In this case, the idle loop executes in the context of the caller thread,
enabling a fast wakeup when the reply message arrives. At 2’ = C = S, the optimization
is
enabled. The optimization
is also responsible for the small “spike”
in the throughput
curves
shown in Figure 4.
ACM

Transactions


on Computer

Systems,

Vol. 9, No

2, May 1991.


192

.

B. N. Bershad et al.

1500–

C=l,

s=o

C=l,

S=2
s=l

14001300120011001000 –

900Call


800-

Latency

700-

(psecs)

/c=l,

600500400 -

/<,..””

~=*~5~2s==2
‘-- ‘

/’

300 200 100 -

k’~”;”;





,---”-’”

o~

12

8
Number
Fig

for the next

call

or reply

of Client

3

Threads

9

10

{T)

URPC call latency

message.

At


T =

2, C = 1, S = 1, latency

increases

by only 20 percent
(to 112 Wsecs), but throughput
increases
by nearly
75
percent
(from 10,300 to 17,850 calls per second). This increase
reflects
the
fact that message processing
can be done in parallel
between
the client and
server.

Operating

as a two-stage

pipeline,

the client

and server


complete

call every 53 psecs.
For larger
values of S and C, the same analysis
holds. As long
sufficiently
large to keep the processors
in the client
and server
spaces busy, throughput
Note that throughput

is maximized.
for S = C = 2 is not twice

as large

one

as T is
address

as for S = C = 1

(23,800 vs. 17,850 calls per second, a 33 percent
improvement).
Contention
for the critical

sections that manage access to the thread and message queues
in each address space limits
throughput
over a single channel.
Of the 148
instructions
required
to do a round-trip
message transfer,
approximately
half
are part of critical
sections
guarded
by two separate
locks kept on a perchannel
basis. With four processors,
the critical
sections are therefore
nearly
always
occupied,
so there is slowdown
due to queueing
at the critical
sections. This factor does not constrain
the aggregate
multiple
clients and servers since they use different
When

reallocate
round-trip
sor
ACM

S = O throughput
and latency
are at their
processors frequently
between
the client
latency

is 375

reallocations.

At this

Transactions

on Computer

psecs.
point,
Systems,

Every
URPC
Vol


call

requires

performs

9, No

worse

2, May 1991

rate of URPCS
channels.

between

worst due to the need to
and server. When T = 1,
two

traps

than

and

LRPC


two

proces-

(157

psecs).


User-Level Interprocess

Communication

for Shared Memory Multiprocessors

.

193

25000

---1

—C=2’S’2

A — — —

/

/


20000 4

/–––__–___c=2,
~<:.— ——.— ——————
~
-’-----------------------

15000

s=l
—_ —c=l
S=l
~=1:’=2

///

Calls

$
f

per
Second
10000
1

I

C=l,


s=o

5000

I

o~

2345678

1

Number
Fig. 4.

9
of Client

Threads

(T)

URPC call throughput

We consider
this comparison
further
in Section
5.5.

mance is worse, URPC’S thread management
primitives
performance
overheads
proves

be amortized

steadily,

and

has been absorbed,
4.4

I). As T increases,

(see Table
can

Performance

over

latency,

the

once the


worsens

10

only

though,

the trap

outstanding
initial

because

Although
call perforretain their superior
calls.

shock

and reallocation
Throughput

of processor

of scheduling

im-


reallocation

delays.

Summary

We have
described
the performance
of the communication
and thread
management
primitives
for an implementation
of URPC
on the Firefly
multiprocessor.
The performance
figures demonstrate
the advantages
of moving

traditional

operating

them at the user level.
Our approach
stands
ogy.


Normally,

system
in contrast

operating

system

order to improve
performance
we are pushing
functionality
and

functions

out

to traditional
functionality

of kernel

and

operating

implementing


system

is pushed

into

methodol-

the

kernel

in

at the cost of reduced flexibility.
With URPC,
out of the kernel
to improve
both performance

flexibility.

5. WORK RELATED TO URPC
In this
of

section

URPC.


The

we examine
common

several
theme

other

communication

underlying

ACM Transactions

these

on Computer

systems
other

Systems,

in the light

systems


is

that

Vol. 9, No, 2, May 1991


194

.

B N Bershad et al

they

reduce

the

kernel’s

role

in

communication

in

an


effort

to

improve

performance.

5.1 Sylvan
Sylvan
[141 relies
on architectural
message-passing
primitives.
Special

support
to implement
coprocessor
instructions

the Thoth
[151
are used to ac-

cess these primitives,
so that messages can be passed between
Thoth tasks
without

kernel
intervention.
A pair of 68020 processors,
using a coprocessor
to implement
the communication
primitives,
can execute a send-receiue-reply
cycle on a zero-byte
message (i. e., the Null call) in 48 ~secs.
Although
ing

the

Sylvan

outperforms

URPC,

use of special-purpose

the

hardware.

enqueueing
and dequeueing
messages

and
when a thread becomes runnable
or blocked.
handles

thread

scheduling

and context

improvement

Sylvan’s

is small

coprocessor

updating
Software

switching.

consider-

takes

care


of

thread
control
blocks
on the main processor

As Table

II shows,

though,

thread
management
can be responsible
for a large portion
of the round-trip
processing
time.
Consequently,
a special-purpose
coprocessor
can offer only
limited
performance
gains.
5.2

Yackos


Yackos

[22] is similar

communication
Messages

and

to URPC

in that

is intended

are passed

between

for

it strives

to bypass

use on shared

address


memory

spaces by means

the kernel

during

multiprocessors.

of a special

message-

passing process
is no automatic

that shares memory
with all communicating
processes. There
load balancing
among the processors cooperating
to provide
a

Yackos

Address

service.


spaces are single

threaded,

and clients

communicate

with a process running
on a single processor.
That process is responsible
for
forwarding
the message to a less busy process if necessary.
In URPC, address
spaces are multithreaded,
so a message
sent from
one address
space to
another
can be fielded
by any processor
allocated
to the receiving
address
space.

Further,


URPC

on

a multiprocessor

in

the

best

case

(no

kernel

involvement)
outperforms
Yackos by a factor of two for a round-trip
call (93
VS. 200 psecs).2 The difference
in communication
performance
is due mostly
to the indirection
through
the interposing

Yackos message-passing
process.
This

indirection

results

in increased

latency

and data copying.
By taking
advantage
in an RPC relationship,
URPC
avoids

due to message

queueing,

polling,

of the client/server
pairing
implicit
the indirection
through

the use of

pair-wise
shared message queues. In the pessimistic
case (synchronous
cessor allocation
on every message send), URPC is over thirteen
times
efficient
(375 vs. 5000 psecs) than Yackos.

5.3

promore

Promises

Argus [28] provides
the programmer
with
that can be used to make an asynchronous

a mechanism
cross-address

called a Promise
space procedure

call.


80386

that

faster

2 Yackos

runs

than

the

C-VAX

the

ACM

TransactIons

Sequent

processors

Symmetry,
used

on Computer


which

uses

processors

in the Firefly.

Systems,

Vol

9, No

2, May

1991

are

somewhat

[271



×