Tải bản đầy đủ (.pdf) (178 trang)

Adaptive cache aware multiprocessor scheduling framework

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.98 MB, 178 trang )

Adaptive Cache Aware Multiprocessor Scheduling
Framework
(Research Masters)
A THESIS SUBMITTED TO
THE FACULTY OF SCIENCE AND TECHNOLOGY
OF

Q UEENSLAND U NIVERSITY OF T ECHNOLOGY

IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF

R ESEARCH M ASTER

Huseyin Gokseli Arslan
Faculty of Science and Technology
Queensland University of Technology
September 2011





Copyright in Relation to This Thesis
c Copyright 2011 by Huseyin Gokseli Arslan. All rights reserved.

Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements for an
award at this or any other higher education institution. To the best of my knowledge and belief,
the thesis contains no material previously published or written by another person except where
due reference is made.
Signature:


Date:
i


ii


This thesis is dedicated to my dearest family and my beloved one
for their love, endless support.

iii


iv


Abstract

Computer resource allocation represents a significant challenge particularly for multiprocessor
systems, which consist of shared computing resources to be allocated among co-runner processes and threads. While an efficient resource allocation would result in a highly efficient
and stable overall multiprocessor system and individual thread performance, ineffective poor
resource allocation causes significant performance bottlenecks even for the system with high
computing resources. This thesis proposes a cache aware adaptive closed loop scheduling
framework as an efficient resource allocation strategy for the highly dynamic resource management problem, which requires instant estimation of highly uncertain and unpredictable resource
patterns.
Many different approaches to this highly dynamic resource allocation problem have been
developed but neither the dynamic nature nor the time-varying and uncertain characteristics of
the resource allocation problem is well considered. These approaches facilitate either static and
dynamic optimization methods or advanced scheduling algorithms such as the Proportional Fair
(PFair) scheduling algorithm. Some of these approaches, which consider the dynamic nature

of multiprocessor systems, apply only a basic closed loop system; hence, they fail to take the
time-varying and uncertainty of the system into account. Therefore, further research into the
multiprocessor resource allocation is required.
Our closed loop cache aware adaptive scheduling framework takes the resource availability
and the resource usage patterns into account by measuring time-varying factors such as cache
miss counts, stalls and instruction counts. More specifically, the cache usage pattern of the
thread is identified using QR recursive least square algorithm (RLS) and cache miss count
time series statistics. For the identified cache resource dynamics, our closed loop cache aware
adaptive scheduling framework enforces instruction fairness for the threads. Fairness in the

v


context of our research project is defined as a resource allocation equity, which reduces corunner thread dependence in a shared resource environment. In this way, instruction count
degradation due to shared cache resource conflicts is overcome.
In this respect, our closed loop cache aware adaptive scheduling framework contributes to
the research field in two major and three minor aspects. The two major contributions lead
to the cache aware scheduling system. The first major contribution is the development of
the execution fairness algorithm, which degrades the co-runner cache impact on the thread
performance. The second contribution is the development of relevant mathematical models,
such as thread execution pattern and cache access pattern models, which in fact formulate the
execution fairness algorithm in terms of mathematical quantities.
Following the development of the cache aware scheduling system, our adaptive self-tuning
control framework is constructed to add an adaptive closed loop aspect to the cache aware
scheduling system. This control framework in fact consists of two main components: the
parameter estimator, and the controller design module. The first minor contribution is the
development of the parameter estimators; the QR Recursive Least Square(RLS) algorithm is
applied into our closed loop cache aware adaptive scheduling framework to estimate highly
uncertain and time-varying cache resource patterns of threads. The second minor contribution
is the designing of a controller design module; the algebraic controller design algorithm, Pole

Placement, is utilized to design the relevant controller, which is able to provide desired timevarying control action. The adaptive self-tuning control framework and cache aware scheduling
system in fact constitute our final framework, closed loop cache aware adaptive scheduling
framework. The third minor contribution is to validate this cache aware adaptive closed loop
scheduling framework efficiency in overwhelming the co-runner cache dependency. The timeseries statistical counters are developed for M-Sim Multi-Core Simulator; and the theoretical
findings and mathematical formulations are applied as MATLAB m-file software codes. In this
way, the overall framework is tested and experiment outcomes are analyzed. According to our
experiment outcomes, it is concluded that our closed loop cache aware adaptive scheduling
framework successfully drives co-runner cache dependent thread instruction count to co-runner
independent instruction count with an error margin up to 25% in case cache is highly utilized.
In addition, thread cache access pattern is also estimated with 75% accuracy.

vi


Keywords

Multiprocessor Scheduling, Adaptive Control Theory, Recursive Least Square, Cache-Aware
Adaptive Scheduling Framework

vii


viii


Acknowledgments

I gratefully acknowledge the contributions of my principal supervisor, Assoc. Prof Glen Tian
and associate supervisor Dr. Ross Hayward and Queensland University of Technology.


ix


x


Table of Contents

Abstract

v

Keywords

vii

Acknowledgments

ix

Nomenclature

xv

List of Figures

xxiv

List of Tables


xxv

1

2

Introduction

1

1.1

Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.3

Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

1.4


Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

1.5

Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

Literature Review

13

2.1

Multi-Core Chip Level Multiprocessor System Architecture . . . . . . . . . . .

13

2.1.1

Core Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.1.2

Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .


16

2.1.3

Core Diversity and Parallelism . . . . . . . . . . . . . . . . . . . . . .

17

xi


2.2

2.3

2.4

2.5

3

4

Cache Architecture and Policies . . . . . . . . . . . . . . . . . . . . . . . . .

18

2.2.1

Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .


19

2.2.2

Cache Performance Indicator: Cache Miss

. . . . . . . . . . . . . . .

21

Multiprocessor Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3.1

Multiprocessor Scheduling Taxonomy . . . . . . . . . . . . . . . . . .

23

2.3.2

Real-Time Multi-Core Multiprocessor Scheduling Algorithms . . . . .

25

Cache-Aware Multi-Core Chip Level Multiprocessor Scheduling . . . . . . . .

27


2.4.1

Cache-Fair Multi-Core CMP Scheduling . . . . . . . . . . . . . . . .

27

2.4.2

Adaptive Cache-Aware CMP Scheduling . . . . . . . . . . . . . . . .

40

Modern Control Theory for Scheduling Problems . . . . . . . . . . . . . . . .

47

2.5.1

Adaptive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

2.5.2

Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

2.5.3


Control System Design Algorithms . . . . . . . . . . . . . . . . . . .

53

System Model Adaptive Control

57

3.1

Theoretical Background of Dynamic System Model . . . . . . . . . . . . . . .

57

3.1.1

State-Space Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

3.1.2

Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . . . . .

62

3.2

Development of Thread Execution Pattern Model . . . . . . . . . . . . . . . .


64

3.3

Control Framework Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.4

Adaptive Control Framework Development . . . . . . . . . . . . . . . . . . .

71

3.5

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

74

Parameter Estimation in Adaptive Control

75

4.1

Theoretical Background of Parameter Estimation . . . . . . . . . . . . . . . .

75


4.2

Least Square Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . .

78

4.3

Adaptive Weighted QR Recursive Least Square Algorithms . . . . . . . . . . .

79

4.3.1

79

Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . .
xii


4.4
5

81

4.3.3

Complexity Analysis of QR-RLS Algorithm . . . . . . . . . . . . . .


92

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

92
93

5.1

Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . .

93

5.2

Theoretical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

5.2.1

Deadbeat Controller Design . . . . . . . . . . . . . . . . . . . . . . .

96

5.2.2

Pole Placement Controller Design . . . . . . . . . . . . . . . . . . . . 102

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111


Experimental Setup and Simulation

113

6.1

Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

6.2

Experiment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

6.3

6.4
7

Formulation and Theoretical Conclusions . . . . . . . . . . . . . . . .

Algebraic Controller Design Methods

5.3
6

4.3.2

6.2.1

Development of Experiment Constraints . . . . . . . . . . . . . . . . . 115


6.2.2

Design Strategy:Two Stage Experiment . . . . . . . . . . . . . . . . . 117

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1

Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

6.3.2

Experiment Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 120

6.3.3

Analysis and Evaluation of Experiments . . . . . . . . . . . . . . . . . 122

Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Conclusions and Recommendations

137

7.1

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

7.2


Future Work and Recommendations . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.1

Heterogeneous Multiprocessor Architecture Resource Allocation Problem138

7.2.2

Statistical Pre-processing of Real-Time Statistical Information . . . . . 139

7.2.3

Robust Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . 139

7.2.4

Theoretical Analysis of the Scheduling Framework . . . . . . . . . . . 140
xiii


Bibliography

141

xiv


Nomenclature

Abbreviations
ALU


Arithmetic Logic Unit

AppC

Application Controller

APPC

Adaptive Pole Placement

ATD

Auxiliary Tag Directory

CMP

Chip Level Multiprocessor

CMR

Co-runner Miss Rate

CP

Cache Miss Penalty

CPI

Cycle Per Instruction


BQ

Bus Queue

BQD

Buffer Queue Delay

DARMA

Dynamic Auto Regressive Moving Average

DBQ

Data Bus Queue

DM

Deadline-Monotonic (Scheduling)

DMHA

Dynamic Miss Handling Architecture

EDF

Earliest-Deadline-First (Scheduling)

ER-PF


Early Release Proportional Fair (Scheduling)

ER-PD

Early Release Predictive Deadline (Scheduling)

FAMHA

Fair Adaptive Miss Handling Architecture

FCP

Finite Cache Penalty

FIFO

First in First Out
xv


FP

Fixed Priority

FS

Fair Speedup

FSI


Fair Speedup Improvement

IC

Instruction Count

ILP

Instruction Level Parallelism

IPC

Inter Processor Communication

IPs

Interference Points

LFU

Least Frequently Used (Scheduling)

LLF

Least Laxity First (Scheduling)

LRU

Least Recently Used (Scheduling)


LS

Least Square

LTI

Linear Time Invariant

LTV

Linear Time Varying

MAD

Memory Access Delay

MHA

Miss Handling Architecture

MIMO

Multiple Input Multiple Output

MLP

Memory Level Parallelism

MSHR


Miss Status Holding Register

MR

Miss Rate

MRC

Miss Rate Curves

MRU

Most Recently Used

NUMA

Non uniform Memory Access

PAN

Pre-Actuation Negotiator

PD

Pseudo Deadline (Scheduling)

PF

Proportional Fair (Scheduling)


Pfair

Proportional Fair (Scheduling)

PI

Proportional-Integral (Control/Controller)
xvi


PID

Proportional-Integral-Derivative (Control/Controller)

RBQ

Request Bus Queue

RLS

Recursive Least Square

RM

Rate-Monotonic (Scheduling)

SD

Service Differentiation


SHARP

Shared Resource Partitioning

SIMD

Single Instruction Multiple Data

SISO

Single Input Single Output

SMP

Symmetric Multiprocessor

SMT

Simultaneous Multithreading

SPM

Static Parametric Model

UMA

Uniform Memory Access

VLIW


Very Long Instruction Word

VT

Vertical Threading

Symbols

Chapter

A, B, C

System State Space Model Coefficient Matrices

Ch3

A

System Plant Denominator Polynomial

Ch5

ai

ith coefficient term of A

Ch5

Ac


Desired Closed Loop Characteristic Polynomial

Ch5

Ay

Maximum Overshoot (Closed Loop Characteristic)

Ch5

B

System Plant Numerator Polynomial

Ch5

bi

ith coefficient term of B

Ch5

C

Controller

Ch3

CM C


Co-Runner Miss Count Vector

Ch3,5

cmc

Co-Runner Miss Count

Ch3,5

DGDes

Desired Closed Loop Denominator Polynomial
xvii

Ch5


DIC

ICCacheDedicated Denominator Polynomial

Ch5

E

Error Tracking Function

Ch5


fe (x, θ)

Prediction Error Distribution

Ch4

ICCacheDedicated

Cache Dedicated Closed Loop System

Ch5

ICError

Instruction Count Error

Ch3

G

System Plant Transfer Function

Ch3

GDesired

Desired Closed Loop Transfer Function

Ch5


G.M

Gain Margin

Ch5

GP

Guaranteed Percentage

Ch2

linf

Infinity Norm

Ch3

l2

Euclidean Norm

Ch3

L(q)

Linear Filter

Ch4


M∗

System Model Set

Ch4

M (θ)

System Model Set Element

Ch4

M

Fairness Metric

Ch2

m

Normalization Signal

Ch3

MC

Miss Count Vector

Ch3


mc

Miss Count

Ch3

mc(k)
ˆ

Miss Count Estimate

Ch4

ˆ
mc(k)

Miss Count Estimate Vector

Ch4

mcq (k)

Triangularized Miss Count Vector

Ch4

MP R

Miss Prediction Rate


Ch2

Mperf

Performance Fairness Metric

Ch2

MM iss

Cache Fairness Metric

Ch2

NGDes

Desired Closed Loop Numerator Polynomial

Ch5

NIC

ICCache−Dedicated Numerator Polynomial

Ch5

Q(k)

Givens Rotation Matrix


Ch4

xviii


Qθi (k)

Givens Rotation Matrices with Rotation Angle θi

Ch4

Q

Controller Transfer Function Numerator Polynomial

Ch5

qi

ith coefficient term of Q

Ch5

P

Controller Transfer Function Denominator Polynomial

Ch5


pi

ith coefficient term of P

Ch5

Piout

Performance Metric

Ch2

Piref

Performance Reference Metric

Ch2

R

Autocorrelation Matrix

Ch4

si

Continuous Time Roots of Polynomial

Ch5


Tded

Execution Time Dedicated Cache Environment

Ch2

Tovl

Overlap Operation Cycles

Ch2

To

Sampling Period

Ch5

Tpri

Private Operation Cycles

Ch2

Tshr

Execution Time Shared Cache Environment

Ch2


Tvul

Vulnerable Operation Cycles

Ch2

tr

trace

Ch4

u

Control Input

uc

Input Command (Closed Loop)

Ch5

U(k)

Triangularized Input Data Matrix

Ch4

V [k]


Random Noise Component

Ch3

VN (θ, Z N )

Norm or Criterion Cost Function

Ch4

wi

Requested Cache Ways

Ch2

W

Available Cache Ways

Ch2

W

Parameter Weight Vector

Ch4

W [k]


Random Noise Component

Ch3

wn

Normalized Frequency

Ch5

w(k)

Plant Parameter Vector (QR RLS Algorithm)

Ch4

Ch3, Ch5

xix


w

Plant Parameter Coefficients

x

State Space Variable

Ch2,3


x

State Space Vector

Ch2,3



State Space Variable Estimate

Ch2,3

y

System Output

yˆ(t|θ)

Prediction

Ch4

zi

Discrete Time Polynomial Roots

Ch5

δCP UCY CLE


Additional CPU Cycles

Ch3

Infinity Norm

Ch3

ϕi (∞)

Ideal Instruction Per Cycle (wi =∞)

Ch2

ψi (t)

Predicted Number of Cache Ways

Ch2

Θ

Weighting Factor

Ch2

θ(t)

Plant Parameter Vector (Adaptive Control)


Ch3

θ(t)∗

Unknown Plant Parameter Vector (Adaptive Control)

Ch3

θi

Givens Rotation Anles

Ch4

ϕ

Regression Input (Regressor)

ϕ(k)

Input Regression Vector

Ch4

ψ

Input Data Matrix

Ch4


φ

Regression Vector

Ch3

Γ

Adaptive Gain

Ch3

ǫ(t, θ∗ )

Model Prediction Error

Ch4

ε

error vector (Least Square)

Ch4

ε

Posterior Error Vector

Ch4


ξ d (k)

RLS Cost Function

Ch4

ζ

Damping Ratio

Ch5



Ch4

Ch3,4,5,6

Greek Letters

xx

Ch3,4


xxi



×