Adaptive Cache Aware Multiprocessor Scheduling
Framework
(Research Masters)
A THESIS SUBMITTED TO
THE FACULTY OF SCIENCE AND TECHNOLOGY
OF
Q UEENSLAND U NIVERSITY OF T ECHNOLOGY
IN FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF
R ESEARCH M ASTER
Huseyin Gokseli Arslan
Faculty of Science and Technology
Queensland University of Technology
September 2011
Copyright in Relation to This Thesis
c Copyright 2011 by Huseyin Gokseli Arslan. All rights reserved.
Statement of Original Authorship
The work contained in this thesis has not been previously submitted to meet requirements for an
award at this or any other higher education institution. To the best of my knowledge and belief,
the thesis contains no material previously published or written by another person except where
due reference is made.
Signature:
Date:
i
ii
This thesis is dedicated to my dearest family and my beloved one
for their love, endless support.
iii
iv
Abstract
Computer resource allocation represents a significant challenge particularly for multiprocessor
systems, which consist of shared computing resources to be allocated among co-runner processes and threads. While an efficient resource allocation would result in a highly efficient
and stable overall multiprocessor system and individual thread performance, ineffective poor
resource allocation causes significant performance bottlenecks even for the system with high
computing resources. This thesis proposes a cache aware adaptive closed loop scheduling
framework as an efficient resource allocation strategy for the highly dynamic resource management problem, which requires instant estimation of highly uncertain and unpredictable resource
patterns.
Many different approaches to this highly dynamic resource allocation problem have been
developed but neither the dynamic nature nor the time-varying and uncertain characteristics of
the resource allocation problem is well considered. These approaches facilitate either static and
dynamic optimization methods or advanced scheduling algorithms such as the Proportional Fair
(PFair) scheduling algorithm. Some of these approaches, which consider the dynamic nature
of multiprocessor systems, apply only a basic closed loop system; hence, they fail to take the
time-varying and uncertainty of the system into account. Therefore, further research into the
multiprocessor resource allocation is required.
Our closed loop cache aware adaptive scheduling framework takes the resource availability
and the resource usage patterns into account by measuring time-varying factors such as cache
miss counts, stalls and instruction counts. More specifically, the cache usage pattern of the
thread is identified using QR recursive least square algorithm (RLS) and cache miss count
time series statistics. For the identified cache resource dynamics, our closed loop cache aware
adaptive scheduling framework enforces instruction fairness for the threads. Fairness in the
v
context of our research project is defined as a resource allocation equity, which reduces corunner thread dependence in a shared resource environment. In this way, instruction count
degradation due to shared cache resource conflicts is overcome.
In this respect, our closed loop cache aware adaptive scheduling framework contributes to
the research field in two major and three minor aspects. The two major contributions lead
to the cache aware scheduling system. The first major contribution is the development of
the execution fairness algorithm, which degrades the co-runner cache impact on the thread
performance. The second contribution is the development of relevant mathematical models,
such as thread execution pattern and cache access pattern models, which in fact formulate the
execution fairness algorithm in terms of mathematical quantities.
Following the development of the cache aware scheduling system, our adaptive self-tuning
control framework is constructed to add an adaptive closed loop aspect to the cache aware
scheduling system. This control framework in fact consists of two main components: the
parameter estimator, and the controller design module. The first minor contribution is the
development of the parameter estimators; the QR Recursive Least Square(RLS) algorithm is
applied into our closed loop cache aware adaptive scheduling framework to estimate highly
uncertain and time-varying cache resource patterns of threads. The second minor contribution
is the designing of a controller design module; the algebraic controller design algorithm, Pole
Placement, is utilized to design the relevant controller, which is able to provide desired timevarying control action. The adaptive self-tuning control framework and cache aware scheduling
system in fact constitute our final framework, closed loop cache aware adaptive scheduling
framework. The third minor contribution is to validate this cache aware adaptive closed loop
scheduling framework efficiency in overwhelming the co-runner cache dependency. The timeseries statistical counters are developed for M-Sim Multi-Core Simulator; and the theoretical
findings and mathematical formulations are applied as MATLAB m-file software codes. In this
way, the overall framework is tested and experiment outcomes are analyzed. According to our
experiment outcomes, it is concluded that our closed loop cache aware adaptive scheduling
framework successfully drives co-runner cache dependent thread instruction count to co-runner
independent instruction count with an error margin up to 25% in case cache is highly utilized.
In addition, thread cache access pattern is also estimated with 75% accuracy.
vi
Keywords
Multiprocessor Scheduling, Adaptive Control Theory, Recursive Least Square, Cache-Aware
Adaptive Scheduling Framework
vii
viii
Acknowledgments
I gratefully acknowledge the contributions of my principal supervisor, Assoc. Prof Glen Tian
and associate supervisor Dr. Ross Hayward and Queensland University of Technology.
ix
x
Table of Contents
Abstract
v
Keywords
vii
Acknowledgments
ix
Nomenclature
xv
List of Figures
xxiv
List of Tables
xxv
1
2
Introduction
1
1.1
Background and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
1.2
Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Scope and Limitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9
1.4
Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
1.5
Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Literature Review
13
2.1
Multi-Core Chip Level Multiprocessor System Architecture . . . . . . . . . . .
13
2.1.1
Core Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
13
2.1.2
Memory Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .
16
2.1.3
Core Diversity and Parallelism . . . . . . . . . . . . . . . . . . . . . .
17
xi
2.2
2.3
2.4
2.5
3
4
Cache Architecture and Policies . . . . . . . . . . . . . . . . . . . . . . . . .
18
2.2.1
Cache Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
2.2.2
Cache Performance Indicator: Cache Miss
. . . . . . . . . . . . . . .
21
Multiprocessor Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22
2.3.1
Multiprocessor Scheduling Taxonomy . . . . . . . . . . . . . . . . . .
23
2.3.2
Real-Time Multi-Core Multiprocessor Scheduling Algorithms . . . . .
25
Cache-Aware Multi-Core Chip Level Multiprocessor Scheduling . . . . . . . .
27
2.4.1
Cache-Fair Multi-Core CMP Scheduling . . . . . . . . . . . . . . . .
27
2.4.2
Adaptive Cache-Aware CMP Scheduling . . . . . . . . . . . . . . . .
40
Modern Control Theory for Scheduling Problems . . . . . . . . . . . . . . . .
47
2.5.1
Adaptive Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
47
2.5.2
Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . .
51
2.5.3
Control System Design Algorithms . . . . . . . . . . . . . . . . . . .
53
System Model Adaptive Control
57
3.1
Theoretical Background of Dynamic System Model . . . . . . . . . . . . . . .
57
3.1.1
State-Space Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
3.1.2
Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . . . . .
62
3.2
Development of Thread Execution Pattern Model . . . . . . . . . . . . . . . .
64
3.3
Control Framework Selection . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
3.4
Adaptive Control Framework Development . . . . . . . . . . . . . . . . . . .
71
3.5
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
74
Parameter Estimation in Adaptive Control
75
4.1
Theoretical Background of Parameter Estimation . . . . . . . . . . . . . . . .
75
4.2
Least Square Parameter Estimation . . . . . . . . . . . . . . . . . . . . . . . .
78
4.3
Adaptive Weighted QR Recursive Least Square Algorithms . . . . . . . . . . .
79
4.3.1
79
Theoretical Background . . . . . . . . . . . . . . . . . . . . . . . . .
xii
4.4
5
81
4.3.3
Complexity Analysis of QR-RLS Algorithm . . . . . . . . . . . . . .
92
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
93
5.1
Introduction and Background . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
5.2
Theoretical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
5.2.1
Deadbeat Controller Design . . . . . . . . . . . . . . . . . . . . . . .
96
5.2.2
Pole Placement Controller Design . . . . . . . . . . . . . . . . . . . . 102
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
Experimental Setup and Simulation
113
6.1
Experiment Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
6.2
Experiment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.3
6.4
7
Formulation and Theoretical Conclusions . . . . . . . . . . . . . . . .
Algebraic Controller Design Methods
5.3
6
4.3.2
6.2.1
Development of Experiment Constraints . . . . . . . . . . . . . . . . . 115
6.2.2
Design Strategy:Two Stage Experiment . . . . . . . . . . . . . . . . . 117
Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
6.3.1
Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.3.2
Experiment Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . 120
6.3.3
Analysis and Evaluation of Experiments . . . . . . . . . . . . . . . . . 122
Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
Conclusions and Recommendations
137
7.1
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
7.2
Future Work and Recommendations . . . . . . . . . . . . . . . . . . . . . . . 138
7.2.1
Heterogeneous Multiprocessor Architecture Resource Allocation Problem138
7.2.2
Statistical Pre-processing of Real-Time Statistical Information . . . . . 139
7.2.3
Robust Adaptive Control Theory . . . . . . . . . . . . . . . . . . . . . 139
7.2.4
Theoretical Analysis of the Scheduling Framework . . . . . . . . . . . 140
xiii
Bibliography
141
xiv
Nomenclature
Abbreviations
ALU
Arithmetic Logic Unit
AppC
Application Controller
APPC
Adaptive Pole Placement
ATD
Auxiliary Tag Directory
CMP
Chip Level Multiprocessor
CMR
Co-runner Miss Rate
CP
Cache Miss Penalty
CPI
Cycle Per Instruction
BQ
Bus Queue
BQD
Buffer Queue Delay
DARMA
Dynamic Auto Regressive Moving Average
DBQ
Data Bus Queue
DM
Deadline-Monotonic (Scheduling)
DMHA
Dynamic Miss Handling Architecture
EDF
Earliest-Deadline-First (Scheduling)
ER-PF
Early Release Proportional Fair (Scheduling)
ER-PD
Early Release Predictive Deadline (Scheduling)
FAMHA
Fair Adaptive Miss Handling Architecture
FCP
Finite Cache Penalty
FIFO
First in First Out
xv
FP
Fixed Priority
FS
Fair Speedup
FSI
Fair Speedup Improvement
IC
Instruction Count
ILP
Instruction Level Parallelism
IPC
Inter Processor Communication
IPs
Interference Points
LFU
Least Frequently Used (Scheduling)
LLF
Least Laxity First (Scheduling)
LRU
Least Recently Used (Scheduling)
LS
Least Square
LTI
Linear Time Invariant
LTV
Linear Time Varying
MAD
Memory Access Delay
MHA
Miss Handling Architecture
MIMO
Multiple Input Multiple Output
MLP
Memory Level Parallelism
MSHR
Miss Status Holding Register
MR
Miss Rate
MRC
Miss Rate Curves
MRU
Most Recently Used
NUMA
Non uniform Memory Access
PAN
Pre-Actuation Negotiator
PD
Pseudo Deadline (Scheduling)
PF
Proportional Fair (Scheduling)
Pfair
Proportional Fair (Scheduling)
PI
Proportional-Integral (Control/Controller)
xvi
PID
Proportional-Integral-Derivative (Control/Controller)
RBQ
Request Bus Queue
RLS
Recursive Least Square
RM
Rate-Monotonic (Scheduling)
SD
Service Differentiation
SHARP
Shared Resource Partitioning
SIMD
Single Instruction Multiple Data
SISO
Single Input Single Output
SMP
Symmetric Multiprocessor
SMT
Simultaneous Multithreading
SPM
Static Parametric Model
UMA
Uniform Memory Access
VLIW
Very Long Instruction Word
VT
Vertical Threading
Symbols
Chapter
A, B, C
System State Space Model Coefficient Matrices
Ch3
A
System Plant Denominator Polynomial
Ch5
ai
ith coefficient term of A
Ch5
Ac
Desired Closed Loop Characteristic Polynomial
Ch5
Ay
Maximum Overshoot (Closed Loop Characteristic)
Ch5
B
System Plant Numerator Polynomial
Ch5
bi
ith coefficient term of B
Ch5
C
Controller
Ch3
CM C
Co-Runner Miss Count Vector
Ch3,5
cmc
Co-Runner Miss Count
Ch3,5
DGDes
Desired Closed Loop Denominator Polynomial
xvii
Ch5
DIC
ICCacheDedicated Denominator Polynomial
Ch5
E
Error Tracking Function
Ch5
fe (x, θ)
Prediction Error Distribution
Ch4
ICCacheDedicated
Cache Dedicated Closed Loop System
Ch5
ICError
Instruction Count Error
Ch3
G
System Plant Transfer Function
Ch3
GDesired
Desired Closed Loop Transfer Function
Ch5
G.M
Gain Margin
Ch5
GP
Guaranteed Percentage
Ch2
linf
Infinity Norm
Ch3
l2
Euclidean Norm
Ch3
L(q)
Linear Filter
Ch4
M∗
System Model Set
Ch4
M (θ)
System Model Set Element
Ch4
M
Fairness Metric
Ch2
m
Normalization Signal
Ch3
MC
Miss Count Vector
Ch3
mc
Miss Count
Ch3
mc(k)
ˆ
Miss Count Estimate
Ch4
ˆ
mc(k)
Miss Count Estimate Vector
Ch4
mcq (k)
Triangularized Miss Count Vector
Ch4
MP R
Miss Prediction Rate
Ch2
Mperf
Performance Fairness Metric
Ch2
MM iss
Cache Fairness Metric
Ch2
NGDes
Desired Closed Loop Numerator Polynomial
Ch5
NIC
ICCache−Dedicated Numerator Polynomial
Ch5
Q(k)
Givens Rotation Matrix
Ch4
xviii
Qθi (k)
Givens Rotation Matrices with Rotation Angle θi
Ch4
Q
Controller Transfer Function Numerator Polynomial
Ch5
qi
ith coefficient term of Q
Ch5
P
Controller Transfer Function Denominator Polynomial
Ch5
pi
ith coefficient term of P
Ch5
Piout
Performance Metric
Ch2
Piref
Performance Reference Metric
Ch2
R
Autocorrelation Matrix
Ch4
si
Continuous Time Roots of Polynomial
Ch5
Tded
Execution Time Dedicated Cache Environment
Ch2
Tovl
Overlap Operation Cycles
Ch2
To
Sampling Period
Ch5
Tpri
Private Operation Cycles
Ch2
Tshr
Execution Time Shared Cache Environment
Ch2
Tvul
Vulnerable Operation Cycles
Ch2
tr
trace
Ch4
u
Control Input
uc
Input Command (Closed Loop)
Ch5
U(k)
Triangularized Input Data Matrix
Ch4
V [k]
Random Noise Component
Ch3
VN (θ, Z N )
Norm or Criterion Cost Function
Ch4
wi
Requested Cache Ways
Ch2
W
Available Cache Ways
Ch2
W
Parameter Weight Vector
Ch4
W [k]
Random Noise Component
Ch3
wn
Normalized Frequency
Ch5
w(k)
Plant Parameter Vector (QR RLS Algorithm)
Ch4
Ch3, Ch5
xix
w
Plant Parameter Coefficients
x
State Space Variable
Ch2,3
x
State Space Vector
Ch2,3
xˆ
State Space Variable Estimate
Ch2,3
y
System Output
yˆ(t|θ)
Prediction
Ch4
zi
Discrete Time Polynomial Roots
Ch5
δCP UCY CLE
Additional CPU Cycles
Ch3
Infinity Norm
Ch3
ϕi (∞)
Ideal Instruction Per Cycle (wi =∞)
Ch2
ψi (t)
Predicted Number of Cache Ways
Ch2
Θ
Weighting Factor
Ch2
θ(t)
Plant Parameter Vector (Adaptive Control)
Ch3
θ(t)∗
Unknown Plant Parameter Vector (Adaptive Control)
Ch3
θi
Givens Rotation Anles
Ch4
ϕ
Regression Input (Regressor)
ϕ(k)
Input Regression Vector
Ch4
ψ
Input Data Matrix
Ch4
φ
Regression Vector
Ch3
Γ
Adaptive Gain
Ch3
ǫ(t, θ∗ )
Model Prediction Error
Ch4
ε
error vector (Least Square)
Ch4
ε
Posterior Error Vector
Ch4
ξ d (k)
RLS Cost Function
Ch4
ζ
Damping Ratio
Ch5
∞
Ch4
Ch3,4,5,6
Greek Letters
xx
Ch3,4
xxi