Tải bản đầy đủ (.pdf) (27 trang)

Bài tập kiến trúc máy tính

Bạn đang xem bản rút gọn của tài liệu. Xem và tải ngay bản đầy đủ của tài liệu tại đây (2.54 MB, 27 trang )

Chapter 1.
Ex. 1. Consider two different implementations, M1 and M2, of the same instruction set. There are
three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 80
MHz and M2 has a clock rate of 100 MHz. The average number of cycles for each instruction
class and their frequencies (for a typical program) are as follows:

a) Calculate the average CPI for each machine, M1, and M2.
b) Calculate the average MIPS ratings for each machine, M1 and M2.
c) Which machine has a smaller MIPS rating ? Which individual instruction class CPI do you
need to change, and by how much, to have this machine have the same or better
performance as the machine with the higher MIPS rating (you can only change the CPI for
one of the instruction classes on the slower machine)?
Ex. 2. (Amdahl’s law question) Suppose you have a machine which executes a program consisting
of 50% floating point multiply, 20% floating point divide, and the remaining 30% are from
other instructions.
a) Management wants the machine to run 4 times faster. You can make the divide run at most
3 times faster and the multiply run at most 8 times faster. Can you meet management’s goal
by making only one improvement, and which one?
b) Dogbert has now taken over the company removing all the previous managers. If you make
both the multiply and divide improvements, what is the speed of the improved machine
relative to the original machine?
Ex. 3. Suppose that we can improve the floating point instruction performance of machine by a
factor of 15 (the same floating point instructions run 15 times faster on this new machine).
What percent of the instructions must be floating point to achieve a Speedup of at least 4?
Ex. 4. Just like we defined MIPS rating, we can also define something called the MFLOPS rating
which stands for Millions of Floating Point operations per Second. If Machine A has a higher
MIPS rating than that of Machine B, then does Machine A necessarily have a higher MFLOPS
rating in comparison to Machine B? Note: MIPS rating is defined by: MIPS = (Clock
Rate)/(CPI * 10
6
)


Ex. 5. Assume that a design team is considering enhancing a machine by adding MMX (multimedia
extension instruction) hardware to a processor. When a computation is run in MMX mode
on the MMX hardware, it is 10 times faster than the normal mode of execution. Call the
percentage of time that could be spent using the MMX mode the percentage of media
enhancement.
a) What percentage of media enhancement is needed to achieve an overall speedup of 2?
b) What percentage of the run-time is spent in MMX mode if a speedup of 2 is achieved? (Hint: You
will need to calculate the new overall time.)
c) What percentage of the media enhancement is needed to achieve one-half the maximum
speedup attainable from using the MMX mode?
Ex. 6. If processor A has a higher clock rate than processor B, and processor A also has a higher
MIPS rating than processor B, explain whether processor A will always execute faster than
processor B. Suppose that there are two implementations of the same instruction set
architecture. Machine A has a clock cycle time of 20ns and an effective CPI of 1.5 for some
program, and machine B has a clock cycle time of 15ns and an effective CPI of 1.0 for the
same program. Which machine is faster for this program, and by how much? Note: MIPS
rating is defined by: MIPS = (Clock Rate)/(CPI * 10
6
)
Ex. 7. Suppose a program segment consists of a purely sequential part which takes 25 cycles to
execute, and an iterated loop which takes 100 cycles per iteration. Assume the loop
iterations are independent, and cannot be further parallelized. If the loop is to be executed
100 times, what is the maximum speedup possible using an infinite number of processors
(compared to a single processor)?
Ex. 8. Computer A has an overall CPI of 1.3 and can be run at a clock rate of 600MHz. Computer B
has a CPI of 2.5 and can be run at a clock rate of 750 Mhz. We have a particular program we
wish to run. When compiled for computer A, this program has exactly 100,000 instructions.
How many instructions would the program need to have when compiled for Computer B, in
order for the two computers to have exactly the same execution time for this program?
Ex. 9. The design team for a simple, single-issue processor is choosing between a pipelined or

non-pipelined implementation. Here are some design parameters for the two possibilities:
Parameter
Pipelined Version
Non-pipelined Version
Clock Rate
500MHz
350MHz
CPI for ALU instructions
1
1
CPI for Control instructions
2
1
CPI for Memory Instructions
2.7
1
a) For a program with 20% ALU instructions, 10% control instructions and 75% memory
instructions, which design will be faster? Give a quantitative CPI average for each case.
b) For a program with 80% ALU instructions, 10% control instructions and 10% memory
instructions, which design will be faster? Give a quantitative CPI average for each case.
Ex. 10. A designer wants to improve the overall performance of a given machine with respect to a
target benchmark suite and is considering an enhancement X that applies to 50% of the
original dynamically-executed instructions, and speeds each of them up by a factor of 3. The
designer’s manager has some concerns about the complexity and the cost-effectiveness of X
and suggests that the designer should consider an alternative enhancement Y. Enhancement
Y, if applied only to some (as yet unknown) fraction of the original dynamically-executed
instructions, would make them only 75% faster. Determine what percentage of all
dynamically-executed instructions should be optimized using enhancement Y in order to
achieve the same overall speedup as obtained using enhancement X.
Ex. 11. Prior to the early 1980s, machines were built with more and more complex instruction set.

The MIPS is a RISC machine. Why has there been a move to RISC machines away from
complex instruction machines?
Chapter 2.
Ex. 12. Write the following sequence of code into MIPS assembler:
x = x + y + z - q;
Assume that x, y, z, q are stored in registers $s1-$s4.
Ex. 13. In MIPS assembly, write an assembly language version of the following C code segment:
int A[100], B[100];
for (i=1; i < 100; i++) {
A[i] = A[i-1] + B[i];
}
At the beginning of this code segment, the only values in registers are the base address of arrays A
and B in registers $a0 and $a1. Avoid the use of multiplication instructions–they are unnecessary.
Ex. 14. Consider the following assembly code for parts 1 and 2.
r1 = 99
Loop:
r1 = r1 – 1
branch r1 > 0, Loop
halt
a) During the execution of the above code, how many dynamic instructions are executed?
b) Assuming a standard unicycle machine running at 100 KHz, how long will the above code take to
complete?
Ex. 15. Convert the C function below to MIPS assembly language. Make sure that your assembly
language code could be called from a standard C program (that is to say, make sure you
follow the MIPS calling conventions).
unsigned int sum(unsigned int n)
{
if (n == 0) return 0;
else return n + sum(n-1);
}

This machine has no delay slots. The stack grows downward (toward lower memory addresses).
The following registers are used in the calling convention:

Ex. 16. In the snippet of MIPS assembler code below, how many times is instruction memory
accessed? How many times is data memory accessed? (Count only accesses to memory, not
registers.)
lw $v1, 0($a0)
addi $v0, $v0, 1
sw $v1, 0($a1)
addi $a0, $a0, 1
Ex. 17. Use the register and memory values in the table below for the next questions. Assume a 32-
bit machine. Assume each of the following questions starts from the table values; that is, DO
NOT use value changes from one question as propagating into future parts of the question.

a) Give the values of R1, R2, and R3 after this instruction: add R3, R2, R1
b) What values will be in R1 and R3 after this instruction is executed: load R3, 12(R1)
c) What values will be in the registers after this instruction is executed: addi R2, R3, #16
Ex. 18. Loop Unrolling and Fibonacci: Consider the following pseudo-C code to compute the fifth
Fibonacci number (F(5)).
1 int a,b,i,t;
2 a=b=1; /* Set a and b to F(2) and F(1) respectively */
3 for(i=0;i<2;i++)
4 {
5 t=a; /* save F(n-1) to a temporary location */
6 a+=b; /* F(n) = F(n-1) + F(n-2) */
7 b=t; /* set b to F(n-1) */
8 }
One observation that a compiler might make is that the loop construction is somewhat unnecessary.
Since the the range of the loop indices is fixed, one can unroll the loop by simply writing three
iterations of the loop one after the other without the intervening increment/comparison on i. For

example, the above could be written as:
1 int a,b,t;
2 a=b=1;
3 t=a;
4 a+=b;
5 b=t;
6 t=a;
7 a+=b;
8 b=t;
a) Convert the pseudo-C code for both of the snippets above into reasonably efficient MIPS code.
Represent each variable of the pseudo-C program with a register. Try to follow the pseudo-C
code as closely as possible (i.e. the first snippet should have a loop in it, while the second should
not).
b) Now suppose that instead of the fifth Fibonacci number we decided to compute the 20th. How
many static instructions would there be in the first version and how many would there be in the
unrolled version? What about dynamic instructions? You do not need to write out the assembly
for this part.
Ex. 19. In MIPS assembly, write an assembly language version of the following C code segment:
for (i = 0; i < 98; i ++) {
C[i] = A[i + 1] - A[i] * B[i + 2]
}

Arrays A, B and C start at memory location A000hex, B000hex and C000hex respectively. Try to
reduce the total number of instructions and the number of expensive instructions such as multiplies.
Ex. 20. Suppose that a new MIPS instruction, called bcp, was designed to copy a block of words
from one address to another. Assume that this instruction requires that the starting address
of the source block be in register $t1 and that the destination address be in $t2. The
instruction also requires that the number of words to copy be in $t3 (which is > 0).
Furthermore, assume that the values of these registers as well as register $t4 can be
destroyed in executing this instruction (so that the registers can be used as temporaries to

execute the instruction).

Do the following: Write the MIPS assembly code to implement a block copy without this instruction.
Write the MIPS assembly code to implement a block copy with this instruction. Estimate the total
cycles necessary for each realization to copy 100-words on the multicycle machine.
Ex. 21. This problem covers 4-bit binary multiplication. Fill in the table for the Product, Multplier
and Multiplicand for each step. You need to provide the DESCRIPTION of the step being
performed (shift left, shift right, add, no add). The value of M (Multiplicand) is 1011, Q
(Multiplier) is isnitially 1010.

Ex. 22. This problem covers floating-point IEEE format.
a) List four floating-point operations that cause NaN to be created?
b) Assuming single precision IEEE 754 format, what decimal number is represent by this word:
1 01111101 00100000000000000000000
(Hint: remember to use the biased form of the exponent.)
Ex. 23. The floating-point format to be used in this problem is an 8-bit IEEE 754 normalized format
with 1 sign bit, 4 exponent bits, and 3 mantissa bits. It is identical to the 32-bit and 64-bit
formats in terms of the meaning of fields and special encodings. The exponent field employs
an bias-7 coding. The bit fields in a number are (sign, exponent, mantissa). Assume that we
use unbiased rounding to the nearest even specified in the IEEE floating point standard.
a) Encode the following numbers the 8-bit IEEE format:
i) 0.0011011
binary

ii) 6.0
decimal

b) Perform the computation 1.011
binary
+ 0.0011011

binary

c) Decode the following 8-bit IEEE number into their decimal value: 1 1010 101
d) Decide which number in the following pairs are greater in value (the numbers are in 8-bit IEEE
754 format):
i) 0 0100 100 and 0 0100 111
ii) 0 1100 100 and 1 1100 101
e) In the 32-bit IEEE format, what is the encoding for negative zero?
f) In the 32-bit IEEE format, what is the encoding for positive infinity?
Ex. 24. The floating-point format to be used in this problem is a normalized format with 1 sign bit,
3 exponent bits, and 4 mantissa bits. The exponent field employs an excess-4 coding. The bit
fields in a number are (sign, exponent, mantissa). Assume that we use unbiased rounding to
the nearest even specified in the IEEE floating point standard.
a) Encode the following numbers in the above format:
i) 1.0binary
ii) 0.0011011binary
Note: The guard bit is an extra bit that is added at the least significant bit position during an
arithmetic operation to prevent loss of significance. The round bit is the second bit that is used
during a floating point arithmetic operation on the rightmost bit position to prevent loss of
precision during intermediate additions. The sticky bit keeps record of any 1’s that have been
shifted on to the right beyond the guard and round bits
b) Using 32-bit IEEE 754 single precision floating point with one(1) sign bit, eight (8)
exponent bits and twenty three (23) mantissa bits, show the representation of -11/16 (-
0.6875).
c) What is the smallest positive (not including +0) representable number in 32-bit IEEE 754
single precision floating point? Show the bit encoding and the value in base 10 (fraction or
decimal OK).
Ex. 25. Perform the following operations by converting the operands to 2’s complement binary
numbers and then doing the addition or subtraction shown. Please show all work in binary,
operating on 16-bit numbers.

a) 3 + 12
b) 13 – 2
c) 5 – 6
d) -7 – (-7)
Ex. 26. Define the WiMPY precision IEEE 754 floating point format to be:

where each ’X’ represents one bit. Convert each of the following WiMPY floating point numbers to
decimal:
a) 00000000
b) 11011010
c) 01110000
Ex. 27. This problem covers 4-bit binary unsigned division (similar to Fig. 3.11 in the text). Fill in
the table for the Quotient, Divisor and Dividend for each step. You need to provide the
DESCRIPTION of the step being performed (shift left, shift right, sub). The value of Divisor is
4 (0100, with additional 0000 bits shown for right shift), Dividend is 6 (initially loaded into
the Remainder).

Ex. 28. We’re going to look at some ways in which binary arithmetic can be unexpectedly useful.
For this problem, all numbers will be 8-bit, signed, and in 2’s complement.
a) For x = 8, compute x & (−x). (& here refers to bitwise-and, and − refers to arithmetic negation.)
b) For x = 36, compute x & (−x).
c) Explain what the operation x & (−x) does.
Ex. 29. Data representation
a) Tìm biểu diễn thập phân của số không dấu, dấu phẩy cố định 10110,110
2

b) Tìm biểu diễn không dấu, dấu phẩy cố định của số 106,375
10

c) Có thể đổi một số thập phân bất kz sang dạng nhị phân dấu phẩy cố định mà không làm mất

chính xác được không?
Ex. 30. Data representation
a) Đổi số thập phân 3,4 và 2,4 sang dạng nhị phân dấu phẩy cố định sử dụng 4 chữ số bên trái dấu
phẩy và 4 chữ số bên phải dấu phẩy. Thực hiện phép cộng 2 số đó. Xác định sai số tương đối.
b) Số 0110 0110 0011 1111
2
tương ứng với số hệ 16 nào?
Ex. 31. Tìm biểu diễn nhị phân 8 bít của số -86
a) Dùng dấu và độ lớn
b) Dùng biểu diễn bù 1
c) Dùng biểu diễn bù 2
d) Dùng biểu diễn lệch 127

Ex. 32.
a) Đổi số thập phân a = 3,4 và b = 10,25 sang dạng biểu diễn dấu phẩy động theo chuẩn IEEE 754
độ chính xác đơn.
b) Cộng 2 số a và b
c) Nhân 2 số a và b
Ex. 33.
Mô tả phương pháp để nhân một số biểu diễn dưới dạng mã bù 2 với 127 mà không dùng bộ nhân. Đổi
127
10
sang dạng số nhị phân mã bù 2, 8 bits. Xác định giá trị 127
2
(Kết quả biểu diễn bằng số 16 bit)
Ex. 34.
Thiết kế bộ dịch Barrel cho phép dịch trái số học 1,0,-1, hoặc -2 bit một số 4 bit. Số lượng bít cần dịch
được cho dưới dạng 1 số nhị phân mã bù 2.
Ex. 35.
Dùng các cổng logic đơn giản và một bộ cộng 32 bit với các bit nhớ vào và ra.

a) Thiết kế một mạch để trừ 2 số không dấu 32 bit. Mạch này có 2 đầu vào 32 bít và 1 đầu ra 32 bit.
Ngoài ra, mạch có một đầu ra n (negative). N=1 báo hiệu hiệu là số âm và không thể biểu diễn
dưới dạng số không dấu.
b) Thiết kế một mạch để so sánh 2 số có dấu 32 bít a và b. Cả 2 số đều được biểu diễn dưới dạng
dấu và trị số tuyệt đối. Mạch này có 1 đầu ra l (less). Khi l = 1, ta có a < b.
c) Thiết kế một mạch để so sánh 2 số dấu phẩy động độ chính xác đơn.
Ex. 36.
Cho một bộ cộng Ripple-Carry gồm 16 bộ cộng đủ 1 bit như hình sau:

Mỗi cổng có độ trễ 1 đơn vị. Tín hiệu được đưa vào ở thời điểm 0. Tính thời điểm t
ar
các tín hiệu tổng và
tín hiệu nhớ đạt trạng thái ổn định.

Chapter 3.
Ex. 37. For the MIPS datapath shown below, several lines are marked with “X”. For each one:
• Describe in words the negative consequence of cutting this line relative to the working,
unmodified processor.
• Provide a snippet of code that will fail
• Provide a snippet of code that will still work

Ex. 38. Consider the following assembly language code:
I0: ADD R4 = R1 + R0;
I1: SUB R9 = R3 - R4;
I2: ADD R4 = R5 + R6;
I3: LDW R2 = MEM[R3 + 100];
I4: LDW R2 = MEM[R2 + 0];
I5: STW MEM[R4 + 100] = R2;
I6: AND R2 = R2 & R1;
I7: BEQ R9 == R1, Target;

I8: AND R9 = R9 & R1;
Consider a pipeline with forwarding, hazard detection, and 1 delay slot for branches. The pipeline is
the typical 5-stage IF, ID, EX, MEM, WB MIPS design. For the above code, complete the pipeline
diagram below (instructions on the left, cycles on top) for the code. Insert the characters IF, ID, EX,
MEM, WB for each instruction in the boxes. Assume that there two levels of bypassing, that the
second half of the decode stage performs a read of source registers, and that the first half of the
write-back stage writes to the register file. Label all data stalls (Draw an X in the box). Label all data
forwards that the forwarding unit detects (arrow between the stages handing off the data and the
stages receiving the data). What is the final execution time of the code?

T0
T1
T2
T3
T4
T5
T6
T7
T8
T9
T10
T11
T12
T13
I0















I1














I2















I3














I4















I5














I6















I7














I8
















Ex. 39. Structural, data and control hazards typically require a processor pipeline to stall. Listed
below are a series of optimization techniques implemented in a compiler or a processor
pipeline designed to reduce or eliminate stalls due to these hazards. For each of the
following optimization techniques, state which pipeline hazards it addresses and how it
addresses it. Some optimization techniques may address more than one hazard, so be sure
to include explanations for all addressed hazards.
a) Branch Prediction
b) Instruction Scheduling
c) delay slots
d) increasing availability of functional units (ALUs, adders etc)
e) caches
Ex. 40. Branch Prediction. Consider the following sequence of actual outcomes for a single static
branch. T means the branch is taken. N means the branch is not taken. For this question,
assume that this is the only branch in the program.
T T T N T N T T T N T N T T T N T N
f) Assume that we try to predict this sequence with a BHT using one-bit counters. The counters in
the BHT are initialized to the N state. Which of the branches in this sequence would be mis-
predicted?

Ex. 41. The classic 5-stage pipeline seen in Section 4.5 is IF, ID, EX, MEM, WB. This pipeline is

designed specifically to execute the MIPS instruction set. MIPS is a load store architecture
that performs one memory operation per instruction, hence a single MEM stage in the
pipeline suffices. Also, its most common addressing mode is register displacement
addressing. The EX stage is placed before the MEM stage to allow it to be used for address
calculation. In this question we will consider a variation in the MIPS instruction set and the
interactions of this variation with the pipeline structure. The particular variation we are
considering involves swapping the MEM and EX stages, creating a pipeline that looks like
this: IF, ID, MEM, EX, WB. This change has two effects on the instruction set. First, it
prevents us from using register displacement addressing (there is no longer an EX in front
of MEM to accomplish this). However, in return we can use instructions with one memory
input operand, i.e., register-memory instructions. For instance: multf_m f0,f2,(r2) multiplies
the contents of register f2 and the value at memory location pointed to by r2, putting the
result in f0.
g) Dropping the register displacement addressing mode is potentially a big loss, since it is the mode
most frequently used in MIPS. Why is it so frequent? Give two popular software constructs
whose implementation uses register displacement addressing (i.e., uses displacement
addressing with non-zero displacements).
h) What is the difference between a dependence and a hazard?
Ex. 42. This is a three-part question about critical path calculation. Consider a simple single-cycle
implementation of MIPS ISA. The operation times for the major functional components for
this machine are as follows:

Below is a copy of the MIPS single-cycle datapath design. In this implementation the clock
cycle is determined by the longest possible path in the machine. The critical paths for the
different instruction types that need to be considered are: R-format, Load-word, and store-
word. All instructions have the same instruction fetch and decode steps. The basic register
transfer of the instructions are:
i) Fetch/Decode: Instruction <- IMEM[PC];
ii) R-type: R[rd] <- R[rs] op R[rt]; PC <- PC + 4;
iii) load: R[rt] <- DMEM[ R[rs] + signext(offset)]; PC <- PC +4;

iv) store: DMEM[ R[rs] + signext(offset)] <- R[Rt]; PC <- PC +4;

a) In the table below, indicate the components that determine the critical path for the respective
instruction, in the order that the critical path occurs. If a component is used, but not part of the
critical path of the instruction (ie happens in parallel with another component), it should not be
in the table. The register file is used for reading and for writing; it will appear twice for some
instructions. All instruction begin by reading the PC register with a latency of 2ns.

b) Place the latencies of the components that you have decided for the critical path of each
instruction in the table below. Compute the sum of each of the component latencies for each
instruction.

c) Use the total latency column to derive the following critical path information:
i) Given the data path latencies above, which instruction determines the overall machine
critical path (latency)?
ii) What will be the resultant clock cycle time of the machine based on the critical path
instruction?
iii) What frequency will the machine run?
Ex. 43. This problem covers your knowledge of branch prediction.
The figure below illustrates three possible predictors.

i) Last taken predicts taken when 1
ii) Up-Down (saturating counter) predicts taken when 11 and 10
iii) Automata A3 predicts taken when 11 and 10
Fill out the tables below and on the next page for each branch predictor. The execution pattern for
the branch is NTNNTTTN.



Calculate the prediction rates of the three branch predictors:


Ex. 44. Pipelining is used because it improves instruction throughput. Increasing the level of
pipelining cuts the amount of work performed at each pipeline stage, allowing more
instructions to exist in the processor at the same time and individual instructions to
complete at a more rapid rate. However, throughput will not improve as pipelining is
increased indefinitely. Give two reasons for this.
Ex. 45. Consider a MIPS machine with a 5-stage pipeline with a cycle time of 10ns. Assume that you
are executing a program where a fraction, f, of all instructions immediately follow a load
upon which they are dependent.
b) With forwarding enabled what is the total execution time for N instructions, in terms of f ?
c) Consider a scenario where the MEM stage, along with its pipeline registers, needs 12 ns. There
are now two options: add another MEM stage so that there are MEM1 and MEM2 stages or
increase the cycle time to 12ns so that the MEM stage fits within the new cycle time and the
number of pipeline stages remain unaffected. For a program mix with the above characteristics,
when is the first option better than the second. Your answer should be based on the value of f.
d) Embedded processors have two different memory regions – a faster scratchpad memory and a
slower normal memory. Assume that in the 6 stage machine (with MEM1 and MEM2 stages),
there is a region of memory that is faster and for which the correct value is obtained at the end
of the MEM1 stage itself while the rest of the memory needs both MEM1 and MEM2 stages. For
the sake of simplicity assume that there are two load instructions load.fast and load.slow that
indicate which memory region is accessed. If 40% of the fraction f mentioned above get their
value from the fast memory, how does the answer to the previous question change ?
Ex. 46. Imagine an instruction whose function is to read four adjacent 32-bit words from memory
and places them into four specified 32-bit architectural registers. Assuming the 5-stage
pipeline is filled with these instructions and these instructions ONLY, what is the minimum
number of register file read and write ports that would be required? Why?
Ex. 47. Pipelining and Bypass. In this question we will explore how bypassing affects program
execution performance. To begin consider the standard MIPS 5 stage pipeline. For your
reference, refer to the figure below. For this question, we will use the following code to
evaluate the pipeline’s performance:

1 add $t2, $s1, $sp
2 lw $t1, $t1, 0
3 addi $t2, $t1, 7
4 add $t1, $s2, $sp
5 lw $t1, $t1, 0
6 addi $t1, $t1, 9
7 sub $t1, $t1, $t2
a) What is the load-use latency for the standard MIPS 5-stage pipeline?
b) Once again, using the standard MIPS pipeline, identify whether the value for each register
operand is coming from the bypass or from the register file. For clarity, please write REG or
BYPASS in each box.

c) How many cycles will the program take to execute on the standard MIPS pipeline?
d) Assume, due to circuit constraints, that the bypass wire from the memory stage back to the
execute stage is omitted from the pipeline. What is the load-use latency for this modified
pipeline?
e) Identify whether the value for each register operand is coming from the bypass or from the
register file for the modified pipeline. For clarity, please write REG or BYPASS in each box.

f) How long does the program take to execute on the modified pipeline?
Ex. 48.
Tìm tất cả các phụ thuộc dữ liệu và vẽ đồ thị phụ thuộc của đoạn chương trình dưới đây. Phụ thuộc nào
dẫn tới xung đột dữ liệu nếu không có chuyển tiếp (forwarding). Những xung đột nào có thể giải quyết
bằng chuyển tiếp.
add $2, $5, $4
add $4, $2, $5
sw $5, 100($2)
add $3, $2, $4
Ex. 49.
Xét đoạn chương trình sau:

add $7, $6, $5
lw $6, 100($7)
sub $7, $6, $8

Cần bao nhiêu xung đồng hồ để thực hiện đoạn mã trên. Vẽ biểu diễn hoạt động pipeline minh họa việc
thực hiện đoạn mã trên trong kiến trúc pipeline. Chỉ ra những vị trí cần có chuyển tiếp để giảm tạm
dừng (stall).
Ex. 50.
Xét chương trình gồm 100 lệnh lw, mỗi lệnh đều phụ thuộc dữ liệu vào lệnh trước đó. Tính chỉ số CPI khi
thực hiện chương trình nói trên.
Ex. 51.
Xét một đoạn mã như sau:

lw $0,0($1)
add $0,$0,$3
lw $2,4($1)
add $2,$2,$3
lw $1,0($4)
sw $0,0($1)
sw $2,4($1)

Định biểu các lệnh trên sao cho không cần có bất cứ sự tạm dừng nào.
Ex. 52.
a) Giả sử bộ xử lý làm việc với lệnh rẽ nhánh chậm (delayed branching) và không cần thêm tạm
dừng cho các lệnh rẽ nhánh. Chuyển đoạn mã C sau đây sang mã lệnh MIPS. Để đơn giản hóa ta
giả sử các biến s1, s2, s3, s4 được lưu trong các thanh ghi tương ứng $s1, $s2, $s3, $s4
if (s1<s2) {
s3=s1;
}
else {

s4=s2;
}
b) Nếu bộ xử lý có thêm 2 lệnh movz (move if zero) và movn (move if not zero). Ví dụ movz $s1,
$s2, $s3 sẽ sao chép $s2 tới $s1 nếu $s3 là 0. Chuyển đoạn mã C nói trên thành mã hợp ngữ
không dùng câu lệnh nhảy có điều kiện.
c) So sánh thời gian thực hiện của 2 đoạn mã hợp ngữ trên. Đoạn mã nào cần ít thời gian hơn. Nêu
lý do.
Ex. 53. Đoạn mã lệnh sau sẽ được thực hiện trên bộ xử lý có rẽ nhánh chậm (delayed branching)

slt $t0,$s1,$s2
beq $t0,$zero,S_2
j End
addi $s3,$s1,0
S_2: addi $s4,$s2,0
End:

Xác định thứ tự thực hiện các lệnh trong 2 trường hợp rẽ nhánh được thực hiện và không được thực
hiện.
Ex. 54. Xét đoạn mã lệnh sau đây:
1. lw $f1,16($s2)
2. add $f7,$f1,$f5
3. sub $f8,$f1,$f6
4. or $f9,$f5,$f1
5. mult $f5,$f8,$f1
6. bnq $f9,$f7,target
7. add $f1,$f10,$f5
8. sub $f10,$f2,$f7
9. sw $f1,24($s1)
10. sub $f5,$f1,$f10
11. target:

Vẽ đồ thị phụ thuộc chỉ ra sự phụ thuộc dữ liệu, không phụ thuộc dữ liệu, phụ thuộc đầu ra trong đoạn
mã nói trên. Sử dụng chỉ số dòng bên trái các câu lệnh làm tên các nốt trong đồ thị.

Chapter 4.
Ex. 1. Caches and Address Translation. Consider a 64-byte cache with 8 byte blocks, an
associativity of 2 and LRU block replacement. Virtual addresses are 16 bits. The cache is
physically tagged. The processor has 16KB of physical memory.
a) What is the total number of tag bits?
b) For the following sequence of references, label the cache misses.Also, label each miss as
being either a compulsory miss, a capacity miss, or a conflict miss. The addresses are given
in octal (each digit represents 3 bits). Assume the cache initially contains block addresses:
000, 010, 020, 030, 040, 050, 060, and 070 which were accessed in that order.


c) Which of the following techniques are aimed at reducing the cost of a miss: dividing the
current block into sub-blocks, a larger block size, the addition of a second level cache, the
addition of a victim buffer, early restart with critical word first, a writeback buffer, skewed
associativity, software prefetching, the use of a TLB, and multi-porting.
d) Why are the first level caches usually split (instructions and data are in different caches)
while the L2 is usually unified (instructions and data are both in the same cache)?
Ex. 2. Assume the following 10-bit address sequence generated by the microprocessor:

The cache uses 4 bytes per block. Assume a 2-way set assocative cache design that uses the LRU
algorithm (with a cache that can hold a total of 4 blocks). Assume that the cache is initially empty. First
determine the TAG, SET, BYTE OFFSET fields and fill in the table above. In the figure below, clearly mark
for each access the TAG, Least Recently Used (LRU), and HIT/MISS information for each access. And
then, derive the hit ratio for the access sequence.






Ex. 3.
a) Why is miss rate not a good metric for evaluating cache performance? What is the
appropriate metric? Give its definition. What is the reason for using a combination of first
and second- level caches rather than using the same chip area for a larger first-level cache?
b) The original motivation for using virtual memory was “compatibility”. What does that mean
in this context? What are two other motivations for using virtual memory?
c) What are the two characteristics of program memory accesses that caches exploit?
d) What are three types of cache misses?
Ex. 4. Design a 128KB direct-mapped data cache that uses a 32-bit address and 16 bytes per block.
Calculate the following:
a) How many bits are used for the byte offset?
b) How many bits are used for the set (index) field?
c) How many bits are used for the tag?
Ex. 5. Design a 8-way set associative cache that has 16 blocks and 32 bytes per block. Assume a
32 bit address. Calculate the following:
a) How many bits are used for the byte offset?
b) How many bits are used for the set (index) field?
c) How many bits are used for the tag?
Ex. 6. This question covers cache and pipeline performance analysis.
a) Write the formula for the average memory access time assuming one level of cache
memory:
b) For a data cache with a 92% hit rate and a 2-cycle hit latency, calculate the average memory
access latency. Assume that latency to memory and the cache miss penalty together is 124
cycles. Note: The cache must be accessed after memory returns the data.
c) Calculate the performance of a processor taking into account stalls due to data cache and
instruction cache misses. The data cache (for loads and stores) is the same as described in
Part B and 30% of instructions are loads and stores. The instruction cache has a hit rate of
90% with a miss penalty of 50 cycles. Assume the base CPI using a perfect memory system

is 1.0. Calculate the CPI of the pipeline, assuming everything else is working perfectly.
Assume the load never stalls a dependent instruction and assume the processor must wait
for stores to finish when they miss the cache. Finally, assume that instruction cache misses
and data cache misses never occur at the same time. Show your work.
 Calculate the additional CPI due to the icache stalls.
 Calculate the additional CPI due to the dcache stalls.
 Calculate the overall CPI for the machine.
Ex. 7. A processor has a 32 byte memory and an 8 byte direct-mapped cache. Table 0 shows the
current state of the cache. Write hit or miss under the each address in the memory
reference sequence below. Show the new state of the cache for each miss in a new table,
label the table with the address, and circle the change. Calculate Hit, Miss rates.





Ex. 8. A processor has a 32 byte memory and an 8 byte 4-way set associative cache. Table 0 shows
the current state of the cache. Use the Least Recently Used replacement policy. Write hit or
miss under the each address in the memory reference sequence below. Show the new state
of the cache for each miss in a new table, label the table with the address, and circle the
change. Calculate Hit, Miss rates.








Ex. 9. How many total SRAM bits will be required to implement a 256KB four-way set associative

cache. The cache is physically-indexed cache, and has 64-byte blocks. Assume that there
are 4 extra bits per entry: 1 valid bit, 1 dirty bit, and 2 LRU bits for the replacement policy.
Assume that the physical address is 50 bits wide.
Ex. 10. Caches: Misses and Hits
int i;
int a[1024*1024];
int x=0;
for(i=0;i<1024;i++)
{
x+=a[i]+a[1024*i];
}
Consider the code snippet in code above. Suppose that it is executed on a system with a 2-way set
associative 16KB data cache with 32-byte blocks, 32-bit words, and an LRU replacement policy. Assume
that int is word-sized. Also assume that the address of a is 0x0, that i and x are in registers, and that the
cache is initially empty. How many data cache misses are there? How many hits are there?
Ex. 11. Describe the general characteristics of a program that would exhibit very little temporal
and spatial locality with regard to instruction fetches. Provide an example of such a
program (pseudo-code is fine). Also, describe the cache effects of excessive unrolling. Use
the terms static instructions and dynamic instructions in your description.
Ex. 12. You are given an empty 16K 2-way set-associative LRU-replacement cache with 32 byte
blocks on a machine with 4 byte words and 32-bit addresses. Describe in mathematical
terms a memory read address sequence which yields the following Hit/Miss patterns. If
such a sequence is impossible, state why. Some sample sequences are given:
Hit/Miss pattern
Address sequence being accessed
Miss, Hit, Hit, Miss :
0,0,0,32
Miss, (Hit)*
0, 0*
(Hit)*


(Miss)*

(Miss, Hit)*

Ex. 13. Assume an instruction cache miss rate for gcc of 2% and a data cache miss rate of 4%. If a
machine has a CPI of 2 without any memory stalls and the miss penalty is 40 cycles for all
misses, determine how much faster a machine would run with a perfect cache that never
missed. Assume 36% of instructions are loads/stores.
Ex. 14. Consider the following piece of code:
int x = 0, y = 0; // The compiler puts x in r1 and y in r2.
int i; // The compiler put i in r3.
int A[4096]; // A is in memory at address 0x10000

×