- 姓名:陳丕祐
- 學號:403410061
Reorder the instructions to improve performance of the code in Figure 3.48. Assume the two-pipe machine in Exercise 3.3 and that the out-of-order completion issues of Exercise 3.4 have been dealt with successfully. Just worry about observing true data dependences and functional unit latencies for now. How many cycles does your reordered code take?
20 cycle.
這個問題首先要考量的是,有兩個pipeline可以執行指令。故我們可以將彼此沒有依賴關係的指令分開來,以增加效率。
| Group 1 | Group 2 |
|---|---|
| LD F2, 0(RX) | LD F4, 0(Ry) |
| DIVD F8, F2, F0 | ADDD F4, F0, F4 |
| MULTD F2, F6, F2 | SD F4, 0(Ry) |
| ADDD F10, F8, F2 | ADDI Rx, Rx, #8 |
| ADDI Ry, Ry, #8 | |
| SUB R20, R4, Rx | |
| BNZ R20, Loop |
詳細的執行順序如下
PipeLine 1
Loop:
LD F2, 0(RX)
<Latency for LD>
<Latency for LD>
<Latency for LD>
<Latency for LD>
DIVD F8, F2, F0
MULTD F2, F6, F2
<Latency for DIVD and MULTD>
<Latency for DIVD and MULTD>
<Latency for DIVD and MULTD>
<Latency for DIVD and MULTD>
<Latency for DIVD and MULTD>
<Latency for DIVD>
<Latency for DIVD>
<Latency for DIVD>
<Latency for DIVD>
<Latency for DIVD>
<Latency for DIVD>
ADDD F10, F8, F2
<NOP>PipeLine 2
Loop:
LD F4, 0(Ry)
<Latency for LD>
<Latency for LD>
<Latency for LD>
<Latency for LD>
ADDD F4, F0, F4
<Latency for ADDD>
SD F4, 0(Ry)
<Latency for SD>
ADDI Rx, Rx, #8
ADDI Ry, Ry, #8
<Latency for ADDI>
<NOP>
<NOP>
<NOP>
<NOP>
<NOP>
SUB R20, R4, Rx
BNZ R20, Loop
<Latency for BNZ>Let's consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in Figure 3.55. Assume that the arithmetic-logical units(ALUs) can do all arthmetic ops (MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the LD/ST).
a. Suppose all of the instructions from the sequence in Figure 3.48 are present in the RS, with no renaming having been done. Highlight any instructions in the code where register renaming would improve performance. (Hint: Look for read-after-write and write-after-write hazards. Assume the same functional unit latencies as in Figure 3.48.)
使用 register renaming 技巧可以改進的 hazard 有 WAW 與 WAR。所以我們的工作是在這個程式碼之中找出具有WAW或者WAR的部份。
1: LD F2, 0(Rx)
2: DIVD F8, F2, F0
3: MULTD F2, F6, F2
4: LD F4, 0(Ry)
5: ADDD F4, F0, F4
6: ADDD F10, F8, F2
7: ADDI Rx, Rx, #8
8: ADDI Ry, Ry, #8
9: SD F4, 0(Ry)
10: SUB R20, R4, Rx
11: BNZ R20, LoopRead-after-write 利用換名的方式我想並沒有辦法解決 hazards,雖然hint中有但是我沒有並列入答案之中。
對於上述的程式碼而言
- RAW (register renaming 不能解決)
- 第一行與第二行
- 第一行跟第三行
- 第二行跟第六行
- 第二行與遞六行
- 第三行與第六行
- 第四行與第五行
- 第七行跟第十行
- 第八行跟第九行
- WAW
- 第一行跟第三行 (但是同時有RAW的hazard發生)
- 第四行與第五行 (但是同時有RAW的hazard發生)
我覺得沒有任何一行code可以藉由register renaming這個方法進行效能的優化。
程式碼之間的資料有彼此相互依賴的關係(RAW)
b. Suppose the register-renamed version of the code from part(a) is resident in the RS in clock cycle N, with latencies as given in Figure 3.48. Show how the RS should dispatch these instructions out of order, clock by clock, to obtain optimal performance on this code. (Assume the same RS restrictions as in part(a). Also assume that results must be written into the RS before they're available for use - no bypassing.) How many clock cycles does the code sequence take?
c. Part (b) lets the RS try to optimally schedule these instructions. But in reality, the whole instruction sequence of interest is not usually present in the RS. Instead, various events clear the RS, and as a new code sequence streams in from the decoder, the RS must choose to dispatch what it has. Suppose that the RS is empty. In cycle 0, the first two register-renamed instructions of this sequence appear in the RS. Assume it takes one clock cycle to dispatch any op, and assume that the front end (decoder/register-renamer) will continue to supply two new instructions per clock cycle. Show the cycle-by-cycle order of dispatch of the RS. How many clock cycles does this code sequence require now?
d. if you wanted to improve the results of part (c), which would have helped most: (1) Another ALU? (2) Another LD/ST unit? (3) Full bypassing of ALU results to subsequenct operations? or (4) Cutting the longest latency in half? What's the speedup?
e. Now let's consider speculation, the act of fetching, decoding, and executing beyond one or more conditional branches. Our motivation to do this is twofold: The dispatch schedule we came up with in part (c) had lots of nops, and we known computers spend most of their time executing loops (witch implies the branch back to the top of the loop is pretty predictable). Loops tell us where to find more work to do; our sparse dispatch schedule suggests e have opportunities to do some of that work earlier than before. In part (d) you found the critical path through the loop. Imagine folding a second copy of that path onto the schedule you got in part (b). How many more clock cycles would be required to do two loops' worth of word (assuming all instructions are resident in the RS?) (Assume all functional units are fully pipelined.)

