CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining
Short Description
Download CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining...
Description
CS 152 Computer Architecture and Engineering
Lecture 4 - Pipelining
Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152
1/28/2016
CS152, Spring 2016
Last time in Lecture 3 Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Load-Store RISC ISAs designed for efficient pipelined implementations – Very similar to vertical microcode – Inspired by earlier Cray machines (more on these later)
Iron Law explains architecture design space – Trade instructions/program, cycles/instruction, and time/cycle
1/28/2016
CS152, Spring 2016
2
Question of the Day Why a five stage pipeline?
1/28/2016
CS152, Spring 2016
3
An Ideal Pipeline stage 1
stage 2
stage 3
stage 4
All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages
These conditions generally hold for industrial assembly lines, but instructions depend on each other! 1/28/2016
CS152, Spring 2016
4
Pipelined RISC-V To pipeline RISC-V: First build RISC-V without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1
1/28/2016
CS152, Spring 2016
5
Lecture 3: Unpipelined Datapath for RISC-V PCSel br rind jabs pc+4
RegWriteEn
MemWrite
WBSel
0x4 Add Add clk
PC clk
Br Logic Bcomp?
we rs1 rs2 rd1 wa wd rd2
addr inst
Inst. Memory
ALU
GPRs
clk
we addr rdata
Data Memory
Imm Select
wdata ALU Control
OpCode WASel
1/28/2016
ImmSel
CS152, Spring 2016
Op2Sel
6
Lecture 3: Hardwired Control Table Opcode ALU
ImmSel
Op2Sel
MemWr
RFWen
pc+4 pc+4 pc+4 pc+4
BEQtrue
BrType12
*
*
no
no
*
*
br
BEQfalse
BrType12
* * *
* *
no
*
no no
no yes yes
* PC PC
* rd rd
pc+4 jabs rind
Op2Sel= Reg / Imm
ALU ALU Mem *
rd rd rd *
BsType12
JALR
yes yes yes no
PCSel
SW
* *
no no no yes
WASel
LW
JAL
Func Op + +
WBSel
* IType12 IType12
ALUi
Reg Imm Imm Imm
FuncSel
WBSel = ALU / Mem / PC PCSel = pc+4 / br / rind / jabs
Correction since L3: “J” has been removed 1/28/2016
CS152, Spring 2016
7
Pipelined Datapath 0x4 Add
addr
PC
rdata
Inst. Memory
fetch phase
IR
we rs1 rs2 rd1 wa wd rd2 GPRs
ALU
Data Memory
Imm Select
decode & Reg-fetch phase
we addr rdata wdata
execute phase
memory phase
write -back phase
Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined 1/28/2016
CS152, Spring 2016
8
“Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends on ISA and µarchitecture Time per cycle depends upon the µarchitecture and base technology Lecture 2 Lecture 3 Lecture 4 1/28/2016
Microarchitecture Microcoded Single-cycle unpipelined Pipelined
CS152, Spring 2016
CPI >1 1 1
cycle time short long short
9
CPI Examples Microcoded machine 7 cycles 5 cycles
Inst 1
10 cycles
Inst 2
Time
Inst 3
3 instructions, 22 cycles, CPI=7.33 Unpipelined machine Inst 1
Inst 2
Inst 3
3 instructions, 3 cycles, CPI=1 Pipelined machine Inst 1 Inst 2 Inst 3 1/28/2016
3 instructions, 3 cycles, CPI=1
5-stage pipeline CPI≠5!!!
CS152, Spring 2016
10
Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!) Thus, the following timing assumption is reasonable
tIM ~= tRF ~= tALU ~= tDM ~= tRW A 5-stage pipeline will be focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! 1/28/2016
CS152, Spring 2016
11
5-Stage Pipelined Execution 0x4 Add
addr rdata
PC
we rs1 rs2 rd1 wa wd rd2 GPRs
IR
Inst. Memory
I-Fetch (IF)
ALU
Data Memory
Imm Select
wdata
Decode, Reg. Fetch Execute (EX) (ID)
time instruction1 instruction2 instruction3 instruction4 instruction5 1/28/2016
we addr rdata
t0 IF1
t1 ID1 IF2
t2 EX1 ID2 IF3
t3 MA1 EX2 ID3 IF4
CS152, Spring 2016
t4 WB1 MA2 EX3 ID4 IF5
Write -Back (WB)
Memory (MA) t5
t6
t7
....
WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 12
5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add
addr rdata
PC
we rs1 rs2 rd1 ws wd rd2 GPRs
IR
Inst. Memory
Resources
I-Fetch (IF)
1/28/2016
we addr rdata
ALU
Data Memory
Imm Select
wdata
Decode, Reg. Fetch Execute (ID) (EX)
time IF ID EX MA WB
t0 I1
t1 I2 I1
t2 I3 I2 I1
t3 I4 I3 I2 I1
CS152, Spring 2016
t4 I5 I4 I3 I2 I1
Write -Back (WB)
Memory (MA)
t5
t6
t7
....
I5 I4 I3 I2
I5 I4 I3
I5 I4
I5 13
Pipelined Execution: ALU Instructions 0x4
IR
Add
PC
addr
inst IR
Inst Memory
we rs1 rs2 rd1 wa wd rd2
GPRs
IR
IR
A ALU
Y
B
we addr rdata
Data Memory wdata
Imm Select
R
wdata MD1
MD2
Not quite correct! We need an Instruction Reg (IR) for each stage 1/28/2016
CS152, Spring 2016
14
Pipelined RISC-V Datapath without jumps F
D
E
M
IR
IR
W IR
0x4 Add
RegWriteEn
PC
addr
inst IR
Inst Memory
we rs1 rs2 rd1 wa wd rd2
ALU Control
MemWrite
A ALU
GPRs
Y
we addr rdata
B
Data Memory wdata
Imm Select
R
wdata MD1
ImmSel
WBSel
MD2
Op2Sel
Control Points Need to Be Connected 1/28/2016
CS152, Spring 2016
15
Instructions interact with each other in pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline structural hazard An instruction may depend on something produced by an earlier instruction – Dependence may be for a data value
data hazard – Dependence may be for the next instruction’s address control hazard (branches, exceptions)
1/28/2016
CS152, Spring 2016
16
Resolving Structural Hazards Structural hazard occurs when two instructions need same hardware resource at same time – Can resolve in hardware by stalling newer instruction till older instruction finished with resource
A structural hazard can always be avoided by adding more hardware to design – E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory
Our 5-stage pipeline has no structural hazards by design – Thanks to RISC-V ISA, which was designed for pipelining
1/28/2016
CS152, Spring 2016
17
Data Hazards x1 …
x4 x1 … 0x4
IR
Add
PC
addr
inst IR
Inst Memory
we rs1 rs2 rd1 wa wd rd2
GPRs
A ALU
Y
B
rdata
R
wdata
wdata MD1
1/28/2016
we addr
Data Memory
Imm Select
... x1 x0 + 10 x4 x1 + 17 ...
IR
IR
MD2
x1 is stale. Oops! CS152, Spring 2016
18
How Would You Resolve This? What can you do if your project buddy has not completed their part, which you need to start? Three options – – – –
1/28/2016
Wait (stall) Speculate on the value of the deliverable Bypass: ask them for what you need before his/her final deliverable (ask him/her to work harder?)
CS152, Spring 2016
19
Resolving Data Hazards (1)
Strategy 1: Wait for the result to be available by freezing earlier pipeline stages interlocks
1/28/2016
CS152, Spring 2016
20
Feedback to Resolve Hazards
FB1
FB2
stage 1
FB3
stage 2
FB4
stage 3
stage 4
Later stages provide dependence information to earlier stages which can stall (or kill) instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) 1/28/2016
CS152, Spring 2016
21
Interlocks to resolve Data Hazards Stall Condition
bubble
0x4 Add
PC
addr
inst IR
Inst Memory
... x1 x0 + 10 x4 x1 + 17 ... 1/28/2016
we rs1 rs2 rd1 wa wd rd2
GPRs
IR
A ALU
Y
B
we addr rdata
Data Memory
Imm Select
R
wdata
wdata MD1
CS152, Spring 2016
IR
IR
MD2
22
Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 (I1) x1 (x0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 (x1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5)
Resource Usage
IF ID EX MA WB
time t0 t1 I1 I2 I1
t2 I3 I2 I1
t3 I3 I2 I1
t4 I3 I2 I1
t5 I3 I2 -
t6
CS152, Spring 2016
....
EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5
t6 I4 I3 I2 -
- 1/28/2016
t7
t7 I5 I4 I3 I2 -
.... I5 I4 I3 I2
I5 I4 I3
I5 I4
I5
pipeline bubble 23
Interlock Control Logic stall
wa Cstall rs2 ? rs1
bubble
0x4 Add
PC
addr
inst IR
Inst Memory
we rs1 rs2 rd1 wa wd rd2
GPRs
IR
IR
IR
A ALU
Y
B
we addr rdata
Data Memory
Imm Select
R
wdata
wdata MD1
MD2
Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. 1/28/2016
CS152, Spring 2016
24
Interlock Control Logic ignoring jumps & branches stall
wa we
Cstall rs1 ? rs2 re1 re2 Cre
0x4 Add
PC
addr
inst IR
Inst Memory
Cdest
Cdest
bubble
we rs1 rs2 rd1 wa wd rd2
GPRs
wa
wa we
we
IR
IR
IR
A ALU
Y
B
we addr rdata
Data Memory wdata
Imm Select
R
wdata MD1
MD2
Should we always stall if an rs field matches some rd? not every instruction writes a register => we not every instruction reads a register => re 1/28/2016
CS152, Spring 2016
25
Source & Destination Registers func7
rs2
rs1
func3
rd
opcode
immediate12
rs1
func3
rd
opcode
imm
rs1
func3 imm opcode
rs2
Jump Offset[19:0]
ALU ALUI LW SW Bcond
JAL JALR 1/28/2016
rd
rd
View more...
Comments