CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining

June 20, 2018 | Author: Anonymous | Category: Engineering & Technology, Computer Science
Share Embed Donate


Short Description

Download CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining...

Description

CS 152 Computer Architecture and Engineering

Lecture 4 - Pipelining

Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory http://inst.eecs.berkeley.edu/~cs152

1/28/2016

CS152, Spring 2016

Last time in Lecture 3  Microcoding became less attractive as gap between RAM and ROM speeds reduced  Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew  Load-Store RISC ISAs designed for efficient pipelined implementations – Very similar to vertical microcode – Inspired by earlier Cray machines (more on these later)

 Iron Law explains architecture design space – Trade instructions/program, cycles/instruction, and time/cycle

1/28/2016

CS152, Spring 2016

2

Question of the Day  Why a five stage pipeline?

1/28/2016

CS152, Spring 2016

3

An Ideal Pipeline stage 1

   

stage 2

stage 3

stage 4

All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages

These conditions generally hold for industrial assembly lines, but instructions depend on each other! 1/28/2016

CS152, Spring 2016

4

Pipelined RISC-V  To pipeline RISC-V:  First build RISC-V without pipelining with CPI=1  Next, add pipeline registers to reduce cycle time while maintaining CPI=1

1/28/2016

CS152, Spring 2016

5

Lecture 3: Unpipelined Datapath for RISC-V PCSel br rind jabs pc+4

RegWriteEn

MemWrite

WBSel

0x4 Add Add clk

PC clk

Br Logic Bcomp?

we rs1 rs2 rd1 wa wd rd2

addr inst

Inst. Memory

ALU

GPRs

clk

we addr rdata

Data Memory

Imm Select

wdata ALU Control

OpCode WASel

1/28/2016

ImmSel

CS152, Spring 2016

Op2Sel

6

Lecture 3: Hardwired Control Table Opcode ALU

ImmSel

Op2Sel

MemWr

RFWen

pc+4 pc+4 pc+4 pc+4

BEQtrue

BrType12

*

*

no

no

*

*

br

BEQfalse

BrType12

* * *

* *

no

*

no no

no yes yes

* PC PC

* rd rd

pc+4 jabs rind

Op2Sel= Reg / Imm

ALU ALU Mem *

rd rd rd *

BsType12

JALR

yes yes yes no

PCSel

SW

* *

no no no yes

WASel

LW

JAL

Func Op + +

WBSel

* IType12 IType12

ALUi

Reg Imm Imm Imm

FuncSel

WBSel = ALU / Mem / PC PCSel = pc+4 / br / rind / jabs

Correction since L3: “J” has been removed 1/28/2016

CS152, Spring 2016

7

Pipelined Datapath 0x4 Add

addr

PC

rdata

Inst. Memory

fetch phase

IR

we rs1 rs2 rd1 wa wd rd2 GPRs

ALU

Data Memory

Imm Select

decode & Reg-fetch phase

we addr rdata wdata

execute phase

memory phase

write -back phase

Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined 1/28/2016

CS152, Spring 2016

8

“Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle  Instructions per program depends on source code, compiler technology, and ISA  Cycles per instructions (CPI) depends on ISA and µarchitecture  Time per cycle depends upon the µarchitecture and base technology Lecture 2 Lecture 3 Lecture 4 1/28/2016

Microarchitecture Microcoded Single-cycle unpipelined Pipelined

CS152, Spring 2016

CPI >1 1 1

cycle time short long short

9

CPI Examples Microcoded machine 7 cycles 5 cycles

Inst 1

10 cycles

Inst 2

Time

Inst 3

3 instructions, 22 cycles, CPI=7.33 Unpipelined machine Inst 1

Inst 2

Inst 3

3 instructions, 3 cycles, CPI=1 Pipelined machine Inst 1 Inst 2 Inst 3 1/28/2016

3 instructions, 3 cycles, CPI=1

5-stage pipeline CPI≠5!!!

CS152, Spring 2016

10

Technology Assumptions • A small amount of very fast memory (caches) backed up by a large, slower memory • Fast ALU (at least for integers) • Multiported Register files (slower!) Thus, the following timing assumption is reasonable

tIM ~= tRF ~= tALU ~= tDM ~= tRW A 5-stage pipeline will be focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! 1/28/2016

CS152, Spring 2016

11

5-Stage Pipelined Execution 0x4 Add

addr rdata

PC

we rs1 rs2 rd1 wa wd rd2 GPRs

IR

Inst. Memory

I-Fetch (IF)

ALU

Data Memory

Imm Select

wdata

Decode, Reg. Fetch Execute (EX) (ID)

time instruction1 instruction2 instruction3 instruction4 instruction5 1/28/2016

we addr rdata

t0 IF1

t1 ID1 IF2

t2 EX1 ID2 IF3

t3 MA1 EX2 ID3 IF4

CS152, Spring 2016

t4 WB1 MA2 EX3 ID4 IF5

Write -Back (WB)

Memory (MA) t5

t6

t7

....

WB2 MA3 WB3 EX4 MA4 WB4 ID5 EX5 MA5 WB5 12

5-Stage Pipelined Execution Resource Usage Diagram 0x4 Add

addr rdata

PC

we rs1 rs2 rd1 ws wd rd2 GPRs

IR

Inst. Memory

Resources

I-Fetch (IF)

1/28/2016

we addr rdata

ALU

Data Memory

Imm Select

wdata

Decode, Reg. Fetch Execute (ID) (EX)

time IF ID EX MA WB

t0 I1

t1 I2 I1

t2 I3 I2 I1

t3 I4 I3 I2 I1

CS152, Spring 2016

t4 I5 I4 I3 I2 I1

Write -Back (WB)

Memory (MA)

t5

t6

t7

....

I5 I4 I3 I2

I5 I4 I3

I5 I4

I5 13

Pipelined Execution: ALU Instructions 0x4

IR

Add

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 wa wd rd2

GPRs

IR

IR

A ALU

Y

B

we addr rdata

Data Memory wdata

Imm Select

R

wdata MD1

MD2

Not quite correct! We need an Instruction Reg (IR) for each stage 1/28/2016

CS152, Spring 2016

14

Pipelined RISC-V Datapath without jumps F

D

E

M

IR

IR

W IR

0x4 Add

RegWriteEn

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 wa wd rd2

ALU Control

MemWrite

A ALU

GPRs

Y

we addr rdata

B

Data Memory wdata

Imm Select

R

wdata MD1

ImmSel

WBSel

MD2

Op2Sel

Control Points Need to Be Connected 1/28/2016

CS152, Spring 2016

15

Instructions interact with each other in pipeline  An instruction in the pipeline may need a resource being used by another instruction in the pipeline  structural hazard  An instruction may depend on something produced by an earlier instruction – Dependence may be for a data value

 data hazard – Dependence may be for the next instruction’s address  control hazard (branches, exceptions)

1/28/2016

CS152, Spring 2016

16

Resolving Structural Hazards  Structural hazard occurs when two instructions need same hardware resource at same time – Can resolve in hardware by stalling newer instruction till older instruction finished with resource

 A structural hazard can always be avoided by adding more hardware to design – E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory

 Our 5-stage pipeline has no structural hazards by design – Thanks to RISC-V ISA, which was designed for pipelining

1/28/2016

CS152, Spring 2016

17

Data Hazards x1 …

x4  x1 … 0x4

IR

Add

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 wa wd rd2

GPRs

A ALU

Y

B

rdata

R

wdata

wdata MD1

1/28/2016

we addr

Data Memory

Imm Select

... x1  x0 + 10 x4  x1 + 17 ...

IR

IR

MD2

x1 is stale. Oops! CS152, Spring 2016

18

How Would You Resolve This?  What can you do if your project buddy has not completed their part, which you need to start?  Three options – – – –

1/28/2016

Wait (stall) Speculate on the value of the deliverable Bypass: ask them for what you need before his/her final deliverable (ask him/her to work harder?)

CS152, Spring 2016

19

Resolving Data Hazards (1)

Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  interlocks

1/28/2016

CS152, Spring 2016

20

Feedback to Resolve Hazards

FB1

FB2

stage 1

FB3

stage 2

FB4

stage 3

stage 4

 Later stages provide dependence information to earlier stages which can stall (or kill) instructions • Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) 1/28/2016

CS152, Spring 2016

21

Interlocks to resolve Data Hazards Stall Condition

bubble

0x4 Add

PC

addr

inst IR

Inst Memory

... x1  x0 + 10 x4  x1 + 17 ... 1/28/2016

we rs1 rs2 rd1 wa wd rd2

GPRs

IR

A ALU

Y

B

we addr rdata

Data Memory

Imm Select

R

wdata

wdata MD1

CS152, Spring 2016

IR

IR

MD2

22

Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 (I1) x1 (x0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) x4 (x1) + 17 IF2 ID2 ID2 ID2 ID2 (I3) IF3 IF3 IF3 IF3 (I4) stalled stages (I5)

Resource Usage

IF ID EX MA WB

time t0 t1 I1 I2 I1

t2 I3 I2 I1

t3 I3 I2 I1

t4 I3 I2 I1

t5 I3 I2 -

t6

CS152, Spring 2016

....

EX2 MA2 WB2 ID3 EX3 MA3 WB3 IF4 ID4 EX4 MA4 WB4 IF5 ID5 EX5 MA5 WB5

t6 I4 I3 I2 -

-  1/28/2016

t7

t7 I5 I4 I3 I2 -

.... I5 I4 I3 I2

I5 I4 I3

I5 I4

I5

pipeline bubble 23

Interlock Control Logic stall

wa Cstall rs2 ? rs1

bubble

0x4 Add

PC

addr

inst IR

Inst Memory

we rs1 rs2 rd1 wa wd rd2

GPRs

IR

IR

IR

A ALU

Y

B

we addr rdata

Data Memory

Imm Select

R

wdata

wdata MD1

MD2

Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. 1/28/2016

CS152, Spring 2016

24

Interlock Control Logic ignoring jumps & branches stall

wa we

Cstall rs1 ? rs2 re1 re2 Cre

0x4 Add

PC

addr

inst IR

Inst Memory

Cdest

Cdest

bubble

we rs1 rs2 rd1 wa wd rd2

GPRs

wa

wa we

we

IR

IR

IR

A ALU

Y

B

we addr rdata

Data Memory wdata

Imm Select

R

wdata MD1

MD2

Should we always stall if an rs field matches some rd? not every instruction writes a register => we not every instruction reads a register => re 1/28/2016

CS152, Spring 2016

25

Source & Destination Registers func7

rs2

rs1

func3

rd

opcode

immediate12

rs1

func3

rd

opcode

imm

rs1

func3 imm opcode

rs2

Jump Offset[19:0]

ALU ALUI LW SW Bcond

JAL JALR 1/28/2016

rd

rd
View more...

Comments

Copyright © 2017 NANOPDF Inc.
SUPPORT NANOPDF