

# EITF20: Computer Architecture Part3.2.1: Pipeline - 3

Liang Liu liang.liu@eit.lth.se



Lund University / EITF20/ Liang Liu 2016

#### Outline

- Reiteration
- Dynamic scheduling Tomasulo
- Superscalar, VLIW
- Speculation
- ILP limitations
- What we have done so far (pipeline)



# Instruction level parallelism (ILP)

- ILP: Overlap execution of unrelated instructions: Pipelining
- Pipeline CPI = Ideal CPI + Structural stalls + Data hazard stalls + Control stalls
- Two main approaches:
  - DYNAMIC  $\Rightarrow$  hardware detects parallelism
  - STATIC  $\Rightarrow$  software detects parallelism
  - Often a mix between both



## **Dependencies**

- Two instructions must be independent in order to execute in parallel
- There are three general types of dependencies that limit parallelism:
  - Data dependencies (RAW)
  - Name dependencies (WAR, WAW)
  - Control dependencies
- Dependencies are properties of the program
- Whether a dependency leads to a hazard or not is a property of the pipeline implementation



# **Dynamic branch prediction**

#### Branches limit performance because:

- Branch penalties
- Limit to available Instruction Level Parallelism

#### Dynamic branch prediction to predict the outcome of conditional branches

- Branch history table
- Branch target buffer

#### Benefits:

- Reduce the time to when the branch condition is known
- Reduce the time to calculate the branch target address



### **Scoreboard pipeline**

- **Scoreboarding** is to execute an instruction as early as possible
- Instructions execute out-of-order when there are sufficient resources and no data dependencies
- A scoreboard is a hardware unit that keeps track of
  - the instructions that are in the process of being executed
  - the functional units that are doing the executing
  - and the **registers** that will hold the results of those units
- A scoreboard centrally performs all hazard detection and instruction control





# **Scoreboard functionality**

#### **Issue:** An instruction is issued if:

- The needed functional unit is free (there is no **structural hazard**)
- No functional unit has a destination operand equal to the destination of the instruction (resolves WAW hazards)
- **Read:** Wait until no data hazards, then read operands
  - Performed in parallel for all functional units
  - Resolves RAW hazards dynamically
- **EX:** Normal execution
  - Notify the scoreboard when ready
- **Write:** The instruction can update destination if:
  - All earlier instructions have read their operands (resolves WAR hazards)



#### **Factors that limits performance**

#### The scoreboard technique is limited by:

- The amount of parallelism available in code
- The number of scoreboard entries (window size)
  - A large window can look ahead across more instructions
- The number and types of functional units
  - Contention for functional units leads to structural hazards
- The presence of anti-dependencies and output dependencies
  - These lead to WAR and WAW hazards that are handled by stalling the instruction in the Scoreboard
- Number of data-paths to registers

# Tomasulo's algorithm addresses the last two limitations.



### Outline

#### Reiteration

#### Dynamic scheduling - Tomasulo

- Superscalar, VLIW
- Speculation
- ILP limitations
- What we have done so far



#### **Tomasulo algorithm**

Another dynamic instruction scheduling algorithm

For IBM 360/91, a few years after the CDC 6600 (Scoreboard)

**Goal:** High performance without compiler support

R. M. Tomasulo

#### An Efficient Algorithm for Exploiting Multiple Arithmetic Units

Abstract: This paper describes the methods employed in the floating-point area of the System/360 Model 91 to exploit the existence of multiple execution units. Basic to these techniques is a simple common data busing and register tagging scheme which permits simultaneous execution of independent instructions while preserving the essential precedences inherent in the instruction stream. The common data bus improves performance by efficiently utilizing the execution units without requiring specially optimized code. Instead, the hardware, by 'looking ahead' about eight instructions, automatically optimizes the program execution on a local basis.

The application of these techniques is not limited to floating-point arithmetic or System/360 architecture. It may be used in almost any computer having multiple execution units and one or more 'accumulators.' Both of the execution units, as well as the associated storage buffers, multiple accumulators and input/output buses, are extensively checked.



# **Registr renaming**



#### Potential WAR harzard on F6

If ADDD finishes before DIVD starts

#### Register renaming

- Another temperoray register (FT) instead of F6
- Any subsequent uses of F6 should be replaced by FT (until the next wirte to F6)

#### Register renaming can be done:

- statically by the compiler
- dynamically by the hardware



## **Tomasulo orgnizations**



SIG \*

0

#### **Reservation stations**

- Op:Operation to perform (e.g., + or –)
- □ Vj, Vk: Value (instead of reg specifier) of Source operands
- Qj, Qk: Reservation stations (instead of FU) producing source registers (value to be written)
  - Note: Qj,Qk=0 => ready
  - V and Q filed are mutual exclusive
- Busy: Indicates reservation station or FU is busy
- Register result status—Indicates which RS will write each register
  - Blank when no pending instructions that will write that register

| Functional u | init stat | tus  |    | src 1 | src 2 | RS for j | RS for k |
|--------------|-----------|------|----|-------|-------|----------|----------|
| Time         | Name      | Busy | Ор | Vj    | Vk    | Qj       | Qk       |
|              | Add1      | No   |    |       |       |          |          |
|              | Add2      | No   |    |       |       |          |          |
|              | Add3      | No   |    |       |       |          |          |
|              | Mult1     | No   |    |       |       |          |          |
|              | Mult2     | No   |    |       |       |          |          |



# Three stages of Tomasulo algorithm

- Issue get instruction from instruction Queue
  - If matching reservation station free (no structural hazard)
  - Instruction is issued together with its operands values or RS point (register rename, handle WAR, WAW)
- **Execution** operate on operands (EX)
  - When both operands are ready, then execute (handle RAW)
  - If not ready, watch Common Data Bus (CDB) for operands (snooping)
- Write result finish execution (WB)
  - Write on CDB to all awaiting RS, regs (forwarding)
  - Mark reservation station available
  - Data Bus
    - Normal Bus: data + destination
    - Common Data Bus: data + source (snooping)





RVM·CARC

 $\langle \mathfrak{Q} \rangle$ 





| Instruction statusExec. WriteInstructionjkIssue compl. resultLoad buffersLDF6 $34+$ R2134LDF2 $45+$ R324Ioad1MULTD F0F2F43Ioad2VesSUBDF8F6F24Ioad3NoDIVDF10F0F6F2Ioad3NoEunctional unit statussrc 1src 2RS for j RS for kTimeNameBusyOpVjVkQjAdd1YesSubM(R2+34)Load2Add3NoNoIoad2No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| $ \begin{array}{c ccccccccccccccccccccccccccccccccccc$                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| MULTD F0F2F430Load2YesR3+45SUBD F8F6F24Load3NoDIVD F10F0F6F6NoADDD F6F8F2Image: Strick 1 strick 2 rest of the strick of the s |
| SUBD F8 F6 F2 4 Load3 No   DIVD F10 F0 F6 F6 ADDD F6 F8 F2 Image: Single state                                                |
| DIVD F10 F0 F6<br>ADDD F6 F8 F2<br><u>Functional unit status</u> src 1 src 2 RS for j RS for k<br>Time Name Busy Op Vj Vk Qj Qk<br>Add1 Yes Sub M(R2+34) Load2<br>Add2 No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |
| ADDD F6 F8 F2   Functional unit status src 1 src 2 RS for j RS for k   Time Name Busy Op Vj Vk Qj Qk   Add1 Yes Sub M(R2+34) Load2   Add2 No No No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
| Functional unit status src 1 src 2 RS for j RS for k   Time Name Busy Op Vj Vk Qj Qk   Add1 Yes Sub M(R2+34) Load2   Add2 No No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Time Name Busy Op Vj Vk Qj Qk<br>Add1 Yes Sub M(R2+34) Load2<br>Add2 No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Time Name Busy Op Vj Vk Qj Qk<br>Add1 Yes Sub M(R2+34) Load2<br>Add2 No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Add1YesSubM(R2+34)Load2Add2No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |
| Add2 No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Add3 No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Mult1 Yes Mult F4 Load2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |
| Mult2 No                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| Register result status                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| F0 F2 F4 F6 F8 F10 F30                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        |
| FU Mult1 Load2 - Add1                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         |
| Clock: 4                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |

|          |          |         |         |       | _        |          |          |          |        |        |                                        |
|----------|----------|---------|---------|-------|----------|----------|----------|----------|--------|--------|----------------------------------------|
| Instruc  | ction s  | tatus   |         |       | Exec.    | Write    |          |          |        |        |                                        |
| Instruct | ion      | j       | k       | Issue | compl.   | result   |          |          | Load I | buffer | S                                      |
| LD       | F6       | 34+     | R2      | 1     | 3        | 4        |          | Time     |        | Busy   | Address                                |
| LD       | F2       | 45+     | Rз      | 2     | 4        | 5        |          |          | Load1  | No     |                                        |
| MULTD    | FO       | F2      | F4      | 3     |          |          |          |          | Load2  | No     |                                        |
| SUBD     | F8       | F6      | F2      | 4     |          |          |          |          | Load3  | No     |                                        |
| DIVD     | F10      | FO      | F6      | 5     |          |          |          |          |        |        |                                        |
| ADDD     | F6       | F8      | F2      |       |          |          |          |          |        |        |                                        |
|          |          |         |         |       |          |          |          |          |        |        |                                        |
| Functi   | onal u   | nit sta | tus     |       | src 1    | src 2    | RS for j | RS for k |        |        |                                        |
|          | Time     | Name    | Busy    | Ор    | Vj       | Vk       | Qj .     | Qk       |        |        |                                        |
|          | 2        | Add1    | Yes     | Sub   | M(R2+34) | M(R3+45) |          |          | 1      |        |                                        |
|          |          | Add2    | No      |       |          |          |          |          |        |        |                                        |
|          |          | Add3    | No      |       |          |          |          |          |        |        |                                        |
|          | 10       | Mult1   | Yes     | Mult  | M(R3+45) | F4       | -        |          |        |        |                                        |
|          |          | Mult2   | Yes     | Div   |          | F6       | Mult1    |          |        |        |                                        |
|          |          |         |         |       |          |          |          |          | 4      |        |                                        |
| Registe  | r result | status  |         |       |          |          |          |          |        |        |                                        |
|          |          |         | -<br>F0 | F2    | F4       | F6       | F8       | F10      |        | F30    |                                        |
|          |          | FU      | Mult1   | -     |          |          | Add1     | Mult2    |        |        | 1                                      |
| Clock:   | 5        |         |         |       |          |          |          |          |        |        | -                                      |
|          |          |         |         |       |          |          |          |          |        |        | 22                                     |
|          |          |         |         |       |          |          |          |          |        |        | $\left  \overline{S} \right ^{\nabla}$ |

NN V

| Instrue    | ction s     | tatus      |           |       | Exec.    | Write    |          |          |        |         |                                          |
|------------|-------------|------------|-----------|-------|----------|----------|----------|----------|--------|---------|------------------------------------------|
| Instruct   | tion        | i          | k         | Issue | compl.   | result   |          |          | Load b | ouffers | S                                        |
| LD         | F6          | 34+        | R2        | 1     | 3        | 4        |          | Time     | 10     | Busy    | Address                                  |
| LD         | F2          | 45+        | R3        | 2     | 4        | 5        |          |          | Load1  | No      |                                          |
| MULTD      | FO          | F2         | F4        | 3     |          |          |          |          | Load2  | No      |                                          |
| SUBD       | F8          | F6         | F2        | 4     | 7        |          |          |          | Load3  | No      |                                          |
| DIVD       | F10         | FO         | F6        | 5     |          |          |          |          |        |         |                                          |
| ADDD       | F6          | F8         | F2        | 6     |          |          |          |          |        |         |                                          |
| Functi     | onal u      | nit stat   | tus       |       | src 1    | src 2    | RS for j | RS for k | c      |         |                                          |
|            | Time        | Name       | Busy      | Ор    | Vj       | Vk       | Qj       | Qk       |        |         |                                          |
|            | 0           | Add1       | Yes       | Sub   | M(R2+34) | M(R3+45) |          |          |        |         |                                          |
|            |             | Add2       | Yes       | Add   |          | F2       | Add1     |          |        |         |                                          |
|            |             | Add3       | No        |       |          |          |          |          |        |         |                                          |
|            | 8           | Mult1      | Yes       | Mult  | M(R3+45) | F4       |          |          |        |         |                                          |
|            |             | Mult2      | Yes       | Div   |          | F6       | Mult1    |          |        |         |                                          |
| Registe    | r result    | status     |           |       |          |          |          |          | -      |         |                                          |
|            |             |            | FO        | F2    | F4       | F6       | F8       | F10      |        | F30     |                                          |
|            |             | FU         | Mult1     |       |          | Add2     | Add1     | Mult2    |        |         | 1                                        |
| Clock:     | 7           |            |           |       |          |          |          |          |        |         | •                                        |
|            |             |            |           |       |          |          |          |          |        |         | N. N |
| Lund Unive | ersity / El | ITF20/ Lia | ng Liu 20 | 16    |          |          |          |          |        |         | WILL                                     |

| Instruc  | ction s  | tatus   |       |       | Exec.    | Write  |         |           |       |         |                                           |
|----------|----------|---------|-------|-------|----------|--------|---------|-----------|-------|---------|-------------------------------------------|
| Instruct | tion     | j       | k     | lssue | compl.   | result |         |           | Load  | buffers | 3                                         |
| LD       | F6       | 34+     | R2    | 1     | 3        | 4      | I       | Time      |       | Busy    | Address                                   |
| LD       | F2       | 45+     | Rз    | 2     | 4        | 5      |         |           | Load1 | No      |                                           |
| MULTD    | FO       | F2      | F4    | 3     |          |        |         |           | Load2 | No      |                                           |
| SUBD     | F8       | F6      | F2    | 4     | 7        | 8      |         |           | Load3 | No      |                                           |
| DIVD     | F10      | FO      | F6    | 5     |          |        |         |           |       |         |                                           |
| ADDD     | F6       | F8      | F2    | 6     | 10       |        |         |           |       |         |                                           |
| Functi   |          | nit etc |       |       | ere 1    | ero 2  | DO (and | DO (an la |       |         |                                           |
| Functi   |          |         |       |       | src 1    | src 2  |         | RS for k  | Ca    | n we v  | vrite the                                 |
|          | Time     | Name    | Busy  | Ор    | Vj       | Vk     | Qj      | Qk        | res   | ult of  | ADDD?                                     |
|          |          | Add1    | No    |       |          |        |         |           |       |         |                                           |
|          | 0        | Add2    | Yes   | Add   | F6-F2    | F2     |         |           |       |         |                                           |
|          |          | Add3    | No    |       |          |        |         |           |       |         |                                           |
|          | 5        | Mult1   | Yes   | Mult  | M(R3+45) | F4     |         |           |       |         |                                           |
|          |          | Mult2   | Yes   | Div   |          | F6     | Mult1   |           |       |         |                                           |
|          |          |         |       |       |          |        |         |           |       |         |                                           |
| Registe  | r result | status  | _     |       |          |        |         |           |       |         |                                           |
|          |          |         | F0    | F2    | F4       | F6     | F8      | F10       |       | F30     |                                           |
|          |          | FU      | Mult1 |       |          | Add2   |         | Mult2     |       |         |                                           |
| Clock:   | 10       |         |       |       |          |        |         |           |       |         |                                           |
|          |          |         |       |       |          |        |         |           |       |         | $\left  \overline{\mathcal{A}} \right  =$ |

N7

# **Elimation of WAR hazard**



ADDD can safely finish before DIVD has read register F6 because:

- DIVD has renamed register F6 to the reservation station
- LD broadcasts its result on the Common Data Bus



| Instruc  | ction s  | tatus   |       |       | Exec.    | Write  |          |          |        |         |         |
|----------|----------|---------|-------|-------|----------|--------|----------|----------|--------|---------|---------|
| Instruct | tion     | j       | k     | Issue | compl.   | result |          |          | Load I | buffers | S       |
| LD       | F6       | 34+     | R2    | 1     | 3        | 4      | I        | Time     |        | Busy    | Address |
| LD       | F2       | 45+     | Rз    | 2     | 4        | 5      |          |          | Load1  | No      |         |
| MULTD    | FO       | F2      | F4    | 3     |          |        |          |          | Load2  | No      |         |
| SUBD     | F8       | F6      | F2    | 4     | 7        | 8      |          |          | Load3  | No      |         |
| DIVD     | F10      | FO      | F6    | 5     |          |        |          |          |        |         |         |
| ADDD     | F6       | F8      | F2    | 6     | 10       | 11     |          |          |        |         |         |
|          |          |         |       |       |          |        | -        |          |        |         |         |
| Functi   | onal u   | nit sta | tus   |       | src 1    | src 2  | RS for j | RS for k | ζ.     |         |         |
|          | Time     | Name    | Busy  | Ор    | Vj       | Vk     | Qj       | Qk       |        |         |         |
|          |          | Add1    | No    |       |          |        |          |          | ]      |         |         |
|          |          | Add2    | No    |       |          |        |          |          |        |         |         |
|          |          | Add3    | No    |       |          |        |          |          |        |         |         |
|          | 4        | Mult1   | Yes   | Mult  | M(R3+45) | F4     |          |          |        |         |         |
|          |          | Mult2   | Yes   | Div   |          | F6     | Mult1    |          |        |         |         |
|          |          |         |       |       |          |        |          |          | -      |         |         |
| Registe  | r result | status  | _     |       |          |        |          |          |        |         |         |
|          |          |         | FO    | F2    | F4       | F6     | F8       | F10      |        | F30     | _       |
|          |          | FU      | Mult1 |       |          |        |          | Mult2    |        |         | ]       |
| Clock:   | 11       |         |       |       |          |        |          |          |        |         | 101     |

| Instruc  | ction s       | status   |           |       | Exec.  | Write  |        |            |        |         |         |
|----------|---------------|----------|-----------|-------|--------|--------|--------|------------|--------|---------|---------|
| Instruct | ion           | i        | k         | Issue | compl. | result |        |            | Load I | buffers | 5       |
| LD       | F6            | 34+      | R2        | 1     | 3      | 4      | I      | Time       |        | Busy    | Address |
| LD       | F2            | 45+      | Rз        | 2     | 4      | 5      |        |            | Load1  | No      |         |
| MULTD    | FO            | F2       | F4        | 3     | 15     | 16     |        |            | Load2  | No      |         |
| SUBD     | F8            | F6       | F2        | 4     | 7      | 8      |        |            | Load3  | No      |         |
| DIVD     | F10           | FO       | F6        | 5     |        |        |        |            |        |         |         |
| ADDD     | F6            | F8       | F2        | 6     | 10     | 11     |        |            |        |         |         |
|          |               |          |           |       |        |        | -      |            |        |         |         |
| Functi   | <u>onal u</u> | init sta | tus       |       | src 1  | src 2  | RS for | j RS for k | C C    |         |         |
|          | Time          | Name     | Busy      | Ор    | Vj     | Vk     | Qj     | Qk         |        |         |         |
|          |               | Add1     | No        |       |        |        |        |            |        |         |         |
|          |               | Add2     | No        |       |        |        |        |            |        |         |         |
|          |               | Add3     | No        |       |        |        |        |            |        |         |         |
|          |               | Mult1    | No        |       |        |        |        |            |        |         |         |
|          | 40            | Mult2    | Yes       | Div   | FO     | F6     | -      |            |        |         |         |
|          |               |          |           |       |        |        |        |            | -      |         |         |
| Registe  | r result      | t status | _         |       |        |        |        |            |        |         |         |
|          |               |          | FO        | F2    | F4     | F6     | F8     | F10        |        | F30     | _       |
|          |               | FU       | -         |       |        |        |        | Mult2      |        |         | ]       |
| Clock:   | 16            |          |           |       |        |        |        |            |        |         |         |
|          |               |          |           |       |        |        |        |            |        |         | U V     |
|          | voreity       |          | ionaliu   | 2016  |        |        |        |            |        |         | WARO    |
| Lund Uni | versity /     | EIIF20/  | Liang Liu | 2010  |        |        |        |            |        |         | 18 VIII |

25

| Instruc   | tion s    | tatus     |           |       | Exec.  | Write  |          |          |       |         |         |
|-----------|-----------|-----------|-----------|-------|--------|--------|----------|----------|-------|---------|---------|
| Instructi | ion       | j         | k         | lssue | compl. | result |          |          | Load  | buffers | 3       |
| LD        | F6        | 34+       | R2        | 1     | 3      | 4      | I        | Time     |       | Busy    | Address |
| LD        | F2        | 45+       | Rз        | 2     | 4      | 5      |          |          | Load1 | No      |         |
| MULTD     | FO        | F2        | F4        | 3     | 15     | 16     |          |          | Load2 | No      |         |
| SUBD      | F8        | F6        | F2        | 4     | 7      | 8      |          |          | Load3 | No      |         |
| DIVD      | F10       | FO        | F6        | 5     | 56     | 57     |          |          |       |         |         |
| ADDD      | F6        | F8        | F2        | 6     | 10     | 11     |          |          |       |         |         |
|           |           |           |           |       |        |        | -        |          |       |         |         |
| Function  | onal u    | nit sta   | tus       |       | src 1  | src 2  | RS for j | RS for k | C C   |         |         |
|           | Time      | Name      | Busy      | Ор    | Vj     | Vk     | Qj       | Qk       | _     |         |         |
|           |           | Add1      | No        |       |        |        |          |          | ]     |         |         |
|           |           | Add2      | No        |       |        |        |          |          |       |         |         |
|           |           | Add3      | No        |       |        |        |          |          |       |         |         |
|           |           | Mult1     | No        |       |        |        |          |          |       |         |         |
|           |           | Mult2     | No        |       |        |        |          |          |       |         |         |
|           |           |           |           |       |        |        |          |          | -     |         |         |
| Register  | r result  | status    |           |       |        |        |          |          |       |         |         |
|           |           |           | F0        | F2    | F4     | F6     | F8       | F10      |       | F30     |         |
|           |           | FU        |           |       |        |        |          | -        |       |         | ]       |
| Clock:    | 57        |           |           |       |        |        |          |          |       |         | •       |
|           |           |           |           |       |        |        |          |          |       |         | U V     |
|           |           |           |           |       |        |        |          |          |       |         | N.W.    |
| Lund Univ | versity / | EITF20/ l | _iang Liu | 2016  |        |        |          |          |       |         |         |

### **Comparing: Scoreboard example, CP62**

| Instruc  | tion s        | <u>status</u> |      |       | Read | Exec.  | Write  |        |        |     |      |
|----------|---------------|---------------|------|-------|------|--------|--------|--------|--------|-----|------|
| Instruct | ion           | j             | k    | Issue | ops  | compl. | result | _      |        |     |      |
| LD       | F6            | 34+           | R2   | 1     | 2    | 3      | 4      | 1      |        |     |      |
| LD       | F2            | 45+           | R3   | 5     | 6    | 7      | 8      |        |        |     |      |
| MULTD    | F0            | F2            | F4   | 6     | 9    | 19     | 20     |        |        |     |      |
| SUBD     | F8            | F6            | F2   | 7     | 9    | 11     | 12     |        |        |     |      |
| DIVD     | F10           | FO            | F6   | 8     | 21   | 61     | 62     |        |        |     |      |
| ADDD     | F6            | F8            | F2   | 13    | 14   | 16     | 22     | 1      |        |     |      |
|          |               |               |      |       |      |        |        |        |        |     |      |
| Function | <u>onal ι</u> | unit stat     | us   |       | dest | src 1  | src 2  | FUsrc1 | FUsrc2 | Fj? | Fk?  |
|          | Time          | Name          | Busy | Ор    | Fi   | Fj     | Fk     | Qj     | Qk     | Rj  | Rk   |
|          |               | Integer       | No   |       |      |        |        |        |        |     |      |
|          |               | Mult1         | No   |       |      |        |        |        |        |     |      |
|          |               | Mult2         | No   |       |      |        |        |        |        |     |      |
|          |               | Add           | No   |       |      |        |        |        |        |     |      |
|          |               | Divide        | No   |       |      |        |        |        |        |     |      |
|          |               |               |      |       |      |        |        |        |        |     |      |
| Registe  | r resul       | t status      |      |       |      |        |        |        |        |     |      |
|          |               |               | F0   | F2    | F4   | F6     | F8     | F10    |        | F30 | -    |
|          |               | FU            |      |       |      |        |        | -      |        |     |      |
| Clock:   | 62            |               |      |       |      |        |        |        |        |     |      |
|          |               |               |      |       |      |        |        |        |        |     | 10%  |
|          |               |               |      |       |      |        |        |        |        |     | A A  |
|          |               |               |      |       |      |        |        |        |        |     | W.   |
|          |               |               |      |       |      |        |        |        |        |     | 1510 |

#### **Tomasulo vs Scoreboard**

**Differences between Tomasulo Algorithm and Scoreboard** 

- Control and buffers distributed with Function Units versus centralized in scoreboard
- Registers in instructions replaced by pointers to reservation stations
  - Register renaming, helps avoid WAR and WAW hazards
  - More reservation stations than registers; so allow optzns compilers can't do
  - Operands stays in register in Scoreboard (stall for WAR and WAW)

#### Common Data Bus broadcasts results to all FUs (forwarding!)



### **Tomasulo summary**

#### **Instructions:** move from decoder to reservation stations

- In program order
- Dependences can be correctly recorded
- Distributed harzard detection

#### Significant increase in HW cost

#### Benefits

- Register renaming, remove WAW, WAR hazard
- Out-of-order execution, completion
- Tolerates unpredictable delays (especially for cache)
- Compile for one pipeline run effectively on another



## Outline

Reiteration

Dynamic scheduling - Tomasulo

#### Superscalar, VLIW

- Speculation
- ILP limitations
- What we have done so far



# Getting CPI <1 !?

Issuing multiple (independent) instructions per clock cycle

Superscalar: varying number of instructions/cycle (1-8) scheduled by compiler or HW

- IBM Power5, Pentium 4, Sun SuperSparc, DEC Alpha
- Simple hardware, complicated compiler (static) or...
- Very complex hardware but simple for compiler (dynamic)
- Very Long Instruction Word (VLIW): fixed number of instructions (3-5) scheduled by the compiler
  - HP/Intel IA-64, Itanium
  - Simple hardware, difficult for compiler
  - High performance through extensive compiler optimization



# **Approches for multiple issuing**

|             | Issue   | Hazard detection | Scheduling    | Characteristics<br>/examples           |
|-------------|---------|------------------|---------------|----------------------------------------|
| Superscalar | dynamic | HW               | static        | in-order execution<br>ARM              |
| Superscalar | dynamic | HW               | dynamic       | out-of-order<br>execution              |
| Superscalar | dynamic | HW               | dynamic       | speculation<br>Pentium 4<br>IBM power5 |
| VLIW        | static  | compiler         | static        | TI C6x                                 |
| EPIC        | static  | compiler         | mostly static | Itanium                                |



### VLIW

- A number of functional units that independently execute instructions in parallel.
- The compiler decides which instructions can execute in parallel
- No hazard detection needed



#### **VLIW instuction format - Itanium**





#### **VLIW** architecture- Itanium



0

# **VLIW** limitations



#### Limited Instruction Level Parallelism

 With *n* functional units and *k* pipeline stages we need *n* \* *k* independent instructions to utilize the hardware

#### Memory and register bandwidth

• With increasing number of functional units, the number of ports needed at the memory or register file must increase to prevent structural hazards

#### Code size

- Compiler scheduled pipeline "bubbles" take up space in the instruction
- Need more aggressive loop unrolling to work well which also increases code size

#### No binary code compatibility

Different pipeline stages and # function units requires different code

#### Synchronization

Stall in one FU cause the entire processor to stall



# Outline

Reiteration

- Dynamic scheduling Tomasulo
- □ Superscalar, VLIW

#### Speculation

- ILP limitations
- What we have done so far



## Hardware-base speculation

Trying to exploit more ILP (e.g., multiple issue) while maintaining control dependencies becomes a burden

#### Overcome control dependencies

- By speculating on the outcome of branches and executing the program as if our guesses were correct
- Need to handle incorrect guesses

#### Methodologies:

- Dynamic branch prediction: allows instruction scheduling across branches (no exe)
- Dynamic scheduling: take advantage of ILP (wait for clear branch)
- <u>Speculation</u>: execute instructions before all control dependencies are resolved (basically data flow)



## Hardware vs software speculation

#### Advantages:

- Dynamic branch prediction is often better than static which limits the performance of SW speculation
- HW speculation can maintain a precise exception model
- Can achieve higher performance on older code (without recompilation)

#### Main disadvantage:

 Extremely complex implementation and extensive need for hardware resources



## Implementing speculation

#### 🗖 Key idea

- Allow instructions to execute out of order
- Force instructions to commit in order
- Prevent any irrevocable action (such as updating state or taking an exception) until an instruction commits

#### Strategies:

- Must separate bypassing of results among instructions from actual completion (write-back) of instructions
- Instruction commit updates register or memory when instruction no longer speculative

#### Need to add re-order buffer

 Hold the results of inst. that have finished exe but have not commiteed



#### **Tomasulo extended to support speculation**





## **ROB (reorder buffer)**

| entry | instruction type | destination | value | ready |
|-------|------------------|-------------|-------|-------|
| 1     |                  |             |       |       |
| 2     |                  |             |       |       |
|       |                  |             |       |       |
| n     |                  |             |       |       |

#### Contains 4 fields:

- Instruction type indicates whether branch, store, or register op
- **Destination field** memory or register
- Value field hold the inst. result until commit
- <u>Ready flag</u> indicates instruction has completed operation
- Every instruction has a ROB entry until it commits
  - Therefore tag results using ROB entry number
  - The renaming function of the reservation stations is partially replaced by the ROB



## **ROB (reorder buffer)**

#### **When MUL.D is ready to commit**

| L.D   | F6,32(R2) |
|-------|-----------|
| L.D   | F2,44(R3) |
| MUL.D | F0,F2,F4  |
| SUB.D | F8,F2,F6  |
| DIV.D | F10,F0,F6 |
| ADD.D | F6,F8,F2  |

| Reorder buffer |      |           |           |              |             |                    |
|----------------|------|-----------|-----------|--------------|-------------|--------------------|
| Entry          | Busy | Instructi | on        | State        | Destination | Value              |
| 1              | No   | L.D       | F6,32(R2) | Commit       | F6          | Mem[32 + Regs[R2]] |
| 2              | No   | L.D       | F2,44(R3) | Commit       | F2          | Mem[44 + Regs[R3]] |
| 3              | Yes  | MUL.D     | F0,F2,F4  | Write result | F0          | #2 × Regs [F4]     |
| 4              | Yes  | SUB.D     | F8,F2,F6  | Write result | F8          | #2 - #1            |
| 5              | Yes  | DIV.D     | F10,F0,F6 | Execute      | F10         |                    |
| 6              | Yes  | ADD.D     | F6,F8,F2  | Write result | F6          | #4 + #2            |

#### **Reservation stations**

| Name  | Busy | Ор    | Vj                 | Vk                 | Qj | Qk | Dest | Α |
|-------|------|-------|--------------------|--------------------|----|----|------|---|
| Load1 | No   |       |                    |                    |    |    |      |   |
| Load2 | No   |       |                    |                    |    |    |      |   |
| Add1  | No   |       |                    |                    |    |    |      |   |
| Add2  | No   |       |                    |                    |    |    |      |   |
| Add3  | No   |       |                    |                    |    |    |      |   |
| Mult1 | No   | MUL.D | Mem[44 + Regs[R3]] | Regs[F4]           |    |    | #3   |   |
| Mult2 | Yes  | DIV.D |                    | Mem[32 + Regs[R2]] | #3 |    | #5   |   |
|       |      |       |                    |                    |    |    |      |   |



## **Four-step speculation**

#### Issue:

- Get instruction from instruction queue and issue if reservation station and ROB slots available – sometimes called dispatch
- Send operands or ROB entry #

#### Execution – operate on operands (EX)

- If both operands ready: execute; if not, watch CDB for result;
- When both operands are in reservation station: execute

#### Write result – complete execution

- Write on CDB to all awaiting FUs (RSs) & ROB (tagged by ROB entry #)
- Mark reservation station available

#### Commit – update register with reorder result

- When instr. is at head of ROB & result is present & no longer speculative; update register with result (or store to memory) and remove instr. from ROB;
- handle mis-speculations and precise exceptions

#### **Four-step speculation**

#### Commit – branch prediction wrong

- When branch instr. is at head of ROB & incorrect prediction (or exception): remove all instr. from reorder buffer (flush); restart execution at correct instruction
- Expensive ⇒ try to recover as early as possible (delay in ROB)
- Performance sensitive to branch prediction/speculation (waste computation power & time if wrong)



## Sandy bridge microarchitecture





## Outline

Reiteration

- Dynamic scheduling Tomasulo
- Superscalar, VLIW
- Speculation
- ILP limitations

What we have done so far



# How much performance can be get from instruction-level parallelism?



47 Lund University / EITF20/ Liang Liu 2016

## A model of an idea processor

#### Assumptions for ideal/perfect machine to start:

- Register renaming: infinite virtual registers => all register WAW & WAR hazards are avoided
- Branch prediction: perfect; no mispredictions
- Jump prediction: all jumps perfectly predicted
- 2 & 3 => machine with perfect speculation & an unbounded buffer of instructions available
- *Memory*: addresses are known, perfect caches;

#### There are only true data dependencies left!

## **Upper limit to ILP (# of instruction per cycle)**



## Window size impact



#### **Branch impact**





### **Register # impact**



0

## Summary

Software (compiler) tricks:

- Loop unrolling
- Static instruction scheduling (with register renaming)
- ... and more

Hardware tricks:

- Dynamic instruction scheduling
- Dynamic branch prediction
- Multiple issue Superscalar, VLIW
- Speculative execution
- ... and more



## Outline

Reiteration

- Dynamic scheduling Tomasulo
- Superscalar, VLIW
- Speculation
- ILP limitations

#### What we have done so far



## **Summary pipeline - method**

| Dependency         | Hazard   | Method                      |  |  |
|--------------------|----------|-----------------------------|--|--|
| Data               | RAW      | Forwarding, Scheduling,     |  |  |
| Name               | WAR, WAW | Register Renaming           |  |  |
| Control            | Control  | Branch Prediction,          |  |  |
|                    |          | Speculation, Delayed branch |  |  |
| Precise exceptions |          | in-order commit             |  |  |
| ILP                |          | Scheduling,                 |  |  |
|                    |          | Loop unrolling              |  |  |



### **Basic 5-stage pipeline**





56 Lund University / EITF20/ Liang Liu 2016

## **Pipeline with several FUs**





### **Scoreboard pipeline**





## **Tomasulo pipeline**





## **Tomasulo pipeline with speculation**





## **Summary pipeline - implementation**

| Problem           | Simple               | Scoreboard | Tomasulo       | Tomasulo +<br>Speculation |
|-------------------|----------------------|------------|----------------|---------------------------|
|                   | Static Sch           | D          | ynamic Schedul | ing                       |
| RAW               |                      |            |                |                           |
|                   |                      |            |                | _                         |
| WAR               |                      |            |                |                           |
| WAW               |                      |            |                | _                         |
| Exceptions        |                      |            | I _            |                           |
| Issue             |                      |            |                |                           |
| Execution         |                      |            |                | _                         |
| Completion        |                      | 1          | ,              |                           |
| Structural        |                      |            |                | _                         |
| hazard            |                      |            |                |                           |
| Control           | -                    |            |                |                           |
| hazard            |                      | . ·        |                |                           |
|                   |                      |            |                | LA LA                     |
| Lund University / | EITF20/ Liang Liu 20 | 16         |                |                           |