## Lecture 7: EITF20 Computer Architecture

#### Anders Ardö

EIT - Electrical and Information Technology, Lund University

November 21, 2012

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 1 / 42

#### Outline

- Reiteration

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

# Memory hierarchy



AIM: Fast as cache; Large as disk; Cheap as possible

#### Who cares?

- 1980: no cache in microprocessors
- 1995: 2-level caches in a processor package
- 2000: 2-level caches on a processor die
- 2003: 3-level caches on a processor die



A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 3 / 42

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

## IBM z196 5.2GHz microprocessor chip



A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012

#### Cache measures

- hit rate = no of accesses that hit/no of accesses
  - close to 1, more convenient with
- miss rate = 1.0 hit rate
- hit time: cache access time plus time to determine hit/miss
- miss penalty: time to replace a block
  - measured in ns or number of clock cycles depends on:
    - latency: time to get first word
    - bandwidth: time to transfer block

out-of-order execution can hide some of the miss penalty

- Average memory access time
  - = hit time + miss rate \* miss penalty

# Why does caching work?

#### The Principle of Locality

- Temporal locality (Locality in Time): If an item is referenced, it will tend to be referenced again soon.
- Spatial locality (Locality in space): If an item is referenced, items whose addresses are close, tend to be referenced soon.



# 4 questions for the Memory Hierarchy

- Q1: Where can a block be placed in the upper level? (Block placement)
- Q2: How is a block found if it is in the upper level? (Block identification)
- Q3: Which block should be replaced on a miss? (Block replacement)
- Q4: What happens on a write? (Write strategy)

A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 7 / 42

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 8 / 42

# Lecture 7 agenda

Chapter 5.2, (App C.2-C.3) in "Computer Architecture" Cache optimizations

- Reiteration
- Cache performance optimization
- Bandwidth increase
- Reduce hit time
- Reduce miss penalty
- Reduce miss rate
- Summary

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 9 / 42

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 10 / 42

# Cache Performance

Execution Time =

$$IC*(CPI_{execution} + \frac{\text{mem accesses}}{\text{instruction}}*\text{miss rate}*\text{miss penalty})*T_{C}$$

Three ways to increase performance:

- Reduce miss rate
- Reduce miss penalty
- Reduce hit time

However, remember:

A. Ardö, EIT

**Execution time is the only true measure!** 

- Cache performance optimization

# Cache optimizations

|               | Hit  | Band- | Miss    | Mi <b>s</b> s | HW         |
|---------------|------|-------|---------|---------------|------------|
|               | time | width | penalty | rate          | complexity |
| Simple        | +    |       |         | -             | 0          |
| Addr. transl. | +    |       |         |               | 1          |
| Way-predict   | +    |       |         |               | 1          |
| Trace         | +    |       |         |               | 3          |
| Pipelined     | -    | +     |         |               | 1          |
| Banked        |      | +     |         |               | 1          |
| Nonblocking   |      | +     | +       |               | 3          |
| Early start   |      |       | +       |               | 2          |
| Merging write |      |       | +       |               | 1          |
| Multilevel    |      |       | +       |               | 2          |
| Read priority |      |       | +       |               | 1          |
| Prefetch      |      |       | +       | +             | 2-3        |
| Victim        |      |       | +       | +             | 2          |
| Compiler      |      |       |         | +             | 0          |
| Larger block  |      |       | -       | +             | 0          |
| Larger cache  | -    |       |         | +             | 1          |
| Associativity | -    |       |         | +             | 1          |

Lecture 7: EITF20 Computer Architecture

November 21, 2012 11 / 42

A. Ardö, EIT Lecture 7: EITF20 Computer Architecture

## Outline

- Bandwidth increase

A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 13 / 42

# Bandwidth increase

- Pipelined cache
  - greater penalty on misspredicted branches
- Multibanked caches
  - works best when even spread of accesses across banks
  - sequential interleaving

# Cache optimizations

|               | Hit<br>time | Band-<br>width | Miss penalty | Miss<br>rate | HW complexity |
|---------------|-------------|----------------|--------------|--------------|---------------|
| Simple        | +           |                |              | -            | 0             |
| Addr. transl. | +           |                |              |              | 1             |
| Way-predict   | +           |                |              |              | 1             |
| Trace         | +           |                |              |              | 3             |
| Pipelined     | -           | +              |              |              | 1             |
| Banked        |             | +              |              |              | 1             |
| Nonblocking   |             | +              | +            |              | 3             |
| Early start   |             |                | +            |              | 2             |
| Merging write |             |                | +            |              | 1             |
| Multilevel    |             |                | +            |              | 2             |
| Read priority |             |                | +            |              | 1             |
| Prefetch      |             |                | +            | +            | 2-3           |
| Victim        |             |                | +            | +            | 2             |
| Compiler      |             |                |              | +            | 0             |
| Larger block  |             |                | -            | +            | 0             |
| Larger cache  | -           |                |              | +            | 1             |
| Associativity | -           |                |              | +            | 1             |

Outline

- Reduce hit time

## Cache optimizations

|               | Hit<br>time | Band-<br>width | Miss penalty   | Miss<br>rate   | HW complexity |
|---------------|-------------|----------------|----------------|----------------|---------------|
| Simple        | +           |                |                | -              | 0             |
| Addr. transl. | +           |                |                |                | 1             |
| Way-predict   | +           |                |                |                | 1             |
| Trace         | +           |                |                |                | 3             |
| Pipelined     | -           | +              |                |                | 1             |
| Banked        |             | +              |                |                | 1             |
| Nonblocking   |             | +              | +              |                | 3             |
| Early start   |             |                | +              |                | 2             |
| Merging write |             |                | +              |                | 1             |
| Multilevel    |             |                | +              |                | 2             |
| Read priority |             |                | +              |                | 1             |
| Prefetch      |             |                | +              | +              | 2-3           |
| Victim        |             |                | +              | +              | 2             |
| Compiler      |             |                |                | +              | 0             |
| Larger block  |             |                | -              | +              | 0             |
| Larger cache  | -           |                |                | +              | 1             |
| Associativity | -           |                |                | +              | 1             |
| A. Ardö, EIT  |             | Lecture 7:     | EITF20 Compute | r Architecture | November 21   |

## Reduce hit time 2: Address translation

 Processor uses virtual addresses (VA) while caches and main memory use physical addresses (PA)



Use the virtual address to index the cache in parallel



# Reduce hit time 1: KISS (Keep It Simple, Stupid)

Hit time critical since it affects clock rate.

- Smaller and simpler is faster:
  - Fits on-chip (→ avoids off-chip time penalties)
  - A direct mapped cache allows data fetch and tag check to proceed in parallel



#### Reduce hit time 3: Address translation

Use virtual addresses to both index cache and tag check



Processes have different virtual address spaces

- Flush cache between context switches or
- Extend tag with process-identifier tag (PID)

Two virtual addresses may map to the same physical address – synonyms or aliases

- HW guarantees that every cache block have a unique physical address
- Some OS uses "page colouring" to ensure that the cache index is unique

A. Ardő, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 19 / 42 A. Ardő, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 20 / 42

#### Reduce hit time 4: trace caches

- Trace caches
  - Dynamically find a sequence of executed instructions (including taken branches) to make up a cache block
  - Complicated address mapping
  - Avoids header and trailer overhead of normal cache schemes
  - The same instruction can be stored several times in the cache (different traces)
  - Used in Intel NetBurst (P4)

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 21 / 42

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 22 / 42

# Outline

- Reduce miss penalty

#### Reduce hit time: various tricks

- Way prediction combines the speed of a direct mapped cache with the conflict miss reductions of set-associative caches
  - Hardware prediction of which block in the set that will be accessed next (Compare branch prediction)
  - If prediction correct low latency otherwise longer latency
  - Prediction accurancy  $\approx$  85 %
- Pseudo associativity (i) check first a direct mapped entry, if miss (ii) check a second entry
  - Two hit times (fast, slow)
  - If hit rate in the fast entry is low, the performance will be dictated by the slow hit time

#### Cache optimizations

|               | Hit<br>time | Band-<br>width | Miss penalty | Miss<br>rate | HW complexity |
|---------------|-------------|----------------|--------------|--------------|---------------|
| Simple        | +           |                | politality   | -            | 0             |
| Addr. transl. | +           |                |              |              | 1             |
| Way-predict   | +           |                |              |              | 1             |
| Trace         | +           |                |              |              | 3             |
| Pipelined     | -           | +              |              |              | 1             |
| Banked        |             | +              |              |              | 1             |
| Nonblocking   |             | +              | +            |              | 3             |
| Early start   |             |                | +            |              | 2             |
| Merging write |             |                | +            |              | 1             |
| Multilevel    |             |                | +            |              | 2             |
| Read priority |             |                | +            |              | 1             |
| Prefetch      |             |                | +            | +            | 2-3           |
| Victim        |             |                | +            | +            | 2             |
| Compiler      |             |                |              | +            | 0             |
| Larger block  |             |                | -            | +            | 0             |
| Larger cache  | -           |                |              | +            | 1             |
| Associativity | -           |                |              | +            | 1             |

A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 23 / 42

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 24 / 42

# Reduce miss penalty 1: Multilevel caches

#### The more the merrier

Use several levels of cache memory:

- The 1st level cache fast and small ⇒ on-chip
- 2nd level cache can be made much larger and set-associative to reduce capacity and conflict misses
- ... and so on for 3rd and 4th level caches

#### On-chip or Off-chip?

Today 3 levels on-chip; 4th comming

#### Performance:

- Average Access time<sub>| 1</sub> = Hit<sub>| 1</sub> + Miss rate<sub>| 1</sub> \* Miss penalty<sub>| 1</sub>
- Miss penalty<sub>1 1</sub> = Hit\_block<sub>1 2</sub> + Miss rate<sub>1 2</sub> \* Miss penalty<sub>1 2</sub>

Design issue: multilevel inclusion or exclusion?

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 25 / 42

## Multilevel caches: examples

|                |     | C      | ache | SP | EC   |      |
|----------------|-----|--------|------|----|------|------|
| CPU            | CP  | L1     | L2   | L3 | Int  | FP   |
|                | GHz | KB     | KB   | MB |      |      |
| FX-51          | 2.2 | 64+64  | 1024 | -  | 1376 | 1371 |
| Itanium 2      | 1.5 | 16+16  | 256  | 6  | 1322 | 2119 |
| Pentium 4      | 3.2 | 12+8   | 512  | -  | 1221 | 1271 |
| (Pentium 4 EE) | 3.2 | 12+8   | 512  | 2  | 1583 | 1475 |
| Core i7        | 3.5 | 32+32  | 256  | 8  |      |      |
| Phenom II      | 3   | 128    | 512  | 8  |      |      |
| AMD Bulldozer  | 4   | 16+64  | 2048 | 8  |      |      |
| IBM z196       | 5.2 | 64+128 | 1536 | 24 |      |      |

#### Multilevel caches: execution time



## Reduce miss penalty 2: Write buffers, Read priority

#### Preference, Companionship

- Giving priority to read misses over writes
- Merging write buffer

|     |            | _ |          |   |          |   |          |   |          |
|-----|------------|---|----------|---|----------|---|----------|---|----------|
| Wri | te address | V |          | V |          | V |          | V |          |
|     | 100        | 1 | Mem[100] | 0 |          | 0 |          | О |          |
|     | 108        | 1 | Mem[108] | 0 |          | 0 |          | О |          |
|     | 116        | 1 | Mem[116] | 0 |          | 0 |          | О |          |
|     | 124        | 1 | Mem[124] | 0 |          | 0 |          | О |          |
|     |            |   |          |   |          |   |          |   | -        |
| Wri | te address |   |          | V |          | V |          | V |          |
|     | 100        | 1 | Mem[100] | 1 | Mem[108] | 1 | Mem[116] | 1 | Mem[124] |
|     |            | 0 |          | 0 |          | О |          | О |          |
|     |            | О |          | 0 |          | О |          | О |          |
|     |            | О |          | 0 |          | 0 |          | О |          |

A. Ardö, EIT

November 21, 2012 Lecture 7: EITF20 Computer Architecture

## Reduce miss penalty 3: other tricks

#### **Impatience**

- Early restart and critical word first
  - Early restart fetch words in normal order but restart processor as soon as requested word has arrived
  - Critical word first fetch the requested word first. Overlap CPU execution with filling the rest of the cache block

Increases perfomance mainly with large block sizes.

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 30 / 42

## Reduce miss rate/penalty: prefetching

Goal: overlap execution with speculative prefetching to cache

- *Hardware prefetching* If there is a miss for block *X*, fetch also block X+1, X+2,... X+d
  - d=1 ⇒ one-block lookahead. Used in Alpha 21064, Pentium 4



- **Software prefetching** load data before it is needed:
  - Register prefetching
  - Cache prefetching requires special instructions (Used in Itanium)
  - Both requires lockup free (nonblocking) caches
  - Can be faulting or nonfaulting (virtual memory)

# Reduce miss penalty 4: Nonblocking caches

Nonblocking cache ≡ lockup-free cache

- (+) Permit other cache operations to proceed when a miss has occurred
- (+) May further lower the effective miss penalty if multiple misses can overlap
- (-) The cache has to book-keep all outstanding references -Increases cache controler complexity

Good for out-of-order pipelined CPUs

The presence of true data dependencies may limit performance

## Outline

- Reduce miss rate

# Cache optimizations

|               | Hit<br>time | Band-<br>width | Miss penalty    | Miss<br>rate | HW<br>complexity |
|---------------|-------------|----------------|-----------------|--------------|------------------|
| Simple        | +           |                |                 | -            | 0                |
| Addr. transl. | +           |                |                 |              | 1                |
| Way-predict   | +           |                |                 |              | 1                |
| Trace         | +           |                |                 |              | 3                |
| Pipelined     | -           | +              |                 |              | 1                |
| Banked        |             | +              |                 |              | 1                |
| Nonblocking   |             | +              | +               |              | 3                |
| Early start   |             |                | +               |              | 2                |
| Merging write |             |                | +               |              | 1                |
| Multilevel    |             |                | +               |              | 2                |
| Read priority |             |                | +               |              | 1                |
| Prefetch      |             |                | +               | +            | 2-3              |
| Victim        |             |                | +               | +            | 2                |
| Compiler      |             |                |                 | +            | 0                |
| Larger block  |             |                | -               | +            | 0                |
| Larger cache  | -           |                |                 | +            | 1                |
| Associativity | -           |                |                 | +            | 1                |
| A. Ardö, EIT  |             | Lecture 7: I   | EITF20 Computer | Architecture | November 21, 2   |

#### Reduce misses 1: increase block size

- Increased block size utilises the spatial locality
- Too big blocks increases capacity miss rate
- Big blocks also increases miss penalty

Beware - impact on average memory access time



#### Reduce miss rate

#### The three C's:

- Compulsory misses in an infinite cache
- Capacity misses in a fully associative cache
- Conflict misses in an N-way associative cache

How do we reduce the number of misses?

- Change cache size?
- Change block size?
- Change associativity?
- Change compiler?
- Other tricks!

#### Which of the three C's are affected?

A. Ardö, EIT

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 34 / 42

## Reduce misses 2: change associativity

**Rule of thumb**: A direct mapped cache of size N has the same miss rate as a 2-way set associative cache of size N/2



A. Ardö, EIT Lecture 7: EITF20 Computer Architecture

## Reduce misses 3: Compiler optimizations

#### Basic idea: Reorganize code to increase locality!

• Merging arrays:

| Before         | After                      |
|----------------|----------------------------|
| int val[SIZE]; | Struct merge { int val;    |
| int key[SIZE]; | int key; }                 |
|                | struct merge newarr[SIZE]; |

Loop interchange:

| Before                                                                           | After                                |
|----------------------------------------------------------------------------------|--------------------------------------|
| for (j=0; j <n; j++)<="" td=""><td>for (i=0; i<n; i++)<="" td=""></n;></td></n;> | for (i=0; i <n; i++)<="" td=""></n;> |
| for (i=0; i <n; i++)<="" td=""><td>for (j=0; j<n; j++)<="" td=""></n;></td></n;> | for (j=0; j <n; j++)<="" td=""></n;> |
| A[i,j] =                                                                         | A[i,j] =                             |

Blocking

A. Ardö, EIT

Lecture 7: EITF20 Computer Architecture

November 21, 2012 37 / 42

## Reduce misses 4: Victim cache



#### Recycling

- How to combine fast hit time of direct mapped yet still avoid conflict misses?
- Add buffer to place data discarded from cache
- Jouppi: a 4-entry victim cache removed 25% of conflict misses for a 4 Kbyte direct mapped data cache
- Used in AMD Athlon, HP and Alpha machines

© 2003 Elsevier Science (USA). All rights reserved.

A. Ardö, EIT

# Reduce misses 3: Compiler optimizations

# **Summary of Compiler Optimizations to Reduce Cache Misses**



#### Outline

- Summary

Lecture 7: EITF20 Computer Architecture November 21, 2012 39 / 42 A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 40 / 42

# Cache Performance

Execution Time =

$$IC*(CPI_{execution} + \frac{\text{mem accesses}}{\text{instruction}}*\text{miss rate}*\text{miss penalty})*T_{C}$$

Three ways to increase performance:

- Reduce miss rate
- Reduce miss penalty
- Reduce hit time
- ... and increase bandwidth

However, remember:

**Execution time is the only true measure!** 

A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 41 / 42

# Cache optimizations

|               | Hit<br>time | Band-<br>width | Miss penalty | Miss<br>rate | HW complexity |
|---------------|-------------|----------------|--------------|--------------|---------------|
| Simple        | +           |                |              | -            | 0             |
| Addr. transl. | +           |                |              |              | 1             |
| Way-predict   | +           |                |              |              | 1             |
| Trace         | +           |                |              |              | 3             |
| Pipelined     | -           | +              |              |              | 1             |
| Banked        |             | +              |              |              | 1             |
| Nonblocking   |             | +              | +            |              | 3             |
| Early start   |             |                | +            |              | 2             |
| Merging write |             |                | +            |              | 1             |
| Multilevel    |             |                | +            |              | 2             |
| Read priority |             |                | +            |              | 1             |
| Prefetch      |             |                | +            | +            | 2-3           |
| Victim        |             |                | +            | +            | 2             |
| Compiler      |             |                |              | +            | 0             |
| Larger block  |             |                | -            | +            | 0             |
| Larger cache  | -           |                |              | +            | 1             |
| Associativity | -           |                |              | +            | 1             |

A. Ardö, EIT Lecture 7: EITF20 Computer Architecture November 21, 2012 42 / 42