

Computer Architecture, EITF20 Exercise

| Problem 1 | L |
|-----------|---|
|-----------|---|

a) Pair each concept from table **I** with the concept from table **II** that is most closely associated. (Each answer should be a number-letter pair.) (10)

| Ì I |                    |                     | II                                                                 |  |  |
|-----|--------------------|---------------------|--------------------------------------------------------------------|--|--|
| 1.  | BTB                | A.                  | output dependency                                                  |  |  |
| 2.  | WAR                | B.                  | large, cheap, slow                                                 |  |  |
| 3.  | Trace cache        | C.                  | pipeline optimization technique                                    |  |  |
| 4.  | out-of-order       | D.                  | Amdahl's law                                                       |  |  |
| 5.  | little-endian      | E.                  | static optimization for memory system                              |  |  |
| 6.  | RAID               | F.                  | jump prediction technique                                          |  |  |
| 7.  | CDB                | G.                  | the least significant unit is stored at the lowest memory address  |  |  |
| 8.  | ROB                | H.                  | precise exception                                                  |  |  |
| 9.  | Procedure inlining | I.                  | virtual memory                                                     |  |  |
| 10. | Interleaved        | J.                  | hard disk performance and reliability                              |  |  |
|     |                    | K. reduce miss rate |                                                                    |  |  |
|     |                    | L.                  | reduce hit time                                                    |  |  |
|     |                    | M.                  | the least significant unit is stored at the highest memory address |  |  |
|     |                    | N.                  | hazard/conflict                                                    |  |  |
|     |                    | 0.                  | forwarding                                                         |  |  |
|     |                    | P.                  | higher memory bandwidth                                            |  |  |
|     |                    |                     |                                                                    |  |  |

Briefly (1-3 sentences) describe each of the following items/concepts concerning computer architecture. Where relevant also give an example.

| b) Replacement algorithm | (1) |
|--------------------------|-----|
| c) Critical word first   | (1) |
| d) Precise exceptions    | (1) |
| e) Speculative execution | (1) |
| f) TLB                   | (1) |

## Problem 2

- a) Describe the concept "memory hierarchy", and state why it is important. State the function of each part, normally used hardware components, and what problems they solve (if any). (5)
- b) Describe the cache technologies direct mapping, set-associative mapping, and fully associative mapping, including how addresses can be viewed as consisting of a number of fields. Highlight the differences.

Consider the following (MIPS) program and pipeline:

- 1. DADD R1,R2,R3 ; R1 = R2 + R3
- 2. DSUB R4,R1,R5
- 3. AND R6,R1,R7
- 4. OR R8,R1,R9
- 5. XOR R4,R1,R10



- a) Identify and state all dependencies in the program. Which of these dependencies leads to hazards? (3)
- b) What is the difference between a dependency and a hazard? (1)
- c) How many clock cycles does it take to execute all instructions in the above program on the given pipeline? (1)
- d) What changes to the pipeline can eliminate the data hazards caused by the data dependencies in the code? (If needed you can use the attached figure at the end of this exam, to visualize your changes to the pipeline.)

We are interested to quickly compare improvements of 3 mutually exclusive enhancements to a pipelined processor (ideal CPI=1) together with a unified cache. The cache system have a miss-penalty of 50 clock cycles. Enhancements and data regarding our intended application are:

- 1. Enhancing integer computation performance by a factor 1.8 (assume no effects on miss-ratio or clock cycle time)
- 2. Decreasing execution time for data access instructions by a factor 2 (assume no effects on miss-ratio or clock cycle time)
- 3. Changing cache associativity from 128 sets, associativity 1 to 32 sets, associativity 4.

| Instruction mix  |                       |  |  |
|------------------|-----------------------|--|--|
| instruction type | % of all instructions |  |  |
| load             | 23.75                 |  |  |
| store            | 9.66                  |  |  |
| uncond branch    | 5.84                  |  |  |
| cond branch      | 18.45                 |  |  |
| int computation  | 42.30                 |  |  |

| Cache miss-ratio |               |          |          |  |  |
|------------------|---------------|----------|----------|--|--|
| No. of           | Associativity |          |          |  |  |
| $\mathbf{sets}$  | 1             | 2        | 4        |  |  |
| 16               | 0.778109      | 0.594727 | 0.303048 |  |  |
| 32               | 0.597252      | 0.312289 | 0.136385 |  |  |
| 64               | 0.397689      | 0.145433 | 0.083915 |  |  |
| 128              | 0.207784      | 0.088052 | 0.051590 |  |  |

Clock cycle time dependence on associativity is:

|    |                           | sociat | ativity  |     |  |
|----|---------------------------|--------|----------|-----|--|
| s: |                           | 1      | <b>2</b> | 4   |  |
|    | relative clock cycle time | 1      | 1.04     | 1.1 |  |

Which of the 3 enhancements will give the highest speedup?(Show all calculations!) (10)

## Problem 5

Consider following part of an assembly-language program:

| 1: MOV R5, R7       | ;Move R5=R7              |
|---------------------|--------------------------|
| 2: LW R8, (R5)      | ;Load word R8=Mem(R5)    |
| 3: MUL R9, R8, R5   | ;Multiply R9=R8*R5       |
| 3: ADDDI R5, R5, #4 | ;Add immediate R5=R5+imm |
| 4: LW R9, (R5)      | ;Load Word R9=Mem(R5)    |
| 5: SUB R1, R3, R2   | ;Subtract R1=R3-R2       |

All instructions take one cycle in the EXE-stage except MUL which takes 5 cycles in the EXE-stage.

- Draw a table (clock number vs instruction number with entries showing the pipeline stage) for the execution pattern of the above program in a standard (with the exception for MUL) 5-stage pipeline without forwarding. (5)
- Discuss improvements in execution timing if you were to execute it on a pipeline with 2 arithmetic functional units, register renaming and reorder buffers (Tomasulo). (5)

- a) Show the address bit partitioning into fields for a memory system with parameters: Memory size = 1GB
  Cache size = 512 KB
  Block size = 32 Bytes
  Set associative with 4 blocks per set.
- b) List the micro-operations (4 steps in the following picture) done during a successful cache access (2)

(2)



c) Algorithm 0 is running with several different cache options. Table 1 is the default parameter setting.

Find any questionable results in Table 2 and explain why (assume the result for default parameter setting is correct). (3)

Fill in the missing information in Table 2 and comment on your answer. (3)

(no calculation needed, more concept check)

Table 1: Default cache parameters

|          | I-Cache             | D-Cache             | Memory                                   |
|----------|---------------------|---------------------|------------------------------------------|
|          |                     |                     | read cycles = $50$ , write cycles = $50$ |
| Default  | cache size $=16$    | cache size $=16$    | write policy = Write Through             |
| settings | block size $= 2$    | block size $= 2$    | write buffersize $= 0$                   |
|          | blocks in set $= 1$ | blocks in set $= 1$ | replacement policy = Random              |

Table 2: Experiment results

| Settings                  | Algorithm   | I-Cache  | D-Cache  | Simulation |
|---------------------------|-------------|----------|----------|------------|
|                           |             | Hit rate | Hit rate | time       |
| All                       | Algorithm 0 | 0.5      | 0.8      | 177123     |
| default                   |             |          |          |            |
| I- and D-cache size=32    | Algorithm 0 | 0.4      | 0.9      | 37321      |
| Others: default           |             |          |          |            |
| Blocksize = 8             | Algorithm 0 | 0.8      |          |            |
| Others: default           |             |          |          |            |
| D-cache replacementpolicy | Algorithm 0 | 0.5      | 0.7      | 155103     |
| =FIFO. Others: default    |             |          |          |            |