



### Introduction

- 733 800 MHz clock
   0.18-micron CMOS process technology
- 2 extended, 2 single precision FMACs
   Execution up to 8 SP flops/cycle 6 GFLOP
- >20x Pentium Pro
- 3-level cache hierarchy
  - Split L1 and Unified L2 on die
  - Unified L3 on separate die but same container

ITANIUM

### Introduction

- 64-byte line size
- Page Sizes up to 256MB
- Full 64-Bit computing
- Full IA- 32 binary compatibility in hardware
  - Shared Resources: ALU, registers, Data Cache
  - IA-32 Engine: Dynamic execution
- Instruction set architecture (Marco) Instruction stream (Ganesh) Data stream and IA-32 Compatibility (Karthik)

ITANIUM

4





# Outline

- Introduction to the ISA
- Expressing parallelism
- Creating parallelism
- Techniques and instructions
- Compatibility
- Observations

ITANIUM

## Why & How

- Goal
  - Bring ILP features to a general purpose microprocessor, flexibility
- Techniques
  - Predication
  - Speculation
  - Large register files
  - register rotation
  - HW exception deferral
  - Software pipelining
- RISC/CISC basic architecture of HP's PA-RISC, but ...





| es, 4units       |                 |                     |
|------------------|-----------------|---------------------|
| Long brar        | nches, lor      | ig immediate        |
|                  |                 |                     |
| Instruction Type | Description     | Execution Unit Type |
| A                | Integer ALU     | I-unit or M-unit    |
| 1                | Non=ALU integer | I-unit              |
| м                | Memory          | M-unit              |
| F                | Floating-point  | F-unit              |
| В                | Branch          | B-unit              |
| L+X              | Extended        | I-unit/B-unit       |
|                  |                 |                     |

| Expressing F                                                                                         | Parallelism                                                                                              |
|------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------|
| <ul> <li>Not only bundles, but a</li> </ul>                                                          | also                                                                                                     |
| - Compound Conditionals                                                                              | - Multi-way branches                                                                                     |
| If $((a==0)   (b<=5)    (c!=d)    (f & 0x2) {r3 = 8};$                                               | <pre>{ .mii     cmp.ne p1,p2 = r1,r2;     cmp.ne p3,p4 = 4, r5;     cmp.lt p5,p6 = r8,r9; } { .bbb</pre> |
| <pre>cmp.ne p1 = r0, r0;<br/>add t = -5, b;;</pre>                                                   | <pre>(p1) br.cond label1 (p3) br.cond label2 (p5) br.call b4 = label3 }</pre>                            |
| <pre>cmp.eq.or pl = 0,a<br/>cmp.ge.or pl = 0,t<br/>cmp.ne.or pl = 0,d<br/>tbit.or pl = 1,f,1;;</pre> | // Fall through code here                                                                                |
| (p1) mov r3 = 8                                                                                      |                                                                                                          |
| ITANIUM                                                                                              | 12                                                                                                       |















|                   | Table 4-22. Br                   | anch Types                               | 2/                    |
|-------------------|----------------------------------|------------------------------------------|-----------------------|
| Mnemonic          | Function                         | Branch Condition                         | Targot<br>Address     |
| br.cand or he     | Conditional branch               | Qualifying predicate                     | 1P-nel or<br>Indirect |
| br.call           | Conditional<br>procedure call    | Qualifying predicate                     | 1P-rel or<br>Indirect |
| ke-ras            | Conditional<br>procedure return  | Qualifying predicate                     | Indirect              |
| br.ia             | Invoke the IA-32 instruction set | Unconditional                            | Indirect              |
| br.closp          | Counted loop<br>branch           | Loop count                               | 1P-rel                |
| br.otop, br.cerit | Modulo-scheduled<br>counted loop | Loop count and Epilog                    | 3P-rel                |
| br.wtop, br.wesit | Modulo-scheduled while loop      | Qualifying predicate<br>and Epilog count | 1P-rel                |
| bel.comf or ket   | Long conditional<br>branch       | Qualifying predicate                     | 1P-rel                |
| bri cali          | Long conditional procedure call  | Qualifying predicate                     | 1P-rel                |





| s, Brar<br>s:<br><sup>-</sup> ategy | nch Pre              | edict Instructions (brp)                                                                                            |
|-------------------------------------|----------------------|---------------------------------------------------------------------------------------------------------------------|
| Tabl                                | e 4-26. Whe          | ther Prediction Hint on Branches                                                                                    |
| spar                                | Static Not-<br>Taken | Ignore this branch, do not allocate prediction resources for this branch.                                           |
| spitk                               | Static Taken         | Always predict taken, do not allocate prediction resources for this branch.                                         |
|                                     |                      |                                                                                                                     |
| dput                                | Dynamic<br>Not-Taken | Use dynamic prediction hardware. If no dynamic<br>history information exists for this branch, predict<br>not-taken. |

| <ul> <li>Prefetch</li> </ul> | Table     | e 4-27. Sequ                          | ential Prefetch Hint on Branches                                                                                      |
|------------------------------|-----------|---------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
|                              | Completer | Sequential<br>Prefetch Hint           | Operation                                                                                                             |
|                              | fev       | Prefetch few<br>lines                 | When prefetching code at the branch targ<br>stop prefetching after a few (implementati<br>dependent number of) lines. |
|                              | sarry     | Prefetch many<br>lines                | When profetching code at the branch targ<br>profetch more lines (also an implementation<br>dependent number).         |
| – Deallocate                 |           | Tab                                   | e 4-28. Predictor                                                                                                     |
| – Deallocate                 |           | Tab<br>De                             | e 4-28. Predictor<br>allocation Hint                                                                                  |
| - Deallocate                 |           | Tab<br>De<br>Completer                | e 4-28. Predictor<br>allocation Hint<br>Operation                                                                     |
| – Deallocate                 |           | Tab<br>Do<br>Completer<br>none        | e 4-28. Predictor<br>allocation Hint<br>Operation<br>Don't dealocate                                                  |
| - Deallocate                 |           | Tab<br>Do<br>Completer<br>none<br>eix | le 4-28. Predictor<br>callocation Hint<br>Operation<br>Don't dealocate<br>Dealocate branch information                |





| Table 4-          | 12. Men                | iory Acces               | is Instructions           |
|-------------------|------------------------|--------------------------|---------------------------|
| M                 | emonio                 |                          |                           |
| 1.1               | Floati                 | ng-point                 | Operation                 |
| General           | Normal                 | Load Pair                |                           |
| 14                | 146                    | 1.44                     | Load                      |
| 14.+              | 146.0                  | Life                     | Speculative load          |
| 10. w             | 107.6                  | 1.00p.a                  | Advanced load             |
| 10.88             | 142. sa                | Lucy.cz                  | Speculative advanced load |
| ld.c.nc, ld.c.clr | 1df.c.rc,<br>1df.c.rc, | ldfp.c.nc,<br>Ldfp.d.clf | Check load                |
| ld.c.cls.mcg      |                        |                          | Ordered check load        |
| ld.moq            | 2 1                    |                          | ordered load              |
| ld.bias           | 8 - B                  |                          | Biased load               |
| 16. fill          | 1df. f111              |                          | Register Fill             |
| 86                | aft                    |                          | store                     |
| st.rel            |                        |                          | Ordered store             |
| st.spill          | stf. spill             |                          | Register spill            |
| capachg           | Q 3                    |                          | compare and excharige     |
| schg              |                        |                          | Eschange memory and GR    |
| fet chadd         | N 6                    |                          | Fetch and add             |









| Code Density                                                                                             |    |
|----------------------------------------------------------------------------------------------------------|----|
| <ul> <li>Causes</li> <li>Avg. 43 bit (32 of RISC)</li> <li>Added (alloc, chk)</li> <li>Fix-up</li> </ul> |    |
| <ul> <li>Biggest impact         <ul> <li>Decreasing hit rate on caches</li> </ul> </li> </ul>            |    |
| ΙΤΑΝΙUΜ                                                                                                  | 32 |

## **Observations**

- Synergetic
   Id.sa, data dependences in software pipelining
- Compiler
  - Template
  - Grouping
  - Explicit prefetching
  - Id.a
- X86 common SW base (aggressive)
- 20/30% improvement over RISC is claimed

ITANIUM

<section-header><section-header><section-header><section-header><section-header><section-header><text><text><text>

## Instruction Stream

- Overview of EPIC hardware
- I-Stream
  - Pipeline
  - I-Cache
  - Prefetch & Fetch
  - Branch prediction
  - Issue (Instruction dispersal & delivery)







### I-Cache; I-TLB

- 16 Kb
- 4-way set associative
- Fully pipelined
- 64-entry I-TLB
- Single cycle
- Fully associative
- On-chip page walker
- I-Cache filters prefetch requests
- Both enhanced with an additional port
   To check for a miss





















| Recap - Execution Units                                                                                                        |    |
|--------------------------------------------------------------------------------------------------------------------------------|----|
| <ul> <li>17 units + ALAT</li> <li>4 ALU</li> <li>4 MMX</li> <li>2 + 2 FMAC</li> <li>2 Load/ Store</li> <li>3 branch</li> </ul> |    |
| <ul> <li>Issue Ports</li> <li>2 I</li> <li>2 M</li> <li>2 F</li> <li>3 B</li> </ul>                                            |    |
| ITANIUM                                                                                                                        | 50 |

### **Register Files**

#### • Integer

- 128 64-bit
- 8 read ports (2 x 2 I units, 2 x 2 M units)
- 6 write ports (1 x 2 I units, 2 x 2 Loads A.I)
- Floating Point
  - 128 82-bit (double extended)
  - 8 read ports (2 x 2 F units, 2 x 2 M units)
  - 4 write ports (2 x 2 F units, 2 x 2 M units)
- Predicate
  - 64 1-bit , "broadside" R/W
  - 15 read ports (2 x 6 M, F, I units & 3B units)
  - 11 write ports
    - (2 x 2 M units, 2 x 2 I units, 2 x 1 F unit, 1 x 1 Reg. Rot.)



| Operand Delivery - WLD/REG Stage                                                                                                                                      | es |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| <ul> <li>Register Read         <ul> <li>WLD (Word Line Decode) - begin access</li> <li>REG - Read Registers</li> <li>WLD - frequency increase?</li> </ul> </li> </ul> |    |
| <ul> <li>Register Scoreboard         <ul> <li>Hazard detection</li> <li>Stall only dependent instructions</li> <li>Include predicates</li> </ul> </li> </ul>          |    |
| <pre>cmp.eq r1,r2&gt; p1,p3 (p1) ld4[r3]&gt; r4 add r4, r1&gt; r5 (no dependence if p1=0)</pre>                                                                       |    |
| – Defer stalls                                                                                                                                                        |    |
| ITANIUM                                                                                                                                                               | 53 |

| Operand Delivery                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |    |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----|
| <ul> <li>Deferred Stall <ul> <li>Stall actually in EXE stage</li> <li>Clock frequency</li> <li>Operand read over - can't re-read</li> <li>Snoop the register bypass network</li> </ul> </li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |    |
| The second se |    |
| ITANIUM                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | 54 |

# Execution

- Deferred Stall
- Execute
  - Writes turned off at retirement for false predicates
  - Different latencies Out Of Order "Execution"
  - In-order retire scoreboard







![](_page_28_Figure_1.jpeg)

![](_page_29_Figure_0.jpeg)

![](_page_29_Figure_1.jpeg)

![](_page_30_Figure_0.jpeg)

#### L1 Data

- 16 K, 4-way, 32 byte lines
- write through, no write allocate
- dual ported, 2 cycle load latency

#### • L2, on chip, unified

- 96 K, 6 way, 64 byte lines, Write back, write allocate
- Dual ported, 6 cycles Int, 9 cycles FP load latencies
- MESI protocol for coherence
- L3, off chip, on package, unified
  - 4 M, 4-way, 64 byte lines
  - 21-24 cycle latency, 128 bit bus

| <ul> <li>Hints</li> <li>FP NT1 = Int NT2</li> <li>Bias - Easier MESI</li> </ul>                                          |                         |
|--------------------------------------------------------------------------------------------------------------------------|-------------------------|
| <ul> <li>– FP NT1 = Int NT2</li> <li>– Bias - Easier MESI</li> </ul>                                                     |                         |
| – Bias - Easier MESI                                                                                                     |                         |
|                                                                                                                          |                         |
|                                                                                                                          |                         |
|                                                                                                                          |                         |
| Table 2. Implementation of cache hints.                                                                                  |                         |
| int Semantics L1 response L2 response L3 re                                                                              | sponse                  |
| TA Nontemporal (all levels) Don't allocate Allocate, mark as next replace Don't                                          | allocate                |
| T2 Nontemporal (2 levels) Don't allocate Allocate, mark as next replace Norm                                             | al allocation           |
| T1 Nontemporal (1 level) Don't allocate Normal allocation Norm                                                           | al allocation           |
|                                                                                                                          | al allocation           |
| 1 (default) Temporal Normal allocation Normal allocation Norm                                                            |                         |
| 1 (default) Temporal Normal allocation Normal allocation Normal allocation Normal allocate into exclusive state Allocate | te into exclusive state |

## Rest of the Processor

- System Bus
  - 64 bit, 2.1GB/s,
  - Multidrop , Split transaction bus
  - Up to 56 outstanding transactions
  - Optimized MESI protocol
  - Glue-less multiprocessor support (Up to 4)
- IA 32 control
- ECC/Parity coverage of processor and bus
  - Read only structures parity

- Data - ECC.

ITANIUM

<section-header><section-header><section-header><section-header><section-header><section-header><text>

![](_page_32_Figure_0.jpeg)

![](_page_32_Picture_1.jpeg)

# Conclusions

- Complexity shift to compilers
- Methods to express compile time information
- Large register files, EPIC specific Hardware
- Optimized FPUs for multimedia applications
- Large L3 cache
- Reliability and performance server side

"Neat design, Let us see if it succeeds"

ITANIUM

67