#### Eliminating Dead Code via Speculative Microcode Transformations

Logan Moody, Wei Qi, Abdolrasoul Sharifi, Layne Berry, Joey Rudek, Jayesh Gaur, Jeff Parkhurst, Sreenivas Subramoney, Kevin Skadron, Ashish Venkat









#### Motivation



#### Overview of the Framework







Conclusion



## The Landscape of Modern Computing







Software (rapidly evolving, increasingly complex)

UNIVERSITY VIRGINIA

## The Landscape of Modern Computing



Original data up to the year 2010 collected and plotted by M. Horowitz, F. Laborite, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2017 by K. Rupp





#### Hardware (stagnation of single thread performance)

UNIVERSITY VIRGINIA

#### **Software** (a substantial chunk of our workloads is inherently sequential)

Despite advances in compiler technology, a considerable chunk of wasteful computation still persists even in highly machine-tuned code.



**Optimizable at compile-time** 



Despite advances in compiler technology, a considerable chunk of wasteful computation still persists even in highly machine-tuned code.

Not optimizable at compile-time



Despite advances in compiler technology, a considerable chunk of wasteful computation still persists even in highly machine-tuned code.

Not optimizable at compile-time

But what if the values of array x are predictable at run-time?



Despite advances in compiler technology, a considerable chunk of wasteful computation still persists even in highly machine-tuned code.



Not optimizable at compile-time

But what if the values of array x are predictable at run-time?





#### Motivation



#### **Overview of the Framework**







Conclusion





#### Intel Front-end

Legacy Decode and µop Cache





**Step 1: Hot Code Detection** Identify regions of hot code in μop cache





**Step 1: Hot Code Detection** Identify regions of hot code in μop cache









**Step 2: Generate Request for Hot Code Region** Request Optimization from Code Compaction Unit







#### Step 3: Perform Optimizations Track register context and prediction sources

#### Process one µop per cycle

| -                   | ld<br>ld<br>addi<br>beq | t2,<br>t3, | [ADI<br>t2, | DR +<br>2 | 8] |
|---------------------|-------------------------|------------|-------------|-----------|----|
|                     | •                       |            |             |           |    |
| foo: add t4, t5, t6 |                         |            |             |           |    |

















































Step 3: Perform Optimizations Branch Elimination







#### Step 4: Dump Live-outs

In order to maintain proper register state, we must dump live outs







#### **Step 4: Dump Live-outs**

In order to maintain proper register state, we must dump live outs







#### Step 5: Write to Optimized Partition If there was sufficient shrinkage







#### Subsequent Executions Next time the head PC is fetched, probe both

partitions and perform profitability analysis





#### **Subsequent Executions**

Next time the head PC is fetched, probe both partitions and perform profitability analysis





Subsequent Executions Next time the head PC is

fetched, probe both partitions and perform profitability analysis





#### Subsequent Executions Next time the head PC is

fetched, probe both partitions and perform profitability analysis





Squashing and Recovery If a prediction source is

mispredicted, we must redirect execution to unoptimized sequence





#### Optimizations:

- Data Invariant Identification
- Control Invariant Identification
- Constant Folding
- Constant Propagation
- Branch Folding
- Inlining Live Outs





#### Motivation



#### Overview of the Framework





Conclusion





The majority of code compaction occurs within short, hot regions of code





Benchmarks with high data and control predictability benefit the most from SCC





SCC is able to reduce energy consumption even on applications which see no speedup





#### Motivation



#### **Overview of the Framework**







Conclusion



- An aggressive scheme of dead code elimination implemented entirely within the processor front-end
- Minimally invasive (incurring just 1.5% in area overhead)
- Provides as much as 18% speedup (average of 6%) for SPEC applications
- Significant energy savings due to aggressive dead code elimination (an average of 12%)
- This research also involved several interesting explorations that study the sensitivity of our approach with different branch and value predictors
  - Aggressive prediction could lead to aggressive compaction, but also increases the risk of squashing, suggesting a balanced approach.



# Thanks!

# Questions?

www.github.com/logangregorym/gem5-changes

lgm4xn@virginia.edu

#### **Extensions to The Micro-op Cache**



Line selection logic extended to select line with highest profitability score



#### Fetch State Machine



Additional states and transitions added to handle streaming from optimized partition