# Many-Core Design from a Thermal Perspective

Wei Huang<sup>‡</sup>, Mircea R. Stan<sup>†</sup>, Karthik Sankaranarayanan<sup>‡</sup>

Robert J. Ribando\*, and Kevin Skadron<sup>‡</sup>

Departments of Computer Science<sup>‡</sup>, Electrical and Computer Engineering<sup>†</sup>, Mechanical and Aerospace Engineering<sup>\*</sup> University of Virginia, Charlottesville, VA 22904

# ABSTRACT

Air cooling limits have been a major design challenge in recent years for integrated circuits. Multi-core exacerbates thermal challenges because power scales with the number of cores, but also creates new opportunities for temperature-aware design, because multi-core designs offer more design parameters than single-core designs. This paper investigates the relationship between core size and on-chip hot spot temperature and shows that with the same power density, smaller cores are cooler than larger cores due to a spatial low-pass filtering effect of temperature. This phenomenon suggests that designs exploiting low-pass filtering can dissipate more power within the same cooling budget than contemporary designs.

# **Categories and Subject Descriptors**

B.7.2 Hardware [Design Aids]:

# **General Terms**

Design

## Keywords

temperature, many-core design, thermal design power, performance

# 1. INTRODUCTION

Semiconductor technology scaling presents severe thermal challenges. Area is scaling down faster than power due to limited supply voltage scalability, growing leakage challenges, and non-ideal interconnect scaling [1]. At the same time, the inability to improve single-thread performance without unreasonable power dissipation has led manufacturers to stop trying to extract instruction-level parallelism (ILP) and instead focus on integrating multiple, possibly simpler cores on a single die. A variety of multi-core products are available today, and all high-performance PC and server processors are multi-core and even many-core.

The many-core paradigm, however, is worrisome from a thermal design standpoint. Many-core allows simpler cores with lower power per core than aggressive ILP cores. Yet total power scales up linearly with the number of cores. Assuming that pricing power requires manufacturers to maintain die area and raise clock rate from generation to generation, not only power density but also total power will rise. Hence, with density doubling every generation, constant area, and voltage supply only dropping 2.5% per generation [1],  $P = CV^2 f$  implies that total power rises at least 50% per generation, assuming continued improvements in circuit delay and hence frequency. Clearly this exponential growth will outstrip the limits of affordable air cooling in a short time. Maintaining Moore's Law within reasonable cooling budgets therefore requires us to find techniques that allow higher thermal design power (TDP) within a fixed cooling budget—TDP scalability. (TDP represents the maximum amount of power the cooling package in a processor is required to dissipate.)

This paper shows that the many-core architecture plays a vital role in coping with these scaling challenges. We have two levers. The first is the choice of core sophistication and hence power per

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DAC 2008, June 8-13, 2008, Anaheim, California, USA.

Copyright 2008 ACM ACM 978-1-60558-115-6/08/0006 ...\$5.00.

core. However, our ability to simplify cores is limited by the nature of the von-Neumann datapath and the per-thread performance that a particular market demands. The second lever is layout: placement of high-power-density elements to maximize thermal uniformity and maximize efficiency of the cooling solution. Layout is in dependent of core sophistication. Even a single, complex core can be broken into chunks whose placement optimizes thermal uniformity [2]. [3] and [4] have suggested that interleaving high-powerdensity circuit elements with low-power-density storage elements achieves further benefits by using the interleaved cool elements as virtual, "lateral heat sinks".

In particular, this paper focuses on the second lever mentioned above and shows that layout can substantially increase the efficiency of the cooling solution. Specifically:

- 1. We show that adopting the many-core design style allows *significantly more* total thermal design power (TDP) than traditional designs. The increase in TDP thus implies that a many-core design can improve performance by simply burning more power without the thermal hazards seen by single-or dual-core designs with the same TDP.
- 2. We also propose a closed-form analytical model to derive hot spot temperature of a homogeneous many-core design. The model is based on a *spatial temperature low-pass filtering effect* which states that with the same power density, power sources with smaller sizes (corresponding to a higher spatial frequency) are cooler than larger power sources (corresponding to a lower spatial frequency).
- 3. Our analysis can help select optimal core size and core sophistication. Cores that are individually weaker but allow greater TDP may be the right choice. GPUs are one example of such a design philosophy.

Overall, this work suggests that temperature-aware design can gain important benefits from *TDP-scalable* designs and motivates this as a valuable direction for future research.

## 2. RELATED WORK

The power and thermal analysis of multi-core designs has been considered by other researchers. For example, the power and energy efficiency of a multi-core design was shown in [5]. The multicore architecture design space was explored and the power and thermal impacts on design choices were shown in [6]. [7] investigates thermal management techniques for multi-core designs. [8] performs architecture-level simulations under thermal constraints for multi-core designs. This paper instead takes advantage of the underlying heat transfer theory and targets the scaling trend of thermal design power for many-core designs. Temperature-aware and layout-sensitive floorplan at the architecture level has also been investigated in [2, 9]; but they did not consider the fact that layout can be made independent of core sophistication. Our work suggests that greater benefit in TDP can be achieved by further refining existing temperature-aware layout techniques.

A unique aspect of our work is that we present an analytical model to quantify many-core design hot spot temperature and the allowed thermal design power as a function of the number of cores. In [8], a thermal model is also proposed without further considering the complicated heat spreading within silicon and package, omitting important details and making it hard to extend to many-core designs. Other existing thermal modeling tools such as HotSpot [10] do not provide direct analytical design insights and are not as efficient as a closed-form analytical thermal model.

This research was supported in part by NSF grant nos. CNS-0551630, CNS-0509245; a MARCO/IFC

# 3. A MOTIVATING EXAMPLE: RELIEVED TDP IN MANY-CORE DESIGNS

Consider two simple designs with the same silicon area-one has a dual-core architecture, the other has a 220-core architecture, and the cores in each design are homogeneous. If we further assume that half of the chip area is occupied by L2 and lower-level caches that are placed with the cores in a checkerboard fashion (2x2 and 21x21, respectively), and the caches generated negligible power densities compared to the cores, as shown in Fig. 1. The assumption that roughly half the area is occupied by L2 and lower-level caches can be seen from recent designs such as IBM POWER5 [11] and Intel Core 2 Duo [12]. The checkerboard core-cache layout arrangement greatly alleviates core-to-core thermal coupling. Whether the caches are shared or private does not greatly affect the way they can be laid out, and any decent size of cache can be banked and placed almost anywhere on the die (e.g. Intel Itanium2-6M [13]), so a checkerboard layout is a a legitimate option (and as we will show a very good one).

The choice of 220 cores for this example is based on rough estimates of scaling trends. Assuming that we scale from a dualcore design, each of the two cores occupies a quarter of the chip area, and the remaining half of the chip area is secondary caches. We further assume that such a design has a typical 20mm×20mm chip size. Many-core designs of the future are likely to use simpler cores than contemporary complex cores, thus we assume a onetime core architecture shift from the dual-core design to the manycore design, resulting in a down-scaling of the core area. For example, according to core area data in [14], when scaling from EV6 (i.e. Alpha 21264) down to EV4, for the same technology node a  $\sim 0.125$  scaling factor due to the change in architecture complexity is observed. In addition, due to Moore's Law, a  $\sim 0.5$  area scaling factor exists across two generations of CMOS technologies. Combining the two scaling factors above, the size of a single core will possibly become 100 times smaller in less than four generations  $(0.125 * (0.5)^4 \approx 0.01)$ . Since ITRS predicts relatively constant chip area across generations, the same 400mm<sup>2</sup> chip would accommodate about 200 such cores. This corresponds to a  $20 \times 20$  checker board; for our example, we choose  $21 \times 21$  since an odd number of divisions makes the floorplan more symmetric. Although this number may seem high, for some applications this is already the norm. (E.g. the nVIDIA GeForce 8800GTX GPU already has 128 simple scalar cores, and the next generation seems likely to double that.)



Figure 1: Dual-core and 220-core designs. The cores and the caches are placed in a checkerboard fashion. Shaded areas correspond to cores that dissipate power (Alpha EV6 core without L2 cache is shown as an example in the dual-core floorplan).

If we apply 110W and 1W to each core (i.e.  $1W/mm^2$  of power density for cores, assuming uniform within-core power distribution and neglecting the cache power) for the dual-core and 220-core designs respectively, we have the same 220W total power for both designs. ITRS predicts increased power density due to non-ideal scaling and  $1W/mm^2$  is a reasonable hot spot power density for contemporary designs. HotSpot 4.0 [3] is used to find the peak temperature rise with respect to ambient temperature. For a typical heatsink convection thermal resistance of 0.1K/W, we find that the dual-core design has a peak temperature rise of 43.3°C, whereas the 220-core design has only 37.1°C. This is because the 220-core design has much smaller cores and a more uniform power distribution, thus less severe hot spot temperatures as we will see in Section 4.

Alternatively, if we try to find the total TDP for each design that results in the same peak hot spot temperature rise of  $37.1^{\circ}$ C, we get 188W TDP for the dual-core design, and 220W TDP for the 220-core design—a 17% increase in power budget. If a more advanced cooling solution is used (e.g.  $R_{convection} = 0.05 K/W$ ), a 25% increase in TDP will be seen in the 220-core design! Even for a design with a moderate cooling solution (e.g.  $R_{convection} = 0.5K/W$ ), we can still see a 5.8% increase in TDP for the 220-core design. These results are listed in Table 1. Alternatively, we can fix the temperature rise and find the corresponding TDPs for both designs with different values of  $R_{conv}$ . This should result in the same percentage of TDP gain for the 220-core design because the temperature rise is strictly proportional to TDP. This will be apparent in Eq. 3 in Section 4.2.

| $R_{conv}$ | 220-core TDP | Temp. Rise | Dual-core TDP | TDP improvement  |
|------------|--------------|------------|---------------|------------------|
| (K/W)      | (W)          | (°C)       | (W)           | of many-core (%) |
| 0.05       | 220          | 25.4       | 176           | 25%              |
| 0.1        | 220          | 37.1       | 188           | 17%              |
| 0.5        | 220          | 125.0      | 208           | 5.8%             |

#### Table 1: For the same hot spot temperature, many-core design allows greater thermal design power (TDP). Using a better package will benefit in term of TDP.

This relief in TDP for many-core designs is important. On one hand, if thermal reliability is the major concern in determining the TDP (i.e. not considering energy savings), many-core design's performance can be boosted to a greater degree than single- and multi-core designs by allowing more TDP without worrying about thermal hazards that would appear otherwise. On the other hand, if more performance is not desired (e.g. in real-time applications), the design requires a smaller TDP bugget while preserving the same performance in a many-core design, therefore it can use a cheaper thermal package and reduce system cost.

With the above example, it is clearly important to quantitatively investigate how the number of cores, or core size, affects the onchip hot spot temperature for the same power density. Performing simulations in HotSpot and other thermal tools only yields point solutions. Therefore, an analytical model that accurately derives hot spot temperatures of a many-core design is desired.

# 4. TEMPERATURE MODEL FOR HOMO-GENEOUS MANY-CORE DESIGNS

In this section, we present the temperature spatial frequency lowpass filtering theory showing the relationship between heat source size and hot spot temperature, and then derive and validate a model calculating the hot spot temperatures for homogeneous many-core designs. Some implications of the theory are also discussed.

## 4.1 Spatial Temperature Low-Pass Filter

Starting with the traditional temporal frequency-domain analysis for a first-order electrical RC circuit, we utilize the analogy between the temporal frequency (in  $s^{-1}$  or Hz) and the spatial frequency (in  $m^{-1}$ ) to extend the analysis from time to space as well as from electrical domain to thermal domain.

We know that a first order electrical RC circuit (Fig. 2(a)) is a low-pass filter, that is, the voltage drop across the capacitor tracks the input voltage  $V_s(t)$  at low frequency, and is increasingly attenuated at higher frequency. The equivalent impedance of this circuit is  $Z_{eq} = Z_R ||Z_C = R||(\frac{1}{j\omega C})$ , with  $Z_{eq} = R$  at DC, and approaching zero at high frequencies. The resistor R determines the "DC" component of the output voltage, whereas the capacitor determines the "AC" component.

In space, there is also such a "low-pass filtering" effect for temperature distribution. Here, we extend the temporal frequency analysis to the one-dimensional spatial frequency domain. Consider a sinusoidal heat flux (i.e. power density) of q(x), which causes a sinusoidal temperature distribution  $T(x) = T_0 e^{j(\omega_s x + \phi)}$ , where  $\omega_s = 2\pi/\lambda$  is the spatial radian frequency, and x is the position in

the 1-D space. According to Fourier Law of heat transfer,

$$q(x) = k \frac{dT(x)}{dx} = k \frac{d}{dx} T_0 e^{j(\omega_s x + \phi_s)} = j\omega_s k T(x)$$
(1)

where k is the thermal conductivity. Notice the similarity between Eq. 1 and the impedance of an electrical capacitor  $(\frac{1}{j\omega C})$ . This leads us to some quantity analogous to the electrical capacitor in the spatial domain for heat transfer, which can be interpreted as a *thermal spatial capacitive impedance*:

$$Z_{Cs} = \frac{T}{q} = \frac{1}{j\omega_s k} = \frac{1}{j\omega_s C_s} \tag{2}$$

where  $C_s$  is defined as *thermal spatial capacitance* (notice that  $C_s$  is completely unrelated to the thermal capacitance  $C_{th}$  that determines the *transient* heat transfer), and  $Z_{Cs}$  is the "thermal spatial capacitive impedance".

Eq. (2) is used when there is an AC component, with spatial frequency  $\omega_s$ , in the applied heat flux. In the case where there is only DC heat flux, Fourier's Law leads to the traditional definition of thermal resistance:  $Z_{Rs} = \frac{t_{iso}}{k}$ , where  $t_{iso}$  is the distance from the active silicon surface to the isotherm surface in the package. From the above derivation, we can reach a first-order spatial thermal series " $R_s C_s$ " circuit similar to Fig. 2(a). Fig. 2(b) shows a more intuitive Norton equivalent circuit. The heat flux generated by the active silicon layer, q(x), models the non-uniform distribution of power density across the chip. The DC component in the spatial temperature distribution is determined by  $Z_{Rs}$ , whereas the AC component is determined by  $Z_{cs}$ . In addition, the total equivalent thermal spatial impedance is  $Z_{eq_s} = Z_{Rs} ||Z_{Cs}$ .



# Figure 2: (a) A first-order electrical RC circuit. (b) The Norton equivalent first-order thermal spatial "RC" circuit.

We can see that for low spatial frequencies (power sources with large dimensions), the thermal impedance is close to the DC component, that is the lumped  $R_{th} = t_{iso}/(kA)$  that we usually see (A is the corresponding vertical heat conduction area). But for high spatial frequencies (power sources with small dimensions), the impedance attenuates to smaller values due to the presence of the thermal spatial "capacitance". This explains the spatial temperature low-pass filtering effect-structures with tiny dimensions have lower peak temperature comparing to their larger counterparts applied with the same power density. Intuitively, a tiny heat source even with a high power density does not significantly increase the total power dissipation that the package has to remove, thus temperature rise in the package is almost negligible. On the other hand, a large heat source with high power density results in significant rise in total power dissipation, which in turn leads to significant temperature rise at the heat sink and the heat spreader, hence the increase of average and peak silicon temperatures.

Because the heat transfer in x and y lateral directions are orthogonal, which is determined by the 2-D form of Fourier's Law, the above derivations can be easily extended into two-dimensional space with similar results.

One limitation of the above analysis is that it does not take vertical temperature gradient in the chip and the package. A more accurate analysis would be using multiple  $R_s C_s$  ladders, or ideally, distributed thermal spatial thermal  $R_s C_s$  circuit. Fig. 3 shows the comparison between the proposed granularity analysis (3-ladder spatial  $R_s C_s$  circuit) and ANSYS simulations for different heat source sizes. Note that the spatial frequency and equivalent thermal impedance are both normalized. The low-pass temperature filtering effect for the relationship between heat source size and peak temperature is strong—for example, assuming the isotherm thickness is 4mm, for a heat source of 0.1mm size, we have a normalized spatial frequency of 40, which corresponds to  $0.045 \times$  the peak resistance from Fig. 3 and a tiny  $0.045 \times$  peak temperature rise.



Figure 3: Comparison of 3-ladder thermal spatial RC model and ANSYS simulation for different heat source sizes (Both axes are in log scale).

#### 4.2 Homogeneous Many-Core Thermal Model

In this section, we use the above spatial temperature low-pass filtering theory to derive an analytical model for hot spot temperature of many-core designs.

If we consider that all the cores are homogeneous and each core in a many-core design is a uniform heat source, the size of a core directly relates to the hot spot temperature of the chip. Thus a firstorder many-core hot spot temperature model as a function of number of cores (n) can be written as follows,

$$T_{max} = \text{TDP}\left(R_{conv} + \frac{t_{si} - t_{iso}(n)}{kA} + \frac{t_{iso}(n)}{k} \frac{1}{A(1 - \text{Ca}(n))} \left| \frac{1}{1 + j\omega_s \tau_s} \right| \right)$$
(3)

where TDP is the total thermal design power,  $R_{conv}$  is the heatsinkambient convection thermal resistance. A is the total chip area, Ca(n) is a function evaluated in the range of (0,1) that models the fraction of chip area occupied by L2 and lower-level caches. Therefore  $L_{core} = \sqrt{A \frac{1-Ca(n)}{n}}$  is the size of one core. The term  $\left|\frac{1}{1+j\omega_s\tau_s}\right|$  models the low-pass temperature filtering effect with  $\omega_s = \frac{2\pi}{2L_{core}}$  is the spatial frequency and  $\tau_s = 0.5R_sC_s = 0.5t_{iso}(n)$ , where the 0.5 factor accounts for difference of the aforementioned distributed vs. lumped RC constants. Eq. 3 states that the peak temperature of a homogeneous many-

Eq. 5 states that the peak temperature of a homogeneous manycore system can be calculated by adding the temperature rise from the air to isotherm surface inside the package (the first two terms) to the temperature rise from isotherm surface to the silicon surface (the third term). [15] observed this composition as well, but here the third term is governed by the presented spatial temperature lowpass filtering effect caused by the small core size.  $t_{iso}(n)$  is the isotherm thickness that is a function of number for cores, and  $t_{si}$  is the total equivalent silicon thickness that combines the thickness of TIM, spreader and heatsink.

To validate the accuracy of Eq. 3, we compare to the HotSpot results presented in Section 3 for the dual-core and 220-core designs with  $R_{conv} = 0.1 K/W$ . Here, Ca(n) = 0.5 since half of the chip area is occupied by the caches,  $A = 441 \text{mm}^2$ , TDP=220W, k=100W/(m-K) for silicon, and  $t_{si} = 2.7$ mm according the default package values in HotSpot. Because the derivation of Eq. 3 is completely independent of HotSpot and HotSpot 4.0 has been extensively validated against ANSYS [3], HotSpot makes a good reference. Table 2 shows Eq. 3 to be accurate, especially if the core number n is large (0.5% error). The more noticeable error for the dual-core design (11.1% error) is caused by the fact that for the two large and hot cores, the assumption that the center sink-to-air surface is isotherm is not accurate. More detailed thermal simulation is needed to decide the model's error for the case of a few cores and good cooling package (i.e. small  $R_{conv}$ ). However, since the model targets many-core designs (mostly with tens or hundreds of cores), the error in designs with a few cores is not critical.

| number of cores<br>(K/W) | model (C) | HotSpot<br>(C) | error<br>(%) |
|--------------------------|-----------|----------------|--------------|
| 220-core                 | 37.3      | 37.1           | 0.5%         |
| dual-core                | 48.1      | 43.3           | 11.1%        |

Table 2: Comparison of the proposed model (Eq. 3) with HotSpot when  $R_{conv} = 0.1 K/W$ . For many-core design, the model is accurate. The model is less accurate for fewer cores.

### 4.3 Implications of the Model

Fig. 4 shows a plot of hot spot temperature vs. number of cores from Eq. 3 with half the chip area occupied by cooler caches. 200W and 250W thermal design powers are applied respectively. When the core number approaches infinity, i.e. truly uniform power distribution across the chip, a uniform chip temperature is obtained and there are no particular hot spots. When the number of cores is about 2-4 or greater than thousands, the hot spot temperature does not change much. For the range of ten to a few hundred cores, the hot spot temperature is quite sensitive to number of cores, therefore, potential opportunity exists in this region for optimization between thermal design power, performance, and package cost. From Fig. 4, we can also confirm what we previously observed from HotSpot simulations in Section 3-many-core design at a higher thermal design power (250W in this example) can have the same hot spot temperature as a fewer-core design which can tolerate much less thermal design power (200W).



Figure 4: Chip peak temperature as a function of number of cores for TDP=200W and 250W, with L2 caches occupying fixed half chip area.

Another observation from the model in Eq. 3 is that when the package-to-ambient convection thermal resistance  $(R_{conv})$  dominates the silicon-to-package thermal resistance, the on-chip peak temperature is not sensitive to the number of cores. It is instead determined more by the total power of the chip, which confirms a similar observation in [8]. This is the case for most low-cost designs that usually have only natural convection as the cooling method. There has also been a fallacy to use power density as a proxy of temperature. Eq. 3 shows that the hot spot temperature is determined not just by the power-density-related second and third terms (proportional to TDP/A). The total-power-related first term also plays an important role. For example, in a low-cost many-core design, it is possible that the total power is fixed but the increase in number of cores leads to more power density for each core due to the increase in secondary cache area. However, the low-pass filtering effect combined with the dominant package-to-air thermal resistance may yield a lower peak temperature.

There are many other interesting questions to be answered regarding the implications of Eq. 3. For example, with the manycore design shift and the thermal spatial low-pass filtering effect, one may wonder if it is practical to perform thermal analysis at the core granularity rather than at the within-core block granularity. According to our preliminary results, it is likely to be so [16]. Another example is to observe the effect of cache area on peak temperatures by varying Ca(n) rather than fixing it at 0.5. Our preliminary results indicate that the hot spot temperature of a homogeneous many-core design is not monotonically related to the cache area. More details about the implications of the model can be found in an extended discussion of this paper [16], which also presents more explanation of Eq. 3 and the derivation of  $t_{iso}(n)$ .

### 5. LIMITATIONS AND FUTURE WORK

It seems that with the proposed model in Eq. 3, it is not necessary to run more detailed HotSpot-like thermal simulations any more. This is not usually the case. The model presented in this paper still has the following limitations:

- 1. All cores are assumed to be homogeneous with uniform power. For heterogeneous many-core designs, further analytical modeling or detailed thermal simulations are needed.
- 2. All the cores are assumed to dissipate the same power all the time. This is not always true due to different activity factors and the application of DVS or clock-gating among cores. However, the proposed model already takes care of the worst-case combination of core activities and is enough to decide TDP and package choices.
- Different types of cooling solution may have different impact on TDP.
- 4. As mentioned earlier, the analytical model in this paper has more error when the number of cores is small. This is due to the fact that the assumption that the isotherm surface always appears in the package may not be valid for small number of cores dissipating a lot of power.
- 5. The simulations and derivations here do not consider the role of the on-chip network. Interconnect density, and hence the area, power and performance overheads all go up as number of cores increases. Interconnect networks' power and thermal impacts need to be carefully considered in many-core designs.

All the above limitations are important to address and will be interesting future work.

## 6. CONCLUSION

It is important to understand how many-core design options interact with thermal and power limits of modern scaled CMOS technologies in order to maintain Moore's Law scaling. In this paper, we present a theoretical analysis of the relationship between core size and peak temperature, and propose a quantitative model to estimate many-core chip hot spot temperature as a function of number of cores. We find that many-core design has the potential advantages of significantly relieving the thermal design power constraint, and hence a performance boost or cheaper system cost.

### 7. REFERENCES

- [1] The International Technology Roadmap for Semiconductors (ITRS), 2005.
- [2] K. Sankaranarayanan et al. A Case for Thermal-Aware Floorplanning at the Microarchitectural Level. *The Journal of ILP*, vol. 7, October 2005.
- [3] W. Huang et al. An Improved HotSpot Block-Based Thermal Model with Granularity considerations. WDDD Workshop, in conjunction with ISCA, June 2007.
- [4] K. Etessam-Yazdani et al. Impact of Power Granularity on Chip Thermal Modeling. *ITHERM*, June 2006.
- [5] P. Kongetira et al. NIAGARA: A 32-Way Multithreaded SPARC Processor. IEEE Micro, 25(2):21–29, March-April 2005.
- [6] M. Monchiero et al. Design Space Exploration for Multicore Architectures: A power/performance/thermal view. *ICS*, June 2006.
- [7] J. Donald et al. Techniques for Multicore Thermal Management: Classification and New Exploration. *ISCA*, June 2006.
- [8] Y. Li et al. CMP Design Space Exploration Subject to Physical Constraints. HPCA, 2006.
   [9] Y. Li et al. CMP Design Space Exploration Subject to Physical Constraints.
- [9] Y. Han et al. Temperature-aware floorplanning. TACS Workshop in conj. with ISCA, 2005.
- K. Skadron et al. Temperature-aware Microarchitecture: Modeling and Implementation. ACM TACO, 1(1):94–125, March 2004.
   J. Clabes et al. Design and Implementation of the Power5 Microprocessor.
- [11] J. Clabes et al. Design and Implementation of the Power5 Microprocessor. ISSCC, Febuary 2004.
- [12] N. Sakran et al. The Implementation of the 65nm Dual-Core 64b Merom Processor. *ISSCC*, Febuary 2007.
  [13] S. Rusu et al. Itanium 2 Processor 6M: Higher Frequency and Larger L3 Cache.
- [15] S. Kusu et al. Mathematical Procession of Margin Frequency and Earger ES can IEEE Micro. 24(2):10–18, March-April 2004.
   [14] R. Kumar et al. Heterogeneous Chip Multiprocessors. IEEE Computer,
- [11] A. Rama et al. References only interpretersors. *IEEE Computer*, 38(11):32–38, May 2005.
   [15] Y. Li et al. Performance. Energy. and Thermal Considerations for SMT a
- [15] Y. Li et al. Performance, Energy, and Thermal Considerations for SMT and CMP architectures. *HPCA*, February 2005.[16] W. Huang et al. Many-Core Design from a Thermal Perspective: Extended
- [16] W. Huang et al. Many-Core Design from a Inerma Perspective: Extended Analysis and Results. Technical Report CS-2008-05, University of Virginia, Computer Science Department, April 2008.