The LAVAlab Manycore Project
The Importance of a "Convenient" Architecture
(...or, why spinach is bad for you)

(problem statement | manycore challenges | pitfalls | research activities)

Problem Statement: Why Manycore?

The power-density problems that contributed to the cancellation of the Pentium 4 line highlight the fact that the rapid growth in clock frequencies we became accustomed to over the last decades (doubling every 1.5-2 years) will at best be dramatically curtailed. At the same time, microarchitecture and compiler techniques have increasingly run out of ways to extract greater instruction level parallelism (ILP). Even clustered core architectures, which had the potential to mitigate some of the cooling problems, faced the same problem of limited ILP. The natural alternative, instead of making individual cores faster or wider, is to replicate multiple cores on a single chip. This takes advantage of the increasing transistor density that Moore's Law will provide for the foreseeable future, with the problematic side effect that these benefits can no longer automatically be realized by legacy programs. Realizing the benefits of multicore processors requires parallel programming.

This is not just a fad! The "frequency wall" appears to be a fundamental technological constraint, and the "ILP wall" appears to be a fundamental property of many applications. Together, the frequency and ILP walls mean that improving the performance of individual cores is unlikely to be cost effective. Integrating multiple cores, on the other hand, offers a straightforward path for continued exponential performance improvements (although contention for memory and interconnect will create scaling challenges).

The question is how to use multiple cores. Simply replicating today's superscalar cores, such as Pentiums or Opterons, will provide short-term benefits as the operating system is able to run distinct or loosely-coupled processes concurrently. Few workloads, however, scale naturally to large numbers of these large, heavyweight cores. Considerable parallelism can be found in many important applications, such as photo/video editing, image processing, games, simulations, data analysis, even applications with irregular task-level parallelism and irregular data-access patterns. Realizing this parallelism, however, will require new multicore architectures and the adoption of parallel programming, perhaps assisted by new parallel-programming models. Critics may argue that parallel computing has yet to become mainstream despite decades of research, because in most cases it is so much harder. Yet flat single-core performance and the presence of commodity, affordable multiprocessor hardware create new incentives that allow much greater optimism for widespread adoption of parallel programming. The lower cost and greater convenience of on-chip communication also simplify some of the classic problems in parallel-program design.

Nevertheless, parallel programming generally remains excruciating, especially when irregular or recursively-defined data structures are involved. In the general case, the programmer must spend considerable effort to understand the potential concurrency in their application, devise a parallel algorithm, manage synchronization and ordering, and ensure that no data races or deadlocks can occur. Usually, limited thread contexts and the high cost of software-managed threads mean that the programmer must also map the natural decomposition to a less natural, coarser one that better suits the hardware, and the high cost of communication, and do so in a parameterized way that allows for portability. Worse yet, in order to realize any performance benefits at all, today we expect programmers to explicitly manage numerous hardware details that affect data and thread placement and thread scheduling. This really is too much for many programmers and drastically reduces productivity even for very experienced programmers. This burden may raise the implementation cost to a level where parallelization is simply not cost-effective. The social cost is high: projects aborted, product ideas canceled, benefits of interactivity foregone, etc. On the other hand, this means that convenience may be a more important consideration than peak performance for many programmers, and this consideration may in turn determine competitive advantage.

For many if not most tasks with significant parallelism, an architecture emphasizing throughput over single-thread latency will likely give better performance and energy efficiency. This argues for large numbers of small, multithreaded cores--an approach that has recently been referred to as manycore design. The questions of how to architect such a system and what programming model(s) to support are wide open but of burning importance in shaping the next 10-20 years of personal computing devices. On the other hand, manycore architectures offer new opportunities to simplify parallel programming.

The LAVAlab here at University of Virginia is exploring a variety of aspects of manycore design. This project and the arguments presented here grew out of a combination of our ongoing work on the role of power and thermal limits in multicore design, and graduate seminars in the fall of 2006 and spring 2007. Our fundamental conclusion is that scalability should not sacrifice programmability.

Challenges in Manycore Design

A Case for Convenience

The architecture of general-purpose manycore chips must conveniently support a wide range of programming styles and languages. Architectures that enforce a particular model of parallel programming are likely to fail--especially if it requires practices that do not come naturally to the average programmer. What is needed is a convenient architecture: conveniently supporting a wide range of parallel programming models and languages, conveniently supporting the ability to port legacy programs, and conveniently providing reasonable but robust performance benefits. Developing a correct, straightforward implementation of a parallel program is hard enough. The architecture should not force programmers to program in unintuitive ways and deal with a variety of hardware details just to avoid "performance cliffs" just to get any reasonable speedup. In many cases, any approach that yields the desired performance with minimal effort will be preferred, even at the expense of less efficient hardware. Human productivity is often far more valuable than scalable performance or energy efficiency.

There is nothing wrong with hardware support for new programming models or languages, even those that are restrictive or limited to specific application domains. However, prior history suggests that there will never be a universal solutions, and in any case, the existence of legacy codes, legacy languages, and "legacy brains" mean that adoption of new paradigms will be slow. Architects therefore cannot rely on languages or compilers to deal with programmer convenience. General-purpose chips that only support new programming techniques, or only certain forms of parallelism (e.g. data parallelism at the expense of task parallelism) will be at a substantial disadvantage. (Obviously, this argument does not apply to domain-specific chips that were never intended for general-purpose programming.)

The problem of "legacy brains" is arguably the biggest concern! Until we train all programmers to think naturally in a parallel paradigm, we must deal with the fact that the vast majority of programmers will not be expert programmers from the HPC community, nor even computer scientists at all, but rather application experts who may not adopt new languages as readily as computer scientists. Training current and future programmers to think and program in parallel is unquestionably important. Yet the vast body of existing programmers need to achieve benefits from multicore now. In many cases, such users will not find parallel programming easy. They will want to minimize the effort to achieve some application-specific objective.

The main problem we face now is one of building an ecosystem around parallel programming. This requires an architecture that supports a variety of programming styles and languages, including strong support for legacy languages. The company that provides the most convenient architecture with the lowest initial effort will have a huge competitive advantage.

Research Needs

The preceding argument therefore argues for the following research directions in manycore computer architecture:

Architectural support for legacy code and toolchains. Existing parallel programs, and straightforward ports of legacy uniprocessor code, should just plain work, and give "reasonable" performance.
Architectural support for approachable, intuitive programming models that abstract away as many details of the hardware as possible and allow the programmer to focus on the already challenging task of expressing concurrency. Remember, convenience wins! Among other features, this requires:
- A high-performance control processor for inherently-serial portions of code
- Support for both data and task parallelism--many applications entail both.
- Support for irregular data structures. At a minimum, this requires support for automatic management of data movement and automatic thread scheduling for data locality. The programmer should be able to drill down and optimize these as desired, but should be able to attain reasonable performance without being distracted by these details.
- Support for recursion. While recursion is often not optimal compared to an iterative approach, it is often the most natural way to express an algorithm. (Remember, convenience wins!)
- Support for large numbers of very fine-grained threads, thus allowing the programmer to better match the concurrency in their program to the concurrency in the algorithm. This also allows the hardware to perform better load balancing. This requires hardware support for thread creation, scheduling, and state spill, because invoking the operating system is too costly. The Princeton Network Driven Processor is one promising example of this approach.
All of the above considerations necessitate support for shared memory, or at least an efficient illusion of it. This requires techniques to mitigate hotspot contention.
Flexibility for the programmer to explicitly manage details [only] when desired. (For example, by managing data movement and local storage explicitly, rather than relying on caching and shared memory.) By drilling down, the programmer should be able to achieve scalable and competitive performance.
An analytical model to guide the programmer toward reasonable initial design decisions. The goal is to minimize the burden of the initial implementation while steering the programmer in a sensible direction that will allow for incremental refinement. The analytical model must incorporate power, and should account for the special features of co-processors. Extensions to a model such as LogP are one possible avenue.
Hardware support for deterministic execution and debugging.
Transparent fault tolerance: The architecture must tolerate both transient and long-term failures in a way that is transparent to the programmer--for example, a permanent failure of a small number of cores should not disrupt an executing program.

Taken together, these attributes give the programmer great flexibility in problem decomposition, task ordering, and data sharing and free them from the initial burden of low-level hardware management. The goal here is to provide some reasonable speedup on single-chip multicore systems, where the majority of programmers will work for the foreseeable future. When this is not enough, and as programmers better understand the nature of their application's concurrency and where the bottlenecks are, programmers can drill down and take greater control of these responsibilities. But often the convenient default will be enough. Not every program needs linear speedup to large numbers of cores. The key is to provide evolutionary and flexible hardware abstractions that are language-neutral, intuitive, and yet retain the option for the programmer to take greater control of the hardware.

Accelerators, such as cryptographic, signal processing, packet-processing, even graphics processing engines create special challenges. It may be sufficient to continue using these accelerators for the narrow application domains for which they were intended. Given a fixed die size, however, this represents an opportunity cost, in terms of more general-purpose computing density or storage capacity. If accelerators must be included, the key challenge is to harness their special capabilities and special programming models in a convenient programming abstraction.

All the preceding capabilities must be achieved while maintaining acceptable power and thermal characteristics. Despite the severity of the power wall, we contend that programmability must be the primary concern: spending some chip energy may be worthwhile if it lowers programmer "activation energy"! A power-efficient chip won't do much good if there's no market for it or if a competing chip is able to capture greater market share. On the other hand, power delivery, battery life, and cooling costs can also be make-or-break factors for a product. This puts the burden for managing the power wall on architects. We must find good abstractions that provide a simple, language-neutral model that is easy to reason about and provides reasonable performance, yet does not require power-hungry implementations. Streams are an interesting example, because for some applications, they provide an excellent programming abstraction. They also lend themselves to very efficient hardware implementation, by allowing bulk transfers in the memory hierarchy and by facilitating direct producer-consumer chaining, thus avoiding costly writes to centralized storage. However, streams, like many other of the innovations in languages, tools, and architectures today, do not yet address the above challenges for important applications involving data structures with irregular structure or reference patterns, such as graphs and trees. Ultimately, the architecture must be extensible as new optimizations such as streaming are developed, while retaining convenience in the general case.

How not to design a manycore chip

In short, we want to argue against the eat-your-spinach attitude in which we make programming even harder or only expose the speedups from parallel hardware to those willing to use new languages (the spinach) in order to make it easy to design scalable hardware (a benefit that is no more likely to convince the average programmer than Popeye could convince me to eat spinach as a child!). We need to provide an evolutionary path by helping programmers focus as much as possible on the inherent algorithmic and structural choices in writing a parallel program, and as little as possible on the non-idealities posed by the underlying hardware.

Some recent trends to simplify the hardware design for scalability, in exchange for a more difficult programming model, fail the "spinach" test. For example, an architecture mandating explicit data movement adds daunting complexity, especially for irregular parallelism. This is likely to slow adoption compared to an architecture offering some reasonable but less efficient automatic data movement as a default and offering explicit data movement as a higher-performance mode when needed. Giving programmers the option is quite another matter, maintaining programming convenience while allowing the programmer to drill down if necessary. This offers a more friendly paradigm to beginners and experts alike and is more likely to gain widespread acceptance.

LAVAlab Activities

In our initial investigation into these issues, we are pursuing just a few activities to support the preceding goals.

Coordinated support for data- and task-parallel code: Support for task parallelism requires a MIMD design, while data parallelism argues for a SIMD design. We are exploring architectures that can reconcile both, with a special focus on reconciling the radically different memory-reference characteristics that these workload categories present.

Core Federation: Multicore chips benefit from one or more high-performance cores for serial code or code with limited parallelism. But mixing high-ILP and small, simple, scalar cores imposes a fixed partitioning among task types: only workloads with the proper ratio of tasks will use the resources most efficiently. Instead, it is possible to federate small, simple cores to provide the same dynamic scheduling capability as a traditional out-of-order processor. The goal here is to coordinate the activities of groups of simple, in-order processors to provide out-of-order execution with virtually no extra hardware.

Optimal core complexity: We are continuing our investigation of power/throughput/area tradeoffs for different core types. Our early work suggested that superscalar, out-of-order-execution (OO) cores are generally superior to simple cores, but this work did not fully account for the role of heavy multithreading. Our hypothesis is that simple, in-order cores are only advantageous if heavily multithreaded; in other words, heavy multithreading benefits in-order cores much more than OO cores.

APIs for exposing producer-consumer relationships: This allows the hardware to better direct data from producer to consumer while bypassing the costly use of centralized storage.

Hardware support for managing data locality: We are exploring integrated mechanisms to coordinate prefetching, locality management, and prevention of false sharing.

Runtime load balancing: Fine-grained threads offer flexibility to load balance according to rate match among producer and consumer, but also to deal with architectures that are heterogeneous by design or due to manufacturing variations.

Smart memory controllers: The advent of on-chip memory controllers effectively creates a new type of accelerator which can be used to reduce off-chip and intra-chip data traffic.

Temperature-aware design for multicore chips: In particular, we are exploring the role of floorplanning and L2 placement in mitigating core hotspots.

Application development: In order to better understand the design tradeoffs in a realistic context, we are developing parallel implementations (using several different APIs) of applications ranging from image processing to discrete-event simulation to graph- and tree-based algorithms.

Last updated 11 April 2007
Back to Skadron's home page

The LAVAlab Manycore Project The Importance of a "Convenient" Architecture (...or, why spinach is bad for you)