(problem statement | manycore challenges | pitfalls | research activities)
This is not just a fad! The "frequency wall" appears to be a fundamental technological constraint, and the "ILP wall" appears to be a fundamental property of many applications. Together, the frequency and ILP walls mean that improving the performance of individual cores is unlikely to be cost effective. Integrating multiple cores, on the other hand, offers a straightforward path for continued exponential performance improvements (although contention for memory and interconnect will create scaling challenges).
The question is how to use multiple cores. Simply replicating today's superscalar cores, such as Pentiums or Opterons, will provide short-term benefits as the operating system is able to run distinct or loosely-coupled processes concurrently. Few workloads, however, scale naturally to large numbers of these large, heavyweight cores. Considerable parallelism can be found in many important applications, such as photo/video editing, image processing, games, simulations, data analysis, even applications with irregular task-level parallelism and irregular data-access patterns. Realizing this parallelism, however, will require new multicore architectures and the adoption of parallel programming, perhaps assisted by new parallel-programming models. Critics may argue that parallel computing has yet to become mainstream despite decades of research, because in most cases it is so much harder. Yet flat single-core performance and the presence of commodity, affordable multiprocessor hardware create new incentives that allow much greater optimism for widespread adoption of parallel programming. The lower cost and greater convenience of on-chip communication also simplify some of the classic problems in parallel-program design.
Nevertheless, parallel programming generally remains excruciating, especially when irregular or recursively-defined data structures are involved. In the general case, the programmer must spend considerable effort to understand the potential concurrency in their application, devise a parallel algorithm, manage synchronization and ordering, and ensure that no data races or deadlocks can occur. Usually, limited thread contexts and the high cost of software-managed threads mean that the programmer must also map the natural decomposition to a less natural, coarser one that better suits the hardware, and the high cost of communication, and do so in a parameterized way that allows for portability. Worse yet, in order to realize any performance benefits at all, today we expect programmers to explicitly manage numerous hardware details that affect data and thread placement and thread scheduling. This really is too much for many programmers and drastically reduces productivity even for very experienced programmers. This burden may raise the implementation cost to a level where parallelization is simply not cost-effective. The social cost is high: projects aborted, product ideas canceled, benefits of interactivity foregone, etc. On the other hand, this means that convenience may be a more important consideration than peak performance for many programmers, and this consideration may in turn determine competitive advantage.
For many if not most tasks with significant parallelism, an architecture emphasizing throughput over single-thread latency will likely give better performance and energy efficiency. This argues for large numbers of small, multithreaded cores--an approach that has recently been referred to as manycore design. The questions of how to architect such a system and what programming model(s) to support are wide open but of burning importance in shaping the next 10-20 years of personal computing devices. On the other hand, manycore architectures offer new opportunities to simplify parallel programming.
The LAVAlab here at University of Virginia is exploring a variety of aspects of manycore design. This project and the arguments presented here grew out of a combination of our ongoing work on the role of power and thermal limits in multicore design, and graduate seminars in the fall of 2006 and spring 2007. Our fundamental conclusion is that scalability should not sacrifice programmability.
The architecture of general-purpose manycore chips must conveniently support a wide range of programming styles and languages. Architectures that enforce a particular model of parallel programming are likely to fail--especially if it requires practices that do not come naturally to the average programmer. What is needed is a convenient architecture: conveniently supporting a wide range of parallel programming models and languages, conveniently supporting the ability to port legacy programs, and conveniently providing reasonable but robust performance benefits. Developing a correct, straightforward implementation of a parallel program is hard enough. The architecture should not force programmers to program in unintuitive ways and deal with a variety of hardware details just to avoid "performance cliffs" just to get any reasonable speedup. In many cases, any approach that yields the desired performance with minimal effort will be preferred, even at the expense of less efficient hardware. Human productivity is often far more valuable than scalable performance or energy efficiency.
There is nothing wrong with hardware support for new programming models or languages, even those that are restrictive or limited to specific application domains. However, prior history suggests that there will never be a universal solutions, and in any case, the existence of legacy codes, legacy languages, and "legacy brains" mean that adoption of new paradigms will be slow. Architects therefore cannot rely on languages or compilers to deal with programmer convenience. General-purpose chips that only support new programming techniques, or only certain forms of parallelism (e.g. data parallelism at the expense of task parallelism) will be at a substantial disadvantage. (Obviously, this argument does not apply to domain-specific chips that were never intended for general-purpose programming.)
The problem of "legacy brains" is arguably the biggest concern! Until we train all programmers to think naturally in a parallel paradigm, we must deal with the fact that the vast majority of programmers will not be expert programmers from the HPC community, nor even computer scientists at all, but rather application experts who may not adopt new languages as readily as computer scientists. Training current and future programmers to think and program in parallel is unquestionably important. Yet the vast body of existing programmers need to achieve benefits from multicore now. In many cases, such users will not find parallel programming easy. They will want to minimize the effort to achieve some application-specific objective.
The main problem we face now is one of building an ecosystem around parallel programming. This requires an architecture that supports a variety of programming styles and languages, including strong support for legacy languages. The company that provides the most convenient architecture with the lowest initial effort will have a huge competitive advantage.
The preceding argument therefore argues for the following research directions in manycore computer architecture:
Accelerators, such as cryptographic, signal processing, packet-processing, even graphics processing engines create special challenges. It may be sufficient to continue using these accelerators for the narrow application domains for which they were intended. Given a fixed die size, however, this represents an opportunity cost, in terms of more general-purpose computing density or storage capacity. If accelerators must be included, the key challenge is to harness their special capabilities and special programming models in a convenient programming abstraction.
All the preceding capabilities must be achieved while maintaining acceptable power and thermal characteristics. Despite the severity of the power wall, we contend that programmability must be the primary concern: spending some chip energy may be worthwhile if it lowers programmer "activation energy"! A power-efficient chip won't do much good if there's no market for it or if a competing chip is able to capture greater market share. On the other hand, power delivery, battery life, and cooling costs can also be make-or-break factors for a product. This puts the burden for managing the power wall on architects. We must find good abstractions that provide a simple, language-neutral model that is easy to reason about and provides reasonable performance, yet does not require power-hungry implementations. Streams are an interesting example, because for some applications, they provide an excellent programming abstraction. They also lend themselves to very efficient hardware implementation, by allowing bulk transfers in the memory hierarchy and by facilitating direct producer-consumer chaining, thus avoiding costly writes to centralized storage. However, streams, like many other of the innovations in languages, tools, and architectures today, do not yet address the above challenges for important applications involving data structures with irregular structure or reference patterns, such as graphs and trees. Ultimately, the architecture must be extensible as new optimizations such as streaming are developed, while retaining convenience in the general case.
In short, we want to argue against the eat-your-spinach attitude in which we make programming even harder or only expose the speedups from parallel hardware to those willing to use new languages (the spinach) in order to make it easy to design scalable hardware (a benefit that is no more likely to convince the average programmer than Popeye could convince me to eat spinach as a child!). We need to provide an evolutionary path by helping programmers focus as much as possible on the inherent algorithmic and structural choices in writing a parallel program, and as little as possible on the non-idealities posed by the underlying hardware.
Some recent trends to simplify the hardware design for scalability, in exchange for a more difficult programming model, fail the "spinach" test. For example, an architecture mandating explicit data movement adds daunting complexity, especially for irregular parallelism. This is likely to slow adoption compared to an architecture offering some reasonable but less efficient automatic data movement as a default and offering explicit data movement as a higher-performance mode when needed. Giving programmers the option is quite another matter, maintaining programming convenience while allowing the programmer to drill down if necessary. This offers a more friendly paradigm to beginners and experts alike and is more likely to gain widespread acceptance.
In our initial investigation into these issues, we are pursuing just a few activities to support the preceding goals.
Coordinated support for data- and task-parallel code: Support for task parallelism requires a MIMD design, while data parallelism argues for a SIMD design. We are exploring architectures that can reconcile both, with a special focus on reconciling the radically different memory-reference characteristics that these workload categories present.
Core Federation: Multicore chips benefit from one or more high-performance cores for serial code or code with limited parallelism. But mixing high-ILP and small, simple, scalar cores imposes a fixed partitioning among task types: only workloads with the proper ratio of tasks will use the resources most efficiently. Instead, it is possible to federate small, simple cores to provide the same dynamic scheduling capability as a traditional out-of-order processor. The goal here is to coordinate the activities of groups of simple, in-order processors to provide out-of-order execution with virtually no extra hardware.
Optimal core complexity: We are continuing our investigation of power/throughput/area tradeoffs for different core types. Our early work suggested that superscalar, out-of-order-execution (OO) cores are generally superior to simple cores, but this work did not fully account for the role of heavy multithreading. Our hypothesis is that simple, in-order cores are only advantageous if heavily multithreaded; in other words, heavy multithreading benefits in-order cores much more than OO cores.
APIs for exposing producer-consumer relationships: This allows the hardware to better direct data from producer to consumer while bypassing the costly use of centralized storage.
Hardware support for managing data locality: We are exploring integrated mechanisms to coordinate prefetching, locality management, and prevention of false sharing.
Runtime load balancing: Fine-grained threads offer flexibility to load balance according to rate match among producer and consumer, but also to deal with architectures that are heterogeneous by design or due to manufacturing variations.
Smart memory controllers: The advent of on-chip memory controllers effectively creates a new type of accelerator which can be used to reduce off-chip and intra-chip data traffic.
Temperature-aware design for multicore chips: In particular, we are exploring the role of floorplanning and L2 placement in mitigating core hotspots.
Application development: In order to better understand the design tradeoffs in a realistic context, we are developing parallel implementations (using several different APIs) of applications ranging from image processing to discrete-event simulation to graph- and tree-based algorithms.