Computer Science Colloquia
Friday, December 2, 2011
Duane Merrill
Advisor: Andrew Grimshaw
Attending Faculty: Kevin Skadron (Chair); Kim Hazelwood; David Luebke; and Hossein Haj-Hariri.
Olsson Hall, Room 236D, 3:00 pm
Ph.D. Dissertation Presentation
Allocation-oriented Algorithm Design for GPU Computing
ABSTRACT
The wide data-parallelism of GPU processor design facilitates the
execution of large quantities of concurrent, fine-grained tasks with
unprecedented performance and efficiency. However, contemporary opinion
regards GPU architecture as being poorly suited for many computations
whose parallelizations would impose dynamic dependences between
processing elements. This dissertation specifically focuses on such
problems whose efficient solutions require cooperative allocation, i.e.,
their operation entails dynamic data placement within shared structures.
The contribution of this thesis is a set of design patterns and reusable
tunable software primitives that foster the construction of cooperative
parallelizations that are significantly faster, more efficient, more
flexible, and more performance-portable than the state-of-the-art.
Whereas concurrent CPU programs typically leverage fine-grained atomic
operations for coordinating data placement, we demonstrate the advantages
of parallel prefix sum as a high performance, bulk-synchronous alternati
for cooperative allocation. We construct very efficient algorithms for
global and local prefix sum by employing flexible granularity coarsening
techniques for properly balancing serial and parallel workloads. The
large efficiency gains permit kernel fusion techniques whereby
application-specific logic can be incorporated within our prefix sum
constructs at significantly reduced (or negligible) overhead.
To demonstrate the viability of our methods, we construct cooperative
GPU implementations for a variety of parallel list-processing primitives
(including reduction, prefix scan, duplicate removal, histogram, and
reduce-by-key), evaluating their performance across a wide spectrum of
problem sizes, types, and target architectures. To achieve such
performance-portability, we present a policy-based autotuning design
idiom where we leave unbound many of the program details having opaque
performance consequences.
Finally, we showcase high performance implementations for two archetypical
problems in Computer Science: sorting and breadth-first search (BFS).
We achieve excellent performance, demonstrating multiple factors of
speedup for both problem genres over prior work for GPU and conventional
multi-core platforms.