- 1 Your Task
- 2 On the skeleton code
- 3 Hints
Examine our supplied skeleton code (last updated 17 Feb 2019) which implements Conway’s Game of Life. See below for a summary of our skeleton code does and how to use it. (Note that there is
lifedirectory on some old versions the VM which you should delete and ignore; make sure you don’t confuse it with this skeleton code.)
Create a parallel version that uses pthreads. We recommend using pthreads barriers to synchronize between the threads, but you may also do something else that does not involve busy-waiting if you prefer. You must reuse the existing
LifeBoardclass. Your implementation must have the prototype
void simulate_life_parallel(int threads, LifeBoard &state, int steps)
like the unimplemented version you should find
life-parallel.cc(though you can implement it in other files, so long as you modify the Makefile accordingly). Your implementation can (and should!) span across multiple functions. You can look in
life-serial.ccto see a reference, single-threaded implementation, which you are welcome to borrow from in producing your implementation.
The function you need to implement take a number of threads that should do the computation. Your code should create the appropriate number of threads, then do as many iterations of the simulation as requested, then let the threads finish. You may not create new threads for each iteration of the simulation.
It must be possible to call your implementation on different
LifeBoards simulatenously from multiple threads and have the results on each of the
LifeBoards computed in parallel. This means that, generally, your implementation should not use global varibales.
One way you can test your version is by uncommenting the corresponding entries in the
main.ccand running script
time-examples.sh. When you are first developing, you may wish to simply run some of the
time-examples.shwith a smaller iteration count.
When running on a machine with multiple cores (if your VM only has one virtual CPU, it doesn’t count), you should notice speedup when running your barrier implementation with more threads.
We may test your code by substituting our own versions of
main.cc, so please do not add anything to
main.ccthat would prevent this from working.
If you have previously taken CoA 2, in which you did a very similar assignment, then as an additional requirement you must at most one call to
barrier_wait(or an equivalent function) per iteration. See the third hint under parallelizing life below for help on how to do this.
If you have previously taken CoA 2, then you also have additional requirements on the following pool assignment. We expect that you will start this assignment early in order to have time to complete these requirements since you are able to reuse work you previously did for the life assignmentin CoA 2.
Submit your code as a .tar.gz archive. You may find the supplied
make submittarget helpful for this. (If you created any new source files for your solution, please verify that they are included in this .tar.gz archive.)
On the skeleton code
Running the supplied code
Conway’s Game of Life is a simulation over a grid of cells, each of which is “dead” or “alive”. The simulation is done in “generations”, with a simple rule (see the wikipedia article or our code) determining what state a cell is in in the next iteration, based on the state of the cells around it in the current iteration. Our supplied code has a working serial implementation, except that the grid is fixed size and cells at the edge of the grid are always dead to avoid simulating an arbitrarily large grid.
If you download the skeleton code and run
makethis will produce a binary called
./life. If you run
./lifeyou will receive a usage message.
We also produce a binary called
./life-asan, which is compiled with AddressSanitizer to help you detect and debug memory errors. Since AddressSanitizer substantially slows down the prorgam, you should not trust performance results from this version of the program, but we strongly recommend testing with it.
We have some example input files in the
inputdirectory. You can open these up in a text editor. The first line of these files consists of the size of the pattern grid and the following lines consist of its contents, where
.or ` ` represents a “dead” cell and anything else represents a “live” cell.
This input format is parsed by the
operator>>(for iostreams) in
When run in
./lifebinary runs the serial implementation and produces the result, for example:
./life 0 input/make-a serial-result
input/make-aafter 0 generations, which will be the same as the original input.
./life 1 input/make-a serial-result
input/make-aafter 1 generations, and you can supply larger numbers to see later. For the
make-apattern, generations numbers around 120 are interesting.
The Serial Implementation
life.h, represents the grid of cells as a one-dimensional array. The
atmember function of this class can be used to access the cell at a given
ycoordinate. In this representation rows are stored together.
simulate_life_serial, our serial implementation, which looks in at 3x3 “window” around each cell to determine the next state. This update code works by making two life boards, one representing the next state, and one representing the current state. After each generation, these boards are swapped. The implementation of swapping for
LifeBoardallows swapping to occur without actually copying the cell values (the pointers in the corresponding vectors are swapped), so this is efficient.
simulate_life_serialtakes a LifeBoard reference and modifies the referred-to LifeBoard to return its result.
LifeBoardis specialized to avoid copying all the cells of the board — instead pointers internal to the
vectorinside the LifeBoard will be swapped.
Running and Timing Parallel Implementations
./lifedoes not run your parallel code until you change
main.ccand run something like
./life input/SOME-FILENAME time(not
main.cchas an array of functions
test_functionsthat you should change to include the parallel implementation you are implementing. We have supplied commented-out definitions you can use.
After changing the array of
./life 10 input/make-a time
will time all the versions of the life simulation listed in
test_functionsand verify that they have the correct result. When running an implementation, first each version is run once and the result is checked for correctness. The implementation is not timed if this produces an error. Then, timing runs each implementation many times in attempt to obtain a result that does not have “noise” from the machine. Each result is checked for correctness versus the reference implementation.
To avoid errors where you neglect to edit
test_functionsor where your parallel implementation terminates the program/thread, make sure you check for a line of output with the timing of your parallel version.
If you run with
time, the main function will print out the final results of your parallel versions whenever they fail to match the serial implemnetation.
The list of test functions in
main.ccexpects each test function to have the prototype
void single_threaded_function(LifeBoard &board, int num_iterations)
but your multithreaded versions will have the prototype
void multithreaded_function(int num_threads, LifeBoard &board, int num_iterations)
to accomodate this the commented out code uses the C++ standard library’s std::bind to create wrapper functions for
multithreaded_functionfor various numbers of threads. For example:
function = std::bind(&foo, 42, _1, _2); function(a, b);
is equivalent to doing
foo(42, a, b);
Scripts to Assist Testing
./lifetiming commands which may be useful for testing.
convert_from_rle.pyis a program converts from the popular “RLE” format for life patterns you may be able to find online into the input format our code expects.
The official POSIX documentation is available at http://pubs.opengroup.org/onlinepubs/9699919799/.
You can make this official documentation available through commands like
man pthread_cond_waiton your VM (or any sufficiently Ubuntu-like system) by running
sudo apt install manpages-posix-dev.
This tutorial from LLNL is one, perhaps friendlier source of information (though it does not cover barriers, see the next section for that.)
Another reference source is Chapter 27 of Operating Systems: Three Easy Pieces along surrounding chapters
Understanding POSIX’s barriers
POSIX barriers have the following API:
int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned count).
Initialize the barrier.
countis the number of threads that will use the barrier.
attrare “attributes” to configure how the barrier works. For this assignment, you can pass “NULL” for
int pthread_barrier_destroy(pthread_barrier_t *barrier)
Deallocate any resources allocated for a barrier.
int pthread_barrier_wait(pthread_barrier_t *barrier)
Wait for exactly
init) threads to call
wait()before returning. After
countthreads have called wait,
waitwill returns for all threads and the state of the barrier is reset. These threads can then immediately use the barrier again to wait for all threads to reach a certain place.
All these functions return an int that can, among other things, be used to know if an error occurred.
You should choose some way to split up the grid between threads. You will get best performance if each thread works on a part of the grid that is contiguous in memory, so you get better locality within a thread. This also will avoid performance problems associated with two processors trying to modify data in different parts of the same cache block.
To compute the value of a cell in generation i, one needs the value of its neighbors in generation i-1. Barriers are one way to make sure that the values from generation i-1 are available before any thread starts computing generation i.
The supplied code calls
swap()to exchange boards. If one uses that code as is, one must ensure that all threads stop accessing the boards before the swap happens and don’t start accessing the boards again until the swap finishes.
An alternative, which calls wait on barriers fewer times each iteration, is to avoid swapping by having an “even” state and “odd” state and choose which board to write to based on the generation number. (In even iterations, threads would use the “odd” version of the state to write to the “even” state; and in odd iterations, threads would use the “even” version of the state to write to the “odd” state.) In this way, rather than waiting for the swap to finish before starting the next generation, threads just need to wait for all other threads to have finished the current generation.
You should be able to reuse almost all of the
simulate_life_serialcode. You will probably need to split it into at least two functions — one that represents the work done in seperate threads and one that spawns the threads and waits for them to finish.
On testing in a way where you can see speedup
If you are using a VM and the VM settings only provide one CPU, multiple threads won’t provide speedup. You can adjust your VM settings to have multiple CPUs (assuming you are running on a multicore machine), which will probably require rebooting.
Alternately, you should have a department account, which you can use via the instructions here which should access a shared Linux machine with multiple cores. You will need to use
module load gccbefore trying to compile or run your program on these machines.
Alternately, you can run “natively” outside a VM on your machine, assuming you have a machine with pthreads. (OS X has this natively, on Windows, the Windows Subsystem for Linux probably gives access to it.)