Changelog:

Your Task

  1. Examine our supplied skeleton code (last updated 24 Feb 2022) which implements Conway’s Game of Life. See below for a summary of our skeleton code does and how to use it.

  2. Create a parallel version that uses pthreads. We recommend using pthreads barriers (see below) to synchronize between the threads, but you may also do something else that does not involve busy-waiting if you prefer. You must reuse the existing LifeBoard class. Your implementation must have the prototype

    void simulate_life_parallel(int threads, LifeBoard &state, int steps)
    

    like the unimplemented version you should find life-parallel.cc (though you can implement it in other files, so long as you modify the Makefile accordingly). Your implementation can (and should!) span across multiple functions. You can look in life-serial.cc to see a reference, single-threaded implementation, which you are welcome to borrow from in producing your implementation.

    The function you need to implement takes as an argument a number of threads that should do the computation. Your code should create the appropriate number of threads, then do as many steps of the simulation as requested, then let the threads finish.

    (As long as the LifeBoard you are passed has sufficiently many rows and columns, you must use exactly the requested number of threads requested to perform the computation. If it has too few rows or columns, it is permissible for fewer threads than requested to perform parts of the computation.)

    Your implementation:

    • Must not create new threads for each step of the simulation. (A single set of threads should perform the computation for all steps of a single call to simulate_life_parallel.)

    • Must be possible to call on different LifeBoards simulatenously from multiple threads. This means that, generally, your implementation should not use global variables.

    • When run on a system with sufficiently many physical cores with a LifeBoard that has substantially more rows and columns than the number of requested threads and a sufficient number of iterations, should achieve significant speedup proportional to the number of threads.

      It is okay if you do not achieve significant speedup when the number of threads is larger than or similar to the number of rows or columns. (This can include some threads doing no computations in this case.)

      Also, with a small board simulated for a small number of iterations, the total amount of computation to may be similar to the amount of time needed to setup the threads, so it is fine if you do not get much speedup in that case.

      (If you perform optimizations that improve your single threaded performance, which is not required or expected, you should still get additional speedup from using more threads.)

    • Compute the correct answer, even when the number of threads is large compared to the size of the LifeBoard. The answer computed should not vary due to race conditions, etc.

    One way you can test your version is by uncommenting the corresponding entries in the test_functions table in main.cc and running script time-examples.sh. When you are first developing, you may wish to simply run some of the ./life commands from time-examples.sh with a smaller iteration count.

    We will test your code by substituting our own versions of main.cc, so do not add anything to main.cc that would prevent this from working.

  3. If you have previously taken CoA 2, in which you did a very similar assignment, then as an additional requirement you must at most one call to barrier_wait (or any equivalent operation) per iteration. See the third hint under parallelizing life below for help on how to do this.

    If you have previously taken CoA 2, then you also have additional requirements on the following pool assignment. We expect that you will start this assignment early in order to have time to complete these requirements since you are able to reuse work you previously did for the life assignmentin CoA 2.

  4. Submit your code as a .tar.gz archive. You may find the supplied make submit target helpful for this. (If you created any new source files for your solution, please verify that they are included in this .tar.gz archive.)

On the skeleton code

Running the supplied code

  1. Conway’s Game of Life is a simulation over a grid of cells, each of which is “dead” or “alive”. The simulation is done in “generations”, with a simple rule (see the wikipedia article or our code) determining what state a cell is in in the next iteration, based on the state of the cells around it in the current iteration. Our supplied code has a working serial implementation. In our implementation (and therefore your parallel implementations), unlike more general implementations of the Game Of Life, the grid is fixed size and cells at the edge of the grid are always dead to avoid simulating an arbitrarily large grid.

  2. If you download the skeleton code and run make this will produce a binary called ./life. If you run ./life you will receive a usage message.

  3. If you are using OS X, unfortunately the OS X pthreads library does include barriers, so you may want to use a Linux system instead, see note on OS X and barriers below.

  4. We also produce two binaries that compile your program “sanitizers” that check for common errors that may not consistently show up in normal testing:

    • one called ./life-tsan, which is compiled with ThreadSanitizer to help you detect and debug race conditions.

    • one called ./life-asan, which is compiled with AddressSanitizer to help you detect and debug memory errors.

    Do not use these to measure performance: they both hurt performance significantly (probably in part by adding additional synchronization) and our timing code for these versions takes less measurements (to make these easier to use for debugging). But they should find errors more consistently than the faster ./life program.

    Using these sanitizers would be our first recommendation in the event of a segfault or possible data race.

  5. We have some example input files in the input directory. You can open these up in a text editor. The first line of these files consists of the size of the pattern grid and the following lines consist of its contents, where . or ` ` represents a “dead” cell and anything else represents a “live” cell.

    This input format is parsed by the operator<< and operator>> (for iostreams) in life.cc.

  6. When run in serial-result mode the ./life binary runs the serial implementation and produces the result, for example:

    ./life 0 input/make-a serial-result
    

    outputs input/make-a after 0 generations, which will be the same as the original input.

    ./life 1 input/make-a serial-result
    

    outputs input/make-a after 1 generations, and you can supply larger numbers to see later. For the make-a pattern, generations numbers around 120 are interesting.

The Serial Implementation

  1. The class LifeBoard, defined life.h, represents the grid of cells as a one-dimensional array. The at member function of this class can be used to access the cell at a given x and y coordinate. In this representation rows are stored together.

  2. life-serial.cc contains simulate_life_serial, our serial implementation, which looks in at 3x3 “window” around each cell to determine the next state. This update code works by making two life boards, one representing the next state, and one representing the current state. After each generation, these boards are swapped. The implementation of swapping for LifeBoard allows swapping to occur without actually copying the cell values (the pointers in the corresponding vectors are swapped), so this is efficient.

  3. simulate_life_serial takes a LifeBoard reference and modifies the referred-to LifeBoard to return its result.

  4. The swap function for LifeBoard is specialized to avoid copying all the cells of the board — instead pointers internal to the vector inside the LifeBoard will be swapped.

Running and Timing Parallel Implementations

  1. ./life does not run your parallel code until you change main.cc and run something like ./life input/SOME-FILENAME time (not serial-result).

  2. main.cc has an array of functions test_functions that you should change to include the parallel implementation you are implementing. We have supplied commented-out definitions you can use.

  3. After changing the array of test_functions in main.cc,

    ./life 10 input/make-a time
    

    will time all the versions of the life simulation listed in test_functions and verify that they have the correct result. When running an implementation, first each version is run once and the result is checked for correctness. The implementation is not timed if this produces an error. Then, timing runs each implementation many times in attempt to obtain a result that does not have “noise” from the machine. Each result is checked for correctness versus the reference implementation.

    To avoid errors where you neglect to edit test_functions or where your parallel implementation terminates the program/thread, make sure you check for a line of output with the timing of your parallel version.

    Similarly, you can run under the (slower) error-checking versions by replacing ./life with ./life-tsan or ./life-asan in the command above. If your parallel code gets the wrong answer from ./life, I recommend looking for messages about data races for ./life-tsan as an early debugging step.

  4. If you run with time-and-result instead of time, the main function will print out the serial results and the final results of your parallel versions whenever they fail to match the serial implementation.

  5. The list of test functions in main.cc expects each test function to have the prototype

    void single_threaded_function(LifeBoard &board, int num_iterations)
    

    but your multithreaded versions will have the prototype

    void multithreaded_function(int num_threads, LifeBoard &board, int num_iterations)
    

    to accomodate this the commented out code uses the C++ standard library’s std::bind to create wrapper functions for multithreaded_function for various numbers of threads. For example:

    function = std::bind(&foo, 42, _1, _2);
    function(a, b);
    

    is equivalent to doing

    foo(42, a, b);
    

Scripts to Assist Testing

  1. sanitizer-test.sh contains several commands to run life-tsan and life-asan for testing correctness (but not for testing performance).

  2. time-examples.sh runs several ./life timing commands which may be useful for testing (both correctness and performance).

  3. convert_from_rle.py is a program converts from the popular “RLE” format for life patterns you may be able to find online into the input format our code expects.

Notes on testing

  1. The results of the testing scripts cannot be used to verify some requirements of the assignment (for example, that new threads are not created for each step of the simulation, that simulatenous calls to simulate_life_parallel will produce correct results).

  2. Since the number of threads used is controlled by main.cc, if you want to test that your behavior with unusual ratios of threads to board size is correct, you may need to edit main.cc to test with an appropriate number of threads. You might also construct specially sized input files for this purpose.

Hints

Pthreads documentation

  1. The official POSIX documentation is available at http://pubs.opengroup.org/onlinepubs/9699919799/.

    You can make this official documentation available through commands like man pthread_cond_wait on your VM (or any sufficiently Ubuntu-like system) by running sudo apt install manpages-posix-dev.

  2. This tutorial from LLNL is one, perhaps friendlier source of information (though it does not cover barriers, see the next section for that.)

  3. Another reference source is Chapter 27 of Operating Systems: Three Easy Pieces along surrounding chapters

Understanding POSIX’s barriers

  1. Barriers generally.

    Barriers are a synchronization construct typically used to allow a group of threads to wait for each other to get the same place in a shared computation. Each time a thread waits on a barrier, it pauses until all threads have started waiting on the barrier. Then, all threads are allowed to proceed and the barrier is reset so it can be used again.

  2. Note On OS X and POSIX barriers. The OS X version of pthreads does not include them. I suggest using a Linux department machine (instructions here; run module load gcc to get a recent compiler) or an x86 Linux VM if you have an OS X machine.

  3. POSIX API.

    • int pthread_barrier_init(pthread_barrier_t *barrier, const pthread_barrierattr_t *attr, unsigned count).

      Initialize the barrier. count is the number of threads that will use the barrier. attr are “attributes” to configure how the barrier works. For this assignment, you can pass “NULL” for attr.

    • int pthread_barrier_destroy(pthread_barrier_t *barrier)

      Deallocate any resources allocated for a barrier.

    • int pthread_barrier_wait(pthread_barrier_t *barrier)

      Wait for exactly count (set by init) threads to call wait() before returning. After count threads have called wait, wait will returns for all threads and the state of the barrier is reset. These threads can then immediately use the barrier again to wait for all threads to reach a certain place.

    • All these functions return an int that can, among other things, be used to know if an error occurred. When no error has occurred, pthread_barrier_init and pthread_barrier_destroy will return 0, and pthread_barrier_wait returns either 0 or the constant PTHREAD_BARRIER_SERIAL_THREAD.

Parallelizing Life

  1. You should choose some way to split up the grid between threads. You will get best performance if each thread works on a part of the grid that is contiguous in memory, so you get better locality within a thread. This also will avoid performance problems associated with two processors trying to modify data in different parts of the same cache block.

  2. To compute the value of a cell in generation i, one needs the value of its neighbors in generation i-1. Barriers are one way to make sure that the values from generation i-1 are available before any thread starts computing generation i.

  3. The supplied code calls swap() to exchange boards. If one uses that code as is, one must ensure that all threads stop accessing the boards before the swap happens and don’t start accessing the boards again until the swap finishes as in:

    diagram showing timeline with state and next_state, with operations using state to set next_state interleaved with swap operations

    An alternative, which calls wait on barriers fewer times each iteration, is to avoid swapping by having an “even” state and “odd” state and choose which board to write to based on the generation number. In even iterations, threads would use the “odd” version of the state to write to the “even” state; and in odd iterations, threads would use the “even” version of the state to write to the “odd” state, yielding a pattern like:

    diagram showing timeline with odd and even items, with operations using odd to set even, then using even to set odd

    In this way, rather than waiting for the swap to finish before starting the next generation, threads just need to wait for all other threads to have finished the current generation.

  4. You should be able to reuse almost all of the simulate_life_serial code. You will probably need to split it into at least two functions — one that represents the work done in seperate threads and one that spawns the threads and waits for them to finish.

On pointers to vector elements

  1. If you create a pointer to an element of a vector, then that pointer is only valid so long as the vector doesn’t resize its internal array.

Thread-safety and LifeBoard

  1. The LifeBoard internally stores cells, represented as unsigned chars, in a vector. As long as the vector is not resized (or being swapped with another vector) and no thread is reading a cell while it is being written to, it is safe for two threads to write to two different cells simulatenously in the same LifeBoard. (This is gaurentee the C++ standard requires of the vector class.)

On testing in a way where you can see speedup

  1. If you are using a VM and the VM settings only provide one CPU, multiple threads won’t provide speedup. You can adjust your VM settings to have multiple CPUs (assuming you are running on a multicore machine), which will probably require rebooting the VM.

    If you’re using VirtualBox and it hangs on booting after a message like “Booting SMP configuration:”, try powering off the VM and changing the setting under the “System” (option on the right), then “Acceleration” (tab on the top), and changing the paravirtualization interface to “Hyper-V”.

    Alternately, you should have a department account, which you can use via the instructions here which should access a shared Linux machine with multiple cores. You will need to use module load gcc before trying to compile or run your program on these machines.

    Alternately, you can run “natively” outside a VM on your machine, assuming you have a machine with pthreads. (OS X has this natively, on Windows, the Windows Subsystem for Linux probably gives access to it.)

  2. If you run a very quick test case, like input/make-a with only 10 iterations, it may be that the amount of time required to setup threads is small compared to the amount of time required for the computation. A good starting point for tests to which run for a reasonable amount of time is time-examples.sh, though performance will vary from machine to machine.

Some causes of not getting speedup

  1. Make sure you are running with multiple physical cores available. (See previous section.)

  2. Check how you are dividing up the board. Note that if you are accidentally computing part multiple times, you may still get the right answer.

  3. Check if you are doing complicated calculations to determine which (y, x) coordinates a thread will handle. In particular, integer division can take a similar amount of time to the entire calculation of a particular life cell’s value.