Changelog:

5 Apr 2023: correct wrong URL for readings link in tips
6 Apr 2023: update Makefile to actually build life-tsan and have a main.cc that properly processes its arguments
6 Apr 2023 5pm: correct link to main.c to not be broken link
6 Apr 2023: specify to create one fewer thread if the main thread is doing its share of the computations
6 Apr 2023 11pm: add links to copies of sanitizer-test.sh, which was also missing from the skeleton code
7 Apr 2023: add note about restrictions on timing of pthread_barrier_destroy due to libc issue
7 Apr 2023: be more explicit that paralle version needs to match serial one
10 Apr 2023: add information re: using debugger to diagnose hangs to hints

If you downloaded the skeleton code before 6 April around 4pm, then there are some minor issues with that skeleton code:

the Makefile does not build the ThreadSanitizer-version of the testing program life-tsan. You can download an updated Makefile to correct this.

the main.c doesn’t properly distinguish between the time and time-and-result arguments; you can download an updated main.c to correct this.

the sanitizer-test.sh file was not included

1 Your Task

1.1 Before you code

Examine the supplied skeleton code [last updated 2023-04-06 3:50pm] which implements Conway’s Game of Life. You should only edit life-parallel.c, but that interacts with many other files in the package.

Make sure you look at input/README.md, as well as life.h and any other files you wish.

The provided code will compile and run. We encourage experimenting with it before adding any code of your own.

Make sure you understand how simulate_life_serial works. You’ll be making a (more complicated) parallel version of this, and are unlikely to be successful if you don’t understand this starting point.

1.2 Implement parallel Life

Create a parallel version (using pthreads) in life-parallel.c. You must use the provided header

void simulate_life_parallel(int threads, LifeBoard *state, int steps)

life-serial.c contains a correct, working single-threaded implementation you are welcome to borrow from. You should write helper functions as appropriate to keep your code well-organized.

Additional constraints:

Your implementation must be correct (result in the expected final board configuration, matching the serial version) for all boards and iteration counts.
You must call pthread_create exactly threads times (assuming you are computing cell values only in new threads you create; if you also compute cell values in the main (initial) thread, call it threads-1 times instead). Do not create more or fewer threads, nor create new threads for each iteration of the simulation.
We must be able to call simulate_life_parallel from multiple threads concurrently. Do not use global variables or other thread-unsafe practices.
You must have no memory leaks or other memory errors. The provided Makefile builds a version with the address sanitizer enabled automatically as life-asan; this must run without errors.
ThreadSanitizer should not detect likely race conditions or other thread API errors in your code. The provided Makefile builds a version with the thread sanitizer enabled automatically as life-tsan.
You must call the appropriate pthread_***_destroy for each pthread_***_init you call to reclaim any pthreads-allocated memory.
1. On portal/our testing machines, pthread_barrier_destroy won’t work properly unless called after all pthread_barrier_wait calls have returned. (This is a bug in the pthread_barrier implementation on those machines.)
On a multi-core machine, your parallel code must be noticeably faster than the serial code when run on sufficiently large boards for sufficiently many iterations.
On sufficiently large boards (both enough rows and enough columns), you should split up the work roughly evenly between the threads. (With large sizes, no thread should end up doing more than a few percent more computations than another.)
And, of course, you are still bound by all the usual course policies.
You must not use OpenMP (subject of one of our labs) in your solution: we are trying to measure your understanding of the lower-level tools OpenMP uses.

1.3 Test your code.

Uncomment the lines of main.c that are marked as appropriate to uncomment after simulate_life_parallel is written. Then test your code, as e.g. by running

./life to see the usage hints.
life-asan 10 input/make-a time to check for memory leaks and other memory errors (but do not trust its timing, the address sanitizer slows things down a lot).
life-tsan 10 input/make-a time to check for race conditions and some other types of synchornization errors (but do not trust its timing, the address sanitizer slows things down a lot).
- if you get an error about a race condition with pthread_barrier_destroy and pthread_barrier_wait, see the first caveat below.
If you downloaded the skeleton code before 6 April, your Makefile may not properly build life-tsan; you can download an updated Makefile here.
./life 0 input/o0075 serial-result to ensure you can load an example file.
./life 110 input/o0075 serial-result to ensure you can simulate an example file.
./life 10 input/make-a time to time the make-a file with 10 steps.
./time-examples.sh to get a sense of how you are doing on parallel performance.
./sanitizer-test.sh to try a bunch of sizes in AddressSanitizer, ThreadSanitizer

If you downloaded the skeleton code before 6 April, your Makefile may not properly build life-tsan or include sanitizer-test.sh; you can download an updated Makefile here and sanitizer-test.sh here.

2 Tips

This assignment was designed to be a natural fit for barriers. You are strongly encourage to get a barrier-based solution working first before attempting any other approaches.
You should choose some way to split up the grid between threads. You will get best performance if each thread works on a part of the grid that is contiguous in memory, so you get better locality within a thread. This also will avoid performance problems associated with two processors trying to modify data in different parts of the same cache block.
To compute the value of a cell in generation i, one needs the value of its neighbors in generation i-1. Barriers are one way to make sure that the values from generation i-1 are available before any thread starts computing generation i.
The supplied code calls LB_swap() to exchange boards. If one uses that code as is, one must ensure that all threads stop accessing the boards before the swap happens and don’t start accessing the boards again until the swap finishes.

An alternative, which calls wait on barriers fewer times each iteration, is to avoid swapping by having an even state and odd state and choose which board to write to based on the generation number. (In even iterations, threads would use the odd version of the state to write to the even state; and in odd iterations, threads would use the even version of the state to write to the odd state.) In this way, rather than waiting for the swap to finish before starting the next generation, threads just need to wait for all other threads to have finished the current generation.
You should be able to reuse almost all of the simulate_life_serial code. You will probably need to split it into at least two functions — one that represents the work done in separate threads and one that spawns the threads and waits for them to finish.
If you find yourself tempted to use a global variable (such as a global mutex or barrier), you can usually get away with adding it as a field of a struct passed in by reference to your thread function’s void * parameter instead.
If your code hangs, you can run it in a debugger until it hangs, and then interrupt it (like with control-C) and examine where each thread is.

For doing this in LLDB, see their tutorial on examining thread state.

In GDB, you can use a command like thread apply all backtrace to get a backtrace from all threads, and thread THREAD-ID to switch to debugging the thread with a particular ID; see also the manual for debugging with threads in GDB.

3 Caveats

The version of pthread_barrier_destroy on portal and our testing machines requires that you do not call pthread_barrier_destroy until you are sure that all threads have first returned from ptherad_barrier_wait. This is a bug in the those machine’s version of the libc, see this mailing list post.
MacOS and OS X ship with only a subset of the pthreads library, notably excluding pthread_barrier_t and its associated functions. You should probably use portal, a virtual machine, etc.