We have used Hermes machines of Computer Science, University of Virgina (UVA) as the back-end platforms for the multicore compiler. There are four Hermes machines in the Computer Science, UVA cluster having the same configuration. We have run the tests on whatever machine we found available. Each 64-core Hermes machine consists of four 16-core AMD Opteron 6276 server processors on four socket of a Dell 0W13NR motherboard. RAM per CPU is 64 GB, giving a total of 256 GB main memory. There are three cache levels in an AMD Opteron 6276. A 6 MB L3 cache segment is shared among a group of 8 cores. Then a pair of cores share a 2 MB L2 cache. Finally, each core has a 16 KB L1 cache. Note that there are only 8 floating point units in a CPU despite the number of cores being 16. Each pair of cores share a floating point unit. The PCubeS description of a Hermes machine is shown in below figure.
The native C++ compiler we have used to generate the executables is gcc version 4.9.4 (Ubuntu4.9.4-2 ubuntu1 14.04.1). All codes have been compiled with O3 optimization flag enabled. Finally, note that Hermes machines have NUMA memory allocation enabled but the multicore compiler allocates data structures assuming a uniform memory. Hence, there is an inherent inefficiency in memory accesses done by IT executables that the sequential reference implementations do not suffer.