Tips on how to get the most from DISCOVAR on your hardware.
If you have an Intel based system make sure that hyperthreading is disabled. Hyperthreading takes each physical core and splits it into two logical cores that share the same execution resources and caches. Whilst this may give a small boost in some circumstances, it generally causes performance degradation when running heavily parallelized, I/O and memory bound code like DISCOVAR. You can check the status of your machine by examining /proc/cpuinfo.
If it is not possible to disable hyperthreading on your system, then consider running DISCOVAR with NUM_THREADS set to the number of physical cores, not logical ones:
Discovar NUM_THREADS= <physical core count> ...
The Linux scheduler is hyperthreading aware and will attempt to assign each thread to a separate physical core, avoiding contention for the core’s resources.
The behavior of malloc on your system can greatly effect the performance of DISCOVAR. In heavily multithreaded code, lock contention can arise during access to the shared heap, decreasing the parallelization efficiency. This can be avoided by making malloc thread aware with more than one heap to reduce locking – the downside being an increase in the amount of memory required.
A number of thread aware malloc implementations have been developed (TCmalloc, Horde, and PTmalloc) and this functionality has also been introduced into the standard GNU C Library (glibc) malloc. We have observed considerable differences in runtime and runtime variability across malloc implementations. However, as most users are not able to swap in a different malloc, we have concentrated on getting the most out of glibc malloc.
Although glibc malloc has been thread aware for some time, only recently was it enabled by default. Prior to this it had be to turned on using the environment variable:
We therefore recommend that all our users set this to ensure that malloc is running in multi-threaded mode.
In our experiments we have discovered erratic behavior in some more recent releases of glibc malloc, resulting in large variations in runtime. This appears to be due to lock contention and is related to the way malloc determines how many independent heaps (called arenas) it requires. The solution was to fix the maximum number of malloc arenas at the start, which we were able to do programmatically as of revision 46659. If you had previously noticed unexplained runtime variations, then try the latest code.
DISCOVAR gives the best performance when running one thread per core. The linux scheduler will try to assign each thread to its own core, but running threads may still jump from core to core – resulting in a loss of cache coherency, which potentially decreases performance.
We have found that the performance drop is marginal on our NUMA hardware, but may be larger on more exotic machines – particularly for software based NUMA machines that share memory across a cluster. One solution is to bind threads to specific cores, preventing the scheduler from moving them around. If your think your hardware might benefit from this you can control it using the following environment variable:
GOMP_CPU_AFFINITY="2 5 10-15 19-25:2"
In this example, the threads will be distribute to cores 2 and 5, 10 through 15, and then 19 through 25 counting by 2, all in a round-robin fashion. Make sure the NUM_THREADS argument is in agreement with the core list.
More cores does not always translate into shorter runtimes, and may even degrade performance. As the number of concurrent threads increases, so too does the chance for lock contention, file and memory IO bottlenecks, and cache coherency problems. The optimal number of threads/cores to use will depend on your hardware and, to a lesser extent, the size of the region being processed. The only way to find the sweet spot is to experiment. We use all the cores on our 48 and 64 cores machines without difficulty, but you might find otherwise on your hardware.