Tuesday, March 10, 2009

Performance improvement of parallel applications with user-space spinlocks

Mutex locks provided by the operating system are expensive to create and acquire and a high contention can dramatically reduce or even eliminate the advantage of parallelism.

GCC provides some atomic operations, directly mapped to the atomic instructions actually provided by the underlying hardware. These are the same instructions that are used to implement the traditional locking primitives (mutexes), but they can be used to implement user-space locking primitives and save a significant amount of overhead.

However, this is not always the best solution: user-space spinlocks have their own drawbacks and overheads. Spinlocks are good when another CPU has the lock and it is likely to release it as soon as possible. In general, where there are more threads than physical CPUs, a spinlock simply wastes the CPU, until the OS decides to preempt it. Moreover, with spinlocks, the CPUs spins (really? :)) forever trying to acquire the lock. This leads to more power consumption and more heat to be dissipated, so this could be a really _bad_ solution for many embedded / ultra-portable devices (this is the same reason because asynchronous interrupt-driven handlers are better than polling for embedded devices). [ Actually, there are also some memory barrier and out-of-order execution troubles, but we will not discuss about this topic now... maybe I'll report some details in another post... ].

Anyway, in the following example we will see a typical problem where the usage of user-space spinlocks can bring evident benefits for performance (however, there is still the power consumption issue, but we don't care about it in this case).

Setting the USER_SPINLOCK option in the Makefile it is possible to choose between (0) the traditional pthread_mutex primitives and (1) custom user-space spinlock implementation.

The problem is the classical dining philosopher problem.

=== Makefile ===

N_CPUS=$(shell getconf _NPROCESSORS_ONLN)
CACHELINE_SIZE=$(shell getconf LEVEL1_DCACHE_LINESIZE)

USER_SPINLOCK=1

TARGET=userspace-spinlock

all:
gcc -g -O3 -lpthread -o$(TARGET) \
-DN_CPUS=$(N_CPUS) \
-DCACHELINE_SIZE=$(CACHELINE_SIZE) \
-DUSER_SPINLOCK=$(USER_SPINLOCK) \
$(TARGET).c

clean:
rm -f $(TARGET)

=== userspace-spinlock.c ===

#define _GNU_SOURCE

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <sched.h>
#include <errno.h>
#include <time.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/syscall.h>

static pthread_t threads[N_CPUS];
static int input[N_CPUS];

#if USER_SPINLOCK
static int shared[N_CPUS * CACHELINE_SIZE];

static inline void lock(int *l)
{
while (__sync_lock_test_and_set(l, 1));
}

static inline void unlock(int *l)
{
*l = 0;

}
#else /* USER_SPINLOCK */
static pthread_mutex_t shared[N_CPUS * CACHELINE_SIZE];

static inline void lock(pthread_mutex_t *l)
{
pthread_mutex_lock(l);
}

static inline void unlock(pthread_mutex_t *l)
{
pthread_mutex_unlock(l);
}
#endif /* USER_SPINLOCK */

/*
* From GETTID(2):
*
* Glibc does not provide a wrapper for this system call; call it using
* syscall(2).
*
*/
static inline pid_t gettid(void)
{
return syscall(SYS_gettid);
}

static void *thread(void *arg)
{
int n = *(int *)arg;
int first, second;
cpu_set_t cmask;
pid_t pid = gettid();
int i = 1E7;

/* Set CPU affinity */
CPU_ZERO(&cmask);
CPU_SET(n % N_CPUS, &cmask);
if (sched_setaffinity(pid, sizeof(cmask), &cmask) < 0) {
fprintf(stderr,
"could not set cpu affinity to core %d.", n % N_CPUS);
exit(1);
}
/*
* Acquire two locks and avoid deadlock; see also the dining
* philosopher problem.
*/
if (n % 2) {
first = (n + 1) % N_CPUS;
second = n;
} else {
first = n;
second = (n + 1) % N_CPUS;
}
while (i) {
lock(&shared[first * CACHELINE_SIZE]);
lock(&shared[second * CACHELINE_SIZE]);
i--;
unlock(&shared[second * CACHELINE_SIZE]);
unlock(&shared[first * CACHELINE_SIZE]);

}
return NULL;
}

int main(int argc, char **argv)
{
struct sched_param sched;
int i;

fprintf(stdout, "running on %d cpus\n", N_CPUS);
sched.sched_priority = sched_get_priority_max(SCHED_RR);
if (sched_setscheduler(0, SCHED_RR, &sched) == -1) {
perror("error setting SCHED_RR");
return -1;
}
for (i = 0; i < N_CPUS; i++) {
input[i] = i;
if (pthread_create(&threads[i], NULL, thread, &input[i]) < 0) {
perror("pthread_create failed");
exit(1);
}
}
for (i = 0; i < N_CPUS; i++)
pthread_join(threads[i], NULL);

return 0;
}


And here some results of a run in my laptop with Intel(R) Core(TM)2 CPU U7600 @ 1.20GHz:

=== USER_SPINLOCK=0 (use pthread_mutex) ===
$ /usr/bin/time -v sudo ./userspace-spinlock
running on 2 cpus
Command being timed: "sudo ./userspace-spinlock"
User time (seconds): 17.01
System time (seconds): 17.09
Percent of CPU this job got: 190%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:17.91
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 658
Voluntary context switches: 8426
Involuntary context switches: 31
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

=== USER_SPINLOCK=1 (use user-space spinlocks) ===
$ /usr/bin/time -v sudo ./userspace-spinlock
running on 2 cpus
Command being timed: "sudo ./userspace-spinlock"
User time (seconds): 13.12
System time (seconds): 0.04
Percent of CPU this job got: 191%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:06.89
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 0
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 657
Voluntary context switches: 3
Involuntary context switches: 17
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0


The elapsed time with traditional pthread_mutex locking is 17.91 sec; with user-space spinlocks the execution needs only 6.89 sec! This means a speed-up of ~2.6!

As said above a disadvantage is the spinning of the CPUs that leads to a greater power consumption. This behaviour can be seen also looking at the voluntary context switches: 8426 in the pthread_mutex case and only 3 with user-space spinlocks (note: if you look at the code you can see that we're running real-time threads, this is the reason of this really low number of context switches). This also means that the whole system is really more reactive if the applications use pthread_mutex primitives, but if cases when reactiveness is not our goal (or we care about the reactiveness of few specific applications) we know that we can achieve a good speed-up of our parallel applications using user-space spinlocks.