q Module

This is a semi-literate document for the module q.c and it's associated files. This module implements a fast, lock-free, non-blocking queue.

To the extent possible under law, the contributors to this project have waived all copyright and related or neighboring rights to this work. This work is published from: United States. The project home is https://github.com/fordsfords/semlit/tree/gh-pages. To contact me, Steve Ford, project owner, you can find my email address at http://geeky-boy.com. Can't see it? Keep looking.

q Module

Introduction

This document describes the internals of the q.c module, and is intended to be read by a programmer who wants to understand, maintain or reuse the code. Before reading this documentation, you are expected to have a good user-level understanding of C and queuing in general.

Characteristics of the queue implemented by q.c:

Fixed size (size specified at queue creation time).
Non-blocking (enqueuing to a full queue returns immediately with status; dequeuing from an empty queue returns immediately with status).
Single Producer, Single Consumer (SPSC).
No dynamic memory allocates or frees during enqueue and dequeue operations. Messages are void pointers (null pointers are allowed).
High performance. On my Macbook Air, streaming data through a queue averages 11 ns per message (queue size=32), while ping-pong latency is 69 ns (one-way).
Tested on Mac OSX 10.9 (Mavericks) and Linux 2.6 and 3.5 kernels. At present, I only recommend the 64-bit x86 processor family, due to the fact that I take advantage of its programmer-friendly memory model. In the future I hope to generalize it to be efficient on other processor types.

The module is written in standard C, and is therefore usable by both C and C++ programs.

There are several source files:

q_c.txt - (right-click and save as "q.c") code for the q module.
q_h.txt - (right-click and save as "q.h") header file for q module.
q_selftest_c.txt - (right-click and save as "q_selftest.c") unit test code.
q_selftest_h.txt - (right-click and save as "q_selftest.h") header for unit test.
q_perf_c.txt - (right-click and save as "q_perf.c") performance measurement tool.

So, what are the ".sldoc" and ".slsrc" files in the doc directory? Short answer: you can ignore them if you are a user of the q module. Longer answer for maintainers of q: see section Developers.

Q API

The API is documented in q.h:

00024  /* Most q APIs return "qerr_t".  These definitions must
00025   * be kept in sync with the "qerrs" string array in "q.c". */
00026  #define QERR_OK 0         /* Success */
00027  #define QERR_BUG1 1       /* Internal software bug - should never happen */
00028  #define QERR_BUG2 2       /* Internal software bug - should never happen */
00029  #define QERR_BADSIZE 3    /* q_size parameter invalid */
00030  #define QERR_MALLOCERR 4  /* No memory available */
00031  #define QERR_FULL 5       /* No room in queue */
00032  #define QERR_EMPTY 6      /* No messages in queue */
00033  #define LAST_QERR 6   /* Set to value of last "QERR_*" definition */
00034  
00035  qerr_t q_create(q_t **rtn_q, unsigned int q_size);
00036  /* Create an instance of a queue.
00037   * rtn_q  : Pointer to caller's queue instance handle.
00038   * q_size : Number of queue elements to allocate.  Must be > 1 and a power
00039   *          of 2.  Due to the nature of the algorithm used, a maximum of
00040   *          q_size - 1 elements can actually be stored in the queue.
00041   * Returns QERR_OK on success, or other QERR_* value on error. */
00042  
00043  qerr_t q_delete(q_t *q);
00044  /* Delete an instance of a queue.
00045   * q : Queue instance handle.
00046   * Returns QERR_OK on success, or other QERR_* value on error. */
00047  
00048  qerr_t q_enq( q_t *q, void *m);
00049  /* Add a message to the queue.
00050   * q : Queue instance handle.
00051   * m : Message to enqueue.
00052   * Returns QERR_OK on success, QERR_FULL if queue full, or other QERR_* value on error. */
00053  
00054  qerr_t q_deq(q_t *q, void **rtn_m);
00055  /* Remove a message from the queue.
00056   * q     : Queue instance handle.
00057   * rtn_m : Pointer to caller's message handle.
00058   * Returns QERR_OK on success, QERR_EMPTY if queue empty, or other QERR_* value on error. */
00059  
00060  int q_is_empty(q_t *q);
00061  /* Returns 1 if queue is empty (contains no messages), 0 otherwise.
00062   * q : Queue instance handle. */
00063  
00064  int q_is_full(q_t *q);
00065  /* Returns 1 if queue is full (contains q_size-1 message), 0 otherwise.
00066   * q : Queue instance handle. */
00067  
00068  char *q_qerr_str(qerr_t qerr);
00069  /* Returns a string representation of a queue API return error code.
00070   * qerr : value returned by most q APIs indicating success or faiure.
00071   * (See q.h for list of QERR_* definitions.) */

Queue Algorithm

Although the code contained herein was written from scratch by Steve Ford in 2014, the algorithm was influenced by John D. Valois' 1994 paper, "Implementing Lock-Free Queues" (Valois, J.: Implementing lock-free queues. In: Proceedings of the Seventh
International Conference on Parallel and Distributed Computing Systems. (1994) 64-69)

This queue implementation is based on an array of N elements (where N must be a power of 2). There is a queue head that points to the oldest message (which will be dequeued next), and a queue tail, which points to an empty element where the next enqueue will go.

At initialization, the queue looks like this:

empty queue with head and tail pointing to msgs[0]

empty queue with head and tail pointing to msgs[0]

All empty elements MUST be set to unused, and the element pointed to by tail must ALWAYS be empty. The slot pointed to by head is unused ONLY when the queue is empty. The algorithm MUST guarantee that an empty queue always has the head and tail pointing to the same element, and that element (actually all elements) set to unused.

After a message m1 is enqueued, the queue looks like this:

tail now points at msgs[1]

Message m1 was written to the tail, and the tail was incremented.

Now enqueue m2 and m3:

two more enqueues, queue is full

This queue algorithm defines full to be when there is exactly one empty element. Thus, in an N-element array, you can only store N-1 messages. Since tail must point to an empty element, the queue can be detected as being full when the element *after* the tail is used. In this case, the next element after tail is msgs[0], which still contains message m1.

Now let's dequeue a message:

dequeue m1, head points at msgs[1]

When message m1 is dequeued, its array element MUST be set to unused. Then head is incremented.

Note that this algorithm does not have a straight-forward method of calculating the queue length (i.e. the number of messages currently in the queue). You might be tempted to subtract the head from the tail, but it gets complicated when tail wraps and head has not. Even if you handle that properly, threading issues makes it dangerous to access both variables. It is much better to simply provide "is_full" and "is_empty" functions, which are typically what applications really want to know when they ask for the queue length.

Threading Considerations

To make a high-performing lockless queue, it is important to minimize the memory which is shared between the threads. This queue is designed for single-producer, single-consumer. So you only have two threads to consider.

The way the algorithm is designed, the enqueue thread accesses the tail pointer, NEVER the head. The dequeue thread accesses the head pointer, NEVER the tail. By making sure head and tail are on different cache lines, there is no memory contention for the head and tail pointers. The msgs array is the obvious point of sharing, so it is declared volatile.

(Aside: there is debate whether volatile should ever be used. Many programmers prefer to use compiler barriers, as their semantics are more-exactly defined. Unfortunately, there isn't an efficient and portable way to code a compiler barrier. And while there are some ambiguities on the exact semantics of volatile, for the purposes of this q module, those ambiguities don't matter. My empirical measurements indicate that this algorithm performs about the same, regardless whether volatile or compiler barriers are used. One disadvantage of compiler barriers is that you must place them very carefully, which leaves opportunity for error. Also, a compiler barrier prevents reordering and optimization of ALL variables, both shared and private. Volatile limits its effect to only those variables that it is applied to, and extends those effects to every access to those variables. I use volatile for this algorithm.)

(Another aside: the compiler barrier v.s. volatile debate is sometimes confused with hardware memory barriers. They are two different issues which must be treated differently. Compiler barriers and volatile variables control the order and optimization of the assembly language instructions which the compiler generates. Hardware memory barriers controls how the CPU and cache hardware is allowed to reorder physical memory accesses differently from the order of the assembly language instructions.)

Another threading consideration is the number of shared variables which must be manipulated to operate on the queue. Each check of fullness and emptiness requires only a single read. The enqueue function looks at the element *after* the tail to determine if the queue is full. The dequeue function looks at the head element to determine if the queue is empty (head element unused).

Alternate Algorithms

Under the directory "alternatives" are a series of alternative implementations: q1.c to q5.c. Three of the five produce longer timings for both streaming and ping-pong. Two of the, q2.c and q4.c, produce slightly better times for streaming, but worse times for ping-pong. I am sticking with q.c because I would expect queues to tend to be empty most of the time.

Module explanation: `q.c`

The enqueue and dequeue functions are the heart of the lockless algorithm. In the code fragments below, click on a line number to display the same code in-context on the right-side of the screen.

Function q_enq

To enqueue a message:

00141  qerr_t q_enq(q_t *q, void *m)
00142  {
00143      unsigned int tail = (unsigned)(q->enq_cnt & q->size_mask);
00144  
00145      /* Queue must always have at least one empty slot.  Make sure that
00146       * after the current tail is filled, the next slot will be empty. */
00147      unsigned int next_tail = (tail + 1) & q->size_mask;
00148      if (q->msgs[next_tail].in_use) { return QERR_FULL; }  /* Queue is full, item not added */
00149  
00150      q->msgs[tail].d = (void * volatile)m;
00151      q->msgs[tail].in_use = 1;
00152      q->enq_cnt++;
00153      return QERR_OK;
00154  }  /* q_enq */

The tail pointer is defined as the number of enqueues mod the number of elements in the msgs array. By constraining the array size to a power of two, the tail can be quickly calculated as the number of enqueues bit-wise ANDed with the the number of array elements minus 1:

00143      unsigned int tail = (unsigned)(q->enq_cnt & q->size_mask);

Function: q_deq

To dequeue a message:

00158  qerr_t q_deq(q_t *q, void **rtn_m)
00159  {
00160      unsigned int head = (unsigned)(q->deq_cnt & q->size_mask);
00161      if (! q->msgs[head].in_use) { return QERR_EMPTY; }
00162  
00163      *rtn_m = (void *)q->msgs[head].d;
00164      q->msgs[head].in_use = 0;  /* mark it as empty */
00165      q->deq_cnt++;
00166      return QERR_OK;
00167  }  /* q_deq */

The head pointer is defined as the number of dequeues mod the number of elements in the msgs array. As with q_enq, a simple bit-wise AND can be used:

00160      unsigned int head = (unsigned)(q->deq_cnt & q->size_mask);

Structure: q_s

Now let's take a look at the queue data structure:

00048  /* message element */
00049  struct q_msg_s {
00050      void *d;
00051      int in_use;
00052  };
00053  typedef struct q_msg_s q_msg_t;
00054  
00055  /* q.h contains an empty forward definition of "q_s", and defines "q_t" */
00056  struct q_s {
00057      unsigned int enq_cnt;     /* count of successful messages enqueued (tail pointer) */
00058      char enq_pad[CACHE_LINE_SIZE - (sizeof(unsigned int))];  /* align next var on cache line */
00059  
00060      unsigned int deq_cnt;     /* count of successful messages dequeued (head pointer) */
00061      char deq_pad[CACHE_LINE_SIZE - (sizeof(unsigned int))];  /* align next var on cache line */
00062  
00063      q_msg_t * volatile msgs;  /* Array of "q_size" elements */
00064      unsigned int size_mask;  /* Number of msgs elements minus 1 */
00065      /* make total size a multiple of cache line size, to prevent interference with whatever comes after */
00066      char final_pad[CACHE_LINE_SIZE - ( sizeof(unsigned int) + sizeof(void **) )];
00067  };  /* struct q_s */

Note that this structure is private to q.c. The API sees an empty forward reference and type in q.h:

00021  struct q_s;                       /* forward (opaque) definition */
00022  typedef struct q_s q_t;           /* object type of queue instance */

To improve threading efficiency, the variables used by the enqueue and dequeue threads are separated and placed on independent cache lines. Cache lines are typically 64 bytes long, but I allocated 128 to be safe. (This is defined in the build script bld.sh.) This is used to include enough padding bytes to align to cache lines:

00058      char enq_pad[CACHE_LINE_SIZE - (sizeof(unsigned int))];  /* align next var on cache line */

Note that it is important to keep the number and types of variables in a cache line synchronized with the padding calculation. This is a maintenance burden, but does reap performance benefits.

Function: q_create

The q_create() function is pretty straight-forward, but there are a few little gems.

To assist in debugging, a set of strings are defined corresponding to the return status error codes (those strings available via the q_qerr_str() function). The manifest constants are defined in q.h:

00024  /* Most q APIs return "qerr_t".  These definitions must
00025   * be kept in sync with the "qerrs" string array in "q.c". */
00026  #define QERR_OK 0         /* Success */
00027  #define QERR_BUG1 1       /* Internal software bug - should never happen */
00028  #define QERR_BUG2 2       /* Internal software bug - should never happen */
00029  #define QERR_BADSIZE 3    /* q_size parameter invalid */
00030  #define QERR_MALLOCERR 4  /* No memory available */
00031  #define QERR_FULL 5       /* No room in queue */
00032  #define QERR_EMPTY 6      /* No messages in queue */
00033  #define LAST_QERR 6   /* Set to value of last "QERR_*" definition */

The strings are defined in q.c:

00033  /* This list of strings must be kept in sync with the
00034   * corresponding "QERR_*" constant definitions in "q.h".
00035   * It is used by the q_qerr_str() function. */
00036  static char *qerrs[] = {
00037      "QERR_OK",
00038      "QERR_BUG1",
00039      "QERR_BUG2",
00040      "QERR_BADSIZE",
00041      "QERR_MALLOCERR",
00042      "QERR_FULL",
00043      "QERR_EMPTY",
00044      "BAD_QERR", NULL};
00045  #define BAD_QERR (sizeof(qerrs)/sizeof(qerrs[0]) - 2)

When a maintainer is modifying the q module, care must be taken when adding or changing the return status error codes to keep the two synchronized. To aid in detecting mistakes, q_create() does a sanity check:

00090      /* Sanity check the error code definitions and strings */
00091      if (LAST_QERR != BAD_QERR - 1) { return QERR_BUG1; }  /* the QERR_* are out of sync with qerrs[] */

The unit test code in q_selftest.c also does some sanity checking:

00309      if (strcmp(q_qerr_str(QERR_OK), "QERR_OK") != 0) { ERR("q_qerr_str OK"); }
00310      if (strcmp(q_qerr_str(QERR_BUG1), "QERR_BUG1") != 0) { ERR("q_qerr_str BUG1"); }
00311      if (strcmp(q_qerr_str(QERR_BUG2), "QERR_BUG2") != 0) { ERR("q_qerr_str BUG2"); }
00312      if (strcmp(q_qerr_str(QERR_BADSIZE), "QERR_BADSIZE") != 0) { ERR("q_qerr_str BADSIZE"); }
00313      if (strcmp(q_qerr_str(QERR_MALLOCERR), "QERR_MALLOCERR") != 0) { ERR("q_qerr_str MALLOCERR"); }
00314      if (strcmp(q_qerr_str(QERR_FULL), "QERR_FULL") != 0) { ERR("q_qerr_str FULL"); }
00315      if (strcmp(q_qerr_str(QERR_EMPTY), "QERR_EMPTY") != 0) { ERR("q_qerr_str EMPTY"); }
00316      if (strcmp(q_qerr_str(LAST_QERR), "QERR_EMPTY") != 0) { ERR("q_qerr_str LAST_QERR"); }
00317      if (strcmp(q_qerr_str(LAST_QERR+1), "BAD_QERR") != 0) { ERR("q_qerr_str BAD_QERR"); }
00318      if (strcmp(q_qerr_str(-1), "BAD_QERR") != 0) { ERR("q_qerr_str -1"); }
00319      printf("FYI %s[%d]: q_qerr_str OK\n", __FILE__, __LINE__);

Another debugging aid is the fact that an extra element is allocated for the msgs array:

00102      /* Allocate message storage array (one extra unused element for fence) */
00103      q->msgs = NULL;
00104      perr = posix_memalign((void **)&q->msgs, CACHE_LINE_SIZE, (q_size + 1) * sizeof(q->msgs[0]) );

A bit later, the ex extra element is set to Q_FENCE:

00107      q->msgs[q_size].d = (void *)Q_FENCE;  /* used by unit tests to insure no overflow */

Q_FENCE is a manifest constant random number (generated at random.org):

00031  #define Q_FENCE 0x7d831f20   /* used for testing (random number generated at random.org) */

This element and value are only used to sanity check the queue by detecting queue overrun (which the algorithm should never allow). This is done in q_delete():

00130      if (q->msgs[q->size_mask + 1].d != (void *)Q_FENCE) { return QERR_BUG1; }

Module explanation: q_selftest.c, q_selftest.h

The files q_selftest.h and q_selftest.c are conditionally included by q.c:

00024  /* If built with "-DSELFTEST" then include extras for unit testing. */
00025  #ifdef SELFTEST
00026  #  include "q_selftest.h"
00027  #endif

...

00187  /* If built with "-DSELFTEST" then include main() for unit testing. */
00188  #ifdef SELFTEST
00189  #include "q_selftest.c"
00190  #endif

They are for unit testing, and are compiled with q.c so that they can have access to internal definitions for white box testing. The build script bld.sh does a special unit test build with "-DSELFTEST" to enable the inclusion.

It is important to understand that since the queue module implementation code is in the same compilation unit as the unit test code, the optimizer in-lines many of the functions, and performs significant compile-time optimizations as a result. Thus, the performance numbers printed by the unit test code are not representative of what an external module would see.

One reason for doing the unit tests in the same compilation unit as the module is that certain subtle bugs can be found that way. For example, if you remove "volatile" from the declaration of q_t.msgs, the functions will appear to work most of the time. But the q_selftest.c function deq_thread() will end up in an infinite loop. This is because q_deq() checks the contents of msgs[0].in_use and finds it unused and returns QERR_EMPTY. Then deq_thread() ends up looping back and calling q_deq() again. But since the function is in-lined, the compiler notices that none of the relevant variables are modified in the empty code path, so it optimizes the function out completely.

Module explanation: q_perf.c

The q_perf.c module consists of the same streaming and ping-pong performance tests as in q_selftest.c, minus the sanity checking. However, since the q_perf.c is built as a separate compilation unit, it calls the functions rather than allowing them to be in-lined. Thus, the performance measurements are more representative.

The streaming test tends to measure performance where the queue is usually non-empty, and is frequently full, whereas the ping-pong test has the queues almost-always empty. Even though the streaming numbers are more-impressive looking, I suspect that in most use cases, the queue will spend more time empty than non-empty, leading me to choose the current queue implementation (which is the fastest ping-pong of the algorithms I've tested).

Developers

The web-based software documentation you are looking at is called "SemiLiterate Documentation". The system which generates the doc is called "semlit", and is freely available at https://github.com/fordsfords/semlit/tree/gh-pages.

If you are a user of the q module, you probably don't care about any of this and you can probably ignore it. However, if you are a maintainer, then you will want to be able to update and re-generate the doc.

One very important fact: DON'T DIRECTLY MODIFY THE *.c AND *.h FILES! Those files are automatically generated from corresponding doc/*_c.slsrc and doc/*_h.slsrc files. To update the q module's code, you should modify the *.slsrc files, and then generate the .c and .h files by running bld.sh.

So, first thing to do is download and install semlit. Make sure that semlit.sh is in your PATH (otherwise the q module's "bld.sh" will skip the semlit stuff). Then use an html editor (like SeaMonkey) to modify doc/*.sldoc files, and a code editor (like vim) to modify doc/*.slsrc files. Finally, run bld.sh to generate the files.

The scripts used by developers are:

bld.sh - build the package.
bld_semlit.sh - build the semi-literate documentation package (optional).
clean.sh - remove temporary files.
tst.sh - run the unit tests.