q Module
This is a semi-literate
document for the module q.c and it's associated
files. This module implements a fast, lock-free,
non-blocking queue.
Copyright 2014,2015 Steven Ford http://geeky-boy.com and
licensed as public domain (CC0):
To the extent possible under law, the contributors to this project
have waived all copyright and related or neighboring rights to
this work. This work is published from: United States. The project
home is https://github.com/fordsfords/semlit/tree/gh-pages.
To contact me, Steve Ford, project owner, you can find my email
address at http://geeky-boy.com.
Can't see it? Keep looking.
- q Module
- Introduction
- Q API
- Queue Algorithm
- Threading Considerations
- Alternate Algorithms
- Module explanation: q.c
- Function q_enq
- Function: q_deq
- Structure: q_s
- Function: q_create
- Module explanation: q_selftest.c,
q_selftest.h
- Module explanation: q_perf.c
- Developers
Introduction
This document describes the internals of the q.c
module, and is intended to be read by a programmer who wants to
understand, maintain or reuse the code. Before reading this
documentation, you are expected to have a good user-level
understanding of C and queuing in general.
Characteristics of the queue implemented by q.c:
- Fixed size (size specified at queue creation time).
- Non-blocking (enqueuing to a full queue returns immediately
with status; dequeuing from an empty queue returns immediately
with status).
- Single Producer, Single Consumer (SPSC).
- No dynamic memory allocates or frees during enqueue and
dequeue operations. Messages are void pointers (null
pointers are allowed).
- High performance. On my Macbook Air, streaming data
through a queue averages 11 ns per message (queue size=32),
while ping-pong latency is 69 ns (one-way).
- Tested on Mac OSX 10.9 (Mavericks) and Linux 2.6 and 3.5
kernels. At present, I only recommend the 64-bit x86
processor family, due to the fact that I take advantage of its
programmer-friendly memory model. In the future I hope to
generalize it to be efficient on other processor types.
The module is written in standard C, and is therefore usable by
both C and C++ programs.
There are several source files:
- q_c.txt
- (right-click and save as "q.c") code for the q
module.
- q_h.txt
- (right-click and save as "q.h") header file for q
module.
- q_selftest_c.txt
- (right-click and save as "q_selftest.c") unit test
code.
- q_selftest_h.txt
- (right-click and save as "q_selftest.h") header for
unit test.
- q_perf_c.txt
- (right-click and save as "q_perf.c") performance
measurement tool.
So, what are the ".sldoc" and ".slsrc" files in
the doc directory? Short answer: you can ignore them if
you are a user of the q module. Longer answer for maintainers
of q: see section Developers.
Q API
The API is documented in q.h:
00024 /* Most q APIs return "qerr_t". These definitions must
00025 * be kept in sync with the "qerrs" string array in "q.c". */
00026 #define QERR_OK 0 /* Success */
00027 #define QERR_BUG1 1 /* Internal software bug - should never happen */
00028 #define QERR_BUG2 2 /* Internal software bug - should never happen */
00029 #define QERR_BADSIZE 3 /* q_size parameter invalid */
00030 #define QERR_MALLOCERR 4 /* No memory available */
00031 #define QERR_FULL 5 /* No room in queue */
00032 #define QERR_EMPTY 6 /* No messages in queue */
00033 #define LAST_QERR 6 /* Set to value of last "QERR_*" definition */
00034
00035 qerr_t q_create(q_t **rtn_q, unsigned int q_size);
00036 /* Create an instance of a queue.
00037 * rtn_q : Pointer to caller's queue instance handle.
00038 * q_size : Number of queue elements to allocate. Must be > 1 and a power
00039 * of 2. Due to the nature of the algorithm used, a maximum of
00040 * q_size - 1 elements can actually be stored in the queue.
00041 * Returns QERR_OK on success, or other QERR_* value on error. */
00042
00043 qerr_t q_delete(q_t *q);
00044 /* Delete an instance of a queue.
00045 * q : Queue instance handle.
00046 * Returns QERR_OK on success, or other QERR_* value on error. */
00047
00048 qerr_t q_enq( q_t *q, void *m);
00049 /* Add a message to the queue.
00050 * q : Queue instance handle.
00051 * m : Message to enqueue.
00052 * Returns QERR_OK on success, QERR_FULL if queue full, or other QERR_* value on error. */
00053
00054 qerr_t q_deq(q_t *q, void **rtn_m);
00055 /* Remove a message from the queue.
00056 * q : Queue instance handle.
00057 * rtn_m : Pointer to caller's message handle.
00058 * Returns QERR_OK on success, QERR_EMPTY if queue empty, or other QERR_* value on error. */
00059
00060 int q_is_empty(q_t *q);
00061 /* Returns 1 if queue is empty (contains no messages), 0 otherwise.
00062 * q : Queue instance handle. */
00063
00064 int q_is_full(q_t *q);
00065 /* Returns 1 if queue is full (contains q_size-1 message), 0 otherwise.
00066 * q : Queue instance handle. */
00067
00068 char *q_qerr_str(qerr_t qerr);
00069 /* Returns a string representation of a queue API return error code.
00070 * qerr : value returned by most q APIs indicating success or faiure.
00071 * (See q.h for list of QERR_* definitions.) */
Queue Algorithm
Although the code contained herein was written from scratch by Steve
Ford in 2014, the algorithm was influenced by John D. Valois' 1994
paper, "Implementing
Lock-Free Queues" (Valois, J.: Implementing
lock-free queues. In: Proceedings of the Seventh
International Conference on Parallel and Distributed Computing
Systems. (1994) 64-69)
This queue implementation is based on an array of N elements (where
N must be a power of 2). There is a queue head that points to
the oldest message (which will be dequeued next), and a queue tail,
which points to an empty element where the next enqueue will go.
At initialization, the queue looks like this:
![empty queue with head and tail pointing to msgs[0]](q_pics1.png)
All empty elements MUST be set to unused, and the element pointed to
by tail must ALWAYS be empty. The slot pointed to by head is
unused ONLY when the queue is empty. The algorithm MUST
guarantee that an empty queue always has the head and tail pointing
to the same element, and that element (actually all elements) set to
unused.
After a message m1 is enqueued, the queue looks like this:
![tail now points at msgs[1]](q_pics2.png)
Message m1 was written to the tail, and the tail was incremented.
Now enqueue m2 and m3:

This queue algorithm defines full to be when there is exactly one
empty element. Thus, in an N-element array, you can only store
N-1 messages. Since tail must point to an empty element, the
queue can be detected as being full when the element *after* the
tail is used. In this case, the next element after tail is
msgs[0], which still contains message m1.
Now let's dequeue a message:
![dequeue m1, head points at msgs[1]](q_pics4.png)
When message m1 is dequeued, its array element MUST be set to
unused. Then head is incremented.
Note that this algorithm does not have a straight-forward method of
calculating the queue length (i.e. the number of messages currently
in the queue). You might be tempted to subtract the head from
the tail, but it gets complicated when tail wraps and head has
not. Even if you handle that properly, threading issues makes
it dangerous to access both variables. It is much better to
simply provide "is_full" and "is_empty" functions, which are
typically what applications really want to know when they ask for
the queue length.
Threading
Considerations
To make a high-performing lockless queue, it is important to
minimize the memory which is shared between the threads. This
queue is designed for single-producer, single-consumer. So you
only have two threads to consider.
The way the algorithm is designed, the enqueue thread accesses the
tail pointer, NEVER the head. The dequeue thread accesses the
head pointer, NEVER the tail. By making sure head and tail are
on different cache lines, there is no memory contention for the head
and tail pointers. The msgs array is the obvious
point of sharing, so it is declared volatile.
(Aside: there is debate whether volatile should ever be used.
Many programmers prefer to use compiler barriers, as their semantics
are more-exactly defined. Unfortunately, there isn't an
efficient and portable way to code a compiler barrier. And
while there are some ambiguities on the exact semantics of volatile,
for the purposes of this q module, those ambiguities don't
matter. My empirical measurements indicate that this algorithm
performs about the same, regardless whether volatile or compiler
barriers are used. One disadvantage of compiler barriers is
that you must place them very carefully, which leaves opportunity
for error. Also, a compiler barrier prevents reordering and
optimization of ALL variables, both shared and private.
Volatile limits its effect to only those variables that it is
applied to, and extends those effects to every access to those
variables. I use volatile for this algorithm.)
(Another aside: the compiler barrier v.s. volatile debate is
sometimes confused with hardware memory barriers. They are two
different issues which must be treated differently. Compiler
barriers and volatile variables control the order and optimization
of the assembly language instructions which the compiler
generates. Hardware memory barriers controls how the CPU and
cache hardware is allowed to reorder physical memory accesses
differently from the order of the assembly language instructions.)
Another threading consideration is the number of shared variables
which must be manipulated to operate on the queue. Each check
of fullness and emptiness requires only a single read. The
enqueue function looks at the element *after* the tail to determine
if the queue is full. The dequeue function looks at the head
element to determine if the queue is empty (head element unused).
Alternate
Algorithms
Under the directory "alternatives" are a series of
alternative implementations: q1.c to q5.c.
Three of the five produce longer timings for both streaming and
ping-pong. Two of the, q2.c and q4.c,
produce slightly better times for streaming, but worse times for
ping-pong. I am sticking with q.c because I would expect
queues to tend to be empty most of the time.
Module explanation:
q.c
The enqueue and dequeue functions are the heart of the lockless
algorithm. In the code fragments below, click on a line
number to display the same code in-context on the right-side of
the screen.
Function q_enq
To enqueue a message:
00141 qerr_t q_enq(q_t *q, void *m)
00142 {
00143 unsigned int tail = (unsigned)(q->enq_cnt & q->size_mask);
00144
00145 /* Queue must always have at least one empty slot. Make sure that
00146 * after the current tail is filled, the next slot will be empty. */
00147 unsigned int next_tail = (tail + 1) & q->size_mask;
00148 if (q->msgs[next_tail].in_use) { return QERR_FULL; } /* Queue is full, item not added */
00149
00150 q->msgs[tail].d = (void * volatile)m;
00151 q->msgs[tail].in_use = 1;
00152 q->enq_cnt++;
00153 return QERR_OK;
00154 } /* q_enq */
The tail pointer is
defined as the number of enqueues mod the number of elements in the
msgs array. By constraining the array size to a power
of two, the tail can be quickly calculated as the number of enqueues
bit-wise ANDed with the the number of array elements minus 1:
00143 unsigned int tail = (unsigned)(q->enq_cnt & q->size_mask);
Function: q_deq
To dequeue a message:
00158 qerr_t q_deq(q_t *q, void **rtn_m)
00159 {
00160 unsigned int head = (unsigned)(q->deq_cnt & q->size_mask);
00161 if (! q->msgs[head].in_use) { return QERR_EMPTY; }
00162
00163 *rtn_m = (void *)q->msgs[head].d;
00164 q->msgs[head].in_use = 0; /* mark it as empty */
00165 q->deq_cnt++;
00166 return QERR_OK;
00167 } /* q_deq */
The head pointer is
defined as the number of dequeues mod the number of elements in the
msgs array. As with q_enq, a simple
bit-wise AND can be used:
00160 unsigned int head = (unsigned)(q->deq_cnt & q->size_mask);
Structure: q_s
Now let's take a look at the queue data structure:
00048 /* message element */
00049 struct q_msg_s {
00050 void *d;
00051 int in_use;
00052 };
00053 typedef struct q_msg_s q_msg_t;
00054
00055 /* q.h contains an empty forward definition of "q_s", and defines "q_t" */
00056 struct q_s {
00057 unsigned int enq_cnt; /* count of successful messages enqueued (tail pointer) */
00058 char enq_pad[CACHE_LINE_SIZE - (sizeof(unsigned int))]; /* align next var on cache line */
00059
00060 unsigned int deq_cnt; /* count of successful messages dequeued (head pointer) */
00061 char deq_pad[CACHE_LINE_SIZE - (sizeof(unsigned int))]; /* align next var on cache line */
00062
00063 q_msg_t * volatile msgs; /* Array of "q_size" elements */
00064 unsigned int size_mask; /* Number of msgs elements minus 1 */
00065 /* make total size a multiple of cache line size, to prevent interference with whatever comes after */
00066 char final_pad[CACHE_LINE_SIZE - ( sizeof(unsigned int) + sizeof(void **) )];
00067 }; /* struct q_s */
Note that this structure is private to q.c.
The API sees an empty forward reference and type in q.h:
00021 struct q_s; /* forward (opaque) definition */
00022 typedef struct q_s q_t; /* object type of queue instance */
To improve threading efficiency, the variables used by the enqueue
and dequeue threads are separated and placed on independent cache
lines. Cache lines are typically 64 bytes long, but I
allocated 128 to be safe. (This is defined in the build script
bld.sh.) This is used to include enough padding bytes
to align to cache lines:
00058 char enq_pad[CACHE_LINE_SIZE - (sizeof(unsigned int))]; /* align next var on cache line */
Note that it is
important to keep the number and types of variables in a cache line
synchronized with the padding calculation. This is a
maintenance burden, but does reap performance benefits.
Function: q_create
The q_create() function is pretty straight-forward, but
there are a few little gems.
To assist in debugging, a set of strings are defined corresponding
to the return status error codes (those strings available via the q_qerr_str()
function). The manifest constants are defined in q.h:
00024 /* Most q APIs return "qerr_t". These definitions must
00025 * be kept in sync with the "qerrs" string array in "q.c". */
00026 #define QERR_OK 0 /* Success */
00027 #define QERR_BUG1 1 /* Internal software bug - should never happen */
00028 #define QERR_BUG2 2 /* Internal software bug - should never happen */
00029 #define QERR_BADSIZE 3 /* q_size parameter invalid */
00030 #define QERR_MALLOCERR 4 /* No memory available */
00031 #define QERR_FULL 5 /* No room in queue */
00032 #define QERR_EMPTY 6 /* No messages in queue */
00033 #define LAST_QERR 6 /* Set to value of last "QERR_*" definition */
The strings are defined in q.c:
00033 /* This list of strings must be kept in sync with the
00034 * corresponding "QERR_*" constant definitions in "q.h".
00035 * It is used by the q_qerr_str() function. */
00036 static char *qerrs[] = {
00037 "QERR_OK",
00038 "QERR_BUG1",
00039 "QERR_BUG2",
00040 "QERR_BADSIZE",
00041 "QERR_MALLOCERR",
00042 "QERR_FULL",
00043 "QERR_EMPTY",
00044 "BAD_QERR", NULL};
00045 #define BAD_QERR (sizeof(qerrs)/sizeof(qerrs[0]) - 2)
When a maintainer is modifying the q
module, care must be taken when adding or changing the return status
error codes to keep the two synchronized. To aid in detecting
mistakes, q_create() does a sanity check:
00090 /* Sanity check the error code definitions and strings */
00091 if (LAST_QERR != BAD_QERR - 1) { return QERR_BUG1; } /* the QERR_* are out of sync with qerrs[] */
The unit test code in q_selftest.c
also does some sanity checking:
00309 if (strcmp(q_qerr_str(QERR_OK), "QERR_OK") != 0) { ERR("q_qerr_str OK"); }
00310 if (strcmp(q_qerr_str(QERR_BUG1), "QERR_BUG1") != 0) { ERR("q_qerr_str BUG1"); }
00311 if (strcmp(q_qerr_str(QERR_BUG2), "QERR_BUG2") != 0) { ERR("q_qerr_str BUG2"); }
00312 if (strcmp(q_qerr_str(QERR_BADSIZE), "QERR_BADSIZE") != 0) { ERR("q_qerr_str BADSIZE"); }
00313 if (strcmp(q_qerr_str(QERR_MALLOCERR), "QERR_MALLOCERR") != 0) { ERR("q_qerr_str MALLOCERR"); }
00314 if (strcmp(q_qerr_str(QERR_FULL), "QERR_FULL") != 0) { ERR("q_qerr_str FULL"); }
00315 if (strcmp(q_qerr_str(QERR_EMPTY), "QERR_EMPTY") != 0) { ERR("q_qerr_str EMPTY"); }
00316 if (strcmp(q_qerr_str(LAST_QERR), "QERR_EMPTY") != 0) { ERR("q_qerr_str LAST_QERR"); }
00317 if (strcmp(q_qerr_str(LAST_QERR+1), "BAD_QERR") != 0) { ERR("q_qerr_str BAD_QERR"); }
00318 if (strcmp(q_qerr_str(-1), "BAD_QERR") != 0) { ERR("q_qerr_str -1"); }
00319 printf("FYI %s[%d]: q_qerr_str OK\n", __FILE__, __LINE__);
Another debugging aid is the fact that an extra element is
allocated for the msgs array:
00102 /* Allocate message storage array (one extra unused element for fence) */
00103 q->msgs = NULL;
00104 perr = posix_memalign((void **)&q->msgs, CACHE_LINE_SIZE, (q_size + 1) * sizeof(q->msgs[0]) );
A
bit later, the ex extra element is set to Q_FENCE:
00107 q->msgs[q_size].d = (void *)Q_FENCE; /* used by unit tests to insure no overflow */
Q_FENCE is a manifest constant
random number (generated at random.org):
00031 #define Q_FENCE 0x7d831f20 /* used for testing (random number generated at random.org) */
This element and value are only used
to sanity check the queue by detecting queue overrun (which the
algorithm should never allow). This is done in q_delete():
00130 if (q->msgs[q->size_mask + 1].d != (void *)Q_FENCE) { return QERR_BUG1; }
Module explanation:
q_selftest.c, q_selftest.h
The files q_selftest.h and q_selftest.c are
conditionally included by q.c:
00024 /* If built with "-DSELFTEST" then include extras for unit testing. */
00025 #ifdef SELFTEST
00026 # include "q_selftest.h"
00027 #endif
...
00187 /* If built with "-DSELFTEST" then include main() for unit testing. */
00188 #ifdef SELFTEST
00189 #include "q_selftest.c"
00190 #endif
They are for unit testing, and are compiled with q.c so
that they can have access to internal definitions for white box
testing. The build script bld.sh does a special unit
test build with "-DSELFTEST" to enable the inclusion.
It is important to understand that since the queue module
implementation code is in the same compilation unit as the unit test
code, the optimizer in-lines many of the functions, and performs
significant compile-time optimizations as a result. Thus, the
performance numbers printed by the unit test code are not
representative of what an external module would see.
One reason for doing the unit tests in the same compilation unit as
the module is that certain subtle bugs can be found that way.
For example, if you remove "volatile" from the declaration
of q_t.msgs, the functions will appear to work most of the
time. But the q_selftest.c function deq_thread()
will end up in an infinite loop. This is because q_deq()
checks the contents of msgs[0].in_use and finds it unused
and returns QERR_EMPTY. Then deq_thread()
ends up looping back and calling q_deq() again. But
since the function is in-lined, the compiler notices that none of
the relevant variables are modified in the empty code path, so it
optimizes the function out completely.
Module
explanation: q_perf.c
The q_perf.c module consists of the same streaming and
ping-pong performance tests as in q_selftest.c, minus the
sanity checking. However, since the q_perf.c is
built as a separate compilation unit, it calls the functions rather
than allowing them to be in-lined. Thus, the performance
measurements are more representative.
The streaming test tends to measure performance where the queue is
usually non-empty, and is frequently full, whereas the ping-pong
test has the queues almost-always empty. Even though the
streaming numbers are more-impressive looking, I suspect that in
most use cases, the queue will spend more time empty than non-empty,
leading me to choose the current queue implementation (which is the
fastest ping-pong of the algorithms I've tested).
Developers
The web-based software documentation you are looking at is called
"SemiLiterate Documentation". The system which generates the
doc is called "semlit", and is freely available at https://github.com/fordsfords/semlit/tree/gh-pages.
If you are a user of the q module, you probably don't care about any
of this and you can probably ignore it. However, if you are a
maintainer, then you will want to be able to update and re-generate
the doc.
One very important fact: DON'T DIRECTLY MODIFY THE *.c AND *.h
FILES! Those files are automatically generated from
corresponding doc/*_c.slsrc and doc/*_h.slsrc
files. To update the q module's code, you should modify the *.slsrc
files, and then generate the .c and .h
files by running bld.sh.
So, first thing to do is download and install semlit. Make sure
that semlit.sh is in your PATH (otherwise the q module's "bld.sh"
will skip the semlit stuff). Then use an html editor (like SeaMonkey)
to modify doc/*.sldoc files, and a code editor (like vim) to modify doc/*.slsrc
files. Finally, run bld.sh to generate the files.
The scripts used by developers are:
- bld.sh - build the package.
- bld_semlit.sh - build the semi-literate documentation
package (optional).
- clean.sh - remove temporary files.
- tst.sh - run the unit tests.