The document proposes a new concurrent FIFO queue algorithm called the Lock-Free Enqueue and Combining Dequeue (LECD) queue that aims to improve scalability without parameter tuning. It integrates lock-free techniques for enqueue operations and combining techniques for dequeue operations. Experimental results show the LECD queue outperforms other state-of-the-art concurrent queues by up to 14 times on an 80-thread system. The LECD queue also avoids issues with prior approaches like needing dedicated memory management or only working on x86 architectures.
Integrating lock free and combining techniques for a practical and scalable fifo queue
1. Integrating Lock-Free and Combining Techniques for a
Practical and Scalable FIFO Queue
ABSTRACT:
Concurrent FIFO queues can be generally classified into lock-free queues and
combining-based queues. Lock-free queues require manual parameter tuning to
control the contention level of parallel execution, while combining-based queues
encounter a bottleneck of single-threaded sequential combiner executions at a high
concurrency level. In this paper, we introduce a different approach using both lock-
free techniques and combining techniques synergistically to design a practical and
scalable concurrent queue algorithm. As a result, we have achieved high scalability
without any parameter tuning: on an 80-thread average throughput in our
experimental results, our queue algorithm outperforms the most widely used
Michael and Scott queue by 14.3 times, the best-performing combining-based
queue by 1.6 times, and the best performing _86-dependent lock-free queue by 1.7
percent. In addition, we designed our algorithm in such a way that the life cycle of
a node is the same as that of its element. This has huge advantages over prior work:
efficient implementation is possible without dedicated memory
2. management schemes, which are supported only in some languages, may cause a
performance bottleneck, or are patented. Moreover, the synchronized life cycle
between an element and its node enables application developers to further optimize
memory management.
EXISTING SYSTEM:
THE FIFO queues are one of the most fundamental and highly studied concurrent
data structures. They are essential building blocks of libraries [1], [2], [3], runtimes
for pipeline parallelism [4], [5], and high performance tracing systems [6]. These
queues can be categorized according to whether they are based on static allocation
of a circular array or on dynamic allocation in a linked list, and whether or not they
support multiple enqueuers and dequeuers. This paper focuses on dynamically
allocated FIFO queues supporting multiple enqueuers and dequeuers. Although
extensive research has been performed to develop scalable and practical concurrent
queues, there are two remaining problems that limit wider practical use: (1)
scalability is still limited in a high level of concurrency or it is difficult to achieve
without sophisticated parameter tuning; and (2) the use of dedicated memory
management schemes for the safe reclamation of removed nodes imposes
unnecessary overhead and limits the further optimization of memory management
[7], [8], [9]. In terms of scalability, avoiding the contended hot spots, Head and
3. Tail, is the fundamental principle in designing concurrent queues. In this regard,
there are two seemingly contradictory approaches. Lock-free approaches use fine-
grain synchronization to maximize the degree of parallelism and thus improve
performance.
The MS queue presented by Michael and Scott [10] is the most well-known
algorithm, and many works to improve the MS queue have been proposed [10],
[11], [12], [13], [14]. They use compare-and swap (CAS) to update Head and Tail.
In the CAS, however, of all contending threads, only one will succeed, and all the
other threads will fail and retry until they succeed. Since the failing threads use not
only computational resources, but also the memory bus, which is a shared resource
in cache-coherent multiprocessors, they also slow down the succeeding thread. A
common way to reduce such contention is to use an exponential backoff scheme
[10], [15], which spreads out CAS retries over time. Unfortunately, this process
requires manual tuning of the backoff parameters for a particular workload and
machine combination. To avoid this disadvantage, Morrison and Afek [16]
proposed a lockfree queue based on fetch-and-add (F&A) atomic instructions,
which always succeeds but is _86-dependent. Though the _86 architecture has a
large market share, as the core count increases, industry and academia have
actively developed various processor architectures to achieve high scalability and
low power consumption and thus we need more portable solutions.
4. PROPOSED SYSTEM:
In this regard, we propose a linearizable concurrent queue algorithm: Lock-free
Enqueue and Combining Dequeue (LECD) queue. In our LECD queue, enqueue
operations are performed in a lock-free manner, and dequeue operations are
performed in a combining manner. We carefully designed the LECD queue so that
the LECD queue requires neither retrying atomic instructions nor tuning
parameters for limiting the contention level; a SWAP instruction used in the
enqueue operation always succeeds and a CAS instruction used in the dequeue
operation is designed not to require retrying when it fails. Using the combining
techniques in the dequeue operation significantly improves scalability and has
additional advantages together with the lock-free enqueue operation: (1) by virtue
of the concurrent enqueue operations, we can prevent the combiner from becoming
a performance bottleneck, and (2) the higher combining degree incurred by the
concurrent enqueue operations makes the combining operation more efficient. To
our knowledge, the LECD queue is the first concurrent data structure that uses both
lock-free and combining techniques.
Using dedicated memory management schemes is another aspect that
hinders the wider practical use of prior concurrent queues. The fundamental
5. reason for using dedicated schemes is in the mismatch at the end of life
between an element and its node. In this regard, we made two non-
traditional, but practical, design decisions. First, to fundamentally avoid
read/reclaim races, the LECDqueue is designed for one thread to access
Head and Tail at a time. Since dequeue operations are serialized by a single
threaded combiner, only one thread accesses Head at a time. Also, a thread
always accesses Tail through a private copy obtained as a result of SWAP
operation, so there is no contention on accessing the private copy. Second,
we introduce a permanent dummy node technique to synchronize the end of
life between an element and its node. We do not recycle the last dequeued
node as a dummy node. Instead, the initially allocated dummy node is
permanently used by updating Head’s next instead of Head in dequeue
operations. Our approach has huge advantages over prior work. Efficient
implementations of the LECD queue are possible, even in languages which
do not support GC, without using dedicated memory management schemes.
Also, synchronizing the end of life between an element and its node opens
up opportunities for further optimization, such as by embedding a node into
its element.
We compared our queues to the state-of-the-art lockfree queues [10], [16]
and combining-based queues [19] in a system with 80 hardware threads.
6. Experimental results show that, except for queues [16] which support only
Intel ×86 architecture, the LECD queue performs best. Even in comparison
with the ×86-dependent queue [16] whose parameter is manually tuned
beforehand, the LECD queue outperforms the ×86-dependent queue in some
benchmark configurations. Since the LECD queue requires neither
parameter tuning nor dedicated memory management schemes, we expect
that the LECD queue can be easily adopted and used in practice.
SOFTWARE IMPLEMENTATION:
Modelsim 6.0
Xilinx 14.2