concurrency-primer.tex

% Build this document with LuaTeX, a modern Unicode-aware LaTeX engine
% that uses system TTF and OTF font files.
% This is needed for the fontspec, microtype, and nolig packages.

\newcommand{\bodyfontsize}{10bp}
\newcommand{\bodybaselineskip}{12bp}

\RequirePackage[table]{xcolor}

% We're using KOMA Script to hand-tune footnotes and TOC appearance.
% It should be available in your texlive distribution,
% which is how most distros package LaTeX.
\documentclass[fontsize=10pt, oneside]{scrartcl}

% Margins: see http://practicaltypography.com/page-margins.html and
% http://practicaltypography.com/line-length.html
% We're aiming for 80-ish characters per line.
\usepackage[
    letterpaper,
    footnotesep=\bodybaselineskip,
    left=0.75in,right=0.75in,top=1in,bottom=1in,
]{geometry}

% Font specification.
\usepackage[no-math]{fontspec}
\usepackage[fleqn]{amsmath}

\usepackage[italic]{mathastext}

\usepackage{polyglossia}
\setdefaultlanguage[variant=american]{english}

\usepackage{microtype} % Font expansion, protrusion, and other goodness

% Disable ligatures across grapheme boundaries
% (see the package manual for details.)
\usepackage[english]{selnolig}

% Use symbols for footnotes, resetting each page
\usepackage[perpage,bottom,symbol*]{footmisc}

% Don't use a sans font for description labels.
\addtokomafont{descriptionlabel}{\rmfamily\mdseries}
\setkomafont{disposition}{\rmfamily}
\setkomafont{section}{\large\itshape}
\setkomafont{subsection}{\normalsize\itshape}
\renewcommand*\thesection{\upshape\arabic{section}}

\usepackage{enumitem}

% Custom footer
\usepackage{scrlayer-scrpage}
\setkomafont{pagefoot}{\sffamily\upshape}
\pagestyle{scrheadings}

\usepackage{minted} % Syntax highlighting via Pygments

\usepackage{graphicx}
\usepackage[font=footnotesize,justification=raggedright]{caption}

\usepackage{tikz} % Duck and cover.

\newcommand{\codesize}{\fontsize{\bodyfontsize}{\bodybaselineskip}}

% Syntax highlighting for Arm asm (minted doesn't do this well)
\usepackage{listings}
\lstset{
basicstyle=\ttfamily\codesize\selectfont,
keywordstyle=\color{darkGreen}\bfseries,
commentstyle=\textcolor[rgb]{0.25,0.50,0.50}
}
% listings definitions for Arm assembly.
% Get them from https://github.com/sysprog21/arm-assembler-latex-listings
\usepackage{lstlangarm} % See above

\usepackage{changepage} % For adjustwidth

\usepackage{metalogo} % for \LuaLaTeX

\input{lib/codeblock}

\setlength\columnsep{2em}
%\setlength\parskip{0em}
\setlength\parindent{1.5em}

\title{Concurrency Primer\footnote{%
The original title was ``What every systems programmer should know about concurrency''.}
}
\author{Matt Kline and Ching-Chun (Jim) Huang}
\date{\today}

% Custom footer
% Hyperlinks
\usepackage[unicode,pdfusetitle]{hyperref}
\usepackage{xcolor}
\definecolor{darkGreen}{HTML}{008000}
\hypersetup{
    colorlinks=true, % Use colors
    linkcolor=violet, % Intra-doc links
    urlcolor=blue % URLs are blue
}

% Use \punckern to overlap periods, commas, and footnote markers
% for a tighter look.
% Care should be taken to not make it too tight - f" and the like can overlap
% if you're not careful.
\newcommand{\punckern}{\kern-0.4ex}
% For placing commas close to, or under, quotes they follow.
% We're programmers, and we blatantly disregard American typographical norms
% to put the quotes inside, but we can at least make it look a bit nicer.
\newcommand{\quotekern}{\kern-0.5ex}


% Create an unbreakable string of text in a monospaced font.
% Useful for `command --line --args`
\newcommand{\monobox}[1]{\mbox{\texttt{#1}}}

\newcommand{\keyword}[1]{\monobox{\color{darkGreen}#1}}

% C++ looks nicer if the ++ is in a monospace font and raised a bit.
% Also, use uppercase numbers to match the capital C.
\newcommand{\cplusplus}[1]{C\kern-0.1ex\raisebox{0.15ex}{\texttt{++}}}
\newcommand{\clang}[1]{C}
\newcommand{\csharp}{C\raisebox{0.25ex}{\#}}

\newcommand{\fig}[1]{Figure~\ref{#1}}

% Italicize new terms
\newcommand{\introduce}[1]{\textit{#1}}

\newcommand{\secref}[1]{\hyperref[#1]{\textsc{\S}\ref*{#1}}}

% See http://tex.stackexchange.com/a/68310
\makeatletter
\let\runauthor\@author
\let\rundate\@date
\let\runtitle\@title
\makeatother

% Spend a bit more time to get better word spacing.
% See http://tex.stackexchange.com/a/52855/92465
\emergencystretch=1em

\begin{document}
% Custom title instead of \maketitle
\begin{center}
\Large \runtitle
\bigskip

\large
\runauthor
\smallskip

\normalsize
\rundate
\end{center}
\bigskip

\begin{center}
\large \bfseries\itshape Abstract
\end{center}
\smallskip

\noindent
System programmers are acquainted with tools such as mutexes, semaphores, and condition variables.
However, the question remains: how do these tools work, and how do we write concurrent code in their absence?
For example, when working in an embedded environment beneath the operating system,
or when faced with hard time constraints that prohibit blocking.
Furthermore, since the compiler and hardware often combine to transform code into an unanticipated order,
how do multithreaded programs work? Concurrency is a complex and counterintuitive topic,
but let us endeavor to explore its fundamental principles.
\bigskip

\section{Background}
\label{background}

Modern computers execute multiple instruction streams concurrently.
On single-core systems, these streams alternate, sharing the \textsc{CPU} in brief time slices.
Multi-core systems, however, allow several streams to run in parallel.
These streams are known by various names such as processes, threads, tasks,
interrupt service routines (ISR), among others, yet many of the same principles govern them all.

Despite the development of numerous sophisticated abstractions by computer scientists,
these instruction streams—hereafter referred to as ``\emph{threads}'' for simplicity—primarily interact through shared state.
Proper functioning hinges on understanding the sequence in which threads read from and write to memory.
Consider a simple scenario where thread \textit{A} communicates an integer with other threads:
it writes the integer to a variable and then sets a flag, signaling other threads to read the newly stored value.
This operation could be conceptualized in code as follows:
\begin{ccode}
int v;
bool v_ready = false;

void threadA()
{
    // Write the value
    // and set its ready flag.
    v = 42;
    v_ready = true;
}
\end{ccode}

\begin{ccode}
void threadB()
{
    // Await a value change and read it.
    while (!v_ready) { /* wait */ }
    const int b_v = v;
    // Do something with b_v...
}
\end{ccode}
We must ensure that other threads only observe \textit{A}'s write to \cc|v_ready| \emph{after A's} write to \cc|v|.
If another thread can ``see'' \cc|v_ready| becoming true before observing \cc|v| becoming $42$,
this simple scheme will not work correctly.

One might assume it is straightforward to ensure this order,
yet the reality is often more complex.
Initially, any optimizing compiler will restructure your code to enhance performance on its target hardware.
The primary objective is to maintain the operational effect within \emph{the current thread},
allowing reads and writes to be rearranged to prevent pipeline stalls\footnote{%
Most \textsc{CPU} architectures execute segments of multiple instructions concurrently to improve throughput (refer to \fig{fig:pipeline}).
A stall, or suspension of forward progress, occurs when an instruction awaits the outcome of a preceding one in the pipeline until the necessary result becomes available.} or to optimize data locality.\punckern\footnote{%
\textsc{RAM} accesses data not byte by byte, but in larger units known as \introduce{cache lines}.
Grouping frequently used variables on the same cache line means they are processed together,
significantly boosting performance. However, as discussed in \secref{shared-resources},
this strategy can lead to complications when cache lines are shared across cores.}

Variables may be allocated to the same memory location if their usage does not overlap.
Furthermore, calculations might be performed speculatively ahead of a branch decision and subsequently discarded if the branch prediction proves incorrect.\punckern\footnote{%
Profile-guided optimization (PGO) often employs this strategy.}
%(These sorts of optimizations  sometimes called the ``as-if'' rule in \cplusplus{}.)

Even without compiler alterations,
we would face challenges because our hardware complicates matters further!
Modern \textsc{CPU}s operate in a fashion far more complex than what traditional pipelined methods,
like those depicted in \fig{fig:pipeline}, suggest.
They are equipped with multiple data paths tailored for various instruction types and schedulers that reorder and direct instructions through these paths.

\includegraphics[keepaspectratio,width=0.7\linewidth]{images/pipeline}
\captionof{figure}{A traditional five-stage \textsc{CPU} pipeline with fetch, decode, execute, memory access, and write-back stages.
                   Modern designs are much more complicated, often reordering instructions on the fly.}
\label{fig:pipeline}

It is quite common to form oversimplified views about memory operations.
Picturing a multi-core processor setup might lead us to envision a model similar to \fig{fig:ideal-machine},
wherein each core alternately accesses and manipulates the system's memory.
\includegraphics[keepaspectratio, width=0.8\linewidth]{ideal-machine}
\captionof{figure}{An idealized multi-core processor where cores
take turns accessing a single shared set of memory.}
\label{fig:ideal-machine}

The reality is far from straightforward.
Although processor speeds have surged exponentially in recent decades,
\textsc{RAM} has struggled to match pace,
leading to a significant gap between the execution time of an instruction and the time required to fetch its data from memory.
To mitigate this, hardware designers have incorporated increasingly complex hierarchical caches directly onto the \textsc{CPU} die.
Additionally, each core often features a \introduce{store buffer} to manage pending writes while allowing further instructions to proceed.
Ensuring this memory system remains \introduce{coherent},
thus allowing writes made by one core to be observable by others even when utilizing different caches,
presents a significant challenge.

\includegraphics[keepaspectratio, width=0.8\linewidth]{images/mp-cache}
\captionof{figure}{A common memory hierarchy for modern multi-core processors}
\label{fig:dunnington}

The myriad complexities within multithreaded programs on multi-core \textsc{CPU}s lead to a lack of a uniform concept of ``now''.
Establishing some semblance of order among threads requires a concerted effort involving the hardware,
compiler, programming language, and your application.
Let's delve into our options and the tools necessary for this endeavor.

\section{Enforcing law and order}
\label{seqcst}

Establishing order in multithreaded programs varies across different \textsc{CPU} architectures.
For years, systems languages like \clang{} and \cplusplus{} lacked built-in concurrency mechanisms,
compelling developers to rely on assembly or compiler-specific extensions.
This gap was bridged in 2011 when the \textsc{ISO} standards for both languages introduced synchronization tools.
Provided these tools are used correctly,
the compiler ensures that neither its optimization processes nor the \textsc{CPU} will perform reorderings that could lead to data races.\punckern\footnote{%
The ISO~\clang{11} standard adopted its concurrency features,
almost directly, from the \cplusplus{11} standard.
Thus, the functionalities discussed should be the same in both languages,
with some minor syntactical differences favoring \cplusplus{} for clarity.
}

To ensure our earlier example functions as intended,
the ``ready'' flag must utilize an \introduce{atomic type}.
\begin{ccode}
#include <stdatomic.h>
int v = 0;
atomic_bool v_ready = false;

void *threadA()
{
    v = 42;
    v_ready = true;
}
\end{ccode}
\begin{ccode}
int b_v;

void *threadB()
{
    while(!v_ready) { /* wait */ }
    b_v = v;
    /* Do something */
}
\end{ccode}
The \clang{} and \cplusplus{} standard libraries define a series of these types in \cc|<stdatomic.h>| and \cpp|<atomic>|,
respectively.
They look and act just like the integer types they mirror (e.g., \monobox{bool}~\textrightarrow~\monobox{atomic\_bool},
\monobox{int}~\textrightarrow~\monobox{atomic\_int}, etc.),
but the compiler ensures that other variables' loads and stores are not reordered around theirs.

Informally, we can think of atomic variables as rendezvous points for threads.
By making \monobox{v\_ready} atomic,
\monobox{v = 42}\, is now guaranteed to happen before \monobox{v\_ready = true}\, in thread~\textit{A},
just as \monobox{b\_v = v}\, must happen after reading \monobox{v\_ready}\,
in thread~\textit{B}.
Formally, atomic types establish a \textit{single total modification order} where,
``[\ldots] the result of any execution is the same as if the reads and writes occurred in some order, and the operations of each individual processor appear in this sequence in the order specified by its program.''
This model, defined by Leslie Lamport in 1979,
is called \introduce{sequential consistency}.

Notice that using atomic variables as an lvalue expression, such as \monobox{v\_ready = true} and \monobox{while(!v\_ready)}, is a convenient alternative to explicitly using \monobox{atomic\_load} or \monobox{atomic\_store}.\punckern\footnote{%
Atomic load/store are not necessary generated as atomic instructions.
Under a weaker consistency model, they could simply be normal load/store,
and their code generation can vary across different architectures.
Checkout \href{https://llvm.org/docs/Atomics.html\#atomics-and-codegen}{LLVM's document} as an example to see how it is handled.}
As stated in C11 6.7.2.4 and 6.7.3, the properties associated with atomic types are meaningful only for expressions that are
lvalues.
Lvalue-to-rvalue conversion (which models a memory read from an atomic location to a CPU register) strips atomicity along with other qualifiers.

\section{Atomicity}
\label{atomicity}
But order is only one of the vital ingredients for inter-thread communication.
The other is what atomic types are named for: atomicity.
Something is \introduce{atomic} if it can not be divided into smaller parts.
If threads do not use atomic reads and writes to share data, we are still in trouble.

Consider a program with two threads.
One thread processes a list of files, incrementing a counter each time it finishes working on one.
The other thread handles the user interface, periodically reading the counter to update a progress bar.
If that counter is a 64-bit integer, we can not access it atomically on 32-bit machines,
since we need two loads or stores to read or write the entire value.
If we are particularly unlucky, the first thread could be halfway through writing the counter when the second thread reads it,
receiving garbage.
These unfortunate occasions are called \introduce{torn reads and writes}.

If reads and writes to the counter are atomic, however, our problem disappears.
We can see that, compared to the difficulties of establishing the right order,
atomicity is fairly straightforward:
just make sure that any variables used for thread synchronization
are no larger than the \textsc{CPU} word size.

\includegraphics[keepaspectratio, width=0.8\linewidth]{images/atomicity}
\captionof{figure}{A flowchart depicting how two concurrent programs communicate and coordinate through a shared resource to achieve a goal, accessing the shared resource.}
\label{fig:atomicity}

Summary of concepts from the first three sections, as shown in \fig{fig:atomicity}.
In \secref{background}, we observe the importance of maintaining the correct order of operations: t3 \to t4 \to t5 \to t6 \to t7, so that two concurrent programs can function as expected.
In \secref{seqcst}, we see how two concurrent programs communicate to guarantee the order of operations: t5 \to t6.
In \secref{atomicity}, we understand that certain operations must be treated as a single atomic step to ensure the order of operations: t3 \to t4 \to t5 and the order of operations: t6 \to t7.

\section{Arbitrarily-sized ``atomic'' types}
\label{atomictype}
Along with \cc|atomic_int| and friends,
\cplusplus{} provides the template \cpp|std::atomic<T>| for defining arbitrary atomic types.
\clang{}, lacking a similar language feature but wanting to provide the same functionality,
added an \keyword{\_Atomic} keyword.
If \texttt{T} is larger than the machine's word size,
the compiler and the language runtime automatically surround the variable's reads and writes with locks.
If you want to make sure this is not happening,\punckern\footnote{%
\ldots which is most of the time,
since we are usually using atomic operations to avoid locks in the first place.}
you can check with:
\begin{cppcode}
std::atomic<Foo> bar;
ASSERT(bar.is_lock_free());
\end{cppcode}
In most cases,\punckern\footnote{%
The language standards permit atomic types to be \emph{sometimes} lock-free.
This might be necessary for architectures that do not guarantee atomicity for unaligned reads and writes.}
this information is known at compile time.
Consequently, \cplusplus{17} added \cpp|is_always_lock_free|:
\begin{cppcode}
static_assert(std::atomic<Foo>::is_always_lock_free);
\end{cppcode}

\section{Read-modify-write}
\label{rmw}

So far we have introduced the importance of order and atomicity.
In \secref{seqcst}, we see how an atomic object ensures the order of single store or load operation is not reordered by the compiler within a program. 
Only upon establishing the correct inter-thread order can we continue to pursue how multiple threads can establish a correct cross-thread order. 
After achieving this goal, we can further explore how concurrent threads can coordinate and collaborate smoothly.
In \secref{atomicity}, there is a need for atomicity to ensure that a group of operations is not only sequentially executed but also completes without being interrupted by operation from other threads.
This establishes correct order of operations from different threads.

\includegraphics[keepaspectratio, width=0.6\linewidth]{images/atomic-rmw}
\captionof{figure}{Exchange, Test and Set, Fetch and…, Compare and Swap can all be transformed into atomic RMW operations, ensuring that operations like t1 \to t2 \to t3 will become an atomic step.}
\label{fig:atomic-rmw}

Atomic loads and stores are all well and good when we do not need to consider the previous state of atomic variables, but sometimes we need to read a value, modify it, and write it back as a single atomic step.
As shown in \fig{fig:atomic-rmw}, the modification is based on the previous state that is visible for reading, and the result is then written back.
A complete \introduce{read-modify-write} operation is performed atomically to ensure visibility to subsequent operations.
  
Furthermore, for communication between concurrent threads, a shared resource is required, as shown in \fig{fig:atomicity} 
Think back to the discussion in previous sections. 
In order for concurrent threads to collaborate on operating a shared resource, we need a way to communicate. 
Thus, the need for a channel for communication arises with the appearance of the shared resource.

As discussed earlier, the process of accessing shared resources responsible for communication must also ensure both order and non-interference. 
To prevent the recursive protection of shared resources, 
atomic operations can be introduced for the shared resources responsible for communication, as shown in \fig{fig:atomic-types}.

There are a few common \introduce{read-modify-write} (\textsc{RMW}) operations to make theses operation become a single atomic step.
In \cplusplus{}, they are represented as member functions of \cpp|std::atomic<T>|.
In \clang{}, they are freestanding functions.

\includegraphics[keepaspectratio, width=1\linewidth]{images/atomic-types}
\captionof{figure}{Test and Set (Left) and Compare and Swap (Right) leverage their functionality of checking and their atomicity to make other RMW operations perform atomically.
The red color represents atomic RMW operations, while the blue color represents RMW operations that behave atomically.}
\label{fig:atomic-types}

\subsection{Exchange}
\label{exchange}
Transform \textsc{RMW} into modifying a private variable first, 
and then directly swapping the private variable with the shared variable. 
Therefore, we only need to ensure that the second step, 
which involves Read that load the shared variable and then Modify and Write that exchange it with the private variable, 
is a single atomic step.
This allows programmers to extensively modify the private variable beforehand and only write it to the shared variable when necessary. 

\subsection{Test and set}
\label{Testandset}
\introduce{Test-and-set} works on a Boolean value:
we read it, set it to \cpp|true|, and provide the value it held beforehand.
\clang{} and \cplusplus{} offer a type dedicated to this purpose, called \monobox{atomic\_flag}.
The initial value of an \monobox{atomic\_flag} is indeterminate until initialized with \monobox{ATOMIC\_FLAG\_INIT} macro.

\introduce{Test-and-set} operations are not limited to just \textsc{RMW} functions; 
they can also be utilized for constructing simple spinlock. 
In this scenario, the flag acts as a shared resource for communication between threads. 
Thus, spinlock implemented with \introduce{Test-and-set} operations ensures that entire \textsc{RMW} operations on shared resources are performed atomically, as shown in \fig{fig:atomic-types}.
\label{spinlock}
\begin{ccode}
atomic_flag af = ATOMIC_FLAG_INIT;

void lock()
{
    while (atomic_flag_test_and_set(&af)) { /* wait */ }
}

void unlock() { atomic_flag_clear(&af); }
\end{ccode}
If we call \cc|lock()| and the previous value is \cc|false|,
we are the first to acquire the lock,
and can proceed with exclusive access to whatever the lock protects.
If the previous value is \cc|true|,
someone else has acquired the lock and we must wait until they release it by clearing the flag.

\subsection{Fetch and…}
Transform \textsc{RMW} to directly modify the shared variable (such as addition, subtraction,
or bitwise \textsc{AND}, \textsc{OR}, \textsc{XOR}) and return its previous value, 
all as part of a single atomic operation. 
Compare with \introduce{Exchange} \secref{exchange}, when programmers only need to make simple modification to the shared variable, 
they can use \introduce{Fetch-and…}.

\subsection{Compare and swap}
\label{cas}
Finally, we have \introduce{compare-and-swap} (\textsc{CAS}),
sometimes called \introduce{compare-and-exchange}.
It allows us to conditionally exchange a value \emph{if} its previous value matches the expected one.
In \clang{} and \cplusplus{}, as noted in C11 7.17.7.4, \textsc{CAS} resembles the following,
if it were executed atomically:
\begin{ccode}
/* A is an atomic type. C is the non-atomic type corresponding to A */
bool atomic_compare_exchange_strong(A* obj, C* expected, C desired)
{
    if (memcmp(obj, expected, sizeof(*object)) == 0) {
        memcpy(obj, &desired, sizeof(*object));
        return true;
    } else {
        memcpy(expected, obj, sizeof(*object));
        return false;
    }
}
\end{ccode}

\begin{samepage}
\noindent The \cpp|_strong| suffix may leave you wondering if there is a corresponding ``weak'' \textsc{CAS}.
Indeed, there is. However, we will delve into that topic later in \secref{spurious-llsc-failures}.
\end{samepage}

Because \textsc{CAS} involves an expected value comparison, 
it allows \textsc{CAS} operations to extend beyond just \textsc{RMW} functions. 
Here's how it works: First, read the shared resource and use this value as the expected value. 
Modify the private variable, and then \textsc{CAS}. Compare the current shared variable with the expected shared variable. 
If they match, it indicates that modify is exclusive, ant then write by swaping the shared variable with the private variable.
If they don't match, it implies that interference from another thread has occurred.
Subsequently, update the expected value with the current shared value and retry modify in a loop. 
This iterative process allows \textsc{CAS} to serve as a communication mechanism between threads, 
ensuring that entire \textsc{RMW} operations on shared resources are performed atomically.
As shown in \fig{fig:atomic-types}, compared with \introduce{Test-and-set} \secref{Testandset}, 
a thread that employs \textsc{CAS} can directly use the shared resource to check.
It uses atomic \textsc{CAS} to ensure that Modify is atomic, 
coupled with a while loop to ensure that the entire \textsc{RMW} can behave atomically.

However, atomic \textsc{RMW} operations here are merely a programming tool for programmers to achieve program logic correctness. 
Its actual execution as atomic operations depends on the how compiler translate it into actual atomic instructions based on differenct hardware instruction set. 
\introduce{Exchange}, \introduce{Fetch-and-Add}, \introduce{Test-and-set} and \textsc{CAS} in instruction level are different style of atomic \textsc{RMW} instructions. 
ISA could only provide some of them, 
leaving the rest to compilers to synthesize atomic \textsc{RMW} operations. 
For example, In IA32/64 and IBM System/360/z architectures, 
\introduce{Test-and-set} functionality is directly supported by hardware instructions. 
x86 has XCHG, XADD for \introduce{Exchange} and \introduce{Fetch-and-Add} but has \introduce{Test-and-set} implemented with XCHG. 
Arm, in another style, provides LL/SC (Load Linked/Store Conditional) flavor instructions for all the operations, 
with \textsc{CAS} added in Armv8/v9-A.

\subsection{example}
\label{rmw_example}
The following example code is a simplified implementation of a thread pool, which demonstrates the use of \clang{}11 atomic library.

\inputminted{c}{./examples/rmw_example.c}

Stdout of the program is:
\begin{ccode}
PI calculated with 100 terms: 3.141592653589793
\end{ccode}

\textbf{Exchange}
In function \monobox{tpool\_destroy}, \monobox{atomic\_exchange(\&thrd\_pool->state, cancelled)} reads the current state and replaces it with ``cancelled''. 
A warning message is printed if the pool is destroyed while workers are still ``running''. 
If the exchange is not performed atomically, we may initially get the state as ``running''. Subsequently, a thread could set the state to ``cancelled'' after finishing the last one, resulting in a false warning.

\textbf{Test and set}
In this example, the scenario is as follows: 
First, the main thread initially acquires a lock \monobox{future->flag} and then sets it true, 
which is akin to creating a job and then transferring its ownership to the worker. 
Subsequently, the main thread will be blocked until the worker clears the flag. 
This indicates that the main thread will wail until the worker completes the job and returns ownership back to the main thread, which ensures correct cooperation.

\textbf{Fetch and…}
In the function \monobox{tpool\_destroy}, \monobox{atomic\_fetch\_and} is utilized as a means to set the state to ``idle''. 
Yet, in this case, it is not necessary, as the pool needs to be reinitialized for further use regardless.
Its return value could be further utilized, for instance, to report the previous state and perform additional actions.

\textbf{Compare and swap}
Once threads are created in the thread pool as workers, they will continuously search for jobs to do.
Jobs are taken from the tail of the job queue.
To take a job without being taken by another worker halfway through, we need to atomically change the pointer to the last job. 
Otherwise, the last job is under race.
The while loop in the function \monobox{worker},
\begin{ccode}
while (!atomic_compare_exchange_weak(&thrd_pool->head->prev, &job,
                                       job->prev)) {
}
\end{ccode}
, keeps trying to claim the job atomically until success.

Built-in post increment and decrement operators and compound assignment on atomic objects, such as \monobox{++} and \monobox{+=}, are read-modify-write atomic operations with total sequentially consistent ordering as well. 
They behave equivalently to a \cc|do while| loop. See \clang{}11 standard 6.5.2.4 and 6.5.16.2 for more details.

What if claiming a job, which updates \cc|thrd_pool->head->prev|, is not done atomically?
Two or more threads could have races updating \cc|thrd_pool->head->prev| and working on the same job.
Data races are undefined behavior in \clang{}11 and \cplusplus{}11.
Working on the same job can lead to duplication of the calculation of \cc|job->future->result|, 
use after free and double free on the job.

But even when jobs were claimed atomically, a thread can still have chances holding a job that has been freed.
This is a defect of the example code.
Jobs in the example are dynamically allocated. They are freed after worker finishes each job. 
However, this situation may lead to dangling pointers for workers that are still holding and attempting to claim the job. 
If jobs are intended to be dynamically allocated, then safe memory reclamation should be implemented for such shared objects.
RCU, hazard pointer and reference counting are major ways of solving this problem.

\subsection{Further improvements}
At the beginning of \secref{rmw}, we described how a global total order is established by combining local order and inter-thread order imposed by atomic objects. 
But should every object, including non-atomic ones, participate in a single global order established by atomic objects?
\introduce{Sequential consistency} solves the ordering problem in in \secref{seqcst}, but it may force too much ordering, as some normal operations may not require it.
Without specifying, atomic operations in \clang{}11 atomic library use \monobox{memory\_order\_seq\_cst} as default memory order. Operations post-fix with \monobox{\_explicit} accept an additional argument to specify which memory order to use.
How to leverage memory orders to optimize performance will be covered later in \secref{lock-example}.

\section{Shared Resources}
\label{shared-resources}
From \secref{rmw}, we have understood that there are two types of shared resources that need to be considered. 
The first type is shared resources that concurrent threads will access in order to collaborate to achieve a goal. 
The second type is shared resources that serve as a communication channel for concurrent threads, 
ensuring correct access to shared resources. 
However, all of these considerations stem from a programming perspective, 
where we only distinguish between shared resources and private resources. 

Given all the complexities to consider, modern hardware adds another layer to the puzzle, 
as depicted in \fig{fig:dunnington}. 
Remember, memory moves between the main \textsc{RAM} and the \textsc{CPU} in segments known as cache lines. 
These cache lines also represent the smallest unit of data transferred between cores and caches. 
When one core writes a value and another reads it, 
the entire cache line containing that value must be transferred from the first core's cache(s) to the second core's cache(s), 
ensuring a coherent ``view'' of memory across cores. This dynamic can significantly affect performance.

This slowdown is even more insidious when it occurs between unrelated variables that happen to be placed on the same shared resource, 
which is the cache line, as shown in \fig{fig:false-sharing}. 
When designing concurrent data structures or algorithms, 
this \introduce{false sharing} must be taken into account. 
One way to avoid it is to pad atomic variables with a cache line of private data, 
but this is obviously a large space-time trade-off.

\includegraphics[keepaspectratio, width=0.6\linewidth]{images/false-sharing}
\captionof{figure}{Processor 1 and Processor 2 operate independently on variables A and B. 
Simultaneously, they read the cache line containing these two variables. 
In the next time step, each processor modifies A and B in their private L1 cache separately. 
Subsequently, both processors write their modified cache line to the shared L2 cache. 
At this moment, the expansion of the scope of shared resources to encompass cache lines highlights the importance of considering cache coherence issues.}
\label{fig:false-sharing}

Not only shared resources, 
but we also need to consider shared resources that serve as a communication channel, e.g. spinlock (see \secref{spinlock}). 
Processors using locks as a communication channel also need to transfer the cache line.
When a processor broadcasts the release of a lock, 
multiple processors on different chips attempt to acquire the lock simultaneously. 
To ensure a consistent state of the lock across all private L1 cache lines, 
which is a part of cache coherence, 
the cache line containing the lock will be continually transferred among the caches of those cores.
Unless the critical sections are considerably lengthy, 
the time spent managing this cache line movement could exceed the time spent within the critical sections themselves,\punckern\footnote{%
This situation underlines how some systems may experience a cache miss that is substantially more costly than an atomic \textsc{RMW} operation,
as discussed in Paul~E.\ McKenney's
\href{https://www.youtube.com/watch?v=74QjNwYAJ7M}{talk from CppCon~2017}
for a deeper exploration.}
despite the algorithm's non-blocking nature.

With these high communication costs, there may be only one processor that succeeds in acquiring it again in the case of mutex lock or spinlock, as shown in \fig{fig:spinlock}. 
Then the other processors that have not successfully acquired the lock will continue to wait, 
resulting in little practical benefit (only one processor gains the lock) and significant communication overhead. 
This disparity severely limits the scalability of the spin lock.

\includegraphics[keepaspectratio, width=0.9\linewidth]{images/spinlock}
\captionof{figure}{Three processors use lock as a communication channel to insure the access operations to the shared L2 cache will be correct. 
Processors 2 and 3 are trying to acquire a lock that is held by processor 1. 
Therefore, when processor 1 unlocks, 
the state of lock needs to be updated on other processors' private L1 cache.}
\label{fig:spinlock}

\section{Concurrency tools and synchronization mechanisms}
\label{concurrency-tool}
Atomic loads, stores, and \textsc{RMW} operations are the building blocks for every single concurrency tool.
It is useful to split those tools into two camps:
\introduce{blocking} and \introduce{lockless}.

As mentioned in \secref{rmw}, multiple threads can use these blocking tools to communicate with others. 
Furthermore, these blocking tools can even assist in synchronization between threads. 
The blocking mechanism is quite simple, 
because all threads need to do is block others in order to make their own progress. 
However, this simplicity can also cause threads to pause for unpredictable durations and then influence the progress of the overall system.

Take a mutex as an example:
it requires threads to access shared data sequentially.
If a thread locks the mutex and another attempts to lock it too,
the second thread must wait, or \introduce{block},
until the first one unlocks it, regardless of the wait time.
Additionally, blocking mechanisms are prone to \introduce{deadlock} and \introduce{livelock},
issues that lead to the system becoming immobilized as threads perpetually wait on each other.

If the first thread acquires a mutex first, 
then the second thread locks another mutex and subsequently attempts to lock the mutex held by the first thread. 
At the same time, the first thread also tries to lock the mutex held by the second thread.
Then the deadlock occurs.
Therefore, we can see that deadlock occurs when different threads acquire locks in incompatible orders, 
leading to system immobilization as threads perpetually wait on each other. 

Additionally, in \secref{shared-resources}, 
we can see another problem with the lock: its scalability is limited.

After understanding the issue that blocking mechanisms are prone to, 
we try to achieve synchronization between threads without lock. 
Consider the program below: if there is only a single thread, execute these operations as follows:

\begin{cppcode}
while (x == 0)
    x = 1 - x;
\end{cppcode}

When executed by a single thread, these operations complete within a finite time. 
However, with two threads executing concurrently, 
if one thread executes \cpp|x = 1 - x| and the other thread executes \cpp|x = 1 - x| subsequently, 
then the value of x will always be 0, which will lead to a livelock. 
Therefore, even without any locks in concurrent threads, 
we still cannot guarantee that the overall system can make progress toward achieving the programmer's goals.

Consequently, we should not focus on comparing which communication tools or synchronization mechanisms are better, 
but rather on exploring how to effectively use these tools in a given scenario to facilitate smooth communication between threads and achieve the programmer's goals.

\section{Lock-free}
In \secref{concurrency-tool}, we explored different mechanisms based on the characteristics of concurrency tools, 
as described in \secref{atomicity} and \secref{rmw}.
In this section, we need to explore which strategies can help programmers to design a concurrency program 
that allows concurrent threads to collectively ensure progress in the overall system while also improving scalability, 
which is the initial goal of designing a concurrency program.
First of all, we must figure out the scope of our problem.
Understanding the relationship between the progress of each thread and the progress of the entire system is necessary.

\subsection{Type of progress}
When we consider the scenario where many concurrent threads collaborate and each thread is divided into many operations, 

\textbf{Wait-Free} Every operation in every thread will be completed within a limited time. 
This also implies that each operation contributes to the overall progress of the system.

\textbf{Lock-Free} At any given moment, among all operations in every thread, 
at least one operation contributes to the overall progress of the system. 
However, it does not guarantee that starvation will not occur.

\textbf{Obstruction-Free} At any given time, if there is only a single thread operating without interference from other threads, 
its instructions can be completed within a finite time. However, when threads are working concurrently, 
it does not guarantee progress.

Therefore, we can understand their three relationships as follows:
obstruction-free includes lock-free and lock-free includes wait-free.
Achieving wait-free is the most optimal approach, 
allowing each thread to make progress without being blocked by other threads.

\includegraphics[keepaspectratio, width=1 \linewidth]{images/progress-type}
\captionof{figure}{In a wait-free system, each thread is guaranteed to make progress at every moment because no thread can block others. 
This ensures that the overall system can always make progress. 
In a lock-free system, at Time 1, Thread 1 may cause other threads to wait while it performs its operation. 
However, even if Thread 1 suspends at Time 2, it does not subsequently block other threads. 
This allows Thread 2 to make progress at Time 3, ensuring that the overall system continues to make progress even if one thread is suspended. 
In an obstruction-free system, when Thread 1 is suspended at Time 2, 
it causes other threads to be blocked as a result. This means that by Time 3, 
Thread 2 and Thread 3 are still waiting, preventing the system from making progress thereafter. 
Therefore, obstruction-free systems may halt progress if one thread is suspended, 
leading to the potential blocking of other threads and even stalling the system.}
\label{fig:progress-type}

The main goal is that the whole system, 
which contains all concurrent threads, 
is always making forward progress. 
To achieve this goal, we rely on concurrency tools, 
including atomic operation and the operations that perform atomically, as described in \secref{rmw}. 
Additionally, we carefully select synchronization mechanism, as described in \secref{concurrency-tool}, 
which may involve utilizing shared resources for communication (e.g., spinlock), as described in \secref{shared-resources}. 
Furthermore, we design our program with appropriate data structures and algorithms. 
Therefore, lock-free doesn't mean we cannot use any lock; 
we just need to ensure that the blocking mechanism will not limit the scalability and that the system can avoid the problems described in \secref{concurrency-tool} (e.g., long time of waiting, deadlock).

Next, we take the single producer and multiple consumers problem as an example to demonstrate how to achieve fully lock-free programming by improving some implementations step by step.\punckern\footnote{%
The first three solutions, which are \secref{spmc-solution1}, \secref{spmc-solution2}, and \secref{spmc-solution3}, are referenced in the Herb Sutter's
\href{https://youtu.be/c1gO9aB9nbs?si=7qJs-0qZAVqLHr1P}{talk from CppCon~2014.}}
This problem is that one producer generates tasks and adds them to a job queue, 
and multiple consumers take tasks from the job queue and execute them.
\subsection{SPMC solution - lock-based}
\label{spmc-solution1}
Firstly, introduce the scenario of lock-based algorithms. 
At any time, there is only one consumer that can get the lock to access the job queue.
This is because in this scenario, the lock is mutex lock, also known as a mutual exclusive lock.
Not until the consumer releases the lock are the other consumers blocked when attempting to access the job queue.

The following text explains the meaning of each state in the \fig{fig:spmc-solution1}.

\textbf{state 1} : The producer is adding tasks to the job queue while multiple consumers wait for tasks to become available and is ready to take on any job that appears in the job queue.

\textbf{state 2} \to \textbf{state 3} : After the producer adds a task to the job queue, 
the producer releases the mutex lock, and then wake the consumers up. 
Those consumers tried to acquire the lock of the job queue for the job before.

\textbf{state 3} \to \textbf{state 4} : Consumer 1 acquires the mutex lock for the job queue, 
retrieves a task from it, and then releases the mutex lock.

\textbf{state 5} : Next, other consumers attempt to acquire the mutex lock for the job queue.
However, after they acquire the lock, they find no tasks in the queue.
This is because the producer has not added more tasks to the job queue.

\textbf{state 6} : Consequently, the consumers wait on a condition variable. 
During this time, the consumers are not busy waiting but rather waiting for the producer to wake it up. 
This is because the mechanism is an advanced form of mutex lock.

\includegraphics[keepaspectratio, width=0.6\linewidth]{images/spmc-solution1}
\captionof{figure}{The interaction between the producer and consumer in SPMC Solution 1, 
including their state transitions.}
\label{fig:spmc-solution1}

The reason why this implementation is not lock-free is: 
First, if a producer suspends, 
it causes consumers to have no job available, 
leading them to block and thus halting progress in the entire system, 
which is obstruction-free, as shown in the \fig{fig:progress-type}.
Secondly, consumers concurrently need to access shared resources, which is the job.
Then, one consumer acquires the lock of the job queue but suddenly gets suspended before completing without unlocking, 
causing other consumers to be blocked.
Meanwhile, the producer still keeps adding jobs, but the system fails to make any progress,
which is obstruction-free, as shown in the \fig{fig:progress-type}.
Therefore, neither the former nor the latter implementation approach is lock-free.

\subsection{SPMC solution - lock-based and lock-free}
\label{spmc-solution2}
As described in \secref{spmc-solution1}, there is a problem when the producer suspends; 
the whole system cannot make any progress.
Additionally, consumers contend for the lock of the job queue to access the job; 
however, after they acquire the lock, they may still need to wait when the queue is empty. 
To solve this issue, the introduction of lock-based and lock-free algorithm is presented in this section.

The following text explains the meaning of each state in the \fig{fig:spmc-solution2}.

\textbf{state 0} : The producer prepares all the jobs in advance.

\textbf{state 1} : Consumer 1 acquires the lock on the job queue, takes a job, and releases the lock.

\textbf{state 2} : After consumer 2 acquires the lock, it definitely can find that there are still jobs in the queue.

Through this approach, once a consumer obtains the lock on the job queue, 
there is guaranteed job available unless all jobs have been taken by other consumers.
Thus, there is no need to wait due to a lack of jobs; 
the only wait is for acquiring the lock to access the job queue.

\includegraphics[keepaspectratio, width=0.7\linewidth]{images/spmc-solution2}
\captionof{figure}{The interaction between the producer and consumer in Solution 2, 
including their state transitions.}
\label{fig:spmc-solution2}

This implementation is referred to as both locked-based and lock-free. 
The algorithm is designed such that the producer adds all jobs to the job queue before multiple consumers begin taking them. 
This design ensures that if the producer suspends or adds the job slowly, 
consumers will not be blocked due to the lack of a job. 
Consumers just thought they have done all the jobs that the producer added.
Therefore, this implementation qualifies as lock-free, as shown in \fig{fig:progress-type}.
The reason that implementation of getting a job is locked-based, not lock-free,  
is the same as the second reason described in \secref{spmc-solution1}.

\subsection{SPMC solution - fully lock-free}
\label{spmc-solution3}
As described in \secref{shared-resources}, 
we can understand that communications between processors across a chip are through cache lines, 
which incurs high costs. Additionally, using locks further decreases overall performance and limits scalability.
However, when locks are necessary for concurrent threads to communicate, 
reducing the sharing resource and the granularity of the sharing resource to communicate (e.g., spinlock, mutex lock) is crucial.
Therefore, to achieve fully lock-free programming, we change the data structure to reduce the granularity of locks.

\includegraphics[keepaspectratio, width=1\linewidth]{images/spmc-solution3}
\captionof{figure}{The left side shows that the lock protects the entire job queue to ensure exclusive access to its head for multiple threads. 
The right side illustrates that each thread has its own slot for accessing jobs, 
not only achieving exclusivity through data structure but also eliminating the need for shared resources for communication.}
\label{fig:spmc-solution3}

Providing each consumer with their own unique slot to access jobs addresses the problem at its root, 
directly avoiding competition. 
By doing so, consumers no longer rely on a shared resource for communication.
Consequently, other consumers will not be blocked by a suspended consumer holding a lock.
This approach ensures that the system maintains its progress, 
as each consumer operates independently within their own slot, 
which is lock-free, as shown in \fig{fig:progress-type}.

\subsection{SPMC solution - fully lock-free with CAS}
\label{SPMC-solution4}
In addition to reducing granularity, 
there is another way to avoid that if one consumer acquires the lock on the job queue but suddenly gets suspended, 
causing other consumers to be blocked as described in \secref{spmc-solution2}. 
As described in \secref{cas}, we can use \textsc{CAS} with a loop to ensure that the write operation achieves semantic atomicity.

Unlike \secref{spmc-solution2}, 
which uses a shared resource (e.g., advanced form of mutex lock) for blocking synchronization, 
the first thread holding the lock causes the other threads to wait until the first thread releases the lock. 
As described in \secref{cas}, \textsc{CAS} allows threads that initially failed to acquire the lock to continue to execute Read and Modify. 
Therefore, we can conclude that if one thread is blocked, 
it indicates that there is another thread is making progress, 
which is lock-free, as shown in \fig{fig:progress-type}.

As described in \secref{spmc-solution2}, a blocking mechanism uses mutex lock; 
we can see that only one thread is active when it accesses the job queue. 
Although \textsc{CAS} will continue to execute Read and Modify, 
it doesn't result in an increase in overall progress. 
This is because the operations will be useless when atomic \textsc{CAS} fails. 
Therefore, we can understand that lock-free algorithms are not faster than blocking ones. 
The reason for using lock-free is to ensure that if one thread is blocked, 
it doesn't cause other threads to be blocked, 
thereby ensuring that the overall system must make progress over a long period of time.

\subsection{Conclusion about lock-free}
In conclusion about lockfree, 
we can see that both blocking and lockless approaches have their place in software development. 
They serve different purposes with their own design philosophies. 
When performance is a key consideration, it is crucial to profile your application, 
take advantage of every concurrency tool or mechanism, and accompany them with appropriate data structures and algorithms. 
The performance impact varies with numerous factors, such as thread count and CPU architecture specifics. 
Balancing complexity and performance is essential in concurrency, 
a domain fraught with challenges.

\subsection{ABA problem}
CAS has been introduced as one of the read-modify-write operations.
However, the target object not changing does not necessarily mean that no other threads modified it halfway through.
If the target object is changed by another thread and then changed back, the result of comparison would still be equal.
In this case, the target object has indeed been modified, yet the operation appears unchanged, compromising its atomicity.
We call this \introduce{ABA problem}.
Consider the following scenario,

\inputminted{c}{./examples/simple_aba_example.c}

The execution result would be:

\begin{ccode}
    A: v = 42
    B: v = 47
    B: v = 42
    A: v = 52
\end{ccode}

In the example provided, the presence of ABA problem results in thread A being unaware that variable \cc{v} has been altered.
Since the comparison result indicates \cc{v} unchanged, \cc{v + 10} is swapped in.
Here sleeping is only used to ensure the occurrence of ABA problem.
In real world scenario, instead of sleeping, thread A could paused by being context switched for other tasks, including being preempted by higher priority tasks.
This example seems harmless, but things can get nasty when atomic \textsc{RMW} operations are used in more complex data structures.

In a broader context, the ABA problem occurs when changes occur between loading and comparing, but the comparing mechanism is unable to identify that the state of the target object is not the latest, yielding a false positive result.

Back to thread pool example in \secref{rmw}, it contains ABA problem as well.
In \monobox{worker} function, we have the thread trying to claim the job.

\begin{ccode}
    job_t *job = atomic_load(&thrd_pool->head->prev);
    ...
    while (!atomic_compare_exchange_weak(&thrd_pool->head->prev, &job,
                                           job->prev))
        ;
\end{ccode}

Consider the following scenario:
\begin{enumerate}
    \item There is only one job left.
    \item Thread A loads the pointer to the job by \cc{atomic_load()}.
    \item Thread A is preempted.
    \item Thread B claims the job and successfully updates \cc{thrd_pool->head->prev}.
    \item Thread B sets thread pool state to idle.
    \item Main thread finishes waiting and adds more jobs.
    \item Memory allocator reuses the recently freed memory as new jobs addresses.
    \item Fortunately, the first added job has the same address as the one thread A held.
    \item Thread A is back in running state. The comparison result is equal so it updates \cc{thrd_pool->head->prev} with the old \cc{job->prev}, which is already a dangling pointer.
    \item Another thread loads the dangling pointer from \cc{thrd_pool->head->prev}.
\end{enumerate}

Notice that even though \cc{job->prev} is not loaded explicitly before comparison, compiler could place loading instructions before comparison. 
At the end, the dangling pointer could either point to garbage or trigger segmentation fault.
It could be even worse if nested ABA problem occurs in thread B.
Also, the possibility to allocate a job with same address could be higher when using memory pool, meaning that more chances to have ABA problem occurred.
In fact, pre-allocated memory should be used to achieve lock-free since \monobox{malloc} could have mutex involved in multi-threaded environment.

Being unable to determine whether the target object has been changed through comparison could result in a false positive when the return value of CAS is true.
Thus, the atomicity provided by CAS is not guaranteed.
The general concept of solving this problem involves adding more information to make different state distinguishable, and then making a decision on whether to act on the old state or retry with the new state.
If acting on the old state is chosen, then safe memory reclamation should be considered as memory may have already been freed by other threads.
More aggressively, one might consider the programming paradigm where each operation on the target object does not have a side effect on modifying it.
In the later section, we will introduce a different way of implementing atomic \textsc{RMW} operations by using LL/SC instructions. The exclusive status provided by LL/SC instructions avoids the pitfall introduced by comparison.

To make different state distinguishable, a common solution is incrementing a version number each time target object is changed.
By bundling the target object and version into a comparison, it ensures that each change marks a distinguishable result.
Given a sufficient large size for version number, there should be no repeated version numbers.
There are multiple methods for storing the version number, depending on the evaluation of the duration before a version number wraps around.
In the thread pool example, the target object is a pointer. The unused bits in a pointer can be utilized to store the version number.
In addition to embedding the version number into a pointer, we could consider utilizing an additional 32-bit or 64-bit value next to the target object for the version number.
It requires the compare-and-swap instruction to be capable of comparing a wider size at once.
Sometimes, this is referred to as \introduce{double-width compare-and-swap}.
On x86-64 processors, for atomic instructions that load or store more than a CPU word size, it needs additional hardware support.
You can use \monobox{grep cx16 /proc/cpuinfo} to check if the processor supports 16-byte compare-and-swap.
For hardware that does not support the desired size, software implementations which may have locks involve are used instead, as mentioned in \secref{atomictype}.
Back to the example, ABA problem in the following code is fixed by using an version number that increments each time a job is added to the empty queue. On x86-64, add a compiler flag \monobox{-mcx16} to enable 16-byte compare-and-swap in \monobox{worker} function.\footnote{When using \texttt{-mcx16}, programs relying on 16-byte atomic operations (e.g., using \texttt{\_\_int128}) may require linking with \texttt{-latomic}, as these operations are implemented via function calls to \texttt{libatomic} on some systems. See \url{https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688}.}

\inputminted{c}{./examples/rmw_example_aba.c}

Notice that, in the \cc{struct idle_job}, a union is used for type punning, which bundles the pointer and version number for compare-and-swap.
Directly casting a job pointer to a pointer that points to a 16-byte object is undefined behavior (due to having different alignment), thus type punning is used instead.
By using this techniques, \cc{struct idle_job} still can be accessed normally in other places, minimizing code modification.
Compiler optimizations are conservative on type punning, but it is acceptable for atomic operations.
See \secref{fusing}.
Another way to prevent ABA problem in the example is using safe memory reclamation mechanisms.
Different from previously mentioned acting on the old state, the address of a job is not freed until no one is using it.
This prevents memory allocator or memory pool from reusing the address and causing problem.

\section{Sequential consistency on weakly-ordered hardware}

Different hardware architectures offer distinct memory models or \introduce{memory models}.
For instance, x64 architecture\punckern\footnote{%
Also known as x86-64, x64 is a 64-bit extension of the x86 instruction set, officially unveiled in 1999.
This extension heralded the introduction of two novel operation modes:
64-bit mode for leveraging the full potential of 64-bit processing and compatibility mode for maintaining support for 32-bit applications.
Initially developed by AMD and publicly released in 2000, the x64 architecture has since been adopted by Intel and VIA,
signaling a unified industry shift towards 64-bit computing.
This wide adoption marked the effective obsolescence of the Intel Itanium architecture (IA-64),
despite its initial design to supersede the x86 architecture.
} is known to be \introduce{strongly-ordered},
generally ensuring a global sequence for loads and stores in most scenarios.
Conversely, architectures like \textsc{Arm} are considered \introduce{weakly-ordered},
meaning one should not expect loads and stores to follow the program sequence without explicit instructions to the \textsc{CPU}.
These instructions, known as \introduce{memory barriers}, are essential to prevent the reordering of these operations.

It is helpful to see how atomic operations work in a weakly-ordered system,
both to understand what's happening in hardware,
and to see why the \clang{} and \cplusplus{} concurrency models were designed as they were.\punckern\footnote{%
It is worth noting that the concepts we discuss here are not specific to \clang{} and \cplusplus{}.
Other systems programming languages like D and Rust have converged on similar models.}
Let's examine \textsc{Arm}, since it is both popular and straightforward.
Consider the simplest atomic operations: loads and stores.
Given some \cc|atomic_int foo|,
\newline
% Shield your eyes.
% Essentially,
% 1. On the left, place getFoo() and setFoo() functions.
% 2. On the right, place the assembly they are compiled to.
% 3. In the middle, place an arrow for each (futzing with height a bit)
%    with the text "becomes" over it.
\begin{minipage}{0.35\linewidth}
\begin{ccode}
int getFoo()
{
    return foo;
}
\end{ccode}
\end{minipage}
\raisebox{-1ex}{
\begin{tikzpicture}
\draw [->, line width=1pt] (0, 0) -- node[above]{\itshape becomes} (0.17\linewidth, 0);
\end{tikzpicture}
}
\begin{minipage}{0.43\linewidth}
\begin{lstlisting}[language={[ARM]Assembler}]
getFoo:
  ldr r3, <&foo>
  dmb
  ldr r0, [r3, #0]
  dmb
  bx lr
\end{lstlisting}
\end{minipage}
%Similarly,
\begin{minipage}{0.35\linewidth}
\begin{ccode}
void setFoo(int i)
{
    foo = i;
}
\end{ccode}
\end{minipage}
\raisebox{-1ex}{
\begin{tikzpicture}
\draw [->, line width=1pt] (0, 0) -- node[above]{\itshape becomes} (0.17\linewidth, 0);
\end{tikzpicture}
}
\begin{minipage}{0.43\linewidth}
\begin{lstlisting}[language={[ARM]Assembler}]
setFoo:
  ldr r3, <&foo>
  dmb
  str r0, [r3, #0]
  dmb
  bx lr
\end{lstlisting}
\end{minipage}
We load the address of our atomic variable into a scratch register (\texttt{r3}),
place our load or store operation between memory barriers (\keyword{dmb}), and then proceed.
These barriers ensure sequential consistency:
the first barrier guarantees that previous reads and writes are not reordered to follow our operation,
and the second ensures that future reads and writes are not reordered to precede it.

\section{Implementing atomic read-modify-write operations with LL/SC instructions}

Like many \textsc{RISC}\footnote{%
\introduce{Reduced instruction set computer}, in contrast to a \introduce{complex instruction set computer} (\textsc{CISC}) architecture like x64.
} architectures, \textsc{Arm} does not have dedicated \textsc{RMW} instructions.
Given that the processor may switch contexts to another thread at any moment,
constructing \textsc{RMW} operations from standard loads and stores is not feasible.
Special instructions are required instead: \introduce{load-link} and \introduce{store-conditional} (\textsc{LL/SC}).
These instructions are complementary:
load-link performs a read operation from an address, similar to any load,
but it also signals the processor to watch that address.
Store-conditional executes a write operation only if no other writes have occurred at that address since its paired load-link.
This mechanism is illustrated through an atomic fetch and add example.

On \textsc{Arm},
\begin{ccode}
void incFoo() { ++foo; }
\end{ccode}
compiles to:
\begin{lstlisting}[language={[ARM]Assembler}]
incFoo:
  ldr r3, <&foo>
  dmb
loop:
  ldrex r2, [r3] // LL foo
  adds r2, r2, #1 // Increment
  strex r1, r2, [r3] // SC
  cmp r1, #0 // Check the SC result.
  bne loop // Loop if the SC failed.
  dmb
  bx lr
\end{lstlisting}
We \textsc{LL} the current value, add one, and immediately try to store it back with a \textsc{SC}.
If that fails, another thread may have written to \texttt{foo} since our \textsc{LL}, so we try again.
In this way, at least one thread is always making forward progress in atomically modifying \texttt{foo},
even if several are attempting to do so at once.\punckern\footnote{%
\ldots though generally,
we want to avoid cases where multiple threads are vying for the same variable for any significant amount of time.}

\subsection{Spurious LL/SC failures}
\label{spurious-llsc-failures}

It is impractical for \textsc{CPU} hardware to track load-linked addresses for each byte within a system due to the immense resource requirements.
To mitigate this, many processors monitor these operations at a broader scale, like the cache line level.
Consequently, a \textsc{SC} operation may fail if any part of the monitored block is written to,
not just the specific address that was load-linked.

This limitation poses a particular challenge for operations like compare and swap,
highlighting the essential purpose of \monobox{compare\_exchange\_weak}.
Consider, for example, the task of atomically multiplying a value without an architecture-specific atomic read-multiply-write instruction.
\begin{cppcode}
void atomicMultiply(int by)
{
    int expected = foo;
    // Which CAS should we use?
    while (!foo.compare_exchange_?(expected, expected * by)) {
        // Empty loop.
        // (On failure, expected is updated with foo's most recent value.)
    }
}
\end{cppcode}
Many lockless algorithms use \textsc{CAS} loops like this to atomically update a variable when calculating its new value is not atomic.
They:
\begin{enumerate}
  \item Read the variable.
  \item Perform some (non-atomic) operation on its value.
  \item \textsc{CAS} the new value with the previous one.
  \item If the \textsc{CAS} failed, another thread beat us to the punch, so try again.
\end{enumerate}
If we use \monobox{compare\_exchange\_strong} for this family of algorithms,
the compiler must emit nested loops:
an inner one to protect us from spurious \textsc{SC} failures,
and an outer one which repeatedly performs our operation until no other thread has interrupted us.
But unlike the \monobox{\_strong} version,
a weak \textsc{CAS} is allowed to fail spuriously, just like the \textsc{LL/SC} mechanism that implements it.
So, with \monobox{compare\_exchange\_weak},
the compiler is free to generate a single loop,
since we do not care about the difference between retries from spurious \textsc{SC} failures and retries caused by another thread modifying our variable.

\section{Do we always need sequentially consistent operations?}
\label{lock-example}

All of our examples so far have been sequentially consistent to prevent reorderings that break our code.
We have also seen how weakly-ordered architectures like \textsc{Arm} use memory barriers to create sequential consistency.
But as you might expect,
these barriers can have a noticeable impact on performance.
After all,
they inhibit optimizations that your compiler and hardware would otherwise make.

What if we could avoid some of this slowdown?
Consider a simple case like the spinlock from \secref{spinlock}.
Between the \cc|lock()| and \cc|unlock()| calls,
we have a \introduce{critical section} where we can safely modify shared state protected by the lock.
Outside this critical section,
we only read and write to things that are not shared with other threads.
\begin{cppcode}
deepThought.calculate(); // non-shared

lock(); // Lock; critical section begins
sharedState.subject = "Life, the universe and everything";
sharedState.answer = 42;
unlock(); // Unlock; critical section ends

demolishEarth(vogons); // non-shared
\end{cppcode}

It is vital that reads and writes to shared memory do not move outside the critical section.
But the opposite is not true!
The compiler and hardware could move as much as they want \emph{into} the critical section without causing any trouble.
We have no problem with the following if it is somehow faster:
\begin{cppcode}
lock(); // Lock; critical section begins
deepThought.calculate(); // non-shared
sharedState.subject = "Life, the universe and everything";
sharedState.answer = 42;
demolishEarth(vogons); // non-shared
unlock(); // Unlock; critical section ends
\end{cppcode}
So, how do we tell the compiler as much?

\section{Memory orderings}

By default, all atomic operations, including loads, stores, and various forms of \textsc{RMW},
are considered sequentially consistent.
However, this is just one among many possible orderings.
We will explore each of these orderings in detail.
A comprehensive list, as well as the corresponding enumerations used by the \clang{} and \cplusplus{} \textsc{API}, can be found here:
\begin{itemize}
\item Sequentially Consistent (\monobox{memory\_order\_seq\_cst})
\item Acquire (\monobox{memory\_order\_acquire})
\item Release (\monobox{memory\_order\_release})
\item Relaxed (\monobox{memory\_order\_relaxed})
\item Acquire-Release (\monobox{memory\_order\_acq\_rel})
\item Consume (\monobox{memory\_order\_consume})
\end{itemize}
To pick an ordering,
you provide it as an optional argument that we have slyly failed to mention so far:\footnote{%
In \clang{}, separate functions are defined for cases where specifying an ordering is necessary.
\cc|exchange()| becomes \cc|exchange_explicit()|, a \textsc{CAS}
becomes \cc|compare_exchange_strong_explicit()|, and so on.}
\begin{cppcode}
void lock()
{
    while (af.test_and_set(memory_order_acquire)) { /* wait */ }
}

void unlock()
{
    af.clear(memory_order_release);
}
\end{cppcode}
Non-sequentially consistent loads and stores also use member functions of \cpp|std::atomic<>|:
\begin{cppcode}
int i = foo.load(memory_order_acquire);
\end{cppcode}
Compare-and-swap operations are a bit odd in that they have \emph{two} orderings: one for when the \textsc{CAS} succeeds, and one for when it fails:
\begin{cppcode}
while (!foo.compare_exchange_weak(
    expected, expected * by,
    memory_order_seq_cst, // On success
    memory_order_relaxed)) // On failure
    { /* empty loop */ }
\end{cppcode}

With the syntax out of the way,
let's look at what these orderings are and how we can use them.
As it turns out, almost all of the examples we have seen so far do not actually need sequentially consistent operations.

\subsection{Acquire and release}

We have just examined the acquire and release operations in the context of the lock example from \secref{lock-example}.
You can think of them as ``one-way'' barriers: an acquire operation permits other reads and writes to move past it,
but only in a $before \to after$ direction.
A release works the opposite manner, allowing actions to move in an $after \to before$ direction.
On \textsc{Arm} and other weakly-ordered architectures, this enables us to eliminate one of the memory barriers in each operation,
such that

\begin{cppcode}
int acquireFoo()
{
    return foo.load(memory_order_acquire);
}

void releaseFoo(int i)
{
    foo.store(i, memory_order_release);
}
\end{cppcode}
become:
\begin{minipage}{0.45\linewidth}
\begin{lstlisting}[language={[ARM]Assembler}]
acquireFoo:
  ldr r3, <&foo>
  ldr r0, [r3, #0]
  dmb
  bx lr
\end{lstlisting}
\end{minipage}
\begin{minipage}{0.45\linewidth}
\begin{lstlisting}[language={[ARM]Assembler}]
releaseFoo:
  ldr r3, <&foo>
  dmb
  str r0, [r3, #0]
  bx lr
\end{lstlisting}
\end{minipage}

Together, these provide $writer \to reader$ synchronization:
if thread \textit{W} stores a value with release semantics,
and thread \textit{R} loads that value with acquire semantics,
then all writes made by \textit{W} before its store-release are observable to \textit{R} after its load-acquire.
If this sounds familiar, it is exactly what we were trying to achieve in
\secref{background} and \secref{seqcst}:
\begin{cppcode}
int v;
std::atomic_bool v_ready(false);

void threadA()
{
    v = 42;
    v_ready.store(true, memory_order_release);
}

void threadB()
{
    while (!v_ready.load(memory_order_acquire)) {
        // wait
    }
    assert(v == 42); // Must be true
}
\end{cppcode}

\subsection{Relaxed}
Relaxed atomic operations are useful for variables shared between threads where \emph{no specific order} of operations is needed.
Although it may seem like a niche requirement, such scenarios are quite common.

Relaxed operations are beneficial for managing flags shared between threads.
For example, a worker thread in thread pool in \secref{rmw} might continuously run until it receives a cancelled signal:
\begin{cppcode}
while (1) {
    if (atomic_load_explicit(&thrd_pool->state, memory_order_relaxed) == cancelled)
        return EXIT_SUCCESS;
    if (atomic_load_explicit(&thrd_pool->state, memory_order_relaxed) == running) {
        // claim the job
        job_t *job = atomic_load(&thrd_pool->head->prev);
        while (!atomic_compare_exchange_weak_explicit(&thrd_pool->head->prev, &job,
                                               job->prev, memory_order_release, 
                                               memory_order_relaxed)) {
        }
        if (job->args == NULL) {
            atomic_store(&thrd_pool->state, idle);
        } else {
            void *ret_value = job->func(job->args);
            job->future->result = ret_value;
            atomic_flag_clear(&job->future->flag);
            free(job->args);
            free(job); // could cause dangling pointer in other threads
        }
    } else {
        /* To auto run when jobs added, set status to running if job queue is not empty.
         * As long as the producer is protected */
        thrd_yield();
        continue;
    }
};
\end{cppcode}
We do not care if the contents of the loop are rearranged around the load.
Nothing bad will happen so long as \texttt{cancelled} is only used to tell the worker to exit, and not to ``announce'' any new data.

Finally, relaxed loads are commonly used with \textsc{CAS} loops.
Continue the example above,
a \textsc{CAS} loop is performed to claim a job.
All of the loads can be relaxed as we do not need to enforce any order until we have successfully modified our value.

\subsection{Acquire-Release}

\cc|memory_order_acq_rel| is used with atomic \textsc{RMW} operations that need to both load-acquire \emph{and} store-release a value.
A typical example involves thread-safe reference counting,
like in \cplusplus{}'s \cpp|shared_ptr|:
\begin{cppcode}
atomic_int refCount;

void inc()
{
    refCount.fetch_add(1, memory_order_relaxed);
}
\end{cppcode}
\begin{cppcode}
void dec()
{
    if (refCount.fetch_sub(1, memory_order_acq_rel) == 1) {
        // No more references, delete the data.
    }
}
\end{cppcode}

Order does not matter when incrementing the reference count since no action is taken as a result.
However, when we decrement, we must ensure that:
\begin{enumerate}
  \item All access to the referenced object happens \emph{before} the count reaches zero.
  \item Deletion happens \emph{after} the reference count reaches zero.\punckern\footnote{%
        This can be optimized even further by making the acquire barrier only occur conditionally,
        when the reference count is zero.
        Standalone barriers are outside the scope of this paper,
        since they are almost always pessimal compared to a combined load-acquire or store-release.}
\end{enumerate}

Curious readers might be wondering about the difference between acquire-release and sequentially consistent operations.
To quote Hans Boehm, chair of the ISO~\cplusplus{} Concurrency Study Group,
\begin{quote}
\small
The difference between \cc|acq_rel| and \cc|seq_cst| is generally whether the operation is required to participate in the single global order of sequentially consistent operations.
\end{quote}
In other words, acquire-release provides order relative to the variable being load-acquired and store-released,
whereas sequentially consistent operation provides some \emph{global} order across the entire program.
If the distinction still seems hazy, you are not alone.
Boehm goes on to say,
\begin{quote}
\small
This has subtle and unintuitive effects.
The [barriers] in the current standard may be the most
experts-only construct we have in the language.
\end{quote}

\subsection{Consume}

Last but not least, we introduce \cc|memory_order_consume|.
Imagine a situation where data changes rarely but is frequently read by many threads.
For example, in a kernel tracking peripherals connected to a machine,
updates to this information occur very infrequently—only when a device is plugged in or removed.
In such cases, it is logical to prioritize read optimization as much as possible.
Based on our current understanding, the most effective strategy is:
\begin{cppcode}
std::atomic<PeripheralData*> peripherals;

// Writers:
PeripheralData* p = kAllocate(sizeof(*p));
populateWithNewDeviceData(p);
peripherals.store(p, memory_order_release);
\end{cppcode}
\begin{cppcode}
// Readers:
PeripheralData *p = peripherals.load(memory_order_acquire);
if (p != nullptr) {
    doSomethingWith(p->keyboards);
}
\end{cppcode}

To further enhance optimization for readers,
bypassing a memory barrier on weakly-ordered systems for loads would be ideal.
Fortunately, this is often achievable.
The data being accessed (\cpp|p->keyboards|) relies on the value of \cpp|p|,
leading most platforms, including those with weak ordering,
to maintain the sequence of the initial load (\cpp|p = peripherals|) and its subsequent use (\cpp|p->keyboards|).
However, it is notable that on some particularly weakly-ordered architectures, like DEC Alpha,
this reordering can occur, much to the frustration of developers.
Ensuring the compiler avoids any similar reordering is crucial, and \monobox{memory\_order\_consume} is designed for this purpose.
Change readers to:
\begin{cppcode}
PeripheralData *p = peripherals.load(memory_order_consume);
if (p != nullptr) {
    doSomethingWith(p->keyboards);
}
\end{cppcode}
and an \textsc{Arm} compiler could emit:
\begin{lstlisting}[language={[ARM]Assembler}]
  ldr r3, &peripherals
  ldr r3, [r3]
  // Look ma, no barrier!
  cbz r3, was_null // Check for null
  ldr r0, [r3, #4] // Load p->keyboards
  b doSomethingWith(Keyboards*)
was_null:
  ...
\end{lstlisting}

\setcounter{footnote}{0}
Sadly, the emphasis here is on \emph{could}.
Figuring out what constitutes a ``dependency'' between expressions is not as trivial as one might hope,\punckern\footnote{Even the experts in
the \textsc{iso} committee's concurrency study group, \textsc{sg}1,
came away with different understandings.
See
\href{https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4036.pdf}{\textsc{n}4036}
for the gory details.
Proposed solutions are explored in
\href{https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0190r3.pdf}{\textsc{p}0190\textsc{r}3}
and
\href{https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0462r1.pdf}{\textsc{p}0462\textsc{R}1}.
}
so all compilers currently convert consume operations to acquires.

\subsection{\textsc{Hc Svnt Dracones}}

Non-sequentially consistent orderings have many subtleties,
and a slight mistake can cause elusive Heisenbugs that only happen sometimes,
on some platforms.
Before reaching for them, ask yourself:
\begin{itemize}[label={}, before=\itshape]
\item Am I using a well-known and understood pattern \\
      (such as the ones shown above)?
\item Are the operations in a tight loop?
\item Does every microsecond count here?
\end{itemize}
If the answer is not yes to several of these,
stick to to sequentially consistent operations.
Otherwise, be sure to give your code extra review and testing.

\section{Hardware convergence}

Those familiar with \textsc{Arm} may have noticed that all assembly shown here is for the seventh version of the architecture.
Excitingly, the eighth generation offers massive improvements for lockless code.
Since most programming languages have converged on the memory model we have been exploring,
\textsc{Arm}v8 processors offer dedicated load-acquire and store-release instructions: \keyword{lda} and \keyword{stl}.
Hopefully, future \textsc{CPU} architectures will follow suit.


\section{If concurrency is the question, \texttt{volatile} is not the answer.}
% Todo: Add ongoing work from JF's CppCon 2019 talk?

Before we go, we should lay a common misconception surrounding the \keyword{volatile} keyword to rest.
Perhaps because of how it worked in older compilers and hardware,
or due to its different meaning in languages like Java and \csharp,\punckern\footnote{Unlike in \clang{} and \cplusplus{},
\keyword{volatile} \emph{does} enforce ordering in those languages.}
some believe that the keyword is useful for building concurrency tools.
Except for one specific case (see \secref{fusing}), this is false.

The purpose of \keyword{volatile} is to inform the compiler that a value can be changed by something besides the program we are executing.
This is useful for memory-mapped~\textsc{I/O} (\textsc{MMIO}),
where hardware translates reads and writes to certain addresses into instructions for the devices connected to the \textsc{CPU}.
(This is how most machines ultimately interact with the outside world.)
\keyword{volatile} implies two guarantees:
\begin{enumerate}
  \item The compiler will not elide loads and stores that seem ``unnecessary''\quotekern.
        For example, if I have some function:
        \begin{minted}[fontsize=\codesize,autogobble]{cpp}
        void write(int *t)
        {
            *t = 2;
            *t = 42;
        }
        \end{minted}
        the compiler would normally optimize it to:
        \begin{minted}[fontsize=\codesize,autogobble]{cpp}
        void write(int *t)
        {
            *t = 42;
        }
        \end{minted}
        \cpp|*t = 2| is often considered a \introduce{dead store},
        seemingly performing no function.
        However, when \texttt{t} is directed at an \textsc{MMIO} register,
        this assumption becomes unsafe.
        In such cases, each write operation could potentially influence the behavior of the associated hardware.

  \item The compiler will not reorder \keyword{volatile} reads and writes with respect to other \keyword{volatile} ones for similar reasons.
\end{enumerate}

These rules fall short of providing the atomicity and order required for safe communication between threads.
It is important to note that the second rule only prevents \keyword{volatile} operations from being reordered in relation to one another.
The compiler remains at liberty to reorganize all other ``normal'' loads and stores around them.
Furthermore, even setting this issue aside,
\keyword{volatile} does not generate memory barriers on hardware with weak ordering.
The effectiveness of the keyword as a synchronization tool hinges on both the compiler and the hardware avoiding any reordering,
which is not a reliable expectation.
\section{Atomic fusion}
\label{fusing}

Finally, one should realize that while atomic operations do prevent certain optimizations,
they are not somehow immune to all of them.
The optimizer can do fairly mundane things, such as replacing
\monobox{foo.fetch\_and(0)} with \monobox{foo = 0},
but it can also produce surprising results.
Consider:
\begin{cppcode}
while (tmp = foo.load(memory_order_relaxed)) {
    doSomething(tmp);
}
\end{cppcode}
Since relaxed loads provide no ordering guarantees,
the compiler is free to unroll the loop as much as it pleases,
perhaps into:
\begin{cppcode}
while (tmp = foo.load(memory_order_relaxed)) {
    doSomething(tmp);
    doSomething(tmp);
    doSomething(tmp);
    doSomething(tmp);
}
\end{cppcode}
If ``fusing'' reads or writes like this is unacceptable,
we must prevent it
with \cpp|volatile| casts or incantations like \cpp|asm volatile("" ::: "memory")|.\punckern\footnote{See
\url{https://stackoverflow.com/a/14983432}.}
The Linux kernel provides \monobox{READ\_ONCE()} and \monobox{WRITE\_ONCE()}
macros for this exact purpose.\punckern\footnote{See
\href{https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4374.html}{\textsc{n}4374}
and the kernel's
\href{https://elixir.bootlin.com/linux/latest/source/include/asm-generic/rwonce.h}{\texttt{rwonce.h}} for details.}

\section{Takeaways}

We have only scratched the surface here, but hopefully you now know:
\begin{itemize}
\item Why compilers and \textsc{CPU} hardware reorder loads and stores.
\item Why we need special tools to prevent these reorderings to communicate between threads.
\item How we can guarantee \introduce{sequential consistency} in our programs.
\item Atomic \introduce{read-modify-write} operations.
\item How atomic operations can be implemented on weakly-ordered hardware,
      and what implications this can have for a language-level \textsc{API}.
\item How we can \emph{carefully} optimize lockless code using non-sequentially-consistent memory orderings.
\item How \introduce{false sharing} can impact the performance of concurrent memory access.
\item Why \keyword{volatile} is an inappropriate tool for inter-thread communication.
\item How to prevent the compiler from fusing atomic operations in undesirable ways.
\end{itemize}
To learn more, see the additional resources below,
or examine lock-free data structures and algorithms,
such as a \introduce{single-producer/single-consumer} (\textsc{sp/sc}) queue or \introduce{read-copy-update}
(\textsc{RCU}).\punckern\footnote{%
See the LWN article,
\href{https://lwn.net/Articles/262464/}{\textit{What is RCU, Fundamentally?}} for an introduction.}

\vspace{\baselineskip}
\noindent Good luck and godspeed!
\newpage

\appendix
\setcounter{secnumdepth}{0}
% We do not use footnotes in anything below, but KOMA Script and multicols
% sometimes conspire to put a rule on the last page anyways.
\setfootnoterule{0pt}

\setlength\parskip{\baselineskip}
\setlength\parindent{0pt}
\section{Additional Resources}

\href{https://www.youtube.com/watch?v=ZQFzMfHIxng}{%
\textit{\cplusplus{} atomics, from basic to advanced. What do they really do?}}
by Fedor Pikus,
a hour-long talk on this topic.

\href{https://herbsutter.com/2013/02/11/atomic-weapons-the-c-memory-model-and-modern-hardware/}{%
\textit{\cpp|atomic<> Weapons|: The \cplusplus{11} Memory Model and Modern Hardware}}
by Herb Sutter,
a three-hour talk that provides a deeper dive.
Also the source of figures \ref{fig:ideal-machine} and \ref{fig:dunnington}.

\href{https://www.akkadia.org/drepper/futex.pdf}{\textit{Futexes are Tricky}},
a paper by Ulrich Drepper on how mutexes and other synchronization primitives can be built in Linux using atomic operations and syscalls.

\href{https://www.kernel.org/pub/linux/kernel/people/paulmck/perfbook/perfbook.html}{%
\textit{Is Parallel Programming Hard, And, If So, What Can You Do About It?}},
by Paul~E.\ McKenney,
an \emph{incredibly} comprehensive book covering parallel data structures and
algorithms, transactional memory, cache coherence protocols,
\textsc{CPU} architecture specifics, and more.

\href{http://www.rdrop.com/~paulmck/scalability/paper/whymb.2010.06.07c.pdf}{%
\textit{Memory Barriers: a Hardware View for Software Hackers}},
an older but much shorter piece by McKenney explaining how memory barriers are implemented
in the Linux kernel on various architectures.

\href{https://preshing.com/archives/}{\textit{Preshing On Programming}},
a blog with many excellent articles on lockless concurrency.

\textit{No Sane Compiler Would Optimize Atomics},
a discussion of how atomic operations are handled by current optimizers.
Available as a writeup,
\href{http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2015/n4455.html}{%
\textsc{n}4455}, and as a
\href{https://www.youtube.com/watch?v=IB57wIf9W1k}{CppCon talk}.

\href{https://en.cppreference.com}{cppreference.com},
an excellent reference for the \clang{} and \cplusplus{} memory model and atomic \textsc{API}.

\href{https://godbolt.org/}{Matt Godbolt's Compiler Explorer},
an online tool that provides live, color-coded disassembly using compilers and flags of your choosing.
\emph{Fantastic} for examining what compilers emit for various atomic operations on different architectures.

\section{Contributing}

Contributions are welcome!
Sources are available on
\href{https://github.com/sysprog21/concurrency-primer}{GitHub}.
This paper is prepared in \LaTeX{}.

This paper is published under a
Creative Commons Attribution-ShareAlike 4.0 International License.
The legalese can be found through
\url{https://creativecommons.org/licenses/by-sa/4.0/},
but in short,
you are free to copy, redistribute, translate, or otherwise transform this paper
so long as you give appropriate credit, indicate if changes were made,
and release your version under this same license.

\end{document}