GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables

# Global Atomic Object

(PR available: #6717 )...


## Details

Below, I go over some details that are present in the current implementation

### Pointer Compression

This is the common case, but certainly not a novel concept in and of itself [nor to
Chapel itself](https://github.com/chapel-lang/chapel/blob/master/runtime/include/chpl-wide-ptr-fns.h), however it is sufficient for most cases where the number of locales
fit within 16 bits. It takes advantage of the fact that all known architectures and
operating systems only use 48 bits of the virtual address space, meaning we are free
to embed the 16 bits of the locale id in the upper 16 bits. This means for 65K locales,
we get the benefit of using atomics for a single 64-bit atomic operation, making
atomics extremely fast.

### Descriptor Tables

This could be a new design pattern really, but the idea, while novel in its application,
is simple in theory. We maintain a table of objects, one for each locale, of which
we use the index and locale id as a descriptor (a pseudo-pointer). When we want to
atomically operate using a class instance, we first hash the class instance (it's 64-bit address)
and perform an `insertIfNotPresent` operation on the target locale, and then embed
the locale id in the upper 32 bits, and the index into the descriptor table into the
lower 32-bits. Compression is the most expensive part, as it involves doing the above
every time (although see improvements for a huge improvement), but decompression
is cheap in that it can easily directly index into the descriptor table.

### Performance

I compare GlobalAtomicObject (using descriptor tables) performance against 'Sync Var',
the all-mighty `sync` variable, which currently gets a handicap by just acquiring the lock,
perform some arbitrary cheap operation, then immediately release it.
This is the 'baseline' which we must beat or else the implementation is useless.
The next is `Atomic` on a 64-bit value, which is the 'ceiling', and can be seen as how
much performance we can get for normal pointer compression versus keeping tables like this.
It has two benchmarks: 'single' which is a single atomic operation, and 'LLSC' which is
a pseudo Load-Linked Store-Conditional operation. LLSC looks like such...

```chpl
var x = atomicVar.read();
atomicVar.compareExchangeStrong(x, f(x));
```

There is no retrying, just a one-and-done atomic operation (which I count as one).
In 'Global Atomic', we use similar atomic operations as 'Atomic' does, although
a few things need to be considered. While for 'Single', it performs a `write`
(because a `read` is not interesting since it is a basic lookup in a locale's table),
for data in *local* memory, which shows the ideal. Now 'LLSC' can perform operations for
data that is *remote*, meaning that each time, its possible that during compression
of the data, we may need to jump to other locales. It shows a more average case, but
not the worse case when data to `compareExchangeStrong` is both remote. Currently,
'LLSC' for `GlobalAtomicObject` demonstrates 3 atomic operations: the `read` which
is a simple lookup, and compression of both variables passed to `compareExchangeStrong`
which one is local, the other remote.

It should be noted again that with pointer compression and a smaller number of nodes,
performance is excellent. However, when one has more than the maximum allowed, it is
still manageable and *better* than a `sync` variable is. It should also be emphasized that
**the runtime of `GlobalAtomicObject` with pointer compression is equivalent to `Atomic`**.


I present two graphs: one with respect to runtime, and another in terms of raw operations
per second.

#### Time

![image](https://user-images.githubusercontent.com/4269990/27997537-f84f8e78-64c7-11e7-93b2-7a98f4c6fcdd.png)

This shows the runtime of the benchmark. As can be seen, raw atomic operations grossly
beats anything else, although in terms of atomic LL/SC, it does lag enough to demonstrate
the power of global atomics. A single global atomic, while is magnitudes slower than a
single 64-bit atomic, is only 2x slower than the LLSC.

Imagine if we had attempted to implement some MCAS operation to update both locations using
descriptors. At best we would succeed within 4 LL/SC operations without obstruction
(2 to set descriptors, 2 to update), of which a single Global Atomic variable would beat.
Of course, in reality, the operation would require multiple retries, in the hundreds to
thousands with real workloads that placed contention on it. This shows that, in terms of
global atomics, this is the *best* option currently, as a global atomic is guaranteed to succeed in one operation.


#### Operations Per Second

![image](https://user-images.githubusercontent.com/4269990/27997538-005aee82-64c8-11e7-9bb9-2079f72442a7.png)

This shows the raw operations per second. Notice that I removed 64-bit atomics entirely as they
shadowed over everything else to the point that the performance boost wouldn't be noticeable.
As can be seen however, both a single global atomic operation, and even the LL/SC combo
operation with a potential *remote* reference (meaning an added `on` statement) beats a normal
`sync` variable. In every case, `GlobalAtomicObject` wins outright, but performance penalties from
remote references are very apparent and should be avoided at all costs.

## Usage

Sample usage of `GlobalAtomicObject` can be seen below (and yes, it does work).

```chpl
class C { var c : int; }
var a = new C(1);
var b = new C(2);
var x = new GlobalAtomicObject(C); // atomic C
var result : bool;

x.write(a);
result = x.compareExchange(a, b);
assert(result);
assert(x.read() == b);

// Note that you can call 'delete' on the object while having it still be present
// in the descriptor table. This may be okay for most cases where memory is not
// an issue since reused addresses will hash to the same node and that node will
// already point to the valid object the user is attempting to insert. However if
// memory is an issue, one can call '_delete' to ensure that the node itself is removed
// and can be recycled for later use.
delete a;
delete b;

// Is Safe because we only use the pointer itself. However, in cases where 'a' and 'b'
// can be reused by the same GlobalAtomicObject, then this could cause undefined behavior
// where another concurrent task adds the same pointer. In those cases, you should call
// '_delete' before 'delete'.
x._delete(a);
x._delete(b);

// As well, when GlobalAtomicObject goes out of scope, all nodes it had in use also
// get deleted...
```

`GlobalAtomicObject(C)` can easily become `atomic C`. If the language were to be
modified, it wouldn't be too surprising to see the above become the below (which
of course lacks the excessive comments), which is an exemplary ideal that @mppf
insisted on:

``` chapel
class C { ... }
proc test() {
   var a = new C(1);
   var b = new C(2);
   var x: atomic C;
   var result:bool;
   x.write(a);
   result = x.compareExchange(a, b);
   assert(result);
   assert(x.read() == b);
   delete a;
   delete b;
}
```

Where `delete` can call some hook which also removes it from the table if it has not yet been
reclaimed. I believe this can become a reality with little effort.

## Improvements

Below, I go over some potential improvements that can be made by others who are up to it.

### Pointer Compression + Descriptor Tables

It may be possible to combine normal pointer compression for the lower locales that fit
within 16 bits. Combining both means we allow better performance on average, even for
cases when they have more than 16 bits worth of locales, any locales with an id
less than 2^16 get the power of atomics, while the ones greater than or equal
to 2^16 get to use the much-slower-but-faster-than-sync-variables descriptors.

See next idea for how to differentiate.

### Zone Compression

A theoretical model where we use a more dynamic type of compression which attempts
to uses the lowest 32 bits for the virtual address, and divide the upper 32 bits
into zone bits and locale bits. A 'zone' is a base 64-bit offset identified by
zone id, which is allocated when a new unique id is found. With optimizations
such as caching the zone mappings for other locales (I.E: First time a unique upper 32
bits are found, check if we have it cached, if not found jump to the owning locale
and add it if not already, and update what we have cached, etc.)

The benefit here is that we get to keep using 64-bit pointer compression, and the amount
of unique zones that can be kept track of depend on the number of bits needed to keep
track of the locale id.

"What happens when we run out of zones?" - We use the descriptor
tables again. We can just search the unique zones for the target locale first, if the
zone isn't found and we can't create a new one, we just use descriptors.

"How do we differentiate between Zone Compression, Pointer Compression, and Descriptor
Tables?" - Its possible to use the lowest 2 bits. This means lowering the potential
maximum segments for descriptor tables, but it allows a much more dynamic compression
strategy.

### Caching Descriptors into Objects

This is a change that isn't theoretical but requires changes with the language itself.
Majority of the overhead of descriptor tables is hashing the object, which requires jumping
to the owning locales, but if the descriptors were cached then repeated accesses would
be just about as fast as atomics. If it were cached, the benefits of recycling memory
to be used in atomics would be tenfold (or hundredfold, as the benchmarks show).

### Concurrent-Safe Implementation

The table utilizes a simplistic skip list implementation on a segmented memory pool.
The memory pool allows indexing by allowing an easy translation from a 32 bit
index into a segment and segment data index. It utilizes a simple bitmap (`BigInteger`).
The implementation of the memory pool allows wait-free reads (hence indexing is fine)
but requires synchronization to recycle/reuse memory. The skip list used to manage
memory also requires synchronization, and a lock-free/wait-free implementation requires
proper memory management such as Hazard Pointers which requires thread-local or task-local
storage. Hence, synchronization is needed and we use a simple local `sync` variable.

If someone managed to make everything concurrent, it could improve performance further
by 5 - 10x at least.


**Edit:**

Removed Introduction to avoid further derailment of discussion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables #6663

Global Atomic Object

Details

Pointer Compression

Descriptor Tables

Performance

Time

Operations Per Second

Usage

Improvements

Pointer Compression + Descriptor Tables

Zone Compression

Caching Descriptors into Objects

Concurrent-Safe Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GlobalAtomicObject - Scalable Atomic Operations using Descriptor Tables #6663

Description

Global Atomic Object

Details

Pointer Compression

Descriptor Tables

Performance

Time

Operations Per Second

Usage

Improvements

Pointer Compression + Descriptor Tables

Zone Compression

Caching Descriptors into Objects

Concurrent-Safe Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions