Skip to content

Commit 5ecbf9d

Browse files
authored
Merge pull request #56 from synopse/main
abouchez / mormot: updated README
2 parents 37b40a1 + efb82c2 commit 5ecbf9d

File tree

2 files changed

+69
-33
lines changed

2 files changed

+69
-33
lines changed

entries/abouchez/README.md

Lines changed: 67 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -20,31 +20,35 @@ I am very happy to share decades of server-side performance coding techniques us
2020

2121
Here are the main ideas behind this implementation proposal:
2222

23-
- **mORMot** makes cross-platform and cross-compiler support simple (e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing);
24-
- Will memmap the entire 16GB file at once into memory (so won't work on 32-bit OS, but reduce syscalls);
25-
- Process file in parallel using several threads (configurable, with `-t=16` by default);
26-
- Fed each thread from 64MB chunks of input (because thread scheduling is unfair, it is inefficient to pre-divide the size of the whole input file into the number of threads);
27-
- Each thread manages its own data, so there is no lock until the thread is finished and data is consolidated;
28-
- Each station information (name and values) is packed into a record of exactly 16 bytes, with no external pointer/string, to match the CPU L1 cache size (64 bytes) for efficiency;
29-
- Use a dedicated hash table for the name lookup, with crc32c perfect hash function - no name comparison nor storage is needed;
30-
- Store values as 16-bit or 32-bit integers (i.e. temperature multiplied by 10);
23+
- **mORMot** makes cross-platform and cross-compiler support simple - e.g. `TMemMap`, `TDynArray.Sort`,`TTextWriter`, `SetThreadCpuAffinity`, `crc32c`, `ConsoleWrite` or command-line parsing;
24+
- The entire 16GB file is `memmap`ed at once into memory - it won't work on 32-bit OS, but avoid any `read` syscall or memory copy;
25+
- Process file in parallel using several threads - configurable via the `-t=` switch, default being the total number of CPUs reported by the OS;
26+
- Input is fed into each thread as 64MB chunks: because thread scheduling is unbalanced, it is inefficient to pre-divide the size of the whole input file into the number of threads;
27+
- Each thread manages its own `Station[]` data, so there is no lock until the thread is finished and data is consolidated;
28+
- Each `Station[]` information is packed into a record of exactly 16 bytes, with no external pointer/string, to leverage the CPU L1 cache size (64 bytes) for efficiency;
29+
- Maintain a `StationHash[]` hash table for the name lookup, with crc32c perfect hash function - no name comparison nor storage is needed with a perfect hash (see below);
30+
- On Intel/AMD/AARCH64 CPUs, *mORMot* uses hardware SSE4.2 opcodes for this crc32c computation;
31+
- Store values as 16-bit or 32-bit integers, as temperature multiplied by 10;
3132
- Parse temperatures with a dedicated code (expects single decimal input values);
33+
- The station names are stored as UTF-8 pointers to the memmap location where they appear first, in `StationName[]`, to be emitted eventually for the final output, not during temperature parsing;
3234
- No memory allocation (e.g. no transient `string` or `TBytes`) nor any syscall is done during the parsing process to reduce contention and ensure the process is only CPU-bound and RAM-bound (we checked this with `strace` on Linux);
3335
- Pascal code was tuned to generate the best possible asm output on FPC x86_64 (which is our target);
3436
- Can optionally output timing statistics and resultset hash value on the console to debug and refine settings (with the `-v` command line switch);
3537
- Can optionally set each thread affinity to a single core (with the `-a` command line switch).
3638

39+
If you are not convinced by the "perfect hash" trick, you can define the `NOPERFECTHASH` conditional, which forces full name comparison, but is noticeably slower. Our algorithm is safe with the official dataset, and gives the expected final result - which was the goal of this challenge: compute the right data reduction with as little time as possible, with all possible hacks and tricks. A "perfect hash" is a well known hacking pattern, when the dataset is validated in advance. And since our CPUs offers `crc32c` which is perfect for our dataset... let's use it! https://en.wikipedia.org/wiki/Perfect_hash_function ;)
40+
3741
## Why L1 Cache Matters
3842

39-
The "64 bytes cache line" trick is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
43+
Take great care of the "64 bytes cache line" is quite unique among all implementations of the "1brc" I have seen in any language - and it does make a noticeable difference in performance.
4044

4145
The L1 cache is well known in the performance hacking litterature to be the main bottleneck for any efficient in-memory process. If you want things to go fast, you should flatter your CPU L1 cache.
4246

4347
Min/max values will be reduced as 16-bit smallint - resulting in temperature range of -3276.7..+3276.8 which seems fair on our planet according to the IPCC. ;)
4448

45-
In our first attempt, we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, we could just store the 32-bit hash instead, for higher performance. On Intel/AMD/AARCH64 CPUs, we use hardware opcodes for this crc32c computation.
49+
In our first attempt (see "Old Version" below), we stored the name into the `Station[]` array, so that each entry is 64 bytes long exactly. But since `crc32c` is a perfect hash function for our dataset, it is enough to just store the 32-bit hash instead, and not the actual name.
4650

47-
See https://en.wikipedia.org/wiki/Perfect_hash_function for reference.
51+
Note that if we reduce the number of stations from 41343 to 400, the performance is much higher, also with a 16GB file as input. The reason is that since 400x16 = 6400, each dataset could fit entirely in each core L1 cache. No slower L2/L3 cache is involved, therefore performance is better. The cache memory seems to be the bottleneck of our code. Which is a good sign.
4852

4953
## Usage
5054

@@ -71,7 +75,7 @@ We will use these command-line switches for local (dev PC), and benchmark (chall
7175

7276
## Local Analysis
7377

74-
On my PC, it takes less than 5 seconds to process the 16GB file with 8/10 threads.
78+
On my PC, it takes less than 3 seconds to process the 16GB file with 8/10 threads.
7579

7680
Let's compare `abouchez` with a solid multi-threaded entry using file buffer reads and no memory map (like `sbalazs`), using the `time` command on Linux:
7781

@@ -88,9 +92,9 @@ real 0m25,330s
8892
user 6m44,853s
8993
sys 0m31,167s
9094
```
91-
We used 20 threads for both executable, because it was giving the best results for each program on our PC.
95+
We defined 20 threads for both executables, because our PC CPU has 20 threads in total, and using them all seems to achieve the best resutls.
9296

93-
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults).
97+
Apart from the obvious global "wall" time reduction (`real` numbers), the raw parsing and data gathering in the threads match the number of threads and the running time (`user` numbers), and no syscall is involved by `abouchez` thanks to the memory mapping of the whole file (`sys` numbers, which contain only memory page faults, is much lower).
9498

9599
The `memmap()` feature makes the initial/cold `abouchez` call slower, because it needs to cache all measurements data from file into RAM (I have 32GB of RAM, so the whole data file will remain in memory, as on the benchmark hardware):
96100
```
@@ -120,7 +124,7 @@ The `-v` verbose mode makes such testing easy. The `hash` value can quickly chec
120124

121125
## Benchmark Integration
122126

123-
Every system is quite unique, especially about its CPU multi-thread abilities. For instance, my Intel Core i5 has both P-cores and E-cores so its threading model is pretty unfair. The Zen architecture should be more balanced.
127+
Every system is quite unique, especially about its CPU multi-thread abilities. For instance, my Intel Core i5 has both P-cores and E-cores so its threading model is pretty unbalanced. The Zen architecture should be more balanced.
124128

125129
So we first need to find out which options leverage at best the hardware it runs on.
126130

@@ -147,9 +151,52 @@ This `-t=1` run is for fun: it will run the process in a single thread. It will
147151

148152
Our proposal has been run on the benchmark hardware, using the full automation.
149153

150-
TO BE COMPLETED - NUMBERS BELOW ARE FOR THE OLD VERSION:
154+
Here are some numbers, with 16 threads:
155+
```
156+
-- SSD --
157+
Benchmark 1: abouchez
158+
Time (mean ± σ): 2.095 s ± 0.044 s [User: 21.486 s, System: 1.752 s]
159+
Range (min … max): 2.017 s … 2.135 s 10 runs
160+
```
161+
162+
With 24 threads:
163+
```
164+
-- SSD --
165+
Benchmark 1: abouchez
166+
Time (mean ± σ): 1.944 s ± 0.014 s [User: 28.686 s, System: 1.909 s]
167+
Range (min … max): 1.924 s … 1.974 s 10 runs
168+
```
151169

152-
With 30 threads (on a busy system):
170+
With 32 threads:
171+
```
172+
-- SSD --
173+
Benchmark 1: abouchez
174+
Time (mean ± σ): 1.768 s ± 0.012 s [User: 33.286 s, System: 2.067 s]
175+
Range (min … max): 1.743 s … 1.782 s 10 runs
176+
```
177+
178+
If we try with 32 threads and thread affinity (`-a` option):
179+
```
180+
Time (mean ± σ): 1.771 s ± 0.010 s [User: 33.415 s, System: 2.056 s]
181+
Range (min … max): 1.760 s … 1.786 s 10 runs
182+
```
183+
184+
So it sounds like if we could just run the benchmark with the `-t=32` option, and achieve the best performance. Thread affinity is no silver bullet here, so we better stay away from it, and let the OS decide about thread scheduling.
185+
186+
The Ryzen CPU has 16 cores with 32 threads, and it makes sense that each thread only have to manage a small number of data per item (a 16 bytes `Station[]` item), so we could leverage all cores and threads.
187+
188+
189+
## Notes about the "Old" Version
190+
191+
In the version same `src` sub-folder, you will find our first attempt of this challenge, as `brcmormotold.lpr`. In respect to the "final/new" version, it did store the name as a "shortstring" within its `Station[]` record, to fill exactly the 64-byte cache line size.
192+
193+
It was already very fast, but since `crc32c` is a perfect hash function, we finally decided to just stored the 32-bit hash, and not the name itself.
194+
195+
You could disable our tuned asm in the project source code, and loose about 10% by using general purpose *mORMot* `crc32c()` and `CompareMem()` functions, which already runs SSE2/SSE4.2 tune assembly. No custom asm is needed on the "new" version: we directly use the *mORMot* functions.
196+
197+
There is a "*pure mORMot*" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead.
198+
199+
As reference, here are the numbers of this "old" version, with 30 threads (on a busy Benchmark system):
153200
```
154201
-- SSD --
155202
Benchmark 1: abouchez
@@ -162,7 +209,7 @@ Benchmark 1: abouchez
162209
Range (min … max): 3.497 s … 3.789 s 10 runs
163210
```
164211

165-
Later on, only the SSD values are shown, because the HDD version triggered the systemd watchdog, which killed the shell and its benchmark executable. But we can see that once the data is loaded from disk into the RAM cache, there is no difference with a `memmap` file on SSD and HDD. Linux is a great Operating System for sure.
212+
In fact, only the SSD values matters. We can see that once the data is loaded from disk into the RAM cache, there is no difference with a `memmap` file on SSD and HDD. Linux is a great Operating System for sure.
166213

167214
With 24 threads:
168215
```
@@ -187,20 +234,8 @@ Benchmark 1: abouchez
187234
Time (mean ± σ): 3.227 s ± 0.017 s [User: 39.731 s, System: 1.875 s]
188235
Range (min … max): 3.206 s … 3.253 s 10 runs
189236
```
237+
It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Operating System do the threads scheduling to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen. For a "pure CPU" process, affinity may help a very little. But for our "old" process working outside of the L1 cache limits, we better let the OS decide.
190238

191-
So it sounds like if we should just run the benchmark with the `-t=16` option.
192-
193-
It may be as expected:
194-
195-
- The Ryzen CPU has 16 cores with 32 threads, and it makes sense that using only the "real" cores with CPU+RAM intensive work is enough to saturate them;
196-
- It is a known fact from experiment that forcing thread affinity is not a good idea, and it is always much better to let any modern Linux Operating System schedule the threads to the CPU cores, because it has a much better knowledge of the actual system load and status. Even on a "fair" CPU architecture like AMD Zen.
197-
198-
## Old Version
199-
200-
TO BE DETAILED (WITH NUMBERS?)
201-
202-
You could disable our tuned asm in the project source code, and loose about 10% by using general purpose *mORMot* `crc32c()` and `CompareMem()` functions, which already runs SSE2/SSE4.2 tune assembly.
203-
204-
There is a "*pure mORMot*" name lookup version available if you undefine the `CUSTOMHASH` conditional, which is around 40% slower, because it needs to copy the name into the stack before using `TDynArrayHashed`, and has a little more overhead.
239+
So with this "old" version, it was decided to use `-t=16`. The "old" version is using a whole cache line (16 bytes) for its `Station[]` record, so it may be the responsible of using too much CPU cache, so more than 16 threads does not make a difference with it. Whereas our "new" version, with its `Station[]` of only 16 bytes, could use `-t=32` with benefits. The cache memory access is likely to be the bottleneck from now on.
205240

206241
Arnaud :D

entries/abouchez/src/brcmormot.lpr

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -420,7 +420,8 @@ function TBrcMain.SortedText: RawUtf8;
420420
affinity := Executable.Command.Option(
421421
['a', 'affinity'], 'force thread affinity to a single CPU core');
422422
Executable.Command.Get(
423-
['t', 'threads'], threads, '#number of threads to run', 16);
423+
['t', 'threads'], threads, '#number of threads to run',
424+
SystemInfo.dwNumberOfProcessors);
424425
help := Executable.Command.Option(['h', 'help'], 'display this help');
425426
if Executable.Command.ConsoleWriteUnknown then
426427
exit

0 commit comments

Comments
 (0)