While Gunnar Morling originally posed The One Billion Row Challenge for Java developers, over time folks have started implementing it in different languages which are showcased here.
While reviewing the results in Java I ran across one implementation in C that made me wonder what I could do with Crystal.
See discussion in Crystal forum here. This led me to create TODO list of things to try.
See post in Gunnar's "Show & Tell" discussion area here
Updated 2024 Feb 11
On a PC with AMD Ryzen 7 7735HS CPU, 16 cores, 32 GB, and running Linux, comparing with some other 1brc
contenders there's still room to do. See the TODO list for changes so far and possible further improvements.
See this CSV for the detailed results.
relative | command | Lang | mean (s) | stddev |
---|---|---|---|---|
1.0x | serkan-ozal | Java | 1.1661091954200002 | 0.10359078275160513 |
dannyvankooten/analyze | C | 1.24137285062 | 0.003261490844545827 | |
merykittyunsafe.sh | Java | 1.34270993842 | 0.036888174169103186 | |
merykitty.sh | Java | 1.51770504842 | 0.021277280568716347 | |
1.74x slower | 1brc_parallel_ptr4 16 48 | Crystal | 2.03081785402 | 0.06009663760371807 |
1.77x slower | 1brc_parallel_mmap2b 16 32 | Crystal | 2.06915509142 | 0.01875921473395469 |
It's worth noting that there is no significant different between my two best results; given the way the file is read in parallel and in chunks, using mmap
or not doesn't make a big difference.
None. Plain ol' Crystal.
shards build --release -Dpreview_mt
Use if you have
brew
on macOS or Linux.
First install crops
and then
ops up
ops cbr -Dpreview_mt
Status | Implementation | Description | Performance |
---|---|---|---|
Fastest | 1brc_parallel_ptr4 |
Variant of ptr3b using long word name parsing |
slightly faster |
1brc_parallel_ptr3b |
Variant of ptr3 using optimized hash map |
much faster | |
1brc_parallel_ptr3 |
Variant of ptr2 which only launched N fibres and loops over D / N chunks. |
faster | |
1brc_parallel_ptr2b |
Variant of ptr2 using optimized hash map |
much faster | |
1brc_parallel_ptr2 |
Variant using page-aligned part size in anticipation of mmap -based implementation. |
faster again | |
1brc_parallel_ptr |
Replaces Slice with Pointer to the buffer, to remove bounds checking when parsing. |
faster | |
Fastest | 1brc_parallel_mmap3 |
Variant of mmap2b using long word name parsing |
slightly faster |
1brc_parallel_mmap2b |
Variant of mmap2 using optimized hash map |
much faster | |
1brc_parallel_mmap2 |
Variant of mmap which only launched N fibres and loops over D / N chunks. |
slightly faster but not always | |
1brc_parallel_mmap |
Using mmap to load the file |
faster | |
1brc_parallel2 |
Variant using page-aligned part size in anticipation of mmap -based implementation. |
faster | |
1brc_parallel |
Parallel multi-threaded implementation that chunks up the file and spawns fibres to process them | much faster | |
1brc_serial2 |
Serial implementation optimized to use byte slices (Bytes ) |
a little faster, but still slow. | |
Slowest | 1brc_serial1 |
A very simple serial implementation using String lines |
slowest. |
While you are welcome to run the serial implementatios, my focus from now on will on the parallel implementations.
The parallel implementations,
- given the buffer division is D (via
BUF_DIV_DENOM
), and - given the number of threads is N (via
CRYSTAL_WORKERS
), and - given N < q where q is the number of chunks based on
file_size / (Int32::MAX / D)
works as follows:
- spawns q fibres, and
- allocates N buffers, and
- processes N chunks concurrently.
Note that since D is the denominator, we have the following number of chunks (and their sizes) based on D.
D | q chunks | chunk size |
---|---|---|
4 | 26 | 512 MB |
5 | 33 | 410 MB |
6 | 39 | 341 MB |
7 | 46 | 293 MB |
8 | 52 | 256 MB |
12 | 78 | 171 MB |
16 | 104 | 128 MB |
24 | 156 | 85 MB |
32 | 208 | 64 MB |
48 | 312 | 43 MB |
A script
run.sh
is provided to conveniently run one of the implementations and specify the concurrency values.
Make sure you have measurements.txt
in the current folder, and then execute ./run.sh 1brc_parallel 32 24
to run the implementation with the specific threads (32) and buffer division (24).
If your machine is different from mine (see results below), send me your results.
See the various file in /perfdata
for results and analyses over time.
Bug reports and sugestions are welcome.
This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the code of conduct.
This project is available as open source under the terms of the CC-BY-4.0 license.
- nogginly - creator and maintainer