1BRC: One Billion Row Challenge in Python

Python implementation of Gunnar's 1 billion row challenge:

Creating the measurements file with 1B rows

First install the Python requirements:

python3 -m pip install -r requirements.txt

The script createMeasurements.py will create the measurement file:

usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]

Create measurement file

optional arguments:
  -h, --help            show this help message and exit
  -o OUTPUT, --output OUTPUT
                        Measurement file name (default is "measurements.txt")
  -r RECORDS, --records RECORDS
                        Number of records to create (default is 1_000_000_000)

Example:

% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
100%|█████████████████████████████████████████| 100/100 [01:15<00:00,  1.32it/s]
Created file 'measurements.txt' with 1,000,000,000 measurements in 75.86 seconds

Be patient as it can take more than a minute to have the file generated.

Maybe as another challenge is to speed up the generation of the measurements file 🙂

Performance (on a MacBook Pro M1 32GB)

Interpreter	Script	user	system	cpu	total
python3	calculateAveragePolars.py	77.84	3.64	703%	11.585
pypy3	calculateAveragePypy.py	~~139.15~~ 135.25	~~3.02s~~ 2.92	~~699%~~ 735%	~~20.323~~ 18.782
python3	calculateAverageDuckDB.py	186.78	4.21	806%	23.673
pypy3	calculateAverage.py	~~284.90~~ 242.89	~~9.12~~ 6.28	~~749%~~ 780%	~~39.236~~ 31.926
python3	calculateAverage.py	~~378.54~~ 329.20	~~6.94~~ 3.77	~~747%~~ 793%	~~51.544~~ 41.941
python3	calculateAveragePypy.py	~~573.77~~ 510.93	~~2.70~~ 1.88	~~787%~~ 793%	~~73.170~~ 64.660

The script calculateAveragePolars.py was suggested by Taufan on this post.

The script calculateAveragePypy.py was created by donalm, a +2x improved version of the initial script (calculateAverage.py) when running in pypy3, even capable of beating the implementation using DuckDB calculateAverageDuckDB.py.

Olivier Scalbert has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you 🙂

His suggestions were to change from:

if measurement < result[location][0]:
    result[location][0] = measurement
if measurement > result[location][1]:
    result[location][1] = measurement
result[location][2] += measurement
result[location][3] += 1

to:

_result = result[location]
if measurement < _result[0]:
    _result[0] = measurement
if measurement > _result[1]:
    _result[1] = measurement
_result[2] += measurement
_result[3] += 1

Python can be surprising sometimes.

Compare results

Run compare.sh if you want to check that all the scripts produce the same output.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
tests		tests
.gitignore		.gitignore
README.md		README.md
calculateAverage.py		calculateAverage.py
calculateAverageDuckDB.py		calculateAverageDuckDB.py
calculateAveragePolars.py		calculateAveragePolars.py
calculateAveragePypy.py		calculateAveragePypy.py
calculateAveragePypyInputBuffer.py		calculateAveragePypyInputBuffer.py
compare.sh		compare.sh
createMeasurements.py		createMeasurements.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

1BRC: One Billion Row Challenge in Python

Creating the measurements file with 1B rows

Performance (on a MacBook Pro M1 32GB)

Compare results

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 10

Uh oh!

Languages

ifnesi/1brc

Folders and files

Latest commit

History

Repository files navigation

1BRC: One Billion Row Challenge in Python

Creating the measurements file with 1B rows

Performance (on a MacBook Pro M1 32GB)

Compare results

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 10

Uh oh!

Languages

Packages