Skip to content

Commit

Permalink
Added polars
Browse files Browse the repository at this point in the history
  • Loading branch information
ifnesi committed Jan 8, 2024
1 parent 689cc43 commit d655a35
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 1 deletion.
5 changes: 4 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,13 +7,16 @@ Python implementation of Gunnar's 1 billion row challenge:
## Performance (on a MacBook Pro M1 32GB)
| Interpreter | Script | user | system | cpu | total |
| ----------- | ------ | ---- | ------ | --- | ----- |
| python3 | calculateAveragePolars.py | 77.84 | 3.64 | 703% | 11.585 |
| pypy3 | calculateAveragePypy.py | ~~139.15~~<br>135.25 | ~~3.02s~~<br>2.92 | ~~699%~~<br>735% | ~~20.323~~<br>18.782 |
| python3 | calculateAverageDuckDB.py | 186.78 | 4.21 | 806% | 23.673 |
| pypy3 | calculateAverage.py | ~~284.90~~<br>242.89 | ~~9.12~~<br>6.28 | ~~749%~~<br>780% | ~~39.236~~<br>31.926 |
| python3 | calculateAverage.py | ~~378.54~~<br>329.20 | ~~6.94~~<br>3.77 | ~~747%~~<br>793% | ~~51.544~~<br>41.941 |
| python3 | calculateAveragePypy.py | ~~573.77~~<br>510.93 | ~~2.70~~<br>1.88 | ~~787%~~<br>793% | ~~73.170~~<br>64.660 |

The file `calculateAveragePypy.py` was created by [donalm](https://github.com/donalm), a +2x improved version of the initial script (`calculateAverage.py`) when running in pypy3, even capable of beating the implementation using [DuckDB](https://duckdb.org/) `calculateAverageDuckDB.py`.
The script `calculateAveragePolars.py` was suggested by [Taufan](https://github.com/mtaufanr) on this [post](https://github.com/gunnarmorling/1brc/discussions/62#discussioncomment-8026402).

The script `calculateAveragePypy.py` was created by [donalm](https://github.com/donalm), a +2x improved version of the initial script (`calculateAverage.py`) when running in pypy3, even capable of beating the implementation using [DuckDB](https://duckdb.org/) `calculateAverageDuckDB.py`.

[Olivier Scalbert](https://github.com/oscalbert) has made a simple but incredible suggestion where performance increased by an average of 15% (table above has been updated), thank you :slightly_smiling_face:

Expand Down
31 changes: 31 additions & 0 deletions calculateAveragePolars.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
import polars as pl


# Read data file
df = pl.scan_csv(
"measurements.txt",
separator=";",
has_header=False,
with_column_names=lambda cols: ["station_name", "measurement"],
)

# Group data
grouped = (
df.group_by("station_name")
.agg(
pl.min("measurement").alias("min_measurement"),
pl.mean("measurement").alias("mean_measurement"),
pl.max("measurement").alias("max_measurement"),
)
.sort("station_name")
.collect(streaming=True)
)

# Print final results
print("{", end="")
for data in grouped.iter_rows():
print(
f"{data[0]}={data[1]:.1f}/{data[2]:.1f}/{data[3]:.1f}",
end=", ",
)
print("\b\b} ")

0 comments on commit d655a35

Please sign in to comment.