Skip to content

Commit

Permalink
Added donalm's enhancements
Browse files Browse the repository at this point in the history
  • Loading branch information
ifnesi committed Jan 5, 2024
1 parent d20c618 commit 87d494f
Show file tree
Hide file tree
Showing 2 changed files with 59 additions and 46 deletions.
17 changes: 15 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,16 @@
From Gunnar's 1 billion rows challenge (https://github.com/gunnarmorling/1brc)
# 1BRC: One Billion Row Challenge in Python

Python implementation
Python implementation of Gunnar's 1 billion rows challenge:
- https://www.morling.dev/blog/one-billion-row-challenge
- https://github.com/gunnarmorling/1brc

## Performance (on a MacBook Pro M1 32GB)
| Interperter | Script | user | system | cpu | total |
| ----------- | ------ | ---- | ------ | --- | ----- |
| pypy3 | calculateAveragePypy.py | 139.15s | 3.02s | 699% | 20.323 |
| python3 | calculateAverageDuckDB.py | 186.78s | 4.21s | 806% | 23.673 |
| pypy3 | calculateAverage.py | 284.90s | 9.12s | 749% | 39.236 |
| pypy3 | calculateAverage.py | 286.33s | 9.57s | 746% | 39.665 |
| python3 | calculateAverage.py | 378.54s | 6.94s | 747% | 51.544 |

The file `calculateAveragePypy.py` was created by [donalm](https://github.com/donalm), a +2x improved version of the initial version (`calculateAverage.py`) when running in pypy3, even capable of beating the implementation using (DuckDB)[https://duckdb.org/] `calculateAverageDuckDB.py`.
88 changes: 44 additions & 44 deletions calculateAveragePypy.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# time python3 calculateAverage.py
# time pypy3 calculateAveragePypy.py
import os
import multiprocessing as mp

Expand Down Expand Up @@ -58,61 +58,61 @@ def _process_file_chunk(
file_name: str,
chunk_start: int,
chunk_end: int,
blocksize: int = 1024 * 1024,
) -> dict:
"""Process each file chunk in a different process"""
result = dict()
blocksize = 1024 * 1024
fh = open(file_name, "rb")
byte_count = chunk_end - chunk_start
fh.seek(chunk_start)
tail = b""

location = None
with open(file_name, "r+b") as fh:
fh.seek(chunk_start)

while byte_count:
if blocksize > byte_count:
blocksize = byte_count
byte_count = byte_count - blocksize
tail = b""
location = None
byte_count = chunk_end - chunk_start

data = tail + fh.read(blocksize)
while byte_count > 0:
if blocksize > byte_count:
blocksize = byte_count
byte_count -= blocksize

index = 0
data = tail + fh.read(blocksize)
while data:
if location is None:
try:
semicolon = data.index(b";", index)
except ValueError:
tail = data[index:]
break

location = data[index:semicolon]
index = semicolon + 1

index = 0
while data:
if location is None:
try:
semicolon = data.index(b";", index)
newline = data.index(b"\n", index)
except ValueError:
tail = data[index:]
break

location = data[index:semicolon]
index = semicolon + 1

try:
newline = data.index(b"\n", index)
except ValueError:
tail = data[index:]
break

value = float(data[index:newline])
index = newline + 1

if location not in result:
result[location] = [
value,
value,
value,
1,
] # min, max, sum, count
else:
if value < result[location][0]:
result[location][0] = value
if value > result[location][1]:
result[location][1] = value
result[location][2] += value
result[location][3] += 1

location = None
value = float(data[index:newline])
index = newline + 1

if location not in result:
result[location] = [
value,
value,
value,
1,
] # min, max, sum, count
else:
if value < result[location][0]:
result[location][0] = value
if value > result[location][1]:
result[location][1] = value
result[location][2] += value
result[location][3] += 1

location = None

return result

Expand Down

0 comments on commit 87d494f

Please sign in to comment.