Skip to content

Commit

Permalink
Merge pull request #3 from fizmat/correctness
Browse files Browse the repository at this point in the history
Fix a bug of defining chunks in bytes then using them in utf8 text mode
  • Loading branch information
ifnesi authored Mar 27, 2024
2 parents 741cd19 + 8cc288a commit 01c9b8a
Show file tree
Hide file tree
Showing 4 changed files with 28 additions and 21 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,9 @@ measurements*.txt

# DuckDB files
*.ddb

# Files for comparing results
duckdb.txt
polars.txt
pypy.txt
python.txt
30 changes: 12 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Python implementation of Gunnar's 1 billion row challenge:
## Creating the measurements file with 1B rows

First install the Python requirements:
```
```shell
python3 -m pip install -r requirements.txt
```

Expand All @@ -17,33 +17,23 @@ usage: createMeasurements.py [-h] [-o OUTPUT] [-r RECORDS]
Create measurement file
options:
optional arguments:
-h, --help show this help message and exit
-o OUTPUT, --output OUTPUT
Measurement file name (Default is measurements.txt)
Measurement file name (default is "measurements.txt")
-r RECORDS, --records RECORDS
Number of records to create (Default is 1000000000)
Number of records to create (default is 1_000_000_000)
```

Example:
```
% python3 createMeasurements.py
Creating measurement file 'measurements.txt' with 1,000,000,000 measurements...
- Wrote 10,000,000 measurements in 8.92 seconds
- Wrote 20,000,000 measurements in 17.82 seconds
- Wrote 30,000,000 measurements in 26.73 seconds
- Wrote 40,000,000 measurements in 35.54 seconds
- Wrote 50,000,000 measurements in 44.36 seconds
- Wrote 60,000,000 measurements in 53.07 seconds
.
.
.
- Wrote 980,000,000 measurements in 880.98 seconds
- Wrote 990,000,000 measurements in 889.99 seconds
Created file 'measurements.txt' with 1,000,000,000 measurements in 898.92 seconds
100%|█████████████████████████████████████████| 100/100 [01:15<00:00, 1.32it/s]
Created file 'measurements.txt' with 1,000,000,000 measurements in 75.86 seconds
```

Be patient as it can take more than 15 minutes to have the file generated.
Be patient as it can take more than a minute to have the file generated.

Maybe as another challenge is to speed up the generation of the measurements file :slightly_smiling_face:

Expand Down Expand Up @@ -84,4 +74,8 @@ _result[2] += measurement
_result[3] += 1
```

Python can be surprising sometimes.
Python can be surprising sometimes.

## Compare results

Run `compare.sh` if you want to check that all the scripts produce the same output.
6 changes: 3 additions & 3 deletions calculateAverage.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,13 +61,13 @@ def _process_file_chunk(
) -> dict:
"""Process each file chunk in a different process"""
result = dict()
with open(file_name, "r") as f:
with open(file_name, "rb") as f:
f.seek(chunk_start)
for line in f:
chunk_start += len(line)
if chunk_start > chunk_end:
break
location, measurement = line.split(";")
location, measurement = line.split(b";")
measurement = float(measurement)
if location not in result:
result[location] = [
Expand Down Expand Up @@ -118,7 +118,7 @@ def process_file(
print("{", end="")
for location, measurements in sorted(result.items()):
print(
f"{location}={measurements[0]:.1f}/{(measurements[2] / measurements[3]) if measurements[3] !=0 else 0:.1f}/{measurements[1]:.1f}",
f"{location.decode('utf8')}={measurements[0]:.1f}/{(measurements[2] / measurements[3]) if measurements[3] !=0 else 0:.1f}/{measurements[1]:.1f}",
end=", ",
)
print("\b\b} ")
Expand Down
7 changes: 7 additions & 0 deletions compare.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
python calculateAverage.py > python.txt
python calculateAveragePypy.py > pypy.txt
python calculateAveragePolars.py > polars.txt
python calculateAverageDuckDB.py > duckdb.txt
git diff --no-index --word-diff=porcelain python.txt pypy.txt
git diff --no-index --word-diff=porcelain python.txt polars.txt
git diff --no-index --word-diff=porcelain python.txt duckdb.txt

0 comments on commit 01c9b8a

Please sign in to comment.