Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster create #2

Merged
merged 8 commits into from
Mar 25, 2024
Merged

Faster create #2

merged 8 commits into from
Mar 25, 2024

Conversation

fizmat
Copy link
Contributor

@fizmat fizmat commented Mar 24, 2024

It takes 1130s on my machine (Apple M1) to generate the data.

I hoped generating the measurements with numpy in batches would be enough for a good speedup. But writing records in a loop is also very slow. Total time ~ 760s

I tried using pandas for csv output, it was slower.

Writing csv in polars is very fast, but converting the array of randomly selected station names into a polars column is slow. Total time ~ 480s

By using polars instead of numpy to sample the stations, this slow conversion can be avoided, so finally creating the 1brc dataset takes just 71s.

@ifnesi
Copy link
Owner

ifnesi commented Mar 25, 2024

Hi @fizmat , that is really cool, thank you very much for your contribution

@ifnesi ifnesi merged commit 741cd19 into ifnesi:main Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants