Skip to content

Parquet output format#870

Open
mschulist wants to merge 2 commits intobirdnet-team:mainfrom
mschulist:mschulist-parquet-output
Open

Parquet output format#870
mschulist wants to merge 2 commits intobirdnet-team:mainfrom
mschulist:mschulist-parquet-output

Conversation

@mschulist
Copy link

This PR adds parquet as an output format for analyze. It uses PyArrow to handle all of the parquet reading and writing.

Currently, it creates a separate "table" for each timestamp, which might result in nonoptimal compression when there are few results per timestamp. There are a few options to improve this:

  • Have a buffer that creates a new table for every $n$ rows (slightly more complex implementation, but still not too bad).
  • Put all rows in a single table (which might use a lot of memory for large datasets).

Combining the results into a single file (with --combine_results) does make the output much smaller, but it would be ideal to have good compression without having to do this extra step.

Either way, I have found that parquet's columnar compression works particularly well on classifier outputs due to the repetitive nature of their outputs (e.g. filenames are repeated for many rows). For large datasets, parquet should provide a significant improvement in file size.

This is somewhat related to #230 as well.

@Josef-Haupt
Copy link
Member

Sounds good, the birdnet lib also has a parquet output, we are currently replacing the core of the analyzer with the lib anyway, we can merge the PR and I'll update the code in #867 to match it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants