Skip to content

Conversation

@hagenw
Copy link
Member

@hagenw hagenw commented Feb 1, 2024

This pull request main goal is to speed up loading, saving, and parsing of the dependency table.

To achieve this we switch to use pyarrow.Table to represent the dependencies.

Benchmark loading and saving dependency files

Reading a dependency file with 1,000,000 entries from CSV, pickle, or parquet

Destination CSV pickle parquet
pandas.DataFrame 1.15 s 0.19 s 0.37 s
pyarrow.Table -> pandas.DataFrame 0.41 s 0.39 s
pyarrow.Table 0.05 s 0.05 s
pandas.DataFrame -> pyarrow.Table 0.47 s

Writing a dependency file with 1,000,000 entries to CSV, pickle, or parquet

Origin CSV pickle parquet
pandas.DataFrame 1.96 s 0.70 s 0.47 s
pandas.DataFrame -> pyarrow.Table 0.47 s 0.77 s
pyarrow.Table 0.25 s 0.24 s

Conclusions

  • pyarrow.Table should be used when reading/writing CSV files
  • the fastest solution would be to represent dependencies as pyarrow.Table instead of pandas.DataFrame

Benchmarking single methods

Method pyarrow.Table pandas.DataFrame
Dependency.__call__() 0.315 s 0.000 s
Dependency.__contains__() 0.001 s 0.000 s
Dependency.__get_item__() 0.001 s 0.000 s
Dependency.__len__() 0.000 s 0.000 s
Dependency.__str__() 0.006 s 0.006 s
Dependency.archives 0.124 s 0.413 s
Dependency.attachments 0.019 s 0.021 s
Dependency.attachment_ids 0.022 s 0.022 s
Dependency.files 0.039 s 0.029 s
Dependency.media 0.090 s 0.094 s
Dependency.removed_media 0.097 s 0.092 s
Dependency.table_ids 0.022 s 0.030 s
Dependency.tables 0.018 s 0.021 s
Dependency.archive(1000 files) 0.884 s 0.005 s
Dependency.bit_depth(1000 files) 1.044 s 0.004 s
Dependency.channels(1000 files) 1.018 s 0.004 s
Dependency.checksum(1000 files) 0.963 s 0.004 s
Dependency.duration(1000 files) 1.299 s 0.004 s
Dependency.format(1000 files) 1.037 s 0.004 s
Dependency.removed(1000 files) 1.507 s 0.004 s
Dependency.sampling_rate(1000 files) 1.116 s 0.004 s
Dependency.type(1000 files) 1.271 s 0.004 s
Dependency.version(1000 files) 0.886 s 0.004 s
Dependency._add_attachment() 0.090 s 0.073 s
Dependency._add_media(1000 files) 0.044 s 0.068 s
Dependency._add_meta() 0.112 s 0.118 s
Dependency._drop() 0.026 s 0.209 s
Dependency._remove() 0.057 s 0.062 s
Dependency._update_media() 0.103 s 0.064 s
Dependency._update_media_version(1000 files) 1.043 s 0.008 s

Conclusion

Using pyarrow.Table (or a polars.DataFrame) is faster for certain column based operations, but it is way too slow when addressing single rows. So we should not use it, but stay with pandas.DataFrame to represent the dependency table.

@hagenw
Copy link
Member Author

hagenw commented May 3, 2024

We decided against storing the dependency table internally as pyarrow.Table and opted instead for pd.DataFrame and use pyarrow.Table only as an intermediate state when reading/writing a file, see #372.

@hagenw hagenw closed this May 3, 2024
@hagenw hagenw mentioned this pull request May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants