Skip to content

Provide table dataframe hashing in public API? #447

@hagenw

Description

@hagenw

In #419 we introduced storing of tables as parquet files, and created a custom hashing method for it as audformat.utils.hash() does not consider column names, and does not provide consistent hashes across different pandas versions.

The steps of the hashing, that could be added to a public function, are:

table = pa.Table.from_pandas(self.df.reset_index(), preserve_index=False)
# Create hash of table
table_hash = hashlib.md5()
table_hash.update(_schema_hash(table))
table_hash.update(_dataframe_hash(self.df))

The only downside might be that we have already audformat.utils.hash(), which is meant for index, series, and dataframe at the moment. In our own tools, we only use the hashing of index. So the question arises how we should name the new hash function, and how it should be positioned with regards to audformat.utils.hash().

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions