Skip to content

Add option to delete temporary files (e.g. extracted files) when loading dataset #2604

@thomwolf

Description

@thomwolf

I'm loading a dataset constituted of 44 GB of compressed JSON files.

When loading the dataset with the JSON script, extracting the files create about 200 GB of uncompressed files before creating the 180GB of arrow cache tables

Having a simple way to delete the extracted files after usage (or even better, to stream extraction/delete) would be nice to avoid disk cluter.

I can maybe tackle this one in the JSON script unless you want a more general solution.

Metadata

Metadata

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions