Understanding the behaviour of `load_from_disk` #4617

nihaljn · 2022-07-01T23:59:10Z

nihaljn
Jul 1, 2022

I have a very large dataset with 32M examples stored as .arrow table using save_to_disk. When I use load_from_disk to load this dataset the first time (i.e., the first time after a reboot for example), it's really slow and takes > 10 minutes to complete. For every subsequent call to load_to_disk it's very fast and completes in a fraction of a second. Why does this happen? Is this due to some caching to memory? Can the cache be set to create to disk instead?

mariosasko · 2022-07-15T10:26:55Z

mariosasko
Jul 15, 2022
Collaborator

Hi! You can check #2252 to find more info on this behavior. We plan to add an option to shard arrow files on save to address this.

1 reply

mariosasko Sep 20, 2023
Collaborator

save_to_disk shards the arrow file to make it smaller by default now :)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Understanding the behaviour of `load_from_disk` #4617

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Understanding the behaviour of load_from_disk #4617

nihaljn Jul 1, 2022

Replies: 1 comment · 1 reply

mariosasko Jul 15, 2022 Collaborator

mariosasko Sep 20, 2023 Collaborator

Understanding the behaviour of `load_from_disk` #4617

nihaljn
Jul 1, 2022

Replies: 1 comment 1 reply

mariosasko
Jul 15, 2022
Collaborator

mariosasko Sep 20, 2023
Collaborator