Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Decompression Improvements #512

Merged
merged 2 commits into from
Dec 16, 2024
Merged

Dataset Decompression Improvements #512

merged 2 commits into from
Dec 16, 2024

Conversation

cmacdonald
Copy link
Contributor

@cmacdonald cmacdonald commented Dec 16, 2024

comments as per commit messages

@seanmacavaney
Copy link
Collaborator

What's the motivation for delaying the decompression? From what I can tell, it's to allow the pgoress bar? At least for compressed tarfiles, this has the side-effect of needing to decompress the entire file twice, which can get pretty expensive.

I also suspect that tarfile is smart enough that it won't re-scan the entire file if the next record the requested file. But This might be worth double checking.

IIRC, zip files don't have either of these problems, due to how the file metadata is stored.

@cmacdonald
Copy link
Contributor Author

Motivation is to avoid reopening a tar file for every file requested. I'm pretty sure tarfiles arent random access - they were designed for tapes...

@seanmacavaney
Copy link
Collaborator

Ahhh, okay, now I get what this is doing. Yes, this is way better than before.

@seanmacavaney seanmacavaney changed the title Dataset improvements Dataset Decompression Improvements Dec 16, 2024
@cmacdonald cmacdonald merged commit f9e4bf7 into master Dec 16, 2024
23 checks passed
@cmacdonald cmacdonald deleted the dataset_improvements branch December 17, 2024 10:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants