Replies: 13 comments 4 replies
-
This is something I would think would be super neat too, but it's unclear what the performance impact might be. |
Beta Was this translation helpful? Give feedback.
-
Yeah it can potentially be a notable issue. I will note that there are ways to decompress files that use deflate, assuming whatever you're using can still work with a decompressed stream. AdvanceCOMP and can handle gz, zip, and png files, for example. qpdf can be used to decompress pdfs. |
Beta Was this translation helpful? Give feedback.
-
I have now, and re-compressing zip is pretty efficient. |
Beta Was this translation helpful? Give feedback.
-
Another alternative to precomp or advancecomp is xtool https://github.com/Razor12911/xtool |
Beta Was this translation helpful? Give feedback.
-
I've been following along, but I'm unsure about what the actual feature request is. I guess from what I've read about the Upon accessing the archive in the mounted filesystem (or upon extraction), the data would be run through the equivalent of I think this is an interesting idea, but there are a few issues I anticipate:
Maybe the solution for the random access problem would be simply to not try and optimize for it. Rather, assume that archives will mostly be read sequentially, and if someone does a random access, it's just going to be painfully slow. I do wonder how well maintained precomp and its potential alternatives are given that the latest release of precomp is from early 2019. |
Beta Was this translation helpful? Give feedback.
-
My own viewpoint is that this would be a useful thing but not necessarily something that should be within dwarfs. I've looked at precomp's code and it doesn't seem to offer any form of clear API so it would likely be a notable effort to make a filesystem that recompresses precomp'd files. I will say that this can be useful for, for example, folders with large amounts of deflated files either in nonstandard containers or requiring to be kept in their original compressed state (wherein AdvanceCOMP or qpdf can't just be used to decompress them). |
Beta Was this translation helpful? Give feedback.
-
Okay, this is likely a can of worms and certainly something that requires more thought to be put into. I'll move it to discussions for now. |
Beta Was this translation helpful? Give feedback.
-
I just had another look at the precomp code. Bits like this scare me:
The fact that it creates myriads of temp files scares me. The fact that it's a basically a single 8.5k-line source file scares me. The fact that it's likely unmaintained scares me. As much as I like the idea and as much as I appreciate the cleverness that has gone into it, I don't see myself adding support for precomp to DwarFS. If this was a (preferably maintained) library with a streaming API, I'd definitely consider it. preflate looks kinds nice, but alas, it also seems unmaintained. |
Beta Was this translation helpful? Give feedback.
-
you can look into https://github.com/Razor12911/xtool it has a streaming API. |
Beta Was this translation helpful? Give feedback.
-
It just occurred to me that at least adding |
Beta Was this translation helpful? Give feedback.
-
I've been reading this thread and, as much as I love precomp, I concur, it's far from ready to be used in something so precise and stable like dwarfs. Nor I believe it is a good idea to include it as-is, because dwarfs is primarily a filesystem, not an archiver. The good news is that precomp is just a parser and a wrapper to external libraries at the most basic level. One could rewrite it in python or pretty much any other language if need be, which I'm slowly doing in my own spare time. Or, as I believe to be the case here, just borrow some ideas and adapt its functionality. How would one go about implementing this on dwarfs? That's probably the tricky part. As I see it, there are at least two major groups of 'transforms': the ones that need a file to be intact to work (jpeg or mp3 for example), and the ones you can feed chunks of arbitrary size and it wouldn't affect functionality. So, in order to include these transforms, you'd probably need to rewrite a lot of important stuff. But! If you do, I can guarantee dwarfs files will be dramatically smaller, and depending on design choices, they won't even be any slower to access, at least compared with the current version when going for best compression. Some lossless transforms than can be included:
I'm leaving aside buggy stuff like packMP3 for obvious reasons (but they can be included if there is a check roundtrip at compression time) So, just before segmenting, there needs to be a parsing stage to find pre-processable stuff. JPEGs, ZIPs, and such. After applying the desired transform (optionally recursively), dwarfs can start deduping. Then, just before passing blocks to the final compressor, some other transforms are applied. For example, BCJ2 or some fast rep finder like lzp (this one helps by increasing the amount of data that can fit in a single block). About performance: even precomp right now is pretty fast at compressing and decompressing stuff, if you take into consideration the ratio you can reach using it is at a whole other level than you can even hope using 'normal' methods. This is, obviously, thinking about an archiver, not a real-time filesystem. What do you think about all this @mhx? |
Beta Was this translation helpful? Give feedback.
-
There seems to be an actively developed class of programs called "file optimizers" that could lead us to a modern precomp replacement. Their goal is to recompress without quality loss. The part we need to add is to reconstruct the poorly compressed original file, bit for bit. They have all the logic for identifying the file type from it's data, and the glue needed to send it to the right decompressor. If those decompressors are able to return the original compression parameters, and rebuild a bit exact replica of the original file, then you would have a complete solution. Here's an optimizer, written in windows c++, with 90+ external compressors: https://nikkhokkho.sourceforge.io/static.php?page=FileOptimizer And one for linux: |
Beta Was this translation helpful? Give feedback.
-
As is, I agree with you. I see it as 2/3rds of the solution (identification and decompression). The unimplemeted third is bitwise recompression. My initial thought was to analyze and extract the needed compression parameters, compressor version, etc. from each file type. But we'd still have a metadata issue. Current time, date, file permissions, etc. would all have to be spoofed to get certain archivers to create an exact replica... Which led me to this eureka moment; We have a powerful deduplicator at our disposal, and the big difference between a file that we compress, and the original compressed file, is a small bit of metadata. Here's my idea:
It's a lot of work, but this technique would cover every past and future compression format in one shot. There could be a config file where users add custom external compressors, along with file extensions and common compression parameters. The list could include multiple versions of a single compressor as well. It could even handle encrypted files, as long as the dwarfs archive and config (with plaintext passwords) are secured or encrypted. That would be a huge benefit for modern day Android backups where LUKS encryption is used. The gotcha formats will be those that stream compress, with metadata mid stream, and some that are multithreaded/indeterminate. Update: I'm experimenting with zip now, and not having any luck with block matching yet. It is either a mid-strem metadata format, or my parameters aren't right. Zip would be one of the biggest formats to solve, because it is used in so many other formats (android apk's, microsoft doc's, etc.) |
Beta Was this translation helpful? Give feedback.
-
Not sure if you know of it but https://github.com/schnaader/precomp-cpp
This has stalled development 2 years ago, seems like mp3 compression has critical bug left unfixed. Besides that nothing major from what I saw.
I have not tested it's efficiency yet due to no directory support, but of course dwarfs could take care of that since it compresses files one by one anyway.
I will test if zstd benefits from this a bit later and send my results. Just wanted to share the idea first, I might not be aware of all the challenges in implementing it.
Beta Was this translation helpful? Give feedback.
All reactions