Implement precomp or write own, for further compression capabilities. #109

adminx01 · 2022-07-22T20:44:58Z

adminx01
Jul 22, 2022

Not sure if you know of it but https://github.com/schnaader/precomp-cpp

This has stalled development 2 years ago, seems like mp3 compression has critical bug left unfixed. Besides that nothing major from what I saw.

I have not tested it's efficiency yet due to no directory support, but of course dwarfs could take care of that since it compresses files one by one anyway.

I will test if zstd benefits from this a bit later and send my results. Just wanted to share the idea first, I might not be aware of all the challenges in implementing it.

tpwrules · 2022-07-23T01:20:19Z

tpwrules
Jul 23, 2022

This is something I would think would be super neat too, but it's unclear what the performance impact might be.

0 replies

Phantop · 2022-07-23T01:23:18Z

Phantop
Jul 23, 2022

This is something I would think would be super neat too, but it's unclear what the performance impact might be.

Yeah it can potentially be a notable issue.

I will note that there are ways to decompress files that use deflate, assuming whatever you're using can still work with a decompressed stream. AdvanceCOMP and can handle gz, zip, and png files, for example. qpdf can be used to decompress pdfs.

0 replies

adminx01 · 2022-08-04T13:41:56Z

adminx01
Aug 4, 2022
Author

I have not tested it's efficiency yet

I have now, and re-compressing zip is pretty efficient.

0 replies

adminx01 · 2022-08-19T17:17:11Z

adminx01
Aug 19, 2022
Author

Another alternative to precomp or advancecomp is xtool https://github.com/Razor12911/xtool

0 replies

mhx · 2022-10-23T17:12:34Z

mhx
Oct 23, 2022
Maintainer

I've been following along, but I'm unsure about what the actual feature request is.

I guess from what I've read about the precomp tool that the most straightforward way to use it in the context of DwarFS would be to run files that look like archives through precomp without compression and then letting DwarFS take care of deduplication/compression.

Upon accessing the archive in the mounted filesystem (or upon extraction), the data would be run through the equivalent of precomp -r.

I think this is an interesting idea, but there are a few issues I anticipate:

It's likely going to significantly slow down compression
It would require some changes to the filesystem format
Access to archives in a mounted filesystem would be painfully slow
Random access (although probably not very useful for archives) even more so
It might require a separate caching layer
I have no idea how to best deal with huge precomp'd archives (e.g. too big to fit in the cache)

Maybe the solution for the random access problem would be simply to not try and optimize for it. Rather, assume that archives will mostly be read sequentially, and if someone does a random access, it's just going to be painfully slow.

I do wonder how well maintained precomp and its potential alternatives are given that the latest release of precomp is from early 2019.

0 replies

Phantop · 2022-10-23T17:19:06Z

Phantop
Oct 23, 2022

My own viewpoint is that this would be a useful thing but not necessarily something that should be within dwarfs. I've looked at precomp's code and it doesn't seem to offer any form of clear API so it would likely be a notable effort to make a filesystem that recompresses precomp'd files.

I will say that this can be useful for, for example, folders with large amounts of deflated files either in nonstandard containers or requiring to be kept in their original compressed state (wherein AdvanceCOMP or qpdf can't just be used to decompress them).

1 reply

Phantop Nov 22, 2022

To elaborate on this, since it seems like none of the reconstructable libraries are particularly good candidates, AdvanceCOMP can be found here and each of its executables (advzip, advdef, advpng, advmng) can be used with the -z0 option to undo relevant deflate compression. The zip functionality unfortunately doesn't function recursively, and this isn't reversible, but it can be comparably effectively to precomp in a lot of circumstances.

The other big format that tends to use deflate is PDF, and qpdf can remove deflate compression like so: qpdf --stream-data=uncompress --replace-input --compress-streams=n --recompress-flate --compression-level=0 --optimize-images <file>

mhx · 2022-10-24T06:15:17Z

mhx
Oct 24, 2022
Maintainer

Okay, this is likely a can of worms and certainly something that requires more thought to be put into. I'll move it to discussions for now.

0 replies

mhx · 2022-11-21T22:45:07Z

mhx
Nov 21, 2022
Maintainer

I just had another look at the precomp code. Bits like this scare me:

// name of temporary files
char metatempfile[18] = "~temp00000000.dat";
char tempfile0[19] = "~temp000000000.dat";
char tempfile1[19] = "~temp000000001.dat";
char tempfile2[19] = "~temp000000002.dat";
char tempfile3[19] = "~temp000000003.dat";

The fact that it creates myriads of temp files scares me. The fact that it's a basically a single 8.5k-line source file scares me. The fact that it's likely unmaintained scares me.

As much as I like the idea and as much as I appreciate the cleverness that has gone into it, I don't see myself adding support for precomp to DwarFS.

If this was a (preferably maintained) library with a streaming API, I'd definitely consider it.

preflate looks kinds nice, but alas, it also seems unmaintained.

0 replies

bdnbje · 2022-11-22T12:54:50Z

bdnbje
Nov 22, 2022

you can look into https://github.com/Razor12911/xtool it has a streaming API.

1 reply

mhx Nov 22, 2022
Maintainer

you can look into https://github.com/Razor12911/xtool it has a streaming API.

Shame it's written in a language that requires a proprietary compiler. (At least I'm not aware of a free Delphi compiler, let alone one that runs on Linux.)

Phantop · 2022-12-29T15:13:49Z

Phantop
Dec 29, 2022

It just occurred to me that at least adding libjxl might be a decent idea, for DwarFS images with jpeg files that wouldn't compress much otherwise. Dunno how feasible it would be with other image formats (png, ppm, etc) , but JXL's jpeg support is losslessly reversible.

0 replies

M-Gonzalo · 2023-02-13T03:21:53Z

M-Gonzalo
Feb 13, 2023

I've been reading this thread and, as much as I love precomp, I concur, it's far from ready to be used in something so precise and stable like dwarfs.

Nor I believe it is a good idea to include it as-is, because dwarfs is primarily a filesystem, not an archiver.

The good news is that precomp is just a parser and a wrapper to external libraries at the most basic level. One could rewrite it in python or pretty much any other language if need be, which I'm slowly doing in my own spare time. Or, as I believe to be the case here, just borrow some ideas and adapt its functionality.

How would one go about implementing this on dwarfs? That's probably the tricky part.

As I see it, there are at least two major groups of 'transforms': the ones that need a file to be intact to work (jpeg or mp3 for example), and the ones you can feed chunks of arbitrary size and it wouldn't affect functionality.
This is important because a transform is better applied after segmentation, so as to not duplicate efforts, but some of them cannot (the first group)

So, in order to include these transforms, you'd probably need to rewrite a lot of important stuff.

But! If you do, I can guarantee dwarfs files will be dramatically smaller, and depending on design choices, they won't even be any slower to access, at least compared with the current version when going for best compression.

Some lossless transforms than can be included:

Uncompressed audio using Wavpack or similar
Pretty much any images (including JPEG) using JPEG XL
Deflate / zlib / gz (which includes basically any modern document format) using preflate, which is not actively developed but it also doesn't have any bugs (I use it literally every day)
squashfs using the official toolchain (yes, they can be losslessly reconstructed)
probably most Microsoft algorithms using wimlib (This one requires more work. Possibly building something like preflate for each algorithm)
Binary programs using something like BCJ2 or similar
Text blocks using XWRT or similar

I'm leaving aside buggy stuff like packMP3 for obvious reasons (but they can be included if there is a check roundtrip at compression time)

So, just before segmenting, there needs to be a parsing stage to find pre-processable stuff. JPEGs, ZIPs, and such. After applying the desired transform (optionally recursively), dwarfs can start deduping. Then, just before passing blocks to the final compressor, some other transforms are applied. For example, BCJ2 or some fast rep finder like lzp (this one helps by increasing the amount of data that can fit in a single block).

About performance: even precomp right now is pretty fast at compressing and decompressing stuff, if you take into consideration the ratio you can reach using it is at a whole other level than you can even hope using 'normal' methods. This is, obviously, thinking about an archiver, not a real-time filesystem.
Having said that, and as much as you first thought of this as a FS only, you should know, the truth of the matter is, depending on how it's used, dwarfs outperform pretty much everything else out there, including 7z, WinRar, zpaq, WinZip, and so on and so forth.
So, the way I see it, some transforms can be applied without hurting too much d-speed to be unacceptable for a FS, while some others can be introduced when going for best compression.
If you compare the c-speed and d-speed of lzma, for example, with that of data-specific algorithms like those meant for raw audio, lzma comes out losing every time.

What do you think about all this @mhx?
I've been meaning to introduce myself to C++ for some time now, so I could try and help here and there with implementation details if you need some help with this.

0 replies

xcfmc · 2023-08-26T23:41:15Z

xcfmc
Aug 26, 2023

There seems to be an actively developed class of programs called "file optimizers" that could lead us to a modern precomp replacement. Their goal is to recompress without quality loss. The part we need to add is to reconstruct the poorly compressed original file, bit for bit. They have all the logic for identifying the file type from it's data, and the glue needed to send it to the right decompressor. If those decompressors are able to return the original compression parameters, and rebuild a bit exact replica of the original file, then you would have a complete solution.

Here's an optimizer, written in windows c++, with 90+ external compressors:

https://nikkhokkho.sourceforge.io/static.php?page=FileOptimizer

And one for linux:

https://github.com/JayXon/Leanify

1 reply

M-Gonzalo Aug 28, 2023

You're misinterpreting what those programs do. They are NOT bitwise lossless, although they try hard to preserve all relevant content. But it is not a reversible process. Information is definitely lost.

xcfmc · 2023-08-29T01:08:46Z

xcfmc
Aug 29, 2023

As is, I agree with you. I see it as 2/3rds of the solution (identification and decompression). The unimplemeted third is bitwise recompression. My initial thought was to analyze and extract the needed compression parameters, compressor version, etc. from each file type. But we'd still have a metadata issue. Current time, date, file permissions, etc. would all have to be spoofed to get certain archivers to create an exact replica... Which led me to this eureka moment; We have a powerful deduplicator at our disposal, and the big difference between a file that we compress, and the original compressed file, is a small bit of metadata.

Here's my idea:

Identify compressed files ("File optimizer" recognizes 600+ types)
Dwarfs segments and block hashes the original compressed archive, but doesn't store the data
Uncompress the archive to a temp folder
Compress the temp folder with default compressor settings
As the file is being compressed, dwarfs looks for block matches between the original archive and the new archive
If no matches are found, abort the compression and retry with the next set of likely parameters
Dwarfs stores blocks that don't match (unique metadata from the original file)
Dwarfs stores the uncompressed data as 'source for external block dictionary'
Dwarfs stores the compression parameters needed to create the 'compressed external block dictionary'
The archive can be bitwise rebuilt, using unmatched normal blocks for metadata, and matched compressed external block dictionary blocks

It's a lot of work, but this technique would cover every past and future compression format in one shot. There could be a config file where users add custom external compressors, along with file extensions and common compression parameters. The list could include multiple versions of a single compressor as well. It could even handle encrypted files, as long as the dwarfs archive and config (with plaintext passwords) are secured or encrypted. That would be a huge benefit for modern day Android backups where LUKS encryption is used.

The gotcha formats will be those that stream compress, with metadata mid stream, and some that are multithreaded/indeterminate.

Update: I'm experimenting with zip now, and not having any luck with block matching yet. It is either a mid-strem metadata format, or my parameters aren't right. Zip would be one of the biggest formats to solve, because it is used in so many other formats (android apk's, microsoft doc's, etc.)

1 reply

M-Gonzalo Aug 29, 2023

I'm sorry to say, it won't work most of the time. What you're describing is the general gist of what precomp does, only the fundamental difference here is that the optimized files will probably share very little to no information at all with the original. I mean, to the naked eye, they are probably identical after being decoded. But to the file system, the optimized version is completely different, bitwise speaking. So you'll end up saving an optimized file, plus a diff almost as big as the original, all in all a waste of resources and space.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement precomp or write own, for further compression capabilities. #109

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 13 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Implement precomp or write own, for further compression capabilities. #109

Replies: 13 comments · 4 replies

adminx01 Aug 4, 2022 Author

adminx01 Aug 19, 2022 Author

mhx Oct 23, 2022 Maintainer

mhx Oct 24, 2022 Maintainer

mhx Nov 21, 2022 Maintainer

mhx Nov 22, 2022 Maintainer

Replies: 13 comments 4 replies

adminx01
Aug 4, 2022
Author

adminx01
Aug 19, 2022
Author

mhx
Oct 23, 2022
Maintainer

mhx
Oct 24, 2022
Maintainer

mhx
Nov 21, 2022
Maintainer

mhx Nov 22, 2022
Maintainer