-
Notifications
You must be signed in to change notification settings - Fork 303
[COMPRESS-706] Add support for reading LHA archive format #690
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[COMPRESS-706] Add support for reading LHA archive format #690
Conversation
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Fixed
Show fixed
Hide fixed
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Fixed
Show fixed
Hide fixed
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Fixed
Show fixed
Hide fixed
garydgregory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank your for the PR. I'll review once the builds are green.
bea59e4 to
c87e781
Compare
Extracted from apache/commons-compress#690 and modified
garydgregory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @fkjellberg
The build is still failing.
|
@fkjellberg |
|
I'm working on fixing the build issues. Headers level 0 and 1 have a timestamp in msdos format and I'm using I'm not sure if we should make the information available somehow, e.g. in a flag, if the Now, that I finally fixed the time zone test failures, I found a lot of other build failures due to not using |
0294e87 to
ff63891
Compare
|
@garydgregory I think I fixed all the build issues now. |
|
Hello @fkjellberg Would you please rebase on git master to avoid the one failing test? I'll review your question later. TY! |
ff63891 to
27b359a
Compare
|
@garydgregory I rebased the PR |
garydgregory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @fkjellberg
Thank you for your updates.
If you see updates on git master, please rebase. The build in this branch has some failures unrelated to this new code, so you can ignore those. Changes in master may help, but nothing at the time I am writing this. The master builds are green.
I've left some comments throughout.
In general, when parsing input, you must assume an end-of-file can happen at any time in the input stream.
You can also assume this will be a new attack vector to try and cause OutOfMemoryError and StackOverflowError from the VM, so see the call sites of MemoryLimitException for examples on how to guard the code.
TY!
src/main/java/org/apache/commons/compress/archivers/ArchiveStreamFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/ArchiveStreamFactory.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveEntry.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveEntry.java
Outdated
Show resolved
Hide resolved
| * supports lh4, lh5, lh6 and lh7 compression methods. | ||
| * | ||
| * This implementation is based on the documentation that can be found at | ||
| * https://github.com/jca02266/lha/blob/master/Hacking_of_LHa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a URL to a specification so maintainers know where to look for canonical definitions, especially if a spec version matters. A URL to a random GitHub page doesn't feel authoritative IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a formal specification document exists for this compression format. I have not found any. This file contains some notes from someone that reversed engineered the format and it's written in Japanese so I used Google Translate so I could read and understand it. I removed this reference and I don't have any better to replace it with.
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have that link in the javadoc for the archive parsing code in LhaArchiveInputStream. It only documents the archive format and not the LH5 compression method.
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Outdated
Show resolved
Hide resolved
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Show resolved
Hide resolved
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/compressors/lha/BinaryTree.java
Outdated
Show resolved
Hide resolved
...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java
Show resolved
Hide resolved
ppkarwasz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fkjellberg,
Thanks for the PR! I haven’t had the chance to do a full review yet, but here are a few suggestions that could help improve it:
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveEntry.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java
Show resolved
Hide resolved
|
@ppkarwasz @garydgregory Thanks for the feedback. I'll work on fixing the raised issues in the coming days. |
27b359a to
042af17
Compare
|
Hi, @ppkarwasz @garydgregory ! I think I've fixed all the issues raised by you now except for the MemoryLimitException. I need to better understand how it's supposed to be used. |
09eb110 to
c6834e7
Compare
garydgregory
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fkjellberg
Yeah, don't worry about the memory limit stuff. We need to look over the whole library.
There are some unresolved EOF issues though.
TY!
| * This is an implementation of a static Huffman compressor input stream for LHA files that | ||
| * supports lh4, lh5, lh6 and lh7 compression methods. | ||
| */ | ||
| abstract class AbstractLhStaticHuffmanCompressorInputStream extends CompressorInputStream implements InputStreamStatistics { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huffman coding is already used in several compressors in Commons Compress (DEFLATE, DEFLATE64, BZip2, Brotli, …), and so are variants of LZ77.
Rather than introducing new ad-hoc helpers for Huffman or LZSS, it would be better to refactor common logic into shared utilities. This helps reduce duplication and makes maintenance easier across codecs.
I’ve started work on a reusable Huffman helper in #701. It hasn’t yet been wired into DEFLATE64, but if it also fits LHA’s needs, we could merge that first and then build LHA on top of it.
For the LZSS side, could the implementation be expressed as extensions to lz77support instead of standalone classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree that the Huffman code from different compressors should be aligned for better code reuse. That was my plan to work on after this PR had been merged. You beat me to it. :-)
I reused the BinaryTree from the ExplodingInputStream class in the zip package which is extremely similar to the LHx compressors in this PR. I think that the Huffman code from the zip package, this PR, Deflate64's HuffmanDecoder and the implementation in Bzip2CompressorInputStream should be aligned and reuse as much code as possible. What seems to differ the most is how the tree is stored in the bit stream and the huffman tree reconstructed at decompress so I guess that has to be implementation specific.
I looked at the AbstractLZ77CompressorInputStream for reused while working on this code, but decided not to try to use it. There is more similarity between this implementation and the ExplodingInputStream so I think they are better candidates for future refactoring.
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java
Outdated
Show resolved
Hide resolved
src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java
Outdated
Show resolved
Hide resolved
| * supports lh4, lh5, lh6 and lh7 compression methods. | ||
| * | ||
| * This implementation is based on the documentation that can be found at | ||
| * https://github.com/jca02266/lha/blob/master/Hacking_of_LHa |
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
| final byte[] buffer = new byte[HEADER_GENERIC_MINIMUM_HEADER_LENGTH]; | ||
| final int len = IOUtils.read(in, buffer); | ||
| if (len == 0) { | ||
| // EOF | ||
| return null; | ||
| } else if (len == 1 && buffer[0] == 0) { | ||
| // Last byte of the file is zero indicating no more entries | ||
| return null; | ||
| } else if (len < HEADER_GENERIC_MINIMUM_HEADER_LENGTH) { | ||
| throw new ArchiveException("Invalid header length"); | ||
| } | ||
|
|
||
| final ByteBuffer header = ByteBuffer.wrap(buffer).order(ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed that while most of the code uses ByteBuffer, there are still a few places where a byte[] is allocated and then wrapped. Would it make sense to consistently use ByteBuffer throughout?
As you pointed out in #690 (comment), ByteBuffer is useful here. If we stick with it exclusively, we could also have this class read directly from a ReadableByteChannel created from the InputStream:
ReadableByteChannel channel = Channels.newChannel(inputStream);This would remove the need for intermediate byte[] allocations.
I don’t think it’s necessary to allocate multiple ByteBuffer instances here. As far as I understand:
- Level-0 through Level-2 base headers are at most 256 bytes.
- Extended headers for Level-1 and Level-2 are also limited to 2^16 bytes.
Given that, it should be sufficient to keep just two buffers as class fields: one for the base header and one for extended headers, instead of creating new buffers each time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For level 1, you don't know the size of all extended headers upfront. You need to read them one by one to figure out how many they are and the final total size of the header. I collect them in a list to be able to calculate the CRC of the entire header later on.
For level 2, you'll find the total size of the header once you have read the minimal header (22 bytes) to figure out which header level is used.
For level 3 (not supported in this PR), the extended headers can be much larger.
| protected String getPathname(final ByteBuffer buffer, final int pathnameLength) { | ||
| final byte[] pathnameBuffer = new byte[pathnameLength]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I correct that pathnameLength is always < 2¹⁶? If so, it might be helpful to note this in a comment: it makes it clear to readers that new byte[pathnameLength] won’t risk creating an unexpectedly large buffer. Alternatively you use char as type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, except for level 3 header. Support for level 3 is not included in this PR, but I have it sitting in a branch for a future PR once I'm able to write some tests for it.
For level 0 and 1 (unless an extended header overrides it) headers, the file length is maximum 256 bytes.
The LHA archive format is a really old format mostly used on MS-DOS, Amiga and Atari back in the 80s and 90s. I think it makes sense to introduce a smaller hard limit here, e.g. max 2048 or 4096 bytes? What do you think @ppkarwasz ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The exact value of the limit doesn't really matter, as long as it is reasonable and does not cause OOMs on corrupted streams.
libarchive uses a 1 MiB limit on the pathname, which is also fine. With @garydgregory we thought about introducing a library-wide limit, maybe based on FileSystem#getMaxPathLength().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a check and will throw an error if the pathname is longer than 4096 bytes. I've downloaded a lot of files from Aminet and Funet for testing and the longest pathname I've seen so far is 89 characters.
| // Split the pathname into parts by 0xFF bytes | ||
| final StringBuilder pathnameStringBuilder = new StringBuilder(); | ||
| int start = 0; | ||
| for (int i = 0; i < pathnameLength; i++) { | ||
| if (pathnameBuffer[i] == (byte) 0xFF) { | ||
| if (i > start) { | ||
| // Decode the path segment into a string using the specified charset and append it to the result | ||
| pathnameStringBuilder.append(new String(pathnameBuffer, start, i - start, getCharset())).append(fileSeparatorChar); | ||
| } | ||
|
|
||
| start = i + 1; // Move start to the next segment | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since almost the entire class uses NIO, should we also use CharsetDecoder and CharBuffer here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Possibly, but I only find a CharsetDecoder.decode(ByteBuffer) that will decode the entire ByteBuffer and no way of selecting just a part of it? What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the implementation, the decoder decodes the buffer from position() to limit(). We could use that with something like:
// Decode the pathname and replace any 0xFF bytes with the file separator char
final CharBuffer charBuffer = getCharBuffer().clear();
final CharsetDecoder decoder = getCharset().newDecoder();
int start = buffer.position();
for (int i = start; i < limit; i++) {
if (buffer.get(i) == (byte) 0xFF) {
if (i > start) {
// Decode the path segment into the char buffer using the specified charset
decode(decoder, buffer.duplicate().position(start).limit(i), charBuffer);
charBuffer.append(fileSeparatorChar);
}
start = i + 1; // Move start to the next segment
buffer.position(start);
}
}where the decode helper checks for encoding errors:
private void decode(final CharsetDecoder decoder, ByteBuffer in, CharBuffer out) throws ArchiveException {
throwIfError(decoder.decode(in, out, true));
throwIfError(decoder.flush(out));
}
private void throwIfError(final CoderResult result) throws ArchiveException {
if (result.isError()) {
try {
result.throwException();
} catch (final Exception e) {
throw new ArchiveException("Invalid path segment", (Throwable) e);
}
}
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we should throw an error for decoding issues. There may be consumers that will never write the output to a file system but just want to consume the decompressed stream, e.g. Apache Tika or training AI models. Throwing an exception would force them to try a number of different encodings until finding the right one. The current code is replacing unknown characters with the unicode replacement character \uFFFD.
c6834e7 to
b56bfb5
Compare
|
Thanks for implementing LHA support. I have done a quick fuzzing run of the current implementation and found some files with unexpected exceptions. See the attached zip lha-fuzzing.zip. There might be duplicated errors in the files. I used the following code for the fuzz test try (LhaArchiveInputStream input = new LhaArchiveInputStream(new ByteArrayInputStream(data))) {
LhaArchiveEntry entry;
while ((entry = input.getNextEntry()) != null) {
IOUtils.copyLarge(input, NullOutputStream.nullOutputStream());
}
} catch (IOException e) {
// ignore
} |
|
Hi @theobisproject, Thanks! Since this PR introduces an entire new archive format, it will be a marathon, not a sprint. |
b56bfb5 to
0cd8e18
Compare
|
Hi @theobisproject, Thanks for helping out with testing the PR. You exposed some weak spots in the code regarding handling of corrupt archives. I've added better checks now and all your corrupt archives now throws either ArchiveException or CompressorException. |
|
Thanks for fixing the corrupt archives. I made a new run with the changes and found 6 more all attached in the zip. |
a52076d to
40a5373
Compare
|
Thanks for testing once again. Four of the files throws |
40a5373 to
067d583
Compare
Thanks for your contribution to Apache Commons! Your help is appreciated!
Before you push a pull request, review this list:
mvn; that'smvnon the command line by itself.This is an read only implementation of the LHA/LZH archive format supporting a subset (but most commonly used) of the compression algorithms (lh0, lh4, lh5, lh6, lh7).
There is currently support for LHA header levels 0, 1 and 2. I added support for header level 3 but since I have not been able to create such archives myself, I removed the support for now. I may create another PR in the future adding it. It seems only the lha tool on OS/2 ever supported level 3 headers and I don't have access to that platform myself.
A large number of archives were downloaded from FUNET and AMINET for mainly the Amiga and Atari platforms and most of them can be successfully decompressed. The small number of archives that don't work are corrupt in one way or the other and the lha tools I tried are not able to decompress them either. Some very old archives use lh1 compression that is currently not supported by this implementation.
I tried to align the implementation with other similar decompression algorithms found in the commons-compress repo and reuse code wherever possible. The
ExplodingInputStreamfrom the zip implementation and theHuffmanDecoderfrom deflate64 are using similar compression algorithms.I ended up reusing the
CircularBufferfrom the zip implementation and I've refactored it into the utils package to make it available for reuse. I added some checks to make sure parts of the buffer that have not been read yet were never overwritten and that distance never goes farther back than the size of the buffer. This is a separate commit in the PR and could possibly be reviewed and merged separately. TheHuffmanDecoderfrom deflate64 is usingDecodingMemorythat is similar toCircularBufferbut it would require some refactoring ofHuffmanDecoderto useCircularBufferinstead.The
ExplodingInputStreamand deflate64'sHuffmanDecoderare both using binary trees for Huffman decoding. I ended up using theBinaryTreefromExplodingInputStreambut since the storage and construction of the actual tree differs, I copied the code into the LHA implementation and kept it package private for now. The common code of this class could possibly go to the utils package and the code to build the tree kept in each implementation package in a future refactoring. TheHuffmanDecodercould also possibly use such a refactoredBinaryTreeimplementation.I also added a CRC-16 implementation. It is needed in both the archiver and the compressor packages so I put it in the utils package as a public class.