Skip to content

Conversation

@fkjellberg
Copy link
Contributor

Thanks for your contribution to Apache Commons! Your help is appreciated!

Before you push a pull request, review this list:

  • Read the contribution guidelines for this project.
  • Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
  • I used AI to create any part of, or all of, this pull request.
  • Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
  • Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best-practice.
  • Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
  • Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This is an read only implementation of the LHA/LZH archive format supporting a subset (but most commonly used) of the compression algorithms (lh0, lh4, lh5, lh6, lh7).

There is currently support for LHA header levels 0, 1 and 2. I added support for header level 3 but since I have not been able to create such archives myself, I removed the support for now. I may create another PR in the future adding it. It seems only the lha tool on OS/2 ever supported level 3 headers and I don't have access to that platform myself.

A large number of archives were downloaded from FUNET and AMINET for mainly the Amiga and Atari platforms and most of them can be successfully decompressed. The small number of archives that don't work are corrupt in one way or the other and the lha tools I tried are not able to decompress them either. Some very old archives use lh1 compression that is currently not supported by this implementation.

I tried to align the implementation with other similar decompression algorithms found in the commons-compress repo and reuse code wherever possible. The ExplodingInputStream from the zip implementation and the HuffmanDecoder from deflate64 are using similar compression algorithms.

I ended up reusing the CircularBuffer from the zip implementation and I've refactored it into the utils package to make it available for reuse. I added some checks to make sure parts of the buffer that have not been read yet were never overwritten and that distance never goes farther back than the size of the buffer. This is a separate commit in the PR and could possibly be reviewed and merged separately. The HuffmanDecoder from deflate64 is using DecodingMemory that is similar to CircularBuffer but it would require some refactoring of HuffmanDecoder to use CircularBuffer instead.

The ExplodingInputStream and deflate64's HuffmanDecoder are both using binary trees for Huffman decoding. I ended up using the BinaryTree from ExplodingInputStream but since the storage and construction of the actual tree differs, I copied the code into the LHA implementation and kept it package private for now. The common code of this class could possibly go to the utils package and the code to build the tree kept in each implementation package in a future refactoring. The HuffmanDecoder could also possibly use such a refactored BinaryTree implementation.

I also added a CRC-16 implementation. It is needed in both the archiver and the compressor packages so I put it in the utils package as a public class.

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fkjellberg

Thank your for the PR. I'll review once the builds are green.

@garydgregory
Copy link
Member

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch 2 times, most recently from bea59e4 to c87e781 Compare August 17, 2025 19:47
garydgregory added a commit to apache/commons-codec that referenced this pull request Aug 17, 2025
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @fkjellberg

The build is still failing.

@garydgregory
Copy link
Member

@fkjellberg
Please rebase on git master to pickup the latest which includes new JaCoCo settings.

@fkjellberg
Copy link
Contributor Author

I'm working on fixing the build issues. Headers level 0 and 1 have a timestamp in msdos format and I'm using ZipUtil.dosToJavaTime() to convert it. That method is using the system default time zone because no other time zone is known. Header level 2 has a unix timestamp based on epoch. Header level 1 may also have an extended header with a unix epoch timestamp that will overwrite the original msdos timestamp in the ArchiveEntry.

I'm not sure if we should make the information available somehow, e.g. in a flag, if the ArchiveEntry.getLastModifiedDate() is returning a java.util.Date based on msdos timestamp (system time zone) or unix epoch (UTC time zone). What do you think, @garydgregory ?

Now, that I finally fixed the time zone test failures, I found a lot of other build failures due to not using final etc. I will work on fixing them but not sure I will be able to do it today.

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch 3 times, most recently from 0294e87 to ff63891 Compare August 18, 2025 20:17
@fkjellberg
Copy link
Contributor Author

@garydgregory I think I fixed all the build issues now.

@garydgregory
Copy link
Member

garydgregory commented Aug 20, 2025

Hello @fkjellberg

Would you please rebase on git master to avoid the one failing test?

I'll review your question later.

TY!

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from ff63891 to 27b359a Compare August 20, 2025 16:05
@fkjellberg
Copy link
Contributor Author

@garydgregory I rebased the PR

Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @fkjellberg

Thank you for your updates.

If you see updates on git master, please rebase. The build in this branch has some failures unrelated to this new code, so you can ignore those. Changes in master may help, but nothing at the time I am writing this. The master builds are green.

I've left some comments throughout.

In general, when parsing input, you must assume an end-of-file can happen at any time in the input stream.

You can also assume this will be a new attack vector to try and cause OutOfMemoryError and StackOverflowError from the VM, so see the call sites of MemoryLimitException for examples on how to guard the code.

TY!

* supports lh4, lh5, lh6 and lh7 compression methods.
*
* This implementation is based on the documentation that can be found at
* https://github.com/jca02266/lha/blob/master/Hacking_of_LHa
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a URL to a specification so maintainers know where to look for canonical definitions, especially if a spec version matters. A URL to a random GitHub page doesn't feel authoritative IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think a formal specification document exists for this compression format. I have not found any. This file contains some notes from someone that reversed engineered the format and it's written in Japanese so I used Google Translate so I could read and understand it. I removed this reference and I don't have any better to replace it with.

This comment was marked as outdated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have that link in the javadoc for the archive parsing code in LhaArchiveInputStream. It only documents the archive format and not the LH5 compression method.

Copy link
Contributor

@ppkarwasz ppkarwasz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fkjellberg,

Thanks for the PR! I haven’t had the chance to do a full review yet, but here are a few suggestions that could help improve it:

@fkjellberg
Copy link
Contributor Author

@ppkarwasz @garydgregory Thanks for the feedback. I'll work on fixing the raised issues in the coming days.

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from 27b359a to 042af17 Compare August 23, 2025 15:41
@fkjellberg
Copy link
Contributor Author

Hi, @ppkarwasz @garydgregory ! I think I've fixed all the issues raised by you now except for the MemoryLimitException. I need to better understand how it's supposed to be used.

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from 09eb110 to c6834e7 Compare August 24, 2025 20:26
Copy link
Member

@garydgregory garydgregory left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @fkjellberg
Yeah, don't worry about the memory limit stuff. We need to look over the whole library.
There are some unresolved EOF issues though.
TY!

* This is an implementation of a static Huffman compressor input stream for LHA files that
* supports lh4, lh5, lh6 and lh7 compression methods.
*/
abstract class AbstractLhStaticHuffmanCompressorInputStream extends CompressorInputStream implements InputStreamStatistics {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huffman coding is already used in several compressors in Commons Compress (DEFLATE, DEFLATE64, BZip2, Brotli, …), and so are variants of LZ77.

Rather than introducing new ad-hoc helpers for Huffman or LZSS, it would be better to refactor common logic into shared utilities. This helps reduce duplication and makes maintenance easier across codecs.

I’ve started work on a reusable Huffman helper in #701. It hasn’t yet been wired into DEFLATE64, but if it also fits LHA’s needs, we could merge that first and then build LHA on top of it.

For the LZSS side, could the implementation be expressed as extensions to lz77support instead of standalone classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree that the Huffman code from different compressors should be aligned for better code reuse. That was my plan to work on after this PR had been merged. You beat me to it. :-)

I reused the BinaryTree from the ExplodingInputStream class in the zip package which is extremely similar to the LHx compressors in this PR. I think that the Huffman code from the zip package, this PR, Deflate64's HuffmanDecoder and the implementation in Bzip2CompressorInputStream should be aligned and reuse as much code as possible. What seems to differ the most is how the tree is stored in the bit stream and the huffman tree reconstructed at decompress so I guess that has to be implementation specific.

I looked at the AbstractLZ77CompressorInputStream for reused while working on this code, but decided not to try to use it. There is more similarity between this implementation and the ExplodingInputStream so I think they are better candidates for future refactoring.

* supports lh4, lh5, lh6 and lh7 compression methods.
*
* This implementation is based on the documentation that can be found at
* https://github.com/jca02266/lha/blob/master/Hacking_of_LHa

This comment was marked as outdated.

Comment on lines +217 to +289
final byte[] buffer = new byte[HEADER_GENERIC_MINIMUM_HEADER_LENGTH];
final int len = IOUtils.read(in, buffer);
if (len == 0) {
// EOF
return null;
} else if (len == 1 && buffer[0] == 0) {
// Last byte of the file is zero indicating no more entries
return null;
} else if (len < HEADER_GENERIC_MINIMUM_HEADER_LENGTH) {
throw new ArchiveException("Invalid header length");
}

final ByteBuffer header = ByteBuffer.wrap(buffer).order(ByteOrder.LITTLE_ENDIAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that while most of the code uses ByteBuffer, there are still a few places where a byte[] is allocated and then wrapped. Would it make sense to consistently use ByteBuffer throughout?

As you pointed out in #690 (comment), ByteBuffer is useful here. If we stick with it exclusively, we could also have this class read directly from a ReadableByteChannel created from the InputStream:

ReadableByteChannel channel = Channels.newChannel(inputStream);

This would remove the need for intermediate byte[] allocations.

I don’t think it’s necessary to allocate multiple ByteBuffer instances here. As far as I understand:

  • Level-0 through Level-2 base headers are at most 256 bytes.
  • Extended headers for Level-1 and Level-2 are also limited to 2^16 bytes.

Given that, it should be sufficient to keep just two buffers as class fields: one for the base header and one for extended headers, instead of creating new buffers each time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For level 1, you don't know the size of all extended headers upfront. You need to read them one by one to figure out how many they are and the final total size of the header. I collect them in a list to be able to calculate the CRC of the entire header later on.

For level 2, you'll find the total size of the header once you have read the minimal header (22 bytes) to figure out which header level is used.

For level 3 (not supported in this PR), the extended headers can be much larger.

Comment on lines 448 to 540
protected String getPathname(final ByteBuffer buffer, final int pathnameLength) {
final byte[] pathnameBuffer = new byte[pathnameLength];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct that pathnameLength is always < 2¹⁶? If so, it might be helpful to note this in a comment: it makes it clear to readers that new byte[pathnameLength] won’t risk creating an unexpectedly large buffer. Alternatively you use char as type.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, except for level 3 header. Support for level 3 is not included in this PR, but I have it sitting in a branch for a future PR once I'm able to write some tests for it.

For level 0 and 1 (unless an extended header overrides it) headers, the file length is maximum 256 bytes.

The LHA archive format is a really old format mostly used on MS-DOS, Amiga and Atari back in the 80s and 90s. I think it makes sense to introduce a smaller hard limit here, e.g. max 2048 or 4096 bytes? What do you think @ppkarwasz ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The exact value of the limit doesn't really matter, as long as it is reasonable and does not cause OOMs on corrupted streams.

libarchive uses a 1 MiB limit on the pathname, which is also fine. With @garydgregory we thought about introducing a library-wide limit, maybe based on FileSystem#getMaxPathLength().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a check and will throw an error if the pathname is longer than 4096 bytes. I've downloaded a lot of files from Aminet and Funet for testing and the longest pathname I've seen so far is 89 characters.

Comment on lines +452 to +555
// Split the pathname into parts by 0xFF bytes
final StringBuilder pathnameStringBuilder = new StringBuilder();
int start = 0;
for (int i = 0; i < pathnameLength; i++) {
if (pathnameBuffer[i] == (byte) 0xFF) {
if (i > start) {
// Decode the path segment into a string using the specified charset and append it to the result
pathnameStringBuilder.append(new String(pathnameBuffer, start, i - start, getCharset())).append(fileSeparatorChar);
}

start = i + 1; // Move start to the next segment
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since almost the entire class uses NIO, should we also use CharsetDecoder and CharBuffer here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Possibly, but I only find a CharsetDecoder.decode(ByteBuffer) that will decode the entire ByteBuffer and no way of selecting just a part of it? What am I missing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the implementation, the decoder decodes the buffer from position() to limit(). We could use that with something like:

// Decode the pathname and replace any 0xFF bytes with the file separator char
final CharBuffer charBuffer = getCharBuffer().clear();
final CharsetDecoder decoder = getCharset().newDecoder();
int start = buffer.position();
for (int i = start; i < limit; i++) {
    if (buffer.get(i) == (byte) 0xFF) {
        if (i > start) {
            // Decode the path segment into the char buffer using the specified charset
            decode(decoder, buffer.duplicate().position(start).limit(i), charBuffer);
            charBuffer.append(fileSeparatorChar);
        }
        start = i + 1; // Move start to the next segment
        buffer.position(start);
    }
}

where the decode helper checks for encoding errors:

private void decode(final CharsetDecoder decoder, ByteBuffer in, CharBuffer out) throws ArchiveException {
    throwIfError(decoder.decode(in, out, true));
    throwIfError(decoder.flush(out));
}

private void throwIfError(final CoderResult result) throws ArchiveException {
    if (result.isError()) {
        try {
            result.throwException();
        } catch (final Exception e) {
            throw new ArchiveException("Invalid path segment", (Throwable) e);
        }
    }
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we should throw an error for decoding issues. There may be consumers that will never write the output to a file system but just want to consume the decompressed stream, e.g. Apache Tika or training AI models. Throwing an exception would force them to try a number of different encodings until finding the right one. The current code is replacing unknown characters with the unicode replacement character \uFFFD.

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from c6834e7 to b56bfb5 Compare August 29, 2025 18:02
@theobisproject
Copy link
Contributor

Thanks for implementing LHA support. I have done a quick fuzzing run of the current implementation and found some files with unexpected exceptions. See the attached zip lha-fuzzing.zip. There might be duplicated errors in the files.

I used the following code for the fuzz test

try (LhaArchiveInputStream input = new LhaArchiveInputStream(new ByteArrayInputStream(data))) {
    LhaArchiveEntry entry;
    while ((entry = input.getNextEntry()) != null) {
        IOUtils.copyLarge(input, NullOutputStream.nullOutputStream());
    }
} catch (IOException e) {
    // ignore
}

@ppkarwasz
Copy link
Contributor

Hi @theobisproject,

Thanks! Since this PR introduces an entire new archive format, it will be a marathon, not a sprint.
It would be nice if you can also run the tests once some obvious errors will be removed.

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from b56bfb5 to 0cd8e18 Compare September 6, 2025 14:13
@fkjellberg
Copy link
Contributor Author

Hi @theobisproject,

Thanks for helping out with testing the PR. You exposed some weak spots in the code regarding handling of corrupt archives. I've added better checks now and all your corrupt archives now throws either ArchiveException or CompressorException.

@theobisproject
Copy link
Contributor

Thanks for fixing the corrupt archives. I made a new run with the changes and found 6 more all attached in the zip.
lha-fuzzing-2.zip

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from a52076d to 40a5373 Compare September 16, 2025 18:52
@fkjellberg
Copy link
Contributor Author

Hi @theobisproject

Thanks for testing once again. Four of the files throws java.lang.IllegalArgumentException: Distance exceeds buffer size which is basically telling you that the compressed data stream is corrupt. The other two exposed an issue with parsing a corrupt level 2 header and I've pushed a fix for that now.

@fkjellberg fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from 40a5373 to 067d583 Compare September 20, 2025 14:14
@ppkarwasz ppkarwasz mentioned this pull request Sep 25, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants