[COMPRESS-706] Add support for reading LHA archive format #690

fkjellberg · 2025-08-17T15:04:46Z

Thanks for your contribution to Apache Commons! Your help is appreciated!

Before you push a pull request, review this list:

Read the contribution guidelines for this project.
Read the ASF Generative Tooling Guidance if you use Artificial Intelligence (AI).
I used AI to create any part of, or all of, this pull request.
Run a successful build using the default Maven goal with mvn; that's mvn on the command line by itself.
Write unit tests that match behavioral changes, where the tests fail if the changes to the runtime are not applied. This may not always be possible, but it is a best-practice.
Write a pull request description that is detailed enough to understand what the pull request does, how, and why.
Each commit in the pull request should have a meaningful subject line and body. Note that a maintainer may squash commits during the merge process.

This is an read only implementation of the LHA/LZH archive format supporting a subset (but most commonly used) of the compression algorithms (lh0, lh4, lh5, lh6, lh7).

There is currently support for LHA header levels 0, 1 and 2. I added support for header level 3 but since I have not been able to create such archives myself, I removed the support for now. I may create another PR in the future adding it. It seems only the lha tool on OS/2 ever supported level 3 headers and I don't have access to that platform myself.

A large number of archives were downloaded from FUNET and AMINET for mainly the Amiga and Atari platforms and most of them can be successfully decompressed. The small number of archives that don't work are corrupt in one way or the other and the lha tools I tried are not able to decompress them either. Some very old archives use lh1 compression that is currently not supported by this implementation.

I tried to align the implementation with other similar decompression algorithms found in the commons-compress repo and reuse code wherever possible. The ExplodingInputStream from the zip implementation and the HuffmanDecoder from deflate64 are using similar compression algorithms.

I ended up reusing the CircularBuffer from the zip implementation and I've refactored it into the utils package to make it available for reuse. I added some checks to make sure parts of the buffer that have not been read yet were never overwritten and that distance never goes farther back than the size of the buffer. This is a separate commit in the PR and could possibly be reviewed and merged separately. The HuffmanDecoder from deflate64 is using DecodingMemory that is similar to CircularBuffer but it would require some refactoring of HuffmanDecoder to use CircularBuffer instead.

The ExplodingInputStream and deflate64's HuffmanDecoder are both using binary trees for Huffman decoding. I ended up using the BinaryTree from ExplodingInputStream but since the storage and construction of the actual tree differs, I copied the code into the LHA implementation and kept it package private for now. The common code of this class could possibly go to the utils package and the code to build the tree kept in each implementation package in a future refactoring. The HuffmanDecoder could also possibly use such a refactored BinaryTree implementation.

I also added a CRC-16 implementation. It is needed in both the archiver and the compressor packages so I put it in the utils package as a public class.

...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java

garydgregory

@fkjellberg

Thank your for the PR. I'll review once the builds are green.

garydgregory · 2025-08-17T16:12:41Z

FTR https://issues.apache.org/jira/browse/COMPRESS-706

Extracted from apache/commons-compress#690 and modified

garydgregory

Hello @fkjellberg

The build is still failing.

src/main/java/org/apache/commons/compress/utils/CRC16.java

garydgregory · 2025-08-18T13:34:53Z

@fkjellberg
Please rebase on git master to pickup the latest which includes new JaCoCo settings.

fkjellberg · 2025-08-18T17:17:05Z

I'm working on fixing the build issues. Headers level 0 and 1 have a timestamp in msdos format and I'm using ZipUtil.dosToJavaTime() to convert it. That method is using the system default time zone because no other time zone is known. Header level 2 has a unix timestamp based on epoch. Header level 1 may also have an extended header with a unix epoch timestamp that will overwrite the original msdos timestamp in the ArchiveEntry.

I'm not sure if we should make the information available somehow, e.g. in a flag, if the ArchiveEntry.getLastModifiedDate() is returning a java.util.Date based on msdos timestamp (system time zone) or unix epoch (UTC time zone). What do you think, @garydgregory ?

Now, that I finally fixed the time zone test failures, I found a lot of other build failures due to not using final etc. I will work on fixing them but not sure I will be able to do it today.

fkjellberg · 2025-08-19T04:27:09Z

@garydgregory I think I fixed all the build issues now.

garydgregory · 2025-08-20T12:54:43Z

Hello @fkjellberg

Would you please rebase on git master to avoid the one failing test?

I'll review your question later.

TY!

fkjellberg · 2025-08-20T16:07:00Z

@garydgregory I rebased the PR

garydgregory

Hello @fkjellberg

Thank you for your updates.

If you see updates on git master, please rebase. The build in this branch has some failures unrelated to this new code, so you can ignore those. Changes in master may help, but nothing at the time I am writing this. The master builds are green.

I've left some comments throughout.

In general, when parsing input, you must assume an end-of-file can happen at any time in the input stream.

You can also assume this will be a new attack vector to try and cause OutOfMemoryError and StackOverflowError from the VM, so see the call sites of MemoryLimitException for examples on how to guard the code.

TY!

src/main/java/org/apache/commons/compress/archivers/ArchiveStreamFactory.java

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveEntry.java

garydgregory · 2025-08-20T19:41:59Z

...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java

+ * supports lh4, lh5, lh6 and lh7 compression methods.
+ *
+ * This implementation is based on the documentation that can be found at
+ * https://github.com/jca02266/lha/blob/master/Hacking_of_LHa


Please add a URL to a specification so maintainers know where to look for canonical definitions, especially if a spec version matters. A URL to a random GitHub page doesn't feel authoritative IMO.

I don't think a formal specification document exists for this compression format. I have not found any. This file contains some notes from someone that reversed engineered the format and it's written in Japanese so I used Google Translate so I could read and understand it. I removed this reference and I don't have any better to replace it with.

Yes, I have that link in the javadoc for the archive parsing code in LhaArchiveInputStream. It only documents the archive format and not the LH5 compression method.

...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java

src/main/java/org/apache/commons/compress/compressors/lha/BinaryTree.java

...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java

ppkarwasz

Hi @fkjellberg,

Thanks for the PR! I haven’t had the chance to do a full review yet, but here are a few suggestions that could help improve it:

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveEntry.java

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java

fkjellberg · 2025-08-21T17:06:58Z

@ppkarwasz @garydgregory Thanks for the feedback. I'll work on fixing the raised issues in the coming days.

fkjellberg · 2025-08-23T16:00:31Z

Hi, @ppkarwasz @garydgregory ! I think I've fixed all the issues raised by you now except for the MemoryLimitException. I need to better understand how it's supposed to be used.

garydgregory

Hi @fkjellberg
Yeah, don't worry about the memory limit stuff. We need to look over the whole library.
There are some unresolved EOF issues though.
TY!

ppkarwasz · 2025-08-29T10:45:30Z

...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java

+ * This is an implementation of a static Huffman compressor input stream for LHA files that
+ * supports lh4, lh5, lh6 and lh7 compression methods.
+ */
+abstract class AbstractLhStaticHuffmanCompressorInputStream extends CompressorInputStream implements InputStreamStatistics {


Huffman coding is already used in several compressors in Commons Compress (DEFLATE, DEFLATE64, BZip2, Brotli, …), and so are variants of LZ77.

Rather than introducing new ad-hoc helpers for Huffman or LZSS, it would be better to refactor common logic into shared utilities. This helps reduce duplication and makes maintenance easier across codecs.

I’ve started work on a reusable Huffman helper in #701. It hasn’t yet been wired into DEFLATE64, but if it also fits LHA’s needs, we could merge that first and then build LHA on top of it.

For the LZSS side, could the implementation be expressed as extensions to lz77support instead of standalone classes?

Yes, I agree that the Huffman code from different compressors should be aligned for better code reuse. That was my plan to work on after this PR had been merged. You beat me to it. :-)

I reused the BinaryTree from the ExplodingInputStream class in the zip package which is extremely similar to the LHx compressors in this PR. I think that the Huffman code from the zip package, this PR, Deflate64's HuffmanDecoder and the implementation in Bzip2CompressorInputStream should be aligned and reuse as much code as possible. What seems to differ the most is how the tree is stored in the bit stream and the huffman tree reconstructed at decompress so I guess that has to be implementation specific.

I looked at the AbstractLZ77CompressorInputStream for reused while working on this code, but decided not to try to use it. There is more similarity between this implementation and the ExplodingInputStream so I think they are better candidates for future refactoring.

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java

...rg/apache/commons/compress/compressors/lha/AbstractLhStaticHuffmanCompressorInputStream.java

+ * supports lh4, lh5, lh6 and lh7 compression methods.
+ *
+ * This implementation is based on the documentation that can be found at
+ * https://github.com/jca02266/lha/blob/master/Hacking_of_LHa


ppkarwasz · 2025-08-29T12:23:06Z

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java

+        final byte[] buffer = new byte[HEADER_GENERIC_MINIMUM_HEADER_LENGTH];
+        final int len = IOUtils.read(in, buffer);
+        if (len == 0) {
+            // EOF
+            return null;
+        } else if (len == 1 && buffer[0] == 0) {
+            // Last byte of the file is zero indicating no more entries
+            return null;
+        } else if (len < HEADER_GENERIC_MINIMUM_HEADER_LENGTH) {
+            throw new ArchiveException("Invalid header length");
+        }
+
+        final ByteBuffer header = ByteBuffer.wrap(buffer).order(ByteOrder.LITTLE_ENDIAN);


I noticed that while most of the code uses ByteBuffer, there are still a few places where a byte[] is allocated and then wrapped. Would it make sense to consistently use ByteBuffer throughout?

As you pointed out in #690 (comment), ByteBuffer is useful here. If we stick with it exclusively, we could also have this class read directly from a ReadableByteChannel created from the InputStream:

ReadableByteChannel channel = Channels.newChannel(inputStream);

This would remove the need for intermediate byte[] allocations.

I don’t think it’s necessary to allocate multiple ByteBuffer instances here. As far as I understand:

Level-0 through Level-2 base headers are at most 256 bytes.

Extended headers for Level-1 and Level-2 are also limited to 2^16 bytes.

Given that, it should be sufficient to keep just two buffers as class fields: one for the base header and one for extended headers, instead of creating new buffers each time.

For level 1, you don't know the size of all extended headers upfront. You need to read them one by one to figure out how many they are and the final total size of the header. I collect them in a list to be able to calculate the CRC of the entire header later on.

For level 2, you'll find the total size of the header once you have read the minimal header (22 bytes) to figure out which header level is used.

For level 3 (not supported in this PR), the extended headers can be much larger.

ppkarwasz · 2025-08-29T12:54:44Z

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java

+    protected String getPathname(final ByteBuffer buffer, final int pathnameLength) {
+        final byte[] pathnameBuffer = new byte[pathnameLength];


Am I correct that pathnameLength is always < 2¹⁶? If so, it might be helpful to note this in a comment: it makes it clear to readers that new byte[pathnameLength] won’t risk creating an unexpectedly large buffer. Alternatively you use char as type.

Yes, except for level 3 header. Support for level 3 is not included in this PR, but I have it sitting in a branch for a future PR once I'm able to write some tests for it.

For level 0 and 1 (unless an extended header overrides it) headers, the file length is maximum 256 bytes.

The LHA archive format is a really old format mostly used on MS-DOS, Amiga and Atari back in the 80s and 90s. I think it makes sense to introduce a smaller hard limit here, e.g. max 2048 or 4096 bytes? What do you think @ppkarwasz ?

The exact value of the limit doesn't really matter, as long as it is reasonable and does not cause OOMs on corrupted streams.

libarchive uses a 1 MiB limit on the pathname, which is also fine. With @garydgregory we thought about introducing a library-wide limit, maybe based on FileSystem#getMaxPathLength().

I added a check and will throw an error if the pathname is longer than 4096 bytes. I've downloaded a lot of files from Aminet and Funet for testing and the longest pathname I've seen so far is 89 characters.

ppkarwasz · 2025-08-29T13:13:54Z

src/main/java/org/apache/commons/compress/archivers/lha/LhaArchiveInputStream.java

+        // Split the pathname into parts by 0xFF bytes
+        final StringBuilder pathnameStringBuilder = new StringBuilder();
+        int start = 0;
+        for (int i = 0; i < pathnameLength; i++) {
+            if (pathnameBuffer[i] == (byte) 0xFF) {
+                if (i > start) {
+                    // Decode the path segment into a string using the specified charset and append it to the result
+                    pathnameStringBuilder.append(new String(pathnameBuffer, start, i - start, getCharset())).append(fileSeparatorChar);
+                }
+
+                start = i + 1; // Move start to the next segment
+            }
+        }


Since almost the entire class uses NIO, should we also use CharsetDecoder and CharBuffer here?

Possibly, but I only find a CharsetDecoder.decode(ByteBuffer) that will decode the entire ByteBuffer and no way of selecting just a part of it? What am I missing?

Looking at the implementation, the decoder decodes the buffer from position() to limit(). We could use that with something like:

// Decode the pathname and replace any 0xFF bytes with the file separator char final CharBuffer charBuffer = getCharBuffer().clear(); final CharsetDecoder decoder = getCharset().newDecoder(); int start = buffer.position(); for (int i = start; i < limit; i++) { if (buffer.get(i) == (byte) 0xFF) { if (i > start) { // Decode the path segment into the char buffer using the specified charset decode(decoder, buffer.duplicate().position(start).limit(i), charBuffer); charBuffer.append(fileSeparatorChar); } start = i + 1; // Move start to the next segment buffer.position(start); } }

where the decode helper checks for encoding errors:

private void decode(final CharsetDecoder decoder, ByteBuffer in, CharBuffer out) throws ArchiveException { throwIfError(decoder.decode(in, out, true)); throwIfError(decoder.flush(out)); } private void throwIfError(final CoderResult result) throws ArchiveException { if (result.isError()) { try { result.throwException(); } catch (final Exception e) { throw new ArchiveException("Invalid path segment", (Throwable) e); } } }

I don't think that we should throw an error for decoding issues. There may be consumers that will never write the output to a file system but just want to consume the decompressed stream, e.g. Apache Tika or training AI models. Throwing an exception would force them to try a number of different encodings until finding the right one. The current code is replacing unknown characters with the unicode replacement character \uFFFD.

theobisproject · 2025-08-31T17:22:17Z

Thanks for implementing LHA support. I have done a quick fuzzing run of the current implementation and found some files with unexpected exceptions. See the attached zip lha-fuzzing.zip. There might be duplicated errors in the files.

I used the following code for the fuzz test

try (LhaArchiveInputStream input = new LhaArchiveInputStream(new ByteArrayInputStream(data))) {
    LhaArchiveEntry entry;
    while ((entry = input.getNextEntry()) != null) {
        IOUtils.copyLarge(input, NullOutputStream.nullOutputStream());
    }
} catch (IOException e) {
    // ignore
}

ppkarwasz · 2025-08-31T20:18:27Z

Hi @theobisproject,

Thanks! Since this PR introduces an entire new archive format, it will be a marathon, not a sprint.
It would be nice if you can also run the tests once some obvious errors will be removed.

fkjellberg · 2025-09-07T09:17:25Z

Hi @theobisproject,

Thanks for helping out with testing the PR. You exposed some weak spots in the code regarding handling of corrupt archives. I've added better checks now and all your corrupt archives now throws either ArchiveException or CompressorException.

theobisproject · 2025-09-15T18:56:40Z

Thanks for fixing the corrupt archives. I made a new run with the changes and found 6 more all attached in the zip.
lha-fuzzing-2.zip

fkjellberg · 2025-09-16T18:57:20Z

Hi @theobisproject

Thanks for testing once again. Four of the files throws java.lang.IllegalArgumentException: Distance exceeds buffer size which is basically telling you that the compressed data stream is corrupt. The other two exposed an issue with parsing a corrupt level 2 header and I've pushed a fix for that now.

…package

github-advanced-security bot found potential problems Aug 17, 2025

View reviewed changes

garydgregory requested changes Aug 17, 2025

View reviewed changes

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch 2 times, most recently from bea59e4 to c87e781 Compare August 17, 2025 19:47

garydgregory added a commit to apache/commons-codec that referenced this pull request Aug 17, 2025

Add org.apache.commons.codec.digest.CRC16

b3eeb52

Extracted from apache/commons-compress#690 and modified

garydgregory requested changes Aug 18, 2025

View reviewed changes

src/main/java/org/apache/commons/compress/utils/CRC16.java Outdated Show resolved Hide resolved

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch 3 times, most recently from 0294e87 to ff63891 Compare August 18, 2025 20:17

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from ff63891 to 27b359a Compare August 20, 2025 16:05

garydgregory requested changes Aug 20, 2025

View reviewed changes

ppkarwasz requested changes Aug 21, 2025

View reviewed changes

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from 27b359a to 042af17 Compare August 23, 2025 15:41

fkjellberg requested review from garydgregory and ppkarwasz August 23, 2025 16:01

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from 09eb110 to c6834e7 Compare August 24, 2025 20:26

garydgregory requested changes Aug 28, 2025

View reviewed changes

ppkarwasz requested changes Aug 29, 2025

View reviewed changes

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from c6834e7 to b56bfb5 Compare August 29, 2025 18:02

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from b56bfb5 to 0cd8e18 Compare September 6, 2025 14:13

fkjellberg requested review from garydgregory and ppkarwasz September 7, 2025 09:35

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from a52076d to 40a5373 Compare September 16, 2025 18:52

fkjellberg added 22 commits September 20, 2025 15:28

Make CircularBuffer part of the public API by moving it to the utils …

43f025c

…package

[COMPRESS-706] Add support for reading LHA archive format

416e1fb

Do not use Optional

7223b47

Throw CompressorException for unexpected end of stream

b6ee9ec

Use IOUtils.read to fill the entire buffer

d1955cb

Add javadoc @SInCE 1.29.0

e27f360

Use varargs for BinaryTree constructor

f61cd80

CircularBuffer should be final

1dea962

Use builder pattern for LhaArchiveEntry and make all fields read only

b48bfd9

Update javadocs

dedf17d

Fix more EOF issues

7152f3e

Use constant NUMBER_OF_LITERAL_CODES

68e8cce

Update test case

9fb4d02

Change to package private instead of protected

8f53529

Use builder pattern for LhaArchiveInputStream

8ee4210

Check maximum pathname length

3930c9c

Add better length checks for pathnames and extended headers

673b2ab

Check for too many or no leaf nodes

be76464

Better handling of corrupt archives

91969bf

Add test case for LArc lz4 compression method

1d17f23

Better handling of corrupt archives

5d96f7c

Align with new ArchiveInputStream builder classes

067d583

fkjellberg force-pushed the COMPRESS-706-add-lha-support branch from 40a5373 to 067d583 Compare September 20, 2025 14:14

ppkarwasz mentioned this pull request Sep 25, 2025

Add reusable Huffman decoder #701

Open

3 tasks

		protected String getPathname(final ByteBuffer buffer, final int pathnameLength) {
		final byte[] pathnameBuffer = new byte[pathnameLength];

[COMPRESS-706] Add support for reading LHA archive format #690

Are you sure you want to change the base?

[COMPRESS-706] Add support for reading LHA archive format #690

Uh oh!

Conversation

fkjellberg commented Aug 17, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

garydgregory commented Aug 17, 2025

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

garydgregory commented Aug 18, 2025

Uh oh!

fkjellberg commented Aug 18, 2025

Uh oh!

fkjellberg commented Aug 19, 2025

Uh oh!

garydgregory commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fkjellberg commented Aug 20, 2025

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ppkarwasz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fkjellberg commented Aug 21, 2025

Uh oh!

fkjellberg commented Aug 23, 2025

Uh oh!

garydgregory left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

garydgregory commented Aug 20, 2025 •

edited

Loading