Skip to content

Fix concurrency issue in IOUtils.skip#801

Merged
garydgregory merged 4 commits intomasterfrom
fix/io-utils-skip
Oct 17, 2025
Merged

Fix concurrency issue in IOUtils.skip#801
garydgregory merged 4 commits intomasterfrom
fix/io-utils-skip

Conversation

@ppkarwasz
Copy link
Contributor

This patch addresses a concurrency problem in IOUtils.skip, as reported in COMPRESS-666 and COMPRESS-697.

Previously, IOUtils.skip relied on InputStream#read to skip bytes, using a buffer shared across all threads. Although IOUtils.skip itself does not consume the data read, certain InputStream implementations (e.g. ChecksumInputStream) may process that data internally.

In concurrent scenarios, this shared buffer could be overwritten by another thread between the read and the subsequent internal processing (such as checksum calculation), leading to incorrect behavior.

This change reverts commit c12eaff and restores the use of a per-thread buffer in IOUtils.skip, ensuring thread safety and correct behavior in concurrent environments.

This patch addresses a concurrency problem in `IOUtils.skip`, as reported in [COMPRESS-666](https://issues.apache.org/jira/browse/COMPRESS-666) and [COMPRESS-697](https://issues.apache.org/jira/browse/COMPRESS-697).

Previously, `IOUtils.skip` relied on `InputStream#read` to skip bytes, using a buffer shared across **all** threads. Although `IOUtils.skip` itself does not consume the data read, certain `InputStream` implementations (e.g. `ChecksumInputStream`) may process that data internally.

In concurrent scenarios, this shared buffer could be overwritten by another thread between the `read` and the subsequent internal processing (such as checksum calculation), leading to incorrect behavior.

This change reverts commit c12eaff and restores the use of a **per-thread buffer** in `IOUtils.skip`, ensuring thread safety and correct behavior in concurrent environments.
@ppkarwasz
Copy link
Contributor Author

@garydgregory,

Do you happen to recall the rationale behind commit c908984, which zeroes out the per-thread buffers before each use? I’m not seeing a clear reason for this extra step, since:

  • the buffers are private,
  • they’re only used by skip and copyLarge,
  • and in both cases they appear to be properly initialized with data before being read.

@garydgregory
Copy link
Member

Hello @ppkarwasz
Thank you for the PR. I'm OK with it but I want to note that we've gone back and forth on this one in the past. Do you think this addresses https://issues.apache.org/jira/browse/IO-802 ?

@ppkarwasz
Copy link
Contributor Author

I believe this PR does address a key part of IO-802: namely, it prevents shared buffer mutation between skipFully() (or skip) and concurrent reads of decompressing streams, which was the root of the multithreading failure observed.

I’m not entirely sure how InflaterInputStream itself uses the buffer, but its subclasses like GZIPInputStream and ZipInputStream definitely update a CRC as they read data. If another thread were to perform a skip operation on a different InputStream instance while sharing the same buffer, it could corrupt the data being validated and cause CRC verification to fail.

However, I don’t think this PR completely closes every corner case. In particular:

  1. Reentrancy / nested skip calls
    Some streams (for example from Commons Compress, such as GzipCompressorInputStream) may internally call IOUtils.skip(...) or skipFully(...) from within their own read(...) implementations.
    Even with a ThreadLocal that internal skip uses the same buffer as an external skip.

  2. Thread-local / buffer ownership tracking
    To guard against those cases, we should adopt a pattern similar to Log4j’s: maintain a flag to detect whether the local buffer is in use.

I’m happy to adapt this PR to follow that pattern.

@garydgregory
Copy link
Member

garydgregory commented Oct 17, 2025

Hi @ppkarwasz

Do you happen to recall the rationale behind commit c908984, which zeroes out the per-thread buffers before each use?

There was a concern that in a multi-tenant server/application, when multiple applications are served (sequentially, or not) on reused threads, one app might be able to read data written by a different app, creating a security issue.

I see:

        Arrays.fill(SCRATCH_BYTE_BUFFER_WO, (byte) 0);
        Arrays.fill(SCRATCH_CHAR_BUFFER_WO, (char) 0);

and calls to fill0() have been removed.

I think we need to keep zeroing out data to be on the safe side, unless you see something I don't.

I’m happy to adapt this PR to follow that pattern.

Let's not make it more complex for now.

@ppkarwasz
Copy link
Contributor Author

I see:

        Arrays.fill(SCRATCH_BYTE_BUFFER_WO, (byte) 0);
        Arrays.fill(SCRATCH_CHAR_BUFFER_WO, (char) 0);

and calls to fill0() have been removed.

These were only removed, because the buffers themselves have been removed.

I’m happy to adapt this PR to follow that pattern.

Let's not make it more complex for now.

Adding a reentrancy guard to the thread-local buffers is a simple and well-established pattern: we’ve been using it in Log4j for years. I implemented it in 0bf22e0 by:

  • Moving the thread locals into a dedicated class, so they are only instantiated on first use (lazy initialization).
  • Using only JDK-provided classes to avoid potential memory leaks in application server environments.

@ppkarwasz ppkarwasz requested a review from Copilot October 17, 2025 12:28
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a concurrency issue in IOUtils.skip by implementing per-thread scratch buffers instead of shared global buffers. The change addresses race conditions where multiple threads could corrupt each other's data when using checksum-aware InputStreams.

Key changes:

  • Introduced ScratchBufferHolder with per-thread buffer management and reentrant access support
  • Replaced shared global buffers with thread-local buffers that require explicit acquisition and release
  • Added comprehensive concurrent testing for various IOUtils methods

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/main/java/org/apache/commons/io/IOUtils.java Implements new ScratchBufferHolder class and updates buffer usage throughout IOUtils methods
src/main/java/org/apache/commons/io/CopyUtils.java Updates to use new buffer acquisition/release pattern
src/test/java/org/apache/commons/io/IOCaseTest.java Updates existing tests to use new buffer API
src/test/java/org/apache/commons/io/IOUtilsConcurrentTest.java Adds comprehensive concurrent testing for IOUtils methods
src/changes/changes.xml Documents the concurrency fix

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

ppkarwasz and others added 2 commits October 17, 2025 14:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@garydgregory garydgregory merged commit f76ee12 into master Oct 17, 2025
21 checks passed
@garydgregory
Copy link
Member

Merged, ty @ppkarwasz

@ppkarwasz ppkarwasz deleted the fix/io-utils-skip branch October 18, 2025 20:33
ppkarwasz added a commit that referenced this pull request Oct 18, 2025
Commit 0698bd9 introduced convenient `AutoCloseable` usage for `ScratchBytes` and `ScratchChars`. However, it also introduced a **classloader memory leak risk** in application server environments by storing custom wrapper instances directly in a `ThreadLocal`.

This PR keeps the ergonomic `AutoCloseable` pattern while eliminating the classloader leak risk:

* Store **only primitive buffers** (`byte[]` / `char[]`) in the `ThreadLocal`, not custom classes.
* Introduce two types of `ScratchBytes` / `ScratchChars` instances:

  * **Global instance** (`buffer == null`) that fetches its buffer from the `ThreadLocal`.
  * **Reentrant instances** (`buffer != null`) for nested usage without interfering with shared buffers.

**Note:** While this revision keeps the readability of using the `AutoCloseable` API, it also introduces a performance regression compared to the original #801 design: retrieving a buffer now requires two `ThreadLocal` lookups: once in `get()` and once in `array()`. The original design avoided this overhead intentionally. Since these classes are package-private and used in performance-sensitive paths, we should carefully weigh the trade-off between API convenience and runtime cost.
garydgregory pushed a commit that referenced this pull request Oct 19, 2025
Commit 0698bd9 introduced convenient `AutoCloseable` usage for `ScratchBytes` and `ScratchChars`. However, it also introduced a **classloader memory leak risk** in application server environments by storing custom wrapper instances directly in a `ThreadLocal`.

This PR keeps the ergonomic `AutoCloseable` pattern while eliminating the classloader leak risk:

* Store **only primitive buffers** (`byte[]` / `char[]`) in the `ThreadLocal`, not custom classes.
* Introduce two types of `ScratchBytes` / `ScratchChars` instances:

  * **Global instance** (`buffer == null`) that fetches its buffer from the `ThreadLocal`.
  * **Reentrant instances** (`buffer != null`) for nested usage without interfering with shared buffers.

**Note:** While this revision keeps the readability of using the `AutoCloseable` API, it also introduces a performance regression compared to the original #801 design: retrieving a buffer now requires two `ThreadLocal` lookups: once in `get()` and once in `array()`. The original design avoided this overhead intentionally. Since these classes are package-private and used in performance-sensitive paths, we should carefully weigh the trade-off between API convenience and runtime cost.
ppkarwasz added a commit to apache/commons-compress that referenced this pull request Oct 19, 2025
This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.
ppkarwasz added a commit to apache/commons-compress that referenced this pull request Oct 19, 2025
This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.
ppkarwasz added a commit to apache/commons-compress that referenced this pull request Oct 19, 2025
This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.
garydgregory pushed a commit to apache/commons-compress that referenced this pull request Oct 19, 2025
* Deprecate `IOUtils.readFully` and `IOUtils.skip`

This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.

* fix: unnecessary qualifier

* fix: remove qualified name
garydgregory added a commit to apache/commons-compress that referenced this pull request Nov 7, 2025
* Migrate to Apache Commons IO 2.21.0-SNAPSHOT

* Avoid redirects for snapshots in this branch

* Use `IOUtils.checkIndexFromSize` for argument validation (#716)

This change replaces custom argument checks in all `InputStream` and `OutputStream` implementations with the common utility method `IOUtils.checkIndexFromSize`.
This ensures consistent validation logic across the codebase.

Some implementations are intentionally left unchanged: cases where the arguments are immediately delegated to methods such as `InputStream#read`, `OutputStream#write`, or `ByteBuffer#wrap`, which already perform equivalent checks internally.

* Set commons-parent to 88

* remove: make `TarUtils` final and clean up internal methods (#712)

* remove: make `TarUtils` final and clean up internal methods

`TarUtils` already had a private constructor and was only used within the `o.a.c.compress.archivers.tar` package. This PR makes the class explicitly final and simplifies its internal API:

* Mark `TarUtils` as `final`.
* Change `protected` methods to package-private (they were never usable outside the package due to the private constructor).
* Remove deprecated non-`public` methods.
* Simplify `parsePaxHeaders`:

  * Require a non-negative `headerSize`. The special value `-1` (“unlimited”) was only used in tests.
  * Update tests to supply a valid `headerSize`.
  * Standardize error handling for invalid PAX 0.0 sparse records: all parsing errors now throw `ArchiveException` (previously one case threw `IOException`).

A changelog entry is included even though these are internal changes, to give users a reference point in case any unintended side effects arise.

* fix: make `TarUtils` final

This change actually makes `TarUtils` final and introduces a JapiCmp configuration that ignores all `CLASS_NOW_FINAL` changes.

* fix: don't call tests with `Integer.MAX_VALUE`

* fix: bump `japicmp` to version `0.24.0`

This bumps `japicmp` to the new version `0.24.0`, which ignores changes to protected method in non-extensible classes such as `TarUtils`.

* Update japicmp version from 0.24.0 to 0.24.1

* fix: clean up POM

* fix: enforcer rule to clean up temporary override

* Test `project.parent.version` resolution on CI

* Try ignoring white space

* Try newest Enforcer version

* Remove enforcer rule

---------

Co-authored-by: Gary Gregory <garydgregory@users.noreply.github.com>

* Add `AbstractArchiveBuilder` for unified archiver support (#721)

* Introduce `AbstractArchiveBuilder` for unified archiver support

This PR introduces a new `AbstractArchiveBuilder`, which serves as a shared foundation for all archive input streams, including `SevenZFile`, `TarFile`, and `ZipFile`.

This refactoring also leverages the new `ChannelOrigin` from Commons IO 2.21.0, streamlining the builder implementations for `SevenZFile`, `TarFile`, and `ZipFile`.

### Additional improvements

* **Constructor unification:** All constructors now delegate to a single constructor that accepts a builder.
* **Resource safety:** If an error occurs during construction (or within the builder’s `get()` method), the underlying resource is now always closed. Previously, resources provided via `SeekableByteChannel` were not closed on error, potentially leaving channels in an incoherent state.
* **Deprecation cleanup:** All references to deprecated constructors and methods have been removed. Compatibility is verified through `LegacyConstructorsTest`, which ensures the builder pattern behaves equivalently to the removed constructors.

* fix: Checkstyle violation

* fix: remove unused members

* fix: Windows failure

* Disable JDK 26-ea tests (#724)

All JDK 26-ea tests currently fail because PMD does not yet support this version (see [job 52084346962](https://github.com/apache/commons-compress/actions/runs/18292910119/job/52084346962) for an example).

While it’s valuable to stay ahead of upcoming JDK releases, having test jobs that are guaranteed to fail for reasons outside of Commons Compress or the JDK itself adds little benefit.

This PR disables JDK 26-ea tests until pmd/pmd#5871 is resolved, which will likely require upgrading to ASM 9.9 (released only a few days ago). Tests can be re-enabled once PMD adds support for Java 26.

* Bump org.apache.commons:commons-parent from 88 to 89

No longer need to override JApiCmp

* [COMPRESS-711] Fix incorrect CPIO checksum verification (#725)

This PR fixes an issue in `CpioArchiveInputStream` where the checksum was computed using the wrong slice of the buffer. Previously, it always used bytes `0` through `len`, ignoring the specified `off` parameter. As a result, checksum verification could fail when `read()` was called with a non-zero offset.

### Changes

* Corrected checksum calculation to use the actual range `off` through `off + len`.
* Added a regression test to ensure checksum verification works correctly with non-zero offsets.

* Sort members and remove test clutter

* Move action to fix section

* Fix Javadoc

* Add `ArchiveFile` abstraction for file-based archives (#709)

* feat: introduce `ArchiveFile` abstraction for file-based archives

This PR adds a new `ArchiveFile` interface to unify the handling of file-based archive utilities such as `SevenZFile`, `TarFile`, and `ZipFile`.

Although these classes target different archive formats, they share several core characteristics:

* All are `Closeable`.
* Each provides the same method to open an `InputStream` for a given entry (`InputStream getInputStream(T)` where `T extends ArchiveEntry`).
* Historically, their `getEntries()` methods returned incompatible types. This PR introduces a common `List<? extends T> entries()` method, aligning with `java.util.zip.ZipFile` in name but offering a modern return value.
* The `ZipFile#stream()` method (added in 1.28.0) is now part of this abstraction.

This change establishes a consistent, format-agnostic API for working with archive files, reducing duplication and improving discoverability for users.

* fix: add `IOIterator<T>` interface

* fix: failing tests

* fix: comment on usage of `getNextTarEntry`

* empty commit to trigger CI

* empty commit to trigger CI (2)

* fix: use `asIterable()` to provide `unwrap()`

* Sort members

* Javadoc

Use longer lines

* Narrow test exception typing

* Use final

* Add configurable maximum entry name length (#710)

* feat: Add configurable maximum entry name length

This change introduces a configurable limit on archive entry names.

Although formats like **AR**, **CPIO**, and **TAR** permit arbitrarily long path names, real operating systems and file systems impose much stricter limits:

* Individual path segments (file names) are typically limited to 255 bytes.
* Full paths are usually capped at a few KiB (e.g. 1024 bytes on macOS via `MAX_PATH`).

#### What’s new

* Added a new common builder, `AbstractArchiveBuilder`, inserted in the hierarchy between archive stream builders and `AbstractStreamBuilder`.
* Introduced a new configuration option: `setMaxEntryLength`.
* Default value is `Short.MAX_VALUE`, which is higher than any realistic OS limit.
* Enforced the limit across:

  * All `ArchiveInputStream` implementations
  * `SevenZFile`, `TarFile`, and `ZipFile`
* Added a dedicated test suite to verify:

  * Entry names up to `Short.MAX_VALUE` are handled correctly.
  * Entries exceeding a lowered limit result in an exception.

* fix: reformat

* fix: AbstractArchiveBuilder javadoc

* fix: extract PAX_NAME_KEY and PAX_LINK_NAME_KEY

* fix: merge errors

* Empty commit to trigger CI

* fix: replace `readRange` + size check with `IOUtils.toByteArray`**

Simplifies PAX header parsing by using `IOUtils.toByteArray` from Commons IO instead of `IOUtils.readRange` followed by a manual size check.

The previous size check was effectively unreachable: oversized PAX key/value pairs are validated earlier, and truncated reads are already handled by the input stream. This change improves code clarity and test coverage without altering behavior.

* simplify file name parsing in `SevenZFile`

This change streamlines file name parsing by:

* Reading directly from the existing `ByteBuffer` rather than copying data into a temporary byte array.
* Replacing the temporary byte array with a `StringBuilder` to accumulate the decoded UTF-16 characters.

This reduces intermediate allocations and simplifies the code path without changing behavior.

* fix: PMD failure

* Restrict visibility of unintentionally exposed APIs (#730)

This PR removes or hides several members that leaked into the public surface, were never meant for external use and were **never released**. There is **no functional behavior change** to the library; this only corrects API visibility and duplication.

## What changed

* **`SevenZFile.SOFT_MAX_ARRAY_LENGTH`**

  * **Change:** Removed.
  * **Rationale:** Duplicates the constant already available in Commons IO’s `IOUtils`.

* **`SevenZFile.toNonNegativeInt(...)`**

  * **Change:** Visibility reduced (internal helper).
  * **Rationale:** Not part of the supported API; only used internally.

* **`SeekableInMemoryByteChannel.getSize()`**

  * **Change:** Removed public alias.
  * **Rationale:** Only used in tests; behavior diverges from `size()` after channel closure and shouldn’t be exposed.

* **`ElementValue.BYTES`**

  * **Change:** Migrated to caller class.
  * **Rationale:** Had a single call site in another package; not a public contract.

* Post merge clean ups

- Use final
- Fix errant curly
- Sort members
- Reduce vertical space

* Declare `IOException` on archive `InputStream` constructors (#731)

> [!CAUTION]
> **Source-incompatible** (callers may need to add `throws IOException` or a catch).
> **Binary-compatible** (the `throws` clause isn’t part of the JVM descriptor).

Several `ArchiveInputStream` implementations either

- must read/validate bytes up front (e.g., magic headers), or
- may fail immediately when the underlying stream is unreadable.

Today we’re inconsistent:

* Formats **without a global signature** (e.g., **CPIO**, **TAR**) historically didn’t read in the constructor, so no `IOException` was declared.
* Other formats that **do need early bytes** either wrapped `IOException` in `ArchiveException` (**ARJ**, **DUMP**) or deferred the read to the first `getNextEntry()` (**AR**, **ZIP**).

This makes error handling uneven for users and complicates eager validation.

* All archive `InputStream` constructors now declare `throws IOException`.

* **ARJ** and **DUMP**: stop wrapping `IOException` in `ArchiveException` during construction; propagate the original `IOException`.

* **AR**: move reading of the global signature into the constructor (eager validation).

No behavioral change is intended beyond surfacing `IOException` at construction time, where appropriate.

For the ARJ format this was discussed in #723 (comment).

> [!NOTE]
> Version `1.29.0` already introduces source-incompatible changes in other methods, by adding checked exceptions.

* Javadoc

* Rename private instance variable

* 7z: enforce reference limits on `Folder` parsing (#729)

7z: enforce reference limits on `Folder` parsing

This change aligns `Folder` parsing with the limits defined in the official 7-Zip implementation
([`7zIn.cpp`](https://github.com/ip7z/7zip/blob/main/CPP/7zip/Archive/7z/7zIn.cpp)):

* Maximum coders per folder: **64**
* Maximum input streams per coder: **64**
* Maximum output streams per coder: **1**
* Maximum total input streams per folder: **64**

These bounds are consistent with the reference behavior and are safe because:

* Other 7z implementations use the same or stricter limits.
* No supported coder uses multiple inputs or outputs.
* Custom coder definitions are not supported in this implementation.

By enforcing these limits, the parser becomes simpler and more predictable,
and redundant dynamic size checks can be removed.

* Improve sparse file handling performance (#715)

* Improve sparse file handling performance

Previously, sparse files were processed recursively. On highly fragmented files, this led to deep recursion and significant inefficiency.

This change replaces the recursive approach with an iterative strategy, which scales better for files with many fragments. It also introduces generated tests that simulate sparse files with very high fragmentation to ensure correctness and performance under stress.

* fix: remove unused method

* fix: simplify input streams

* fix: error message

* Fix failing tests

* Sort members

* Simplify writing ustart trailer

- Javadoc

* ARJ: correct byte accounting and truncation errors (#723)

* ARJ: correct byte accounting and truncation errors

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* fix: failing legacy test

* fix: checkstyle error

* fix: remove `EndianUtils` static import

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

* Fix failing test

* Sort methods

* Remove unused method

* Refactor common test pattern

Use final

* ARJ: strict header validation and selfExtracting option (#728)

* ARJ: correct byte accounting and truncation errors

* `getBytesRead()` could drift from the actual archive size after a full read.
* Exceptions on truncation errors were inconsistent or missing.
* `DataInputStream` (big-endian) forced ad-hoc helpers for ARJ’s little-endian fields.

* **Accurate byte accounting:** count all consumed bytes across main/file headers, variable strings, CRCs, extended headers, and file data. `getBytesRead()` now matches the archive length at end-of-stream.
* **Consistent truncation handling:**

  * Truncation in the **main (archive) header**, read during construction, now throws an `ArchiveException` **wrapping** an `EOFException` (cause preserved).
  * Truncation in **file headers or file data** is propagated as a plain `EOFException` from `getNextEntry()`/`read()`.
* **Endianness refactor:** replace `DataInputStream` with `EndianUtils`, removing several bespoke helpers and making intent explicit.

* Add assertion that `getBytesRead()` equals the archive size after full consumption.
* Parameterized truncation tests at key boundaries (signature, basic/fixed header sizes, end of fixed/basic header, CRC, extended-header length, file data) verifying the exception contract above.

* fix: failing legacy test

* fix: checkstyle error

* fix: remove `EndianUtils` static import

The static import makes it harder to distinguish calls that need to count bytes from those that do not.

* ARJ: strict header validation and `selfExtracting` option

Today `ArjArchiveInputStream` keeps scanning past invalid headers, assuming self-extracting stubs. That can hide corruption.

This PR:

* Introduces a `selfExtracting` ARJ archive option (default **false**).

  * **false:** no scanning; parse strictly from the first byte. Any invalid/truncated header fails fast.
  * **true:** scan only to locate the Main Archive Header (AMH), then switch to **strict mode**. All subsequent headers must be contiguous and valid.

**Behavioral change**

Previously, we might “skip over” bad data. Now we **only** allow a discovery scan for AMH (when opted in); everything after must validate or fail.

* fix: simplify rethrowing

* Fix failing test

* Sort methods

* Remove unused method

* Sort members

* Fix remove unused method

* Extract `MAX_BASIC_HEADER_SIZE` constant

* fix: exception message

* Javadoc

- Normalize error messages
- Use final
- Reduce vertical whitespace

* 7z: unsigned number parsing and improved header validation (#734)

* 7z: unsigned number parsing and improved header validation

The 7z file format specification defines only **unsigned numbers** (`UINT64`, `REAL_UINT64`, `UINT32`). However, the current implementation allows parsing methods like `readUint64`, `getLong`, and `getInt` to return negative values and then handles those inconsistently in downstream logic.

This PR introduces a safer and more specification-compliant number parsing model.

### Key changes

* **Strict unsigned number parsing**

  * Parsing methods now *never* return negative numbers.
  * `readUint64`, `readUint64ToIntExact`, `readRealUint64`, and `readUint32` follow the terminology from `7zFormat.txt`.
  * Eliminates scattered negative-value checks that previously compensated for parsing issues.

* **Improved header integrity validation**

  * Before large allocations, the size is now validated against the **actual available data in the header** as well as the memory limit.
  * Prevents unnecessary or unsafe allocations when the archive is corrupted or truncated.

* **Correct numeric type usage**

  * Some fields represent 7z numbers as 64-bit values but are constrained internally to Java `int` limits.
  * These are now declared as `int` to signal real constraints in our implementation.

* **Consistent error handling**
  Parsing now throws only three well-defined exception types:

  | Condition                                                              | Exception                                    |
  | ---------------------------------------------------------------------- | -------------------------------------------- |
  | Declared structure exceeds `maxMemoryLimitKiB`                         | `MemoryLimitException`                       |
  | Missing data inside header (truncated or corrupt)                      | `ArchiveException("Corrupted 7z archive")`   |
  | Unsupported numeric values (too large for implementation) | `ArchiveException("Unsupported 7z archive")` |

  Note: `EOFException` is no longer used: a header with missing fields is not “EOF,” it is **corrupted**.

This PR lays groundwork for safer parsing and easier future maintenance by aligning number handling with the actual 7z specification and making header parsing behavior *predictable and robust*.

* fix: explain 3 bytes for Folder

* fix: comment memory check again

* fix: remove unused import

---------

Co-authored-by: Gary Gregory <garydgregory@users.noreply.github.com>

* Normalize some error messages

* Simplify `PackingUtils` using `IOUtils` (#737)

This change simplifies `PackingUtils` by leveraging `IOUtils` from Commons IO for common I/O operations, reducing boilerplate and improving readability.

* Fix grammar

* 7z: optimize header loading (#735)

* 7z: optimize header loading

This change improves the efficiency of 7z header parsing:

* Reads the **Signature Header** in a single ByteBuffer instead of multiple small reads, reducing overhead.
* Uses a `MappedByteBuffer` to load the **Next Header** when the archive is backed by a `FileChannel`, improving performance for large headers by avoiding unnecessary copies.

No new tests are added, as the existing test suite already exercises the affected header loading paths sufficiently.

* fix: rename `computeChecksum` to `crc32`

* fix: improve error messages

* Deprecate `IOUtils.readFully` and `IOUtils.skip` (#736)

* Deprecate `IOUtils.readFully` and `IOUtils.skip`

This change deprecates `readFully` and `skip` in `o.a.c.compress.utils.IOUtils` in favor of their Commons IO counterparts. Beyond code reuse, this offers two benefits:

* `readFully` had several overloads with inconsistent semantics. All read as many bytes as possible (like `read`), but only one threw on EOF (as `readFully` should).
* `skip` previously used a per-call byte buffer, unlike the Commons IO version. Since apache/commons-io#801 upstreamed the concurrency fix, this workaround is no longer needed.

**Note**: As `o.a.c.compress.utils.IOUtils` is now rarely used, it is always referenced by FQCN to avoid confusion.

* fix: unnecessary qualifier

* fix: remove qualified name

* Use `consume` instead of `skip` (#738)

Replace `IOUtils.skip(in, Long.MAX_VALUE)`` with `IOUtils.consume(in)`` for clarity and intent, removing the need for a magic constant.

* Use HTTPS in URL

* Test with `commons-io` 2.21.0 RC1

* fix: remove staging repository

---------

Co-authored-by: Piotr P. Karwasz <piotr@github.copernik.eu>
Co-authored-by: Piotr P. Karwasz <pkarwasz-github@apache.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants