Reimplement CPIO components using libarchive #647

paulnoalhyt · 2025-10-03T15:57:13Z

I have reviewed the OFRAK contributor guide and attest that this pull request is in accordance with it.
I have made or updated a changelog entry for the changes in this pull request.

One sentence summary of this PR (This should go in the CHANGELOG!)

Reimplement the CPIO components using libarchive.

Link to Related Issue(s)

Closes #616

Please describe the changes in your request.

Updated the CPIO components to use libarchive instead of the cpio command. The advantage is that unpacking is done in memory, files aren't actually created on disk. This fixes the 2 issues documented in #616. I added regression tests for the 2 aforementioned issues, plus a full unpack-pack-unpack test to make sure all metadata is preserved.

Anyone you think should look at this, specifically?

No

…re/setup.py

whyitfor · 2025-10-03T17:48:36Z

@paulnoalhyt, does this mean that cpio can be removed from ofrak_core/Dockerstub?

whyitfor · 2025-10-06T00:02:28Z

@paulnoalhyt, looks like there are issues with libarchive not being installed on Windows. Sounds like choco offer a libarchive -- you could try updating the CI steps for Windows to test this out, or try another approach. Either way, seems like there should be a check if library works so OFRAK can run without the library installed.

whyitfor

Added some comments/questions

ofrak_core/Dockerstub

ofrak_core/src/ofrak/core/cpio.py

whyitfor · 2025-10-06T00:13:24Z

ofrak_core/src/ofrak/core/cpio.py

+    Note: libarchive supports reading many CPIO variants but only supports
+    writing a subset. We map to the closest supported write format.
+    Available write formats: 'cpio' (generic), 'cpio_newc' (SVR4 newc)


Is this a limitation of libarchive or the python bindings?

Does this make packing lossy (are we encountering changes when repacking)?

Can we help add support to handle this?

This is a limitation from libarchive itself (not the python bindings). See https://linux.die.net/man/5/libarchive-formats:

The libarchive library can read a number of common cpio variants and can write ''odc'' and ''newc'' format archives

Worst case scenario an CPIO could be repacked to a different format (header's format), but the actual files data and metadata would be the same. I don't think there is much we can do.

ofrak_core/src/ofrak/core/cpio.py

whyitfor · 2025-10-06T00:17:49Z

ofrak_core/tests/components/test_cpio_component.py

+        assert patched_data == EXPECTED_DATA
+
+
+async def test_unpacking_root(ofrak_context: OFRAKContext):


Could this (and other) tests be written using the OFRAK testing patterns?

I removed some tests that were overlapping with others. For the ones I left, the issue is that the FilesystemPackUnpackVerifyPattern expect files being flushed to disk, which is exactly what I'm trying to avoid with this move to libarchive.

whyitfor · 2025-10-06T00:19:05Z

ofrak_core/tests/components/test_cpio_component.py

+        # For symlinks, skip size check - libarchive CPIO writer doesn't preserve symlink size
+        # This is a known limitation of libarchive's add_file_from_memory for CPIO symlinks


Link for this? Is this something we could fix?

This took me a while to investigate! The issue was a combination of libarchive-c python bindings, and libarchive code itself. The issue is that symlinks are created as hardlinks, then ignored during archive writing. I created an issue on the libarchive-c python bindings repo: Changaco/python-libarchive-c#143. I found a fix for this, until this is fixed in the bindings.

Co-authored-by: Wyatt <53830972+whyitfor@users.noreply.github.com>

…le_types

paulnoalhyt · 2025-10-08T14:57:34Z

Another issue that I need to spend time on is that the rdev in entry is not preserved by our packer/unpacker.
[EDIT] Device number is now preserved, see the a30e30c commit

paulnoalhyt · 2025-10-08T17:28:30Z

ofrak_core/src/ofrak/core/cpio.py

+    try:
+        # On Windows, if libarchive is not installed, use the DLL from the `extractcode-libarchive` python package.
+        import sys
+
+        os.environ["LIBARCHIVE"] = os.path.join(
+            sys.prefix, "lib", "site-packages", "extractcode_libarchive", "lib", "libarchive-13.dll"
+        )
+        import libarchive


As choco didn't provide a libarchive package, I found a dirty solution: install the extractcode-libarchive python package, which comes with libarchive-13.dll. When the libarchive-c python bindings are loaded, there is a check for the LIBARCHIVE environment variable. So by setting that, the bindings can use that DLL. That appears to me like the easiest solution if we don't want to compile the libarchive library ourselves, or ship a DLL in ofrak. Let me know what you think.

Since libarchive may presumably become a core part of many of our packers and unpackers after this change is merged, I wonder if we should cross-build the DLLs and push Python wheels with them to PyPI ourselves? If it takes less than a day to get our own builds pushed up to PyPI, I think it is probably worthwhile. (@whyitfor can weigh in on this.)

The cross-compiling part isn't difficult with GitHub actions. As a quick experiment, I did it already here using the Zig toolchain. What remains to be done there is just the "build the wheel and push it" part, which I think might not be that hard. To see what is generated, look at the Actions build log.

@rbs-jacob, I like this idea, but let's keep it separate from the PR. Can you make an issue? Possible options would include a) pushing our own wheels b) helping the libarchive-c maintainer do this with the library.

Happy to make an issue.

To be clear: you're saying that, for now, we should stick with the extractcode-libarchive Python package hack used here? If so, I agree. Just want to confirm.

whyitfor

Looking good! please see a few comments. Once they are resolved, I think this is ready to go!

whyitfor · 2025-10-16T19:57:10Z

ofrak_core/src/ofrak/core/cpio.py

+            choco_package="",  # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package.
+        )


Suggested change

choco_package="", # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package.

)

choco_package="", # libarchive is not available on choco. If libarchive is not installed, OFRAK will use the DLL provided by the `extractcode-libarchive` python package.

)

whyitfor · 2025-10-16T20:01:09Z

ofrak_core/tests/components/test_cpio_component.py

+        cpio_r = await ofrak_context.create_root_resource("root.cpio", b"", (CpioFilesystem,))
+        cpio_r.add_view(CpioFilesystem(archive_type=CpioArchiveType.NEW_ASCII))
+        await cpio_r.save()
+        cpio_v = await cpio_r.view_as(CpioFilesystem)
+        # This also tests packing and unpacking a root file
+        await cpio_v.add_file(
+            path=CPIO_ENTRY_NAME,
+            data=INITIAL_DATA,
+            file_stat_result=os.stat_result((0o644, 0, 0, 1, 0, 0, 0, 0, 0, 0)),
+            file_xattrs=None,
+        )
+        await cpio_r.pack_recursively()
+        return cpio_r


The point of this function was to create a file without OFRAK to test OFRAK. Can you build this file once, using the cpio utility and then add it as a test asset to gitlfs?

whyitfor · 2025-10-16T20:02:46Z

ofrak_core/CHANGELOG.md

 - When the user selects the "Decompilation" tab in the GUI, the pane is updated with the decompiled code automatically, without having to click "Analyze" first. ([#639](https://github.com/redballoonsecurity/ofrak/pull/639))
 - Use a single source of truth for the package version ([#640](https://github.com/redballoonsecurity/ofrak/pull/640))
 - Update the behavior of `get_only_descendant_as_view`, `get_descendants_as_view`, `get_ancestors_as_view`, and `get_only_ancestor_as_view` to retrieve all resources that match the filter. ([#642](https://github.com/redballoonsecurity/ofrak/pull/642))
+- Reimplement the CPIO components using `libarchive` ([#647](https://github.com/redballoonsecurity/ofrak/pull/647))


This should be added to unreleased! You also need to bump the VERSION in ofrak_core/version.py.

whyitfor · 2025-10-16T20:12:14Z

ofrak_core/src/ofrak/core/cpio.py

+    format_map = {
+        CpioArchiveType.NEW_ASCII: "cpio_newc",
+        CpioArchiveType.OLD_ASCII: "cpio",
+        CpioArchiveType.CRC_ASCII: "cpio_newc",
+        CpioArchiveType.BINARY: "cpio",
+        CpioArchiveType.TAR: "cpio",
+        CpioArchiveType.USTAR: "ustar",
+        CpioArchiveType.HPBIN: "cpio",
+        CpioArchiveType.HPODC: "cpio",
+    }


@paulnoalhyt, I think I'd feel more comfortable if we didn't do this mapping here.

I think the behavior we want is the following:

If someone tries to pack a CPIO archive, and the archive type is not supported, we raise an error explaining that to the user.

This error can tell them how to do this mapping themselves (they can update the attributes on the CPIO archive to the desired output format before calling pack. We could even implement an archive format modifier to do this change.

This still gives the users the ability to pack, but they need to explicitly know they are changing the archive format.

What do you think?

Paul Noalhyt added 5 commits October 3, 2025 17:25

Reimplement the CPIO components using libarchive

f7a7e03

Add test for archive type preservation

8a0d82a

Lint

842c64d

Pin version of libarchive-c. Add libarchive-dev to Dockerstub

a79ce80

Update changelog

94891a8

paulnoalhyt changed the title ~~Bugfix/cpio libarchive~~ Reimplement CPIO components using libarchive Oct 3, 2025

Ignore libarchive in mypy. Add requirement for libarchive in ofrak_co…

3876dfb

…re/setup.py

whyitfor reviewed Oct 6, 2025

View reviewed changes

Paul Noalhyt and others added 10 commits October 6, 2025 13:36

Added comments about libarchive

9048c18

Support more OSes in libarchive check

74c6ce4

Apply suggestions from code review

59144dd

Co-authored-by: Wyatt <53830972+whyitfor@users.noreply.github.com>

Merge branch 'master' into bugfix/cpio-libarchive

a47b051

Pin newer version of libarchive-c python package

ffa0962

Remove cpio from Dockerfile

285159b

remove duplicate test

044d6ee

Remove test_character_device test, already handled in test_special_fi…

49a1e14

…le_types

libarchive on Windows

0fe103a

fix symlinks in libarchive, added link to issue

617cfe8

Preserve rdev (character device number)

a30e30c

paulnoalhyt commented Oct 8, 2025

View reviewed changes

whyitfor requested changes Oct 16, 2025

View reviewed changes

		assert patched_data == EXPECTED_DATA


		async def test_unpacking_root(ofrak_context: OFRAKContext):

		# For symlinks, skip size check - libarchive CPIO writer doesn't preserve symlink size
		# This is a known limitation of libarchive's add_file_from_memory for CPIO symlinks

		choco_package="", # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package.
		)

Uh oh!

Reimplement CPIO components using libarchive #647

Are you sure you want to change the base?

Reimplement CPIO components using libarchive #647

Uh oh!

Conversation

paulnoalhyt commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

whyitfor commented Oct 3, 2025

Uh oh!

whyitfor commented Oct 6, 2025

Uh oh!

whyitfor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulnoalhyt Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paulnoalhyt commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

whyitfor left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

paulnoalhyt commented Oct 3, 2025 •

edited

Loading

paulnoalhyt Oct 6, 2025 •

edited

Loading

paulnoalhyt commented Oct 8, 2025 •

edited

Loading