Skip to content

Conversation

@paulnoalhyt
Copy link
Collaborator

@paulnoalhyt paulnoalhyt commented Oct 3, 2025

  • I have reviewed the OFRAK contributor guide and attest that this pull request is in accordance with it.
  • I have made or updated a changelog entry for the changes in this pull request.

One sentence summary of this PR (This should go in the CHANGELOG!)

Reimplement the CPIO components using libarchive.

Link to Related Issue(s)

Closes #616

Please describe the changes in your request.

Updated the CPIO components to use libarchive instead of the cpio command. The advantage is that unpacking is done in memory, files aren't actually created on disk. This fixes the 2 issues documented in #616. I added regression tests for the 2 aforementioned issues, plus a full unpack-pack-unpack test to make sure all metadata is preserved.

Anyone you think should look at this, specifically?

No

@paulnoalhyt paulnoalhyt changed the title Bugfix/cpio libarchive Reimplement CPIO components using libarchive Oct 3, 2025
@whyitfor
Copy link
Contributor

whyitfor commented Oct 3, 2025

@paulnoalhyt, does this mean that cpio can be removed from ofrak_core/Dockerstub?

@whyitfor
Copy link
Contributor

whyitfor commented Oct 6, 2025

@paulnoalhyt, looks like there are issues with libarchive not being installed on Windows. Sounds like choco offer a libarchive -- you could try updating the CI steps for Windows to test this out, or try another approach. Either way, seems like there should be a check if library works so OFRAK can run without the library installed.

Copy link
Contributor

@whyitfor whyitfor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some comments/questions

Comment on lines 197 to 199
Note: libarchive supports reading many CPIO variants but only supports
writing a subset. We map to the closest supported write format.
Available write formats: 'cpio' (generic), 'cpio_newc' (SVR4 newc)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Is this a limitation of libarchive or the python bindings?
  • Does this make packing lossy (are we encountering changes when repacking)?
  • Can we help add support to handle this?

Copy link
Collaborator Author

@paulnoalhyt paulnoalhyt Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a limitation from libarchive itself (not the python bindings). See https://linux.die.net/man/5/libarchive-formats:

The libarchive library can read a number of common cpio variants and can write ''odc'' and ''newc'' format archives

Worst case scenario an CPIO could be repacked to a different format (header's format), but the actual files data and metadata would be the same. I don't think there is much we can do.

assert patched_data == EXPECTED_DATA


async def test_unpacking_root(ofrak_context: OFRAKContext):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this (and other) tests be written using the OFRAK testing patterns?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed some tests that were overlapping with others. For the ones I left, the issue is that the FilesystemPackUnpackVerifyPattern expect files being flushed to disk, which is exactly what I'm trying to avoid with this move to libarchive.

Comment on lines 182 to 183
# For symlinks, skip size check - libarchive CPIO writer doesn't preserve symlink size
# This is a known limitation of libarchive's add_file_from_memory for CPIO symlinks
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link for this? Is this something we could fix?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me a while to investigate! The issue was a combination of libarchive-c python bindings, and libarchive code itself. The issue is that symlinks are created as hardlinks, then ignored during archive writing. I created an issue on the libarchive-c python bindings repo: Changaco/python-libarchive-c#143. I found a fix for this, until this is fixed in the bindings.

@paulnoalhyt
Copy link
Collaborator Author

paulnoalhyt commented Oct 8, 2025

Another issue that I need to spend time on is that the rdev in entry is not preserved by our packer/unpacker.
[EDIT] Device number is now preserved, see the a30e30c commit

Comment on lines +37 to +44
try:
# On Windows, if libarchive is not installed, use the DLL from the `extractcode-libarchive` python package.
import sys

os.environ["LIBARCHIVE"] = os.path.join(
sys.prefix, "lib", "site-packages", "extractcode_libarchive", "lib", "libarchive-13.dll"
)
import libarchive
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As choco didn't provide a libarchive package, I found a dirty solution: install the extractcode-libarchive python package, which comes with libarchive-13.dll. When the libarchive-c python bindings are loaded, there is a check for the LIBARCHIVE environment variable. So by setting that, the bindings can use that DLL. That appears to me like the easiest solution if we don't want to compile the libarchive library ourselves, or ship a DLL in ofrak. Let me know what you think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since libarchive may presumably become a core part of many of our packers and unpackers after this change is merged, I wonder if we should cross-build the DLLs and push Python wheels with them to PyPI ourselves? If it takes less than a day to get our own builds pushed up to PyPI, I think it is probably worthwhile. (@whyitfor can weigh in on this.)

The cross-compiling part isn't difficult with GitHub actions. As a quick experiment, I did it already here using the Zig toolchain. What remains to be done there is just the "build the wheel and push it" part, which I think might not be that hard. To see what is generated, look at the Actions build log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rbs-jacob, I like this idea, but let's keep it separate from the PR. Can you make an issue? Possible options would include a) pushing our own wheels b) helping the libarchive-c maintainer do this with the library.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to make an issue.

To be clear: you're saying that, for now, we should stick with the extractcode-libarchive Python package hack used here? If so, I agree. Just want to confirm.

Copy link
Contributor

@whyitfor whyitfor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! please see a few comments. Once they are resolved, I think this is ready to go!

Comment on lines +59 to +60
choco_package="", # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package.
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
choco_package="", # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package.
)
choco_package="", # libarchive is not available on choco. If libarchive is not installed, OFRAK will use the DLL provided by the `extractcode-libarchive` python package.
)

Comment on lines +30 to +42
cpio_r = await ofrak_context.create_root_resource("root.cpio", b"", (CpioFilesystem,))
cpio_r.add_view(CpioFilesystem(archive_type=CpioArchiveType.NEW_ASCII))
await cpio_r.save()
cpio_v = await cpio_r.view_as(CpioFilesystem)
# This also tests packing and unpacking a root file
await cpio_v.add_file(
path=CPIO_ENTRY_NAME,
data=INITIAL_DATA,
file_stat_result=os.stat_result((0o644, 0, 0, 1, 0, 0, 0, 0, 0, 0)),
file_xattrs=None,
)
await cpio_r.pack_recursively()
return cpio_r
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The point of this function was to create a file without OFRAK to test OFRAK. Can you build this file once, using the cpio utility and then add it as a test asset to gitlfs?

- When the user selects the "Decompilation" tab in the GUI, the pane is updated with the decompiled code automatically, without having to click "Analyze" first. ([#639](https://github.com/redballoonsecurity/ofrak/pull/639))
- Use a single source of truth for the package version ([#640](https://github.com/redballoonsecurity/ofrak/pull/640))
- Update the behavior of `get_only_descendant_as_view`, `get_descendants_as_view`, `get_ancestors_as_view`, and `get_only_ancestor_as_view` to retrieve all resources that match the filter. ([#642](https://github.com/redballoonsecurity/ofrak/pull/642))
- Reimplement the CPIO components using `libarchive` ([#647](https://github.com/redballoonsecurity/ofrak/pull/647))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be added to unreleased! You also need to bump the VERSION in ofrak_core/version.py.

Comment on lines +217 to +226
format_map = {
CpioArchiveType.NEW_ASCII: "cpio_newc",
CpioArchiveType.OLD_ASCII: "cpio",
CpioArchiveType.CRC_ASCII: "cpio_newc",
CpioArchiveType.BINARY: "cpio",
CpioArchiveType.TAR: "cpio",
CpioArchiveType.USTAR: "ustar",
CpioArchiveType.HPBIN: "cpio",
CpioArchiveType.HPODC: "cpio",
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulnoalhyt, I think I'd feel more comfortable if we didn't do this mapping here.

I think the behavior we want is the following:

  1. If someone tries to pack a CPIO archive, and the archive type is not supported, we raise an error explaining that to the user.
  2. This error can tell them how to do this mapping themselves (they can update the attributes on the CPIO archive to the desired output format before calling pack. We could even implement an archive format modifier to do this change.

This still gives the users the ability to pack, but they need to explicitly know they are changing the archive format.

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unpacking CPIO filesystem fails on MacOS because of permissions

3 participants