-
Couldn't load subscription status.
- Fork 144
Reimplement CPIO components using libarchive #647
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@paulnoalhyt, does this mean that |
|
@paulnoalhyt, looks like there are issues with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some comments/questions
| Note: libarchive supports reading many CPIO variants but only supports | ||
| writing a subset. We map to the closest supported write format. | ||
| Available write formats: 'cpio' (generic), 'cpio_newc' (SVR4 newc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Is this a limitation of libarchive or the python bindings?
- Does this make packing lossy (are we encountering changes when repacking)?
- Can we help add support to handle this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a limitation from libarchive itself (not the python bindings). See https://linux.die.net/man/5/libarchive-formats:
The libarchive library can read a number of common cpio variants and can write ''odc'' and ''newc'' format archives
Worst case scenario an CPIO could be repacked to a different format (header's format), but the actual files data and metadata would be the same. I don't think there is much we can do.
| assert patched_data == EXPECTED_DATA | ||
|
|
||
|
|
||
| async def test_unpacking_root(ofrak_context: OFRAKContext): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this (and other) tests be written using the OFRAK testing patterns?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed some tests that were overlapping with others. For the ones I left, the issue is that the FilesystemPackUnpackVerifyPattern expect files being flushed to disk, which is exactly what I'm trying to avoid with this move to libarchive.
| # For symlinks, skip size check - libarchive CPIO writer doesn't preserve symlink size | ||
| # This is a known limitation of libarchive's add_file_from_memory for CPIO symlinks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link for this? Is this something we could fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me a while to investigate! The issue was a combination of libarchive-c python bindings, and libarchive code itself. The issue is that symlinks are created as hardlinks, then ignored during archive writing. I created an issue on the libarchive-c python bindings repo: Changaco/python-libarchive-c#143. I found a fix for this, until this is fixed in the bindings.
Co-authored-by: Wyatt <53830972+whyitfor@users.noreply.github.com>
|
Another issue that I need to spend time on is that the |
| try: | ||
| # On Windows, if libarchive is not installed, use the DLL from the `extractcode-libarchive` python package. | ||
| import sys | ||
|
|
||
| os.environ["LIBARCHIVE"] = os.path.join( | ||
| sys.prefix, "lib", "site-packages", "extractcode_libarchive", "lib", "libarchive-13.dll" | ||
| ) | ||
| import libarchive |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As choco didn't provide a libarchive package, I found a dirty solution: install the extractcode-libarchive python package, which comes with libarchive-13.dll. When the libarchive-c python bindings are loaded, there is a check for the LIBARCHIVE environment variable. So by setting that, the bindings can use that DLL. That appears to me like the easiest solution if we don't want to compile the libarchive library ourselves, or ship a DLL in ofrak. Let me know what you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since libarchive may presumably become a core part of many of our packers and unpackers after this change is merged, I wonder if we should cross-build the DLLs and push Python wheels with them to PyPI ourselves? If it takes less than a day to get our own builds pushed up to PyPI, I think it is probably worthwhile. (@whyitfor can weigh in on this.)
The cross-compiling part isn't difficult with GitHub actions. As a quick experiment, I did it already here using the Zig toolchain. What remains to be done there is just the "build the wheel and push it" part, which I think might not be that hard. To see what is generated, look at the Actions build log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rbs-jacob, I like this idea, but let's keep it separate from the PR. Can you make an issue? Possible options would include a) pushing our own wheels b) helping the libarchive-c maintainer do this with the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to make an issue.
To be clear: you're saying that, for now, we should stick with the extractcode-libarchive Python package hack used here? If so, I agree. Just want to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! please see a few comments. Once they are resolved, I think this is ready to go!
| choco_package="", # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package. | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| choco_package="", # libarchive is not available on choco. If it is not installed, it will use the DLL provided by the `extractcode-libarchive` python package. | |
| ) | |
| choco_package="", # libarchive is not available on choco. If libarchive is not installed, OFRAK will use the DLL provided by the `extractcode-libarchive` python package. | |
| ) |
| cpio_r = await ofrak_context.create_root_resource("root.cpio", b"", (CpioFilesystem,)) | ||
| cpio_r.add_view(CpioFilesystem(archive_type=CpioArchiveType.NEW_ASCII)) | ||
| await cpio_r.save() | ||
| cpio_v = await cpio_r.view_as(CpioFilesystem) | ||
| # This also tests packing and unpacking a root file | ||
| await cpio_v.add_file( | ||
| path=CPIO_ENTRY_NAME, | ||
| data=INITIAL_DATA, | ||
| file_stat_result=os.stat_result((0o644, 0, 0, 1, 0, 0, 0, 0, 0, 0)), | ||
| file_xattrs=None, | ||
| ) | ||
| await cpio_r.pack_recursively() | ||
| return cpio_r |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The point of this function was to create a file without OFRAK to test OFRAK. Can you build this file once, using the cpio utility and then add it as a test asset to gitlfs?
| - When the user selects the "Decompilation" tab in the GUI, the pane is updated with the decompiled code automatically, without having to click "Analyze" first. ([#639](https://github.com/redballoonsecurity/ofrak/pull/639)) | ||
| - Use a single source of truth for the package version ([#640](https://github.com/redballoonsecurity/ofrak/pull/640)) | ||
| - Update the behavior of `get_only_descendant_as_view`, `get_descendants_as_view`, `get_ancestors_as_view`, and `get_only_ancestor_as_view` to retrieve all resources that match the filter. ([#642](https://github.com/redballoonsecurity/ofrak/pull/642)) | ||
| - Reimplement the CPIO components using `libarchive` ([#647](https://github.com/redballoonsecurity/ofrak/pull/647)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be added to unreleased! You also need to bump the VERSION in ofrak_core/version.py.
| format_map = { | ||
| CpioArchiveType.NEW_ASCII: "cpio_newc", | ||
| CpioArchiveType.OLD_ASCII: "cpio", | ||
| CpioArchiveType.CRC_ASCII: "cpio_newc", | ||
| CpioArchiveType.BINARY: "cpio", | ||
| CpioArchiveType.TAR: "cpio", | ||
| CpioArchiveType.USTAR: "ustar", | ||
| CpioArchiveType.HPBIN: "cpio", | ||
| CpioArchiveType.HPODC: "cpio", | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulnoalhyt, I think I'd feel more comfortable if we didn't do this mapping here.
I think the behavior we want is the following:
- If someone tries to pack a CPIO archive, and the archive type is not supported, we raise an error explaining that to the user.
- This error can tell them how to do this mapping themselves (they can update the attributes on the CPIO archive to the desired output format before calling pack. We could even implement an archive format modifier to do this change.
This still gives the users the ability to pack, but they need to explicitly know they are changing the archive format.
What do you think?
One sentence summary of this PR (This should go in the CHANGELOG!)
Reimplement the CPIO components using
libarchive.Link to Related Issue(s)
Closes #616
Please describe the changes in your request.
Updated the CPIO components to use
libarchiveinstead of thecpiocommand. The advantage is that unpacking is done in memory, files aren't actually created on disk. This fixes the 2 issues documented in #616. I added regression tests for the 2 aforementioned issues, plus a full unpack-pack-unpack test to make sure all metadata is preserved.Anyone you think should look at this, specifically?
No