Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unpacking of filenames with contains UTF-8 characters #52

Merged
merged 3 commits into from
Mar 3, 2021

Conversation

otavio
Copy link
Member

@otavio otavio commented Feb 28, 2021

Change from the C to C.UTF-8 locale, allowing libarchive to handle
filenames in UTF-8. We restrict to change LC_CTYPE only, since
libarchive only needs the charset set.

See on libarchive Website for a more complete description of the issue:

libarchive/libarchive#587
https://github.com/libarchive/libarchive/wiki/Filenames

Once we complete the uncompress operation, we restore the original
LC_CTYPE after extraction to avoid side effects.

Signed-off-by: Otavio Salvador otavio@ossystems.com.br

@coveralls
Copy link

coveralls commented Feb 28, 2021

Pull Request Test Coverage Report for Build 618426689

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 11 of 15 (73.33%) changed or added relevant lines in 2 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage increased (+0.9%) to 82.586%

Changes Missing Coverage Covered Lines Changed/Added Lines %
src/ffi/generated.rs 0 4 0.0%
Files with Coverage Reduction New Missed Lines %
../../../../../usr/share/rust/.cargo/registry/src/github.com-1ecc6299db9ec823/futures-channel-0.3.13/src/mpsc/mod.rs 1 89.33%
Totals Coverage Status
Change from base Build 608762615: 0.9%
Covered Lines: 313
Relevant Lines: 379

💛 - Coveralls

@afdw
Copy link
Collaborator

afdw commented Feb 28, 2021

But does it mean that user code in callbacks (for example, functions on source and target arguments, which implement Read and Write) are called with the changed (UTF-8) locale instead of the original one?

pub fn uncompress_data<R, W>(source: R, target: W) -> Result<usize> 
where
    R: Read,
    W: Write, 

I think this is not ideal?

Also, does it mean that libarchive functions are called with a different locale when reading or writing archive and when opening or closing archive?

@otavio
Copy link
Member Author

otavio commented Feb 28, 2021

Yes, @afdw. This was the best choice I have found up to now. Do you see another way to solve this?

I'd love to have a second pair of eyes on the issue and a proposed alternative.

@afdw
Copy link
Collaborator

afdw commented Feb 28, 2021

I think what can be done is the following: on every transition of execution to libarchive, locale is changed to UTF-8; on every transition of execution from libarchive, locale is restored.

This means the following:

  1. When calling any libarchive: before the call, locale is changed to UTF-8; after the call, locale is restored.
  2. When libarchive calls any our callback: when entering our callback, locale is restored; when exiting our callback, locale is changed to UTF-8.

One the ways to guarantee both of these is creating a module with the same API as the ffi one, acting as a wrapper for it.

@otavio
Copy link
Member Author

otavio commented Feb 28, 2021

In general, not all functions need to take this into consideration. I am aware of archive_entry_pathname.

Your idea isn't bad. In fact, it's a really clever idea. We found this error while unpacking a rootfs during an OTA update so we'll first confirm this current approach works and if it does, we can rework it as you said.

@otavio
Copy link
Member Author

otavio commented Feb 28, 2021

By the way, I did not find how to do the same thing on Windows. Are you aware of some way to do it?

@afdw
Copy link
Collaborator

afdw commented Feb 28, 2021

By the way, I did not find how to do the same thing on Windows. Are you aware of some way to do it?

Restoring locale? Yes, after reading the Microsoft's documentation on setlocale, I think I am.

You can copy the string returned by setlocale to restore that part of the program's locale information.

A null pointer that's passed as the locale argument tells setlocale to query instead of to set the international environment. If the locale argument is a null pointer, the program's current locale setting isn't changed. Instead, setlocale returns a pointer to the string that's associated with the category of the thread's current locale.

So, I think something like that should do the trick:

#[cfg(windows)]
unsafe fn set_utf8_locale() -> std::ffi::CString {
    let locale = b".UTF-8\0";

    let old_locale = std::ffi::CStr::from_ptr(libc::setlocale(libc::LC_CTYPE, std::ptr::null())).to_owned();

    libc::setlocale(
        libc::LC_CTYPE,
        std::ffi::CStr::from_bytes_with_nul_unchecked(locale).as_ptr(),
    );

    old_locale
}

#[cfg(windows)]
unsafe fn restore_locale(old_locale: std::ffi::CString) {
    libc::setlocale(libc::LC_CTYPE, old_locale.as_ptr());
}

@otavio otavio force-pushed the issue-utf8-pathname branch 2 times, most recently from 233a1fd to 560d9ee Compare March 1, 2021 00:42
@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

@afdw mostly, yes. However, it seems we are getting a segfault. I tried using Option, and also passing the pointer directly, but it is failing. Do you have a Windows device to test it and see if you find the culprit?

@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

@afdw I tried:

#[cfg(windows)]
unsafe fn set_utf8_locale() -> Option<ffi::CString> {
    let locale = b".UTF-8\0";

    let old_locale = {
        let old_locale = libc::setlocale(libc::LC_CTYPE, ptr::null());
        if old_locale.is_null() {
            None
        } else {
            Some(ffi::CStr::from_ptr(old_locale).to_owned())
        }
    };

    libc::setlocale(
        libc::LC_CTYPE,
        ffi::CStr::from_bytes_with_nul_unchecked(locale).as_ptr(),
    );

    old_locale
}

#[cfg(windows)]
unsafe fn restore_locale(old_locale: Option<ffi::CString>) {
    if let Some(old_locale) = old_locale {
        libc::setlocale(libc::LC_CTYPE, old_locale.as_ptr());
    }
}

However, it seems to segfault.

@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

@innobead, could you take a look at this? As you seem to be using the library in Windows, it'd be nice if you could help.

@otavio otavio force-pushed the issue-utf8-pathname branch 2 times, most recently from 0ab5b16 to 6191f16 Compare March 1, 2021 02:33
@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

I finally got it working. The code is much simpler, and we essentially rely on Windows to do the right thing (if possible, heh). I ended using:

#[cfg(windows)]
unsafe fn set_utf8_locale() -> *mut libc::c_char {
    let locale = b".UTF-8\0";

    let old_locale = libc::setlocale(libc::LC_CTYPE, std::ptr::null());

    libc::setlocale(libc::LC_CTYPE, locale.as_ptr() as *const libc::c_char);

    old_locale
}

#[cfg(windows)]
unsafe fn restore_locale(old_locale: *mut libc::c_char) {
    libc::setlocale(libc::LC_CTYPE, old_locale);
}

We need to confirm this work on the embedded device before the merge. It was a long journey, but it seems we got it right after all.

I want to thank @afdw for his nice idea of wrapping the FFI method and his help in figuring how to do the locale setting in Windows; really appreciated.

@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

This seems close but it still fails sometimes.

image

@afdw
Copy link
Collaborator

afdw commented Mar 1, 2021

I finally got it working. The code is much simpler, and we essentially rely on Windows to do the right thing (if possible, heh). I ended using:

#[cfg(windows)]
unsafe fn set_utf8_locale() -> *mut libc::c_char {
    let locale = b".UTF-8\0";

    let old_locale = libc::setlocale(libc::LC_CTYPE, std::ptr::null());

    libc::setlocale(libc::LC_CTYPE, locale.as_ptr() as *const libc::c_char);

    old_locale
}

#[cfg(windows)]
unsafe fn restore_locale(old_locale: *mut libc::c_char) {
    libc::setlocale(libc::LC_CTYPE, old_locale);
}

The reason why I did not write it initially like that was:

You can copy the string returned by setlocale to restore that part of the program's locale information. Global or thread local storage is used for the string returned by setlocale. Later calls to setlocale overwrite the string, which invalidates string pointers returned by earlier calls.

So it seems to me that libc::setlocale(libc::LC_CTYPE, locale.as_ptr() as *const libc::c_char); invalides the pointer returned by libc::setlocale(libc::LC_CTYPE, std::ptr::null())?

But I am not sure if I understand the documentation correctly here.

@afdw
Copy link
Collaborator

afdw commented Mar 1, 2021

@afdw mostly, yes. However, it seems we are getting a segfault. I tried using Option, and also passing the pointer directly, but it is failing. Do you have a Windows device to test it and see if you find the culprit?

For me, PKG_CONFIG_ALLOW_CROSS=1 WINEPATH=/usr/x86_64-w64-mingw32/bin cargo test --target x86_64-pc-windows-gnu -- --skip uncompress_to_dir_with_utf8_pathname runs OK with my code, but this is on Wine (not Windows). Also, uncompress_to_dir_with_utf8_pathname never works on Wine for me.

Were you getting a segfault on uncompress_to_dir_with_utf8_pathname, or also on other tests with this locale restoring code?

@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

So it seems to me that libc::setlocale(libc::LC_CTYPE, locale.as_ptr() as *const libc::c_char); invalides the pointer returned by libc::setlocale(libc::LC_CTYPE, std::ptr::null())?

So the code (#52 (comment)) is the closest to the correct approach I've tested.

Were you getting a segfault on uncompress_to_dir_with_utf8_pathname, or also on other tests with this locale restoring code?

Just on uncompress_to_dir_with_utf8_pathname test. All others run just fine.

What puzzles me is it is flaky as 1 of 3 Windows runs has complete successfully, so it points to some concurrency or other indeterministic issue.

@afdw
Copy link
Collaborator

afdw commented Mar 1, 2021

Were you getting a segfault on uncompress_to_dir_with_utf8_pathname, or also on other tests with this locale restoring code?

Just on uncompress_to_dir_with_utf8_pathname test. All others run just fine.

What puzzles me is it is flaky as 1 of 3 Windows runs has complete successfully, so it points to some concurrency or other indeterministic issue.

Do you have a backtrace? If not, I will probably need to install a Windows VM to test it myself, maybe I will be able to say something (as I can not get any code version to run this test case on Wine, probably due to some unrelated issue).

Another thing which is I think worth trying is wrapping every method in change-to-UTF-8/restore locale code, not just archive_read_next_header.

@otavio
Copy link
Member Author

otavio commented Mar 1, 2021

Do you have a backtrace? If not, I will probably need to install a Windows VM to test it myself, maybe I will be able to say something (as I can not get any code version to run this test case on Wine, probably due to some unrelated issue).

I don't. It is not shown in the CI job.

Another thing which is I think worth trying is wrapping every method in change-to-UTF-8/restore locale code, not just archive_read_next_header.

Indeed; this is worth checking.

@otavio
Copy link
Member Author

otavio commented Mar 3, 2021

@Jonathas-Conceicao if CI passes, please invert the order of patches so it updates the container version first.

Jonathas-Conceicao and others added 3 commits March 3, 2021 15:07
Signed-off-by: Jonathas-Conceicao <jonathas.conceicao@ossystems.com.br>
This commit is a preparation to a bigger change to wrapper few `ffi`
calls.

We are adding the generated bindings inside an inner `ffi::generated`
module and re-exporting it as `ffi` so requiring no code changes to use
it.

This also change the symbols visibility to be pub(crate) as we do not
intent people to call the `ffi` module directly.

Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
Change from the C to C.UTF-8 locale, allowing libarchive to handle
filenames in UTF-8. We restrict to change LC_CTYPE only, since
libarchive only needs the charset set.

See on libarchive Website for a more complete description of the issue:

  libarchive/libarchive#587
  https://github.com/libarchive/libarchive/wiki/Filenames

Once we complete the uncompress operation, we restore the original
LC_CTYPE after extraction to avoid side effects.

Signed-off-by: Otavio Salvador <otavio@ossystems.com.br>
@Jonathas-Conceicao Jonathas-Conceicao merged commit f4f0668 into master Mar 3, 2021
@Jonathas-Conceicao Jonathas-Conceicao deleted the issue-utf8-pathname branch March 3, 2021 18:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants