Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open error: Gigi Rf.res Illegal byte sequence #81

Open
cmcginty opened this issue Mar 26, 2023 · 8 comments
Open

open error: Gigi Rf.res Illegal byte sequence #81

cmcginty opened this issue Mar 26, 2023 · 8 comments

Comments

@cmcginty
Copy link

cmcginty commented Mar 26, 2023

Extracting "Amped - Freestyle Snowboarding" ISO produces an error on MacOS.

Running with LC_ALL=C produced the same result.

$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=
@JayFoxRox
Copy link
Member

This file is probably supposed to be called "Gigi Rüf.res", named after the Snowboarder.
Your filesystem does probably not support the umlaut.
You likely can't fix this via the locale, because it's a filesystem limitation.

So either the file can't be extracted, or it has to be renamed (eitherway, resulting in a potentially unusable dump for many purposes).

With a bit of luck/work on extract-xiso, the umlaut can likely be preserved, but it would still affect the filename (meaning the game / tools probably wouldn't find it).
As a workaround, you could create / mount a different filesystem which does support it.

However, the solution likely depends on why you are extracting the XISO in the first place.

@rapperskull
Copy link
Contributor

rapperskull commented Mar 27, 2023

Any modern FS supports Unicode file names, so I doubt it could be a problem. I tested the reported ISO under Linux and the problem is there also when listing files (also Windows is affected when an incompatible character set is used).

The underlining problem is extract-xiso treating filenames as character sequences, without any understanding of the character set. The solution would be to convert the filenames to Unicode (UTF-8 would probably be the chosen encoding), and let the FS decide how to rename incompatible characters. Since we're dealing with pretty standard characters, there shouldn't be any loss when converting back and forth, but it's not a guarantee.

EDIT: To be clear, it's not that the file system does not support the character ü, but that the Windows-1252 representation of the character (0xFC) is considered invalid, probably because the ASCII (7-bit) character set is in use.

@rapperskull
Copy link
Contributor

rapperskull commented Mar 27, 2023

I fixed the issue in my development branch (#80). It needs some testing, though.

Remarks:

  • It will probably still fail on systems that don't support the UTF-8 charset.
  • In extract mode, non-ASCII names will result in a "best fit" when the file system doesn't support Unicode. This means that extracting an ISO and re-creating it from the extracted files does not guarantee identical file names.
  • In listing mode, and in general when printing the file names on the terminal, the characters not in the user charset will be shown as a "best fit". This does not affect the creation/rewriting/extraction of the ISO, and is purely a limitation of the terminal.
  • In create mode, characters outside the Windows-1252 character set will be replaced by spaces.

@JayFoxRox
Copy link
Member

To be clear, it's not that the file system does not support the character ü, but that the Windows-1252 representation of the character (0xFC) is considered invalid

Right, I'm not saying you can't visually represent the names, I'm saying you are making it binary incompatible, because all tools (modding suites, emulators, tools which transfer files to your Xbox / default.xbe, other XISO tools, ,..) have to agree on how they handle this (= use existing byte sequences on filesystems which can have them? always convert? prefix illegal sequences with non-printing modifiers and add a visual umlaut? ...).

Hence:

With a bit of luck/work on extract-xiso, the umlaut can likely be preserved, but it would still affect the filename (meaning the game / tools probably wouldn't find it).


I've also seen at least one game (Furious Karting maybe?) which uses small files, which also acts as copy-protection, because extracting each file in the XISO to FATX (or many other filesystems) requires a full sector, so the size of the game explodes. It's not hard to imagine that certain games use specific filenames which aren't easily supported in other filesystems as protection, too.
These are situations which are hard to solve without affecting at least some of the programs which work with these extracted files.

However, it depends on what you want to do with those files, because other losses during extraction include the loss of metadata and file offsets within the image (which, again, might be used by the copy protection).

@rapperskull
Copy link
Contributor

Right, I'm not saying you can't visually represent the names, I'm saying you are making it binary incompatible, because all tools (modding suites, emulators, tools which transfer files to your Xbox / default.xbe, other XISO tools, ,..) have to agree on how they handle this (= use existing byte sequences on filesystems which can have them? always convert? prefix illegal sequences with non-printing modifiers and add a visual umlaut? ...).

I got the point, and this is a problem that goes far beyond the OG Xbox, since Unicode support is basically always broken. And when it's not broken, the implicit "best fit" approach makes you lose information without even realizing.
Back on track, in FileZilla you can select the charset of the server you're trying to connect, so it should be possible to transfer all Windows-1252 characters without problems, and all other characters will be mapped with the "best fit" approach. And this isn't affected by how the filesystem stores file names¹, since the system calls will always return the name encoded in the charset you requested.
Of course this does not apply to every tool, and problems can and will arise. However I don't think that other tools being broken is an excuse to behave wrongly or outright crash.

With a bit of luck/work on extract-xiso, the umlaut can likely be preserved, but it would still affect the filename (meaning the game / tools probably wouldn't find it).

That would be a problem of the tool, not ours.

I've also seen at least one game (Furious Karting maybe?) which uses small files, which also acts as copy-protection, because extracting each file in the XISO to FATX (or many other filesystems) requires a full sector, so the size of the game explodes. It's not hard to imagine that certain games use specific filenames which aren't easily supported in other filesystems as protection, too. These are situations which are hard to solve without affecting at least some of the programs which work with these extracted files.

I don't see how this is a problem, since files still use at least one sector in XISOs. Sure, those are 2048 bytes sectors, versus the typical 4096 bytes sector, but the game will at most double in size compared to the ISO size. Not ideal, but not too bad either.
I tried extracting Furious Karting (http://redump.org/disc/23905/) and it took a long time, but these are the results:

  • 41.266 files and 2.117 folders
  • Size: 2.89 GB (3,113,473,764 bytes)
  • Size on disk: 2.93 GB (3,153,113,088 bytes)

Not a huge difference, so I don't know if you were talking about another game. Just curiosity, though.
Of course trying to copy the extracted game via FTP would be a nightmare.

However, it depends on what you want to do with those files, because other losses during extraction include the loss of metadata and file offsets within the image (which, again, might be used by the copy protection).

I agree. Some loss of information will always occur, the important thing is to know where and when it occurs, and plan accordingly.

¹ with the exception of APFS on Sierra and probably some more exotic OS/FS combinations.

@JayFoxRox
Copy link
Member

Sure, those are 2048 bytes sectors, versus the typical 4096 bytes sector

Sounds right, now I wonder if I misremember or if it has just been a bug in some tooling. 🤔

The solution would be to convert the filenames to Unicode (UTF-8 would probably be the chosen encoding), and let the FS decide how to rename incompatible characters
[...]
Of course this does not apply to every tool, and problems can and will arise. However I don't think that other tools being broken is an excuse to behave wrongly or outright crash.
[...]
That would be a problem of the tool, not ours.

Agreed, although I'm not 100% sure we should enforce UTF-8. I think it might be better to allow the user to set a charset (but defaulting to UTF-8).

Anyhow, if Filezilla has support for re-interpreting charsets, that solves a lot of the problems already for a common use-case.
Cxbx-Reloaded is probably another huge use-case (not sure how that currently deals with it).
Not sure about modding tools, but I'm fine with them being responsible for using the correct charset to decode/encode filenames.

@rapperskull
Copy link
Contributor

Agreed, although I'm not 100% sure we should enforce UTF-8. I think it might be better to allow the user to set a charset (but defaulting to UTF-8).

We don't need to support any user charset, the OS does it for us (mostly),

The are three charsets the OS knows of:

  • the program charset: this is set by us and defaults to "C" and is the language we're speaking with the OS
  • the terminal charset: this is the charset the user terminal is currently set to
  • the filesystem charset: this is the charset the FS uses to store file names

And then there's the charset of the data (in our case always Windows-1252, since it's the one used by the Xbox), that the OS known nothing about.

When we want to talk to the OS, we need to translate¹ the data charset to the program charset. At this point, if we're printing a filename, the OS translates it to the terminal charset and prints it. If we're creating a file, instead, the OS translates it to the FS charset and creates the file.

¹Translating from one charset to the other works like this: for every character in the source string, if the destination charset can represent the character, convert it to the new encoding, otherwise use a "best fit" approach (i.e. ü becomes u).

Two observations:

  1. The program charset should be able to represent every character we need, otherwise it becomes a "bottleneck"
  2. The best case scenario would be if the program charset is Windows-1252, as in this case we wouldn't need to translate the data charset

Unfortunately, not all systems support CP1252, so we still need to find a "universal" program charset. UTF-8 is the best candidate because it's ALMOST universal. We could try to support more charsets, but that would require handling the translations between CP1252 and those charsets, and we could only ever support charsets that can actually represent all CP1252 characters.

One change I will make to my implementation is to fallback to Windows-1252 if UTF-8 is not supported, since, as I understood, UTF-8 support is only available since a specific Windows 10 version.

@rapperskull
Copy link
Contributor

I implemented the change and added a comment on #80 to explain a problem. I don't know what the best approach would be though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants