Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

screenshot: long filename handling improvements #10052

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

rtldg
Copy link
Contributor

@rtldg rtldg commented Apr 4, 2022

For screenshot filenames, it was possible for the basename to be
longer than what filesystems generally support.

On Linux, this is 255 bytes. On Windows, this is 255 wchar_t units.
Thus basenames are truncated to under 255 bytes so that the
basename + extension are <= 255 with truncate_long_base_filename.
It also makes sure not to produce an invalid UTF-8 codepoint in the filename.

For testing, filling screenshot-template= with 3-byte or 4-byte
UTF-8 codepoints is best. Such as "ウ" (3-byte) or "🌂" (4-byte).
Example: 84 * strlen("ウ") + strlen(".jpg") == 256
The last "ウ" is removed and the basename string will be
filled with 83 "ウ" characters and ".jpg" totalling 253 bytes.

I only tested on Windows 10 21H2 x64 and also here's some copy & paste screenshot-templates

screenshot-template="ウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウウ"
screenshot-template="aaa🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"
screenshot-template="aa🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"
screenshot-template="a🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"

@rtldg
Copy link
Contributor Author

rtldg commented Apr 4, 2022

I'll have to correct the trim_invalid_utf8 function as I realized I wrote it for something else that had different input string guarantees and it only trims the last codepoint if it's missing the last byte of the sequence.

edit: Backtracking the string some and using bstr_parse_utf8_code_length to see if it goes past the NUL terminator is probably what I'll do for it.

That's done

@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch 2 times, most recently from b8eb5bb to 0738e2e Compare April 5, 2022 08:20
@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch from 0738e2e to 8d62bb6 Compare January 17, 2023 13:26
@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch from 8d62bb6 to a89f2dd Compare August 1, 2023 05:05
@sfan5
Copy link
Member

sfan5 commented Aug 20, 2023

I'd take the UTF-8 fix without the win32-specific changes. There's #12119 for that now.

@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch from a89f2dd to ca0c65a Compare August 21, 2023 05:08
@rtldg
Copy link
Contributor Author

rtldg commented Aug 21, 2023

Forced pushed to remove the Win32 changes and to make a couple of the screenshot file writing functions use mp_fopen to benefit from #12119 since they were using fopen previously.

@sfan5 sfan5 self-requested a review August 21, 2023 08:04
video/image_writer.c Outdated Show resolved Hide resolved
@@ -127,6 +127,30 @@ static void append_filename(char **s, const char *f)
talloc_free(append);
}

static void trim_invalid_utf8(char *s, size_t len)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would you want to "fix" invalid names? If the name is invalid then mpv should fail when it tries to use it.

There's no end to trying to "fix" bad names, and mpv should not try that, except if there's a very good reason.

Copy link
Contributor Author

@rtldg rtldg Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

truncate_long_base_filename can leave the basename with an invalid codepoint so trim_invalid_utf8 is used to remove such a thing. (this is described in the PR text above)

@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch from ca0c65a to 25c1a4c Compare August 21, 2023 08:44
player/screenshot.c Outdated Show resolved Hide resolved
player/screenshot.c Outdated Show resolved Hide resolved
player/screenshot.c Outdated Show resolved Hide resolved
@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch from 25c1a4c to a7b5097 Compare August 21, 2023 10:41
For screenshot filenames, it was possible for the basename to be
longer than what filesystems generally support.

On Linux, this is 255 bytes. On Windows, this is 255 wchar_t units.
Thus basenames are truncated to under 255 bytes so that the
basename + extension are <= 255 with `truncate_long_base_filename`.
It also makes sure not to produce an invalid UTF-8 codepoint in the filename.

For testing, filling `screenshot-template=` with 3-byte or 4-byte
UTF-8 codepoints is best. Such as "ウ" (3-byte) or "🌂" (4-byte).
Example: 84 * strlen("ウ") + strlen(".jpg") == 256
The last "ウ" is removed and the basename string will be
filled with 83 "ウ" characters and ".jpg" totalling 253 bytes.
@rtldg rtldg force-pushed the rtldg-screenshot-long-filenames branch from a7b5097 to 03e91b6 Compare August 21, 2023 10:51
// If truncation produces an invalid UTF-8 codepoint, then chop that off.
static void truncate_long_base_filename(char *s, const size_t ext_len)
{
const size_t max_utf8_bytes = 255 - (ext_len + 1); // ext_len+1 for '.'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

255 is a magic number. It should not be hardcoded. It should probably be MAX_PATH.

On windows MAX_PATH can be more than 255, because it has enough space to hold UTF8 of 260 wchar_t elements.

Also, if extlen is 255 or more, then max_utf8_bytes will wrap around to a huge number...

Copy link
Contributor Author

@rtldg rtldg Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be MAX_PATH because MAX_PATH is not the basename length. MAX_PATH is to hold drive-letter + ":\" + basename of 255 256(??) wchar_t/char elements + NUL terminator.

Per Maximum Path Length Limitation for Windows:

This type of path is composed of components separated by backslashes, each up to the value returned in the lpMaximumComponentLength parameter of the GetVolumeInformation function (this value is commonly 255 characters).

Bothering to check GetVolumeInformation isn't worth doing though.
All relevant filesystem use 255 for segments of filename (including for non-Windows OSes).

If ext_len is 255 let mpv blow up because that's an absurd case to care about.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be MAX_PATH

Which is why I said probably, I.e. you should figure whether it should be MAX_PATH or something else, like MAX_NAME.

The point is that 255 should not be hardcoded. It should be appropriate for the current platform, and if it's not MAX_PATH and not MAX_NAME then you should figure out what it needs to be, without calling APIs.

It should probably be some existing constant of the platform, and not hardcoded inside this function.

If ext_len is 255 let mpv blow up because that's an absurd case to care about.

In your applications maybe. Not in mpv.

You mean that the example you gave which should be fixed are not absurd cases, and so we should really care about them, like this?

screenshot-template="a🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"

Luckily for you though, it won't blow up, but it will also not work.

It's not rocket science. Please fix it correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which is why I said probably, I.e. you should figure whether it should be MAX_PATH or something else, like MAX_NAME.

Which is what I did and why 255.

The point is that 255 should not be hardcoded. It should be appropriate for the current platform, and if it's not MAX_PATH and not MAX_NAME then you should figure out what it needs to be, without calling APIs.

Don't hardcode but also figure it out without calling APIs? Okay...

If ext_len is 255 let mpv blow up because that's an absurd case to care about.

In your applications maybe. Not in mpv.

The extension comes from mpv. If mpv decides to use 255 character long extensions then that is mpv's fault.

screenshot-template="a🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂🌂"

Luckily for you though, it won't blow up, but it will also not work.
It's not rocket science. Please fix it correctly.

You're missing something if you think that won't work, or why it was listed as an example for testing removal of invalidated UTF-8 codepoints due to truncation.

Copy link
Member

@avih avih Aug 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should probably be some existing constant of the platform

I don't know if it's available.

We have this:

mpv/osdep/io.c

Lines 539 to 547 in ef4c6df

// Windows' MAX_PATH/PATH_MAX/FILENAME_MAX is fixed to 260, but this limit
// applies to unicode paths encoded with wchar_t (2 bytes on Windows). The UTF-8
// version could end up bigger in memory. In the worst case each wchar_t is
// encoded to 3 bytes in UTF-8, so in the worst case we have:
// wcslen(wpath) * 3 <= strlen(utf8path)
// Thus we need MP_PATH_MAX as the UTF-8/char version of PATH_MAX.
// Also make sure there's free space for the terminating \0.
// (For codepoints encoded as UTF-16 surrogate pairs, UTF-8 has the same length.)
#define MP_PATH_MAX (FILENAME_MAX * 3 + 1)

But it's only used privately at this C file when enumerating files in a directory.

Not sure how to solve this in general. I don't think we should change the global MAX_PATH either.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NAME_MAX seems to available in limits.h for Linux/macos as 255. It's also in my mingw64/msys2's limits.h but behind a _POSIX_ ifdef. BSDs have MAXNAMELEN which is 255 as far as I can tell.

Something like this, hardcoding 255, or calling out to GetVolumeInformation/pathconf(_PC_NAME_MAX)

#include <limits.h>
#ifndef NAME_MAX
 #ifdef MAXNAMELEN
  #define NAME_MAX MAXNAMELEN
 #else
  #define NAME_MAX 255
 #endif
#endif

Copy link
Contributor

@kasper93 kasper93 Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For win32 you can use _MAX_FNAME https://learn.microsoft.com/en-us/cpp/c-runtime-library/path-field-limits

Note that _MAX_FNAME includes space for terminating null, while NAME_MAX does not.

EDIT: And just as I mentioned in the other PR. Shouldn't we set long paths support in manifest on Windows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that _MAX_FNAME includes space for terminating null, while NAME_MAX does not.

I didn't include it because of that but yeah a Windows specific define could be (_MAX_FNAME-1) which is 255

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EDIT: And just as I mentioned in the other PR. Shouldn't we set long paths support in manifest on Windows?

In my opinion, yes..

@sfan5
Copy link
Member

sfan5 commented Aug 22, 2023

After an IRC discussion it turned out that choosing the right amount to truncate at is hard because it's not straightforward bytes on every OS.
The maximum is 255 bytes on Linux and 255 UTF-16 (or UCS-2?) codepoints on Windows and macOS.
However truncating on 255*3 can also be too late on Windows since 300 UTF-8 bytes can contain more than 255 UTF-16 codepoints...

I see three options:

  1. drop this PR and continue to consider it user error if he configures a too long template
  2. truncate of 255*3 on win32 and 255 elsewhere (and hope for the best)
  3. truncate to 255 (and restrict win32 more than necessary)

@rtldg
Copy link
Contributor Author

rtldg commented Aug 22, 2023

drop this PR and continue to consider it user error if he configures a too long template

Subtitle text via %{sub-text} can leave that out of the user's hands at times.

truncate of 255*3 on win32 and 255 elsewhere (and hope for the best)

Converting the basename with mp_from_utf8, truncating the new wchar_t*, and then back with mp_to_utf8 can be done for a more precise truncation to 255 wchar_ts. I had whipped up some code for this (in the details/spoiler below) but haven't given it enough testing since I think 3. [truncating to 255 UTF-8 bytes] is best for interoperability between operating systems (and network file shares). I even think it'd be worthwhile to use the same ILLEGAL_FILENAME_CHARS instead of for WIN32 and not.

edit: doesn't append the basename back to the path or anything so would need to be edited to do that

#ifdef _WIN32
static char *truncate_long_base_filename(char *s, const size_t ext_len)
{
    const size_t max_units = 255 - (ext_len + 1); // +1 for '.'
    char *basename = mp_basename(s);
    wchar_t *w = mp_from_utf8(NULL, basename);
    size_t len = wcslen(w);

    if (len <= max_units) {
        talloc_free(w);
        return s;
    }

    talloc_free(s);

    if (IS_HIGH_SURROGATE(w[max_units]))
        w[max_units-1] = 0; // chop off entire surrogate pair
    else
        w[max_units] = 0; // either not a pair or we're chopping off a 'low'

    char *res = mp_to_utf8(NULL, w);
    talloc_free(w);
    return res;
}
#else
/* "regular" truncate_long_base_filename here... */
#endif

@sfan5
Copy link
Member

sfan5 commented Aug 23, 2023

Subtitle text via %{sub-text} can leave that out of the user's hands at times.

You have to admit that this is a niche usecase. Users may very well write a script to correctly take screenshots named after subtitle text if they want to do that.

Converting the basename with mp_from_utf8, truncating the new wchar_t*, and then back with mp_to_utf8 can be done [...]

Sure, but this is a good example for platform-specific complicated support code that I'd like to avoid.

I even think it'd be worthwhile to use the same ILLEGAL_FILENAME_CHARS instead of for WIN32 and not.

Terrible idea IMO.

@rtldg
Copy link
Contributor Author

rtldg commented Aug 23, 2023

Subtitle text via %{sub-text} can leave that out of the user's hands at times.

You have to admit that this is a niche usecase. Users may very well write a script to correctly take screenshots named after subtitle text if they want to do that.

That's much more work than just throwing %{sub-text} into screenshot-template.

Converting the basename with mp_from_utf8, truncating the new wchar_t*, and then back with mp_to_utf8 can be done [...]

Sure, but this is a good example for platform-specific complicated support code that I'd like to avoid.

I'd like to avoid it too especially since it could cause file access issues if you were to have a filename on NTFS that'd be longer than 255 UTF-8 bytes in Linux.

I even think it'd be worthwhile to use the same ILLEGAL_FILENAME_CHARS instead of for WIN32 and not.

Terrible idea IMO.

Networked file shares and also file access issues from Linux -> Windows again.

@kasper93
Copy link
Contributor

kasper93 commented May 7, 2024

Honestly, I think long/unsupported filenames should be rejected with an error for user to act on. Truncating it implicitly doesn't really help anyone.

@kasper93 kasper93 added the priority:on-ice may be revisited later label May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority:on-ice may be revisited later
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants