Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time.strftime() and Unicode characters on Windows #52551

Closed
AndiDogold mannequin opened this issue Apr 3, 2010 · 18 comments · Fixed by #125193
Closed

time.strftime() and Unicode characters on Windows #52551

AndiDogold mannequin opened this issue Apr 3, 2010 · 18 comments · Fixed by #125193
Assignees
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error

Comments

@AndiDogold
Copy link
Mannequin

AndiDogold mannequin commented Apr 3, 2010

BPO 8304
Nosy @terryjreedy, @pfmoore, @abalkin, @vstinner, @ericvsmith, @tjguk, @ezio-melotti, @shimizukawa, @zware, @eryksun, @zooba

Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

Show more details

GitHub fields:

assignee = None
closed_at = None
created_at = <Date 2010-04-03.15:08:42.673>
labels = ['3.8', '3.9', 'extension-modules', 'expert-unicode', 'type-bug', '3.10', 'library', 'OS-windows']
title = 'time.strftime() and Unicode characters on Windows'
updated_at = <Date 2021-03-08.19:17:29.084>
user = 'https://bugs.python.org/AndiDogold'

bugs.python.org fields:

activity = <Date 2021-03-08.19:17:29.084>
actor = 'eryksun'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Extension Modules', 'Library (Lib)', 'Unicode', 'Windows']
creation = <Date 2010-04-03.15:08:42.673>
creator = 'AndiDog_old'
dependencies = []
files = []
hgrepos = []
issue_num = 8304
keywords = []
message_count = 16.0
messages = ['102269', '102298', '102310', '102332', '102335', '159341', '222667', '226114', '251554', '251558', '251560', '255043', '255133', '388241', '388277', '388286']
nosy_count = 12.0
nosy_names = ['terry.reedy', 'paul.moore', 'belopolsky', 'vstinner', 'eric.smith', 'tim.golden', 'ezio.melotti', 'AndiDog_old', 'shimizukawa', 'zach.ware', 'eryksun', 'steve.dower']
pr_nums = []
priority = 'normal'
resolution = None
stage = None
status = 'open'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue8304'
versions = ['Python 3.8', 'Python 3.9', 'Python 3.10']

Linked PRs

@AndiDogold
Copy link
Mannequin Author

AndiDogold mannequin commented Apr 3, 2010

There is inconsistent behavior in time.strftime, comparing Python 2.6 and 3.1. In 3.1, non-ASCII Unicode characters seem to get dropped whereas in 2.6 you can keep them using the necessary Unicode-to-UTF8 workaround.

This should be fixed if it isn't intended behavior.

Python 2.6

>>> time.strftime(u"%d\u200F%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03\u200fSaturday'
>>> time.strftime(u"%d\u0041%A".encode("utf-8"), time.gmtime()).decode("utf-8")
u'03ASaturday'

Python 3.1

>>> time.strftime("%d\u200F%A", time.gmtime())
''
>>> len(time.strftime("%d\u200F%A", time.gmtime()))
0
>>> time.strftime("%d\u0041%A", time.gmtime())
'03ASaturday'

@AndiDogold AndiDogold mannequin added stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error labels Apr 3, 2010
@ezio-melotti
Copy link
Member

This seems to be fixed now, on both 3.1 and 3.2.
Can you try with 3.1.2 and see if it works?
What operating system are you using?

@ezio-melotti
Copy link
Member

Actually the bug seems related to Windows.

@AndiDogold
Copy link
Mannequin Author

AndiDogold mannequin commented Apr 4, 2010

Just installed Python 3.1.2, same problem. I'm using Windows XP SP2 with two Python installations (2.6.4 and now 3.1.2).

@AndiDogold
Copy link
Mannequin Author

AndiDogold mannequin commented Apr 4, 2010

Definitely a Windows problem. I did this on Visual Studio 2008:

wchar_t out[1000];
time_t currentTime;
time(&currentTime);
tm *timeStruct = gmtime(&currentTime);
    size_t ret = wcsftime(out, 1000, L"%d%A", timeStruct);
    wprintf(L"ret = %d, out = (%s)\n", ret, out);
    ret = wcsftime(out, 1000, L"%d\u200f%A", timeStruct);
    wprintf(L"ret = %d, out = (%s)\n", ret, out);

and the output was

    ret = 8, out = (04Sunday)
    ret = 0, out = ()

Python really shouldn't use any so-called standard functions on Windows. They never work as expected ^^...

@vstinner
Copy link
Member

Actually the bug seems related to Windows.

See also the issue bpo-10653: wcsftime() doesn't format correctly time zones, so Python 3 uses strftime() instead.

@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Jul 10, 2014

Using 3.4.1 and 3.5.0 I get:-

time.strftime("%d\u200F%A", time.gmtime())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'locale' codec can't encode character '\u200f' in position 2: Illegal byte sequence

@terryjreedy
Copy link
Member

I verified Marks 3.4.1 result with Idle.

It strikes me as a bug that a function that maps a unicode format string to a unicode string with interpolations added should ever encode the format to bytes, lets alone using using an encoding that fails or loses information. It is especially weird given that % formatting does not even work (at present) for bytes.

It seems to me that strftime should never encode the non-special parts of the format text. Instead, it could split the format (re.split) into a list of alternatine '%x' pairs and running text segments, replace the '%x' entries with the proper entries, and return the list joined back into a string. Some replacements would be locale dependent, other not.

(Just wondering, are the locate names of days and months bytes restricted to ascii or unrestricted unicode using native characters?)

@vstinner vstinner changed the title strftime and Unicode characters time.strftime() and Unicode characters on Windows Oct 1, 2014
@BreamoreBoy
Copy link
Mannequin

BreamoreBoy mannequin commented Sep 24, 2015

@alexander what is you take on this please? I can confirm that it is still a problem on Windows in 3.5.0.

@abalkin
Copy link
Member

abalkin commented Sep 25, 2015

Mark, I am no expert on Windows. I believe Victor is most knowledgable in this area.

@ericvsmith
Copy link
Member

The problem is definitely that:
format = PyUnicode_EncodeLocale(format_arg, "surrogateescape");
fails on Windows.

Windows is using strftime, not wcsftime. It's not using wcsftime because of bpo-10653.

If I force Windows to use wcsftime, this particular example works:
>>> time.strftime("%d\u200F%A", time.gmtime())
'25\u200fFriday'

I haven't looked at bpo-10653 enough to understand if it's still a problem with the new Visual C++. Maybe it is: I only tested with my default US locale.

@shimizukawa
Copy link
Mannequin

shimizukawa mannequin commented Nov 21, 2015

I've implemented a workaround for Sphinx:

>>> time.strftime(u'%Y 年'.encode('unicode-escape').decode(), *args).encode().decode('unicode-escape')
2015 年

https://github.com/sphinx-doc/sphinx/blob/8ae43b9fd/sphinx/util/osutil.py#L175

@eryksun
Copy link
Contributor

eryksun commented Nov 23, 2015

The problem from bpo-10653 is that internally the CRT encodes the time zone name using the ANSI codepage (i.e. the default system codepage). wcsftime decodes this string using mbstowcs (i.e. multibyte string to wide-character string), which uses Latin-1 in the C locale. In other words, in the C locale on Windows, mbstowcs just casts the byte values to wchar_t.

With the new Universal CRT, strftime is implemented by calling wcsftime, so the accepted solution for bpo-10653 is broken in 3.5+. A simple way around the problem is to switch back to using wcsftime and temporarily (or permanently) set the thread's LC_CTYPE locale to the system default. This makes the internal mbstowcs call use the ANSI codepage. Note that on POSIX platforms 3.x already sets the default via setlocale(LC_CTYPE, "") in Python/pylifecycle.c. Why not set this for all platforms that have setlocale?

I only tested with my default US locale.

If your system locale uses codepage 1252 (a superset of Latin-1), then you can still test this on a per thread basis if your system has additional language packs. For example:

    import ctypes

    kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)

    if kernel32.GetModuleHandleW('ucrtbased'):  # debug build
        crt = ctypes.CDLL('ucrtbased', use_errno=True)
    else:
        crt = ctypes.CDLL('ucrtbase', use_errno=True)

    MUI_LANGUAGE_NAME = 8
    LC_CTYPE = 2

    class tm(ctypes.Structure):
        pass

    crt._gmtime64.restype = ctypes.POINTER(tm)

    # set a Russian locale for the current thread    
    kernel32.SetThreadPreferredUILanguages(MUI_LANGUAGE_NAME,
                                           'ru-RU\0', None)
    crt._wsetlocale(LC_CTYPE, 'ru-RU')
    # update the time zone name based on the thread locale
    crt._tzset() 

    # get a struct tm *
    ltime = ctypes.c_int64()
    crt._time64(ctypes.byref(ltime))
    tmptr = crt._gmtime64(ctypes.byref(ltime))

    # call wcsftime using C and Russian locales 
    buf = (ctypes.c_wchar * 100)()
    crt._wsetlocale(LC_CTYPE, 'C')
    size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
    tz1 = buf[:size]
    crt._wsetlocale(LC_CTYPE, 'ru-RU')
    size = crt.wcsftime(buf, 100, '%Z\r\n', tmptr)
    tz2 = buf[:size]

    hcon = kernel32.GetStdHandle(-11)
    pn = ctypes.pointer(ctypes.c_uint())
    >>> _ = kernel32.WriteConsoleW(hcon, tz1, len(tz1), pn, None)
    Âðåìÿ â ôîðìàòå UTC
    >>> _ = kernel32.WriteConsoleW(hcon, tz2, len(tz2), pn, None)
    Время в формате UTC

The first result demonstrates the ANSI => Latin-1 mojibake problem in the C locale. You can encode this result as Latin-1 and then decode it back as codepage 1251:

    >>> tz1.encode('latin-1').decode('1251') == tz2
    True

But transcoding isn't a general workaround since the format string shouldn't be restricted to ANSI, unless you can smuggle the Unicode through like Takayuki showed.

@eryksun
Copy link
Contributor

eryksun commented Mar 7, 2021

Update since msg255133:

Python 3.8+ now calls setlocale(LC_CTYPE, "") at startup in Windows, as 3.x has always done in POSIX. So decoding the output of C strftime("%Z") with PyUnicode_DecodeLocaleAndSize() 'works' again, since both default to the process code page. The latter is usually the system code page, unless overridden to UTF-8 in the application manifest.

But calling C strftime() as a workaround is still a fragile solution, since it requires that the process code page is able to encode the process or thread UI language. In general, the system code page, the current user locale, and current user preferred language are independent settings in Windows.

Calling C strftime() also unnecessarily limits the format string to characters in the current LC_CTYPE locale encoding, which requires hacky workarounds.

Starting with Windows 10 v2004 (build 19041), ucrt uses an internal wide-character version of the time-zone name that gets returned by an internal __wide_tzname() call and used for "%Z" in wcsftime(). The wide-character value gets updated by _tzset() and kept in sync with _tzname.

If Python switched to using wcsftime() in Windows 10 2004+, then the current locale encoding would no longer be a problem for any UI language.

Also, bpo-36779 switched to setting time.tzname by directly calling WinAPI GetTimeZineInformation(). time.tzset() should be implemented in order to keep the value of time.tzname in sync with time.strftime("%Z").

@eryksun eryksun added extension-modules C modules in the Modules dir 3.8 (EOL) end of life 3.9 only security fixes 3.10 only security fixes labels Mar 7, 2021
@vstinner
Copy link
Member

vstinner commented Mar 8, 2021

time.tzset() should be implemented

I'm not sure of what you mean. The function is implemented:

static PyObject *
time_tzset(PyObject *self, PyObject *unused)
{
    PyObject* m;

    m = PyImport_ImportModuleNoBlock("time");
    if (m == NULL) {
        return NULL;
    }

    tzset();

    /* Reset timezone, altzone, daylight and tzname */
    if (init_timezone(m) < 0) {
         return NULL;
    }
    Py_DECREF(m);
    if (PyErr_Occurred())
        return NULL;

    Py_RETURN_NONE;
}

@eryksun
Copy link
Contributor

eryksun commented Mar 8, 2021

I'm not sure of what you mean. The function is implemented:

My comment was limited to Windows, for which time.tzset() has never been implemented. Since Python has its own implementation of time.tzname in Windows, it should also implement time.tzset() to allow refreshing the value. Actually, ucrt implements C _tzset(), so the implementation of time.tzset() in Windows also has to call C _tzset() to update _tzname (and also ucrt's new private __wide_tzname), in addition to calling GetTimeZoneInformation() to update its own time.tzname value.

Another difference with Python's time.tzname and C strftime("%Z") is that ucrt will use the TZ environment variable, but Python's implementation of time.tzname in Windows does not.

@ezio-melotti ezio-melotti transferred this issue from another repository Apr 10, 2022
@serhiy-storchaka serhiy-storchaka self-assigned this Oct 5, 2024
@serhiy-storchaka serhiy-storchaka removed 3.10 only security fixes 3.9 only security fixes 3.8 (EOL) end of life labels Oct 5, 2024
@serhiy-storchaka serhiy-storchaka added 3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes labels Oct 5, 2024
@serhiy-storchaka
Copy link
Member

Yet a couple bugs.

On platforms without wcsftime():

>>> print(ascii(time.strftime('\udcf0\udc9f\udc90\udc8d')))
'\U0001f40d'

The result depends on the locale encoding. The above was for UTF-8.

I expect the similar result for time.strftime('\ud83d\udc0d') on platforms with wcsftime() and 16-bit wchar_t.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 9, 2024
Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also pythongh-78662 and pythongh-124531.
@serhiy-storchaka serhiy-storchaka linked a pull request Oct 9, 2024 that will close this issue
serhiy-storchaka added a commit that referenced this issue Oct 17, 2024
Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also gh-78662 and gh-124531.
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 17, 2024
…5193)

Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also pythongh-78662 and pythongh-124531.
(cherry picked from commit ad3eac1)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 17, 2024
Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also pythongh-78662 and pythongh-124531.

(cherry picked from commit ad3eac1)
@serhiy-storchaka
Copy link
Member

serhiy-storchaka commented Oct 17, 2024

Python's time.strftime() now passes only ASCII strings to C's strftime(), so there are no encoding errors. But the result of strftime() still can contain non-ASCII data and it need to be decoded. I think that switching to wcsftime() would make decoding less prone to locale misconfiguration errors.

serhiy-storchaka added a commit to serhiy-storchaka/cpython that referenced this issue Oct 17, 2024
serhiy-storchaka added a commit that referenced this issue Oct 17, 2024
…5657)

Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also gh-78662 and gh-124531.

(cherry picked from commit ad3eac1)
miss-islington pushed a commit to miss-islington/cpython that referenced this issue Oct 17, 2024
…5193) (pythonGH-125657)

Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also pythongh-78662 and pythongh-124531.

(cherry picked from commit 08ccbb9)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
(cherry picked from commit ad3eac1)
serhiy-storchaka added a commit that referenced this issue Oct 17, 2024
…5657) (GH-125661)

Fix time.strftime(), the strftime() method and formatting of the
datetime classes datetime, date and time.

* Characters not encodable in the current locale are now acceptable in
  the format string.
* Surrogate pairs and sequence of surrogatescape-encoded bytes are no
  longer recombinated.
* Embedded null character no longer terminates the format string.

This fixes also gh-78662 and gh-124531.

(cherry picked from commit 08ccbb9)
(cherry picked from commit ad3eac1)

Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.12 bugs and security fixes 3.13 bugs and security fixes 3.14 new features, bugs and security fixes extension-modules C modules in the Modules dir OS-windows stdlib Python modules in the Lib dir topic-unicode type-bug An unexpected behavior, bug, or error
Projects
Development

Successfully merging a pull request may close this issue.

7 participants