bpo-34454: Fix issue with non-UTF8 separator strings #8862

pganssle · 2018-08-22T16:31:26Z

It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings.

In the special case of non-UTF-8 separators, we replace the separator character with T before encoding as UTF-8, so that encoding errors only occur on invalid ISO 8601 strings, and are handled as a standard ValueError (as would occur in the pure Python version).

bpo-34454: Implementation of the fix without significant performance problems.

https://bugs.python.org/issue34454

pganssle · 2018-08-22T16:32:53Z

@taleinat If you want you want to use this feel free to rebase against my branch. This PR is mainly because as part of figuring out how to make your PR fast, I actually re-wrote your PR, so it seemed easier to just push my changes.

pganssle · 2018-08-22T20:57:13Z

I've merged in the tests and NEWS from #8859, but I now think this PR should be merged instead of that one.

Comparing performance (using the script from this comment) of this PR (updated after sanitize_isoformat_str refactor):

datetime constructor:                1192.5ns
fromisoformat:                       561.3ns
fromisoformat (special characters):  599.7ns
fromisoformat (non-utf8):            1289.5ns
fromisoformat (fail, non-utf8):      3501.5ns
fromisoformat (fail, utf8):          1738.7ns

Compared with #8859:

datetime constructor:                1165.1ns
fromisoformat:                       520.7ns
fromisoformat (special characters):  1153.1ns
fromisoformat (non-utf8):            1165.8ns
fromisoformat (fail, non-utf8):      2815.5ns
fromisoformat (fail, utf8):          1648.3ns

It's much faster in at least one common(ish) case (utf-8) and essentially the same performance in all other cases. IMO, this one also is more readable, since it's essentially equivalent to:

def new_isoformat(dtstr):
    if len(dtstr) > 10 and is_surrogate(dtstr[10]):
        dtstr = "%sT%s" % (dtstr[0:10], dtstr[11:])
    return old_isoformat_minus_segfaults(dtstr)

It does not require the more complicated fast-path/slow-path branching in #8859 and proliferation of intermediate PyObjects (and associated refcounts) is kept to an absolute minimum.

pganssle · 2018-08-22T20:57:47Z

CC @abalkin @serhiy-storchaka

taleinat

Looks good, just a few small details to amend.

taleinat · 2018-08-22T21:04:27Z

Modules/_datetimemodule.c

+_sanitize_isoformat_str(PyObject* dtstr, unsigned char * needs_decref) {
+    Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
+    *needs_decref = 0;
+    if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) {


This should be len < 11 or len < 10.

taleinat · 2018-08-22T21:04:58Z

Misc/NEWS.d/next/Library/2018-08-22-21-59-08.bpo-34454.z7uG4b.rst

+Fix the .fromisoformat() methods of datetime types crashing when given
+unicode with non-UTF-8-encodable code points.  Specifically,
+datetime.fromisoformat() now accepts surrogate unicode code points used as
+the separator.


We should mention @izbyshev in this NEWS entry.

taleinat · 2018-08-22T21:07:47Z

Modules/_datetimemodule.c

@@ -4839,6 +4852,41 @@ datetime_combine(PyObject *cls, PyObject *args, PyObject *kw)
    return result;
 }

+
+static PyObject *
+_sanitize_isoformat_str(PyObject *dtstr, unsigned char *needs_decref) {


IMO needs_decref should be an int, not unsigned char.

Makes no difference to me. Ideally it would be bool but I guess that's not a thing in C?

taleinat · 2018-08-22T21:09:57Z

Modules/_datetimemodule.c

+    // the separator; to allow datetime_fromisoformat to make the simplifying
+    // assumption that all valid strings can be encoded in UTF-8, this function
+    // replaces any surrogate character separators with `T`.
+    Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);


Using PyUnicode_GET_LENGTH requires having called PyUnicode_READY before. In this case (and below), just use PyUnicode_GetLength().

taleinat · 2018-08-22T21:12:53Z

Modules/_datetimemodule.c

+    // replaces any surrogate character separators with `T`.
+    Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
+    *needs_decref = 0;
+    if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) {


Are you sure that only surrogates can cause failure to encode as UTF-8?

taleinat · 2018-08-22T21:16:34Z

Modules/_datetimemodule.c

+invalid_string_error:
+    PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);
+
+finally:


"finally" can be confusing since this doesn't happen upon success. I would just use "error".

bedevere-bot · 2018-08-22T21:17:52Z

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

taleinat · 2018-08-22T21:20:41Z

Modules/_datetimemodule.c

+    // replaces any surrogate character separators with `T`.
+    Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
+    *needs_decref = 0;
+    if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) {


This should be len < 11 or len <= 10.

pganssle · 2018-08-22T21:32:39Z

I believe all the changes are fixed. Thanks for the review @taleinat

izbyshev · 2018-08-22T21:31:37Z

Modules/_datetimemodule.c

+        return NULL;
+    }
+
+    PyObject *str_out = PyUnicode_FromFormat("%UT%U", left, right);


I haven't checked it, but creating a copy of dtstr (with PyUnicode_New/ PyUnicode_CopyCharacters or similar) and using PyUnicode_WriteChar to replace the separator might be both faster and simpler than splitting the string and then joining it again.

OK, refactored. You are right, it is both simpler and faster.

Great, thanks!

taleinat · 2018-08-23T08:13:15Z

Modules/_datetimemodule.c

+    int needs_decref = 0;
+    dtstr = _sanitize_isoformat_str(dtstr, &needs_decref);
+
+    Py_ssize_t len = PyUnicode_GetLength(dtstr);


The PyUnicode_GetLength() here is now unnecessary.

taleinat · 2018-08-23T08:15:47Z

@pganssle, for your consideration: If you also do the UTF-8 encoding in _sanitize_isoformat_str (renamed appropriately) and return a const char * and length, you can avoid the conditional Py_DECREF() in the fast-path at the end of datetime_fromisoformat().

izbyshev · 2018-08-23T09:29:03Z

Modules/_datetimemodule.c

@@ -4848,9 +4889,17 @@ datetime_fromisoformat(PyObject* cls, PyObject *dtstr) {
        return NULL;
    }

-    Py_ssize_t len;
+    int needs_decref = 0;
+    dtstr = _sanitize_isoformat_str(dtstr, &needs_decref);


dtstr should be checked for NULL.

Good catch, C programming is hard. :(

Fixed now.

pganssle · 2018-08-23T11:20:45Z

@taleinat I'm not entirely sure, but I think that wouldn't work; when the RC of the temporary dtstr reaches 0 it would be deleted, and I think that the temporary dtstr is managing the memory for dt_ptr. If that's not how it works, I'm not sure what would be managing that memory, since I'm not allocating any memory for it, or freeing it later.

taleinat · 2018-08-23T11:31:24Z

@pganssle, you're right, I hadn't considered that. Better to leave it as it is then.

Just remove the PyUnicode_GetLength() call and it should be ready to go in.

Also, you're welcome to add yourself to the NEWS section, i.e. "Patch by Paul Ganssle".

It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings. In the special case of non-UTF-8 separators, we take a performance hit by encoding the string as ASCII and replacing any invalid characters with ?.

Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.

This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io>

Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>

pganssle · 2018-08-23T13:14:06Z

@taleinat Fixed the duplicate PyUnicode_GetLength and the missing NULL check. I don't really need to be mentioned in the NEWS, plus I think it would be complicated to properly assign credit for this patch, as it was a collaborative effort between me, you and @izbyshev.

taleinat · 2018-08-23T13:30:52Z

@pganssle, looks good!

I'm a core dev and will merge this so my name will be on it anyways.

miss-islington · 2018-08-23T15:06:24Z

Thanks @pganssle for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

…gate code points (pythonGH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>

bedevere-bot · 2018-08-23T15:06:37Z

GH-8877 is a backport of this pull request to the 3.7 branch.

…gate code points (GH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>

serhiy-storchaka · 2018-08-24T12:59:19Z

Modules/_datetimemodule.c

@@ -4839,6 +4852,33 @@ datetime_combine(PyObject *cls, PyObject *args, PyObject *kw)
    return result;
 }

+static PyObject *
+_sanitize_isoformat_str(PyObject *dtstr, int *needs_decref) {


{ should be on new line.

serhiy-storchaka · 2018-08-24T13:02:18Z

Modules/_datetimemodule.c

+    // the separator; to allow datetime_fromisoformat to make the simplifying
+    // assumption that all valid strings can be encoded in UTF-8, this function
+    // replaces any surrogate character separators with `T`.
+    Py_ssize_t len = PyUnicode_GetLength(dtstr);


PyUnicode_GetLength() can set an exception and return -1.

Hm, good call. For some reason I had the impression that Py_ssize_t was an unsigned integer.

serhiy-storchaka · 2018-08-24T13:04:15Z

Modules/_datetimemodule.c

+        return dtstr;
+    }
+
+    PyObject *str_out = PyUnicode_New(len, PyUnicode_MAX_CHAR_VALUE(dtstr));


_PyUnicode_Copy() could be used.

But I'm not sure that arbitrary character (including lone surrogates) should be accepted as a date-time separator. For example 2018-08-24016:10:00 looks pretty confusing.

When we developed datetime.fromisoformat, the contract of the function was that it should act as the inverse of datetime.isoformat, meaning that it should satisfy

datetime.fromisoformat(dt.isoformat(*args, **kwargs)) == dt

for all values of dt, args and kwargs. Since dt.isoformat(sep='\ud800') and dt.isoformat(sep='0') are valid, we need to accept any character in the separator position in order to comply with the contract of the function.

It may be worth bringing up the appropriate contract in the datetime-SIG mailing list, but it was decided to use a very simple contract so that it's very simple to define what is and is not the correct behavior for this function (which could otherwise grow unwieldy).

serhiy-storchaka · 2018-08-24T13:14:25Z

Modules/_datetimemodule.c

    return dt;
+
+invalid_string_error:
+    PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);


This error message can contain not original string.

Oh good point. As part of the fixup I will change how this works so that dtstr is not re-assigned.

pganssle · 2018-08-24T13:46:11Z

@serhiy-storchaka I'll make a second PR with the cleanup.

the-knights-who-say-ni added the CLA signed label Aug 22, 2018

bedevere-bot added the awaiting review label Aug 22, 2018

pganssle mentioned this pull request Aug 22, 2018

bpo-34454: fix crash in .fromisoformat() methods when given inputs with surrogate code points #8859

Closed

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch 3 times, most recently from 0baa78c to 71eeb20 Compare August 22, 2018 20:45

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch from 71eeb20 to b5eeba0 Compare August 22, 2018 21:04

taleinat requested changes Aug 22, 2018

View reviewed changes

bedevere-bot added awaiting changes and removed awaiting review labels Aug 22, 2018

taleinat reviewed Aug 22, 2018

View reviewed changes

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch 2 times, most recently from 2162a43 to f73b230 Compare August 22, 2018 21:31

izbyshev reviewed Aug 22, 2018

View reviewed changes

taleinat reviewed Aug 23, 2018

View reviewed changes

izbyshev reviewed Aug 23, 2018

View reviewed changes

pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch from e45c28a to b89d4f5 Compare August 23, 2018 13:11

pganssle and others added 5 commits August 23, 2018 09:11

Fix non-UTF8 crash for (date|time)_fromisoformat

85a99ca

Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.

Refactor non-UTF-8 sanitization

dd82aa0

This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.

Add tests for surrogate code points

c24388e

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io>

Add news entry for bpo-34454

a0246a0

Refactor sanitize_isoformat_str

b89d4f5

Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>

taleinat approved these changes Aug 23, 2018

View reviewed changes

bedevere-bot added awaiting merge and removed awaiting changes labels Aug 23, 2018

added mention of patch author in NEWS

160e779

taleinat added needs backport to 3.7 type-bug An unexpected behavior, bug, or error labels Aug 23, 2018

taleinat merged commit 096329f into python:master Aug 23, 2018

bedevere-bot removed the awaiting merge label Aug 23, 2018

bedevere-bot removed the needs backport to 3.7 label Aug 23, 2018

taleinat mentioned this pull request Aug 23, 2018

bpo-34454: datetime: Fix crash on PyUnicode_AsUTF8AndSize() failure #8850

Closed

serhiy-storchaka reviewed Aug 24, 2018

View reviewed changes

pganssle mentioned this pull request Aug 27, 2018

bpo-34454: Clean up datetime.fromisoformat surrogate handling #8959

Merged

taleinat mentioned this pull request Oct 23, 2018

bpo-34482: Add tests for proper handling of non-UTF-8-encodable strin… #8878

Merged

Uh oh!

bpo-34454: Fix issue with non-UTF8 separator strings #8862

bpo-34454: Fix issue with non-UTF8 separator strings #8862

Uh oh!

Conversation

pganssle commented Aug 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 22, 2018

Uh oh!

pganssle commented Aug 22, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pganssle commented Aug 22, 2018

Uh oh!

taleinat left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bedevere-bot commented Aug 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pganssle commented Aug 22, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pganssle commented Aug 23, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

pganssle commented Aug 23, 2018

Uh oh!

taleinat commented Aug 23, 2018

Uh oh!

miss-islington commented Aug 23, 2018

Uh oh!

bedevere-bot commented Aug 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

pganssle commented Aug 22, 2018 •

edited

Loading

pganssle commented Aug 22, 2018 •

edited

Loading

pganssle commented Aug 23, 2018 •

edited

Loading