-
-
Notifications
You must be signed in to change notification settings - Fork 31.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bpo-34454: Fix issue with non-UTF8 separator strings #8862
bpo-34454: Fix issue with non-UTF8 separator strings #8862
Conversation
@taleinat If you want you want to use this feel free to rebase against my branch. This PR is mainly because as part of figuring out how to make your PR fast, I actually re-wrote your PR, so it seemed easier to just push my changes. |
0baa78c
to
71eeb20
Compare
I've merged in the tests and NEWS from #8859, but I now think this PR should be merged instead of that one. Comparing performance (using the script from this comment) of this PR (updated after sanitize_isoformat_str refactor):
Compared with #8859:
It's much faster in at least one common(ish) case (utf-8) and essentially the same performance in all other cases. IMO, this one also is more readable, since it's essentially equivalent to: def new_isoformat(dtstr):
if len(dtstr) > 10 and is_surrogate(dtstr[10]):
dtstr = "%sT%s" % (dtstr[0:10], dtstr[11:])
return old_isoformat_minus_segfaults(dtstr) It does not require the more complicated fast-path/slow-path branching in #8859 and proliferation of intermediate PyObjects (and associated refcounts) is kept to an absolute minimum. |
71eeb20
to
b5eeba0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a few small details to amend.
Modules/_datetimemodule.c
Outdated
_sanitize_isoformat_str(PyObject* dtstr, unsigned char * needs_decref) { | ||
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr); | ||
*needs_decref = 0; | ||
if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be len < 11
or len < 10
.
Fix the .fromisoformat() methods of datetime types crashing when given | ||
unicode with non-UTF-8-encodable code points. Specifically, | ||
datetime.fromisoformat() now accepts surrogate unicode code points used as | ||
the separator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention @izbyshev in this NEWS entry.
Modules/_datetimemodule.c
Outdated
@@ -4839,6 +4852,41 @@ datetime_combine(PyObject *cls, PyObject *args, PyObject *kw) | |||
return result; | |||
} | |||
|
|||
|
|||
static PyObject * | |||
_sanitize_isoformat_str(PyObject *dtstr, unsigned char *needs_decref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO needs_decref
should be an int
, not unsigned char
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes no difference to me. Ideally it would be bool
but I guess that's not a thing in C?
Modules/_datetimemodule.c
Outdated
// the separator; to allow datetime_fromisoformat to make the simplifying | ||
// assumption that all valid strings can be encoded in UTF-8, this function | ||
// replaces any surrogate character separators with `T`. | ||
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using PyUnicode_GET_LENGTH
requires having called PyUnicode_READY
before. In this case (and below), just use PyUnicode_GetLength()
.
Modules/_datetimemodule.c
Outdated
// replaces any surrogate character separators with `T`. | ||
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr); | ||
*needs_decref = 0; | ||
if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you sure that only surrogates can cause failure to encode as UTF-8?
Modules/_datetimemodule.c
Outdated
invalid_string_error: | ||
PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr); | ||
|
||
finally: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"finally" can be confusing since this doesn't happen upon success. I would just use "error".
A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated. Once you have made the requested changes, please leave a comment on this pull request containing the phrase |
Modules/_datetimemodule.c
Outdated
// replaces any surrogate character separators with `T`. | ||
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr); | ||
*needs_decref = 0; | ||
if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be len < 11
or len <= 10
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
2162a43
to
f73b230
Compare
I believe all the changes are fixed. Thanks for the review @taleinat |
Modules/_datetimemodule.c
Outdated
return NULL; | ||
} | ||
|
||
PyObject *str_out = PyUnicode_FromFormat("%UT%U", left, right); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked it, but creating a copy of dtstr
(with PyUnicode_New
/ PyUnicode_CopyCharacters
or similar) and using PyUnicode_WriteChar
to replace the separator might be both faster and simpler than splitting the string and then joining it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, refactored. You are right, it is both simpler and faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, thanks!
Modules/_datetimemodule.c
Outdated
int needs_decref = 0; | ||
dtstr = _sanitize_isoformat_str(dtstr, &needs_decref); | ||
|
||
Py_ssize_t len = PyUnicode_GetLength(dtstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PyUnicode_GetLength()
here is now unnecessary.
@pganssle, for your consideration: If you also do the UTF-8 encoding in |
Modules/_datetimemodule.c
Outdated
@@ -4848,9 +4889,17 @@ datetime_fromisoformat(PyObject* cls, PyObject *dtstr) { | |||
return NULL; | |||
} | |||
|
|||
Py_ssize_t len; | |||
int needs_decref = 0; | |||
dtstr = _sanitize_isoformat_str(dtstr, &needs_decref); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dtstr
should be checked for NULL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, C programming is hard. :(
Fixed now.
@taleinat I'm not entirely sure, but I think that wouldn't work; when the RC of the temporary |
@pganssle, you're right, I hadn't considered that. Better to leave it as it is then. Just remove the Also, you're welcome to add yourself to the NEWS section, i.e. "Patch by Paul Ganssle". |
e45c28a
to
b89d4f5
Compare
It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings. In the special case of non-UTF-8 separators, we take a performance hit by encoding the string as ASCII and replacing any invalid characters with ?.
Previously this would end up dereferencing a NULL pointer if the PyUnicode_AsUTF8AndSize call failed, this makes it so that the same error as any other parsing error is raised.
This increases performance for valid non-UTF-8 strings by avoiding an error condition, and minimizes the impact on the rest of the algorithm.
Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io>
Rather than splitting the string at position 10 and re-joining it with PyUnicode_Format, this copies the original unicode object and overwrites the separator character. Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>
@pganssle, looks good! I'm a core dev and will merge this so my name will be on it anyways. |
…gate code points (pythonGH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
GH-8877 is a backport of this pull request to the 3.7 branch. |
…gate code points (GH-8862) The current C implementations **crash** if the input includes a surrogate Unicode code point, which is not possible to encode in UTF-8. Important notes: 1. It is possible to pass a non-UTF-8 string as a separator to the `.isoformat()` methods. 2. The pure-Python `datetime.fromisoformat()` implementation accepts strings with a surrogate as the separator. In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators, this implementation will take a performance hit by making a copy of the input string and replacing the separator with 'T'. Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru> Co-authored-by: Paul Ganssle <paul@ganssle.io> (cherry picked from commit 096329f) Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
@@ -4839,6 +4852,33 @@ datetime_combine(PyObject *cls, PyObject *args, PyObject *kw) | |||
return result; | |||
} | |||
|
|||
static PyObject * | |||
_sanitize_isoformat_str(PyObject *dtstr, int *needs_decref) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{
should be on new line.
// the separator; to allow datetime_fromisoformat to make the simplifying | ||
// assumption that all valid strings can be encoded in UTF-8, this function | ||
// replaces any surrogate character separators with `T`. | ||
Py_ssize_t len = PyUnicode_GetLength(dtstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PyUnicode_GetLength()
can set an exception and return -1
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, good call. For some reason I had the impression that Py_ssize_t
was an unsigned integer.
return dtstr; | ||
} | ||
|
||
PyObject *str_out = PyUnicode_New(len, PyUnicode_MAX_CHAR_VALUE(dtstr)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_PyUnicode_Copy()
could be used.
But I'm not sure that arbitrary character (including lone surrogates) should be accepted as a date-time separator. For example 2018-08-24016:10:00 looks pretty confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we developed datetime.fromisoformat
, the contract of the function was that it should act as the inverse of datetime.isoformat
, meaning that it should satisfy
datetime.fromisoformat(dt.isoformat(*args, **kwargs)) == dt
for all values of dt
, args
and kwargs
. Since dt.isoformat(sep='\ud800')
and dt.isoformat(sep='0')
are valid, we need to accept any character in the separator position in order to comply with the contract of the function.
It may be worth bringing up the appropriate contract in the datetime-SIG
mailing list, but it was decided to use a very simple contract so that it's very simple to define what is and is not the correct behavior for this function (which could otherwise grow unwieldy).
return dt; | ||
|
||
invalid_string_error: | ||
PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This error message can contain not original string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh good point. As part of the fixup I will change how this works so that dtstr
is not re-assigned.
@serhiy-storchaka I'll make a second PR with the cleanup. |
It is possible to pass a non-UTF-8 string as a separator in
datetime.isoformat
, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings.In the special case of non-UTF-8 separators, we replace the separator character with
T
before encoding as UTF-8, so that encoding errors only occur on invalid ISO 8601 strings, and are handled as a standardValueError
(as would occur in the pure Python version).bpo-34454: Implementation of the fix without significant performance problems.
https://bugs.python.org/issue34454