Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bpo-34454: Fix issue with non-UTF8 separator strings #8862

Merged
merged 7 commits into from
Aug 23, 2018

Conversation

pganssle
Copy link
Member

@pganssle pganssle commented Aug 22, 2018

It is possible to pass a non-UTF-8 string as a separator in datetime.isoformat, but the current implementation starts by decoding to UTF-8, which will fail even for some valid strings.

In the special case of non-UTF-8 separators, we replace the separator character with T before encoding as UTF-8, so that encoding errors only occur on invalid ISO 8601 strings, and are handled as a standard ValueError (as would occur in the pure Python version).

bpo-34454: Implementation of the fix without significant performance problems.

https://bugs.python.org/issue34454

@pganssle
Copy link
Member Author

@taleinat If you want you want to use this feel free to rebase against my branch. This PR is mainly because as part of figuring out how to make your PR fast, I actually re-wrote your PR, so it seemed easier to just push my changes.

@pganssle
Copy link
Member Author

pganssle commented Aug 22, 2018

I've merged in the tests and NEWS from #8859, but I now think this PR should be merged instead of that one.

Comparing performance (using the script from this comment) of this PR (updated after sanitize_isoformat_str refactor):

datetime constructor:                1192.5ns
fromisoformat:                       561.3ns
fromisoformat (special characters):  599.7ns
fromisoformat (non-utf8):            1289.5ns
fromisoformat (fail, non-utf8):      3501.5ns
fromisoformat (fail, utf8):          1738.7ns

Compared with #8859:

datetime constructor:                1165.1ns
fromisoformat:                       520.7ns
fromisoformat (special characters):  1153.1ns
fromisoformat (non-utf8):            1165.8ns
fromisoformat (fail, non-utf8):      2815.5ns
fromisoformat (fail, utf8):          1648.3ns

It's much faster in at least one common(ish) case (utf-8) and essentially the same performance in all other cases. IMO, this one also is more readable, since it's essentially equivalent to:

def new_isoformat(dtstr):
    if len(dtstr) > 10 and is_surrogate(dtstr[10]):
        dtstr = "%sT%s" % (dtstr[0:10], dtstr[11:])
    return old_isoformat_minus_segfaults(dtstr)

It does not require the more complicated fast-path/slow-path branching in #8859 and proliferation of intermediate PyObjects (and associated refcounts) is kept to an absolute minimum.

@pganssle
Copy link
Member Author

CC @abalkin @serhiy-storchaka

@pganssle pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch from 71eeb20 to b5eeba0 Compare August 22, 2018 21:04
Copy link
Contributor

@taleinat taleinat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a few small details to amend.

_sanitize_isoformat_str(PyObject* dtstr, unsigned char * needs_decref) {
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
*needs_decref = 0;
if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be len < 11 or len < 10.

Fix the .fromisoformat() methods of datetime types crashing when given
unicode with non-UTF-8-encodable code points. Specifically,
datetime.fromisoformat() now accepts surrogate unicode code points used as
the separator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should mention @izbyshev in this NEWS entry.

@@ -4839,6 +4852,41 @@ datetime_combine(PyObject *cls, PyObject *args, PyObject *kw)
return result;
}


static PyObject *
_sanitize_isoformat_str(PyObject *dtstr, unsigned char *needs_decref) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO needs_decref should be an int, not unsigned char.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes no difference to me. Ideally it would be bool but I guess that's not a thing in C?

// the separator; to allow datetime_fromisoformat to make the simplifying
// assumption that all valid strings can be encoded in UTF-8, this function
// replaces any surrogate character separators with `T`.
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using PyUnicode_GET_LENGTH requires having called PyUnicode_READY before. In this case (and below), just use PyUnicode_GetLength().

// replaces any surrogate character separators with `T`.
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
*needs_decref = 0;
if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure that only surrogates can cause failure to encode as UTF-8?

invalid_string_error:
PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);

finally:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"finally" can be confusing since this doesn't happen upon success. I would just use "error".

@bedevere-bot
Copy link

A Python core developer has requested some changes be made to your pull request before we can consider merging it. If you could please address their requests along with any other requests in other reviews from core developers that would be appreciated.

Once you have made the requested changes, please leave a comment on this pull request containing the phrase I have made the requested changes; please review again. I will then notify any core developers who have left a review that you're ready for them to take another look at this pull request.

// replaces any surrogate character separators with `T`.
Py_ssize_t len = PyUnicode_GET_LENGTH(dtstr);
*needs_decref = 0;
if (len < 10 || !Py_UNICODE_IS_SURROGATE(PyUnicode_READ_CHAR(dtstr, 10))) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be len < 11 or len <= 10.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@pganssle pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch 2 times, most recently from 2162a43 to f73b230 Compare August 22, 2018 21:31
@pganssle
Copy link
Member Author

I believe all the changes are fixed. Thanks for the review @taleinat

return NULL;
}

PyObject *str_out = PyUnicode_FromFormat("%UT%U", left, right);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked it, but creating a copy of dtstr (with PyUnicode_New/ PyUnicode_CopyCharacters or similar) and using PyUnicode_WriteChar to replace the separator might be both faster and simpler than splitting the string and then joining it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, refactored. You are right, it is both simpler and faster.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, thanks!

int needs_decref = 0;
dtstr = _sanitize_isoformat_str(dtstr, &needs_decref);

Py_ssize_t len = PyUnicode_GetLength(dtstr);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PyUnicode_GetLength() here is now unnecessary.

@taleinat
Copy link
Contributor

@pganssle, for your consideration: If you also do the UTF-8 encoding in _sanitize_isoformat_str (renamed appropriately) and return a const char * and length, you can avoid the conditional Py_DECREF() in the fast-path at the end of datetime_fromisoformat().

@@ -4848,9 +4889,17 @@ datetime_fromisoformat(PyObject* cls, PyObject *dtstr) {
return NULL;
}

Py_ssize_t len;
int needs_decref = 0;
dtstr = _sanitize_isoformat_str(dtstr, &needs_decref);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dtstr should be checked for NULL.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, C programming is hard. :(

Fixed now.

@pganssle
Copy link
Member Author

pganssle commented Aug 23, 2018

@taleinat I'm not entirely sure, but I think that wouldn't work; when the RC of the temporary dtstr reaches 0 it would be deleted, and I think that the temporary dtstr is managing the memory for dt_ptr. If that's not how it works, I'm not sure what would be managing that memory, since I'm not allocating any memory for it, or freeing it later.

@taleinat
Copy link
Contributor

@pganssle, you're right, I hadn't considered that. Better to leave it as it is then.

Just remove the PyUnicode_GetLength() call and it should be ready to go in.

Also, you're welcome to add yourself to the NEWS section, i.e. "Patch by Paul Ganssle".

@pganssle pganssle force-pushed the fromisoformat_fix_nonutf8_crash branch from e45c28a to b89d4f5 Compare August 23, 2018 13:11
pganssle and others added 5 commits August 23, 2018 09:11
It is possible to pass a non-UTF-8 string as a separator in
datetime.isoformat, but the current implementation starts by decoding to
UTF-8, which will fail even for some valid strings.

In the special case of non-UTF-8 separators, we take a performance hit
by encoding the string as ASCII and replacing any invalid characters
with ?.
Previously this would end up dereferencing a NULL pointer if the
PyUnicode_AsUTF8AndSize call failed, this makes it so that the same
error as any other parsing error is raised.
This increases performance for valid non-UTF-8 strings by avoiding
an error condition, and minimizes the impact on the rest of the
algorithm.
Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru>
Co-authored-by: Paul Ganssle <paul@ganssle.io>
Rather than splitting the string at position 10 and re-joining it with
PyUnicode_Format, this copies the original unicode object and overwrites
the separator character.

Co-Authored-By: Alexey Izbyshev <izbyshev@ispras.ru>
@pganssle
Copy link
Member Author

@taleinat Fixed the duplicate PyUnicode_GetLength and the missing NULL check. I don't really need to be mentioned in the NEWS, plus I think it would be complicated to properly assign credit for this patch, as it was a collaborative effort between me, you and @izbyshev.

@taleinat
Copy link
Contributor

@pganssle, looks good!

I'm a core dev and will merge this so my name will be on it anyways.

@taleinat taleinat added needs backport to 3.7 type-bug An unexpected behavior, bug, or error labels Aug 23, 2018
@taleinat taleinat merged commit 096329f into python:master Aug 23, 2018
@miss-islington
Copy link
Contributor

Thanks @pganssle for the PR, and @taleinat for merging it 🌮🎉.. I'm working now to backport this PR to: 3.7.
🐍🍒⛏🤖

miss-islington pushed a commit to miss-islington/cpython that referenced this pull request Aug 23, 2018
…gate code points (pythonGH-8862)

The current C implementations **crash** if the input includes a surrogate
Unicode code point, which is not possible to encode in UTF-8.

Important notes:

1.  It is possible to pass a non-UTF-8 string as a separator to the
    `.isoformat()` methods.
2.  The pure-Python `datetime.fromisoformat()` implementation accepts
    strings with a surrogate as the separator.

In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators,
this implementation will take a performance hit by making a copy of the
input string and replacing the separator with 'T'.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru>
Co-authored-by: Paul Ganssle <paul@ganssle.io>
(cherry picked from commit 096329f)

Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
@bedevere-bot
Copy link

GH-8877 is a backport of this pull request to the 3.7 branch.

miss-islington added a commit that referenced this pull request Aug 23, 2018
…gate code points (GH-8862)

The current C implementations **crash** if the input includes a surrogate
Unicode code point, which is not possible to encode in UTF-8.

Important notes:

1.  It is possible to pass a non-UTF-8 string as a separator to the
    `.isoformat()` methods.
2.  The pure-Python `datetime.fromisoformat()` implementation accepts
    strings with a surrogate as the separator.

In `datetime.fromisoformat()`, in the special case of non-UTF-8 separators,
this implementation will take a performance hit by making a copy of the
input string and replacing the separator with 'T'.

Co-authored-by: Alexey Izbyshev <izbyshev@ispras.ru>
Co-authored-by: Paul Ganssle <paul@ganssle.io>
(cherry picked from commit 096329f)

Co-authored-by: Paul Ganssle <pganssle@users.noreply.github.com>
@@ -4839,6 +4852,33 @@ datetime_combine(PyObject *cls, PyObject *args, PyObject *kw)
return result;
}

static PyObject *
_sanitize_isoformat_str(PyObject *dtstr, int *needs_decref) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

{ should be on new line.

// the separator; to allow datetime_fromisoformat to make the simplifying
// assumption that all valid strings can be encoded in UTF-8, this function
// replaces any surrogate character separators with `T`.
Py_ssize_t len = PyUnicode_GetLength(dtstr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyUnicode_GetLength() can set an exception and return -1.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, good call. For some reason I had the impression that Py_ssize_t was an unsigned integer.

return dtstr;
}

PyObject *str_out = PyUnicode_New(len, PyUnicode_MAX_CHAR_VALUE(dtstr));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_PyUnicode_Copy() could be used.

But I'm not sure that arbitrary character (including lone surrogates) should be accepted as a date-time separator. For example 2018-08-24016:10:00 looks pretty confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we developed datetime.fromisoformat, the contract of the function was that it should act as the inverse of datetime.isoformat, meaning that it should satisfy

datetime.fromisoformat(dt.isoformat(*args, **kwargs)) == dt

for all values of dt, args and kwargs. Since dt.isoformat(sep='\ud800') and dt.isoformat(sep='0') are valid, we need to accept any character in the separator position in order to comply with the contract of the function.

It may be worth bringing up the appropriate contract in the datetime-SIG mailing list, but it was decided to use a very simple contract so that it's very simple to define what is and is not the correct behavior for this function (which could otherwise grow unwieldy).

return dt;

invalid_string_error:
PyErr_Format(PyExc_ValueError, "Invalid isoformat string: %R", dtstr);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message can contain not original string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point. As part of the fixup I will change how this works so that dtstr is not re-assigned.

@pganssle
Copy link
Member Author

@serhiy-storchaka I'll make a second PR with the cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants