Skip to content

pybind11::str::raw_str simplification (for Python 2) #2367

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

rwgk
Copy link
Collaborator

@rwgk rwgk commented Aug 5, 2020

For Python 2, convert directly to unicode (instead of converting to str first, followed by decoding).

Simpler code, and more obviously equivalent to unicode(...) in the interpreter.

PR is #2366 is meant to be a preparation for this PR.
They could be merged in any order, but it's best to merge #2366 first, re-base and re-test this PR, then merge.

Py_XDECREF(str_value); str_value = unicode;
PyObject *str_value = PyObject_Unicode(op);
#else
PyObject *str_value = PyObject_Str(op);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a ton of compatibility macros in common.h. I'd define a new one to avoid this macro branching here. Something like PYBIND11_OBJECT_TO_STRING.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have a ton of compatibility macros in common.h. I'd define a new one to avoid this macro branching here. Something like PYBIND11_OBJECT_TO_STRING.

Thanks Boris, I'm inclined to take your suggestions, although it makes it a (slightly) sprawling change.
I just looked: the needed compatibility macro doesn't exist already, I'll have to add it.
Checking with @YannickJadoul and @EricCousineau-TRI : what's your recommendation, local change as-is or adding the compatibility macro?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine with a compatibility macro living in common.h. The main concern about macros is some sort of deprecation mechanism when it's no longer necessary, but I'm sure there are several ways to resolve that.

Copy link
Collaborator

@YannickJadoul YannickJadoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe (haven't tested this explicitly with this code), you'll break the use case where a utf-8-encoded bytes object is passed. Currently, pybind11 decodes it to (Python 2's) unicode type with "utf-8" encoding.
This PR won't do that anymore, since PyObject_Unicode is supposedly the same as unicode(o) (according to the docs):

>>> print(u"abcd\u0259")
abcdə
>>> unicode(u"abcd\u0259".encode("utf-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 4: ordinal not in range(128)
>>> unicode(u"abcd\u0259".encode("utf-8"), "utf-8")
u'abcd\u0259'

Another thing to watch out for is custom types with __str__/__unicode__ in Python 2. Currently, __str__ would be called, then that result converted to unicode. This PR would immediately call __unicode__ (arguably better, I'd say, but it is a change).

@@ -930,11 +930,10 @@ class str : public object {
private:
/// Return string representation -- always returns a new reference, even if already a str
static PyObject *raw_str(PyObject *op) {
PyObject *str_value = PyObject_Str(op);
if (!str_value) throw error_already_set();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one line went missing but is important! (Unless this can never fail?)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Important to add a test for this as well)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nvm. Just noticed this will be done by the code generated by PYBIND11_OBJECT_CVT ;-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one line went missing but is important! (Unless this can never fail?)

Oops, sorry, that was an accident (manual cherry-picking). I fixed it quick to get that part out of the way. More changes later.

@YannickJadoul
Copy link
Collaborator

Another thing to watch out for is custom types with __str__/__unicode__ in Python 2. Currently, __str__ would be called, then that result converted to unicode. This PR would immediately call __unicode__ (arguably better, I'd say, but it is a change).

This seems to be OK. It seems like unicode(o) still falls back to __str__ if __unicode__ isn't present.

>>> # This is Python 2
>>> class X:
...     def __str__(self):
...             return "Hello from __str__"
... 
>>> str(X())
'Hello from __str__'
>>> unicode(X())
u'Hello from __str__'
>>> class Y:
...     def __unicode__(self):
...             return u"Hello from __unicode__"
... 
>>> unicode(Y())
u'Hello from __unicode__'
>>> str(Y())
'<__main__.Y instance at 0x7f707209d1e0>'

So this is better, I believe, because first __unicode__ will be tried. Though it might still be considered a breaking change?

The thing to watch out for is the same as above. "utf-8" is still not the default encoding:

>>> # This is Python 2
>>> class Z:
...     def __str__(self):
...             return u"He\u0259llo from __str__".encode('utf-8')
... 
>>> str(Z())
'He\xc9\x99llo from __str__'
>>> unicode(Z())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 2: ordinal not in range(128)

For Python 2, convert directly to unicode (instead of converting to str first, followed by encoding).

Simpler code, and more obviously equivalent to `unicode(...)` in the interpreter.
@rwgk rwgk force-pushed the pybind11_str_raw_str_simplification branch from 5e76fc6 to 603367d Compare August 5, 2020 15:56
@rwgk
Copy link
Collaborator Author

rwgk commented Aug 5, 2020

Though it might still be considered a breaking change?

Oh wow, I didn't realize this, thanks!
Maybe I'll just leave this alone then. My feeling is it's not helpful to deviate from "normal" Python 2 behavior (generates surprises), but then again, this isn't important enough a battle to pick.

@YannickJadoul
Copy link
Collaborator

Oh wow, I didn't realize this, thanks!
Maybe I'll just leave this alone then. My feeling is it's not helpful to deviate from "normal" Python 2 behavior (generates surprises), but then again, this isn't important enough a battle to pick.

I'm ... yeah, it's always a bit arbitrary which encoding to pick? I'm not sure if it every occurs, so we might just as well try making the change (or at least proposing it?).
The more I think about it, the more I quite like the more obvious: "py::str is Python 2's unicode, so we're calling unicode(...)", but it seems pybind11 has taken the stance that all strings ought to be encoded as UTF-8 (similar to std::string conventions).

@bstaletic
Copy link
Collaborator

but it seems pybind11 has taken the stance that all strings ought to be encoded as UTF-8

Yup

@jbarlow83
Copy link
Contributor

How long is pybind11 intending to support Python 2?

@EricCousineau-TRI
Copy link
Collaborator

How long is pybind11 intending to support Python 2?

My understanding (from discussions with others) is that Python 2 will continue to be supported in pybind11 as long as the supporting infrastructure is not too much of burden (namely CI, package managers).
While Python 2 support complicates some of the code (as seen in this PR), that complexity isn't all too bad in terms of maintenance.
@wjakob Please feel free to correct me here if I misspoke.

@rwgk rwgk closed this Aug 11, 2020
@rwgk rwgk deleted the pybind11_str_raw_str_simplification branch August 11, 2020 01:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants