pybind11::str::raw_str simplification (for Python 2) #2367

rwgk · 2020-08-05T01:24:42Z

For Python 2, convert directly to unicode (instead of converting to str first, followed by decoding).

Simpler code, and more obviously equivalent to unicode(...) in the interpreter.

PR is #2366 is meant to be a preparation for this PR.
They could be merged in any order, but it's best to merge #2366 first, re-base and re-test this PR, then merge.

bstaletic · 2020-08-05T06:16:25Z

include/pybind11/pytypes.h

-        Py_XDECREF(str_value); str_value = unicode;
+        PyObject *str_value = PyObject_Unicode(op);
+#else
+        PyObject *str_value = PyObject_Str(op);


We already have a ton of compatibility macros in common.h. I'd define a new one to avoid this macro branching here. Something like PYBIND11_OBJECT_TO_STRING.

We already have a ton of compatibility macros in common.h. I'd define a new one to avoid this macro branching here. Something like PYBIND11_OBJECT_TO_STRING.

Thanks Boris, I'm inclined to take your suggestions, although it makes it a (slightly) sprawling change.
I just looked: the needed compatibility macro doesn't exist already, I'll have to add it.
Checking with @YannickJadoul and @EricCousineau-TRI : what's your recommendation, local change as-is or adding the compatibility macro?

I'd be fine with a compatibility macro living in common.h. The main concern about macros is some sort of deprecation mechanism when it's no longer necessary, but I'm sure there are several ways to resolve that.

YannickJadoul

I believe (haven't tested this explicitly with this code), you'll break the use case where a utf-8-encoded bytes object is passed. Currently, pybind11 decodes it to (Python 2's) unicode type with "utf-8" encoding.
This PR won't do that anymore, since PyObject_Unicode is supposedly the same as unicode(o) (according to the docs):

>>> print(u"abcd\u0259")
abcdə
>>> unicode(u"abcd\u0259".encode("utf-8"))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 4: ordinal not in range(128)
>>> unicode(u"abcd\u0259".encode("utf-8"), "utf-8")
u'abcd\u0259'

Another thing to watch out for is custom types with __str__/__unicode__ in Python 2. Currently, __str__ would be called, then that result converted to unicode. This PR would immediately call __unicode__ (arguably better, I'd say, but it is a change).

YannickJadoul · 2020-08-05T15:22:25Z

include/pybind11/pytypes.h

@@ -930,11 +930,10 @@ class str : public object {
 private:
    /// Return string representation -- always returns a new reference, even if already a str
    static PyObject *raw_str(PyObject *op) {
-        PyObject *str_value = PyObject_Str(op);
-        if (!str_value) throw error_already_set();


This one line went missing but is important! (Unless this can never fail?)

(Important to add a test for this as well)

Nvm. Just noticed this will be done by the code generated by PYBIND11_OBJECT_CVT ;-)

This one line went missing but is important! (Unless this can never fail?)

Oops, sorry, that was an accident (manual cherry-picking). I fixed it quick to get that part out of the way. More changes later.

YannickJadoul · 2020-08-05T15:51:47Z

Another thing to watch out for is custom types with __str__/__unicode__ in Python 2. Currently, __str__ would be called, then that result converted to unicode. This PR would immediately call __unicode__ (arguably better, I'd say, but it is a change).

This seems to be OK. It seems like unicode(o) still falls back to __str__ if __unicode__ isn't present.

>>> # This is Python 2
>>> class X:
...     def __str__(self):
...             return "Hello from __str__"
... 
>>> str(X())
'Hello from __str__'
>>> unicode(X())
u'Hello from __str__'
>>> class Y:
...     def __unicode__(self):
...             return u"Hello from __unicode__"
... 
>>> unicode(Y())
u'Hello from __unicode__'
>>> str(Y())
'<__main__.Y instance at 0x7f707209d1e0>'

So this is better, I believe, because first __unicode__ will be tried. Though it might still be considered a breaking change?

The thing to watch out for is the same as above. "utf-8" is still not the default encoding:

>>> # This is Python 2
>>> class Z:
...     def __str__(self):
...             return u"He\u0259llo from __str__".encode('utf-8')
... 
>>> str(Z())
'He\xc9\x99llo from __str__'
>>> unicode(Z())
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc9 in position 2: ordinal not in range(128)

For Python 2, convert directly to unicode (instead of converting to str first, followed by encoding). Simpler code, and more obviously equivalent to `unicode(...)` in the interpreter.

rwgk · 2020-08-05T16:02:31Z

Though it might still be considered a breaking change?

Oh wow, I didn't realize this, thanks!
Maybe I'll just leave this alone then. My feeling is it's not helpful to deviate from "normal" Python 2 behavior (generates surprises), but then again, this isn't important enough a battle to pick.

YannickJadoul · 2020-08-05T16:06:12Z

Oh wow, I didn't realize this, thanks!
Maybe I'll just leave this alone then. My feeling is it's not helpful to deviate from "normal" Python 2 behavior (generates surprises), but then again, this isn't important enough a battle to pick.

I'm ... yeah, it's always a bit arbitrary which encoding to pick? I'm not sure if it every occurs, so we might just as well try making the change (or at least proposing it?).
The more I think about it, the more I quite like the more obvious: "py::str is Python 2's unicode, so we're calling unicode(...)", but it seems pybind11 has taken the stance that all strings ought to be encoded as UTF-8 (similar to std::string conventions).

bstaletic · 2020-08-05T17:29:14Z

but it seems pybind11 has taken the stance that all strings ought to be encoded as UTF-8

Yup

jbarlow83 · 2020-08-09T05:30:25Z

How long is pybind11 intending to support Python 2?

EricCousineau-TRI · 2020-08-09T15:56:34Z

How long is pybind11 intending to support Python 2?

My understanding (from discussions with others) is that Python 2 will continue to be supported in pybind11 as long as the supporting infrastructure is not too much of burden (namely CI, package managers).
While Python 2 support complicates some of the code (as seen in this PR), that complexity isn't all too bad in terms of maintenance.
@wjakob Please feel free to correct me here if I misspoke.

rwgk mentioned this pull request Aug 5, 2020

Adding tests specifically to exercise pybind11::str::raw_str. #2366

Merged

rwgk requested review from bstaletic, EricCousineau-TRI and YannickJadoul August 5, 2020 06:02

bstaletic reviewed Aug 5, 2020

View reviewed changes

YannickJadoul reviewed Aug 5, 2020

View reviewed changes

pybind11::str::raw_str simplification (for Python 2)

603367d

For Python 2, convert directly to unicode (instead of converting to str first, followed by encoding). Simpler code, and more obviously equivalent to `unicode(...)` in the interpreter.

rwgk force-pushed the pybind11_str_raw_str_simplification branch from 5e76fc6 to 603367d Compare August 5, 2020 15:56

rwgk closed this Aug 11, 2020

rwgk deleted the pybind11_str_raw_str_simplification branch August 11, 2020 01:06

YannickJadoul mentioned this pull request Aug 14, 2020

Tracking PR: intermediate state of completed str/bytes cleaning up PRs #2348

Closed

4 tasks

rwgk mentioned this pull request Feb 10, 2023

FWD pybind11 google/pybind11clif#2367

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

pybind11::str::raw_str simplification (for Python 2) #2367

pybind11::str::raw_str simplification (for Python 2) #2367

Uh oh!

rwgk commented Aug 5, 2020 •

edited

Loading

Uh oh!

bstaletic Aug 5, 2020

Uh oh!

rwgk Aug 5, 2020

Uh oh!

EricCousineau-TRI Aug 9, 2020

Uh oh!

YannickJadoul left a comment

Uh oh!

YannickJadoul Aug 5, 2020

Uh oh!

YannickJadoul Aug 5, 2020

Uh oh!

YannickJadoul Aug 5, 2020

Uh oh!

rwgk Aug 5, 2020

Uh oh!

YannickJadoul commented Aug 5, 2020

Uh oh!

rwgk commented Aug 5, 2020

Uh oh!

YannickJadoul commented Aug 5, 2020

Uh oh!

bstaletic commented Aug 5, 2020

Uh oh!

jbarlow83 commented Aug 9, 2020

Uh oh!

EricCousineau-TRI commented Aug 9, 2020

Uh oh!

Uh oh!

pybind11::str::raw_str simplification (for Python 2) #2367

pybind11::str::raw_str simplification (for Python 2) #2367

Uh oh!

Conversation

rwgk commented Aug 5, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bstaletic Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

rwgk Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

EricCousineau-TRI Aug 9, 2020

Choose a reason for hiding this comment

Uh oh!

YannickJadoul left a comment

Choose a reason for hiding this comment

Uh oh!

YannickJadoul Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

YannickJadoul Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

YannickJadoul Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

rwgk Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

YannickJadoul commented Aug 5, 2020

Uh oh!

rwgk commented Aug 5, 2020

Uh oh!

YannickJadoul commented Aug 5, 2020

Uh oh!

bstaletic commented Aug 5, 2020

Uh oh!

jbarlow83 commented Aug 9, 2020

Uh oh!

EricCousineau-TRI commented Aug 9, 2020

Uh oh!

Uh oh!

rwgk commented Aug 5, 2020 •

edited

Loading