gh-128213: fast path for bytes creation from list and tuple #128214

blhsing · 2024-12-24T07:39:39Z

Issue: Speed up bytes creation from list and tuple #128213

Benchmark using python -m timeit -n 100 -s 'a = __import__("random").choices(range(256), k=1000000)' 'bytes(a)' showing a ~31% reduction in time consumed:

100 loops, best of 5: 3.21 msec per loop # current
100 loops, best of 5: 2.23 msec per loop # this PR

With k=10000, a ~30% time reduction:

100 loops, best of 5: 31.7 usec per loop # current
100 loops, best of 5: 22.3 usec per loop # this PR

With k=100, a ~27% time reduction:

100 loops, best of 5: 410 nsec per loop # current
100 loops, best of 5: 299 nsec per loop # this PR

Objects/bytesobject.c

picnixz

Benchmark using python -m timeit -n 100 -s 'a = [40] * 100000' 'bytes(a)' showing a ~81% increase in performance:

Can we have better benchmarks, namely:

Benchmarks for small lists (< 100 items), medium-sized lists (< 10k items), large lists (>100k). It is important to know how this affect other paths and this should be properly reflected in the NEWS entry.
Are the benchmarks on a DEBUG or a RELEASE build (possibly PGO/LTO?). Benchmarks on DEBUG builds are not really important, it's better to have a PGO build or at least a release build (-O3)
Can we check with lists that don't have the same value? namely, use a = [random.randint(0, 255) for _ in range(N)]. Please also check that int subclasses are using the fast path.

Objects/bytesobject.c

picnixz · 2024-12-24T17:22:09Z

Objects/bytesobject.c

+            return Py_None; // None as fallback sentinel to the slow path
+        }
+        int overflow;
+        long value = PyLong_AsLongAndOverflow(items[i], &overflow);


If we're still assuming a long object,. we can just use PyLong_AsInt instead. We don't care about an overflow as we're only interested in values in [0, 255].

Thanks. I copied this code from bytearray but you're absolutely right that PyLong_AsInt is a much better fit here.

An existing test with -sys.maxsize failed with PyLong_AsInt. A fix would be to check for OverflowError and reraise ValueError instead, but I think it's easier to simply revert to PyLong_AsLongAndOverflow and let the range check that follows raise ValueError when out of the byte range.

Previously, we used PyNumber_AsSsize_t (which invokes _PyNumber_Index, hence possibly creating side effects). This will be a small behavioral change. While I understand Guido's comment, I'm wondering whether we should keep the old behaviour (though I don't know how it could be useful in production, namely making bytes() have side-effects on lists if one of their element being converted invokes __index__).

Note that the current code also prevented crashes by temporarily increfing item before calling PyNumber_AsSsize_t.

Since this PR is not supposed to be about altering behaviors, I'll keep the current behavior for this PR and file a separate bug against the current behavior then. It should be considered a bug because like Guido pointed out, a list can potentially be mutated inside __index__, resulting in the item copy loop accessing freed memory. Despite the unlikelihood that such code exists in the real world, it can still potentially happen and cause a crash.

You don't have a crash because the list size is checked at every iteration (though what should be checked perhaps is that the list pointer is not NULL). We incref the item before calling __index__ on it so it shouldn't cause crashes. INCREFing before is a trick you also use for use-after-free issues and evil mutations (but we can easily check if this is crashing or not as follows:

class EvilInt: def __index__(self): x.clear() return 0 x = [1,2, EvilInt(), 4] bytes(x)

and this does not crash.

Ahh now I see what INCREF and DECREF are doing there. I'll add them back in a bit too then. Thanks.

…ark notes

blhsing · 2024-12-25T09:36:53Z

Benchmarks for small lists (< 100 items), medium-sized lists (< 10k items), large lists (>100k). It is important to know how this affect other paths and this should be properly reflected in the NEWS entry.

Are the benchmarks on a DEBUG or a RELEASE build (possibly PGO/LTO?). Benchmarks on DEBUG builds are not really important, it's better to have a PGO build or at least a release build (-O3)

Can we check with lists that don't have the same value? namely, use a = [random.randint(0, 255) for _ in range(N)]. Please also check that int subclasses are using the fast path.

Right. I've now updated my benchmarks accordingly with a stripped RELEASE build (--enable-optimizations --with-lto=full).

picnixz · 2024-12-25T09:34:12Z

Misc/NEWS.d/next/Core_and_Builtins/2024-12-24-08-44-49.gh-issue-128213.Y71jDi.rst

@@ -0,0 +1,3 @@
+Speed up :class:`bytes` creation from :class:`list` and :class:`tuple` of integers. Benchmarks show that from a list with 1000000 random numbers the time to create a bytes object is reduced by around 31%, or 30% with 10000 numbers, or 27% with 100 numbers.


Can we have the pyperf benchmarks on the PR as well? (namely, the nice table with two columns and the diffs as well as the benchmark script? thanks)

picnixz · 2024-12-25T09:34:32Z

Misc/NEWS.d/next/Core_and_Builtins/2024-12-24-08-44-49.gh-issue-128213.Y71jDi.rst

@@ -0,0 +1,3 @@
+Speed up :class:`bytes` creation from :class:`list` and :class:`tuple` of integers. Benchmarks show that from a list with 1000000 random numbers the time to create a bytes object is reduced by around 31%, or 30% with 10000 numbers, or 27% with 100 numbers.
+


IIRC, NEWS should not contain an empty line.

Objects/bytesobject.c

picnixz · 2024-12-25T09:35:57Z

Objects/bytesobject.c

+            return Py_None; // None as fallback sentinel to the slow path
+        }
+        int overflow;
+        long value = PyLong_AsLongAndOverflow(items[i], &overflow);


Previously, we used PyNumber_AsSsize_t (which invokes _PyNumber_Index, hence possibly creating side effects). This will be a small behavioral change. While I understand Guido's comment, I'm wondering whether we should keep the old behaviour (though I don't know how it could be useful in production, namely making bytes() have side-effects on lists if one of their element being converted invokes __index__).

Note that the current code also prevented crashes by temporarily increfing item before calling PyNumber_AsSsize_t.

Objects/bytesobject.c

picnixz · 2024-12-25T09:44:49Z

For benchmarks, we prefer having a comparison in terms of mean and standard deviation rather than the best of 5 which could be just "good" data points. As such, it's better to use pyperf or hyperfine (also, 100 loops is IMO not sufficient)

ZeroIntensity · 2024-12-25T19:55:11Z

Objects/bytesobject.c

-        value = PyNumber_AsSsize_t(item, NULL);
-        if (value == -1 && PyErr_Occurred())
+    char *str = PyBytes_AS_STRING(bytes);
+    PyObject *const *items = PySequence_Fast_ITEMS(x);


If we're going for performance, then we can do even better. PySequence_Fast_ITEMS will call PyList_Check, but we already know that it's exactly a list or tuple here.

But all we know is that it is either a list or a tuple, but we still don't know which of the two it is, so a PyList_Check call is still in order.

Right, but PyList_Check is extra work because that will check for a subclass. It's not really going to be noticable, but it's something to think about :)

Alternatively, you can just duplicate the functions as we did previously. This is not really an issue IMO, and we couold use the fact that tuples are immutables for instance to avoid INCREF/DECREF values.

…sing PyNumber_AsSsize_t; fixed indentation

blhsing · 2024-12-26T09:02:04Z

For benchmarks, we prefer having a comparison in terms of mean and standard deviation rather than the best of 5 which could be just "good" data points. As such, it's better to use pyperf or hyperfine (also, 100 loops is IMO not sufficient)

I see. If it's the norm here then I will propose in Discourse for timeit to include mean and standard deviation in its CLI output as an option. I'll update the benchmarks with pyperf later.

picnixz · 2024-12-26T09:04:54Z

If it's the norm here then I will propose in Discourse for timeit to include mean and standard deviation in its CLI output as an option

It's more than just including the mean and the standard deviation actually. timeit is not always sufficient for micro-benchmarks like these and does not allow you to calibrate your CPU or compare with reference implementations. Using pyperf is in general the preferred way to get a nice comparison table (it also allows to create multiple statements to compare in one go).

But yes, timeit having the mean and the standard deviation would still be a nice improvement IMO, though it would depend on whether this remains a "minimal" replacement or not. There is a timeit command for pyperf which is used exactly the same as timeit and thus core devs may not think we need to make timeit more advanced.

picnixz · 2024-12-26T09:11:20Z

Oh by the way. I just remember something, but a naive k=0 and k=1 benchmark would also be interesting (namely [] and ()). I see that the smaller the size, the smaller the improvement is (which is kind of expected) and we seem to get a 30% improvement overall (considering the best of 5 loops).

I'm really interested in the pyperf benchmarks because they might differ (maybe the best of 5s are way faster than the average runs)

ZeroIntensity · 2024-12-26T09:14:51Z

Objects/bytesobject.c

-        if (value == -1 && PyErr_Occurred())
+    char *str = PyBytes_AS_STRING(bytes);
+    PyObject *const *items = PySequence_Fast_ITEMS(x);
+    Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x);


You need to acquire the critical section before calling PySequence_Fast_ITEMS and PySequence_Fast_GET_SIZE. Attempting to read mutable data without a lock isn't thread safe.

ZeroIntensity · 2024-12-26T09:15:22Z

Objects/bytesobject.c

+    PyObject *const *items = PySequence_Fast_ITEMS(x);
+    Py_BEGIN_CRITICAL_SECTION_SEQUENCE_FAST(x);
+    for (Py_ssize_t i = 0; i < size; i++) {
+        if (!PyLong_Check(items[i])) {


PyLong_CheckExact probably fits better here, and is faster!

(The previous code wasn't doing an exact check so we shouldn't change it)

Well, it wouldn't be a breaking change, just a performance loss for the (very niche!) set of cases that use special ints. I'm also slightly worried that non-exact ints might have some nasty side effects that we aren't anticipating here (e.g. can they mess with the Py_ssize_t value?)

They can mess with Py_ssize_t values but only through __index__ and we already check that with PyNumber_AsSsize_t, but the problem is wider. For example, np.int32() values can be passed to bytes([...]) but without this check, we will impact performances of numpy-related code which is something we don't want to.

Is casting to bytes a common thing to do with numpy integers, or is that speculation? (I see your point, I'm just sort of gauging what the cost-benefit would be here.)

I'd say yes, if we're considering serialization or introspection. I imagine that we can have such things when people are working with images because their arrays won't necessarily be pure Python lists but np.ndarray objects which may have non-primitive data types. I'm not aware of a wild opensource usage though but IMO, since we want to improve performance overall, we shouldn't penalize existing users. If we can slightly decrease our performance gain so to have a stable result, it's good.

Nonetheless, if we want to fallback to a slow path for non-exact ints, benchmarks should show how performances are impacted (a simple timeit could be sufficient but I'd be more comfortable with a much precise benchmark).

Most changes were addressed (although I'm waiting for a pyperf comparison, but this can wait)

pythongh-128213: fast path for bytes creation from list and tuple

6f699b5

bedevere-app bot mentioned this pull request Dec 24, 2024

Speed up bytes creation from list and tuple #128213

Open

bedevere-app bot added the awaiting review label Dec 24, 2024

blhsing and others added 3 commits December 24, 2024 16:03

coerce long to char after validation of integer in byte range

18c8e4a

📜🤖 Added by blurb_it.

406fbdb

Update 2024-12-24-08-44-49.gh-issue-128213.Y71jDi.rst

4912a05

eendebakpt reviewed Dec 24, 2024

View reviewed changes

Objects/bytesobject.c Outdated Show resolved Hide resolved

picnixz previously requested changes Dec 24, 2024

View reviewed changes

bedevere-app bot added awaiting core review and removed awaiting review labels Dec 24, 2024

leopornorecio approved these changes Dec 25, 2024

View reviewed changes

blhsing added 2 commits December 25, 2024 16:53

updated for thread-safety, style choices, function choices and benchm…

56f802e

…ark notes

revert to PyLong_AsLongAndOverflow for easier overflow handling

4e1e3e6

picnixz reviewed Dec 25, 2024

View reviewed changes

ZeroIntensity reviewed Dec 25, 2024

View reviewed changes

fixed issue of a label at the end of a compound statment; revert to u…

bf96d06

…sing PyNumber_AsSsize_t; fixed indentation

ZeroIntensity reviewed Dec 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gh-128213: fast path for bytes creation from list and tuple #128214

gh-128213: fast path for bytes creation from list and tuple #128214

blhsing commented Dec 24, 2024 •

edited

Loading

picnixz left a comment

picnixz Dec 24, 2024

blhsing Dec 25, 2024

blhsing Dec 25, 2024 •

edited

Loading

picnixz Dec 25, 2024

blhsing Dec 26, 2024 •

edited

Loading

picnixz Dec 26, 2024

blhsing Dec 26, 2024

blhsing commented Dec 25, 2024

picnixz Dec 25, 2024

picnixz Dec 25, 2024

picnixz Dec 25, 2024

picnixz commented Dec 25, 2024 •

edited

Loading

ZeroIntensity Dec 25, 2024

blhsing Dec 26, 2024

ZeroIntensity Dec 26, 2024

picnixz Dec 26, 2024

blhsing commented Dec 26, 2024 •

edited

Loading

picnixz commented Dec 26, 2024 •

edited

Loading

picnixz commented Dec 26, 2024

ZeroIntensity Dec 26, 2024

ZeroIntensity Dec 26, 2024

picnixz Dec 26, 2024

ZeroIntensity Dec 26, 2024

picnixz Dec 26, 2024 •

edited

Loading

ZeroIntensity Dec 26, 2024

picnixz Dec 26, 2024

		@@ -0,0 +1,3 @@
		Speed up :class:`bytes` creation from :class:`list` and :class:`tuple` of integers. Benchmarks show that from a list with 1000000 random numbers the time to create a bytes object is reduced by around 31%, or 30% with 10000 numbers, or 27% with 100 numbers.

gh-128213: fast path for bytes creation from list and tuple #128214

Are you sure you want to change the base?

gh-128213: fast path for bytes creation from list and tuple #128214

Conversation

blhsing commented Dec 24, 2024 • edited Loading

picnixz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blhsing Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blhsing Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blhsing commented Dec 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

picnixz commented Dec 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blhsing commented Dec 26, 2024 • edited Loading

picnixz commented Dec 26, 2024 • edited Loading

picnixz commented Dec 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

picnixz Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

blhsing commented Dec 24, 2024 •

edited

Loading

blhsing Dec 25, 2024 •

edited

Loading

blhsing Dec 26, 2024 •

edited

Loading

picnixz commented Dec 25, 2024 •

edited

Loading

blhsing commented Dec 26, 2024 •

edited

Loading

picnixz commented Dec 26, 2024 •

edited

Loading

picnixz Dec 26, 2024 •

edited

Loading