Skip to content

bpo-38644: Add _PyObject_VectorcallTstate() #17052

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 8, 2019
Merged

bpo-38644: Add _PyObject_VectorcallTstate() #17052

merged 1 commit into from
Nov 8, 2019

Conversation

vstinner
Copy link
Member

@vstinner vstinner commented Nov 5, 2019

  • Add _PyObject_VectorcallTstate() function: similar to
    _PyObject_Vectorcall() with tstate parameter
  • Add tstate parameter to _PyObject_MakeTpCall()

https://bugs.python.org/issue38644

@vstinner
Copy link
Member Author

vstinner commented Nov 5, 2019

@jdemeyer: Would it be possible to change _PyObject_Vectorcall() and _PyObject_MakeTpCall() to add a tstate parameter? Or is it better to add new functions? The functions are private, so we don't provide any backward compatibility warranty.

@encukou
Copy link
Member

encukou commented Nov 5, 2019

Please benchmark this very carefully on Windows.

One note from Mark was that there's a performance loss on Windows when passing more than 4 arguments in C. That the main reason why Vectorcall combines the "number of args" and "flags" into a single argument.

@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2019

tl; dr I don't see any significant performance difference when running microbenchmarks on Linux.

Linux x86-64 ABI allows to pass up to six function parameters as registers: https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI

I ran a benchmark on Linux (Fedora 30) with CPU isolation (isolcpus=2,3,6,7 rcu_nocbs=2,3,6,7 passed to the Linux command line).

EDIT: I compiled Python with PGO+LTO: make distclean; ./configure --enable-optimizations --with-lto && make.

I used all benchmarks called "bench*call.py" in my https://github.com/vstinner/pymicrobench project: collection of Python microbenchmarks.

bench_call_method.py
bench_call_method_slots.py
bench_call_method_unknown.py
bench_call_simple.py
bench_fastcall_builtins.py
bench_fastcall_bytes_join.py
bench_fastcall_call_pyinit_kwargs.py
bench_fastcall_c_method.py
bench_fastcall_deque_methods.py
bench_fastcall_dict_methods.py
bench_fastcall_partial.py
bench_fastcall_slots.py
bench_fastcall_str_methods.py
bench_fastcall_struct.py

If I ignore differences smaller than 5%, all results are shown as "Not significant". They are micro benchmarks. IMHO on a microbenchmark, a significant difference should be at least 10% (-10% or +10%).

=== bench_call_method.json ===
Not significant (1): call_method

=== bench_call_method_slots.json ===
Not significant (1): call_method_slots

=== bench_call_method_unknown.json ===
Not significant (1): call_method_unknown

=== bench_call_simple.json ===
Not significant (1): call_simple

=== bench_fastcall_builtins.json ===
Not significant (7): filter(lambda x: x, list(range(1000))); map(lambda x: x, list(range(1000))); sorted(list(range(1000)), key=lambda x: x); namedtuple.attr; object.__setattr__(obj, "x", 1); object.__getattribute__(obj, "x"); getattr(1, "real")

=== bench_fastcall_bytes_join.json ===
Not significant (2): b"".join((b"hello", b"world")); b"".join((b"hello", b"world") * 100)

=== bench_fastcall_call_pyinit_kwargs.json ===
Not significant (3): call_pyinit_kw1; call_pyinit_kw5; call_pyinit_kw10

=== bench_fastcall_c_method.json ===
Not significant (3): b"".decode(); b"".decode("ascii"); [0].count(0)

=== bench_fastcall_deque_methods.json ===
Not significant (4): collections.deque.rotate(); collections.deque.rotate(1); collections.deque([None]).index(None); collections.deque.insert()

=== bench_fastcall_dict_methods.json ===
Not significant (2): {1: 2}.get(1); {1: 2}.get(7, None)

=== bench_fastcall_partial.json ===
Not significant (7): partial Python, 1+1 arg; partial Python, 2+0 arg; partial Python, 5+1 arg; partial Python, 5+5 arg; partial C VARARGS, 1+1 arg; partial C VARARGS, 2+0 arg; partial C FASTCALL, 1+0 arg

=== bench_fastcall_slots.json ===
Not significant (2): Python __int__: int(obj); Python __getitem__: obj[0]

=== bench_fastcall_str_methods.json ===
Not significant (1): "a".replace("x", "y")

=== bench_fastcall_struct.json ===
Not significant (2): int.to_bytes(1, 4, "little"); struct.pack("i", 1)

Raw results:

=== bench_call_method.json ===
+-------------+----------------------------+------------------------------+
| Benchmark   | ref/bench_call_method.json | patch/bench_call_method.json |
+=============+============================+==============================+
| call_method | 8.86 ms                    | 9.04 ms: 1.02x slower (+2%)  |
+-------------+----------------------------+------------------------------+

=== bench_call_method_slots.json ===
+-------------------+----------------------------------+------------------------------------+
| Benchmark         | ref/bench_call_method_slots.json | patch/bench_call_method_slots.json |
+===================+==================================+====================================+
| call_method_slots | 8.97 ms                          | 8.84 ms: 1.02x faster (-1%)        |
+-------------------+----------------------------------+------------------------------------+

=== bench_call_method_unknown.json ===
Not significant (1): call_method_unknown

=== bench_call_simple.json ===
Not significant (1): call_simple

=== bench_fastcall_builtins.json ===
+--------------------------------------------+----------------------------------+------------------------------------+
| Benchmark                                  | ref/bench_fastcall_builtins.json | patch/bench_fastcall_builtins.json |
+============================================+==================================+====================================+
| sorted(list(range(1000)), key=lambda x: x) | 50.5 us                          | 49.0 us: 1.03x faster (-3%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| object.__setattr__(obj, "x", 1)            | 80.3 ns                          | 79.6 ns: 1.01x faster (-1%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| object.__getattribute__(obj, "x")          | 61.6 ns                          | 62.0 ns: 1.01x slower (+1%)        |
+--------------------------------------------+----------------------------------+------------------------------------+

Not significant (4): filter(lambda x: x, list(range(1000))); map(lambda x: x, list(range(1000))); namedtuple.attr; getattr(1, "real")

=== bench_fastcall_bytes_join.json ===
+--------------------------------------+------------------------------------+--------------------------------------+
| Benchmark                            | ref/bench_fastcall_bytes_join.json | patch/bench_fastcall_bytes_join.json |
+======================================+====================================+======================================+
| b"".join((b"hello", b"world") * 100) | 5.09 us                            | 5.14 us: 1.01x slower (+1%)          |
+--------------------------------------+------------------------------------+--------------------------------------+

Not significant (1): b"".join((b"hello", b"world"))

=== bench_fastcall_call_pyinit_kwargs.json ===
+------------------+--------------------------------------------+----------------------------------------------+
| Benchmark        | ref/bench_fastcall_call_pyinit_kwargs.json | patch/bench_fastcall_call_pyinit_kwargs.json |
+==================+============================================+==============================================+
| call_pyinit_kw1  | 201 ns                                     | 197 ns: 1.02x faster (-2%)                   |
+------------------+--------------------------------------------+----------------------------------------------+
| call_pyinit_kw5  | 322 ns                                     | 317 ns: 1.02x faster (-2%)                   |
+------------------+--------------------------------------------+----------------------------------------------+
| call_pyinit_kw10 | 532 ns                                     | 530 ns: 1.00x faster (-0%)                   |
+------------------+--------------------------------------------+----------------------------------------------+

=== bench_fastcall_c_method.json ===
+--------------+----------------------------------+------------------------------------+
| Benchmark    | ref/bench_fastcall_c_method.json | patch/bench_fastcall_c_method.json |
+==============+==================================+====================================+
| b"".decode() | 29.7 ns                          | 29.2 ns: 1.02x faster (-2%)        |
+--------------+----------------------------------+------------------------------------+
| [0].count(0) | 34.6 ns                          | 33.4 ns: 1.03x faster (-3%)        |
+--------------+----------------------------------+------------------------------------+

Not significant (1): b"".decode("ascii")

=== bench_fastcall_deque_methods.json ===
+---------------------------------------+---------------------------------------+-----------------------------------------+
| Benchmark                             | ref/bench_fastcall_deque_methods.json | patch/bench_fastcall_deque_methods.json |
+=======================================+=======================================+=========================================+
| collections.deque.rotate()            | 39.5 ns                               | 39.0 ns: 1.01x faster (-1%)             |
+---------------------------------------+---------------------------------------+-----------------------------------------+
| collections.deque([None]).index(None) | 62.2 ns                               | 63.0 ns: 1.01x slower (+1%)             |
+---------------------------------------+---------------------------------------+-----------------------------------------+
| collections.deque.insert()            | 446 ns                                | 442 ns: 1.01x faster (-1%)              |
+---------------------------------------+---------------------------------------+-----------------------------------------+

Not significant (1): collections.deque.rotate(1)

=== bench_fastcall_dict_methods.json ===
+---------------+--------------------------------------+----------------------------------------+
| Benchmark     | ref/bench_fastcall_dict_methods.json | patch/bench_fastcall_dict_methods.json |
+===============+======================================+========================================+
| {1: 2}.get(1) | 37.7 ns                              | 37.1 ns: 1.02x faster (-2%)            |
+---------------+--------------------------------------+----------------------------------------+

Not significant (1): {1: 2}.get(7, None)

=== bench_fastcall_partial.json ===
+----------------------------+---------------------------------+-----------------------------------+
| Benchmark                  | ref/bench_fastcall_partial.json | patch/bench_fastcall_partial.json |
+============================+=================================+===================================+
| partial Python, 2+0 arg    | 54.1 ns                         | 54.8 ns: 1.01x slower (+1%)       |
+----------------------------+---------------------------------+-----------------------------------+
| partial Python, 5+5 arg    | 113 ns                          | 111 ns: 1.01x faster (-1%)        |
+----------------------------+---------------------------------+-----------------------------------+
| partial C VARARGS, 1+1 arg | 121 ns                          | 117 ns: 1.03x faster (-3%)        |
+----------------------------+---------------------------------+-----------------------------------+
| partial C VARARGS, 2+0 arg | 88.7 ns                         | 90.4 ns: 1.02x slower (+2%)       |
+----------------------------+---------------------------------+-----------------------------------+

Not significant (3): partial Python, 1+1 arg; partial Python, 5+1 arg; partial C FASTCALL, 1+0 arg

=== bench_fastcall_slots.json ===
+----------------------------+-------------------------------+---------------------------------+
| Benchmark                  | ref/bench_fastcall_slots.json | patch/bench_fastcall_slots.json |
+============================+===============================+=================================+
| Python __int__: int(obj)   | 103 ns                        | 104 ns: 1.02x slower (+2%)      |
+----------------------------+-------------------------------+---------------------------------+
| Python __getitem__: obj[0] | 70.6 ns                       | 73.6 ns: 1.04x slower (+4%)     |
+----------------------------+-------------------------------+---------------------------------+

=== bench_fastcall_str_methods.json ===
+-----------------------+-------------------------------------+---------------------------------------+
| Benchmark             | ref/bench_fastcall_str_methods.json | patch/bench_fastcall_str_methods.json |
+=======================+=====================================+=======================================+
| "a".replace("x", "y") | 40.1 ns                             | 40.8 ns: 1.02x slower (+2%)           |
+-----------------------+-------------------------------------+---------------------------------------+

=== bench_fastcall_struct.json ===
+------------------------------+--------------------------------+----------------------------------+
| Benchmark                    | ref/bench_fastcall_struct.json | patch/bench_fastcall_struct.json |
+==============================+================================+==================================+
| int.to_bytes(1, 4, "little") | 49.3 ns                        | 49.1 ns: 1.00x faster (-0%)      |
+------------------------------+--------------------------------+----------------------------------+
| struct.pack("i", 1)          | 72.8 ns                        | 69.6 ns: 1.05x faster (-4%)      |
+------------------------------+--------------------------------+----------------------------------+

Script to run the benchmark:

set -e -x
SRC=~/myprojects/pymicrobench
ENV=env
rm -rf $ENV
./python -m venv $ENV
PYTHON=$ENV/bin/python
$PYTHON -m pip install pyperf
for script in $(cd $SRC; ls bench*call*.py); do
    $PYTHON $SRC/$script -v -o ${script:0:-3}.json
done

I moved all .json to ref/ (reference Python) or patch/ (patched Python).

Script to compare results:

set -e
for name in $(cd ref; ls *.json); do
    echo "=== $name ==="
    python3 -m pyperf compare_to ref/$name patch/$name --table #--min-speed=5
    echo
done

@encukou
Copy link
Member

encukou commented Nov 7, 2019

That's not surprising. Are you planning to run Windows benchmarks as well?

@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2019

That's not surprising. Are you planning to run Windows benchmarks as well?

I'm now trying to redo the same benchmark on Windows.

@vstinner
Copy link
Member Author

vstinner commented Nov 7, 2019

It's hard to me to understand microbenchmark results on Windows because I don't know how to minimize the std dev.

I installed psutil, so pyperf calls proc.nice(psutil.REALTIME_PRIORITY_CLASS) to set the benchmark process to the highest priority.

But all differences are smaller than 10%. Some microbenchmarks are faster, some are slower. But it may come from the benchmark "noise".


Benchmarks on Windows. I used the following commands to build Python:

git clean -fdx # clean checkout
PCbuild\build.bat -p x64 --pgo

I cloned https://github.com/vstinner/pymicrobench To create the venv, I used:

rmdir /q /s env
python -m venv env
env\Scripts\python -m pip install pyperf
echo "now install psutil manually"

I downloaded https://files.pythonhosted.org/packages/03/9a/95c4b3d0424426e5fd94b5302ff74cea44d5d4f53466e1228ac8e73e14b4/psutil-5.6.5.tar.gz and extracted using "python -m tarfile -e psutil-5.6.5.tar.gz". I hacked its setup.py to add include_dirs=[r"\vstinner\python\master\PC"] to the Windows Extension. I installed it using the python.exe of the venv:

# in psutil source
\vstinner\python\master\env\Scripts\python setup.py install

Then I ran benchmarks using:

env\Scripts\python \vstinner\pymicrobench\bench_call_method.py -o patch\bench_call_method.json -v
env\Scripts\python \vstinner\pymicrobench\bench_call_method_slots.py -o patch\bench_call_method_slots.json -v    
env\Scripts\python \vstinner\pymicrobench\bench_call_method_unknown.py -o patch\bench_call_method_unknown.json -v
env\Scripts\python \vstinner\pymicrobench\bench_call_simple.py -o patch\bench_call_simple.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_builtins.py -o patch\bench_fastcall_builtins.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_bytes_join.py -o patch\bench_fastcall_bytes_join.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_call_pyinit_kwargs.py -o patch\bench_fastcall_call_pyinit_kwargs.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_c_method.py -o patch\bench_fastcall_c_method.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_deque_methods.py -o patch\bench_fastcall_deque_methods.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_dict_methods.py -o patch\bench_fastcall_dict_methods.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_partial.py -o patch\bench_fastcall_partial.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_slots.py -o patch\bench_fastcall_slots.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_str_methods.py -o patch\bench_fastcall_str_methods.json -v
env\Scripts\python \vstinner\pymicrobench\bench_fastcall_struct.py -o patch\bench_fastcall_struct.json -v

Comparison ignoring differences smaller than 5%:


vstinner@WIN C:\vstinner\python\master>compare

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_method.json patch\bench_call_method.json --table --min-speed=5
Not significant (1): call_method

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_method_slots.json patch\bench_call_method_slots.json --table --min-speed=5
Not significant (1): call_method_slots

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_method_unknown.json patch\bench_call_method_unknown.json --table --min-speed=5
Not significant (1): call_method_unknown

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_simple.json patch\bench_call_simple.json --table --min-speed=5
Not significant (1): call_simple

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_builtins.json patch\bench_fastcall_builtins.json --table --min-speed=5
+--------------------------------------------+----------------------------------+------------------------------------+
| Benchmark                                  | ref\bench_fastcall_builtins.json | patch\bench_fastcall_builtins.json |
+============================================+==================================+====================================+
| filter(lambda x: x, list(range(1000)))     | 79.0 us                          | 83.5 us: 1.06x slower (+6%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| sorted(list(range(1000)), key=lambda x: x) | 70.4 us                          | 75.2 us: 1.07x slower (+7%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| getattr(1, "real")                         | 60.9 ns                          | 57.4 ns: 1.06x faster (-6%)        |
+--------------------------------------------+----------------------------------+------------------------------------+

Not significant (4): map(lambda x: x, list(range(1000))); namedtuple.attr; object.__setattr__(obj, "x", 1); object.__getattribute__(obj, "x")

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_bytes_join.json patch\bench_fastcall_bytes_join.json --table --min-speed=5
Not significant (2): b"".join((b"hello", b"world")); b"".join((b"hello", b"world") * 100)

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_call_pyinit_kwargs.json patch\bench_fastcall_call_pyinit_kwargs.json --table --min-speed=5
Not significant (3): call_pyinit_kw1; call_pyinit_kw5; call_pyinit_kw10

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_c_method.json patch\bench_fastcall_c_method.json --table --min-speed=5
Not significant (3): b"".decode(); b"".decode("ascii"); [0].count(0)

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_deque_methods.json patch\bench_fastcall_deque_methods.json --table --min-speed=5
+----------------------------+---------------------------------------+-----------------------------------------+
| Benchmark                  | ref\bench_fastcall_deque_methods.json | patch\bench_fastcall_deque_methods.json |
+============================+=======================================+=========================================+
| collections.deque.rotate() | 71.1 ns                               | 64.4 ns: 1.10x faster (-9%)             |
+----------------------------+---------------------------------------+-----------------------------------------+

Not significant (3): collections.deque.rotate(1); collections.deque([None]).index(None); collections.deque.insert()

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_dict_methods.json patch\bench_fastcall_dict_methods.json --table --min-speed=5
Not significant (2): {1: 2}.get(1); {1: 2}.get(7, None)

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_partial.json patch\bench_fastcall_partial.json --table --min-speed=5
+-------------------------+---------------------------------+-----------------------------------+
| Benchmark               | ref\bench_fastcall_partial.json | patch\bench_fastcall_partial.json |
+=========================+=================================+===================================+
| partial Python, 2+0 arg | 100 ns                          | 92.3 ns: 1.09x faster (-8%)       |
+-------------------------+---------------------------------+-----------------------------------+
| partial Python, 5+5 arg | 160 ns                          | 151 ns: 1.06x faster (-6%)        |
+-------------------------+---------------------------------+-----------------------------------+

Not significant (5): partial Python, 1+1 arg; partial Python, 5+1 arg; partial C VARARGS, 1+1 arg; partial C VARARGS, 2+0 arg; partial C FASTCALL, 1+0 arg

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_slots.json patch\bench_fastcall_slots.json --table --min-speed=5
Not significant (2): Python __int__: int(obj); Python __getitem__: obj[0]

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_struct.json patch\bench_fastcall_struct.json --table --min-speed=5
Not significant (2): int.to_bytes(1, 4, "little"); struct.pack("i", 1)

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_str_methods.json patch\bench_fastcall_str_methods.json --table --min-speed=5
Not significant (1): "a".replace("x", "y")

Comparison:


vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_method.json patch\bench_call_method.json --table
Not significant (1): call_method

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_method_slots.json patch\bench_call_method_slots.json --table
Not significant (1): call_method_slots

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_method_unknown.json patch\bench_call_method_unknown.json --table
+---------------------+------------------------------------+--------------------------------------+
| Benchmark           | ref\bench_call_method_unknown.json | patch\bench_call_method_unknown.json |
+=====================+====================================+======================================+
| call_method_unknown | 15.7 ms                            | 16.3 ms: 1.04x slower (+4%)          |
+---------------------+------------------------------------+--------------------------------------+

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_call_simple.json patch\bench_call_simple.json --table
+-------------+----------------------------+------------------------------+
| Benchmark   | ref\bench_call_simple.json | patch\bench_call_simple.json |
+=============+============================+==============================+
| call_simple | 10.9 ms                    | 11.3 ms: 1.03x slower (+3%)  |
+-------------+----------------------------+------------------------------+

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_builtins.json patch\bench_fastcall_builtins.json --table
+--------------------------------------------+----------------------------------+------------------------------------+
| Benchmark                                  | ref\bench_fastcall_builtins.json | patch\bench_fastcall_builtins.json |
+============================================+==================================+====================================+
| filter(lambda x: x, list(range(1000)))     | 79.0 us                          | 83.5 us: 1.06x slower (+6%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| map(lambda x: x, list(range(1000)))        | 80.4 us                          | 82.5 us: 1.03x slower (+3%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| sorted(list(range(1000)), key=lambda x: x) | 70.4 us                          | 75.2 us: 1.07x slower (+7%)        |
+--------------------------------------------+----------------------------------+------------------------------------+
| object.__setattr__(obj, "x", 1)            | 113 ns                           | 110 ns: 1.03x faster (-3%)         |
+--------------------------------------------+----------------------------------+------------------------------------+
| getattr(1, "real")                         | 60.9 ns                          | 57.4 ns: 1.06x faster (-6%)        |
+--------------------------------------------+----------------------------------+------------------------------------+

Not significant (2): namedtuple.attr; object.__getattribute__(obj, "x")

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_bytes_join.json patch\bench_fastcall_bytes_join.json --table
+--------------------------------+------------------------------------+--------------------------------------+
| Benchmark                      | ref\bench_fastcall_bytes_join.json | patch\bench_fastcall_bytes_join.json |
+================================+====================================+======================================+
| b"".join((b"hello", b"world")) | 88.8 ns                            | 86.8 ns: 1.02x faster (-2%)          |
+--------------------------------+------------------------------------+--------------------------------------+

Not significant (1): b"".join((b"hello", b"world") * 100)

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_call_pyinit_kwargs.json patch\bench_fastcall_call_pyinit_kwargs.json --table
Not significant (3): call_pyinit_kw1; call_pyinit_kw5; call_pyinit_kw10

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_c_method.json patch\bench_fastcall_c_method.json --table
+--------------+----------------------------------+------------------------------------+
| Benchmark    | ref\bench_fastcall_c_method.json | patch\bench_fastcall_c_method.json |
+==============+==================================+====================================+
| [0].count(0) | 55.1 ns                          | 53.6 ns: 1.03x faster (-3%)        |
+--------------+----------------------------------+------------------------------------+

Not significant (2): b"".decode(); b"".decode("ascii")

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_deque_methods.json patch\bench_fastcall_deque_methods.json --table
+---------------------------------------+---------------------------------------+-----------------------------------------+
| Benchmark                             | ref\bench_fastcall_deque_methods.json | patch\bench_fastcall_deque_methods.json |
+=======================================+=======================================+=========================================+
| collections.deque.rotate()            | 71.1 ns                               | 64.4 ns: 1.10x faster (-9%)             |
+---------------------------------------+---------------------------------------+-----------------------------------------+
| collections.deque([None]).index(None) | 93.4 ns                               | 89.3 ns: 1.05x faster (-4%)             |
+---------------------------------------+---------------------------------------+-----------------------------------------+

Not significant (2): collections.deque.rotate(1); collections.deque.insert()

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_dict_methods.json patch\bench_fastcall_dict_methods.json --table
Not significant (2): {1: 2}.get(1); {1: 2}.get(7, None)

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_partial.json patch\bench_fastcall_partial.json --table
+-----------------------------+---------------------------------+-----------------------------------+
| Benchmark                   | ref\bench_fastcall_partial.json | patch\bench_fastcall_partial.json |
+=============================+=================================+===================================+
| partial Python, 1+1 arg     | 101 ns                          | 97.4 ns: 1.03x faster (-3%)       |
+-----------------------------+---------------------------------+-----------------------------------+
| partial Python, 2+0 arg     | 100 ns                          | 92.3 ns: 1.09x faster (-8%)       |
+-----------------------------+---------------------------------+-----------------------------------+
| partial Python, 5+1 arg     | 130 ns                          | 125 ns: 1.04x faster (-4%)        |
+-----------------------------+---------------------------------+-----------------------------------+
| partial Python, 5+5 arg     | 160 ns                          | 151 ns: 1.06x faster (-6%)        |
+-----------------------------+---------------------------------+-----------------------------------+
| partial C VARARGS, 1+1 arg  | 156 ns                          | 153 ns: 1.02x faster (-2%)        |
+-----------------------------+---------------------------------+-----------------------------------+
| partial C VARARGS, 2+0 arg  | 120 ns                          | 116 ns: 1.03x faster (-3%)        |
+-----------------------------+---------------------------------+-----------------------------------+
| partial C FASTCALL, 1+0 arg | 40.5 ns                         | 39.4 ns: 1.03x faster (-3%)       |
+-----------------------------+---------------------------------+-----------------------------------+

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_slots.json patch\bench_fastcall_slots.json --table
+--------------------------+-------------------------------+---------------------------------+
| Benchmark                | ref\bench_fastcall_slots.json | patch\bench_fastcall_slots.json |
+==========================+===============================+=================================+
| Python __int__: int(obj) | 161 ns                        | 155 ns: 1.03x faster (-3%)      |
+--------------------------+-------------------------------+---------------------------------+

Not significant (1): Python __getitem__: obj[0]

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_struct.json patch\bench_fastcall_struct.json --table
+---------------------+--------------------------------+----------------------------------+
| Benchmark           | ref\bench_fastcall_struct.json | patch\bench_fastcall_struct.json |
+=====================+================================+==================================+
| struct.pack("i", 1) | 82.3 ns                        | 85.3 ns: 1.04x slower (+4%)      |
+---------------------+--------------------------------+----------------------------------+

Not significant (1): int.to_bytes(1, 4, "little")

vstinner@WIN C:\vstinner\python\master>env\scripts\python -m pyperf compare_to ref\bench_fastcall_str_methods.json patch\bench_fastcall_str_methods.json --table
+-----------------------+-------------------------------------+---------------------------------------+
| Benchmark             | ref\bench_fastcall_str_methods.json | patch\bench_fastcall_str_methods.json |
+=======================+=====================================+=======================================+
| "a".replace("x", "y") | 68.9 ns                             | 70.5 ns: 1.02x slower (+2%)           |
+-----------------------+-------------------------------------+---------------------------------------+

The problem is that I don't know how to run benchmar ks on Windows. For example, on \bench_fastcall_partial.json, the std dev is between +- 0.8 ns and +- 9 ns:

partial Python, 1+1 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 101 ns +- 3 ns -> [patch\bench_fastcall_partial.json] 97.4 ns +- 5.4 ns: 1.03x faster (-3%)
partial Python, 2+0 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 100 ns +- 6 ns -> [patch\bench_fastcall_partial.json] 92.3 ns +- 5.3 ns: 1.09x faster (-8%)
partial Python, 5+1 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 130 ns +- 6 ns -> [patch\bench_fastcall_partial.json] 125 ns +- 6 ns: 1.04x faster (-4%)
partial Python, 5+5 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 160 ns +- 7 ns -> [patch\bench_fastcall_partial.json] 151 ns +- 4 ns: 1.06x faster (-6%)
partial C VARARGS, 1+1 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 156 ns +- 7 ns -> [patch\bench_fastcall_partial.json] 153 ns +- 9 ns: 1.02x faster (-2%)
partial C VARARGS, 2+0 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 120 ns +- 8 ns -> [patch\bench_fastcall_partial.json] 116 ns +- 5 ns: 1.03x faster (-3%)
partial C FASTCALL, 1+0 arg: Mean +- std dev: [ref\bench_fastcall_partial.json] 40.5 ns +- 0.8 ns -> [patch\bench_fastcall_partial.json] 39.4 ns +- 1.9 ns: 1.03x faster (-3%)

When I read 100 ns +- 6 ns -> 92.3 ns +- 5.3 ns: 1.09x faster (-8%), for me, it's not easy to confirm is if it's really significant between of the "large" std dev: on 100 ns, 6 ns std dev is large, knowning that speedup is only -7.7 ns.

* Add _PyObject_VectorcallTstate() function: similar to
  _PyObject_Vectorcall(), but with tstate parameter
* Add tstate parameter to _PyObject_MakeTpCall()
@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2019

I removed _PyObject_MakeTpCallTstate(): I modified _PyObject_MakeTpCall() to add a tstate parameter instead. You should not call _PyObject_MakeTpCall() directly. But I chose to leave _PyObject_Vectorcall() unchanged, and add _PyObject_VectorcallTstate().

@vstinner vstinner changed the title [WIP] bpo-38644: Add _PyObject_VectorcallTstate() bpo-38644: Add _PyObject_VectorcallTstate() Nov 8, 2019
@vstinner vstinner merged commit 7e43373 into python:master Nov 8, 2019
@vstinner vstinner deleted the call_tstate2 branch November 8, 2019 09:05
@vstinner
Copy link
Member Author

vstinner commented Nov 8, 2019

I didn't notice any significant performance overhead of this change, nor speedup. The balance seems to be null. This change is more about correctness rather than performance. See https://bugs.python.org/issue36710 for the rationale.

@markshannon
Copy link
Member

I'm not really keen on this change, especially the change to PyObject_Vectorcall.
It is now fetching the thread state from the runtime, then passes it as an argument to where it would have been fetched from the runtime anyway, or just ignored.
I don't see how this could ever be faster, nor do I see how it is more correct.

What would be helpful from a performance point of view is to fix up _PyThreadState_GET() so it doesn't do so much pointer chasing and maybe eliminate the memory fence, by storing the thread state in a C++ 11 __thread variable.
The current means of accessing the thread state does seem rather convoluted, whereas accessing from a thread local is quite efficient (at least with GCC) https://godbolt.org/z/z-vNPN

@vstinner
Copy link
Member Author

@markshannon: "I'm not really keen on this change, (...)"

I started a thread on python-dev: https://mail.python.org/archives/list/python-dev@python.org/thread/PQBGECVGVYFTVDLBYURLCXA3T7IPEHHO/ I invite you to participate to the discussion there ;-)

jacobneiltaylor pushed a commit to jacobneiltaylor/cpython that referenced this pull request Dec 5, 2019
* Add _PyObject_VectorcallTstate() function: similar to
  _PyObject_Vectorcall(), but with tstate parameter
* Add tstate parameter to _PyObject_MakeTpCall()
shihai1991 pushed a commit to shihai1991/cpython that referenced this pull request Jan 31, 2020
* Add _PyObject_VectorcallTstate() function: similar to
  _PyObject_Vectorcall(), but with tstate parameter
* Add tstate parameter to _PyObject_MakeTpCall()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants