Description
We recently saw a big performance regression on the telco benchmark when the decimal module was moved to multi-phase init.
Accessing state is now much slower than before.
Anecdotally, accessing a global now takes 7 dependent loads instead of 1. (@mdboom do you have a link for this?)
If we make the observation that we do not need per-module variables, but per interpreter ones, to replace (C) global variables, we can design an API that needs much fewer indirections.
This API is largely stolen from HPy with a few tweaks for better performance. https://docs.hpyproject.org/en/stable/api-reference/hpy-global.html
typedef struct { uintptr_t index } PyGlobal;
/* Declare a global */
#define PyGLOBAL_DECLARE(NAME) PyGlobal NAME = PY_GLOBAL_INIT;
/* Initialize global, this must be called at least once per-process.
* This function is idempotent, so can be called whenever a module is loaded */
PyGlobal_Init(PyGlobal *name);
PyObject *PyGlobal_Load(PyGlobal name);
void PyGlobal_Store(PyGlobal name, PyObject *value);
Implementation
Each interpreter states has a reference to an array of PyObject *
pointers.
PyGlobal_Init()
initializes the global to so non-zero index and makes sure that each interpreter has a table large enough to store that index.
Then load and store can be implemented as follows
PyObject *
PyGlobal_Load(PyGlobal name)
{
return Py_NewRef(_PyThreadState_GET()->globals_table[name.index]);
}
void
PyGlobal_Store(PyGlobal name, PyObject *value)
{
PyObject **table = _PyThreadState_GET()->globals_table;
PyObject *tmp = table[name.index];
table[name.index] = Py_NewRef(value);
Py_XDECREF(tmp);
}