Allows to introspect Python modules from cdylib: first step #3977

Tpt · 2024-03-22T09:21:51Z

This is a first step to introspect Python modules built by PyO3

A missing piece in the story listed in #2454 is how tools like Maturin move the introspection information generated how by PyO3 into to type stubs files included in the built wheels.

I see three approaches for it:

Make pyo3-macros generate a file with the stubs after having processed all macros of the crate. This has the advantage of being self-contained in the crate but falls short in cases like python classes declared in a crate but exposed in an other crate: there is no guarantees that proc macros of a crate and its dependencies are compiled in the same process and that proc macros will still be able to write files in the future (like with proposal to run them a WASM sandbox).
Make pyo3-macros exports a C function that returns the stubs. The built libraries can then be loaded by Maturin that would call the function and write the stubs to a file. I wrote a quick experiment with a patch to PyO3 that generate a public C ABI __pyo3_stubs_MODULE_NAME function However, for the build system to execute it, a compatible Python interpreter must be present to link with and a compatible CPU or VM to run it, making generation when doing cross-compilation very hard. I guess it's what Python Interface (.pyi) generation and runtime inspection #2454 was heading toward.
Add the introspection data in a custom section of the built cdylib. It requires the introspection data to be completely static. However, we can easily build a library that extract and parses these sections and generate the stub file. Build tool like Maturin would just need to call this library to get the stubs. Cross compilation support is easy as soon as a parser for the built cdylib format exists. However, this might make the generated binaries bigger because of the extra data. This is what the MR experiments with.

Architecture:

Each PyO3 element (pymodule, pyclass, pyfunction...) generates a segment in the built binary pyo3_data0 section that contains a JSON "chunk" with the introspection data. Code in pyo3-macros-backend/src/introspection.rs. I had to do some bad hack to generate the segments properly via Rust static elements.
JSON chunks can refer to other chunks via global ids. These ids are stored in the PYO3_INTROSPECTION_ID constants, allowing the code building the JSON chunk to get the global id of eg. a class C via C::PYO3_INTROSPECTION_ID. This allows chunks to refer to other chunks easily (eg. to list module elements). A bad hack is used to generate the ids (TypeId::of would have been a nicer approach but is not const on Rust stable yet).
The 0 at the end of pyo3_data0 is a version number, allowing breaking changes in the introspection format.
The pyo3-introspection crate parses the binary using goblin (library also used by Maturin), fetches all the pyo3_data0 segments (only implemented for macOS Match-O in this experiment), and builds nice data structures.
Not done yet: the pyo3-introspection crate would implement a to_stubs function converting the data structures to Type stubs.
The pyo3-introspection has an integration tests doing introspection of the pyo3-pytests library.

Current limitations:

WASM is not supported yet, only PE (Windows), Mach-0 (macOS) and ELF (Linux and some other *nix).
the only introspection data created is the list of function, classes and submodules inside of a module created using the new declarative module syntax
this approach requires to convert FromPyObject::type_input into an associated constant or a const function and similarly for IntoPy::type_output. This is mandatory in order to allow to make use of them in the static values added to the generated binary.

davidhewitt · 2024-03-23T22:30:56Z

Thanks for moving this forward! The idea of using custom data sections is new to me. I see the upsides of it, though I am slightly worried by the extra complexity of needing to worry about linker details in yet another way.

1. Make pyo3-macros generate a file with the stubs after having processed all macros of the crate.

I agree that having the macros generate file(s) is unlikely to be the right solution 👍

2. However, for the build system to execute it, a compatible Python interpreter must be present to link with and a compatible CPU or VM to run it, making generation when doing cross-compilation very hard. I guess it's what Python Interface (.pyi) generation and runtime inspection #2454 was heading toward.

For the library which converts the metadata to .pyi files, I think rather than expecting the build system like maturin to execute it, it would work better if we asked users to keep their .pyi file committed in the repository and they used the library to generate / update the file as part of their test suite. I think in that case then local development should be enough and cross-compiling can be ignored from the picture.

In fact, I wonder if for option (3) explored here then the using a test to generate and update the .pyi file is still better than to run it during maturin build. Here's why:

I think it's easier for users to understand what their .pyi file contains if it's committed in the repository rather than generated by maturin as part of the packaging process.
With a test, users are hopefully reminded in CI if they forgot to regenerate their .pyi files.
No need for us to include this code / information from the binary if it's only needed during development. (Probably we use a cargo feature which users only enable in their dev-dependencies?)

If we agree that using a test to update stubs is a good solution, then I think the choice between runtime code like (2) and data segments like (3) is probably just influenced by whatever is easier for us to implement. We might even be able to swap back and forth between these two options as an implementation detail as we learn.

Tpt · 2024-03-25T09:47:41Z

If we agree that using a test to update stubs is a good solution, then I think the choice between runtime code like (2) and data segments like (3) is probably just influenced by whatever is easier for us to implement. We might even be able to swap back and forth between these two options as an implementation detail as we learn.

Yes! To have played a bit with both, runtime code like (2) is way easier (the difference between this MR and #2454 is quite significant).

If I try to summarize the pros of each approach:

Approach 3: add introspection to the cdylib and let maturin write the stubs on build:

no extra user code/configuration: they upgrade maturin and pyo3 and it just works instead of having to setup a test, make extension-module not a default...
support of Rust features: if a feature is enabled, extra stubs might get generated and if it is disabled stubs won't get generated. This seems impossible to get with approach 4
easier Python version and platform customization: stubs are generated per Python version and platform

Approach 4: use a test to write/update stubs:

hopefully simpler code in PyO3 (not data segment generation and retrieval from binary...)
easier debugging and change inspection: stubs are commited as part of repository content
Python version and platform specific elements are explicit in the stubs. But it might be painful to implement because it might be a bit of cat-and-mouse game with Rust #[cfg] annotation.
smaller generated cdylib (no introspection data in them)

davidhewitt · 2024-03-26T23:46:53Z

You make a very good point that automated tests to update a .pyi file struggle with cfg declarations. I suppose one way around that would be to make it possible to have multiple .pyi files committed for different Python / OS combinations and select the correct one in packaging.

One thing that I expect is that .pyi files might want to include extra user customisations beyond what the autogenerated stubs contain. For example, I expect generic code may be tricky to define on the Rust side and may benefit from some handwritten customisation of stubs. (e.g. making a user-defined dict[K, V] class.) I don't have a good answer to how we can solve that.

Overall I don't have a good sense of whether option 3 or 4 is better. In an ideal world we might offer both options. Which one do you think would meet your needs better at present? Maybe we start by implementing that and we learn a lot by doing so!

Tpt · 2024-03-27T09:50:02Z

One thing that I expect is that .pyi files might want to include extra user customisations beyond what the autogenerated stubs contain. For example, I expect generic code may be tricky to define on the Rust side and may benefit from some handwritten customisation of stubs. (e.g. making a user-defined dict[K, V] class.) I don't have a good answer to how we can solve that.

I would love to avoid people to handwrite customization in stubs because it makes automatically updating the stubs when Rust code is changed very hard. Imho automatically updating stubs to reflect changes in the Rust code is the main value proposition of auto-generating stubs in the first place.

An idea: Add entry points in PyO3 macros to extend the stubs. For example (rought idea, not sure about the actual details):

#[pymodule]
#[py(extra_stub_content = "
K = VarType('T')
V = VarType('V')
")]
mod module {
    #[pyclass(stub_parent = [collections::abc::Mapping::<K,V>])]
    struct CustomDict;
    
    #[pymethods]
    impl CustomDict {
       #[pyo3(signature = (key: K), return_type = V | None)]
       fn get(key: &Bound<'_, PyAny>) -> Option<Bound<'_, PyAny>> {
       }
    }
}

would generate

K = VarType('T')
K = VarType('V')

class CustomDict(collections.abc.Mapping[K,V]):
     def get(key: K) -> V | None: ...

This way stubs would stay auto generated but can be improved by the author.

A possible way to mix options 3 and 4:

When building a wheel, Maturin uses by default the present stubs file or, if not present, falls back to automated introspection (option 3)
Maturin provides a generate-stubs command that takes for input arguments like --target and --features and generates stubs file, allowing users to pick option 4 by commiting the command result to git. Users are also able to use this command to check that commited to git are up to date (run the command again and check if anything changed) or run some validation on the stubs in the CI using eg. mypy.stubtests even if the stubs are not commited to git.

davidhewitt · 2024-03-27T11:04:36Z

Agreed that having the proc macros be able to collect all the necessary information would be nice. I think only time will tell whether they can meet all user needs!

I'm slightly wary of coupling to maturin for all stub generation, because some projects use setuptools-rust or their own build options for good reasons. I think that offering maturin generate-stubs as an alternative development command would be a good way to avoid the problem for projects that don't want to do their packaging with maturin.

cc @messense do you see any concerns with adding this to maturin?

So what's next steps here? Do you want me to start reviewing this code, or will you push more first?

Regarding the data sections, I happened to hear yesterday that UniFFI's proc macros can do something similar about shipping definitions in the shared library, so it might be interesting to look at / ask them how that was implemented.

messense · 2024-03-27T11:19:17Z

do you see any concerns with adding this to maturin?

No concern, I think a generate-stubs command will be very useful for users wanting to commit pyi files in git. We can also add a --check option to fail the command when existing pyi file is outdated.

Tpt · 2024-03-27T17:05:30Z

Thank you!

Agreed that having the proc macros be able to collect all the necessary information would be nice. I think only time will tell whether they can meet all user needs!

Yes! My hope is to cover as many as possible.

I'm slightly wary of coupling to maturin for all stub generation, because some projects use setuptools-rust or their own build options for good reasons. I think that offering maturin generate-stubs as an alternative development command would be a good way to avoid the problem for projects that don't want to do their packaging with maturin.

Additionaly to maturin generate-stubs, I would go one step further and suggest we also follow the approach started in this MR, ie. build a pyo3-introspection stand-alone crate that contains all the code to generate the actual stubs from the binary annotations. This way other build systems would be able to ship integrated automated stub generation if they want. This would also enable other tools like doing breaking change checking based on the introspection data.

Regarding the data sections, I happened to hear yesterday that UniFFI's proc macros can do something similar about shipping definitions in the shared library, so it might be interesting to look at / ask them how that was implemented.

Thank you! I'm going to have a look at it.

So what's next steps here? Do you want me to start reviewing this code, or will you push more first?

I think the current draft already shows the relevant direction, a very high level code review to check if it's going in the good direction would be welcome. Maybe wait for me to have a look at UniFFI, I might change a bit this MR if I find interesting things there. Thank you!

Rigidity · 2024-03-27T18:43:14Z

This is very exciting, looking forward to being able to generate type stubs! Currently we have this lengthy and hard to maintain Python script for doing so, which we have to update by hand: https://github.com/Chia-Network/chia_rs/blob/main/wheel/generate_type_stubs.py

This would be a major improvement. Happy to help out however I can (testing, implementation, whatever) as time allows, to hopefully get this out the door 😄

Tpt · 2024-03-27T20:06:27Z

@Rigidity Thank you! I plan to work on this MRs to get the basics done. Then there will be a lot of features to incrementally add on it (support for all PyO3 features...) so help will coding and testing will be much welcome!

Tpt · 2024-05-10T15:13:39Z

Sorry for the very long reaction delay (a lot of priorities + vacations).

Regarding the data sections, I happened to hear yesterday that UniFFI's proc macros can do something similar about shipping definitions in the shared library, so it might be interesting to look at / ask them how that was implemented.

I had a look at uniffi, they basically use the same approach as us: embedding the metadata in the binary and then parsing the binary using the same goblin library. The major different I see is that my prototype is putting the metadata into a custom section named set using link_section whereas uniffi is using regular const without custom link_section. I would tend to think custom link_section is a bit clearer when analyzing the binary but I am not sure how much it changes things.

@davidhewitt If you have time, may you have a quick look at the MR to see if the global design goes into a good direction? If yes, I will fix a lot of shortcut I took and get the MR ready for review.

davidhewitt

Thanks, yes I'm happy with us proceeding with this approach!

I'm definitely convinced by technical direction here for the generation process; the main risk I still see is how to give users the full power to customise the stubs with generic args etc.

I think we can learn by that piece-by-piece as we proceed.

pyo3-introspection/LICENSE-APACHE

pyo3-introspection/Cargo.toml

pyo3-introspection/LICENSE-MIT

davidhewitt · 2024-06-08T07:52:48Z

I would tend to think custom link_section is a bit clearer when analyzing the binary but I am not sure how much it changes things.

I think so too; I also imagine we might want to strip the custom link section after the stubs have been extracted, it feels like it's probably easier to do that by having a distinct section.

Tpt · 2024-06-10T06:58:50Z

I'm definitely convinced by technical direction here for the generation process; the main risk I still see is how to give users the full power to customise the stubs with generic args etc.

I think we can learn by that piece-by-piece as we proceed.

Thank you! I agree on the risk. My guess is that we will introduce a set of macro arguments for that, but getting them right won't be easy.

Tpt · 2025-02-20T17:02:36Z

Note: I am more than happy to rebase this MR if someone is willing to review it

abrisco · 2025-02-20T17:33:24Z

@davidhewitt do you know if you or any other maintainers will have time in the foreseeable future?

bschoenmaeckers · 2025-02-28T13:53:48Z

Do you have any estimates on how much this will increase the size of the binary, especially for larger projects? Did you consider using something more concise like CBOR since this data doesn't need to be human-readable once it's embedded?

Tpt · 2025-02-28T14:00:48Z

@bschoenmaeckers This is a good point. My assumption is that most people that care about binary size will strip the binary before publishing it but it might not be always true. Moving to CBOR is a great idea but might make the data building code a bit more complex (it already quite abuses the const expression evaluation mechanism).

bschoenmaeckers · 2025-03-03T09:05:29Z

@bschoenmaeckers This is a good point. My assumption is that most people that care about binary size will strip the binary before publishing it but it might not be always true. Moving to CBOR is a great idea but might make the data building code a bit more complex (it already quite abuses the const expression evaluation mechanism).

Let's get this merged first and look into optimizing the format if necessary in the next iteration.

davidhewitt · 2025-03-03T09:31:11Z

Yes. I finally have some promising Fridays ahead so I will do my utmost to review this in the next week or two. It is long overdue and deserves to move forwards.

abrisco · 2025-03-03T22:03:03Z

@Tpt do you wanna rebase in response to #3977 (comment) ?

Tpt · 2025-03-04T08:21:18Z

@davidhewitt Amazing! I am quite packed this week but I will do the rebase early next week.

@abrisco Thanks for the ping

davidhewitt · 2025-03-14T21:33:20Z

With sincerest apologies my childcare fell through today so I did not get the productive day I had hoped for. Next week is disrupted for other reasons. This PR is still my top priority to review however I think the realistic timeframe is now Friday in two weeks' time.

davidhewitt

Thanks, I have finally reviewed. This looks good to me to move forward with, and I am very sorry I blocked this for so long.

Just a few small tidy ups, and then let's merge and proceed with building this out 👍

pyo3-introspection/Cargo.toml

davidhewitt · 2025-03-28T11:41:28Z

pyo3-macros-backend/src/introspection.rs

+
+#[derive(Default)]
+struct ConcatenationBuilder {
+    elements: Vec<TokenStream>,


It looks like these are either paths (referring to string constants), or just chunks of strings. Would it make sense to build an enum over these instead of using TokenStream?

Indeed, thank you for spotting this. Done.

I have reverted back my code to an enum on String and TokenStream because in the follow MR on type signature I will need to inject more complex expressions than just a path. Is it fine with you?

Sure thing, we'll stick with that and can always reconsider later 👍

pyo3-macros-backend/src/introspection.rs

davidhewitt · 2025-03-28T12:00:00Z

pyo3-macros-backend/src/introspection.rs

+        let mut content = ConcatenationBuilder::default();
+        self.add_to_serialization(&mut content);
+        let content = content.into_token_stream(pyo3_crate_path);


I seems to me that the content is defined to be JSON payloads matching the Chunk enum in pyo3-introspection? Maybe shall we document / link to that, it took me a while to find...

Indeed. I have added a line to the module doc comments. is it good enough?

Tpt · 2025-03-28T13:41:51Z

@davidhewitt Thank you so much for the review. I have applied your changes suggestions and rebased the MR.

Except if you prefer for me to do something else first, my next step is going to rebase and polish the type signature extraction. I have a very rough draft on my hard drive.

davidhewitt

Sounds good to me! Let's do that in incremental follow-ups; this PR is long overdue merge so let's get some progress now 👍

davidhewitt · 2025-03-28T21:13:39Z

Thanks again for all your patience while I went through a very busy time 😮‍💨

purepani · 2025-03-28T22:37:05Z

I have nothing to say besides
YAY

abrisco · 2025-03-28T22:47:09Z

Thanks to everyone that was involved!!!

Tpt mentioned this pull request Mar 27, 2024

Upgrades to PyO3 0.21 oxigraph/oxigraph#837

Merged

Tpt force-pushed the stub-generation-static branch 4 times, most recently from e82fc35 to cb4cfa7 Compare May 10, 2024 15:13

Tpt force-pushed the stub-generation-static branch 2 times, most recently from ed37f96 to 2f0d8dc Compare June 7, 2024 08:25

davidhewitt reviewed Jun 8, 2024

View reviewed changes

pyo3-introspection/LICENSE-APACHE Show resolved Hide resolved

pyo3-introspection/Cargo.toml Outdated Show resolved Hide resolved

pyo3-introspection/LICENSE-MIT Show resolved Hide resolved

Tpt force-pushed the stub-generation-static branch 10 times, most recently from 6ee4940 to 82e672f Compare June 10, 2024 13:37

Tpt self-assigned this Mar 4, 2025

Merge branch 'main' into stub-generation-static

4b2b8cc

Tpt force-pushed the stub-generation-static branch 2 times, most recently from d35274f to 683e3be Compare March 4, 2025 20:04

Support cfg with module members

a5e309a

Tpt force-pushed the stub-generation-static branch from 683e3be to a5e309a Compare March 4, 2025 20:12

davidhewitt reviewed Mar 28, 2025

View reviewed changes

Tpt added 2 commits March 28, 2025 13:39

Merge remote-tracking branch 'upstream/main' into stub-generation-static

9e4c68a

Code review feedback

cb3b0c1

Revert back to TokenStream inside of ConcatenationBuilder

f783b6e

davidhewitt approved these changes Mar 28, 2025

View reviewed changes

davidhewitt added this pull request to the merge queue Mar 28, 2025

Merged via the queue into PyO3:main with commit 27178e8 Mar 28, 2025
52 checks passed

Tpt deleted the stub-generation-static branch March 29, 2025 07:23

This was referenced Apr 2, 2025

range end index out of range error in pyo3-introspection #5023

Closed

Functions not being included in generated type stubs #5033

Closed

Allows to introspect Python modules from cdylib: first step #3977

Allows to introspect Python modules from cdylib: first step #3977

Uh oh!

Conversation

Tpt commented Mar 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidhewitt commented Mar 23, 2024

Uh oh!

Tpt commented Mar 25, 2024

Uh oh!

davidhewitt commented Mar 26, 2024

Uh oh!

Tpt commented Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davidhewitt commented Mar 27, 2024

Uh oh!

messense commented Mar 27, 2024

Uh oh!

Tpt commented Mar 27, 2024

Uh oh!

Rigidity commented Mar 27, 2024

Uh oh!

Tpt commented Mar 27, 2024

Uh oh!

Tpt commented May 10, 2024

Uh oh!

davidhewitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

davidhewitt commented Jun 8, 2024

Uh oh!

Tpt commented Jun 10, 2024

Uh oh!

Tpt commented Feb 20, 2025

Uh oh!

abrisco commented Feb 20, 2025

Uh oh!

bschoenmaeckers commented Feb 28, 2025

Uh oh!

Tpt commented Feb 28, 2025

Uh oh!

bschoenmaeckers commented Mar 3, 2025

Uh oh!

davidhewitt commented Mar 3, 2025

Uh oh!

abrisco commented Mar 3, 2025

Uh oh!

Tpt commented Mar 4, 2025

Uh oh!

davidhewitt commented Mar 14, 2025

Uh oh!

davidhewitt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidhewitt Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Tpt Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Tpt Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

davidhewitt Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

davidhewitt Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Tpt Mar 28, 2025

Choose a reason for hiding this comment

Uh oh!

Tpt commented Mar 28, 2025

Tpt commented Mar 22, 2024 •

edited

Loading

Tpt commented Mar 27, 2024 •

edited

Loading