Skip to content

Conversation

@OriolAbril
Copy link
Contributor

First attempt at grammar extension to support
the examples in
https://pandas.pydata.org/docs/development/contributing_docstring.html#section-3-parameters.

The only one that doesn't parse is the float, decimal.Decimal or None, not sure if it
is possible to "look ahead" for an or or to start from the rightmost comma and try to parse
as type, if it works go ahead, otherwise move one comma to the left and try again.

That being said, dict of {str : int} parses everything but I it doesn't take into account
that left of the colon are key types right of the colon value types. I have no idea if this
should happen at a grammar level, python processing or both.

Lastly, I did some changes to literals to make sure there can be no confusion between
dict subtypes or literals (colons being inside the curly brackets being the only indicator
seemed like a bad idea). I think this is also a closer match to numpydoc, as from how I understand
the description, {} for literals should only be used when only a handful of options are allowed
and therefore is incompatible with type information of any kind.

@OriolAbril OriolAbril mentioned this pull request Jun 22, 2024
Copy link
Member

@lagru lagru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @OriolAbril!


The only one that doesn't parse is the float, decimal.Decimal or None, not sure if it is possible to "look ahead" for an or or to start from the rightmost comma and try to parse as type, if it works go ahead, otherwise move one comma to the left and try again.

I don't think I want that to work. Especially since

float or decimal.Decimal or None, optional, extra info

is perfectly "human readable" and something like

float, decimal.Decimal or None, optional, extra info

not so much. 🤔

Parameters
----------
a1 : {"A", "B", "C"}
a2 : {0 or "index", 1 or "columns", None}, default None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This pandas's type syntax seems a bit dubious. I guess this is equivalent to

Suggested change
a2 : {0 or "index", 1 or "columns", None}, default None
a2 : {0, "index", 1, "columns", None}, default None

and the alternating or is for grouping of equivalent values?

This might be a case I'd leave a third party to configure itself and not support it directly in docstub.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they do use the or to indicate equivalent meaning literals, the comma to indicate different meaning literals. I have never used the or in literals though

?start : doctype

doctype : type_or ("," optional)? ("," extra_info)?
doctype : (literals | type_or) ("," optional)? ("," extra_info)?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lastly, I did some changes to literals to make sure there can be no confusion between
dict subtypes or literals (colons being inside the curly brackets being the only indicator
seemed like a bad idea). I think this is also a closer match to numpydoc, as from how I understand
the description, {} for literals should only be used when only a handful of options are allowed
and therefore is incompatible with type information of any kind.

Restricting literals to the top-level is probably sensible? Though, currently it's nice that something like

dict[{"a", "b"}, int] -> dict[Literals["a", "b"], int]

work. Do you find that readable?

Though,

dict of {{"a", "b"}: int} -> dict[Literals["a", "b"], int]

working is something. 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 5a28828.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had never seen nor considered that option, but now thinking about it there are a couple places I could use it. If you use it or feel strongly about it maybe we could use something similar to arrays for mappings in the sense a subset of names are allowed, and only if they are present can then curly brackets indicate two subtypes with colon. My guess is dict and mapping alone will cover 90% of the cases, maybe mutablemapping could also be there.

Plus a way to extend those names for both dict and array (to allow tensor for example in projects that use it)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could use something similar to arrays for mappings in the sense a subset of names are allowed

I think it might be more confusing if we restricted who can use the mapping of {KT: VT} syntax? 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to keep literals as top level option only

container_of : NAME "of" type_or
container_of : NAME "of" ( type_or | dict_subtypes )

dict_subtypes : "{" type_or ":" type_or "}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, you made me release that we can streamline this and get rid of dict_subtypes and even the existing container_of!

contains: "[" type_or ("," type_or)* "]"
        | "[" type_or "," PY_ELLIPSES "]"
        | "of" type
        | "of" "(" type_or ("," type_or)* ")"
        | "of" "{" type_or ":" type_or "}"

That setup also makes it so that one has to enclose in (...) to allow multiple types inside the container. That get's rid of ambiguity with the top-level "or".

(BTW amazing that GitHub highlights Lark syntax!)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 3908f3f.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is great, I'll also open an issue or PR to numpydoc itself with these at some point. I have never known how "list of int or float" is supposed to be interpreted (list of int) or float vs list of (int or float)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intuitively I'd say (list of int) or float. I don't think numpydoc worries about those yet and maybe they don't need to.

Part of the aim behind docstub is also to create some kind of standard, with the understanding that "hey if you want something more custom you need to configure it yourself" .

I don't remember who but someone from NumPyDoc told me at some point they'd be happy to go with whatever recommendation docstub settles on.

container_of : NAME "of" type_or
container_of : NAME "of" ( type_or | dict_subtypes )

dict_subtypes : "{" type_or ":" type_or "}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That being said, dict of {str : int} parses everything but I it doesn't take into account that left of the colon are key types right of the colon value types. I have no idea if this should happen at a grammar level, python processing or both.

It doesn't have to because Python's type annotation for dicts dict[key_type, value_type] only make the distinction whether a type is used for key or value by the order they appear. So as the order in {key_type : value_type} is the same we don't have to do anything.

@lagru
Copy link
Member

lagru commented Jun 23, 2024

Note, I'm opted to incorporate your suggestions and add them to the WIP #2. The classic PR-based contribution workflow may be a bit too clunky while I'm still very much refactoring and extending the prototype.

@OriolAbril
Copy link
Contributor Author

OriolAbril commented Jun 23, 2024

Sounds great, just wanted to get the ball rolling.

I forgot to comment on the parsing of defaults, hou would you feel about allowing space in addition to the colon and equal? Thus changing to "default" ("=" | ":")? literal. All 3 are allowed and equivalent according to numpydoc. I am not sure parsing of defaults plays any role but figured I'd mention it

@lagru
Copy link
Member

lagru commented Jun 23, 2024

Happy to use "default" ("=" | ":")? literal. 👍

@OriolAbril
Copy link
Contributor Author

I think this can be closed now. Let me know if at some point you want me to test the other PR

@OriolAbril OriolAbril closed this Jun 28, 2024
@lagru lagru added the enhancement New feature or functionality label Sep 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or functionality

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants