Skip to content

Replace factory functions with classes. #81

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

udifuchs
Copy link
Contributor

Element() and ElementTree() were factory functions until now. They are now defined as classes.

With this change, we can annotate code with the public Element class, instead of using the private _Element class.
For example:

def print_tree(tree: e.ElementTree[e.Element]) -> None: ...

There is a related PR for lxml:
lxml/lxml#405

In the lxml PR, _Element is a virtual subclass of Element. This means that when we create an Element it actually creates an _Element class, but the created object is also an instance of Element:

el = Element()
reveal_type(el)  # _Element
isinstance(element, Element)  # True
isinstance(element, _Element)  # True

I don't know if it is possible to have the equivalent of a virtual subclass in a type stub. Therefore, I just made Element a type alias of _Element.

The lxms stubs should work with all existing versions of lxml. But I am not sure it make sense to merge this PR until the related PR for lxml is merged.

Element() and ElementTree() were factory functions until now.
They are now defined as classes.

With this change, we can annotate code with the public Element class,
instead of using the private _Element class.
For example:
```
def print_tree(tree: e.ElementTree[e.Element]) -> None: ...
```

There is a related PR for lxml:
lxml/lxml#405

In the lxml PR, _Element is a virtual subclass of Element.
This means that when we create an Element it actually creates an _Element class,
but the created object is also an instance of Element:
```
el = Element()
reveal_type(el)  # _Element
isinstance(element, Element)  # True
isinstance(element, _Element)  # True
```

I don't know if it is possible to have the equivalent of a virtual subclass
in a type stub. Therefore, I just made Element a type alias of _Element.

The lxms stubs should work with all existing versions of lxml.
But I am not sure it make sense to merge this PR until the related PR for lxml
is merged.
@abelcheung abelcheung added the enhancement New feature or request label Feb 22, 2025
@abelcheung
Copy link
Owner

abelcheung commented Feb 22, 2025

... not sure it make sense to merge this PR until the related PR for lxml is merged.

Yeah, it makes more sense to wait for lxml release which includes your change, then I make a new types-lxml release including this PR.

I have taken a glimpse of your lxml PR. If Element becomes a superclass, there is an alternative approach that I want to try out too. While making it a TypeAlias of _Element should most likely work fine, I'd like to try a more ambitious change -- allow Element to be the canonical type and phase out _Element wherever convenient. It has been a common complaint that a "private" type (with underscore) is used in mission critical way, where most code using lxml will probably fail to annotate if _Element and _ElementTree were not available. This is also the reason why PyCharm support is not very usable for now, as PyCharm has no method autocompletion for private classes.

Such change needs some time for testing; luckily there's spare time until lxml releases new version. In worst case, I can apply your PR first and refine later, or maybe have a review in the coming days.

@udifuchs
Copy link
Contributor Author

I was thinking of breaking this change over several PRs:

  1. Make Element a type alias of _Element instead of a factory function (this PR.)
  2. Replace most occurrences of _Element with Element in the stubs.
  3. Switch roles between Element and _Element, making Element a class and _Element a type alias of Element.
  4. Remove _Element completely from the type stubs.

I can easily create a PR for step 2 or add it to this PR. The only down side is that it changes almost 100 lines of code, so it makes it harder to see what are interesting changes in this PR.

Step 3 subtly breaks the runtime tests unless one uses a patched version of lxml. I think that the problem is in the tests, but this is not the important point. What is important is that step 3 has the potential of breaking existing code that does run-time type checking. It is probably very rare, but we shouldn't make a release of types-lxml with step 3 before lxml 6.0.0 (with the Element change) is released.

Step 4 would essentially break any existing code that uses types-lxml. It might be possible to implement it in a few years, but maybe an _Element type alias should be kept forever.

@abelcheung
Copy link
Owner

I'm aware that your PR in lxml just got merged. Will give it some thought after I obtain a new wheel build of 6.0 alpha.

@abelcheung
Copy link
Owner

Things don't seem right for me:

>>> from lxml import etree
>>> etree.LXML_VERSION
(6, 0, 0, -200)
>>> root = etree.Element('foo')
>>> type(root).__mro__
(<class 'lxml.etree._Element'>, <class 'object'>)

Are you expecting etree.Element to be a superclass of _Element?

@udifuchs
Copy link
Contributor Author

Are you expecting etree.Element to be a superclass of _Element?

Sort of. See lxml/lxml#405 for the full details.
etree.Element is an Abstract Base Class:

class Element(ABC):

_Element is not an actual subclass of Element, it is just registered as a virtual subclass of Element.

# Register _Element as a virtual subclass of Element
Element.register(_Element)

This way Element can be upgraded from a factory function to a class that behaves the way we expect a class to behave:

>>> element = Element("test")
>>> type(element)
<class 'lxml.etree._Element'>
>>> isinstance(element, Element)
True
>>> issubclass(_Element, Element)
True

And at the same time _Element stays a cython class, which is an internal implementation detail and does not depend directly on Element.

For the types stubs, it is a bit simpler. Element is just a type alias to _Element. Technically, it would be possible to get rid of _Element completely, but this would break backward compatibility.

@udifuchs
Copy link
Contributor Author

PR lxml/lxml#461 - Enable subscripting generic types in annotations - was just merged into lxml. This means that with lxml-6.0 one could write:

element_tree: etree.ElementTree[etree.Element]

and Python would accept it even if you do not use: from __future__ import annotations

It would be really nice if the lxml stubs would also allow this syntax for the lxml-6.0 release.

@abelcheung
Copy link
Owner

PR lxml/lxml#461 - Enable subscripting generic types in annotations - was just merged into lxml.

Sure, I noticed the merge yesterday. It was a surprise, given that similar effort has been rejected before.

@udifuchs
Copy link
Contributor Author

I was able to reproduce the error locally by running the latest version of pyright. I am not sure what version of pyright is used by github actions.

The last commit removes the duplicate definition of makeelement from HTMLElement. There are other duplicate definitions in HTMLElement that could be removed. But this problem is not related to this PR and should be handled in a different PR.

@abelcheung
Copy link
Owner

I was able to reproduce the error locally by running the latest version of pyright. I am not sure what version of pyright is used by github actions.

Yeah, the pyright errors only occur since 1.1.399, and that applies to main branch as well. So it's not your fault.

The last commit removes the duplicate definition of makeelement from HTMLElement. There are other duplicate definitions in HTMLElement that could be removed.

So I bet you don't want to only apply a single fix locally, only to conflict with a larger scale warning fix I'm going to apply to main. Maybe you can revert your last 2 commits and merge main branch again when I have commited the fix.

The duplicate definitions are another story. I added them for very specific reason. Take .makeelement() for example:

>>> from lxml.html import fromstring
>>> root = fromstring("<html><body><br/></body></html>")
>>> elem = root.makeelement("form")
>>> type(elem)
<class 'lxml.html.FormElement'>

>>> elem = root.makeelement("label")
>>> type(elem)
<class 'lxml.html.LabelElement'>

If root node is HtmlElement, its .makeelement() factory always produce HtmlElement or some of its subclass, but we can't tell which one from typing alone. So I add override to make sure it produces HtmlElement, not XML element (_Element). All other "duplicates" are there to mandate the methods output HtmlElement or expects HtmlElement as argument.

@udifuchs
Copy link
Contributor Author

Your example with makeelement only works for runtime checks, so the stub definitions are not relevant.

Static checks work correctly without the duplicate definition thanks to the Self mechanism:

from lxml.html import HtmlElement
html = HtmlElement()
reveal_type(html.makeelement)  # Revealed type is "type[lxml.html._element.HtmlElement]"

In any case, I reverted my last two commits like you asked.

@abelcheung
Copy link
Owner

In any case, I reverted my last two commits like you asked.

Thanks, I have just committed a fix, and you can rebase your work on that.

Your example with makeelement only works for runtime checks, so the stub definitions are not relevant.
Static checks work correctly without the duplicate definition thanks to the Self mechanism:

The issue here is, I am taking care of real world usage when designing those incompatible overrides. The usage of Self in _Element methods and non-usage in HtmlElement demonstrates 2 different user cases.

  • For _Element, developers are expected to not subclass or create a single subclass of _Element while parsing their XML tree. So Self can refer to either vanilla _Element or their own custom element as they see fit.
  • For HtmlElement, Self is detrimental to its usage. This time I use __getitem__ for easier understanding, which means elem[0] is the first immmediate child subelement of a parent element. If the override were not there, then the type of elem[0] for a subclass of HtmlElement, say FormElement, would be another FormElement in terms of static type. This implies illegal html where all elements of an HTML form are themselves forms.

When fixing stub for lxml.html submodule, the behavior of various subclasses of HtmlElement have to be taken into account too.

@abelcheung
Copy link
Owner

Ah sorry, by "revert" I mean

git reset --hard 5d970bd9da39237ebb02ed5d7dae967cd85b0269

then force push the branch, so the conflict is completely gone.

@udifuchs
Copy link
Contributor Author

I merged your commits to my branch.

Notice that multi-subclass.patch did not have any conflicts, but it still references _ElementFactory. I am not sure how you use it, so I am not sure how to fix and test it.

@abelcheung
Copy link
Owner

Notice that multi-subclass.patch did not have any conflicts, but it still references _ElementFactory.

No problem, I'll redo the patch and commit directly to your branch.

@udifuchs
Copy link
Contributor Author

lxml 6.0 was released a few weeks ago with the change that Element is now a class instead of a factory function.

Is there anything holding back merging this PR?

@abelcheung
Copy link
Owner

@udifuchs Sorry for late reply.

Is there anything holding back merging this PR?

Yeah, a few reasons actually.

  1. Perhaps the PR started too long ago. I have found an existing feature removed during previous merge of repo head in this PR. Need some additional check to make sure there aren't any more stuff accidentally removed.
  2. All _ElementFactory definitions are converted to type like:
    -         makeelement: _ElementFactory[_ET_co],
    +         makeelement: type[_ET_co],
    Usage of type indicates makeelement is a class itself, while in reality they are only class methods. Although .makeelement and the new Element perform the same thing (calling _makeElement cython function internally), the nature of .makeelement is completely modified in PR. I'm a bit worried about that.
  3. This is the most important: the change is incompatible with lxml 5.x. I have to declare types-lxml supports lxml 6.0 only if I merge this PR right now. How to resolve this dilemma is open for discussion.

@udifuchs
Copy link
Contributor Author

2. All `_ElementFactory` definitions are converted to `type` like:
   ```diff
   -         makeelement: _ElementFactory[_ET_co],
   +         makeelement: type[_ET_co],
   ```
     
   Usage of `type` indicates `makeelement` is a class itself, while in reality they are only class methods. Although `.makeelement` and the new `Element` perform the same thing (calling `_makeElement` cython function internally), the nature of `.makeelement` is completely modified in PR. I'm a bit worried about that.

There is not much difference between a class and a factory function. Both are considered callable and both return a class instance. The only relevant difference is that only on a class you can call isinstance.

3. This is the most important: the change is incompatible with lxml 5.x. I have to declare `types-lxml` supports lxml 6.0 _only_ if I merge this PR right now. How to resolve this dilemma is open for discussion.

AFAIK, this PR is fully compatible with both lxml 5.x and 6.0. Do you have any examples of incompatible code?

@abelcheung
Copy link
Owner

OK, I should have mentioned I was working on 2 PRs together, so it was not apparent to you: for example, mypy.stubtest checks the nature of lxml.etree.ElementTree itself, and it will complain about stub not implemented incorrectly for lxml 5.x -- etree.ElementTree is a class in stub now after merging this PR, but is a function when running tests against 5.x.

I can silent the warnings and see if it works in real world. As long as nobody complains, it's good to go.

@abelcheung
Copy link
Owner

abelcheung commented Aug 8, 2025

The merge with main was becoming a mess because there have been too many commits in between. Let me roll back to prestine state and work from there instead.

element: None = ...,
*,
file: _t._FileReadSource,
parser: CustomTargetParser[_T],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [mypy] reported by reviewdog 🐶
Name "CustomTargetParser" is not defined [name-defined]

*,
file: _t._FileReadSource,
parser: CustomTargetParser[_T],
) -> _T: ...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [mypy] reported by reviewdog 🐶
A function returning TypeVar should receive at least one argument containing the same TypeVar [type-var]

element: None = ...,
*,
file: _t._FileReadSource,
parser: CustomTargetParser[_T],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [pyright] reported by reviewdog 🐶
Type of parameter "parser" is unknown (reportUnknownParameterType)

element: None = ...,
*,
file: _t._FileReadSource,
parser: CustomTargetParser[_T],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [pyright] reported by reviewdog 🐶
"CustomTargetParser" is not defined (reportUndefinedVariable)

@abelcheung
Copy link
Owner

@udifuchs It turns out rebasing a large change against almost 200 commits turns out to be disastrous, with way too many conflict resolution that I have lost track somewhere in between.

Can you rework the patch against current head and submit a new PR? BTW I have added a few patches to this PR, you will need to fix those parts in new PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: On Hold
Development

Successfully merging this pull request may close these issues.

2 participants