Advice for type checking Unions of recursively defined TypedDicts #9362

clause · 2024-10-30T15:06:20Z

clause
Oct 30, 2024

I'm working on a small compiler that has multiple similar tree representations. For example:

from typing import Literal, TypedDict, Union

type Surface = Union[
    Int,
    Add[Surface, Surface],
    Subtract[Surface, Surface],
    Bool,
    If[Surface, Surface, Surface],
    And[Surface, Surface],
]

type Core = Union[
    Int,
    Add[Core, Core],
    Subtract[Core, Core],
    Bool,
    If[Core, Core, Core],
]

where the elements of the unions (nodes) are defined as follows:

class Int(TypedDict, total=False):
    tag: Literal["int"]
    value: int


class Add[E1, E2](TypedDict, total=False):
    tag: Literal["+"]
    operands: tuple[E1, E2]


class Subtract[E1, E2](TypedDict, total=False):
    tag: Literal["-"]
    operands: tuple[E1, E2]


class Bool(TypedDict, total=False):
    tag: Literal["bool"]
    value: bool


class If[Predicate, Consequent, Alternative](TypedDict, total=False):
    tag: Literal["if"]
    predicate: Predicate
    consequent: Consequent
    alternative: Alternative


class And[E1, E2](TypedDict, total=False):
    tag: Literal["and"]
    operands: tuple[E1, E2]

Many of the compiler passes are simple transformations between the representations. For example:

def shrink(
    expr: Surface,
) -> Core:
    match expr:
        case {"tag": ("int" | "bool")}:
            return expr

        case {"tag": "+", "operands": [e1, e2]}:
            return {**expr, "operands": (shrink(e1), shrink(e2))}
        case {"tag": "-", "operands": [e1, e2]}:
            return {**expr, "operands": (shrink(e1), shrink(e2))}

        # case {"tag": ("+" | "-"), "operands": [e1, e2]}:
        #     return {**expr, "operands": (shrink(e1), shrink(e2))}

        case {
            "tag": "if",
            "predicate": predicate,
            "consequent": consequent,
            "alternative": alternative,
        }:
            return {
                **expr,
                "predicate": shrink(predicate),
                "consequent": shrink(consequent),
                "alternative": shrink(alternative),
            }
        case {"tag": "and", "operands": [e1, e2]}:
            return If(
                tag="if",
                predicate=shrink(e1),
                consequent=shrink(e2),
                alternative=Bool(tag="bool", value=False),
            )
        case _:
            raise ValueError(f"unhandled expression: {expr}")

The reason for total=False and {**expr, ...} is to allow for and to propagate optional information (e.g., source locations) in the nodes.

The code above works fine, but there are many situations where nodes should be handled similarly. For non-recursive nodes (e.g., Int, Bool) I can combine the cases and the type checker is happy. However, for nodes that are recursive (e.g., Add, Subtract), I can not combine them. For example, if the cases for Add and Subtract in the above example are replaced with the commented code it does not type check.

Is there a way I can write this code so that the similar cases are merged and the type checker is satisfied without resorting to something like cast? Or is it just too complex?

Also, why is the final case _ necessary? Why isn't the match known to be exhaustive?

Answered by erictraut

Oct 30, 2024

Recursive nodes require a static type checker to employ bidirectional type inference to infer a result that is compatible with the return type (Core). When the "expected type" for bidirectional type inference involves a union, it succeeds only if the expression can be evaluated using one of the subtypes of the union. Your code is creating a situation where the expression must evaluate to more than one subtype.

def func(t1: Literal["+"], t2: Literal["+", "-"]):
    # This works because the dictionary expression evaluates
    # to `Add[Core, Core]`, which is one of the subtypes of the
    # expected type union.
    x1: Add[Core, Core] | Subtract[Core, Core] = {
        "tag": t1,
        "o…

View full answer

erictraut · 2024-10-30T21:05:55Z

erictraut
Oct 30, 2024
Maintainer

Recursive nodes require a static type checker to employ bidirectional type inference to infer a result that is compatible with the return type (Core). When the "expected type" for bidirectional type inference involves a union, it succeeds only if the expression can be evaluated using one of the subtypes of the union. Your code is creating a situation where the expression must evaluate to more than one subtype.

def func(t1: Literal["+"], t2: Literal["+", "-"]):
    # This works because the dictionary expression evaluates
    # to `Add[Core, Core]`, which is one of the subtypes of the
    # expected type union.
    x1: Add[Core, Core] | Subtract[Core, Core] = {
        "tag": t1,
        "operands": ({"tag": "int", "value": 1}, {"tag": "int", "value": 1}),
    }

    # This does not work because the dictionary expression does not
    # evaluate to either `Add[Core, Core]` or `Subtract[Core, Core]`.
    x2: Add[Core, Core] | Subtract[Core, Core] = {
        "tag": t2,
        "operands": ({"tag": "int", "value": 1}, {"tag": "int", "value": 1}),
    }

I don't know of any static type checkers (in any language) that support this. Off the top of my head, I can't think of an algorithm that would enable this, but it's possible that such an algorithm exists. If it does, it would be undoubtedly be extremely expensive computationally — probably infeasible in practice.

Why isn't the match known to be exhaustive?

There are a couple of reasons for this. The first is that your TypedDict definitions have total=False, which means that all items are not required. That means an empty dictionary or a dictionary with only a "tags" key but no "operands" key would satisfy these structural type definitions.

The second reason is that pyright's type narrowing algorithm for mapping types based on unions of tagged TypedDicts isn't being as smart as it could be in the negative (fall-through) case. It's not narrowing the fall-through type as much as it could in the case where the case mapping pattern is filtering on more than one key. The normal pattern when using tagged TypedDicts is to filter only on the tag key. Determining whether this is safe to narrow further when the filter involves multiple keys is technically possible, but it would require a bunch of extra complex logic. No one has requested this feature, and other type checkers do not implement it. It's something I would be willing to implement if I receive signal from enough pyright users that it's something they'd like to see. It's enough work (and carries with it enough regression risk) that I can't justify adding it for just one pyright user.

3 replies

clause Oct 31, 2024
Author

Thanks for the detailed reply. I was hoping it could be a simpler tweak to the type equivalence algorithm. The following version gave me hope:

from typing import Literal, TypedDict, ReadOnly, Union

type Surface = Union[
    Int,
    # works with this definition
    Primitive[Literal["+", "-"], tuple[Surface, Surface]],
    # not this one
    # Primitive[Literal["+"], tuple[Core, Core]],
    # Primitive[Literal["-"], tuple[Surface, Surface]],
    Bool,
    If[Surface, Surface, Surface],
    Primitive[Literal["and"], tuple[Surface, Surface]],
]

type Core = Union[
    Int,
    # works with this definition
    Primitive[Literal["+", "-"], tuple[Core, Core]],
    # not this one
    # Primitive[Literal["+"], tuple[Core, Core]],
    # Primitive[Literal["-"], tuple[Core, Core]],
    Bool,
    If[Core, Core, Core],
]


class Primitive[Operator, Operands](TypedDict):
    tag: ReadOnly[Literal["primitive"]]
    operator: ReadOnly[Operator]
    operands: ReadOnly[Operands]


class Int(TypedDict):
    tag: ReadOnly[Literal["int"]]
    value: ReadOnly[int]


class Bool(TypedDict):
    tag: ReadOnly[Literal["bool"]]
    value: ReadOnly[bool]


class If[Predicate, Consequent, Alternative](TypedDict):
    tag: ReadOnly[Literal["if"]]
    predicate: ReadOnly[Predicate]
    consequent: ReadOnly[Consequent]
    alternative: ReadOnly[Alternative]


def shrink(
    expr: Surface,
) -> Core:
    match expr:
        case {"tag": ("int" | "bool")}:
            return expr

        case {"tag": "primitive", "operator": ("+" | "-"), "operands": [e1, e2]}:
            return {**expr, "operands": (shrink(e1), shrink(e2))}

        # also works
        # case {"tag": "primitive", "operator": "+", "operands": [e1, e2]}:
        #     return {**expr, "operands": (shrink(e1), shrink(e2))}

        # case {"tag": "primitive", "operator": "-", "operands": [e1, e2]}:
        #     return {**expr, "operands": (shrink(e1), shrink(e2))}

        case {
            "tag": "if",
            "predicate": predicate,
            "consequent": consequent,
            "alternative": alternative,
        }:
            return {
                **expr,
                "predicate": shrink(predicate),
                "consequent": shrink(consequent),
                "alternative": shrink(alternative),
            }

        case {"tag": "primitive", "operator": "and", "operands": [e1, e2]}:
            return If(
                tag="if",
                predicate=shrink(e1),
                consequent=shrink(e2),
                alternative=Bool(tag="bool", value=False),
            )

        case _:
            raise ValueError(f"unhandled expression: {expr}")

But I don't want to have to place all of the operators that I might want to match together in the definition (e.g., not would need to go with + and - if I wanted to be able to treat them all the same).

Why are the following definitions not equivalent (assuming unions could only have one child)? I'm guessing the intervening Primitive is the issues, but I could see some normalization that replaces nested literals with more than one argument with copies the the type expression with each additional argument.

type Expr = Union[
    Primitive[Literal["+", "-"], tuple[Expr, Expr]],
]

type Expr = Union[
    Primitive[Literal["+"], tuple[Expr, Expr]],
    Primitive[Literal["-"], tuple[Expr, Expr]],
]

erictraut Oct 31, 2024
Maintainer

Why are the following definitions not equivalent

In the Python type system, the type X[A | B] is not equivalent to the type X[A] | X[B]. For example, the type Sequence[int | str] describes a different set of runtime values than Sequence[int] | Sequence[str].

I don't want to have to place all of the operators that I might want to match together in the definition

I recommend creating one node for binary operators ("add", "sub", "mul", "and", etc.) and one node for unary operators ("negate", "not", etc.). I've written dozens of parsers and compilers over the years, and this is the pattern I've always used. For example, here's the equivalent data structures in pyright's parser: BinaryOperationNode and UnaryOperationNode.

You might also consider switching from TypedDicts to dataclasses (preferably frozen), which you may work better for your use case.

clause Oct 31, 2024
Author

I agree that Binary and Unary nodes make sense, but this is for an undergraduate compilers class that is based on: https://mitpress.mit.edu/9780262047760/essentials-of-compilation/. The source language is built up incrementally and it seems pedagogically beneficial to have a clear distinction between the elements of each extension to the language rather than updating existing definitions.

I've tried using dataclasses but it loses the ability to have "extra" attributes (e.g. source location). It also still seemed to have the issue of not being able to handle similar nodes together. The replace function felt like it might be a good approach but I had issues getting it to work in a type safe manner (e..g, the type parameters can't be changed).

@dataclass(frozen=True)
class Foo[T]:
    value: T

f1: Foo[int] = Foo(5)
# Type "Foo[int]" is not assignable to declared type "Foo[str]"
#  "Foo[int]" is not assignable to "Foo[str]"
#    Type parameter "T@Foo" is invariant, but "int" is not the same as "str"
f2: Foo[str] = replace(f1, value="x")

The type checking also seemed to be inconsistent. For example, consider the following workup:

from dataclasses import dataclass, replace
from typing import assert_type, Union

type Surface = Union[
    Int,
    Add[Surface, Surface],
    Subtract[Surface, Surface],
    Bool,
    If[Surface, Surface, Surface],
    And[Surface, Surface],
]

type Core = Union[
    Int,
    Add[Core, Core],
    Subtract[Core, Core],
    Bool,
    If[Core, Core, Core],
    And[Core, Core],
]


@dataclass(frozen=True)
class Int:
    value: int


@dataclass(frozen=True)
class Add[E1, E2]:
    operands: tuple[E1, E2]


@dataclass(frozen=True)
class Subtract[E1, E2]:
    operands: tuple[E1, E2]


@dataclass(frozen=True)
class Bool:
    value: bool


@dataclass(frozen=True)
class If[Predicate, Consequent, Alternative]:
    predicate: Predicate
    consequent: Consequent
    alternative: Alternative


@dataclass(frozen=True)
class And[E1, E2]:
    operands: tuple[E1, E2]


def shrink(
    expr: Surface,
) -> Core:
    match expr:
        case Int() | Bool():
            return expr

        case Add([e1, e2]) | Subtract([e1, e2]):
            result = replace(expr, operands=(shrink(e1), shrink(e2)))
            # this fails:
            # "assert_type" mismatch: expected "Int | Add[Core, Core] | Subtract[Core, Core] |
            # Bool | If[Core, Core, Core] | And[Core, Core]" but received "Add[Int | ... |
            # Subtract[Surface, Surface] | Bool | If[Surface, Surface, Surface] |
            # And[Surface, Surface], Int | ... | Subtract[Surface, Surface] | Bool |
            # If[Surface, Surface, Surface] | And[Surface, Surface]] | Subtract[Int |
            # Add[Surface, Surface] | ... | Bool | If[Surface, Surface, Surface] |
            # And[Surface, Surface], Int | Add[Surface, Surface] | ... | Bool |
            # If[Surface, Surface, Surface] | And[Surface, Surface]]"
            assert_type(result, Core)
            # but this is fine
            return result

        case If(predicate, consequent, alternative):
            return If(shrink(predicate), shrink(consequent), shrink(alternative))

        case And([e1, e2]):
            return If(shrink(e1), shrink(e2), Bool(False))

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for type checking Unions of recursively defined TypedDicts #9362

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Advice for type checking Unions of recursively defined TypedDicts #9362

clause Oct 30, 2024

Replies: 1 comment · 3 replies

erictraut Oct 30, 2024 Maintainer

clause Oct 31, 2024 Author

erictraut Oct 31, 2024 Maintainer

clause Oct 31, 2024 Author

clause
Oct 30, 2024

Replies: 1 comment 3 replies

erictraut
Oct 30, 2024
Maintainer

clause Oct 31, 2024
Author

erictraut Oct 31, 2024
Maintainer

clause Oct 31, 2024
Author