Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some basic subscript type inference #13562

Merged
merged 5 commits into from
Sep 30, 2024
Merged

Add some basic subscript type inference #13562

merged 5 commits into from
Sep 30, 2024

Conversation

charliermarsh
Copy link
Member

Summary

Just for tuples and strings -- the easiest cases. I think most of the rest require generic support?

@charliermarsh charliermarsh added the red-knot Multi-file analysis & type inference label Sep 30, 2024
}
}
// TODO some slices on strings should resolve to type errors rather than literals
(Type::StringLiteral(_), _) => Type::LiteralString,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the general philosophy around when we should error in a case like this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few things to note here.

Issues with LiteralString

According to a strict reading of PEP 675, I don't think this is a legal place where we can infer LiteralString, since the PEP states:

As per the Rationale, we also infer LiteralString in the following cases:

[...]

  • Literal-preserving methods: In Appendix C, we have provided an exhaustive list of str methods that preserve the LiteralString type.

str.__getitem__ is not listed in Appendix C of the PEP, so I suppose that means that we're not allowed to infer LiteralString for x, y or z in a snippet like this:

def foo(a: Literal["foo"], b: LiteralString, c: int):
    x = a[c]
    y = b[42]
    z = b[c]

I'm not sure what the status of Appendix C is, though... it doesn't seem to have been copied into the relevant section of the typing spec, so perhaps it isn't canonical anymore? Or maybe that was just an omission when writing the spec? There's also the question of typeshed's stubs: typeshed doesn't include a LiteralString overload in its stub for str.__getitem__, although it includes LiteralString overloads for many of its other str methods. I'm not sure if there's a reason for that or not -- even if not, it may be desirable to be in sync with typeshed here.

For now I think I'd just delete this branch -- continuing to fallback to Unknown is ~fine for now.

Emitting diagnostics

In general, if we can't infer that an operation creates a Literal type (whether that's LiteralString or StringLiteral), we should fall back to typeshed's stubs. Here that means falling back to builtins.str.__getitem__ -- you'd do something like value_ty.to_meta_type(db).member(db, "__getitem__").call(&[value_ty, slice_ty]) to figure out what the type of the subscript expression is. Except there are some complications ;) Take a look at Type::iterate for a more complete example of how we lookup dunder methods elsewhere, try to call them, and emit a diagnostic if the dunder method eithere doesn't exist or isn't callable.

/// Given the type of an object that is iterated over in some way,
/// return the type of objects that are yielded by that iteration.
///
/// E.g., for the following loop, given the type of `x`, infer the type of `y`:
/// ```python
/// for y in x:
/// pass
/// ```
fn iterate(self, db: &'db dyn Db) -> IterationOutcome<'db> {
if let Type::Tuple(tuple_type) = self {
return IterationOutcome::Iterable {
element_ty: UnionType::from_elements(db, &**tuple_type.elements(db)),
};
}
// `self` represents the type of the iterable;
// `__iter__` and `__next__` are both looked up on the class of the iterable:
let iterable_meta_type = self.to_meta_type(db);
let dunder_iter_method = iterable_meta_type.member(db, "__iter__");
if !dunder_iter_method.is_unbound() {
let CallOutcome::Callable {
return_ty: iterator_ty,
} = dunder_iter_method.call(db, &[])
else {
return IterationOutcome::NotIterable {
not_iterable_ty: self,
};
};
let dunder_next_method = iterator_ty.to_meta_type(db).member(db, "__next__");
return dunder_next_method
.call(db, &[])
.return_ty(db)
.map(|element_ty| IterationOutcome::Iterable { element_ty })
.unwrap_or(IterationOutcome::NotIterable {
not_iterable_ty: self,
});
}
// Although it's not considered great practice,
// classes that define `__getitem__` are also iterable,
// even if they do not define `__iter__`.
//
// TODO(Alex) this is only valid if the `__getitem__` method is annotated as
// accepting `int` or `SupportsIndex`
let dunder_get_item_method = iterable_meta_type.member(db, "__getitem__");
dunder_get_item_method
.call(db, &[])
.return_ty(db)
.map(|element_ty| IterationOutcome::Iterable { element_ty })
.unwrap_or(IterationOutcome::NotIterable {
not_iterable_ty: self,
})
}

Copy link
Contributor

@carljm carljm Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree with the pedantic reading of PEP 675 (or typeshed) as a rationale for removing this branch. Many useful things are simply overlooked. If a more precise type inference is clearly correct in terms of the runtime semantics, and there's nothing in the typing spec or conformance suite explicitly prohibiting it, then I think we should feel totally free to go ahead and do it.

It is clearly correct wrt the runtime semantics to infer LiteralString as the result of indexing into a LiteralString or StringLiteral.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, given the security rationale for LiteralString, there may be an argument that untrusted-user-supplied indices could allow turning a "safe" LiteralString into an "unsafe" one via careful cropping. I have mixed feelings about this, because there's also type-system value to LiteralString (we know it is really a built-in str, not a subclass), and in type-system terms, the inference of LiteralString is correct. And the security issue seems fairly contrived. But that might be why inference of LiteralString on __getitem__ is not in the PEP?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is clearly correct wrt the runtime semantics to infer LiteralString as the result of indexing into a LiteralString or StringLiteral.

I agree, but the fact that PEP-675 is so clear that the list of string methods in Appendix C is valid is meant to be "exhaustive" made me worried that there was something I was misunderstanding :-) and until recently, PEP-675 was the spec -- it's not clear to me whether Appendix C was deliberately kept out of the typing spec or not.

This branch was anyway incorrect -- StringLiteral cannot be indexed into with objects of arbitraty types, which is what this branch allowed before Charlie removed it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So to clarify... We need to add a branch for __getitem__ (in general -- I plan to do this in a separate PR), and slices into StringLiteral (with non-int literal indexes) should be handled by that branch? Or do we want to resolve to LiteralString?

Copy link
Contributor

@carljm carljm Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further thought, I don't really buy the possible security argument I mentioned above for not inferring LiteralString: you can easily construct many other cases where the contents of a LiteralString, though always originating from a literal string in the source code, could be influenced in some way by untrusted user input (e.g. determining whether some concatenation does or doesn't occur.)

But that said, I don't think this is critical, and it is probably best to just rely on typeshed for this. So we could at some point consider proposing a change to typeshed with a LiteralString overload for str.__getitem__, but I don't think it's worth an extra special case in red-knot to diverge from typeshed here.

So yeah, I would say let's just add the branch that uses the return type from __getitem__, as Alex suggested. (And yeah, fine for that to be a separate follow-up if you prefer.

(I suspect, but don't know, that the reason Appendix C isn't in the type spec is that it was considered to already be reflected in the typeshed change.)

Copy link
Contributor

github-actions bot commented Sep 30, 2024

ruff-ecosystem results

Linter (stable)

✅ ecosystem check detected no linter changes.

Linter (preview)

✅ ecosystem check detected no linter changes.

}
}
// TODO some slices on strings should resolve to type errors rather than literals
(Type::StringLiteral(_), _) => Type::LiteralString,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a few things to note here.

Issues with LiteralString

According to a strict reading of PEP 675, I don't think this is a legal place where we can infer LiteralString, since the PEP states:

As per the Rationale, we also infer LiteralString in the following cases:

[...]

  • Literal-preserving methods: In Appendix C, we have provided an exhaustive list of str methods that preserve the LiteralString type.

str.__getitem__ is not listed in Appendix C of the PEP, so I suppose that means that we're not allowed to infer LiteralString for x, y or z in a snippet like this:

def foo(a: Literal["foo"], b: LiteralString, c: int):
    x = a[c]
    y = b[42]
    z = b[c]

I'm not sure what the status of Appendix C is, though... it doesn't seem to have been copied into the relevant section of the typing spec, so perhaps it isn't canonical anymore? Or maybe that was just an omission when writing the spec? There's also the question of typeshed's stubs: typeshed doesn't include a LiteralString overload in its stub for str.__getitem__, although it includes LiteralString overloads for many of its other str methods. I'm not sure if there's a reason for that or not -- even if not, it may be desirable to be in sync with typeshed here.

For now I think I'd just delete this branch -- continuing to fallback to Unknown is ~fine for now.

Emitting diagnostics

In general, if we can't infer that an operation creates a Literal type (whether that's LiteralString or StringLiteral), we should fall back to typeshed's stubs. Here that means falling back to builtins.str.__getitem__ -- you'd do something like value_ty.to_meta_type(db).member(db, "__getitem__").call(&[value_ty, slice_ty]) to figure out what the type of the subscript expression is. Except there are some complications ;) Take a look at Type::iterate for a more complete example of how we lookup dunder methods elsewhere, try to call them, and emit a diagnostic if the dunder method eithere doesn't exist or isn't callable.

/// Given the type of an object that is iterated over in some way,
/// return the type of objects that are yielded by that iteration.
///
/// E.g., for the following loop, given the type of `x`, infer the type of `y`:
/// ```python
/// for y in x:
/// pass
/// ```
fn iterate(self, db: &'db dyn Db) -> IterationOutcome<'db> {
if let Type::Tuple(tuple_type) = self {
return IterationOutcome::Iterable {
element_ty: UnionType::from_elements(db, &**tuple_type.elements(db)),
};
}
// `self` represents the type of the iterable;
// `__iter__` and `__next__` are both looked up on the class of the iterable:
let iterable_meta_type = self.to_meta_type(db);
let dunder_iter_method = iterable_meta_type.member(db, "__iter__");
if !dunder_iter_method.is_unbound() {
let CallOutcome::Callable {
return_ty: iterator_ty,
} = dunder_iter_method.call(db, &[])
else {
return IterationOutcome::NotIterable {
not_iterable_ty: self,
};
};
let dunder_next_method = iterator_ty.to_meta_type(db).member(db, "__next__");
return dunder_next_method
.call(db, &[])
.return_ty(db)
.map(|element_ty| IterationOutcome::Iterable { element_ty })
.unwrap_or(IterationOutcome::NotIterable {
not_iterable_ty: self,
});
}
// Although it's not considered great practice,
// classes that define `__getitem__` are also iterable,
// even if they do not define `__iter__`.
//
// TODO(Alex) this is only valid if the `__getitem__` method is annotated as
// accepting `int` or `SupportsIndex`
let dunder_get_item_method = iterable_meta_type.member(db, "__getitem__");
dunder_get_item_method
.call(db, &[])
.return_ty(db)
.map(|element_ty| IterationOutcome::Iterable { element_ty })
.unwrap_or(IterationOutcome::NotIterable {
not_iterable_ty: self,
})
}

crates/red_knot_python_semantic/src/types/infer.rs Outdated Show resolved Hide resolved
crates/red_knot_python_semantic/src/types/infer.rs Outdated Show resolved Hide resolved
@charliermarsh
Copy link
Member Author

Okay thanks, will re-request review once I've addressed feedback.

@charliermarsh
Copy link
Member Author

Actually, it should be fine to merge in its current state. I'll look at doing the dunder fallback and diagnostics in a separate PR to keep this small.

@charliermarsh
Copy link
Member Author

Added diagnostics, though unsure if there are established patterns around the rule names, granularity, etc.

@AlexWaygood
Copy link
Member

Added diagnostics, though unsure if there are established patterns around the rule names, granularity, etc.

It's a bit of a wild west right now, I wouldn't worry about that too much! I'm sure we'll clean them up at some point (and make the diagnostics less stringly typed etc.). Micha's proposals he shared during the on-site go some way to addressing this.

@charliermarsh
Copy link
Member Author

Sounds good, I figured as much :)

Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you! One minor naming nit, and the issue of negative indexing.

crates/red_knot_python_semantic/src/types/infer.rs Outdated Show resolved Hide resolved
crates/red_knot_python_semantic/src/types/infer.rs Outdated Show resolved Hide resolved
// TODO handle variable-length tuples
(Type::Tuple(tuple_ty), Type::IntLiteral(int)) => {
let elements = tuple_ty.elements(self.db);
usize::try_from(int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python allows negative indexing, which indexes from the end of the container:

>>> t = (1, 2, 3)
>>> t[-1]
3

So we should handle that here. I would say we don't have to do it in this PR, and we can just add a TODO, but I don't like that in the meantime this would mean we'd wrongly emit an out-of-bounds error.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll handle it here, thanks.

Copy link
Contributor

@carljm carljm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! LGTM.

I think there would be a possible implementation where we extract a helper function for the actual indexing-or-None for each type, and then the negative case could just add .len() to the index and try that function. But there's really not that much duplication in this version, so I'm not sure that would be worth it, happy with this as-is.

@charliermarsh charliermarsh merged commit c9c748a into main Sep 30, 2024
20 checks passed
@charliermarsh charliermarsh deleted the charlie/sub branch September 30, 2024 20:50
@charliermarsh
Copy link
Member Author

Thanks for the thorough reviews, much appreciated! I'll look at the __getitem__ part next.

charliermarsh added a commit that referenced this pull request Oct 1, 2024
## Summary

Follow-up to #13562, to add
support for "arbitrary" subscript operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
red-knot Multi-file analysis & type inference
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants