Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

concat of types Iterable[Any] cause Pyright to think subsequent code is unreachable because of recent stubs change #888

Closed
torext opened this issue Mar 13, 2024 · 39 comments · Fixed by #890
Assignees
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode

Comments

@torext
Copy link

torext commented Mar 13, 2024

Describe the bug
In the following code the last line is wrongly considered to be unreachable by Pyright:

from typing import Any

import pandas as pd


def generate_series() -> Any:
    return pd.Series([1, 2, 3])


df = pd.concat({k: generate_series() for k in range(10)}, axis=1)

print("Unreachable?!")

This is caused by the recent change #858 adding a stub returning Never. This is presumably a misuse of Never or the stub as a whole should be moved to the very end so that concat on types Iterable[Any] can first match a stub with valid return.

To Reproduce
See microsoft/pylance-release#5631 and microsoft/pylance-release#5630 for issues reported on the Pylance repo.

Please complete the following information:

  • OS: Linux
  • OS Version 5.15.0-46-generic
  • Python 3.11.6
  • Pylance v2024.3.1
  • version of installed pandas-stubs: the one shipped with above Pylance version
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 13, 2024

Interesting. We were trying to prevent Iterable[None], but in the process blocked Iterable[Any].

Asking @twoertwein to look at this.

@Dr-Irv Dr-Irv added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Mar 13, 2024
@debonte
Copy link

debonte commented Mar 13, 2024

We were trying to prevent Iterable[None]...

Note that returning Never from an overload means that with that combination of arguments, the function will never return (ex. it always raises an exception, or loops infinitely). It doesn't mean, "this combination of argument values is illegal".

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 13, 2024

We were trying to prevent Iterable[None]...

Note that returning Never from an overload means that with that combination of arguments, the function will never return (ex. it always raises an exception, or loops infinitely). It doesn't mean, "this combination of argument values is illegal".

In this case, if you do pd.concat([None]), pandas does raise an exception. So our meaning is correct here.

If we move that overload down, then it will never match, because pd.concat([pd.Series(), None]) is valid, which matches the second overload, and we don't have a way to say "you can't pass an iterable of just None, but mixing None with Series or DataFrame is OK"

The main issue here is the "wildcard" nature of Any . It's unclear to me whether we should be supporting the use of parameters that are typed as Any by users and passed into methods declared in the stubs, because that defeats the purpose of type checking, IMHO.

@torext
Copy link
Author

torext commented Mar 13, 2024

I'd rather say the main issue here is that you're trying to support a case (i.e. all None args) that is very unlikely to occur in practice in a way that needs to be caught by a static type checker and where an appropriate runtime error (which is already there) doesn't suffice.

Instead when one omits type annotations from function returns it seems that many type checkers will implicitly consider the return type to be Any, unless otherwise inferable. The example I gave above is purely artifical to highlight the problem, but I've encountered the problem in practice with very generic code that doesn't have type annotations.

Given type annotations are purely optional, I'd say no combination of type checkers and stubs should ever declare code unreachable when no type annotations are present and no obvious inference of unreachability can be made (like e.g. if False:...).

@debonte
Copy link

debonte commented Mar 13, 2024

It's unclear to me whether we should be supporting the use of parameters that are typed as Any by users and passed into methods declared in the stubs, because that defeats the purpose of type checking, IMHO.

This scenario may be more interesting than explicit Any:

pd.concat([])
print("Unreachable?!")

FWIW, Pylance had two issues filed on this (one with 5 up votes) within about 12 hours of shipping this change.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 13, 2024

I'd rather say the main issue here is that you're trying to support a case (i.e. all None args) that is very unlikely to occur in practice in a way that needs to be caught by a static type checker and where an appropriate runtime error (which is already there) doesn't suffice.

I think that is debatable. If the type checker can find the error before getting to runtime, that's a good thing.

Instead when one omits type annotations from function returns it seems that many type checkers will implicitly consider the return type to be Any, unless otherwise inferable. The example I gave above is purely artifical to highlight the problem, but I've encountered the problem in practice with very generic code that doesn't have type annotations.

Yes, that's true, but the question is whether our goal is to support well-typed code that calls pandas, or untyped code that calls pandas. We've chosen the former from a philosophical point of view.

Given type annotations are purely optional, I'd say no combination of type checkers and stubs should ever declare code unreachable when no type annotations are present and no obvious inference of unreachability can be made (like e.g. if False:...).

The unreachable conclusion made by pylance within VSCode is their choice. If you run your code through mypy or command line pyright, then it passes fine without any errors.

The python typing page says the following about Never:

This can be used to define a function that should never be called, or a function that never returns:

That's exactly what we used it for.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 13, 2024

It's unclear to me whether we should be supporting the use of parameters that are typed as Any by users and passed into methods declared in the stubs, because that defeats the purpose of type checking, IMHO.

This scenario may be more interesting than explicit Any:

pd.concat([])
print("Unreachable?!")

And that code IS unreachable!! If you try to execute pd.concat([]), you will get an exception!

FWIW, Pylance had two issues filed on this (one with 5 up votes) within about 12 hours of shipping this change.

@debonte
Copy link

debonte commented Mar 13, 2024

And that code IS unreachable!! If you try to execute pd.concat([]), you will get an exception!

Ah, ok. I thought the exception was only thrown if all of the collection's (one or more) elements were None. I didn't realize it was also thrown for empty collections.

This can be used to define a function that should never be called, or a function that never returns:

That's exactly what we used it for.

I believe the "never be called" case is talking about using Never to annotate function parameters, not function return types.

@torext
Copy link
Author

torext commented Mar 13, 2024

Yes, that's true, but the question is whether our goal is to support well-typed code that calls pandas, or untyped code that calls pandas. We've chosen the former from a philosophical point of view.

Apologies for my ignorance, but shouldn't untyped code be considered the default case? Unless these stubs are widely used by different IDEs and type checkers then what is the point of making them? And unless the latter don't behave sanely and usefuly for the base-case of untyped code, then what's the point of those in turn?

My point being, again, it seems to me that you are prioritising here a very particular edge case purely for the sake of prioritising that niche case. I on the other hand have work to do, and I need a solid cross-OS IDE for various languages and in particular Python.

Apologies for the likely palpable frustration on my side, but I hope I'm making it clear where I'm coming from on this. You said you guys made a philosophical choice, I'm instead urging you to consider pragmatism.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 13, 2024

I believe the "never be called" case is talking about using Never to annotate function parameters, not function return types.

That's unclear. See https://mypy.readthedocs.io/en/latest/more_types.html#the-noreturn-type with respect to NoReturn. But then the current typing docs at https://docs.python.org/3/library/typing.html say about Never "This can be used to define a function that should never be called, or a function that never returns:" and "NoReturn can also be used as a bottom type, a type that has no values. Starting in Python 3.11, the Never type should be used for this concept instead. Type checkers should treat the two equivalently."

So our interpretation of Never is correct - it means a function that never returns, i.e., if you pass those parameters, you don't get a return - it can just die via sys.exit() or raise an Exception

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 13, 2024

Apologies for my ignorance, but shouldn't untyped code be considered the default case?

No!!!

When my team had a large untyped code, and we then converted it to have types, the static typing checks actually found bugs that we didn't know existed.

Here's a really simple example:

s = pd.Series([3.4, 2.3])
r: Any = 3.2
s.round(r)

This code will fail, because the argument for round() should be an int. But because r is declared to be Any, the type checker says it is valid code.

If you remove the Any declaration, then the type checker sees that you passed an incorrect argument.

The problem with using Any is that it matches anything. The advantage of using type checking is that it allows you to find errors in your code prior to runtime. But you only get that advantage if you start introducing types into your code. Using Any doesn't allow you to obtain the benefits of type checking.

@twoertwein
Copy link
Member

@torext I cannot reproduce your error on main with the latest pyright (even when adding # pyright: strict) - it does not complain about unreachable code! It, however, reports Type of "concat" is partially unknown (reportUnknownMemberType).

@twoertwein
Copy link
Member

twoertwein commented Mar 14, 2024

I understand that the Any case is problematic! Does VSCode in its default configuration complain about this or only when users request more strict(er) type checking? I believe mypy does, by default, not report unreachable code.

If it is "only" a non-default issue, I would expect that people who care about type checks (which is persumably why they enabled a non-default setting) to also minimize the number of Anys in their code - so it shouldn't be an issue?

If it is an issue in the default configuration, we should change it.

@debonte
Copy link

debonte commented Mar 14, 2024

Pyright reports unreachability as a diagnostic hint. In VS Code that causes the unreachable code to be dimmed. You won't see a diagnostic in the Problems pane. You'll see this even with typeCheckingMode set to off. Other editors may behave differently. I expect that Pyright doesn't emit a reachability diagnostic when run on the command line.

image

@twoertwein
Copy link
Member

Pyright reports unreachability as a diagnostic hint. In VS Code that causes the unreachable code to be dimmed. You won't see a diagnostic in the Problems pane. You'll see this even with typeCheckingMode set to off. Other editors may behave differently. I expect that Pyright doesn't emit a reachability diagnostic when run on the command line.

Thank you for the explanation - I'm using pyright on the command line. So, is it an issue in PyLance default configuration?

@debonte
Copy link

debonte commented Mar 14, 2024

So, is it an issue in PyLance default configuration?

It's an issue in the default configurations of both Pylance and Pyright (when run as a language server).

@twoertwein
Copy link
Member

How does pyright/pylance handle "Unknown" in this case? Is it also flagged as unreachable?

I think Any is actually not the problematic case (only when people explicitly add Any) - the more problematic case is probably "Unknown" (which is probably quite rare thanks to Pyright's great type inference for even missing return annotations).

Assuming Unknown triggers the unreachable diagnostic: I'm open to removing the None-overload - but I'm surprised that people expect pd.concat([]) to work. This means there is a benefit to this overload, making it tricky to judge whether we want to rather have a false positive or a false negative.

when run as a language server

Is there an option to get these warnings when running Pyright outside an editor? If so, we can enable that on the CI.

@debonte
Copy link

debonte commented Mar 14, 2024

How does pyright/pylance handle "Unknown" in this case? Is it also flagged as unreachable?

Unknown is a special case of Any, so yes the behavior would be the same.

I'm surprised that people expect pd.concat([]) to work. This means there is a benefit to this overload...

It would be better if the illegal cases could be reported as errors by a type checker. Returning Never and flagging the subsequent code as unreachable makes the user aware that something is wrong, but then they have to dig into the stubs to understand the problem. Whereas an error at the call site explaining that the args are invalid would be more helpful.

Unfortunately, I don't think the Python typing system currently has a way to represent the concept of an Iterable[Series | None] where at least one of the elements is non-None.

@erictraut might have ideas.

Is there an option to get these warnings when running Pyright outside an editor? If so, we can enable that on the CI.

No. Currently they're filtered out. If you think it's important, you could file an enhancement request.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

This is a problem with pylance. See microsoft/pylance-release#5640 (comment) for a simple example that illustrates the issue on code that has nothing to do with pandas-stubs. pylance is doing unreachable analysis even if type checking is off. The only way it can do that is by peeking into the stubs.

I'm closing this issue here.

@Dr-Irv Dr-Irv closed this as completed Mar 14, 2024
@torext
Copy link
Author

torext commented Mar 14, 2024

The issue you created (microsoft/pylance-release#5640 (comment)) doesn't address the bug I'm reporting here! I'm reporting here an issue where concaton a type Iterable[Any]returns a stub with return type Never. Clearly even the most hard-core type-enthusiast can see this is wrong: it's simply not true that such a call will never return. It only sometimes won't return (in fact only in the very rare case that all objects in the iterable are None).

I don't have type checking set to off in Pylance and I don't plan to. I want some basic type checking and stubs that do the right thing, even if there's some Anys in my code. You seriously cannot hope to catch all bugs and errors with static type checking; if that was the goal, then Python as a whole should've been designed differently (e.g. statically typed).

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

The issue you created (microsoft/pylance-release#5640 (comment)) doesn't address the bug I'm reporting here! I'm reporting here an issue where concaton a type Iterable[Any]returns a stub with return type Never. Clearly even the most hard-core type-enthusiast can see this is wrong: it's simply not true that such a call will never return. It only sometimes won't return (in fact only in the very rare case that all objects in the iterable are None).

This is what I call the "Goldilocks problem." If the stubs are too wide, then you don't catch errors in typical code. If the stubs are too narrow, then the type checker flags errors on valid code. So we try to find something that is "just right", i.e., something in the middle that will catch the most typical use cases.

You're arguing that the stub in this case is too narrow. I think it is better that we don't accept untyped arguments, which Any represents. You can add a #type: ignore statement to your code, or make sure that the arguments you pass to pd.concat() are all typed.

IMHO, type stubs are there to help people who have declared types in their code. Any represents a case where that is not true.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

The issue you created (microsoft/pylance-release#5640 (comment)) doesn't address the bug I'm reporting here! I'm reporting here an issue where concaton a type Iterable[Any]returns a stub with return type Never. Clearly even the most hard-core type-enthusiast can see this is wrong: it's simply not true that such a call will never return. It only sometimes won't return (in fact only in the very rare case that all objects in the iterable are None).

One other point here. The type checker (pyright) does NOT report an issue with your code that uses Any. If you run pyright from the command line, you will get no errors. The issue here is that the way that pyright is embedded in pylance in VSCode is reporting the code as unreachable. IMHO, they should have an option to turn off that analysis to get around this problem.

@torext
Copy link
Author

torext commented Mar 14, 2024

Alright, I give up. I'm thoroughly convinced at this point that I've made a clear and strong argument here that you guys are far from "trying to find something that is 'just right'" and rather obsessing over a narrow corner-case in a way that provably isn't helping anyone in this particular instance. Rationalize it all you will, fact is people are having issues because of your changes and are hacking the stubs locally to circumvent them.

Also the fact remains that logically what you did is plain wrong: it's not true that concat never returns anything on Iterable[Any]input, and, because of your changes, Pyright/the stubs do instead report that. All the discussion about how then Pylance interprets that report from Pyright/the stubs is then a red herring IMO. If anything you could start an argument that Pyright's method of returning the first matching stub is limited, but I'm not participating.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

it's not true that concat never returns anything on Iterable[Any]input, and, be

It only returns something if you have Iterable[DataFrame | Series | None], but not Iterable[None]. It does NOT return something if you passed in Iterable[int], for example.

Consider the following code:

from typing import Iterable, Any

import pandas as pd


def foo() -> Iterable[Any]:
    return [3, 4, 5]


pd.concat(foo())

print("unreachable")

Both pyright (command line) and mypy will NOT report an issue with this code. However, it will fail. It just turns out that when you have this code inside of VS Code, it greys out print("unreachable").

From a type checking perspective, we've done the right thing, because type checkers don't report unreachable code. VS Code does. That's an issue with that tool, not the type stubs.

Pyright/the stubs do instead report that.

No, pyright does NOT report that. If you run the code with command line pyright, no errors are reported.

The issue here is that within VS Code, pyright is configured differently and that's why I opened up the issue there.

@debonte
Copy link

debonte commented Mar 14, 2024

From a type checking perspective, we've done the right thing, because type checkers don't report unreachable code. VS Code does. That's an issue with that tool, not the type stubs.

Consider the following code:

# pyright: basic

from typing import Iterable, Any

import pandas as pd


def foo() -> Iterable[Any]:
    return [3, 4, 5]

x: int = "str"

pd.concat(foo())

y: int = "str"

Although pyright (on the command line) does not report that the last line is unreachable, it also doesn't report that there's a typing error there -- pyright will not report diagnostics within unreachable code.

Pyright run on the code above emits this:

  c:\issues\pylance\5630\test2.py:11:10 - error: Expression of type "Literal['str']" cannot be assigned to declared type "int"
    "Literal['str']" is incompatible with "int" (reportAssignmentType)
1 error, 0 warnings, 0 informations

Whereas, if the concat call is commented out, it emits this:

  c:\issues\pylance\5630\test2.py:11:10 - error: Expression of type "Literal['str']" cannot be assigned to declared type "int"
    "Literal['str']" is incompatible with "int" (reportAssignmentType)
  c:\issues\pylance\5630\test2.py:15:10 - error: Expression of type "Literal['str']" cannot be assigned to declared type "int"
    "Literal['str']" is incompatible with "int" (reportAssignmentType)
2 errors, 0 warnings, 0 informations

So the call to concat results in type checking false negatives in the subsequent code.

I'd argue that the Never concat overload is causing more pain than value for users.

@Dr-Irv Dr-Irv reopened this Mar 14, 2024
@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

That's an interesting example, but it points out that it is the VSCode/pylance integration that is problematic.

So the call to concat results in type checking false negatives in the subsequent code.

That's only true if you do it within VS Code! Which is my point - the reachability analysis in VS Code is what is problematic.

From the command line on the above code for both pyright and mypy:

> pyright issue888.py
issue888.py
issue888.py:9:10 - error: Expression of type "Literal['str']" cannot be assigned to declared type "int"
    "Literal['str']" is incompatible with "int" (reportAssignmentType)
issue888.py:13:10 - error: Expression of type "Literal['str']" cannot be assigned to declared type "int"
    "Literal['str']" is incompatible with "int" (reportAssignmentType)
2 errors, 0 warnings, 0 informations

> mypy issue888.py
issue888.py:9: error: Incompatible types in assignment (expression has type "str", variable has type "int")  [assignment]
issue888.py:13: error: Incompatible types in assignment (expression has type "str", variable has type "int")  [assignment]
Found 2 errors in 1 file (checked 1 source file)

Reopening to see if @twoertwein has any further thoughts on this.

@debonte
Copy link

debonte commented Mar 14, 2024

That's only true if you do it within VS Code! Which is my point - the reachability analysis in VS Code is what is problematic.

From the command line on the above code for both pyright and mypy:

Hmm, that's not what I'm seeing. I'm guessing you ran pyright without access to the pandas stubs? Or with old stubs?

I copied the diagnostic output above from runs of pyright on the command line. VS Code was not involved.

> pyright test2.py
c:\issues\pylance\5630\test2.py
  c:\issues\pylance\5630\test2.py:11:10 - error: Expression of type "Literal['str']" cannot be assigned to declared type "int"
    "Literal['str']" is incompatible with "int" (reportAssignmentType)
1 error, 0 warnings, 0 informations

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

Hmm, that's not what I'm seeing. I'm guessing you ran pyright without access to the pandas stubs? Or with old stubs?

I ran with the released version of pandas-stubs. If I run with the version we currently have on main, which I think is the version you are distributing with pylance now, I do replicate what you see, although mypy is analyzing the full file and reporting all the errors and not ignoring unreachable code.

Let's see what @twoertwein has to say.

@twoertwein
Copy link
Member

I think pandas-dev/pandas#57846 would be the ideal solution as it would avoid the false-negatives and false-positives mentioned in this issue: the Never overload would no longer be needed and Nones are no longer accepted.

As long as there is no pushback at pandas, I wouldn't mind making the change here at pandas-stubs before pandas releases a new version that has this change

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 14, 2024

As long as there is no pushback at pandas, I wouldn't mind making the change here at pandas-stubs before pandas releases a new version that has this change

Or we can go ahead and make the change to not accept None in the list of iterables, and see if we get any complaints about that.

@debonte
Copy link

debonte commented Mar 14, 2024

mypy is analyzing the full file and reporting all the errors and not ignoring unreachable code.

That's because mypy is evaluating pd.concat(foo()) as Any rather than Never. But mypy has it's own issues with this Never overload. Consider this:

from typing import Iterable, Any

import pandas as pd

#
# concat with Iterable[Any]
#
def foo() -> Iterable[Any]:
    return [3, 4, 5]

x = pd.concat(foo())
reveal_type(x) # Any

#
# concat with Iterable[None]
#
reveal_type(pd.concat([None])) # Never
y: Any = pd.concat([None])
reveal_type(y) # Mypy never reveals this type

#
# Obvious error
#
z: int = "str"

Mypy emits the following. Note that there's no error for the last line and no result for reveal_type(y):

test5.py:12: note: Revealed type is "Any"
test5.py:17: note: Revealed type is "Never"
Success: no issues found in 1 source file

But if you comment out this part:

#
# concat with Iterable[None]
#
reveal_type(pd.concat([None])) # Never
y: Any = pd.concat([None])
reveal_type(y) # Mypy never reveals this type

...you'll get:

test5.py:12: note: Revealed type is "Any"
test5.py:24: error: Incompatible types in assignment (expression has type "str", variable has type "int")
  [assignment]
Found 1 error in 1 file (checked 1 source file)

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 15, 2024

For those following this thread, I found a fix in #890 that supports our use case (disallow Iterable[None]) and also avoid the "unreachable code" aspect if the type checker infers the argument to be list[Any]

@Multihuntr
Copy link

Multihuntr commented Mar 20, 2024

Sorry, I know that this is closed, but, I want to understand, so...

Why is it important to explicitly create an overloaded function signature for the case the caller provides the wrong input types?

To make a contrived example, why not also have one for concat(objs: int, ...) -> Never? I would have thought that the return type when providing the wrong input type was always implicitly Never.

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 20, 2024

Why is it important to explicitly create an overloaded function signature for the case the caller provides the wrong input types?

To make a contrived example, why not also have one for concat(objs: int, ...) -> Never? I would have thought that the return type when providing the wrong input type was always implicitly Never.

The way the type checkers work is that if your object has a type that is not Any, then if the stub doesn't have a matching overload for the specified type, it will see that as incorrect code, but that is ONLY if you turn on type checking. If you don't turn on type checking, then pylance is still using the stubs, but not doing the checks, so that it can analyze unreachable code.

Here is an example where I did a screenshot:
image

For the first call, the type checker shows it is incorrect, but the call to func2() then creates unreachable code. If I were to turn off type checking, then the pop up message wouldn't appear, but the code at the bottom would still be unreachable.

@Multihuntr
Copy link

The way the type checkers work is that if your object has a type that is not Any, then if the stub doesn't have a matching overload for the specified type, it will see that as incorrect code, but that is ONLY if you turn on type checking.

Ok, I'm with you. Giving type hints allow type checkers to see if you have given an incorrect type. If you have labelled your types, and they don't match, then the type checker flags it and an editor like VSCode will show it with a red squiggly line. Cool, good, normal type checking.

If you don't turn on type checking, then pylance is still using the stubs, but not doing the checks, so that it can analyze unreachable code.

Sure, makes sense. Whichever signature matches an Any case is the default when type checking is off. If your Any case says it will Never return, and you get a false positive (incorrectly state there will be a problem).

I think I understand what you are showing me. But, my question was actually maybe a little more about the design philosophy. If type checkers can already tell you that the types don't match any of the signatures, then why is there a signature that catches the case that it would not match any of the other signatures? Obviously incorrect types will cause problems; that's why type checkers flag it as a problem.

Or to put it another way: what is the situation where a type checker telling you there's an error in your program is insufficient?

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 20, 2024

I think I understand what you are showing me. But, my question was actually maybe a little more about the design philosophy. If type checkers can already tell you that the types don't match any of the signatures, then why is there a signature that catches the case that it would not match any of the other signatures? Obviously incorrect types will cause problems; that's why type checkers flag it as a problem.

This more relates to what we are trying to do for pd.concat() with pandas-stubs.

We want to accept Iterable[pd.DataFrame | None], but NOT accept Iterable[None]

So to put it another way, it is acceptable to pass a list where you mix DataFrame, Series and None, but it is not acceptable to pass a list of only None.

If we only had Iterable[pd.DataFrame | None] in the stubs, then the type checker wouldn't catch [None, None] as an argument.

Or to put it another way: what is the situation where a type checker telling you there's an error in your program is insufficient?

I don't understand what you mean by "insufficient". My team uses the stubs for pandas (and other libraries) to help us flag incorrect code. There are occasions where the code is correct, and the type checker will say it isn't. That's usually an issue with the stubs, and in some cases it is because the type of the object cannot be fully inferred. For example, if you have a DataFrame variable df with columns "a" and "b", then we know that df["a"] is a Series, but we don't know if that series contains integers, floats, strings, etc. So the type of df["a"] is "partially unknown" (you will see Series[Any] if you do a reveal_type())

There are also occasions where the type checker will say the code is correct, but the code will fail at runtime. In those cases, it's either because variables of our program don't have all the types set up right (or a variable is inferred as Any), or it could be because it was impossible for the stubs to say "I except all these types, but NOT certain types".

A good example is with parameters that need to be a sequence of strings. If we declare the type of a parameter as Sequence[str], then any string will match. That's incorrect. So we created a type alias called SequenceNotStr that corresponds to being a sequence that is not a pure string. There are most likely cases in pandas stubs where we need to use SequenceNotStr instead of Sequence[str] to determine whether we mean a sequence is allowed versus either a pure string OR a sequence of strings. Maybe someday someone will get around to looking at all of our uses of Sequence[str] and determine the places where it needs to be replaced.

@Multihuntr
Copy link

Multihuntr commented Mar 20, 2024

It seems like you are really using the typing system to do a lot!

We want to accept Iterable[pd.DataFrame | None], but NOT accept Iterable[None]

Ah, I see. My initial instinct tells me this should be a runtime check instead. But, of course, if you can do it with typing, why not do it with typing? I misunderstood the purpose of the signature, and thus, this realisation answers my query.

Although, from my understanding of inferred types, this really only covers the case that the user creates a list that is guaranteed to only contain Nones. Which is a pretty weird edge case to check for. But, whatever, if you detect can such subcases as being problematic in typing, then why not?

Side note: I actually would have expected passing Iterable[None] to return an empty DataFrame. Less special cases are better.

There are also occasions where the type checker will say the code is correct, but the code will fail at runtime.

If I recall correctly, those are called "runtime errors". (Sorry, I had to. I do appreciate you answering my questions, but this was too good an opportunity)

Anyway, I see what you mean that it is an issue with pylance. Either the argument is strictly an Iterable[None], in which case the code will always error, or the argument is not Iterable[None] (e.g. Iterable[Any]), in which case pylance shouldn't be considering that overloaded signature for our code. The signature is obviously not incorrect, it's just being picked up for the wrong inputs.

Thanks for answering my questions. I learned something!

@Dr-Irv
Copy link
Collaborator

Dr-Irv commented Mar 20, 2024

It seems like you are really using the typing system to do a lot!

I have discovered lots of bugs by converting untyped code to typed code. In addition, turning on type checking has prevented runtime errors from occurring.

My initial instinct tells me this should be a runtime check instead.

It is a runtime check in pandas.

But, of course, if you can do it with typing, why not do it with typing?

It's more that if we can catch it with typing, then we can avoid the runtime error from occurring.

Although, from my understanding of inferred types, this really only covers the case that the user creates a list that is guaranteed to only contain Nones. Which is a pretty weird edge case to check for.

Agreed, but it is what the pandas source does.

Side note: I actually would have expected passing Iterable[None] to return an empty DataFrame. Less special cases are better.

You could contribute to the discussion at pandas-dev/pandas#57846 . It's not clear why the design of the pandas API accepts None, but that's been around for a while, and this issue here led to the issue there being created.

If I recall correctly, those are called "runtime errors". (Sorry, I had to. I do appreciate you answering my questions, but this was too good an opportunity)

A rose is a rose by any other name!

Thanks for answering my questions. I learned something!

You're welcome.

@lgonzalez-silen
Copy link

The fix took care of most of the concat calls in my codebase (thanks!), but there's one stubborn case in one of my files for for an Iterable[None | Any] accumulator in a loop:

import pandas as pd


# no param
def foo():
    return pd.DataFrame()


df_acc = None
df_new = foo()
pd.concat([df_acc, df_new])
print("reachable")


# the param makes the second concat call unreachable if the return is stored
def foo2(df):
    return pd.DataFrame()


pd.concat([df_acc, foo2()])  # direct use works
print("reachable")

# the param makes the second concat call unreachable
df_new = foo2()
pd.concat([df_acc, df_new])  # var use breaks
print("unreachable")

I'll open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants