Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make int_range() and int_ranges() work with no inputs and default to int_range(0, pl.len()) #17810

Open
deanm0000 opened this issue Jul 23, 2024 · 8 comments
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer

Comments

@deanm0000
Copy link
Collaborator

Description

It seems from questions in the discord and on SO that int_range is mostly used as an index so I think it'd be a little more convenient if we could drop typing in pl.len()

So if we have df=pl.DataFrame({'a':[2,5,8]})

then instead of df.with_columns(z=pl.int_range(pl.len())) we could just do df.with_columns(z=pl.int_range())

Of course in this specific example it'd be quicker to do df.with_row_index('z') but in the cases where we're using it in an over or whatever it's just a small QOL improvement. Since it currently raises when there are no inputs, there shouldn't be any backward compatibility issues with this change.

I can make a PR to do this, just want to make sure it's what people want.

@deanm0000 deanm0000 added enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer labels Jul 23, 2024
@cmdlineluser
Copy link
Contributor

Coincidentally, the pl.row_index() issue received a bump earlier today.

#12420

@deanm0000
Copy link
Collaborator Author

Actually, not a coincidence I saw the bump and it reminded me of asking for this default.

@deanm0000
Copy link
Collaborator Author

@ritchie46 is this one ok to do or do you have any objections?

@Oreilles
Copy link

Oreilles commented Jul 24, 2024

This feels counter-intuitive to me: I would expect a range function called without specified bounds to raise.
Adding pl.row_index() (with the same implementation you propose) to the API would make more sense in my opinion.

EDIT: Furthermore, pl.int_range() could not be used as a join expression since it doesn't return an elementwise expression (see #12420 (comment))

@mcrumiller
Copy link
Contributor

Chiming in, I think pl.index() is better than pl.row_index(). It's simpler, and all horizontal expressions use horizontal_, so the "row" part is unnecessary.

It also seems to be a super obvious operation and very common to use, why is there any argument about it?

@deanm0000
Copy link
Collaborator Author

@mcrumiller I, for one, am not arguing against pl.index or pl.row_index or whatever other name it could have. It's just that those were already ruled out so this is a middle ground.

@Oreilles Lots of things are counterintuitive until you get used to them. That said, I don't really share that sense of it being counterintuitive since it exists in a context that has a length that is a natural default so why not make it the default?

@Oreilles
Copy link

Intuitiveness certainely is subjective, but using pl.int_range() to generate an index has some drawbacks that would cripple its use case compared to an actual row_index function.

  • int_range doesn't reflex the intent in the way row_index or with_row_index does (would with_int_range make sense ? Maybe, but arguably less intuitively).
  • int_range defaults to Int64 and index columns generated with with_row_index are UInt32, so they cannot be merged on by default.
  • int_range cannot be used as a join expression... which would likely be the primary use case for using it instead of with_row_index (see Add pl.row_num() as syntactic sugar for pl.int_range(0, pl.count()) #12420 (comment))

@deanm0000
Copy link
Collaborator Author

@Oreilles your arguments seem to be for having another expression to generate a numeric index rather than against having int_range have defaults when no parameters are passed.

It already defaults to starting at 0 when only 1 parameter is passed so why not have it default the length parameter to the context length when no parameters are input?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature needs decision Awaiting decision by a maintainer
Projects
None yet
Development

No branches or pull requests

4 participants