Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BREAKING] fix isagg to correctly use a fast path #2357
[BREAKING] fix isagg to correctly use a fast path #2357
Changes from 1 commit
dad6118
d11e358
bca411d
f481296
e5ac5bb
063a5e4
695b237
90d79a7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you combine
validate_aggregate
withcheck_aggregate
? Basically, we have a fallback which returnsf
, and for some combinations of function and type we return an optimized object.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. And I did it now, but it was much easier to handle them separately when developing changes :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these methods work for any type. We just have a faster path when
isconcretetype(S) && hasmethod($initf, Tuple{S})
, but we don't requiretypemin
andtypemax
in general. In particular, we have code forCategoricalArray
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need this restriction:
but I will make the signature looser.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Damn, again that multi-column issue... So yeah, the fast path needs to be disabled when the column can contain
MULTI_COLS_TYPE
orAbstractVector
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed this case is a bit ugly, as code that generates a single column when a data frame stores only scalars will suddenly create multiple columns when it stores e.g. named tuples. Of course this is a more general problem than
first
and optimized reductions. I wonder whether we should require explicitly mentioning that you want the result to be destructured into multiple columns, e.g. via:x => MultiCol(x -> ...)
. Without this, we basically consider that only scalars should be stored in data frames, or many things may break.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We indeed can discuss this and this relates to what @pdeffebach wants that we allow destruction into multiple columns.
Just to be clear the
MULTI_COLS_TYPE
part is not really problematic in my opinion - it just makes sure we throw an error (as we disallow it now), so in the future we can process it correctly. This is what we try to do consistently everywhere, so that post 1.0 changes here will be non-breaking, as they will currently throw an error (this is your pet trick to handle SemVer AFAICT 😄).A more problematic case is
AbstractVector
as it was simply inconsistent between slow and fast paths (because fast path was not expanding it and slow path does pseudo-broadcasting).So there actually two questions:
MultiCol
wrapper in the future or just allow multiple cols to be returned and silently accept it (I thought @pdeffebach wanted to silently accept this)Although it is a legacy from the very distant past. If I were designing it now I would never expand anything neither in rows nor in columns, just store one result per group in
combine
and then say to users to useflatten
to flatten it. But this is a completely different design in comparison to what we have now, so this is just a side comment.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I had forgotten that we don't allow returning
MULTI_COLS_TYPE
currently. So at least that's safe.AbstractVector
remains problematic though. I wonder whether there are strong use cases for pseudo-broadcasting. It would be safer to make this opt-in for 1.0, otherwise we don't allow working with data frames whose cells contain vectors.(BTW, instead for
MultiCol
, maybe we could reuseAsTable
, but to wrap the returned value this time.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about it when cleaning up the rules of pseudo-broadcasting recently. Here is what we have on 0.21:
So in short - we allow working with data frames that contain vectors as:
first
) thenRef
can be used as in Base to protect the resultThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's continue this discussion in a separate issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK - can you please open the issue explaining what you would want to change? (given the three key considerations: 1) currently we unwrap vectors, 2)
Ref
protects, 3) in the future we want to add support for multiple columns passing)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's too bad that irrationals don't define
zero
andone
so thatzero(x) + zero(x)
has the same type asx + x
... I don't remember offhand why they returnBool
but that may have to do with the fact that irrationals promote to whatever type they are combined with.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not designed it 😄, but actually it is inconsistent:
and you get
Float64
. I have commented in JuliaLang/julia#36978 to keep track of it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this special case?
last(skipmissing(x))
doesn't work outside DataFrames (maybe it should though), and it doesn't seem to work with any type in DataFrames either.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a special case, because it works in fast path, but fails in slow path. I have changed the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it works only with
GroupedDataFrame
input, not with aDataFrame
input. I don't get why.Also, I thought the goal of this PR was to make the slow and fast path behave the same, so shouldn't this always throw an error whatever the eltype?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did not see the benefit of
last∘skipmissing
throwing an error in cases when we can efficiently produce a correct result. The fact thatlast∘skipmissing
errors in Base is only due toO(1)
restriction inlast
docstring, e.g.first∘skipmissing
works in base because its docstring does not requireO(1)
. I think we should change in Base thatlast∘skipmissing
works.It should work in
combine
both forGroupedDataFrame
andDataFrame
and withselect
onGroupedDataFrame
(and pseudo-broadcasting is applied in this case). It is expected to fail forselect
onDataFrame
. Do you observe a different behaviour?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah,
last(skipmissing(x))
andlastindex(skipmissing(x))
should probably be defined whenx
is anAbstractArray
, since they don't require going over the whole collection.I get this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah - I have forgotten of this path. So you have:
which is unfortunate.
I would not touch it though now as it is not easy to fix.
In the long term we should rewrite the whole engine for
select
/transform
/combine
to be unified. Now - unfortunately - I have designed it in stages, and when initially implementingselect
andtransform
we have not envisioned that we will have a full ecosystem that we have now, so we essentially have two processing engines - one forcombine
inGroupedDataFrame
and the other forselect
inDataFrame
.This should be implemented in a way that
select
/transform
/combine
onDataFrame
are always essentially calls toGroupedDataFrame
on a group formed by no columns (single group). Additional things to remember when implementing it:source_columns => function => target_col_name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. No hurry, maybe just copy your comment to an issue to keep your analysis available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this is tracked in separate issues.