Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix depwarns of getindex #1585

Merged
merged 14 commits into from
Nov 8, 2018
Merged

Conversation

bkamins
Copy link
Member

@bkamins bkamins commented Nov 1, 2018

I am in the process of cleaning up after getindex revision. I will continue as this is not finished (but it gives
the idea of changes).

A thing to discuss from what I have already changed is for hcat! - if we want a target functionality to put views or copies if RHS of hcat! is a SubDataFrame.

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2018

@nalimilan Before I move forward I would like to discuss one design issue. In the transition period I need to write specialized code for SubDataFrame before it is defined in the package (e.g. in src/abstractdataframe/abstractdataframe.jl`). What do you think would be the best approach to handle this?

@nalimilan
Copy link
Member

I'd rather define duplicated methods in deprecated.jl, so that the main code remains clean and they are easy to remove. That should also ensure SubDataFrame has already been defined.

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2018

Good point - thanks.
So - for now I am left with the question about hcat!. The status is that currently hcat! for DataFrame passes a column without copying, but it does copy SubDataFrame columns.

The question is if we want to keep it
or
we should pass a view of a vector constituting the column of a SubDataFrame (so we would be consistent, in that hcat! produces an object that allows for modification of source data frames, but the downside is that we would have columns that are views not normal vector).

@nalimilan - do you have any opinion on this?

@nalimilan
Copy link
Member

Tricky question! If we make copies automatically then there will be no (simple) way of concatenating a SubDataFrame while preserving views; conversely, if we don't make copies automatically, people can easily and efficiently call copy manually. As you said, it would also be consistent in that modifying the result would modify the inputs. So these are two arguments in favor of preserving views.

OTOH I'm not sure it's a very common need, and it makes things a bit more complex (creating SubArray columns). In terms of performance, preserving views is faster for one-off operations, but slower for repeated operations. In particular, if people use hcat! with by, we'd better preserve views or performance will be terrible.

Overall I'd lean towards preserving views, but it's a tough call.

@bkamins
Copy link
Member Author

bkamins commented Nov 2, 2018

I am slowly moving forward. I make hcat! as you suggested.
Now the other point for you to look at is groupby where we make:

    sdf = df[cols]

selection. The tricky part is if we want to have sdf to be a SubDataFrame if df is a SubDataFrame (this is what would happen if I did not add a check as currently in this PR). This is related to groupby performance on SubDataFrame (maybe it is better that sdf is a DataFrame always - this is what is currently implemented in this PR).

@nalimilan
Copy link
Member

I'm not sure it really matters whether we make a copy or not in groupby. Since we go over the grouping columns twice, copying could actually be a bit faster. Anyway I expect the difference to be negligible compared with the cost of grouping and combining (note that this is orthogonal to the fact that we pass a SubDataFrame to the user-provided function). So I'd go with the simplest solution.

@test all(ismissing, df[2])
@test all(ismissing, df[3])
# TODO: enable those tests after getindex deprecation period
# @test typeof(df[:, 1]) == Vector{Union{Int, Missing}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use @test_broken for these?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not broken - the test passes, but prints warnings (massively) and I understood that you wanted to avoid printing unnecessary warnings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's OK if the tests print warnings, as long as they are correct in the long term. Better have good tests that are temporarily verbose than having to enable them manually later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I understand we go the following route:

  • it is not a problem if tests print warnings;
  • however, we want to avoid printing warnings by library functions in normal usage.

Yes?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right.

@nalimilan
Copy link
Member

The amount of code that needs duplicating is terrifying. I wonder whether it wouldn't be simpler even for users to change the behavior without a deprecation. Indeed in most situations the deprecation will force people to write sdf[:, col] even if they don't actually need a copy. It's probably relatively rare to rely on the copying behavior. Though of course it could break code without notice. What do you think?

@bkamins
Copy link
Member Author

bkamins commented Nov 3, 2018

I propose that I finish this PR (I will remove WIP then).
Then we can have a look how much copying was performed.
Actually - given your proposal - the code that is not needed is not that hard to track 😄.

However, I would retain deprecation - even for a short period to simplify debugging (even something what Julia 0.7 and 1.0 had - actually it worked well from my experience).

The big question is if we wait for setindex! changes and improved performance of groupby before the release (I would wait for it, but I am not sure what is the status of #1571).

@bkamins bkamins changed the title WIP: fix depwarns of getindex Fix depwarns of getindex Nov 3, 2018
@bkamins
Copy link
Member Author

bkamins commented Nov 3, 2018

@nalimilan This should be good for a reivew now.
I have moved all offending code to src/deprecated.jl.
Also I have commented out some old tests and some new tests that throw a warning (I can uncomment them, but it it easier to read console log with them commented if you would want to).
The only tests that are massively offending that I retained turned on are for view construction. The upcoming change of meaning view(df, x) produces a lot of warnings but actually this is the change that is most likely to lead to subtle errors in the future.

data = [df[name] for df in dfs]
# TODO: replace with commented out code after getindex deprecation
# data = [df[name] for df in dfs]
data = [df isa DataFrame ? df[name] : df[:, name] for df in dfs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be better to always use view here (even if that's temporary)? Same for completecases and maybe other places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could do this, but this would be breaking.
That is: currently df[name] when df is a SubDataFrame produces a copy and so I used df[:, name] to still make a copy temporarily (it will be a view after getindex deprecation).

Do you think making a view now would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But here it's an implementation detail which shouldn't have any user-visible consequences, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - right. I mixed it up with hcat! (this PR is really long). I will push a PR changing copy to view in places where it is not user visible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have pushed a PR implementing what you have requested. Actually it is a good way to test if something will not break in the future. The only limitation we have is with vcat with which you have started (I have noted it in the code).

@nalimilan
Copy link
Member

OK, thanks, let's do this then, but remove the deprecations more quickly than we usually do.

@bkamins
Copy link
Member Author

bkamins commented Nov 6, 2018

@nalimilan This should be cleaned up now for the next round of checking.

The only method that is left problematic is Base.getindex(itr::Cols, inds...) = getindex(itr.df, inds...) as it is sometimes called internally and occasionally can throw a warning, but this should not be much.

# the code below assumes that only DataFrame and SubDataFrame
# are subtypes of AbstractDataFrame
# it should be removed ASAP after deprecation
data = [df isa DataFrame ? df[name] : view(parent(df)[name], rows(df)) for df in dfs]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't matter much since it's temporary, but better call view only for SubDataFrame, so that the fallback works for any type. Then you don't need the scary comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I cannot do this here because SubDataFrame is not defined at this stage of DataFrames.jl loading. And I would have to move the whole function much later in the loading process to make it work, which I think should not be done since it is only a temporary change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Argh. Well, let's say we will remove this soon anyway...

@bkamins
Copy link
Member Author

bkamins commented Nov 8, 2018

Good to merge (I do not want to do it myself, as it is too delicate 😄)?

@nalimilan nalimilan merged commit d102854 into JuliaData:master Nov 8, 2018
@nalimilan
Copy link
Member

So scary...

@bkamins bkamins deleted the fix_getindex_depwarns branch November 8, 2018 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants