Unify eachcol and columns functions #1590

bkamins · 2018-11-08T22:26:02Z

Implementing what was discussed in #1586.

bkamins · 2018-11-08T23:24:56Z

I had to add rawcolumns function defined for DataFrame only to allow access to columns field.
Also I fixed some problems with DataFrameStream that were "close" to columns issue (it was type unstable and was using .columns access which would produce an error after getproperty change in Julia 1.0).

bkamins · 2018-11-09T00:16:52Z

This is still WIP as I have to think if we want to make DFRowIterator and DFColumnIterator subtypes of AbstractVector and if yes what is the best way to do it.

bkamins · 2018-11-09T13:21:06Z

This should be good to have a look at now. I decided to leave the types only iterators (not AbstractVectors).

Two notes:

interestingly the code fails on Julia 0.7 and I do not understand why (it says it cannot find a method that is defined) - maybe it would be good to understand why
@nalimilan after Improve performance of by() using NamedTuples #1520 we have some new depwarn-s because of changes of how getindex will work; probably this is only transitory, but it is better to check (e.g. if you do not assume there some specific type, e.g. DataFrame not SubDataFrame or vector not a view - I checked this and I think it does not but you know the code better)

bkamins · 2018-11-10T15:27:24Z

I have fixed Julia 0.7 issues with broadcasting in tests.

src/abstractdataframe/iteration.jl

nalimilan · 2018-11-10T15:37:07Z

src/abstractdataframe/iteration.jl

 """
-struct DFColumnIterator{T <: AbstractDataFrame}
+struct DFColumnIterator{T<:AbstractDataFrame, C}


Instead of C being a Boolean, it could be either AbstractVector or Tuple{Symbol,AbstractVector}. That way you could do <: AbstractArray{C} here, and you wouldn't need to define eltype and such below.

BTW, it's a good occasion to move from a tuple to a pair. That shouldn't affect most uses but it's more natural here.

That said, if we keep the special behavior of map which returns a DataFrame, maybe we'd better not make this iterator an AbstractVector. Same for the row iterator if we want to do something similar in the future.

@nalimilan Thank you for the comments. They touch on things I was not sure myself.

Actually the crucial design issue is to make them AbstractVector subtypes or not. This is tempting and then they should carry the eltype in the signature (actually - interestingly - this is the only way it is possible to implement it AFAIK) and the design would be simplified. If we went for this I was thinking about renaming the types to DFRowVector and DFColumnVector to better sell the idea what we produce.

But we have this special map behavior that produces a DataFrame from eachcol. If we went for AbstractVector subtyping I would remove this functionality. If we feel that we like its functionality I would prefer to create a new specialized method for it, e.g. having the following signature colmap(::Union{Type, Functon}, ::AbstractDataFrame and not overload map with a method that breaks its intuitive contract.

Also, I do not see much sense of doing the same functionality on eachrow (we would not know how to transform it to a DataFrame later).

In summary, having thought of it I would:

make both types AbstractVector subtype

deprecate map for eachcol

add colmap function that does what map for eachcol now does

When we agree if this is the way to go (or decide something else) I will fix this PR.

OK, so we need to choose between these two solutions:

Define map(f, columns(df)) to return a data frame. We could add map(f, rows(df)) later if we want and that would be consistent. People have to use comprehensions to get a vector instead.

Keep the standard behavior of map(f, columns(df)), i.e. have it return a vector. This is basically colwise, but with a more standard name (and we could deprecate it, the broadcasting form f.(columns(df)) is even more compact). We need another way to apply a function to each column of a data frame and get a data frame back: maybe mapcols(f, df), and mapcols!(f, df) for the in-place variant (the naming would be consistent with other column-wise functions, cf. Review row vs. column orientation of API #1514). Then map(f, df) would apply a function to each row (this is what JuliaDB supports).

Also, I do not see much sense of doing the same functionality on eachrow (we would not know how to transform it to a DataFrame later).

That would be quite simple AFAICT: require the user-provided function to return a named tuple, and fill a data frame with that like we now do in map(::Function, ::GroupedDataFrame).

Cc: @piever to discuss the consistency with JuliaDB.

For me to understand, there are a ColumnIterator and RowIterator types, and the decision is whether one should overload map on them to return a DataFrame rather than a vector of columns or a vector of rows?

In JuliaDB map acts by default on the rows and returns a table. We also have a map_rows which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...). It is useful when f returns NamedTuples but iter... are not a DataFrame (they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100)). Could this be added to DataFrames as well? In which case I think mapcols map_rows is a strange pair. What could be a better pair of names?

For consistency with JuliaDB, option 2 seems much better (also, to keep map and filter consistent). I thought there was a colwise or similar in JuliaDB but apparently I'm misremembering. It looks like currently one has to do:

julia> t = table((a = rand(100), b = randn(100))); julia> f(v) = partialsort(v, 1:10) f (generic function with 1 method) julia> table(map(f, columns(t))) Table with 10 rows, 2 columns: a b ──────────────────── 0.00392291 -2.52979 0.0221383 -2.01489 0.029867 -1.92054 0.0462468 -1.89799 0.0720458 -1.5951 0.0824684 -1.31414 0.105048 -1.25503 0.108803 -1.19262 0.1204 -1.13395 0.136053 -1.13125

This I think is not so bad but maybe it'd be more awkward in DataFrames because DataFrame(map(f, columns(t))) would lose the column names?

In JuliaDB map acts by default on the rows and returns a table. We also have a map_rows which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...). It is useful when f returns NamedTuples but iter... are not a DataFrame (they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100)). Could this be added to DataFrames as well? In which case I think mapcols map_rows is a strange pair. What could be a better pair of names?

I don't really see the advantage of providing a map_rows function if DataFrame(f(i) for i in iter...) does the same thing TBH. And using a generator like that sounds more Julian (it's like creating dicts).

For consistency with JuliaDB, option 2 seems much better (also, to keep map and filter consistent).

OK. That's good since that's the approach @bkamins has implemented. ;-)

But the example you show below goes a bit against with this statement AFAICT: map(f, columns(t)) returns a NamedTuple in JuliaDB, not a vector (which is why names are lost). That's logical since column(t) gives a NamedTuple, but that would be different from data frames. That said, it's not a big deal. Maybe JuliaDB could support mapcols just for consistency and convenience?

Yep, maybe JuliaDB could deprecate map_rows as "unnecessary". The easiest is probably to add a mapcols(f, t; select = All()) method to JuliaDB. As I mentioned above, I'm afraid this creates a bit of cacophony on the JuliaDB side (mapcols and map_rows have different underscore style and are not related in the obvious way). The cleanest is probably to deprecate (or at least not export) map_rows and rename setcol, popcol, pushcol... to setcols, popcols, pushcols (as they accept multiple columns at once) and add mapcols to the collections. But this PR shouldn't worry about that as all the changes to make things consistent should come from JuliaDB.

Makes sense. It's good to anticipate to what API we both want to converge in the long term.

src/abstractdataframe/iteration.jl

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

bkamins · 2018-11-11T15:11:33Z

Having thought about it, I will go on with what I have suggested and then it lend under review.

bkamins · 2018-11-11T19:52:45Z

OK - in 52e9010 you have what had to be changed to better support this intermediate deprecation.

bkamins · 2018-11-12T08:24:24Z

@nalimilan Having tried what you propose I now have a basic example when iterate and getindex conflict, it is collect(eachcol(df)). This is exactly what I was afraid of - seems that collect iterates through a collection using getindex but then hits a problem that it expects the return value of type that iterate produces.

So - in my opinion - we have to be breaking here, unless you have some better recommendation.

nalimilan · 2018-11-12T08:32:13Z

Hmm... I guess we can also override collect for now. Anyway that's temporary -- and you've already duplicated half of the code base to silence SubDataFrame deprecations. ;-)

src/abstractdataframe/iteration.jl

nalimilan · 2018-11-12T08:33:18Z

src/abstractdataframe/iteration.jl

+Base.size(itr::DataFrameColumns) = (size(itr.df, 2),)
+Base.IndexStyle(::Type{<:DataFrameColumns}) = Base.IndexLinear()
+
+function Base.getindex(itr::DataFrameColumns{<:AbstractDataFrame,


@propagate_inbounds wouldn't hurt here and below.

AFAIK it will not change anything significantly as we do boundschecking anyway when we try to access column number j from df (note that df is an AbstractDataFrame so we do not know at this point how it stores data).

Yes but propagating this is better anyway, and the data frame implementation could use @boundscheck at some point.

I have opened a separate issue #1594 for this as this PR is complex enough and I am not sure if it will help anything (performancewise - disabling checking bounds saves us nanoseconds but then we have a dynamic dispatch on the result which is orders of magnitude more expensive).

I'm not saying it matters, but it's cheap to add, so...

OK - I will add, but I leave the issue open for further analysis.

src/abstractdataframe/iteration.jl

test/grouping.jl

test/iteration.jl

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

bkamins · 2018-11-12T09:52:23Z

I will override collect, but please note that some other function will have the same problem.

In summary: after we merge this, what other PRs would you want to finish before tagging a release. After you tag a release I will immediately submit a PRs (probably it is better to have several for different types of deprecations) cleaning up all this mess.

pdeffebach · 2018-11-12T12:52:37Z

But the problem is that then the following will produce different results:
[f(col) for col in eachcol(df)]
and
[f(eachcol(df)[i]) for i in 1:ncol(df)]

And now a question: what will f.(eachcol(df)) produce? Unless you are an expert in how broadcasting is internally implemented you have to guess.

Note that [f(eachcol(df)[i]) for i in 1:ncol(df)] is very unlikely to be used very much because f has to take in a (name => column) pair. Because of that, I think that the divergence between Base.broadcast, Base.map, and f(col) for col in eachcol is probably fine. And I agree that is a preferred solution fo colwise.

nalimilan · 2018-11-12T13:57:22Z

src/abstractdataframe/iteration.jl

@@ -138,11 +138,17 @@ function Base.iterate(itr::DataFrameColumns{<:AbstractDataFrame,
    return (_names(itr.df)[j] => itr.df[j], j + 1)
 end

+# TODO: remove this after deprecation period of getindex of DataFrameColumns


Why not put this in deprecated.jl?

It will be simpler for me to clean it up later (and I assume I will do it just after the coming release).
deprecated.jl has 62kb and almost 2000 loc now and actually it is easy to forget to fix something there.

nalimilan · 2018-11-12T13:57:33Z

I'd like to merge #1591 and finish a PR to speed up grouping on categorical arrays. Then yes, it would be nice to clean up all this mess.

nalimilan · 2018-11-13T08:21:31Z

test/grouping.jl

-        # f4(df) = [maximum(df[:c]), minimum(df[:c])]
-        # f5(df) = reshape([maximum(df[:c]), minimum(df[:c])], 2, 1)
-        # f6(df) = [maximum(df[:c]) minimum(df[:c])]
+        f1(df) = DataFrame(cmax = maximum(df[:c]))


Isn't this backwards?

Yes - but I thought this is what you wanted me to do saying that this will be handled in a separate PR. I will revert it. Hold on several minutes with reviewing please I am pushing things following your earlier comments and will let you know when I am finished

bkamins · 2018-11-13T08:45:34Z

@nalimilan - thank you for the patience with this PR. I believe the having a first-class views of columns and rows of a DataFrame as abstract vectors will be very useful.

I have additionally opened #1595 to discuss the future of colwise and similarly #1596 for eltypes (both things came up in the discussion and I think I would wait if people would want to comment on those issues before deciding how to move forward).

nalimilan · 2018-11-13T08:48:27Z

You're welcome!

bkamins · 2018-11-13T15:33:28Z

If there will be no additional comments to this PR I will merge it tomorrow.

bkamins added 4 commits November 8, 2018 23:22

unify eachcol and columns

4f9c61a

clean up deprecated code

65a2d1e

add paren

c9b8e88

add rawcolumns and fix DataFrameStream

ebc6835

bkamins added 4 commits November 9, 2018 00:34

make DFColumnIterator and DFRowIterator subtypes of AbstractArray

97af6cc

avoid using map! on columns

a470817

change eltypes

17b28a0

test fixes

c3663bf

bkamins changed the title ~~Unify eachcol and columns functions~~ WIP: Unify eachcol and columns functions Nov 9, 2018

bkamins added 7 commits November 9, 2018 08:33

cleanup accessor methods

5a2e30c

fix typo

4da925c

fix another typo

1cd3d3c

further fix getindex of iterators

0fd91aa

fix test

7f10bba

qualify depwarn

717f47b

final fixes

22f2b90

bkamins changed the title ~~WIP: Unify eachcol and columns functions~~ Unify eachcol and columns functions Nov 9, 2018

bkamins added 3 commits November 10, 2018 14:49

change broadcasting to map in tests

32f2fb2

further Julia 0.7 fixes

4bd4f3c

further Julia 0.7 fixes

f8a4dab

Wording

f584ffb

nalimilan reviewed Nov 10, 2018

View reviewed changes

nalimilan and others added 2 commits November 11, 2018 16:09

Update src/abstractdataframe/iteration.jl

9bda15d

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

Update src/abstractdataframe/iteration.jl

a317e77

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

go for AbstractVector subtyping

a776b64

bkamins added 2 commits November 11, 2018 20:33

revert test to a more terse form

82aa836

improve deprecation period

52e9010

fix typos

e42b8d3

bkamins force-pushed the df_col_iteration branch from b33456e to e42b8d3 Compare November 11, 2018 21:01

nalimilan reviewed Nov 12, 2018

View reviewed changes

nalimilan and others added 2 commits November 12, 2018 10:42

Update src/abstractdataframe/iteration.jl

b63ccba

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

Update src/abstractdataframe/iteration.jl

4576df6

Co-Authored-By: bkamins <bkamins@sgh.waw.pl>

bkamins added 3 commits November 12, 2018 10:58

fixes after a code review

213106f

small fixes

d9c8a30

fix collect signature

e1cb969

nalimilan reviewed Nov 12, 2018

View reviewed changes

bkamins mentioned this pull request Nov 13, 2018

Investigate boundschecking in DataFrame indexing #1594

Open

bkamins added 2 commits November 13, 2018 09:11

Merge branch 'master' into df_col_iteration

912f1b4

add mapcols tests

2363ef5

nalimilan reviewed Nov 13, 2018

View reviewed changes

allow @inbounds and re-enable some tests

6d1eb90

nalimilan approved these changes Nov 13, 2018

View reviewed changes

bkamins mentioned this pull request Nov 13, 2018

Future of colwise #1595

Closed

nalimilan mentioned this pull request Nov 13, 2018

Review row vs. column orientation of API #1514

Closed

6 tasks

bkamins merged commit 5ebb714 into JuliaData:master Nov 14, 2018

bkamins deleted the df_col_iteration branch November 14, 2018 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify eachcol and columns functions #1590

Unify eachcol and columns functions #1590

bkamins commented Nov 8, 2018

bkamins commented Nov 8, 2018

bkamins commented Nov 9, 2018

bkamins commented Nov 9, 2018

bkamins commented Nov 10, 2018

nalimilan Nov 10, 2018

nalimilan Nov 10, 2018

bkamins Nov 10, 2018

nalimilan Nov 11, 2018 •

edited

Loading

piever Nov 12, 2018

nalimilan Nov 12, 2018

piever Nov 12, 2018 •

edited

Loading

nalimilan Nov 12, 2018

bkamins commented Nov 11, 2018

bkamins commented Nov 11, 2018

bkamins commented Nov 12, 2018

nalimilan commented Nov 12, 2018

nalimilan Nov 12, 2018

bkamins Nov 12, 2018

nalimilan Nov 12, 2018

bkamins Nov 13, 2018

nalimilan Nov 13, 2018

bkamins Nov 13, 2018

bkamins commented Nov 12, 2018

pdeffebach commented Nov 12, 2018

nalimilan Nov 12, 2018

bkamins Nov 13, 2018

nalimilan commented Nov 12, 2018

nalimilan Nov 13, 2018

bkamins Nov 13, 2018

bkamins commented Nov 13, 2018

nalimilan commented Nov 13, 2018

bkamins commented Nov 13, 2018

Unify eachcol and columns functions #1590

Unify eachcol and columns functions #1590

Conversation

bkamins commented Nov 8, 2018

bkamins commented Nov 8, 2018

bkamins commented Nov 9, 2018

bkamins commented Nov 9, 2018

bkamins commented Nov 10, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan Nov 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

piever Nov 12, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 11, 2018

bkamins commented Nov 11, 2018

bkamins commented Nov 12, 2018

nalimilan commented Nov 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 12, 2018

pdeffebach commented Nov 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nalimilan commented Nov 12, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bkamins commented Nov 13, 2018

nalimilan commented Nov 13, 2018

bkamins commented Nov 13, 2018

nalimilan Nov 11, 2018 •

edited

Loading

piever Nov 12, 2018 •

edited

Loading