Adding Vector conversion for a DataFrame #1461

bkamins · 2018-07-24T20:09:04Z

Every once in a while I need to access columns of a DataFrame as a vector of vectors. This is exactly what df.columns is, but of course we should not expose it. On the other hand eachcol is not very user friendly now.

What we could do:

add something like Vector(df::DataFrame) = copy(df.columns) conversion;
redefine eachcol to be more usable.

Actually I prefer the first option as we already have Matrix conversion. Any thoughts?

The text was updated successfully, but these errors were encountered:

pdeffebach · 2018-07-27T14:46:51Z

I think we should add an eachcol operation that doesn't return a (name, col) tuple, rather just iterating through columns, so it is easier to use like a vector of vectors.

bkamins · 2018-07-27T15:02:42Z

So the todo would be:

update current eachcol to support new iteration protocol
add kwarg usenames to eachcol that is true by default; if it is false then do not pass names in the iteration
make DFColumnIterator have a second parameter that is true or false and keeps information which style we want (so that we can dispatch on it)
define map in such a way that if usenames is false a standard map is used (not the custom one that is present currently)

If we are OK with this plan I can implement it.

pdeffebach · 2018-07-27T15:15:59Z

Would changing the output of the iterator based on a keyword argument lead to type instability? If it is true, it returns a tuple, if not, it returns a vector?

bkamins · 2018-07-27T15:22:10Z

The idea is the following:

struct DFColumnIterator{U, T <: AbstractDataFrame}
    df::T
end
eachcol(df::T; usenames::Bool=true) where T<: AbstractDataFrame= DFColumnIterator{usenames, T}(df)

and then in any function you specify:

somefunction(itr::DFColumnIterator{true}) = ...

or

somefunction(itr::DFColumnIterator{false}) = ...

And it will be type stable AFAIK.

pdeffebach · 2018-07-27T15:28:01Z

I think that is a good idea. maybe names instead of usenames. It kind of makes sense to define another iterator but I don't know what the name would be.

bkamins · 2018-07-27T15:29:02Z

Ah - I see what you mean - we could use Val if we find that the compiler complains. But this should be a small union and those should be handled efficiently. (EDIT: this refers to your earlier question)

pdeffebach · 2018-07-27T15:36:34Z

It might be hard to make this as fast as df[i] for i in 1:ncol(df)

using BenchmarkTools
df = DataFrame(rand(100,100));

function newarray(df::DataFrame, f::Function)
       [f(df[i]) for i in 1:ncol(df)]
end

function newarrayiter(df::DataFrame, f::Function)
       [f(col) for (name, col) in eachcol(df)]
end

julia> @btime newarray($df, mean);
  7.931 μs (104 allocations: 2.52 KiB)

julia> @btime newarrayiter($df, mean);
  30.791 μs (203 allocations: 5.59 KiB)

Perhaps this performance difference will go away if we stop returning name.

bkamins · 2018-07-27T15:54:37Z

This in tests is actually faster:

struct Test{T <: AbstractDataFrame}
    df::T
end
ec(df::AbstractDataFrame) = Test(df)
Base.length(itr::Test) = ncol(itr.df)

function Base.iterate(itr::Test, state::Int=0)
    state += 1
    ncol(itr.df) < state && return nothing
    (itr.df[state], state)
end

than [f(df[i]) for i in 1:ncol(df)]

pdeffebach · 2018-07-27T16:11:05Z

Cool. let's add the functionality then, and if we decide we need it to be a separate iterator rather than a parametric type we can always do that.

bkamins · 2018-07-27T16:14:39Z

OK - @nalimilan - do you have any comments before the PR?

pdeffebach · 2018-07-27T19:25:56Z

This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?

bkamins · 2018-07-27T19:31:50Z

Yes - this is the issue, but fortunately only noticeable if work done on a column is small; if the work is large enough then it is usually delegated to a function that works as barrier-function.

pdeffebach · 2018-07-27T19:55:47Z

True. I've been benchmarking some large aggregate operations recently and I think that inference is a barrier to high performance for this. aggregate iterates, so maybe a decent amount of bottlenecks could be gotten rid of with just a typed iterator without the need for a NamedTuple implementation. But I don't know if typed iterators are a thing.

nickeubank · 2018-07-27T19:58:56Z

This actually might be a good time to think about #1335 and type stability of columns. What if DataFrames isn't type stable, but its iterator is?

Sorry, I recognize this is a touch off topic, but not sure where its best to bring it up: seems like a lack of type-stability of columns underlies many of the performance issues for DataFrames (if I understand what's going on). Is moving to type-stable columns a subject of discussion? If so, where?

pdeffebach · 2018-07-27T20:06:53Z

I would say #1335 and #1256 have this discussion. Then there is nl/typed which has some code for an implementation.

bkamins · 2018-07-27T20:16:19Z

And #744 - and old issue that is still open

nickeubank · 2018-07-27T20:17:27Z

thx

nalimilan · 2019-01-21T09:03:45Z

@bkamins Do you think something still needs to be done here?

bkamins · 2019-01-21T10:06:13Z

It can be closed given our current implementation (there is still a deprecation period finished by #1613, but we will not lose track of it). The type stability issue is important, but I would discuss it in the other threads.

bkamins closed this as completed Jan 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Vector conversion for a DataFrame #1461

Adding Vector conversion for a DataFrame #1461

bkamins commented Jul 24, 2018

pdeffebach commented Jul 27, 2018 •

edited

Loading

bkamins commented Jul 27, 2018 •

edited

Loading

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018 •

edited

Loading

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

nickeubank commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

nickeubank commented Jul 27, 2018

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

Adding Vector conversion for a DataFrame #1461

Adding Vector conversion for a DataFrame #1461

Comments

bkamins commented Jul 24, 2018

pdeffebach commented Jul 27, 2018 • edited Loading

bkamins commented Jul 27, 2018 • edited Loading

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018 • edited Loading

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

nickeubank commented Jul 27, 2018

pdeffebach commented Jul 27, 2018

bkamins commented Jul 27, 2018

nickeubank commented Jul 27, 2018

nalimilan commented Jan 21, 2019

bkamins commented Jan 21, 2019

pdeffebach commented Jul 27, 2018 •

edited

Loading

bkamins commented Jul 27, 2018 •

edited

Loading

bkamins commented Jul 27, 2018 •

edited

Loading