Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify eachcol and columns functions #1590

Merged
merged 39 commits into from
Nov 14, 2018
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
4f9c61a
unify eachcol and columns
bkamins Nov 8, 2018
65a2d1e
clean up deprecated code
bkamins Nov 8, 2018
c9b8e88
add paren
bkamins Nov 8, 2018
ebc6835
add rawcolumns and fix DataFrameStream
bkamins Nov 8, 2018
97af6cc
make DFColumnIterator and DFRowIterator subtypes of AbstractArray
bkamins Nov 8, 2018
a470817
avoid using map! on columns
bkamins Nov 8, 2018
17b28a0
change eltypes
bkamins Nov 9, 2018
c3663bf
test fixes
bkamins Nov 9, 2018
5a2e30c
cleanup accessor methods
bkamins Nov 9, 2018
4da925c
fix typo
bkamins Nov 9, 2018
1cd3d3c
fix another typo
bkamins Nov 9, 2018
0fd91aa
further fix getindex of iterators
bkamins Nov 9, 2018
7f10bba
fix test
bkamins Nov 9, 2018
717f47b
qualify depwarn
bkamins Nov 9, 2018
22f2b90
final fixes
bkamins Nov 9, 2018
32f2fb2
change broadcasting to map in tests
bkamins Nov 10, 2018
4bd4f3c
further Julia 0.7 fixes
bkamins Nov 10, 2018
f8a4dab
further Julia 0.7 fixes
bkamins Nov 10, 2018
f584ffb
Wording
nalimilan Nov 10, 2018
9bda15d
Update src/abstractdataframe/iteration.jl
nalimilan Nov 11, 2018
a317e77
Update src/abstractdataframe/iteration.jl
nalimilan Nov 11, 2018
a776b64
go for AbstractVector subtyping
bkamins Nov 11, 2018
c882551
fix tests
bkamins Nov 11, 2018
4536033
fix subtyping
bkamins Nov 11, 2018
91edc19
documentation cleanup
bkamins Nov 11, 2018
f57af37
Update docs/src/lib/types.md
nalimilan Nov 11, 2018
4566656
apply review comments
bkamins Nov 11, 2018
ceb697c
Merge branch 'df_col_iteration' of https://github.com/bkamins/DataFra…
bkamins Nov 11, 2018
82aa836
revert test to a more terse form
bkamins Nov 11, 2018
52e9010
improve deprecation period
bkamins Nov 11, 2018
e42b8d3
fix typos
bkamins Nov 11, 2018
b63ccba
Update src/abstractdataframe/iteration.jl
nalimilan Nov 12, 2018
4576df6
Update src/abstractdataframe/iteration.jl
nalimilan Nov 12, 2018
213106f
fixes after a code review
bkamins Nov 12, 2018
d9c8a30
small fixes
bkamins Nov 12, 2018
e1cb969
fix collect signature
bkamins Nov 12, 2018
912f1b4
Merge branch 'master' into df_col_iteration
bkamins Nov 13, 2018
2363ef5
add mapcols tests
bkamins Nov 13, 2018
6d1eb90
allow @inbounds and re-enable some tests
bkamins Nov 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/src/lib/functions.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ meltdf

```@docs
allowmissing!
columns
completecases
describe
disallowmissing!
Expand Down
16 changes: 13 additions & 3 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,9 +33,19 @@ and reflects changes done to the parent after the creation of the view.
Typically objects of the `DataFrameRow` type are encountered when returned by the `eachrow` function.
In the future accessing a single row of a data frame via `getindex` or `view` will return a `DataFrameRow`.

Additionally the `eachrow` and `eachcol` functions return values of the `DFRowIterator` and `DFColumnIterator` types respectively.
Those types are not exported and should not be constructed directly.
They respectively serve as iterators over rows and columns of an `AbstractDataFrame`.
Additionally the `eachrow` function returns value of the `DFRowIterator` type, which
serves as an iterator over rows and columns of an `AbstractDataFrame` returning `DataFrameRow` objects.

Similarly `eachcol` and `columns` functions return value `DFColumnIterator` type, which
serves as iterator over columns of an `AbstractDataFrame`.
The difference between the return value of `eachcol` and `columns` is the following:

* The `eachcol` function returns value of the `DFColumnIterator{<:AbstractDataFrame, true}` type which is an
iterator returning a tuple containing column name and column value.
* The `columns` function returns value of the `DFColumnIterator{<:AbstractDataFrame, false}` type which is an
iterator returning a column value only.

The `DFRowIterator` and `DFColumnIterator` types are not exported and should not be constructed directly.

## Types specification

Expand Down
1 change: 1 addition & 0 deletions src/DataFrames.jl
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ export AbstractDataFrame,
aggregate,
by,
categorical!,
columns,
colwise,
combine,
completecases,
Expand Down
23 changes: 4 additions & 19 deletions src/abstractdataframe/abstractdataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -81,22 +81,6 @@ abstract type AbstractDataFrame end
##
##############################################################################

struct Cols{T <: AbstractDataFrame} <: AbstractVector{AbstractVector}
df::T
end
function Base.iterate(itr::Cols, st=1)
st > length(itr.df) && return nothing
return (itr.df[st], st + 1)
end
Base.length(itr::Cols) = length(itr.df)
Base.size(itr::Cols, ix) = ix==1 ? length(itr) : throw(ArgumentError("Incorrect dimension"))
Base.size(itr::Cols) = (length(itr.df),)
Base.IndexStyle(::Type{<:Cols}) = IndexLinear()
Base.getindex(itr::Cols, inds...) = getindex(itr.df, inds...)

# N.B. where stored as a vector, 'columns(x) = x.vector' is a bit cheaper
columns(df::T) where {T <: AbstractDataFrame} = Cols{T}(df)

Base.names(df::AbstractDataFrame) = names(index(df))
_names(df::AbstractDataFrame) = _names(index(df))

Expand Down Expand Up @@ -218,7 +202,7 @@ eltypes(df)
```

"""
eltypes(df::AbstractDataFrame) = map!(eltype, Vector{Type}(undef, size(df,2)), columns(df))
eltypes(df::AbstractDataFrame) = [eltype(col) for col in columns(df)]

Base.size(df::AbstractDataFrame) = (nrow(df), ncol(df))
function Base.size(df::AbstractDataFrame, i::Integer)
Expand Down Expand Up @@ -1097,7 +1081,8 @@ julia> repeat(df, inner = 2, outer = 3)
```
"""
Base.repeat(df::AbstractDataFrame; inner::Integer = 1, outer::Integer = 1) =
map(x -> repeat(x, inner = inner, outer = outer), eachcol(df))
DataFrame(map(x -> repeat(x, inner = inner, outer = outer), columns(df)),
names(df))

"""
repeat(df::AbstractDataFrame, count::Integer)
Expand Down Expand Up @@ -1127,7 +1112,7 @@ julia> repeat(df, 2)
```
"""
Base.repeat(df::AbstractDataFrame, count::Integer) =
map(x -> repeat(x, count), eachcol(df))
DataFrame(map(x -> repeat(x, count), columns(df)), names(df))

##############################################################################
##
Expand Down
15 changes: 7 additions & 8 deletions src/abstractdataframe/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -213,11 +213,11 @@ struct DataFrameStream{T}
columns::T
header::Vector{String}
end
DataFrameStream(df::DataFrame) = DataFrameStream(Tuple(columns(df)), string.(names(df)))
DataFrameStream(df::DataFrame) = DataFrameStream(Tuple(rawcolumns(df)), string.(names(df)))

# DataFrame Data.Source implementation
Data.schema(df::DataFrame) =
Data.Schema(Type[eltype(A) for A in columns(df)], string.(names(df)), size(df, 1))
Data.Schema(Type[eltype(A) for A in rawcolumns(df)], string.(names(df)), size(df, 1))

Data.isdone(source::DataFrame, row, col, rows, cols) = row > rows || col > cols
function Data.isdone(source::DataFrame, row, col)
Expand Down Expand Up @@ -276,26 +276,25 @@ function DataFrame(sch::Data.Schema{R}, ::Type{S}=Data.Field,
# to the # of rows in the source
newsize = ifelse(S == Data.Column || !R, 0,
ifelse(append, sinkrows + sch.rows, sch.rows))
foreach(col->resize!(col, newsize), columns(sink))
foreach(col->resize!(col, newsize), rawcolumns(sink))
sch.rows = newsize
end
# take care of a possible reference from source by addint to WeakRefStringArrays
if !isempty(reference)
foreach(col-> col isa WeakRefStringArray && push!(col.data, reference),
sink.columns)
rawcolumns(sink))
end
sink = DataFrameStream(sink)
DataFrameStream(sink)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
else
# allocating a fresh DataFrame Sink; append is irrelevant
# for Data.Column or unknown # of rows in Data.Field, we only ever append!,
# so just allocate empty columns
rows = ifelse(S == Data.Column, 0, ifelse(!R, 0, sch.rows))
names = Data.header(sch)
sink = DataFrameStream(
Tuple(allocate(types[i], rows, reference) for i = 1:length(types)), names)
sch.rows = rows
DataFrameStream(Tuple(allocate(types[i], rows, reference)
for i = 1:length(types)), names)
end
return sink
end

DataFrame(sink, sch::Data.Schema, ::Type{S}, append::Bool;
Expand Down
101 changes: 74 additions & 27 deletions src/abstractdataframe/iteration.jl
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ with each row represented as a `DataFrameRow`.

A value of this type is returned by the [`eachrow`](@link) function.
"""
struct DFRowIterator{T <: AbstractDataFrame}
struct DFRowIterator{T<:AbstractDataFrame}
df::T
end

Expand All @@ -27,25 +27,32 @@ with each row represented as a `DataFrameRow`.
"""
eachrow(df::AbstractDataFrame) = DFRowIterator(df)

Base.size(itr::DFRowIterator) = (size(itr.df, 1), )
Base.size(itr::DFRowIterator, ix) =
bkamins marked this conversation as resolved.
Show resolved Hide resolved
ix == 1 ? length(itr) : throw(ArgumentError("Incorrect dimension"))
Base.length(itr::DFRowIterator) = size(itr.df, 1)
Base.firstindex(itr::DFRowIterator) = 1
Base.lastindex(itr::DFRowIterator) = length(itr)

function Base.iterate(itr::DFRowIterator, i=1)
i > size(itr.df, 1) && return nothing
return (DataFrameRow(itr.df, i), i + 1)
end

Base.eltype(::DFRowIterator{T}) where {T} = DataFrameRow{T}
Base.size(itr::DFRowIterator) = (size(itr.df, 1), )
Base.length(itr::DFRowIterator) = size(itr.df, 1)
Base.getindex(itr::DFRowIterator, i) = DataFrameRow(itr.df, i)

# Iteration by columns
"""
DFColumnIterator{<:AbstractDataFrame}
DFColumnIterator{<:AbstractDataFrame, C}

Iterator over columns of an `AbstractDataFrame`.
Each returned value is a tuple consisting of column name and column vector.

A value of this type is returned by the [`eachcol`](@link) function.
If `C` is `true` (a value returned by the [`eachcol`](@link) function)
bkamins marked this conversation as resolved.
Show resolved Hide resolved
then each returned value is a tuple consisting of column name and column vector.
If `C` is `false` (a value returned by the [`columns`](@link) function)
then each returned value is a column vector.
"""
struct DFColumnIterator{T <: AbstractDataFrame}
struct DFColumnIterator{T<:AbstractDataFrame, C}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of C being a Boolean, it could be either AbstractVector or Tuple{Symbol,AbstractVector}. That way you could do <: AbstractArray{C} here, and you wouldn't need to define eltype and such below.

BTW, it's a good occasion to move from a tuple to a pair. That shouldn't affect most uses but it's more natural here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said, if we keep the special behavior of map which returns a DataFrame, maybe we'd better not make this iterator an AbstractVector. Same for the row iterator if we want to do something similar in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nalimilan Thank you for the comments. They touch on things I was not sure myself.

Actually the crucial design issue is to make them AbstractVector subtypes or not. This is tempting and then they should carry the eltype in the signature (actually - interestingly - this is the only way it is possible to implement it AFAIK) and the design would be simplified. If we went for this I was thinking about renaming the types to DFRowVector and DFColumnVector to better sell the idea what we produce.

But we have this special map behavior that produces a DataFrame from eachcol. If we went for AbstractVector subtyping I would remove this functionality. If we feel that we like its functionality I would prefer to create a new specialized method for it, e.g. having the following signature colmap(::Union{Type, Functon}, ::AbstractDataFrame and not overload map with a method that breaks its intuitive contract.

Also, I do not see much sense of doing the same functionality on eachrow (we would not know how to transform it to a DataFrame later).

In summary, having thought of it I would:

  • make both types AbstractVector subtype
  • deprecate map for eachcol
  • add colmap function that does what map for eachcol now does

When we agree if this is the way to go (or decide something else) I will fix this PR.

Copy link
Member

@nalimilan nalimilan Nov 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, so we need to choose between these two solutions:

  1. Define map(f, columns(df)) to return a data frame. We could add map(f, rows(df)) later if we want and that would be consistent. People have to use comprehensions to get a vector instead.
  2. Keep the standard behavior of map(f, columns(df)), i.e. have it return a vector. This is basically colwise, but with a more standard name (and we could deprecate it, the broadcasting form f.(columns(df)) is even more compact). We need another way to apply a function to each column of a data frame and get a data frame back: maybe mapcols(f, df), and mapcols!(f, df) for the in-place variant (the naming would be consistent with other column-wise functions, cf. Review row vs. column orientation of API #1514). Then map(f, df) would apply a function to each row (this is what JuliaDB supports).

Also, I do not see much sense of doing the same functionality on eachrow (we would not know how to transform it to a DataFrame later).

That would be quite simple AFAICT: require the user-provided function to return a named tuple, and fill a data frame with that like we now do in map(::Function, ::GroupedDataFrame).

Cc: @piever to discuss the consistency with JuliaDB.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me to understand, there are a ColumnIterator and RowIterator types, and the decision is whether one should overload map on them to return a DataFrame rather than a vector of columns or a vector of rows?

In JuliaDB map acts by default on the rows and returns a table. We also have a map_rows which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...). It is useful when f returns NamedTuples but iter... are not a DataFrame (they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100)). Could this be added to DataFrames as well? In which case I think mapcols map_rows is a strange pair. What could be a better pair of names?

For consistency with JuliaDB, option 2 seems much better (also, to keep map and filter consistent). I thought there was a colwise or similar in JuliaDB but apparently I'm misremembering. It looks like currently one has to do:

julia> t = table((a = rand(100), b = randn(100)));

julia> f(v) = partialsort(v, 1:10)
f (generic function with 1 method)

julia> table(map(f, columns(t)))
Table with 10 rows, 2 columns:
a           b
────────────────────
0.00392291  -2.52979
0.0221383   -2.01489
0.029867    -1.92054
0.0462468   -1.89799
0.0720458   -1.5951
0.0824684   -1.31414
0.105048    -1.25503
0.108803    -1.19262
0.1204      -1.13395
0.136053    -1.13125

This I think is not so bad but maybe it'd be more awkward in DataFrames because DataFrame(map(f, columns(t))) would lose the column names?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JuliaDB map acts by default on the rows and returns a table. We also have a map_rows which basically would correspond to map_rows(f, iter...) = DataFrame(f(i) for i in iter...). It is useful when f returns NamedTuples but iter... are not a DataFrame (they could be vectors or iterators), for example map_rows((x, y) -> (s = x+y, d = x-y), rand(100), rand(100)). Could this be added to DataFrames as well? In which case I think mapcols map_rows is a strange pair. What could be a better pair of names?

I don't really see the advantage of providing a map_rows function if DataFrame(f(i) for i in iter...) does the same thing TBH. And using a generator like that sounds more Julian (it's like creating dicts).

For consistency with JuliaDB, option 2 seems much better (also, to keep map and filter consistent).

OK. That's good since that's the approach @bkamins has implemented. ;-)

But the example you show below goes a bit against with this statement AFAICT: map(f, columns(t)) returns a NamedTuple in JuliaDB, not a vector (which is why names are lost). That's logical since column(t) gives a NamedTuple, but that would be different from data frames. That said, it's not a big deal. Maybe JuliaDB could support mapcols just for consistency and convenience?

Copy link

@piever piever Nov 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, maybe JuliaDB could deprecate map_rows as "unnecessary". The easiest is probably to add a mapcols(f, t; select = All()) method to JuliaDB. As I mentioned above, I'm afraid this creates a bit of cacophony on the JuliaDB side (mapcols and map_rows have different underscore style and are not related in the obvious way). The cleanest is probably to deprecate (or at least not export) map_rows and rename setcol, popcol, pushcol... to setcols, popcols, pushcols (as they accept multiple columns at once) and add mapcols to the collections. But this PR shouldn't worry about that as all the changes to make things consistent should come from JuliaDB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. It's good to anticipate to what API we both want to converge in the long term.

df::T
end

Expand All @@ -55,10 +62,6 @@ end
Return a `DFColumnIterator` that iterates an `AbstractDataFrame` column by column.
Iteration returns a tuple consisting of column name and column vector.

`DFColumnIterator` has a custom implementation of the `map` function which
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
returns a `DataFrame` and assumes that a function argument passed do
the `map` function accepts takes only a column vector.

**Examples**

```jldoctest
Expand All @@ -72,29 +75,73 @@ julia> df = DataFrame(x=1:4, y=11:14)
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │

julia> map(sum, eachcol(df))
1×2 DataFrame
julia> collect(eachcol(df))
2-element Array{Tuple{Symbol,Any},1}:
(:x, [1, 2, 3, 4])
(:y, [11, 12, 13, 14])
```
"""
eachcol(df::T) where T<: AbstractDataFrame = DFColumnIterator{T, true}(df)

"""
columns(df::AbstractDataFrame)

Return a `DFColumnIterator` that iterates an `AbstractDataFrame` column by column.
Iteration returns a column vector.
bkamins marked this conversation as resolved.
Show resolved Hide resolved

**Examples**

```jldoctest
julia> df = DataFrame(x=1:4, y=11:14)
4×2 DataFrame
│ Row │ x │ y │
│ │ Int64 │ Int64 │
├─────┼───────┼───────┤
│ 1 │ 10 │ 50 │
│ 1 │ 1 │ 11 │
│ 2 │ 2 │ 12 │
│ 3 │ 3 │ 13 │
│ 4 │ 4 │ 14 │

julia> collect(columns(df))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]
nalimilan marked this conversation as resolved.
Show resolved Hide resolved
```
"""
eachcol(df::AbstractDataFrame) = DFColumnIterator(df)
columns(df::T) where T<: AbstractDataFrame = DFColumnIterator{T, false}(df)

function Base.iterate(itr::DFColumnIterator, j=1)
Base.length(itr::DFColumnIterator) = size(itr.df, 2)
Base.size(itr::DFColumnIterator) = (size(itr.df, 2),)
Base.size(itr::DFColumnIterator, ix) =
ix == 1 ? length(itr) : throw(ArgumentError("Incorrect dimension"))
Base.firstindex(itr::DFColumnIterator) = 1
Base.lastindex(itr::DFColumnIterator) = length(itr)

function Base.iterate(itr::DFColumnIterator{<:AbstractDataFrame,true}, j=1)
j > size(itr.df, 2) && return nothing
return ((_names(itr.df)[j], itr.df[j]), j + 1)
end
Base.eltype(::DFColumnIterator) = Tuple{Symbol, AbstractVector}
Base.size(itr::DFColumnIterator) = (size(itr.df, 2), )
Base.length(itr::DFColumnIterator) = size(itr.df, 2)
Base.getindex(itr::DFColumnIterator, j) = itr.df[j]
function Base.map(f::Union{Function,Type}, dfci::DFColumnIterator)
# note: `f` must return a consistent length
res = DataFrame()
for (n, v) in eachcol(dfci.df)
res[n] = f(v)

Base.eltype(::DFColumnIterator{<:AbstractDataFrame,true}) =
Tuple{Symbol, AbstractVector}

function Base.getindex(itr::DFColumnIterator{<:AbstractDataFrame,true}, j)
# TODO: change to the way getindex for false is defined below afted deprecation
if !(j isa Integer) || j isa Bool
bkamins marked this conversation as resolved.
Show resolved Hide resolved
Base.depwarn("calling getindex on DFColumnIterator{<:AbstractDataFrame,true} " *
" object will only accept integer indexing and will return " *
bkamins marked this conversation as resolved.
Show resolved Hide resolved
"a tuple of column name and column value in the future.", :getindex)
end
res
itr.df[j]
end

function Base.iterate(itr::DFColumnIterator{<:AbstractDataFrame,false}, j=1)
j > size(itr.df, 2) && return nothing
return (itr.df[j], j + 1)
end

Base.eltype(::DFColumnIterator{<:AbstractDataFrame,false}) = AbstractVector
Base.getindex(itr::DFColumnIterator{<:AbstractDataFrame,false}, j::Integer) =
itr.df[j]
Base.getindex(itr::DFColumnIterator{<:AbstractDataFrame,false}, j::Bool) =
throw(ArgumentError("invalid index $j of type Bool"))
Loading