Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify eachcol and columns functions #1590

Merged
merged 39 commits into from
Nov 14, 2018
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
4f9c61a
unify eachcol and columns
bkamins Nov 8, 2018
65a2d1e
clean up deprecated code
bkamins Nov 8, 2018
c9b8e88
add paren
bkamins Nov 8, 2018
ebc6835
add rawcolumns and fix DataFrameStream
bkamins Nov 8, 2018
97af6cc
make DFColumnIterator and DFRowIterator subtypes of AbstractArray
bkamins Nov 8, 2018
a470817
avoid using map! on columns
bkamins Nov 8, 2018
17b28a0
change eltypes
bkamins Nov 9, 2018
c3663bf
test fixes
bkamins Nov 9, 2018
5a2e30c
cleanup accessor methods
bkamins Nov 9, 2018
4da925c
fix typo
bkamins Nov 9, 2018
1cd3d3c
fix another typo
bkamins Nov 9, 2018
0fd91aa
further fix getindex of iterators
bkamins Nov 9, 2018
7f10bba
fix test
bkamins Nov 9, 2018
717f47b
qualify depwarn
bkamins Nov 9, 2018
22f2b90
final fixes
bkamins Nov 9, 2018
32f2fb2
change broadcasting to map in tests
bkamins Nov 10, 2018
4bd4f3c
further Julia 0.7 fixes
bkamins Nov 10, 2018
f8a4dab
further Julia 0.7 fixes
bkamins Nov 10, 2018
f584ffb
Wording
nalimilan Nov 10, 2018
9bda15d
Update src/abstractdataframe/iteration.jl
nalimilan Nov 11, 2018
a317e77
Update src/abstractdataframe/iteration.jl
nalimilan Nov 11, 2018
a776b64
go for AbstractVector subtyping
bkamins Nov 11, 2018
c882551
fix tests
bkamins Nov 11, 2018
4536033
fix subtyping
bkamins Nov 11, 2018
91edc19
documentation cleanup
bkamins Nov 11, 2018
f57af37
Update docs/src/lib/types.md
nalimilan Nov 11, 2018
4566656
apply review comments
bkamins Nov 11, 2018
ceb697c
Merge branch 'df_col_iteration' of https://github.com/bkamins/DataFra…
bkamins Nov 11, 2018
82aa836
revert test to a more terse form
bkamins Nov 11, 2018
52e9010
improve deprecation period
bkamins Nov 11, 2018
e42b8d3
fix typos
bkamins Nov 11, 2018
b63ccba
Update src/abstractdataframe/iteration.jl
nalimilan Nov 12, 2018
4576df6
Update src/abstractdataframe/iteration.jl
nalimilan Nov 12, 2018
213106f
fixes after a code review
bkamins Nov 12, 2018
d9c8a30
small fixes
bkamins Nov 12, 2018
e1cb969
fix collect signature
bkamins Nov 12, 2018
912f1b4
Merge branch 'master' into df_col_iteration
bkamins Nov 13, 2018
2363ef5
add mapcols tests
bkamins Nov 13, 2018
6d1eb90
allow @inbounds and re-enable some tests
bkamins Nov 13, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
apply review comments
  • Loading branch information
bkamins committed Nov 11, 2018
commit 4566656658ac2abb4002fda9a82c4993f81378c7
16 changes: 8 additions & 8 deletions docs/src/lib/types.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,20 +33,20 @@ and reflects changes done to the parent after the creation of the view.
Typically objects of the `DataFrameRow` type are encountered when returned by the `eachrow` function.
In the future accessing a single row of a data frame via `getindex` or `view` will return a `DataFrameRow`.

Additionally, the `eachrow` function returns a value of the `DFRowVector` type, which
Additionally, the `eachrow` function returns a value of the `DataFrameRows` type, which
serves as an iterator over rows of an `AbstractDataFrame`, returning `DataFrameRow` objects.

Similarly, the `eachcol` and `columns` functions return a value of the `DFColumnVector` type, which
Similarly, the `eachcol` and `columns` functions return a value of the `DataFrameColumns` type, which
serves as an iterator over columns of an `AbstractDataFrame`.
The difference between the return value of `eachcol` and `columns` is the following:

* The `eachcol` function returns a value of the `DFColumnVector{<:AbstractDataFrame, true}` type, which is an
* The `eachcol` function returns a value of the `DataFrameColumns{<:AbstractDataFrame, true}` type, which is an
iterator returning a pair containing the column name and the column vector.
* The `columns` function returns a value of the `DFColumnVector{<:AbstractDataFrame, false}` type, which is an
* The `columns` function returns a value of the `DataFrameColumns{<:AbstractDataFrame, false}` type, which is an
iterator returning the column vector only.

The `DFRowVector` and `DFColumnVector` types are subtypes of `AbstractVector` and support its interface
with the exception that they are read only. Note, that they are not exported and should not be constructed directly,
The `DataFrameRows` and `DataFrameColumns` types are subtypes of `AbstractVector` and support its interface
with the exception that they are read only. Note that they are not exported and should not be constructed directly,
but using `eachrow`, `eachcol` and `columns` functions.

## Types specification
Expand All @@ -57,6 +57,6 @@ DataFrame
DataFrameRow
GroupedDataFrame
SubDataFrame
DFRowVector
DFColumnVector
DataFrameRows
DataFrameColumns
```
14 changes: 7 additions & 7 deletions src/abstractdataframe/io.jl
Original file line number Diff line number Diff line change
Expand Up @@ -213,11 +213,11 @@ struct DataFrameStream{T}
columns::T
header::Vector{String}
end
DataFrameStream(df::DataFrame) = DataFrameStream(Tuple(rawcolumns(df)), string.(names(df)))
DataFrameStream(df::DataFrame) = DataFrameStream(Tuple(_columns(df)), string.(names(df)))

# DataFrame Data.Source implementation
Data.schema(df::DataFrame) =
Data.Schema(Type[eltype(A) for A in rawcolumns(df)], string.(names(df)), size(df, 1))
Data.Schema(Type[eltype(A) for A in _columns(df)], string.(names(df)), size(df, 1))

Data.isdone(source::DataFrame, row, col, rows, cols) = row > rows || col > cols
function Data.isdone(source::DataFrame, row, col)
Expand Down Expand Up @@ -276,24 +276,24 @@ function DataFrame(sch::Data.Schema{R}, ::Type{S}=Data.Field,
# to the # of rows in the source
newsize = ifelse(S == Data.Column || !R, 0,
ifelse(append, sinkrows + sch.rows, sch.rows))
foreach(col->resize!(col, newsize), rawcolumns(sink))
foreach(col->resize!(col, newsize), _columns(sink))
sch.rows = newsize
end
# take care of a possible reference from source by addint to WeakRefStringArrays
if !isempty(reference)
foreach(col-> col isa WeakRefStringArray && push!(col.data, reference),
rawcolumns(sink))
_columns(sink))
end
DataFrameStream(sink)
return DataFrameStream(sink)
else
# allocating a fresh DataFrame Sink; append is irrelevant
# for Data.Column or unknown # of rows in Data.Field, we only ever append!,
# so just allocate empty columns
rows = ifelse(S == Data.Column, 0, ifelse(!R, 0, sch.rows))
names = Data.header(sch)
sch.rows = rows
DataFrameStream(Tuple(allocate(types[i], rows, reference)
for i = 1:length(types)), names)
return DataFrameStream(Tuple(allocate(types[i], rows, reference)
for i = 1:length(types)), names)
end
end

Expand Down
49 changes: 32 additions & 17 deletions src/abstractdataframe/iteration.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,47 +8,50 @@

# Iteration by rows
"""
DFRowVector{T<:AbstractDataFrame} <: AbstractVector{DataFrameRow{T}}
DataFrameRows{T<:AbstractDataFrame} <: AbstractVector{DataFrameRow{T}}

Iterator over rows of an `AbstractDataFrame`,
with each row represented as a `DataFrameRow`.

A value of this type is returned by the [`eachrow`](@link) function.
"""
struct DFRowVector{T<:AbstractDataFrame} <: AbstractVector{DataFrameRow{T}}
struct DataFrameRows{T<:AbstractDataFrame} <: AbstractVector{DataFrameRow{T}}
df::T
end

"""
eachrow(df::AbstractDataFrame)

Return a `DFRowVector` that iterates an `AbstractDataFrame` row by row,
Return a `DataFrameRows` that iterates an `AbstractDataFrame` row by row,
with each row represented as a `DataFrameRow`.
"""
eachrow(df::AbstractDataFrame) = DFRowVector(df)
eachrow(df::AbstractDataFrame) = DataFrameRows(df)

Base.size(itr::DFRowVector) = (size(itr.df, 1), )
Base.IndexStyle(::Type{<:DFRowVector}) = Base.IndexLinear()
Base.getindex(itr::DFRowVector, i::Int) = DataFrameRow(itr.df, i)
Base.size(itr::DataFrameRows) = (size(itr.df, 1), )
Base.IndexStyle(::Type{<:DataFrameRows}) = Base.IndexLinear()
@inline function Base.getindex(itr::DataFrameRows, i::Int)
@boundscheck checkbounds(itr, i)
return DataFrameRow(itr.df, i)
end

# Iteration by columns
"""
DFColumnVector{<:AbstractDataFrame, V} <: AbstractVector{V}
DataFrameColumns{<:AbstractDataFrame, V} <: AbstractVector{V}

Iterator over columns of an `AbstractDataFrame`.
If `V` is `Pair{Symbol,AbstractVector}` (which is the case when calling
[`eachcol`](@link)) then each returned value is a pair consisting of
column name and column vector. If `V` is `AbstractVector` (a value returned by
the [`columns`](@link) function) then each returned value is a column vector.
"""
struct DFColumnVector{T<:AbstractDataFrame, V} <: AbstractVector{V}
struct DataFrameColumns{T<:AbstractDataFrame, V} <: AbstractVector{V}
df::T
end

"""
eachcol(df::AbstractDataFrame)

Return a `DFColumnVector` that iterates an `AbstractDataFrame` column by column.
Return a `DataFrameColumns` that iterates an `AbstractDataFrame` column by column.
Iteration returns a pair consisting of column name and column vector.

**Examples**
Expand All @@ -71,12 +74,12 @@ julia> collect(eachcol(df))
```
"""
eachcol(df::T) where T<: AbstractDataFrame =
DFColumnVector{T, Pair{Symbol, AbstractVector}}(df)
DataFrameColumns{T, Pair{Symbol, AbstractVector}}(df)

"""
columns(df::AbstractDataFrame)

Return a `DFColumnVector` that iterates an `AbstractDataFrame` column by
Return a `DataFrameColumns` that iterates an `AbstractDataFrame` column by
column, yielding column vectors.

**Examples**
Expand All @@ -96,17 +99,29 @@ julia> collect(columns(df))
2-element Array{AbstractArray{T,1} where T,1}:
[1, 2, 3, 4]
[11, 12, 13, 14]

julia> sum.(columns(df))
2-element Array{Int64,1}:
10
50

julia> map(columns(df)) do col
maximum(col) - minimum(col)
end
2-element Array{Int64,1}:
3
3
```
"""
columns(df::T) where T<: AbstractDataFrame =
DFColumnVector{T, AbstractVector}(df)
DataFrameColumns{T, AbstractVector}(df)

Base.size(itr::DFColumnVector) = (size(itr.df, 2),)
Base.IndexStyle(::Type{<:DFColumnVector}) = Base.IndexLinear()
Base.getindex(itr::DFColumnVector{<:AbstractDataFrame,
Base.size(itr::DataFrameColumns) = (size(itr.df, 2),)
Base.IndexStyle(::Type{<:DataFrameColumns}) = Base.IndexLinear()
Base.getindex(itr::DataFrameColumns{<:AbstractDataFrame,
Pair{Symbol, AbstractVector}}, j::Int) =
_names(itr.df)[j] => itr.df[j]
Base.getindex(itr::DFColumnVector{<:AbstractDataFrame,AbstractVector}, j::Int) =
Base.getindex(itr::DataFrameColumns{<:AbstractDataFrame,AbstractVector}, j::Int) =
itr.df[j]

"""
Expand Down
52 changes: 26 additions & 26 deletions src/dataframe/dataframe.jl
Original file line number Diff line number Diff line change
Expand Up @@ -230,10 +230,10 @@ end
##############################################################################

index(df::DataFrame) = getfield(df, :colindex)
rawcolumns(df::DataFrame) = getfield(df, :columns)
_columns(df::DataFrame) = getfield(df, :columns)

# note: these type assertions are required to pass tests
nrow(df::DataFrame) = ncol(df) > 0 ? length(rawcolumns(df)[1])::Int : 0
nrow(df::DataFrame) = ncol(df) > 0 ? length(_columns(df)[1])::Int : 0
ncol(df::DataFrame) = length(index(df))

##############################################################################
Expand All @@ -247,13 +247,13 @@ const ColumnIndex = Union{Integer, Symbol}
# df[SingleColumnIndex] => AbstractVector, the same vector
function Base.getindex(df::DataFrame, col_ind::ColumnIndex)
selected_column = index(df)[col_ind]
return rawcolumns(df)[selected_column]
return _columns(df)[selected_column]
end

# df[MultiColumnIndex] => DataFrame
function Base.getindex(df::DataFrame, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = rawcolumns(df)[selected_columns]
new_columns = _columns(df)[selected_columns]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
end

Expand All @@ -263,7 +263,7 @@ Base.getindex(df::DataFrame, col_inds::Colon) = copy(df)
# df[SingleRowIndex, SingleColumnIndex] => Scalar
function Base.getindex(df::DataFrame, row_ind::Integer, col_ind::ColumnIndex)
selected_column = index(df)[col_ind]
return rawcolumns(df)[selected_column][row_ind]
return _columns(df)[selected_column][row_ind]
end

# df[SingleRowIndex, MultiColumnIndex] => DataFrame (will be DatFrameRow)
Expand All @@ -274,7 +274,7 @@ function Base.getindex(df::DataFrame, row_ind::Integer, col_inds::AbstractVector
Base.depwarn("Selecting a single row from a `DataFrame` will return a `DataFrameRow` in the future. " *
"To get a `DataFrame` use `df[row_ind:row_ind, col_inds]`.", :getindex)
selected_columns = index(df)[col_inds]
new_columns = AbstractVector[[dv[row_ind]] for dv in rawcolumns(df)[selected_columns]]
new_columns = AbstractVector[[dv[row_ind]] for dv in _columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
end

Expand All @@ -285,20 +285,20 @@ function Base.getindex(df::DataFrame, row_ind::Integer, ::Colon)
end
Base.depwarn("Selecting a single row from a `DataFrame` will return a `DataFrameRow` in the future. " *
"To get a `DataFrame` use `df[row_ind:row_ind, :]`.", :getindex)
new_columns = AbstractVector[[dv[row_ind]] for dv in rawcolumns(df)]
new_columns = AbstractVector[[dv[row_ind]] for dv in _columns(df)]
return DataFrame(new_columns, copy(index(df)))
end

# df[MultiRowIndex, SingleColumnIndex] => AbstractVector, copy
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_ind::ColumnIndex)
selected_column = index(df)[col_ind]
return rawcolumns(df)[selected_column][row_inds]
return _columns(df)[selected_column][row_inds]
end

# df[MultiRowIndex, MultiColumnIndex] => DataFrame
function Base.getindex(df::DataFrame, row_inds::AbstractVector, col_inds::AbstractVector)
selected_columns = index(df)[col_inds]
new_columns = AbstractVector[dv[row_inds] for dv in rawcolumns(df)[selected_columns]]
new_columns = AbstractVector[dv[row_inds] for dv in _columns(df)[selected_columns]]
return DataFrame(new_columns, Index(_names(df)[selected_columns]))
end

Expand All @@ -312,7 +312,7 @@ end

# df[MultiRowIndex, :] => DataFrame
function Base.getindex(df::DataFrame, row_inds::AbstractVector, ::Colon)
new_columns = AbstractVector[dv[row_inds] for dv in rawcolumns(df)]
new_columns = AbstractVector[dv[row_inds] for dv in _columns(df)]
return DataFrame(new_columns, copy(index(df)))
end

Expand Down Expand Up @@ -351,15 +351,15 @@ function insert_single_column!(df::DataFrame,
dv = isa(v, AbstractRange) ? collect(v) : v
if haskey(index(df), col_ind)
j = index(df)[col_ind]
rawcolumns(df)[j] = dv
_columns(df)[j] = dv
else
if typeof(col_ind) <: Symbol
push!(index(df), col_ind)
push!(rawcolumns(df), dv)
push!(_columns(df), dv)
else
if ncol(df) + 1 == Int(col_ind)
push!(index(df), nextcolname(df))
push!(rawcolumns(df), dv)
push!(_columns(df), dv)
else
throw(ArgumentError("Cannot assign to non-existent column: $col_ind"))
end
Expand All @@ -370,7 +370,7 @@ end

function insert_single_entry!(df::DataFrame, v::Any, row_ind::Real, col_ind::ColumnIndex)
if haskey(index(df), col_ind)
rawcolumns(df)[index(df)[col_ind]][row_ind] = v
_columns(df)[index(df)[col_ind]][row_ind] = v
return v
else
error("Cannot assign to non-existent column: $col_ind")
Expand All @@ -382,7 +382,7 @@ function insert_multiple_entries!(df::DataFrame,
row_inds::AbstractVector{<:Real},
col_ind::ColumnIndex)
if haskey(index(df), col_ind)
rawcolumns(df)[index(df)[col_ind]][row_inds] .= v
_columns(df)[index(df)[col_ind]][row_inds] .= v
return v
else
error("Cannot assign to non-existent column: $col_ind")
Expand Down Expand Up @@ -616,7 +616,7 @@ function Base.setindex!(df::DataFrame,
new_df::DataFrame,
row_inds::Colon,
col_inds::Colon=Colon())
setfield!(df, :columns, copy(rawcolumns(new_df)))
setfield!(df, :columns, copy(_columns(new_df)))
setfield!(df, :colindex, copy(index(new_df)))
df
end
Expand All @@ -639,7 +639,7 @@ Base.setindex!(df::DataFrame, v, ::Colon, col_inds) =
##
##############################################################################

Base.empty!(df::DataFrame) = (empty!(rawcolumns(df)); empty!(index(df)); df)
Base.empty!(df::DataFrame) = (empty!(_columns(df)); empty!(index(df)); df)

"""
Insert a column into a data frame in place.
Expand Down Expand Up @@ -728,7 +728,7 @@ function insertcols!(df::DataFrame, col_ind::Int, name_col::Pair{Symbol, <:Abstr
end
end
insert!(index(df), col_ind, name)
insert!(rawcolumns(df), col_ind, item)
insert!(_columns(df), col_ind, item)
df
end

Expand All @@ -749,12 +749,12 @@ end

# A copy of a DataFrame points to the original column vectors but
# gets its own Index.
Base.copy(df::DataFrame) = DataFrame(copy(rawcolumns(df)), copy(index(df)))
Base.copy(df::DataFrame) = DataFrame(copy(_columns(df)), copy(index(df)))

# Deepcopy is recursive -- if a column is a vector of DataFrames, each of
# those DataFrames is deepcopied.
function Base.deepcopy(df::DataFrame)
DataFrame(deepcopy(rawcolumns(df)), deepcopy(index(df)))
DataFrame(deepcopy(_columns(df)), deepcopy(index(df)))
end

##############################################################################
Expand All @@ -766,7 +766,7 @@ end
function deletecols!(df::DataFrame, inds::Vector{Int})
for ind in sort(inds, rev = true)
if 1 <= ind <= ncol(df)
splice!(rawcolumns(df), ind)
splice!(_columns(df), ind)
delete!(index(df), ind)
else
throw(ArgumentError("Can't delete a non-existent DataFrame column"))
Expand All @@ -779,7 +779,7 @@ deletecols!(df::DataFrame, c::Any) = deletecols!(df, index(df)[c])

function deleterows!(df::DataFrame, ind::Union{Integer, UnitRange{Int}})
for i in 1:ncol(df)
rawcolumns(df)[i] = deleteat!(rawcolumns(df)[i], ind)
_columns(df)[i] = deleteat!(rawcolumns(df)[i], ind)
end
df
end
Expand All @@ -805,7 +805,7 @@ function deleterows!(df::DataFrame, ind::AbstractVector{Int})
keep[ikeep:end] = idf:n

for i in 1:ncol(df)
rawcolumns(df)[i] = rawcolumns(df)[i][keep]
_columns(df)[i] = rawcolumns(df)[i][keep]
end
df
end
Expand Down Expand Up @@ -1011,11 +1011,11 @@ function Base.push!(df::DataFrame, iterable::Any)
i = 1
for t in iterable
try
push!(rawcolumns(df)[i], t)
push!(_columns(df)[i], t)
catch
#clean up partial row
for j in 1:(i - 1)
pop!(rawcolumns(df)[j])
pop!(_columns(df)[j])
end
msg = "Error adding $t to column :$(_names(df)[i]). Possible type mis-match."
throw(ArgumentError(msg))
Expand Down Expand Up @@ -1083,7 +1083,7 @@ function permutecols!(df::DataFrame, p::AbstractVector)
if !(length(p) == size(df, 2) && isperm(p))
throw(ArgumentError("$p is not a valid column permutation for this DataFrame"))
end
permute!(rawcolumns(df), p)
permute!(_columns(df), p)
@inbounds permute!(index(df), p)
df
end
Expand Down
Loading