Creating an empty DataFrame is unwieldy and has unexpected behavior #1569

coreywoodfield · 2018-10-16T01:14:08Z

The canonical way of creating an empty DataFrame according to the docs (see the Constructing Row by Row header) is a bit unwieldy and has unexpected behavior.

The constructor that creates columns based on keyword arguments seems to be mostly for demo purposes, so I'll use the constructor that takes in a dictionary (or a vararg parameter of pairs) in my examples.

Suppose I want to create two DataFrames to hold measurements before and after some event. Getting the data for each row is not trivial, so the overhead of creating a dataframe row by row is not an issue for me. It's the most convenient way to do it and won't have a big impact on the runtime of my program, so I'll create both dataframes row by row. Since they have the same columns, I'll make one column dictionary, and create two dataframes with the same dictionary, as shown in the following example.

columnnames = ["x", "y", "z"]
columns = [Symbol(col) => Float64[] for col in columnnames]
df1 = DataFrame(columns...)
df2 = DataFrame(columns...)

Then, if I add a row to df1, since the same Float64 vectors in columns are used in the underlying data storage of both dataframes, it also adds the row to df2. This is very unexpected, and took me a little while to figure out what exactly was going on.

The problem could be fixed simply by changing line 100 of src/dataframe/dataframe.jl from return new(columns, colindex) to return new(map(copy, columns), colindex), however, I think these constructors are still awkward to use.

The best current alternative is using DataFrame([Float64, Float64, Float64], [:x, :y, :z], 0) but I still don't like this, for two reasons. First, the names of the columns are separated from the types, which I'm not a fan of--I'd prefer it if I could pass in a dictionary mapping column names to column types (if you try to do that now it creates a column with the type itself in the first row), and second, the argument for the number of rows seems superfluous--given the current API of DataFrames, I don't see what the use cases would be for initializing an empty dataframe with a given number of rows. It makes sense if you know beforehand what size your data will be, but then (as far as I can tell) there's no convenient way to set a row to something. So I'd like it if there were some setindex! methods that paralleled the available push! methods (e.g. setindex!(df::DataFrame, row::Union{AbstractDict, NamedTuple}, row_ind::Real, col_ind::Colon)).

So in conclusion, I think this should be fixed. We could just add a copy when dataframes get constructed with empty vectors, to protect the user, or we could extend the API to make it more intuitive. I'd propose adding constructors that take in a dictionary or a list of pairs where the column values are types instead of vectors, with an optional size parameter, and also adding methods to setindex! so that it can be used the same way push! can be, to set rows in empty dataframes of a predetermined size so that creating a dataframe row by row isn't quite as slow.

I'm happy to help implement any of my suggestions, but I just didn't want to start working on them before knowing that they would likely be approved.

The text was updated successfully, but these errors were encountered:

nalimilan · 2018-10-16T20:40:04Z

Honestly I don't see the problem. Yes, if you use the same vectors to construct two data frames, they will share the same values. But that's not a bug, it's a feature. Copying is slow and most of the time there's no reason to do that (vectors are not always empty). Once you know how it works (and if you don't know it you'll discover it soon enough), you can just call deepcopy if you happen to need two data frames.

I guess we could allow DataFrame(x=>Int, y=>Float64, size=0), but I'm not sure it's worth it, as the API is already quite complex, with lots of different constructors.

OTOH we should definitely support named tuples (and maybe dicts, why not) with setindex!.

coreywoodfield · 2018-10-16T21:19:20Z

I agree that the API is complex, with a lot of constructors, so I suppose I'll defer on that count. I do recognize that copying is slow, that's why I suggested it only be done if the vectors are in fact empty. I guess I didn't clarify that, but line 100 of dataframe.jl is the case where all input vectors are empty.

I'll work on expanding the setindex! API to support the operations that push! does.

nalimilan · 2018-10-17T08:26:20Z

I do recognize that copying is slow, that's why I suggested it only be done if the vectors are in fact empty. I guess I didn't clarify that, but line 100 of dataframe.jl is the case where all input vectors are empty.

But then this makes behavior hard to predict. You'd need to anticipate that if your vectors can be empty they will be copied, which may have many consequences on the code. This is the kind of trick that introduces subtle bugs which are missed by tests and only appear in production.

bkamins · 2019-07-12T20:00:32Z

Copying on construction is now a default so I am closing it. If I have missed something please reopen.

bkamins · 2020-04-02T08:14:11Z

I just got the following message:

the deepcopied dataframe is an alias for the one that was copied.

(I guess it was deleted, but just for completeness and future reference let me comment on it)

DataFrames.jl does not override the definition of deepcopy from Base, so deepcopying a data frame creates a completely freshly allocated data frame.

Note though that there is a difference between DataFrame constructor and deepcopy for views (probably this is something that is to be expected but I just want to be clear here):

julia> df = DataFrame(rand(3,3))
3×3 DataFrame
│ Row │ x1       │ x2        │ x3       │
│     │ Float64  │ Float64   │ Float64  │
├─────┼──────────┼───────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069  │ 0.156343 │
│ 2   │ 0.183679 │ 0.49501   │ 0.653551 │
│ 3   │ 0.708212 │ 0.0398483 │ 0.113054 │

julia> dfv = @view df[1:2, 1:2]
2×2 SubDataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069 │
│ 2   │ 0.183679 │ 0.49501  │

julia> dfv_dc = deepcopy(dfv)
2×2 SubDataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069 │
│ 2   │ 0.183679 │ 0.49501  │

julia> parent(dfv) === parent(dfv_dc)
false

julia> DataFrame(dfv)
2×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069 │
│ 2   │ 0.183679 │ 0.49501  │

(in particular note that deepcopy of the view also deepcopies the parent.

lewisl · 2020-04-02T16:22:04Z

It was deleted. The problem was getting caught by Revise not being recursive (by design). Deepcopy of either the header array or the entire template dataframe absolutely worked. From: Bogumił Kamiński <notifications@github.com> Reply-To: "JuliaData/DataFrames.jl" <reply@reply.github.com> Date: Thursday, April 2, 2020 at 1:14 AM To: "JuliaData/DataFrames.jl" <DataFrames.jl@noreply.github.com> Cc: Lewis Levin <lewis@neilson-levin.org>, Comment <comment@noreply.github.com> Subject: Re: [JuliaData/DataFrames.jl] Creating an empty DataFrame is unwieldy and has unexpected behavior (#1569) I just got the following message: the deepcopied dataframe is an alias for the one that was copied. (I guess it was deleted, but just for completeness and future reference let me comment on it) DataFrames.jl does not override the definition of deepcopy from Base, so deepcopying a data frame creates a completely freshly allocated data frame. Note though that there is a difference between DataFrame constructor and deepcopy for views (probably this is something that is to be expected but I just want to be clear here): julia> df = DataFrame(rand(3,3)) 3×3 DataFrame │ Row │ x1 │ x2 │ x3 │ │ │ Float64 │ Float64 │ Float64 │ ├─────┼──────────┼───────────┼──────────┤ │ 1 │ 0.628808 │ 0.261069 │ 0.156343 │ │ 2 │ 0.183679 │ 0.49501 │ 0.653551 │ │ 3 │ 0.708212 │ 0.0398483 │ 0.113054 │ julia> dfv = @view df[1:2, 1:2] 2×2 SubDataFrame │ Row │ x1 │ x2 │ │ │ Float64 │ Float64 │ ├─────┼──────────┼──────────┤ │ 1 │ 0.628808 │ 0.261069 │ │ 2 │ 0.183679 │ 0.49501 │ julia> dfv_dc = deepcopy(dfv) 2×2 SubDataFrame │ Row │ x1 │ x2 │ │ │ Float64 │ Float64 │ ├─────┼──────────┼──────────┤ │ 1 │ 0.628808 │ 0.261069 │ │ 2 │ 0.183679 │ 0.49501 │ julia> parent(dfv) === parent(dfv_dc) false julia> DataFrame(dfv) 2×2 DataFrame │ Row │ x1 │ x2 │ │ │ Float64 │ Float64 │ ├─────┼──────────┼──────────┤ │ 1 │ 0.628808 │ 0.261069 │ │ 2 │ 0.183679 │ 0.49501 │ (in particular note that deepcopy of the view also deepcopies the parent. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#1569 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIYWLJH5JAKQF7PCWVACR3RKRCOFANCNFSM4F3VTEAQ>.

coreywoodfield mentioned this issue Oct 16, 2018

Extend setindex api #1571

Closed

jlperla mentioned this issue Oct 18, 2018

Support for construction from a named tuple? #1573

Closed

bkamins mentioned this issue Jan 15, 2019

DataFrames.jl roadmap #1678

Closed

31 tasks

bkamins closed this as completed Jul 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating an empty DataFrame is unwieldy and has unexpected behavior #1569

Creating an empty DataFrame is unwieldy and has unexpected behavior #1569

coreywoodfield commented Oct 16, 2018 •

edited

Loading

nalimilan commented Oct 16, 2018

coreywoodfield commented Oct 16, 2018

nalimilan commented Oct 17, 2018

bkamins commented Jul 12, 2019

bkamins commented Apr 2, 2020

lewisl commented Apr 2, 2020 via email

Creating an empty DataFrame is unwieldy and has unexpected behavior #1569

Creating an empty DataFrame is unwieldy and has unexpected behavior #1569

Comments

coreywoodfield commented Oct 16, 2018 • edited Loading

nalimilan commented Oct 16, 2018

coreywoodfield commented Oct 16, 2018

nalimilan commented Oct 17, 2018

bkamins commented Jul 12, 2019

bkamins commented Apr 2, 2020

lewisl commented Apr 2, 2020 via email

coreywoodfield commented Oct 16, 2018 •

edited

Loading