Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating an empty DataFrame is unwieldy and has unexpected behavior #1569

Closed
coreywoodfield opened this issue Oct 16, 2018 · 6 comments
Closed

Comments

@coreywoodfield
Copy link

coreywoodfield commented Oct 16, 2018

The canonical way of creating an empty DataFrame according to the docs (see the Constructing Row by Row header) is a bit unwieldy and has unexpected behavior.

The constructor that creates columns based on keyword arguments seems to be mostly for demo purposes, so I'll use the constructor that takes in a dictionary (or a vararg parameter of pairs) in my examples.

Suppose I want to create two DataFrames to hold measurements before and after some event. Getting the data for each row is not trivial, so the overhead of creating a dataframe row by row is not an issue for me. It's the most convenient way to do it and won't have a big impact on the runtime of my program, so I'll create both dataframes row by row. Since they have the same columns, I'll make one column dictionary, and create two dataframes with the same dictionary, as shown in the following example.

columnnames = ["x", "y", "z"]
columns = [Symbol(col) => Float64[] for col in columnnames]
df1 = DataFrame(columns...)
df2 = DataFrame(columns...)

Then, if I add a row to df1, since the same Float64 vectors in columns are used in the underlying data storage of both dataframes, it also adds the row to df2. This is very unexpected, and took me a little while to figure out what exactly was going on.

The problem could be fixed simply by changing line 100 of src/dataframe/dataframe.jl from return new(columns, colindex) to return new(map(copy, columns), colindex), however, I think these constructors are still awkward to use.

The best current alternative is using DataFrame([Float64, Float64, Float64], [:x, :y, :z], 0) but I still don't like this, for two reasons. First, the names of the columns are separated from the types, which I'm not a fan of--I'd prefer it if I could pass in a dictionary mapping column names to column types (if you try to do that now it creates a column with the type itself in the first row), and second, the argument for the number of rows seems superfluous--given the current API of DataFrames, I don't see what the use cases would be for initializing an empty dataframe with a given number of rows. It makes sense if you know beforehand what size your data will be, but then (as far as I can tell) there's no convenient way to set a row to something. So I'd like it if there were some setindex! methods that paralleled the available push! methods (e.g. setindex!(df::DataFrame, row::Union{AbstractDict, NamedTuple}, row_ind::Real, col_ind::Colon)).

So in conclusion, I think this should be fixed. We could just add a copy when dataframes get constructed with empty vectors, to protect the user, or we could extend the API to make it more intuitive. I'd propose adding constructors that take in a dictionary or a list of pairs where the column values are types instead of vectors, with an optional size parameter, and also adding methods to setindex! so that it can be used the same way push! can be, to set rows in empty dataframes of a predetermined size so that creating a dataframe row by row isn't quite as slow.

I'm happy to help implement any of my suggestions, but I just didn't want to start working on them before knowing that they would likely be approved.

@nalimilan
Copy link
Member

Honestly I don't see the problem. Yes, if you use the same vectors to construct two data frames, they will share the same values. But that's not a bug, it's a feature. Copying is slow and most of the time there's no reason to do that (vectors are not always empty). Once you know how it works (and if you don't know it you'll discover it soon enough), you can just call deepcopy if you happen to need two data frames.

I guess we could allow DataFrame(x=>Int, y=>Float64, size=0), but I'm not sure it's worth it, as the API is already quite complex, with lots of different constructors.

OTOH we should definitely support named tuples (and maybe dicts, why not) with setindex!.

@coreywoodfield
Copy link
Author

I agree that the API is complex, with a lot of constructors, so I suppose I'll defer on that count. I do recognize that copying is slow, that's why I suggested it only be done if the vectors are in fact empty. I guess I didn't clarify that, but line 100 of dataframe.jl is the case where all input vectors are empty.

I'll work on expanding the setindex! API to support the operations that push! does.

@nalimilan
Copy link
Member

I do recognize that copying is slow, that's why I suggested it only be done if the vectors are in fact empty. I guess I didn't clarify that, but line 100 of dataframe.jl is the case where all input vectors are empty.

But then this makes behavior hard to predict. You'd need to anticipate that if your vectors can be empty they will be copied, which may have many consequences on the code. This is the kind of trick that introduces subtle bugs which are missed by tests and only appear in production.

@bkamins
Copy link
Member

bkamins commented Jul 12, 2019

Copying on construction is now a default so I am closing it. If I have missed something please reopen.

@bkamins bkamins closed this as completed Jul 12, 2019
@bkamins
Copy link
Member

bkamins commented Apr 2, 2020

I just got the following message:

the deepcopied dataframe is an alias for the one that was copied.

(I guess it was deleted, but just for completeness and future reference let me comment on it)

DataFrames.jl does not override the definition of deepcopy from Base, so deepcopying a data frame creates a completely freshly allocated data frame.

Note though that there is a difference between DataFrame constructor and deepcopy for views (probably this is something that is to be expected but I just want to be clear here):

julia> df = DataFrame(rand(3,3))
3×3 DataFrame
│ Row │ x1       │ x2        │ x3       │
│     │ Float64  │ Float64   │ Float64  │
├─────┼──────────┼───────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069  │ 0.156343 │
│ 2   │ 0.183679 │ 0.49501   │ 0.653551 │
│ 3   │ 0.708212 │ 0.0398483 │ 0.113054 │

julia> dfv = @view df[1:2, 1:2]
2×2 SubDataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069 │
│ 2   │ 0.183679 │ 0.49501  │

julia> dfv_dc = deepcopy(dfv)
2×2 SubDataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069 │
│ 2   │ 0.183679 │ 0.49501  │

julia> parent(dfv) === parent(dfv_dc)
false

julia> DataFrame(dfv)
2×2 DataFrame
│ Row │ x1       │ x2       │
│     │ Float64  │ Float64  │
├─────┼──────────┼──────────┤
│ 1   │ 0.628808 │ 0.261069 │
│ 2   │ 0.183679 │ 0.49501  │

(in particular note that deepcopy of the view also deepcopies the parent.

@lewisl
Copy link

lewisl commented Apr 2, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants