-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating an empty DataFrame is unwieldy and has unexpected behavior #1569
Comments
Honestly I don't see the problem. Yes, if you use the same vectors to construct two data frames, they will share the same values. But that's not a bug, it's a feature. Copying is slow and most of the time there's no reason to do that (vectors are not always empty). Once you know how it works (and if you don't know it you'll discover it soon enough), you can just call I guess we could allow OTOH we should definitely support named tuples (and maybe dicts, why not) with |
I agree that the API is complex, with a lot of constructors, so I suppose I'll defer on that count. I do recognize that copying is slow, that's why I suggested it only be done if the vectors are in fact empty. I guess I didn't clarify that, but line 100 of I'll work on expanding the |
But then this makes behavior hard to predict. You'd need to anticipate that if your vectors can be empty they will be copied, which may have many consequences on the code. This is the kind of trick that introduces subtle bugs which are missed by tests and only appear in production. |
Copying on construction is now a default so I am closing it. If I have missed something please reopen. |
I just got the following message:
(I guess it was deleted, but just for completeness and future reference let me comment on it) DataFrames.jl does not override the definition of Note though that there is a difference between
(in particular note that |
It was deleted. The problem was getting caught by Revise not being recursive (by design). Deepcopy of either the header array or the entire template dataframe absolutely worked.
From: Bogumił Kamiński <notifications@github.com>
Reply-To: "JuliaData/DataFrames.jl" <reply@reply.github.com>
Date: Thursday, April 2, 2020 at 1:14 AM
To: "JuliaData/DataFrames.jl" <DataFrames.jl@noreply.github.com>
Cc: Lewis Levin <lewis@neilson-levin.org>, Comment <comment@noreply.github.com>
Subject: Re: [JuliaData/DataFrames.jl] Creating an empty DataFrame is unwieldy and has unexpected behavior (#1569)
I just got the following message:
the deepcopied dataframe is an alias for the one that was copied.
(I guess it was deleted, but just for completeness and future reference let me comment on it)
DataFrames.jl does not override the definition of deepcopy from Base, so deepcopying a data frame creates a completely freshly allocated data frame.
Note though that there is a difference between DataFrame constructor and deepcopy for views (probably this is something that is to be expected but I just want to be clear here):
julia> df = DataFrame(rand(3,3))
3×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64 │ Float64 │ Float64 │
├─────┼──────────┼───────────┼──────────┤
│ 1 │ 0.628808 │ 0.261069 │ 0.156343 │
│ 2 │ 0.183679 │ 0.49501 │ 0.653551 │
│ 3 │ 0.708212 │ 0.0398483 │ 0.113054 │
julia> dfv = @view df[1:2, 1:2]
2×2 SubDataFrame
│ Row │ x1 │ x2 │
│ │ Float64 │ Float64 │
├─────┼──────────┼──────────┤
│ 1 │ 0.628808 │ 0.261069 │
│ 2 │ 0.183679 │ 0.49501 │
julia> dfv_dc = deepcopy(dfv)
2×2 SubDataFrame
│ Row │ x1 │ x2 │
│ │ Float64 │ Float64 │
├─────┼──────────┼──────────┤
│ 1 │ 0.628808 │ 0.261069 │
│ 2 │ 0.183679 │ 0.49501 │
julia> parent(dfv) === parent(dfv_dc)
false
julia> DataFrame(dfv)
2×2 DataFrame
│ Row │ x1 │ x2 │
│ │ Float64 │ Float64 │
├─────┼──────────┼──────────┤
│ 1 │ 0.628808 │ 0.261069 │
│ 2 │ 0.183679 │ 0.49501 │
(in particular note that deepcopy of the view also deepcopies the parent.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub<#1569 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAIYWLJH5JAKQF7PCWVACR3RKRCOFANCNFSM4F3VTEAQ>.
|
The canonical way of creating an empty
DataFrame
according to the docs (see the Constructing Row by Row header) is a bit unwieldy and has unexpected behavior.The constructor that creates columns based on keyword arguments seems to be mostly for demo purposes, so I'll use the constructor that takes in a dictionary (or a vararg parameter of pairs) in my examples.
Suppose I want to create two
DataFrame
s to hold measurements before and after some event. Getting the data for each row is not trivial, so the overhead of creating a dataframe row by row is not an issue for me. It's the most convenient way to do it and won't have a big impact on the runtime of my program, so I'll create both dataframes row by row. Since they have the same columns, I'll make one column dictionary, and create two dataframes with the same dictionary, as shown in the following example.Then, if I add a row to
df1
, since the sameFloat64
vectors incolumns
are used in the underlying data storage of both dataframes, it also adds the row todf2
. This is very unexpected, and took me a little while to figure out what exactly was going on.The problem could be fixed simply by changing line 100 of
src/dataframe/dataframe.jl
fromreturn new(columns, colindex)
toreturn new(map(copy, columns), colindex)
, however, I think these constructors are still awkward to use.The best current alternative is using
DataFrame([Float64, Float64, Float64], [:x, :y, :z], 0)
but I still don't like this, for two reasons. First, the names of the columns are separated from the types, which I'm not a fan of--I'd prefer it if I could pass in a dictionary mapping column names to column types (if you try to do that now it creates a column with the type itself in the first row), and second, the argument for the number of rows seems superfluous--given the current API of DataFrames, I don't see what the use cases would be for initializing an empty dataframe with a given number of rows. It makes sense if you know beforehand what size your data will be, but then (as far as I can tell) there's no convenient way to set a row to something. So I'd like it if there were somesetindex!
methods that paralleled the availablepush!
methods (e.g.setindex!(df::DataFrame, row::Union{AbstractDict, NamedTuple}, row_ind::Real, col_ind::Colon)
).So in conclusion, I think this should be fixed. We could just add a
copy
when dataframes get constructed with empty vectors, to protect the user, or we could extend the API to make it more intuitive. I'd propose adding constructors that take in a dictionary or a list of pairs where the column values are types instead of vectors, with an optional size parameter, and also adding methods tosetindex!
so that it can be used the same waypush!
can be, to set rows in empty dataframes of a predetermined size so that creating a dataframe row by row isn't quite as slow.I'm happy to help implement any of my suggestions, but I just didn't want to start working on them before knowing that they would likely be approved.
The text was updated successfully, but these errors were encountered: