Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do we need CSV.read? #2309

Closed
bkamins opened this issue Jun 29, 2020 · 12 comments · Fixed by JuliaData/CSV.jl#687
Closed

Do we need CSV.read? #2309

bkamins opened this issue Jun 29, 2020 · 12 comments · Fixed by JuliaData/CSV.jl#687
Labels

Comments

@bkamins
Copy link
Member

bkamins commented Jun 29, 2020

It has been defined in CSV.jl but now is deprecated. Do we need this function or:

CSV.File(...) |> DataFrame

is enough.

Please comment here to have a record of the dicussion.

@ararslan
Copy link
Member

CSV.read is pleasantly concise and the name matches other comparable software more closely, e.g. read.csv in R, read_csv in Pandas, etc. DataFrame(CSV.File(...)) is just not discoverable for users coming from other software. I understand the desire for separation of concerns and making CSV less opinionated about the particular tabular structure it uses, but this really seems like a huge blow to usability to me. Not to mention that omitting the ! in DataFrame!(CSV.File(...)), which is quite easy to do, will make unnecessary copies...

@bkamins
Copy link
Member Author

bkamins commented Jun 29, 2020

Not to mention that omitting the ! in DataFrame!(CSV.File(...)), which is quite easy to do, will make unnecessary copies...

Most of the time you want to omit !. Without it the resulting DataFrame has read-only columns.


The problem with doing a type piracy over CSV.read in DataFrames.jl is that some other package supporting Tables.jl interface could wish to use the same name.

@quinnj - maybe the solution is to define CSV.read in CSV.jl in the following way:

CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink

In such a case CSV.read would be discoverable for the users and they just would learn that they need to specify sink?

@quinnj
Copy link
Member

quinnj commented Jun 29, 2020

Most of the time you want to omit !. Without it the resulting DataFrame has read-only columns.

This isn't accurate as of the 0.7 CSV.jl release; with 0.7, you get back fully mutable arrays.

@bkamins
Copy link
Member Author

bkamins commented Jun 29, 2020

This isn't accurate

Ah - it is plainly wrong :(. I have not tested this enough yet.

So now indeed using DataFrames! should be preferred if I understand things correctly. You only get SentinelArray if you have a non-PooledArray result that has missing values and still SentinelArray is resizeable and mutable. Is this correct?

@oxinabox
Copy link
Contributor

I don't see the need to privledge CSV.jl over any other Table type.
Its nice and clear DataFame(table) works for any Table,
and CSV.File is the way to get a table from a CSV.
And there are ways to get tables from other sources that are similar.

Its not like its much extra typing, CSV has a very short package name.

@piever
Copy link

piever commented Jun 29, 2020

maybe the solution is to define CSV.read in CSV.jl in the following way:

CSV.read(sink, file, args...; kwargs...) = CSV.File(file, args...; kwargs...) |> sink

CSV.read(DataFrame, file) is quite nice and explicit. Another option could be to simply have:

CSV.read(file; kwargs...) = CSV.File(file; kwargs...)

The idea is that CSV.File(file) is a perfectly valid table already (you can iterate over rows, access columns with getproperty), and can be converted to any other table if needed. CSV.read is IMO a more easily discoverable name than CSV.File, so it could be the public interface.

@iamed2
Copy link

iamed2 commented Jun 30, 2020

I think having both CSV.read(sink, file) and CSV.read(file) would be best for CSV. If DataFrames then wanted to be more performant they could override CSV.read as:

CSV.read(::Type{DataFrame}, file; kwargs...) = DataFrame!(CSV.File(file; kwargs))

@oxinabox
Copy link
Contributor

to match Base.read it would want to be CSV.read(file, sink)

@nilshg
Copy link
Contributor

nilshg commented Jun 30, 2020

Very torn on this one - on the one hand I'm fully in the camp that says DataFrame(CSV.File(...)) is great because it introduces users to a better way of thinking about what CSV can offer, and how independent and composable things in Julia are, but on the other hand I already fear the time I'll spend on Discourse, SO, Slack and Zulip to argue with people claiming "Julia will never be a serious language until it offers a CSV.read function"...

On balance it therefore seems to me that the CSV.read(file, sink) option is the best of both worlds, as it exposes a public API that meets user expectations while at the same time getting people to explicitly think about the concept of a sink and what they actually need in terms of postprocessing.

@anandijain
Copy link
Contributor

anandijain commented Jul 10, 2020

I'm also kinda torn. Coming from pandas, I initially thought it was sort of absurd to remove/change CSV.read.

I've come around though, @bkamins has convinced me how valuable a concise API is.

That said CSV |> DataFrame is so commonplace (imo) that it should take very few characters, otherwise it runs the risk of being newcomer and REPL unfriendly.

CSV.read(file, sink) definitely seems to be the move here.

quinnj added a commit to JuliaData/CSV.jl that referenced this issue Jul 10, 2020
Fixes JuliaData/DataFrames.jl#2309. After much
discussion, people really like the `CSV.read` function call naming, so
it was decided to keep it around, while still allowing a break from
DataFrames.jl as a dependency.

I also realized that `CSV.read` does indeed provide one bit of
value/functionality: we can wrap `CSV.File` in `Tables.CopiedColumns`.
What this means is that columnar sinks can safely use the columns passed
without needing to make copies; i.e. they can assume ownership of the
columns. In `CSV.read`, the user is essentially saying, "I want to make
a `CSV.File` and pass it directly to `sink`" which also implies that
`CSV.File` doesn't need to "own" its own columns.

The only question left in my mind is, with the 1.0 release, what to do
with `CSV.read(file)` when no `sink` argument is passed. Suggestions
have included just returning a `CSV.File`, since it's a valid table
anyway. Or allowing DataFrames.jl to define the no-sink-arg version and
return a `DataFrame`; that one's a bit awkward as a form of "blessed"
type piracy, but could also be useful for users as convenience (they
just have to remember to do `using DataFrames` before trying to use it).
The other awkward part is that we're currently warning users of the
deprecation and that they should explicitly spell out `DataFrame`
whereas we might not actually require that.
@quinnj
Copy link
Member

quinnj commented Jul 10, 2020

PR up! JuliaData/CSV.jl#687. Feel free to comment there on implementation, or here on any other questions/concerns.

@quinnj
Copy link
Member

quinnj commented Jul 28, 2020

FYI, CSV.read now gives a deprecation that users need to do CSV.read(input, DataFrame) explicitly. Depending on when #1764 is implemented, we could consider taking a dependency on the DataFramesCore package and make that the default sink again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants