Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove DataFrames dependency #19

Closed
wants to merge 1 commit into from
Closed

Remove DataFrames dependency #19

wants to merge 1 commit into from

Conversation

piever
Copy link

@piever piever commented Mar 26, 2019

This removes the DataFrames dependency and outputs the data as a named tuple of columns, e.g.

julia> iris = readcsv("/home/pietro/Data/examples/iris.csv")
(SepalLength = [5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9    6.7, 6.9, 5.8, 6.8, 6.7, 6.7, 6.3, 6.5, 6.2, 5.9], SepalWidth = [3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1    3.1, 3.1, 2.7, 3.2, 3.3, 3.0, 2.5, 3.0, 3.4, 3.0], PetalLength = [1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5    5.6, 5.1, 5.1, 5.9, 5.7, 5.2, 5.0, 5.2, 5.4, 5.1], PetalWidth = [0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1    2.4, 2.3, 1.9, 2.3, 2.5, 2.3, 1.9, 2.0, 2.3, 1.8], Species = ["setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa", "setosa"    "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica", "virginica"])

The rationale is that a named tuple of columns is already considered a table by the Tables interface and it can be converted to any other table type by:

using DataFrames: DataFrame
DataFrame(nt)
using StructArrays: StructArray
StructArray(nt)
using IndexedTables: table
table(nt)
using TypedTables: Table
Table(nt)

So it feels strange to have to load DataFrames (which is a fair amount of code) when one is using this parser from another table package (like StructArrays, IndexedTables or TypedTables).

The main downside is that users of DataFrames have to do DataFrame(readcsv(filename)).

@codecov-io
Copy link

codecov-io commented Mar 26, 2019

Codecov Report

Merging #19 into master will increase coverage by 0.21%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #19      +/-   ##
==========================================
+ Coverage   76.32%   76.54%   +0.21%     
==========================================
  Files           4        4              
  Lines         980      989       +9     
==========================================
+ Hits          748      757       +9     
  Misses        232      232
Impacted Files Coverage Δ
src/TableReader.jl 90.32% <100%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8061739...6a6d49e. Read the comment docs.

@bicycle1885
Copy link
Owner

Thank you for your pull request. I greatly appreciate your work.
However, I cannot accept this change for two reasons; one reason is a design choice and the other is a practical reason.

The design goal of this package in my mind is the readr package of R, which offers very simple and concise API and so I always use it to read tabular data files. I think people always want to keep easy things as easy as possible. What this package does is very easy: reading a tabular data as a table-like object. To this end, the named tuple is a little bit awkward because it doesn't offer methods that make it like a table such as joining, subsetting, sorting, and so on. The DataTable is arguably the de-facto standard of table-like structures in the Julia community, and thus I chose that type as a default return value of this package.

The second reason is that the named tuple is not good at handling very wide tables. I've tried this branch to read a file with ~20,000 columns and it made Julia hang for indefinitely long time (so I killed the process after 10 minutes or so). The main reason I have created this package is that I need to handle this kind of wide tables.

@piever
Copy link
Author

piever commented Mar 27, 2019

Thanks for the explanation! I generally work with "long and thin tables" (say no more than 20 columns but up to 1_000_000 rows) and didn't imagine that even just creating the named tuple at the end would crash things. If you are looking for an untyped container of columns then DataFrame is a very good choice.

@piever piever closed this Mar 27, 2019
@JeffBezanson
Copy link

I would like to see DataFrames.jl become more lightweight --- closer to just defining the data structure and a bare minimum of operations. That would greatly help compile and load times for packages that want to ultimately convert CSVs to other kinds of structures. A NamedTuple of vectors is unfortunately not a great solution since (aside from julia performance problems) it doesn't really have anything like a table interface, despite what Tables.jl might claim. A StructArray would be better of course, but I'd also just settle for a super lightweight DataFrames.

@davidanthoff
Copy link

https://github.com/queryverse/QueryTables.jl is probably as light weight as it gets. It is still WIP because I can't make up my mind whether it should be entirely read only or not. Also, it is tied to DataValue, which some folks of course don't like.

I would like to see DataFrames.jl become more lightweight

I mostly want it to stop making breaking changes every few months. There is so much code out there that relies on its current behavior. The pain these redesigns cause the casual user is just enormous. I think we don't see them in the forums/slack etc., but I have about a dozen of them in my lab and I think there are few things we can do to turn them away from julia more effectively than breaking things like the DataFrame API with regularity.

@JeffBezanson
Copy link

https://github.com/queryverse/QueryTables.jl is probably as light weight as it gets. It is still WIP because I can't make up my mind whether it should be entirely read only or not. Also, it is tied to DataValue, which some folks of course don't like.

It would be great to have a widely-used table type whose base implementation is that simple. I don't mind DataValue, but that table type is so simple it seems to me it doesn't really need to be tied to it.

Can we make DataFrames significantly less than its current 9000+ LOC, and make it not depend on things like StatsBase and CategoricalArrays, without breaking its API?

@piever
Copy link
Author

piever commented Mar 30, 2019

I completely agree that several packages (CSV, RDatasets, now TableReaders) become a bit of a harder sell (as a dependency) because they require DataFrames (only so that they could output a DataFrame) which is still somewhat heavyweight. I imagine the DataFrames package is largish as it has a user-base that expect to do using DataFrames and have everything available. Maybe this could be fixed by simply moving the type DataFrame to a lightweight DataFramesBase which is then reexported by DataFrames? Especially for a package like TableReaders that has a fast startup time as a selling point this could help a lot.

On a somewhat tangential note, IndexedTables could also use a lightweight untyped table structure to modify columns. To modify many columns (replace them with new vectors, add or remove columns), at the moment it converts to a custom ColDict, modifies all columns there and then recreates the original data structure, but I think it'd be nice if this "modifiable table" was just a DataFrame. I think that a good table type needs to be able to be converted easily from untyped (to e.g. edit columns types) to fully typed (to iterate rows efficiently).

@JeffBezanson
Copy link

That's an excellent idea! We should pull the core type into a new DataFramesBase, which DataFrames depends on. Users of DataFrames will be totally unaffected. Packages that want a lighter-weight dependency can switch to DataFramesBase. Sounds perfect to me.

@bicycle1885
Copy link
Owner

bicycle1885 commented Mar 30, 2019

I quickly estimated the load time occupied by DataFrames.jl. The total load time is roughly 3.40 seconds and 3.06 seconds are from DataFrames.jl; other packages are 0.24 seconds in total. So, 90% of the load time is occupied by DataFrames.jl! If we could make it a much more light-weighted package, the benefit would be substantial.

~/w/TableReader (master|…) $ julia -e '@time using TableReader'
  3.402209 seconds (7.41 M allocations: 406.798 MiB, 3.98% gc time)
~/w/TableReader (master|…) $ julia -e '@time using DataFrames'
  3.063209 seconds (7.07 M allocations: 386.460 MiB, 4.29% gc time)
~/w/TableReader (master|…) $ julia -e '@time using TranscodingStreams, CodecZlib, CodecZstd, CodecXz'
  0.240461 seconds (265.42 k allocations: 15.634 MiB)

@piever
Copy link
Author

piever commented Apr 1, 2019

Issue opened: JuliaData/DataFrames.jl#1764

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants