-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove DataFrames dependency #19
Conversation
Codecov Report
@@ Coverage Diff @@
## master #19 +/- ##
==========================================
+ Coverage 76.32% 76.54% +0.21%
==========================================
Files 4 4
Lines 980 989 +9
==========================================
+ Hits 748 757 +9
Misses 232 232
Continue to review full report at Codecov.
|
Thank you for your pull request. I greatly appreciate your work. The design goal of this package in my mind is the readr package of R, which offers very simple and concise API and so I always use it to read tabular data files. I think people always want to keep easy things as easy as possible. What this package does is very easy: reading a tabular data as a table-like object. To this end, the named tuple is a little bit awkward because it doesn't offer methods that make it like a table such as joining, subsetting, sorting, and so on. The The second reason is that the named tuple is not good at handling very wide tables. I've tried this branch to read a file with ~20,000 columns and it made Julia hang for indefinitely long time (so I killed the process after 10 minutes or so). The main reason I have created this package is that I need to handle this kind of wide tables. |
Thanks for the explanation! I generally work with "long and thin tables" (say no more than 20 columns but up to 1_000_000 rows) and didn't imagine that even just creating the named tuple at the end would crash things. If you are looking for an untyped container of columns then |
I would like to see DataFrames.jl become more lightweight --- closer to just defining the data structure and a bare minimum of operations. That would greatly help compile and load times for packages that want to ultimately convert CSVs to other kinds of structures. A NamedTuple of vectors is unfortunately not a great solution since (aside from julia performance problems) it doesn't really have anything like a table interface, despite what Tables.jl might claim. A StructArray would be better of course, but I'd also just settle for a super lightweight DataFrames. |
https://github.com/queryverse/QueryTables.jl is probably as light weight as it gets. It is still WIP because I can't make up my mind whether it should be entirely read only or not. Also, it is tied to
I mostly want it to stop making breaking changes every few months. There is so much code out there that relies on its current behavior. The pain these redesigns cause the casual user is just enormous. I think we don't see them in the forums/slack etc., but I have about a dozen of them in my lab and I think there are few things we can do to turn them away from julia more effectively than breaking things like the |
It would be great to have a widely-used table type whose base implementation is that simple. I don't mind Can we make DataFrames significantly less than its current 9000+ LOC, and make it not depend on things like StatsBase and CategoricalArrays, without breaking its API? |
I completely agree that several packages (CSV, RDatasets, now TableReaders) become a bit of a harder sell (as a dependency) because they require DataFrames (only so that they could output a On a somewhat tangential note, IndexedTables could also use a lightweight untyped table structure to modify columns. To modify many columns (replace them with new vectors, add or remove columns), at the moment it converts to a custom |
That's an excellent idea! We should pull the core type into a new DataFramesBase, which DataFrames depends on. Users of DataFrames will be totally unaffected. Packages that want a lighter-weight dependency can switch to DataFramesBase. Sounds perfect to me. |
I quickly estimated the load time occupied by DataFrames.jl. The total load time is roughly 3.40 seconds and 3.06 seconds are from DataFrames.jl; other packages are 0.24 seconds in total. So, 90% of the load time is occupied by DataFrames.jl! If we could make it a much more light-weighted package, the benefit would be substantial.
|
Issue opened: JuliaData/DataFrames.jl#1764 |
This removes the DataFrames dependency and outputs the data as a named tuple of columns, e.g.
The rationale is that a named tuple of columns is already considered a table by the Tables interface and it can be converted to any other table type by:
So it feels strange to have to load DataFrames (which is a fair amount of code) when one is using this parser from another table package (like StructArrays, IndexedTables or TypedTables).
The main downside is that users of DataFrames have to do
DataFrame(readcsv(filename))
.