-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
move DataFrame type definition to lightweight package #1764
Comments
Methods that overload Base functions and other extend functions from other packages would probably need to move to the new lightweight package as well (type piracy). |
I am OK to go this way, but could you be more specific what you expect that can be stripped? I am asking, because I expect that most of the things will have to stay anyway. The majority of dependencies are needed by various |
Yes, I'm in favor of both cleaning up DataFrames dependency (I think the StatsBase could be inside a Requires.jl block, for example) and moving the core type definition + constructor to a separate package (DataFrameTypes.jl?), I guess we'd need |
I think it's fine to have some constructors defined in DataFramesBase and some in DataFrames, if that helps. |
I see the reasoning, but wouldn't it be weird if CSV parsers returned a We could probably get rid of a few dependencies like CategoricalArrays and PooledArrays by using Requires.jl, but apparently it currently incurs a significant slowdown at load time. |
Actually most of the dependencies can be lazily loaded and most of them will be never loaded in a normal workflow. Maybe this is the way to go?. |
One issue is that AFAIK we cannot load CategoricalArrays automatically when calling |
Depends how we define "most" :) I think it is highly valuable to have a standard data structure like I hate to say this, but |
EDIT: Also we rely on |
I'd be happy to pick a place to draw the line. We do that in API design all the time. I also humbly submit that if something as byzantine as Requires.jl is needed to concatenate tables, we are doing something wrong. |
Let me write down the DataFrames.jl dependency list and where they are used (if I have missed something please add here):
|
I am a bit confused by this point. I imagine most users do
OTOH I appreciate that splitting the package without type piracy may be tricky. I imagine that the things to take out of DataFramesBase for sure are the functions defined in DataFrames that are not imported from elsewhere ( I'm not sure that IteratorInterfaceExtensions and TableTraits constructors are needed as direct dependencies now that Tables "incorporates them". Sorting is a bit trickier though: I've had similar difficulties with StructArrays: it was difficult to define efficient sorting by multiple columns without extra dependencies (and in the end I needed PooledArrays). |
@piever even if functions are usually not reexported, methods are implicitly (e.g. Do we actually know what makes loading slow? Is it DataFrames per se, one of more dependencies, or the fact that Tables.jl uses Requires? I think we should experiment with this. For example,
This was precisely the goal of defining a generic interface in Tables.jl. In an ideal world I would like CSV and TableReaders to return a Therefore an argument for not splitting DataFrames is that it will annoy @JeffBezanson so much that he will implement a way to avoid that (e.g. that's possible in R, though I guess it may be more difficult in Julia). :-) In the meantime, as I suggested above, I would just require people to write |
This proposal would leave DataFrames alone. I agree it looks possible to stop using Requires.jl in Tables. I think there are a few reasons DataFrames is bulky:
|
Yet that requires keeping the two packages in sync, which is painful for development. That's not too terrible if we just move out the type definitions, but as I said I would find it confusing that things like
What determines the time to load a package? I guess it's more the number of methods (and which methods) than the number of lines? I'd like to see a few experiments to see which dependencies are the most costly. Deprecations can go away at any time, but we've had complaints that we changed too many things quickly and that users may have code that only runs every few months. Again, this can be benchmarked. |
A minimal DataFramesBase.jl specification would be:
This would have no dependencies. This set would allow to create a Conclusion:
Now regarding dependency costs (each call on clean environment):
|
Load time is one thing, but I think the real problem is more conceptual (e.g. I'd be perfectly happy if we can just speed up julia to reduce the load time for the same code). It should be possible to read a csv file or make some simple tabular data without needing StatsBase, DataStructures, and CategoricalArrays. I see it as a design and code factoring issue, which currently happens to have load time (and code size) consequences as well. Yes, the number and nature of methods is important. In fact, the real cost is often not measured as part of the load time, since adding methods can cause us to recompile code later (e.g the repl gets laggy again for a while). It can also cause issues like JuliaLang/julia#31535 (though fortunately that has been fixed). Having lots of unnecessary methods loaded has other minor consequences too, like polluting tab-completion results. |
I can see it would be a problem to constantly need to commit to both packages, but my impression is that DataFramesBase would not need to change very much if at all. Is that true? Or is there some other problem? |
If we decide on a minimum set of operations in TableOperations.jl that all tabular formats need to support, then we can create an abstract type That seems like ultimately what we want, but it would be a long way off and users would have to put up with compile times for a while. Also this makes something like #1458 adding meta-data to dataframes very hard, since now it's the defacto "minimum" table implementation that we cannot add features to. |
Maybe let us try to discuss the different heavy dependencies:
So in short - now DataFrames.jl loads in 3 seconds - we can speed up loading time by 1 second easily (if we decide to rename |
It seems odd to me that Tables.jl, which is supposed to be the ultra light interface package is one of the packages with the highest impact on loading time. |
I have the impression that there are some basic methods that are needed from "categorical array" type of packages and WeakRefStrings that complicate the scenario. Tables provides a method to allocate a vector (to collect an iterable into a column based table) based on the element type so that some special types get special vectors (for example I wonder whether there is a good solution for the |
@JeffBezanson StatsBase is supposed to go away anyway. We could move Regarding CategoricalArrays and PooledArrays, we notably depend on them to implement optimized functions (grouping, hashing). It would sound natural to me to use Requires for that. That doesn't sound like a conceptual issue at all. @piever noted StructArrays has the same issue. Do you suggest we also create StructArraysBase? I rather think the performance problem should be fixed.
@JeffBezanson It depends on what you put there, but even the list @bkamins gave above covers a lot of code. Note that we're currently redesigning quite a few fundamental things, which is why I don't see splitting the package into pieces as a high priority thing. Hopefully things will stabilize later, but it's almost certain is that many changes will require adding version bounds, which is painful (as we already experience with StatsBase-StatsModels-GLM).
@KristofferC AFAIK that's due to Requires (as we discussed the other day). |
Reminds me of the situation in Images.jl, where they also have a metadata-carrying type. The solution adopted there was to allow any array to be used as an image, and make the image-with-metadata type an optional wrapper. Hopefully something similar could be done here. But ok --- it sounds like things are already moving in the right direction, and we will get pretty far by de-coupling from StatsBase and adding
Fair question. But StructArrays has exactly 2 dependencies (Requires, PooledArrays) and ~700 LOC, compared to DataFrames with 11 dependencies and 9000 LOC. However I agree that the dependency on PooledArrays is ugly. Ideally we could have some abstractions allowing both PooledArrays and CategoricalArrays to provide optimized functionality when they are used as columns. |
|
TableTraits.jl provides an ultralight table interop story: Less than 50 LOC, and no business with Requires.jl (which I think packages that are low in the stack should avoid at all costs, not just for the performance impact but especially because it significantly complicates the dependency versioning story, IMHO). |
I've opened JuliaLang/julia#31601 with the above proposal on how to solve the need for
I'll just add my thoughts here (sorry for off topic) as I think it's a relevant comparison. I still aim to keep the package lightweight. To get rid of WeakRefStrings in the Requires block and PooledArrays I would need the following two things:
I like supporting the Tables interface, but when I added it there was some pushback so I put it in a Requires block, but this doesn't feel optimal either: I'm just not sure what the optimal solution is. |
💯 As an abstraction that's intended for general "common denominator" use, Tables should not be wilfully bypassing dependency resolution. |
Tables could still have the packages as dependencies in the Project.toml file, just chose to conditionally load them. Installing packages is cheap. |
I've added a proposal to address this in JuliaLang/julia#31606. |
https://discourse.julialang.org/t/sluggish-performance-when-loading-in-modules-dataframes-csv/17857 Slow loading time on a startup package really make people think julia is slow and would decrease potential users. Hope this gets fixed soon. |
Surprisingly, most of that is due to CSV.jl AFAICT. Anyway, there's been some progress on that front via https://github.com/JuliaData/DataAPI.jl. |
Just to connect some pieces here, the idea of splitting up Since data manipulation syntax is something that people can be quite opinionated about (myself certainly among them), and because the nomenclature that different paradigms use often overlaps (namely |
As noted earlier:
|
One more thought. @dgkf - actually it might be easier for you to get going by using a bit different approach. You can create a package of whatever name you like and reexport from it only those functions from DataFrames.jl that you want while you can define your own functions even with the same names that DataFrames.jl defines in whatever way you like without any problems. The only restriction here is that functions that are defined in Base cannot be handled this way, but I guess you would treat them as low-level anyway (and we tried to keep them consistent with Base as much as possible with these functions). It seems to me that taking this approach would be much less work than decoupling "internals" of DataFrames.jl from high-level API and in this way you could field-test your ideas more quickly and easily. |
Thanks, @bkamins for hearing me out on this one. I feel like I either entered the conversation at exactly the right time before a v1.0 release or exactly the wrong time as to throw a huge wrench into everyone's plans - maybe both! I'm very sorry for that! I think that as far as API experimentation goes, it's certainly a manageable space currently. My priority would be to separate out the underlying |
Discussion is highly appreciated. The issue is that you are touching points that are difficult to get right and different options have pros and cons (and people with different backgrounds tend to significantly differ in their opinions), which means we cannot expect to reach definitive conclusions quickly and easily (see e.g. discussions around E.g. having DataFrames.jl split into two packages is appealing but we need to understand the benefits of it, as there are clear and significant costs, as having two packages makes development of the ecosystem much harder (you have to synchronize releases of both packages, CI is harder etc.). And then the benefits and costs should be compared and decision what to do made.
This is exactly the problem currently no one came with a good solution. That is why I have proposed to start with a simpler approach - if you would consider doing it.
In this way you can easily experiment what is the minimal set of functions that you feel should go into each of the packages + you can see what else (instead of removed functionality) can be implemented as an alternative. The key thing in this approach is that even if someone does not like e.g. |
Two additional thoughts:
Thank you for your support and thoughts! |
Thanks @bkamins, you're right. This probably isn't my place to contribute so early on in getting a feel for the package trajectory. I'd be happy to contribute to something smaller - please keep me posted on where I can help. I only offered my support for this particular issue so that the limiting factor isn't someone's time as I see it as a pretty critical step for the package ecosystem. I completely agree with your vision for the package. As a new user myself, the "not spending time investigating the package ecosystem" and wanting something that "will just work" really speak to my experience. I wouldn't say I expected all the bells and whistles to come from |
Here is my view on this discussion (as a large scale user of DataFrames.jl in my lab group, not as the queryverse author): I think the number 1 priority should be to get to a 1.0 and stability, ideally with as little breaking from the status quo as possible. The API that we have at this moment in DataFrames.jl is literally used by thousands, if not tens of thousands pieces of code, so I think the benefits of not breaking those pieces of code far outweigh pretty much all other considerations. I think there will always be disagreement on what the ideal approach in terms of surface API for a package like DataFrames.jl should be, but I really hope that we would consider that debate as done and over for DataFrames.jl at this point, and just consider that it is what it is at this point (namely a package that is not a minimal data structure, but a one-stop package that includes a lot of manipulation functionality), simply because anything else is just too disruptive for the end-users of this package, and there are too many of them. That does not mean that we shouldn't experiment with lots and lots of other models, designs etc. in this space! I'm a huge fan of trying to come up with new designs etc. BUT, I think that should happen in other packages, not in DataFrames.jl. At this point we have really excellent interop of table types in the ecosystem, so if someone wants to introduce a lightweight table data structure only package that doesn't have much beyond that, then it is actually super easy to pull that off, and then implement a few basic interfaces and interop with pretty much everything that is out there in terms of table support. Integration with file IO, plotting, viewing etc. all pretty much comes at very, very minimal code cost at this point. And then we can see how that catches on or not, without disrupting the huge existing user base of DataFrames.jl. Here is my view as the queryverse author: from that end it really doesn't matter :) Query.jl and all the other packages in queryverse have worked with all tabular data structures that exist in Julia land for more than two years now, so I think that is a pretty clear demonstration that one can create alternate APIs that are not tied to a particular tabular data structure in the system we have right now. Query.jl works with DataFrames.jl, but it equally well works with TypedTables, IndexedTables, JuliaDB.jl, whatever. If a user wants to have a tabular data type that is more light weight than DataFrames.jl and wants to use it with Query.jl, they can. If they prefer the batteries included approach of DataFrames.jl and want to use it with Query.jl, they also can. I see no reason why other manipulation frameworks that take a different API approach than Query.jl couldn't work exactly the same way. So I think the status quo is already such that it allows really easy experimentation with N table types and M manipulation frameworks, and have all N to M combinations just work. |
@davidanthoff - thank you for the comment. I must say that it really clarifies for me some final design concerns we recently have trying to ship out 1.0 release. Essentially it says that when in doubt we should try not to break things. |
I don't have enough experience with Julia to make decisions regarding the breakdown of the package(s), but I figured I could help by trying to make the decision-making process a bit easier. I made this script to create a table of exported DataFrames.jl functions, and for methods that extend base functions, itemize the signature and keyword arguments. If there's other data people would like to see in here or find an issue, please let me know - I like doing this type of code analysis work. You get a table that looks something like this (showing 5 of ~350 rows):
|
I'll just comment here that I largely agree with @davidanthoff's comments that we definitely want to avoid breakage in DataFrames at this point, and that we have pretty powerful interfaces that mean things are decently abstracted anyway. In my mind, I love efforts like the Selections.jl package from @Drvi , which aim to take functionality, currently specific to DataFrames, and make it more general to all table types. There's power there because you have lightweight "table types" like |
My experience from last several months of trying to push out DataFrames.jl to 1.0 release (like #2206 discussion currently) lead me to thinking that actually we should split "core" of DataFrames.jl and "utility functions". Tentatively I would keep DataFrames.jl name for the package with utility functions and DataFramesBase.jl for the core (but this can be discussed). The issue is (and I guess most of people who commented in this thread were clear about it from the start 😄):
What I would keep in DataFramesBase.jl:
What notably I would exclude from DataFramesBase.jl and keep in DataFrames.jl:
If we wanted to do this this would be done post 0.21 release. Then we could quickly have DataFramesBase.jl 1.0 release (needed for package developers to have a firm base) and DataFrames.jl would not have a pressure to stabilize as there are constant discussions what would be best in the "utility functions" layer. EDIT Also then it would be much easier to maintain DataFramesBase.jl with a limited number of contributors and I would try to encourage more people to contribute to DataFrames.jl "utility part". |
I would very much support this plan. |
Some more thoughts before I move with the implementation. I think we have two options:
Do you have an opinion which one of them would be better? In general the (easy) option (only type definitions) should be enough for package development, but user would need to load DataFrames.jl later anyway for things to work. |
See discussion starting from this comment. The key issue is that some CSV reading packages like to have a
DataFrame
as a default output: it is probably the most popular tabular data format in Julia and is preferrable to the "typed alternatives" when the data is wide (NamedTuple
based alternatives have poor performance when the number of columns is in the thousands). In practice, both CSV.jl and TableReader.jl have a dependency on DataFrames which they only use so that they can output aDataFrame
. Unfortunately, this causes a slowdown when loading those packages due to the complexity of the code ofDataFrames
. It also makes those packages less appealing as dependencies for tabular data packages that rely on alternative formats (like IndexedTables, StructArrays or TypedTables).A proposed solution is to split the
DataFrame
definition and constructors to a DataFramesBase packages which would have basically zero dependencies and would be very fast to load, so that CSV and TableReader can just depend on DataFramesBase. DataFrames would then reexportDataFramesBase.DataFrame
so this change would not affect users.As an added bonus, I suspect this may make it more appealing to port IndexedTables from their current "mutable table" structure (a dictionary of columns) to just using a
DataFrame
: convertingIndexedTable
toDataFrame
, replacing a few columns in various ways and converting back toIndexedTable
is a useful workflow.The text was updated successfully, but these errors were encountered: