-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove CategoricalArrays dependency, for performance #2321
Comments
This is the major point why #1764 is a priority (after we stabilize the API). As I guess there is no hope for solving these issues on the Julia Base level, an alternative that has been recently considered is to avoid having to define the functionality described in JuliaData/CategoricalArrays.jl#177 that causes the problems. Just to comment why CategoricalArrays.jl is a dependency:
|
Why can't things just work on a |
I would wait for @nalimilan to get back on-line to comment on this 😄. As you 👍 in JuliaLang/julia#32613 (comment) - removing these
so in a generic code you anyway have to check if you are not getting However, this change is hugely breaking, and I am sure @nalimilan has spent days thinking about it. In particular in JuliaData/CategoricalArrays.jl#177 (comment) I have posted an idea that maybe would solve the issue without having to be breaking. |
I've tried investigating this to see what are the problematic methods, and I couldn't replicate @vtjnash's findings on Julia master from two weeks ago nor on 1.4.2. My machine appears to be much slower than yours, but the gain from removing CategoricalArrays isn't that clear (-15%). At any rate, the code is much faster than on Julia 1.4.2.
@vtjnash What did you remove exactly in DataFrames to get rid of CategoricalArrays? What Julia version did you use? FWIW, I've pushed my code to the nl/removeCategoricalArrays branch. |
@vtjnash - you can index My tests are on Ubuntu 20.04 LTS, Julia Version 1.6.0-DEV.538: With CategoricalArrays.jl:
and without CategoricalArrays.jl (used my own way to remove the dependency)
|
I think you meant to @KristofferC there. And I was only looking at the absolute time; since I can make the overall relative percentage arbitrarily small it's not a particularly convenient metric. |
Here are a few timings from various tests I've just made (on Antarctic):
So it looks like we can go a long way by just dropping deprecated methods and restricting supported types to a fixed list. This restriction avoids defining methods for |
I suggest that we drop CategoricalArrays support from DataFrames for the 1.0 release. Removing it now leaves us open to being able to bring it, or something like it but with a slightly different (breakingly different) API back later. |
Actually my hope is that DataAPI.jl can allow to use CategoricalArrays.jl in DataFrames.jl without DataFrames.jl being aware of it most of the time (to be confirmed by @nalimilan) |
could you say what would change for users?
EDIT: You were once considering factoring out |
by default it will be ordered alphabetically. It will be opt-in to use and
Other than that we hope that in user-facing code things will not change (there will be some internal changes, but not user visible). |
Given discussions with @nalimilan it seems to be possible to:
|
This can be closed. CategoricalArrays.jl is no longer a dependency. |
Because CategoricalArrays gets loaded, due to issues such as JuliaData/CategoricalArrays.jl#177, everything you do with DataFrames can become a lot slower. For example:
With CategoricalArrays:
Without CategoricalArrays:
The text was updated successfully, but these errors were encountered: