Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formalize API for column vectors #2567

Open
pdeffebach opened this issue Nov 30, 2020 · 4 comments
Open

Formalize API for column vectors #2567

pdeffebach opened this issue Nov 30, 2020 · 4 comments
Labels
non-breaking The proposed change is not breaking performance
Milestone

Comments

@pdeffebach
Copy link
Contributor

I feel like the question about datframes with distributed arrays comes up a lot. My impression is that we don't know, for sure, if a Dagger array etc. can "just work" as a column in a DataFrame.

I think I might try to write a custom vector type and then put it in a data frame and see how many functions I can call for it before it becomes a normal vector. Then we can assess to what extent DataFrames can support Dask-like operations just by changing the vector type.

@quinnj
Copy link
Member

quinnj commented Nov 30, 2020

This is a great idea; in particular, it would be great to document which functions/methods are expected to work along with how they're used in DataFrames in different operations. Happy to help with this effort.

@bkamins bkamins added non-breaking The proposed change is not breaking performance labels Nov 30, 2020
@bkamins bkamins added this to the 1.0 milestone Nov 30, 2020
@bkamins
Copy link
Member

bkamins commented Nov 30, 2020

The first candidates that would break are fast aggregations like combine(gdf, :x => sum). In general - all cases when DataFrames.jl "internally" creates a column it is likely to assume that it is a "standard" vector. Similarly in many operations DataFrames.jl internally creates Vectors for processing data (see e.g. at GroupedDataFrame struct definition).

Having said that I think it should be doable to add "distributed" support to DataFrames.jl in the long run. However, probably we would need to have some API that would communicate to DataFrames.jl how distribution is performed (as if you have distributed vectors most likely you want to process them in a way that takes this into account).

@pdeffebach
Copy link
Contributor Author

Yeah I have no idea how distributed computing works, or threading for that matter. Still I will put this on the to-do list for winter break / procrastination from school.

@nalimilan
Copy link
Member

Somewhat related is whether we preserve the container types of input columns: #2569

I don't think DataFrames has very specific requirements for columns: apart from the issue of one-based indexing, which we should investigate if somebody cares, things should work as long as the AbstractArray interface is implemented. It probably won't be fast for distributed arrays, though, since we use for i in eachindex(col) loops a lot.

@bkamins bkamins closed this as completed Dec 1, 2020
@bkamins bkamins reopened this Dec 1, 2020
@bkamins bkamins mentioned this issue Mar 4, 2021
19 tasks
@bkamins bkamins modified the milestones: 1.0, 1.x Mar 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
non-breaking The proposed change is not breaking performance
Projects
None yet
Development

No branches or pull requests

4 participants