Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: to_dask (or more generally to_batches?) #9891

Open
jcrist opened this issue Aug 21, 2024 · 5 comments
Open

feat: to_dask (or more generally to_batches?) #9891

jcrist opened this issue Aug 21, 2024 · 5 comments
Labels
feature Features or general enhancements io Issues related to input and/or output

Comments

@jcrist
Copy link
Member

jcrist commented Aug 21, 2024

Opening this mostly for discussion.

Say all your data lives in <big cloud provider db>. After doing some selecting/filtering/transforming, you want to export your data out of the DB and into a different distributed system like dask (or spark or others) to do some operations (ML training for example) that can't as easily be executed purely in the database backend.

Some of our backends provide efficient means for distributed batch retrieval. By this I mean a way to fetch query results in parallel (perhaps across a distributed system) rather than streaming them back through the client. In these cases, conversion of a result set to a distributed object (like a dask.dataframe) could be done fairly efficiently, and in a way that the user can't easily compose using existing API methods.


Systems that support this natively:

  • dask
  • spark
  • bigquery
  • snowflake

We could support this as a general method for systems where this is inefficient, but I'm not sure if we'd want to do that. Better to error than accidentally slowly pipe data through the client and back out to a cluster (a user can fairly easily write this code themselves too).


We could expose this as a to_dask method on an expression that does all the fiddly bits and returns a dask.dataframe object.

Alternatively (or additionally), we could generalize this to a to_batches (or better name) method that returns a list of Batch objects, each of which has a to_pandas/to_arrow/to_polars methods for fetching the partition as a specified type. These could be pickleable and distributed to any distributed system (dask/spark/ray/...).

Conversion to a dask dataframe would then be something like:

import dask.dataframe as dd

batches = expr.to_batches()
ddf = dd.from_map(lambda batch: batch.to_pandas(), batches, meta=expr.schema().to_pandas())
@jcrist jcrist changed the title feat: to_dask (or more generally to_partitions?) feat: to_dask (or more generally to_batches?) Aug 21, 2024
@gforsyth gforsyth added feature Features or general enhancements io Issues related to input and/or output labels Aug 22, 2024
@ncclementi
Copy link
Contributor

I like the idea of a general to_batches that is general, and then if needed/asked we could create the specific APIs like to_dask_df / to_spark_df

@cpcloud
Copy link
Member

cpcloud commented Aug 22, 2024

I think starting with to_dask makes sense. Supporting a general batching API doesn't (yet) seem worth the effort.

@jitingxu1
Copy link
Contributor

Hi @jcrist , Thank you for creating this. Transferring data directly between the compute backend and another cluster, bypassing the client, is crucial for efficient ML training.

We could start from to_dask and have a general to_batches would be perfect. We could connect compute backend to different kinds of training cluster, such as pytorch, tensorflow.

Please let me know if you need any help from me.

@cpcloud
Copy link
Member

cpcloud commented Aug 23, 2024

Not seeing exactly what to_batches is getting us here. Is this motivated by an ibis-ml use case?

@jitingxu1
Copy link
Contributor

jitingxu1 commented Aug 27, 2024

to_batches could be more general, we could convert the batches to different format, i.e. dask, tensor or torch.

it would be great if we could pass the data from some other backends to the training cluster without going through the client.

One direct use case for IbisML, we could demo large scale training using spark or bigquery + xgboost/torch,.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Features or general enhancements io Issues related to input and/or output
Projects
Status: backlog
Development

No branches or pull requests

5 participants