feat: `to_dask` (or more generally `to_batches`?) #9891

jcrist · 2024-08-21T20:28:42Z

Opening this mostly for discussion.

Say all your data lives in <big cloud provider db>. After doing some selecting/filtering/transforming, you want to export your data out of the DB and into a different distributed system like dask (or spark or others) to do some operations (ML training for example) that can't as easily be executed purely in the database backend.

Some of our backends provide efficient means for distributed batch retrieval. By this I mean a way to fetch query results in parallel (perhaps across a distributed system) rather than streaming them back through the client. In these cases, conversion of a result set to a distributed object (like a dask.dataframe) could be done fairly efficiently, and in a way that the user can't easily compose using existing API methods.

Systems that support this natively:

dask
spark
bigquery
snowflake

We could support this as a general method for systems where this is inefficient, but I'm not sure if we'd want to do that. Better to error than accidentally slowly pipe data through the client and back out to a cluster (a user can fairly easily write this code themselves too).

We could expose this as a to_dask method on an expression that does all the fiddly bits and returns a dask.dataframe object.

Alternatively (or additionally), we could generalize this to a to_batches (or better name) method that returns a list of Batch objects, each of which has a to_pandas/to_arrow/to_polars methods for fetching the partition as a specified type. These could be pickleable and distributed to any distributed system (dask/spark/ray/...).

Conversion to a dask dataframe would then be something like:

import dask.dataframe as dd

batches = expr.to_batches()
ddf = dd.from_map(lambda batch: batch.to_pandas(), batches, meta=expr.schema().to_pandas())

The text was updated successfully, but these errors were encountered:

ncclementi · 2024-08-22T15:42:28Z

I like the idea of a general to_batches that is general, and then if needed/asked we could create the specific APIs like to_dask_df / to_spark_df

cpcloud · 2024-08-22T16:29:41Z

I think starting with to_dask makes sense. Supporting a general batching API doesn't (yet) seem worth the effort.

jitingxu1 · 2024-08-22T23:21:25Z

Hi @jcrist , Thank you for creating this. Transferring data directly between the compute backend and another cluster, bypassing the client, is crucial for efficient ML training.

We could start from to_dask and have a general to_batches would be perfect. We could connect compute backend to different kinds of training cluster, such as pytorch, tensorflow.

Please let me know if you need any help from me.

cpcloud · 2024-08-23T17:45:34Z

Not seeing exactly what to_batches is getting us here. Is this motivated by an ibis-ml use case?

jitingxu1 · 2024-08-27T17:53:05Z

to_batches could be more general, we could convert the batches to different format, i.e. dask, tensor or torch.

it would be great if we could pass the data from some other backends to the training cluster without going through the client.

One direct use case for IbisML, we could demo large scale training using spark or bigquery + xgboost/torch,.

jcrist changed the title ~~feat: to_dask (or more generally to_partitions?)~~ feat: to_dask (or more generally to_batches?) Aug 21, 2024

gforsyth added feature Features or general enhancements io Issues related to input and/or output labels Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: `to_dask` (or more generally `to_batches`?) #9891

feat: `to_dask` (or more generally `to_batches`?) #9891

jcrist commented Aug 21, 2024 •

edited

Loading

ncclementi commented Aug 22, 2024

cpcloud commented Aug 22, 2024

jitingxu1 commented Aug 22, 2024

cpcloud commented Aug 23, 2024

jitingxu1 commented Aug 27, 2024 •

edited

Loading

feat: to_dask (or more generally to_batches?) #9891

feat: to_dask (or more generally to_batches?) #9891

Comments

jcrist commented Aug 21, 2024 • edited Loading

ncclementi commented Aug 22, 2024

cpcloud commented Aug 22, 2024

jitingxu1 commented Aug 22, 2024

cpcloud commented Aug 23, 2024

jitingxu1 commented Aug 27, 2024 • edited Loading

feat: `to_dask` (or more generally `to_batches`?) #9891

feat: `to_dask` (or more generally `to_batches`?) #9891

jcrist commented Aug 21, 2024 •

edited

Loading

jitingxu1 commented Aug 27, 2024 •

edited

Loading