data.table spark/databases interface #1828

ysgit · 2016-08-26T18:08:05Z

data.table is awesome but most people don't have 100GB memory in order to handle really large data sets in memory.

Big progress has been made making the Apache Spark framework available through R in the last couple of years. Two such projects are Apache's sparkr and Rstudio's sparklyr. Both of these provide a dplyr style interface to spark's data processing engine.

As a heavy data.table user it would be amazing if there were to be a data.table interface for spark. That would make it incredibly easy for data scientists to migrate their projects from the smaller CSV style data sets to the huge data sets that can be processed by spark.

A classic data pipeline for me is

Bring the data into R by CSV
Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table
Build a model using one of R's many machine learning packages

I want to be able to migrate this to

Connect to data on hadoop cluster
Do some pre-processing (filters, joins, aggregation, feature extraction) of the data using data.table's spark interface
Build a model using one of spark's many machine learning algorithm.

mattdowle · 2016-08-26T20:34:37Z

Thanks for the encouragement. Fully agree.
Yes there are lots of efforts in this space. Note that I (and now Jan Gorecki) work at H2O.
I gave a recent presentation here: https://www.youtube.com/watch?v=5X7h1rZGVs0
The slides are here : https://github.com/Rdatatable/data.table/wiki/Presentations
Just to check whether you had seen these before? Before discussing further.

sskarkhanis · 2017-04-11T09:15:07Z

I really hope DT syntax is available someday. Personally, I really prefer the DT syntax and would like to use it consistently in Spark rather than cringe with dplyr.
Fingers crossed...

MichaelChirico · 2018-05-18T13:48:06Z

I'm curious what ppl want out of this exactly.

Just to be able to use [ on an RDD like you would on a data.table (namely, i/j/by)?

certainly the full functionality is a ways away, but i imagine it wouldn't be too earth shaking to make an idiom for filtering, grouping, even joining by sending syntax within [] to the corresponding operations in sparkR.

in particular, this would just amount to (in essence) aliasing sparkR functions in a syntax friendlier for data.table regulars.

is this what people have in mind?

jangorecki · 2018-05-18T13:49:59Z

No updates, there is a lot to dev around data.table itself, so external interfacing is not that high priority now. Instead of just spark integration it makes more sense to integrate to dplyr, something like dplyr.table (inverse of dtplyr). Then any dplyr backend will work on data.table syntax.

DavidArenburg · 2018-08-12T14:11:10Z

@jangorecki The problem with that is that will also be slow/full of bugs/ and very unstable as dplyr interface changes on almost daily basis and sometimes they change the whole idiom at once like they did with lazyeval > rlang > tidyeval and G-d knows else what as I long time lost track. Not to mention that hadley, who once stated (with a "tongue in cheeck") that data.table uses "cryptic shortcuts" now masks few of these shortcuts and suddenly don't consider them so cryptic anymore. In short, creating such an API would be a full time job IMO.

I think migrating a few main functionalities from data.table and keep adding them when there is time would be much safer/easier.

jangorecki · 2019-04-24T17:06:41Z

@DavidArenburg agree, thus I would suggest to wait at least till dplyr 1.0 before starting any serious dev of such dplyr.table interface.

griipen · 2020-10-27T09:07:40Z

Have you arrived at a conceivable roadmap for a spark integration project (reverse dtplyr or any other form) given that dplyr 1.0 has been released? It would be great to hear your thoughts now that some time has passed.

ColeMiller1 · 2020-11-15T13:16:38Z

@jangorecki if the dplyr.table approach is key, or any backend for that matter, it seems like higher integration with data.table would be necessary. Take i which can be:

A character that is NSE to a data.table to do key matching
A logical that is length 1 or nrow(dt) with potential for notjoin NSE
A numeric vector (or single-column numeric matrix) with potential notjoin NSE
A list-like that produces a join or anti-join

While working on #4585, I also worked on functions to process the isub to their end points but did not implement in the PR because with all the variables needed to process the isub, it was not clean. However, to implement a backend, a function to process the isub would be useful so that NSE would be processed consistently. Otherwise, it would be very easy for dplyr.table i processing to become out of sync of data.table processing.

jangorecki · 2020-11-15T14:06:10Z

That make sense but for that we have to make multiple new helpers to process internal logic of understanding input arguments, and then export them so such tool can easily mimic this logic. Describing our current API with use of helpers is not that trivial task. See related #852

drag05 · 2020-11-15T20:13:51Z

@ jangorecki
I sincerely hope data.table will become a viable solution on itself and help to completely avoid the dplyr / tidyverse fluff when (and not only when!) interacting with large datasets and out-of-memory computing. So far I find data.table a marvel of clarity and efficiency, almost perfect and excellent integration with mlr3verse, to give an example.

jangorecki · 2024-01-07T10:48:31Z

Proposed dplyr.table does not need to interact with data.table at all and can be completely standalone package.
All it needs is to mimic data.table's API:

DT[ subset|order, select, groupby ]

That makes it much easier to deliver rather than trying to fit translation inside [.data.table.

Then possible usage could look like:

library(data.table)
dt = data.table(a=1:4, b=1:2)

library(dplyr.table)
dp = as.dplyr.table(dt)

all.equal(
  dt[, sum(a), b],
  dp[, sum(a), b] |> as.data.table
)

The latter one dp could be theoretically any dplyr backend, spark, duckdb (personally I like duckdb a lot but lack of user friendly API is still pushing me away) and so on.

grantmcdermott · 2024-01-07T16:50:27Z

I might be missing something (sorry: long, old thread with several hidden replies). But since we're ultimately talking syntax masking/mimicking, wouldn't it be easier in the long-run to create something like a database.table package that translates the [i, j, by] into the appropriate backend(s)? Or, perhaps easier for conversion, going through the new DT(i, j, by) functional syntax that Matt introduced not so long ago.

Having a dedicated database.table frontend package that directly controls the syntax generics is probably more aligned with the data.table way than going through d(b)plyr in the end.

(This be might Jan's point, so again apologies if I'm just quibbling over the name.)

+1 on DuckDB, although I do think their SQL API is much better than the alternatives.

jangorecki · 2024-01-07T17:19:04Z

Integration of various sources/targets is a lot of dev and maintenance, therefore doing single integration to dplyr, and via it having another backends feels like much more likely to be achieved. If we want to target only spark, or only duckdb, then I agree it's better to translate directly rather than via dplyr.

An off topic: DT() has been pulled back for the moment from exported API.

tdhock · 2024-01-08T19:10:08Z

I agree that it would be possible and preferable to implement this in a separate package, which hopefully would get the seal of approval #5723

tdhock · 2024-01-08T19:15:41Z

also based on the new governance this is out of scope -- "Functionality that is out of current scope...Manipulating out-of-memory data, e.g. data stored on disk or remote SQL DB, (as opposed e.g. to sqldf / dbplyr)" and consensus seems to be that this should be implemented in another package, so I am closing this issue. (feel free to re-open if I have mis-understood)

lucasmation · 2024-01-30T09:36:11Z

Bump for the "dplyr.table (inverse of dtplyr)". For me, that would be a game changer, letting me use arrow with the DT syntax. I naively added a feature request for DT syntax on the arrow github page

This comment was marked as outdated.

Sign in to view

This comment was marked as resolved.

Sign in to view

This comment was marked as duplicate.

Sign in to view

This comment was marked as outdated.

Sign in to view

This comment was marked as duplicate.

Sign in to view

MichaelChirico mentioned this issue Dec 6, 2018

Master list of most-requested issues #3189

Open

76 tasks

jangorecki added feature request someday labels Apr 26, 2019

MichaelChirico mentioned this issue Sep 5, 2019

working with out-of-memory data #3821

Closed

jangorecki changed the title ~~[Request] data.table spark interface~~ data.table spark interface Apr 6, 2020

MichaelChirico added the High label May 30, 2020

jangorecki removed the High label Jun 3, 2020

jangorecki added the top request One of our most-requested issues label Jun 27, 2020

jangorecki changed the title ~~data.table spark interface~~ data.table spark/databases interface Jan 7, 2024

tdhock closed this as completed Jan 8, 2024

jangorecki mentioned this issue Jan 29, 2024

setDT() for arrow tables #5584

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data.table spark/databases interface #1828

data.table spark/databases interface #1828

ysgit commented Aug 26, 2016

mattdowle commented Aug 26, 2016

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as duplicate.

sskarkhanis commented Apr 11, 2017

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as outdated.

This comment was marked as duplicate.

This comment was marked as duplicate.

MichaelChirico commented May 18, 2018

jangorecki commented May 18, 2018

This comment was marked as duplicate.

DavidArenburg commented Aug 12, 2018 •

edited

Loading

jangorecki commented Apr 24, 2019 •

edited

Loading

griipen commented Oct 27, 2020

ColeMiller1 commented Nov 15, 2020

jangorecki commented Nov 15, 2020

drag05 commented Nov 15, 2020 •

edited

Loading

jangorecki commented Jan 7, 2024 •

edited

Loading

grantmcdermott commented Jan 7, 2024 •

edited

Loading

jangorecki commented Jan 7, 2024

tdhock commented Jan 8, 2024

tdhock commented Jan 8, 2024

lucasmation commented Jan 30, 2024 •

edited

Loading

data.table spark/databases interface #1828

data.table spark/databases interface #1828

Comments

ysgit commented Aug 26, 2016

mattdowle commented Aug 26, 2016

This comment was marked as outdated.

This comment was marked as resolved.

This comment was marked as duplicate.

sskarkhanis commented Apr 11, 2017

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as duplicate.

This comment was marked as outdated.

This comment was marked as duplicate.

This comment was marked as duplicate.

MichaelChirico commented May 18, 2018

jangorecki commented May 18, 2018

This comment was marked as duplicate.

DavidArenburg commented Aug 12, 2018 • edited Loading

jangorecki commented Apr 24, 2019 • edited Loading

griipen commented Oct 27, 2020

ColeMiller1 commented Nov 15, 2020

jangorecki commented Nov 15, 2020

drag05 commented Nov 15, 2020 • edited Loading

jangorecki commented Jan 7, 2024 • edited Loading

grantmcdermott commented Jan 7, 2024 • edited Loading

jangorecki commented Jan 7, 2024

tdhock commented Jan 8, 2024

tdhock commented Jan 8, 2024

lucasmation commented Jan 30, 2024 • edited Loading

DavidArenburg commented Aug 12, 2018 •

edited

Loading

jangorecki commented Apr 24, 2019 •

edited

Loading

drag05 commented Nov 15, 2020 •

edited

Loading

jangorecki commented Jan 7, 2024 •

edited

Loading

grantmcdermott commented Jan 7, 2024 •

edited

Loading

lucasmation commented Jan 30, 2024 •

edited

Loading