Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Arrow.jl-based collect and createDataFrame #115

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

exyi
Copy link
Contributor

@exyi exyi commented Sep 10, 2022

Functions collect_arrow, collect_tuples, and collect_df are provided,
which all use Arrow.jl and Spark's Arrow support to transfer data
from Spark to Julia. collect_arrow returns the raw Arrow.jl table,
collect_df returns the DataFrame from DataFrames.jl, collect_tuples
returns a simple Vector of named tuples.

createDataFrame now has overloads which accept a DataFrame or abstract Table

This version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:

  • Simpler :)
  • Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM
    datasets
  • or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory,
    without additional copying on Julia side

However, if you think sockets would be preferable I can change it (PySpark and SparkR use sockets)

Few things are missing

  • I included 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better.

    • One collectToArrow is basically the same one as I used last year. The other comes from looking at PySpark currently does it.
    • Would be nice to benchmark them, I don't currently have meaningful data for this as I left the company when we used Spark
  • I added DataFrames.jl to the dependencies, but it doesn't seems right to depend on this fairly non-trivial library just for collect_df. Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out in collect_df

Functions collect_arrow, collect_tuples, and collect_df are provided,
which all use Arrow.jl and Spark's Arrow support to transfer data
from Spark to Julia. collect_arrow returns the raw Arrow.jl table,
collect_df returns the DataFrame from DataFrames.jl, collect_tuples
returns a simple Vector of named tuples.

createDataFrame now has overloads which accept a DataFrame or abstract Table

This version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:
* Simpler :)
* Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM
  datasets
* or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory,
  without additional copying on Julia side

This commit still includes 2 versions for both collectToArrow and fromArrow,
since I couldn't yet decide which is better
@dfdx
Copy link
Owner

dfdx commented Sep 10, 2022

Thanks a lot! It's great to see Arrow getting back to Spark.jl!

This version create a temporary file for each transfer, but I actually think it's
preferable in many ways to socket based transfer:

I'm totally fine with files. If I remember correctly, PySpark uses (used?) both - sockets and files in different places or with different settings. Anyway, I realized that it's not necessarily a good idea to follow PySpark or SparkR design since many decisions in them were made specifically to that languages or due to certain conditions that may not apply to our case. So let's do whatever is the best for Julia.

Would be nice to benchmark them, I don't currently have meaningful data for this as I left the company when we used Spark

Can't we just generate, say, a big Parquet file?

Does Julia support something like optional dependencies? It would seem nicer to import DataFrames only if it's already installed, otherwise error-out in collect_df

Requires.jl was designed to do exactly this, though I haven't used it for years now and don't know its status.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants