Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic Arrow.jl-based collect and createDataFrame #115

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from

Commits on Sep 10, 2022

  1. Configuration menu
    Copy the full SHA
    171658a View commit details
    Browse the repository at this point in the history
  2. Basic Arrow.jl-based collect and createDataFrame

    Functions collect_arrow, collect_tuples, and collect_df are provided,
    which all use Arrow.jl and Spark's Arrow support to transfer data
    from Spark to Julia. collect_arrow returns the raw Arrow.jl table,
    collect_df returns the DataFrame from DataFrames.jl, collect_tuples
    returns a simple Vector of named tuples.
    
    createDataFrame now has overloads which accept a DataFrame or abstract Table
    
    This version create a temporary file for each transfer, but I actually think it's
    preferable in many ways to socket based transfer:
    * Simpler :)
    * Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM
      datasets
    * or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory,
      without additional copying on Julia side
    
    This commit still includes 2 versions for both collectToArrow and fromArrow,
    since I couldn't yet decide which is better
    exyi committed Sep 10, 2022
    Configuration menu
    Copy the full SHA
    f826c30 View commit details
    Browse the repository at this point in the history