Basic Arrow.jl-based collect and createDataFrame #115

Functions collect_arrow, collect_tuples, and collect_df are provided, which all use Arrow.jl and Spark's Arrow support to transfer data from Spark to Julia. collect_arrow returns the raw Arrow.jl table, collect_df returns the DataFrame from DataFrames.jl, collect_tuples returns a simple Vector of named tuples. createDataFrame now has overloads which accept a DataFrame or abstract Table This version create a temporary file for each transfer, but I actually think it's preferable in many ways to socket based transfer: * Simpler :) * Arrow.jl will mmap the file, so it can in theory handle sligtly-larger-than-RAM datasets * or, if you have /tmp in tmpfs (RAM-disk), it will just mmap the chunk of memory, without additional copying on Julia side This commit still includes 2 versions for both collectToArrow and fromArrow, since I couldn't yet decide which is better

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic Arrow.jl-based collect and createDataFrame #115

Basic Arrow.jl-based collect and createDataFrame #115

Commits on Sep 10, 2022

Basic Arrow.jl-based collect and createDataFrame #115

Are you sure you want to change the base?

Basic Arrow.jl-based collect and createDataFrame #115

Commits on Sep 10, 2022