ARROW-4466: [Rust] [DataFusion] Add support for Parquet data source #3851

andygrove · 2019-03-09T17:41:51Z

I'm sure I'll need some guidance on this one from @sunchao or @liurenjie1024 but I am keen to get parquet support added for primitive types so that I can actually use DataFusion and Arrow in production at some point.

andygrove · 2019-03-09T18:25:37Z

@liurenjie1024 should i use the row reader for now ?

andygrove · 2019-03-09T21:32:16Z

row_iter seems like safest path for now but I don't know how to check for null values

andygrove · 2019-03-10T00:25:12Z

switched back to column readers ... fixed bugs ... ready for first review (this is still WIP)

sunchao · 2019-03-10T20:48:54Z

rust/arrow/src/datatypes.rs

@@ -751,6 +752,22 @@ impl Schema {
            "fields": self.fields.iter().map(|field| field.to_json()).collect::<Vec<Value>>(),
        })
    }
+
+    /// Create a new schema by applying a projection to this schema's fields
+    pub fn projection(&self, projection: &Vec<usize>) -> Result<Arc<Schema>> {


nit: can we use &[usize] instead of &Vec<usize>?

sunchao · 2019-03-10T20:53:00Z

rust/datafusion/src/datasource/parquet.rs

+
+impl ParquetFile {
+    pub fn open(file: File, projection: Option<Vec<usize>>) -> Result<Self> {
+        println!("open()");


let's remove these printlns?

sunchao · 2019-03-10T20:54:25Z

rust/datafusion/src/datasource/parquet.rs

+
+impl ParquetTable {
+    pub fn new(filename: &str) -> Self {
+        let file = File::open(filename).unwrap();


is it safe to unwrap here? what if the input file doesn't exist?

sunchao · 2019-03-10T20:58:55Z

rust/datafusion/src/datasource/parquet.rs

+            physical_type,
+            ..
+        } => {
+            let arrow_type = match physical_type {


we should look at logical type instead of physical type here.

Do you mean basic_info.logical_type()? If so, the value is NONE for every column in the test parquet file

In the NONE case it should be converted to the corresponding type, e.g., PhysicalType::INT32 -> int, PhysicalType::INT64 -> long, etc. Again you can check here or here for reference.

Fixed. Now using parquet::reader::schema::parquet_to_arrow_schema

sunchao · 2019-03-10T21:00:35Z

rust/datafusion/src/datasource/parquet.rs

+
+            Ok(Field::new(basic_info.name(), arrow_type, false))
+        }
+        Type::GroupType { basic_info, fields } => Ok(Field::new(


Group type can also represent list, map, etc. We should not convert them all to struct.

Fixed. Now using parquet::reader::schema::parquet_to_arrow_schema and also added checks for non primitives and throw an Err in those cases.

sunchao · 2019-03-10T22:45:13Z

rust/datafusion/src/datasource/parquet.rs

+        }
+    }
+
+    fn load_next_row_group(&mut self) {


Maybe we should return a Result from this function? also in the else branch we should not just panic.

sunchao · 2019-03-10T22:49:49Z

rust/datafusion/src/datasource/parquet.rs

+
+                            match r.read_batch(
+                                self.batch_size,
+                                None,


We should not ignore definition levels and repetition levels. Otherwise nulls and nested data types may not handled properly.

I don't really understand what I need to do here. For now I think we should limit support to simple types so I'm not worried about nesting yet, but I do want to support null values.

In that case, you should only pass in None for def_levels when you know that the column is required. Otherwise, you should pass in a mutable slice with batch_size length, which will be filled up with the def_levels for the values. Note that the number of values filled in by this method will always be equal or less than the batch_size.

For instance, if batch_size is 10, and there are 3 null values, then the def_levels will contain 10 entries while values will only contain 7 entries (occupy the first 7 slot for the input values slice - the rest will just be the default value).

After calling this method, you'll need to inspect the def_levels vector to find out the null values. A value is non-null iff the corresponding def_level equals to max_def_level.

An example can be found here.

For a simple POC, I think it's better to ignore nulls and nest types. For null elements you need an spaced reader.

Thanks for the help. This makes sense. I am going to implement this using generics otherwise the code will be too verbose.

Generics are too hard because Parquet and Arrow crates have different data types... I tried

trait ArrowReader<T> where T: ArrowPrimitiveType { fn read(&mut self, batch_size: usize, is_nullable: bool) -> Result<Arc<PrimitiveArray<T>>>; } impl<T,P> ArrowReader<T> for ColumnReaderImpl<P> where T: ArrowPrimitiveType, P: parquet::data_type::DataType { fn read(&mut self, batch_size: usize, is_nullable: bool) -> Result<Arc<PrimitiveArray<T>>> { // create read buffer let mut read_buffer: Vec<P::T> = Vec::with_capacity(batch_size); for _ in 0..batch_size { read_buffer.push(T::default_value()); } let (values_read, _) = self.read_batch( batch_size, None, None, &mut read_buffer, )?; let mut builder = PrimitiveBuilder::<T>::new(values_read); builder.append_slice(&read_buffer[0..values_read])?; Ok(Arc::new(builder.finish())) } }

but get errors like this

= note: expected type `&[<T as arrow::datatypes::ArrowPrimitiveType>::Native]` found type `&[<P as parquet::data_type::DataType>::T]`

I guess I'll do macros for now.

Hi @andygrove,

You could rewrite the above as:

trait ArrowReader<T> where T: ArrowPrimitiveType { fn read(&mut self, batch_size: usize, is_nullable: bool) -> Result<Arc<PrimitiveArray<T>>>; } impl<A,P> ArrowReader<A> for ColumnReaderImpl<P> where A: ArrowPrimitiveType, P: parquet::data_type::DataType, // the problem's that we didn't have a trait bound that allows for converting between Parquet native and Arrow native P::T: std::convert::From<A::Native>, A::Native: std::convert::From<P::T>, { fn read(&mut self, batch_size: usize, is_nullable: bool) -> Result<Arc<PrimitiveArray<A>>> { // create read buffer let mut read_buffer: Vec<P::T> = Vec::with_capacity(batch_size); for _ in 0..batch_size { // convert from Arrow native to Parquet native read_buffer.push(A::default_value().into()); } let (values_read, _) = self.read_batch( batch_size, None, None, &mut read_buffer, )?; let mut builder = PrimitiveBuilder::<A>::new(values_read); // need to convert the vec of Parquet native types to Arrow native types // otherwise we could create std::convert::From<Vec<P::T>> to Vec<T::Native> let converted_buffer: Vec<A::Native> = read_buffer.into_iter().map(|v| v.into()).collect(); builder.append_slice(&converted_buffer[0..values_read])?; Ok(Arc::new(builder.finish())) } }

This compiles, but I don't know where in your code I'd put this so that I can test if it's doing the right thing.

@sunchao @liurenjie1024 might it benefit us to change the DataType::T below to DataType::Native like we have with Arrow? I had to change the T in the above code to A because P::T was confusing, where P::Native would have been easier to read.

// parquet/src/data_type.rs macro_rules! make_type { ($name:ident, $physical_ty:path, $native_ty:ty, $size:expr) => { pub struct $name {} impl DataType for $name { type T = $native_ty; fn get_physical_type() -> Type { $physical_ty } fn get_type_size() -> usize { $size } } }; }

@andygrove here's the implementation as a PR in your fork:: andygrove#1

Wow, thanks @nevi-me !

andygrove · 2019-03-10T23:21:26Z

@sunchao I fixed the nits but I need some guidance on checking against logical types and how definition/repetition levels work when you have some time. Maybe you can point me to some code I can learn from?

sunchao · 2019-03-10T23:26:29Z

@andygrove Yes. To convert group type, you can check here, which actually implements the conversion from Parquet to Arrow schema. For an introduction on definition/repetition levels, you can read this article, which I found pretty helpful.

liurenjie1024

Sorry for the late reply. Left some comments, and for a simple implementation without nulls and nested types, you can refer to my previous implementation https://github.com/liurenjie1024/zeus/blob/arrow/zeus-server/src/storage/blizard_storage/reader.rs.

liurenjie1024 · 2019-03-11T00:43:12Z

rust/datafusion/src/datasource/parquet.rs

+
+                            match r.read_batch(
+                                self.batch_size,
+                                None,


For a simple POC, I think it's better to ignore nulls and nest types. For null elements you need an spaced reader.

liurenjie1024 · 2019-03-11T00:45:14Z

rust/datafusion/src/datasource/parquet.rs

+    }
+}
+
+fn to_arrow(t: &Type) -> Result<Field> {


Maybe you can use the schema converter in parquet::reader::schema::parquet_to_arrow_schema?

andygrove · 2019-03-12T00:07:57Z

rust/parquet/src/reader/schema.rs

@@ -177,6 +177,7 @@ impl ParquetTypeConverter {
            PhysicalType::BOOLEAN => Ok(DataType::Boolean),
            PhysicalType::INT32 => self.to_int32(),
            PhysicalType::INT64 => self.to_int64(),
+            PhysicalType::INT96 => self.to_int64(),


INT96 is for timestamp in nano and when reading to arrow I am converting to timestamp in ms

andygrove · 2019-03-12T02:52:38Z

@sunchao @liurenjie1024 This is ready for another review ... I am checking if the field is nullable, and if so, I pass def levels.

If values_read == levels_read then there are no null values and I read as usual.

If there are null values I return an Err so I think this makes it safe for the first pass and we can add null support in a follow up PR.

I would also like to work with you both on a generic arrow reader.

sunchao · 2019-03-13T06:32:52Z

rust/datafusion/src/datasource/parquet.rs

+fn create_binary_array(b: &Vec<ByteArray>, row_count: usize) -> Result<Arc<BinaryArray>> {
+    let mut builder = BinaryBuilder::new(b.len());
+    for j in 0..row_count {
+        let slice = b[j].slice(0, b[j].len());


Why we need to call the slice here?

sunchao · 2019-03-13T06:34:41Z

rust/datafusion/src/datasource/parquet.rs

+                    None,
+                    &mut read_buffer,
+            )?;
+            let mut builder = $BUILDER::new(levels_read);


can we extract these common code out of the if statement:

let mut builder = $BUILDER::new(values_read); builder.append_slice(&read_buffer[0..values_read])?; Arc::new(builder.finish())

refactored this to remove duplication - i was prematurely optimizing

sunchao · 2019-03-13T06:35:53Z

rust/datafusion/src/datasource/parquet.rs

+macro_rules! read_column {
+    ($SELF:ident, $R:ident, $INDEX:expr, $BUILDER:ident, $TY:ident, $DEFAULT:expr) => {{
+        //TODO: should be able to get num_rows in row group instead of defaulting to batch size
+        let mut read_buffer: Vec<$TY> = Vec::with_capacity($SELF.batch_size);


you can replace this with:

let mut read_buffer: Vec<$TY> = vec![$DEFAULT; $SELF.batch_size];

sunchao · 2019-03-13T17:08:30Z

rust/datafusion/tests/sql.rs

 /// Execute query and return result set as tab delimited string
 fn execute(ctx: &mut ExecutionContext, sql: &str) -> String {
-    let results = ctx.sql(&sql, DEFAULT_BATCH_SIZE).unwrap();
+    let plan = ctx.create_logical_plan(&sql).unwrap();
+    println!("Plan: {:?}", plan);


nit: remove the println?

andygrove · 2019-03-13T23:40:30Z

@sunchao @nevi-me @liurenjie1024 I cleaned the code up to remove duplication, added null support, and addressed other feedback.

I have been testing null support manually so far since there are no suitable test files in this repo. I will create a separate PR for creating a suitable file to test with.

nevi-me · 2019-03-14T07:04:56Z

rust/datafusion/src/datasource/parquet.rs

+
+use crate::datasource::{RecordBatchIterator, ScanResult, Table};
+use crate::execution::error::{ExecutionError, Result};
+use arrow::builder::{BinaryBuilder, Int64Builder};


nit: we could condense some of these imports into fewer lines

nevi-me · 2019-03-14T07:09:35Z

rust/datafusion/src/datasource/parquet.rs

+                                &mut read_buffer,
+                            )?;
+
+                            let mut builder = Int64Builder::new(levels_read);


EDIT: You can ignore the below, doesn't work.

What do you think of using TimestampMillisecondBuilder for millisecond timestamp? It should be a drop-in replacement to Int64Builder

nevi-me · 2019-03-14T07:20:12Z

rust/datafusion/src/datasource/parquet.rs

+}
+
+/// convert a parquet timestamp in nanoseconds to a timestamp with milliseconds
+fn convert_int96_timestamp(v: &[u32]) -> i64 {


As DataType::Timestamp supports nanosecond precision, might it not be better to keep the resolution at nanosecond level and use a TimestampNanosecond? @xhochy do we have to worry about int96 conversion semantics from apache/parquet-format#49?

@andygrove I'll submit another PR that addresses this, please see (https://gist.github.com/nevi-me/574038fdec8e9c207f661813789d58fb)

Here is the fix: andygrove#2

Yes, INT96 is that weird. Please ensure that you don't write them by default ;)

Thanks @nevi-me and @xhochy !

fix int96 conversion to read timestamps correctly

…nverter

andygrove · 2019-03-14T13:38:15Z

@sunchao @liurenjie1024 @nevi-me Ready for re-review. I compared the schema converter with the C++ implementation and made it consistent.

nevi-me · 2019-03-14T14:22:31Z

rust/datafusion/src/datasource/parquet.rs

+                            )?;
+
+                            let mut builder =
+                                TimestampNanosecondBuilder::new(levels_read);


Sorry about the back and forth @andygrove, the builder should be TimestampMillisecondBuilder in this instance as convert_int96_timestamp still returns milliseconds. Alternatively we could change seconds * MILLIS_PER_SECOND + nanoseconds / 1_000_000 to seconds * MILLIS_PER_SECOND * 1_000_000 + nanoseconds to keep nano precision (which I prefer).

The original JIRA ticket that deprecated INT96 (https://issues.apache.org/jira/browse/PARQUET-323) based it on the reason that nanosecond precision is 'rarely a real requirement'.

Good catch. Thanks. Since the schema converter says that INT96 is timestamp in nanoseconds (just like the C++ impl) we should convert to nanoseconds, so I updated as you suggested.

nevi-me · 2019-03-14T14:27:46Z

rust/datafusion/src/datasource/parquet.rs

+        let array = batch
+            .column(0)
+            .as_any()
+            .downcast_ref::<TimestampNanosecondArray>()


This will downcast to millisecond per the related comment. As an aside, the nice thing with temporal arrays now is that we can do the below (which isn't necessary for this test):

let mut values = vec![]; for i in 0..batch.num_rows() { values.push(array.datetime(i)); } assert_eq!("2001-03-31 12:00:00, ...", format!("{:?}", values);

paddyhoran · 2019-03-15T02:30:27Z

rust/arrow/src/datatypes.rs

@@ -751,6 +752,22 @@ impl Schema {
            "fields": self.fields.iter().map(|field| field.to_json()).collect::<Vec<Value>>(),
        })
    }
+
+    /// Create a new schema by applying a projection to this schema's fields
+    pub fn projection(&self, projection: &[usize]) -> Result<Arc<Schema>> {


Does this belong in datafusion as a free function? It doesn't seem like this will be used within the arrow sub-crate?

I'm 50/50 on this but I moved it to datafusion for now.

paddyhoran · 2019-03-15T02:42:04Z

rust/datafusion/src/datasource/parquet.rs

+}
+
+impl ParquetTable {
+    pub fn try_new(filename: &str) -> Result<Self> {


nit: Maybe filename should be of type AsRef<Path> just like File::open? (probably does not need to be included in this PR)

Sure. I'd prefer to look at that as a separate PR as you suggested.

nevi-me

Thanks @andygrove

andygrove · 2019-03-15T16:44:23Z

Thanks @nevi-me.

@sunchao @paddyhoran I think this is ready to go now?

sunchao

Thanks @andygrove . Looks much better!

sunchao · 2019-03-14T05:15:02Z

rust/datafusion/src/datasource/parquet.rs

+        is_nullable: bool,
+    ) -> Result<Arc<PrimitiveArray<A>>> {
+        // create read buffer
+        let mut read_buffer: Vec<P::T> = vec![A::default_value().into(); batch_size];


nit: can replace A::default_value().into() with P::T::default() - save a into() call.

I had problems making this change. Will look at this as separate PR too.

sunchao · 2019-03-14T05:21:12Z

rust/datafusion/src/datasource/parquet.rs

+        if self.row_group_index < self.reader.num_row_groups() {
+            let reader = self.reader.get_row_group(self.row_group_index)?;
+
+            self.column_readers = Vec::with_capacity(self.projection.len());


nit: can we call self.column_readers.clear() here?

sunchao · 2019-03-15T16:27:45Z

rust/datafusion/src/datasource/parquet.rs

+            row_group_index: 0,
+            projection_schema: projected_schema,
+            projection,
+            batch_size: 64 * 1024,


nit: extract this as a constant.

sunchao · 2019-03-15T16:33:17Z

rust/datafusion/src/datasource/parquet.rs

+                                is_nullable,
+                            )?
+                        }
+                        ColumnReader::Int64ColumnReader(ref mut r) => {


should we look at the logical type for the int64? it may need to be converted to timestamp or decimal. Same for int32.

Yes, that's a good point.

Thanks. It's fine to do it in a separate PR though.

How would you feel about doing this as separate PRs after this is merged?

Yes. That's fine for me.

Cool. Once you and/or Paddy have approved this one, I'll merge and start on the date/time support.

sunchao · 2019-03-15T16:42:52Z

rust/datafusion/src/datasource/parquet.rs

+
+    fn next(&mut self) -> Result<Option<RecordBatch>> {
+        // advance the row group reader if necessary
+        if self.current_row_group.is_none() {


at some point we may need to think about small row groups versus big batch size, so that we may need to read across row group boundaries.

sunchao · 2019-03-15T16:46:35Z

rust/datafusion/src/datasource/parquet.rs

+        })
+    }
+
+    fn load_next_row_group(&mut self) -> Result<()> {


can we have a test for this case as well? i.e., loading multiple row groups.

sunchao · 2019-03-15T20:02:50Z

LGTM - pending on CI.

andygrove changed the title ~~ARROW-4466: [Rust] [DataFusion] Add support for Parquet data source [WIP]~~ ARROW-4466: [Rust] [DataFusion] Add support for Parquet data source Mar 10, 2019

andygrove requested review from sunchao and paddyhoran March 10, 2019 14:29

andygrove force-pushed the ARROW-4466 branch from e414436 to d1c6b53 Compare March 10, 2019 16:22

andygrove requested a review from kszucs March 10, 2019 16:59

andygrove force-pushed the ARROW-4466 branch from d1c6b53 to c11962e Compare March 10, 2019 17:08

sunchao requested changes Mar 10, 2019

View reviewed changes

liurenjie1024 reviewed Mar 11, 2019

View reviewed changes

andygrove commented Mar 12, 2019

View reviewed changes

sunchao added the Component: Rust label Mar 13, 2019

sunchao reviewed Mar 13, 2019

View reviewed changes

andygrove added 12 commits March 13, 2019 18:06

Parquet datasource

10710a2

test

ff3e5b7

first parquet test passes

3a412b1

add test for reading strings from parquet

322fc87

save

eaddafb

save

f46e6f7

convert to use row iter

aea9f8a

add integration test

c3f71d7

revert to columnar reads

5ce3086

implement more parquet column types and tests

b4981ed

add support for all primitive parquet types

6c3b7e2

code cleanup

debb2fb

andygrove added 5 commits March 13, 2019 18:06

add date support

80cf303

handle nulls for binary data

1503855

null handling for int96

639e13e

code cleanup

9d3047a

remove println from tests

2aeea24

andygrove force-pushed the ARROW-4466 branch from 3f5b77a to 2aeea24 Compare March 14, 2019 00:06

nevi-me reviewed Mar 14, 2019

View reviewed changes

nevi-me and others added 5 commits March 14, 2019 13:03

fix int96 conversion to read timestamps correctly

02b2ed3

Merge pull request #2 from nevi-me/ARROW-4466

023dc25

fix int96 conversion to read timestamps correctly

Clean up imports

1ec815b

clean up handling of INT96 and DATE/TIME/TIMESTAMP types in schema co…

9b1308f

…nverter

Make INT32/64/96 handling consistent with C++ implementation

25d34ac

Remove println from test

73aa934

nevi-me reviewed Mar 14, 2019

View reviewed changes

fix timestamp nano issue

204db83

paddyhoran reviewed Mar 15, 2019

View reviewed changes

move schema projection function from arrow into datafusion

8d2df06

nevi-me approved these changes Mar 15, 2019

View reviewed changes

sunchao reviewed Mar 15, 2019

View reviewed changes

andygrove added 2 commits March 15, 2019 11:16

Remove hard-coded batch size, fix nits

549c829

add test for reading small batches

3158529

sunchao approved these changes Mar 15, 2019

View reviewed changes

andygrove closed this in e3df5b7 Mar 15, 2019

asfimport mentioned this pull request Mar 15, 2019

[Rust] [DataFusion] Add support for Parquet data sources #21023

Closed

ARROW-4466: [Rust] [DataFusion] Add support for Parquet data source #3851

ARROW-4466: [Rust] [DataFusion] Add support for Parquet data source #3851

Conversation

andygrove commented Mar 9, 2019

andygrove commented Mar 9, 2019

andygrove commented Mar 9, 2019

andygrove commented Mar 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao Mar 10, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me Mar 13, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Mar 10, 2019

sunchao commented Mar 10, 2019

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Mar 12, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Mar 13, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me Mar 14, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andygrove commented Mar 14, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nevi-me left a comment

Choose a reason for hiding this comment

andygrove commented Mar 15, 2019

sunchao left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sunchao commented Mar 15, 2019

sunchao Mar 10, 2019 •

edited

Loading

nevi-me Mar 13, 2019 •

edited

Loading

nevi-me Mar 14, 2019 •

edited

Loading