Skip to content

Support round trip reading / writing Arrow Duration type to parquet #7482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 9, 2025

Conversation

Liyixin95
Copy link
Contributor

@Liyixin95 Liyixin95 commented May 8, 2025

@github-actions github-actions bot added the parquet Changes to the parquet crate label May 8, 2025
@alamb
Copy link
Contributor

alamb commented May 8, 2025

CI Failure should be fixed by @crepererum 's PR:

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Liyixin95 -- this looks really nice. I poked around and actually found that there were some pre-existing tests ❤️

I also tested this PR manually using the duration.parquet file from apache/parquet-testing#80 and the following code:

Details

fn main() {
    let file = File::open("/Users/andrewlamb/Downloads/duration.parquet").unwrap();
    let reader = ParquetRecordBatchReaderBuilder::try_new(file)
        .unwrap()
        .build()
        .unwrap();
    
    for batch in reader {
        let batch = batch.unwrap();
        println!("Read batch schema: {:?}", batch.schema());
        println!("Read batch num rows: {:?}", batch.num_rows());
        println!("Read batch pretty:\n{}", pretty_format_batches(&[batch]).unwrap());

    }
}

Before this PR:

Field { name: "a", data_type: Int64:

warning: `parquet_test` (bin "parquet_test") generated 1 warning
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.04s
     Running `target/debug/parquet_test`
Read batch schema: Schema { fields: [Field { name: "a", data_type: Int64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
Read batch num rows: 100
Read batch pretty:
+-------------+
| a           |
+-------------+
| 86400000000 |
...

After this PR (the schema shows duration):

Field { name: "a", data_type: Duration(Microsecond),

Read batch schema: Schema { fields: [Field { name: "a", data_type: Duration(Microsecond), nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }], metadata: {} }
Read batch num rows: 100
Read batch pretty:
+----------+
| a        |
+----------+
| PT86400S |

@@ -2471,25 +2471,21 @@ mod tests {
}

#[test]
#[should_panic(expected = "Converting Duration to parquet not supported")]
fn duration_second_single_column() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found there were already existing tests for the feature ❤️ (thanks @carols10cents )

@alamb alamb changed the title read and write duration type Support round trip reading / writeing Arrow Duration type to parquet May 8, 2025
@alamb alamb changed the title Support round trip reading / writeing Arrow Duration type to parquet Support round trip reading / writing Arrow Duration type to parquet May 8, 2025
@Liyixin95
Copy link
Contributor Author

@alamb I poked around and actually found that there were some pre-existing tests

Oh, Thanks. I didn't notice there were some pre-existing tests. I just create the pr and go to sleep.

@alamb
Copy link
Contributor

alamb commented May 9, 2025

@alamb I poked around and actually found that there were some pre-existing tests

Oh, Thanks. I didn't notice there were some pre-existing tests. I just create the pr and go to sleep.

I didn't realize there were any tests either until I checked out the branch locally and they failed for me 😆

@alamb alamb merged commit 689897e into apache:main May 9, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

write duration to parquet but read as int64
2 participants