Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support DurationType in cudf parquet reader via arrow:schema #15617

Merged
Merged
Changes from 1 commit
Commits
Show all changes
61 commits
Select commit Hold shift + click to select a range
053f7da
Read duration type in cudf parquet via arrow:schema
mhaseeb123 Apr 30, 2024
aa4e9bb
reverting an inadvertently removed code line.
mhaseeb123 Apr 30, 2024
6c67c28
clang-format changes
mhaseeb123 Apr 30, 2024
0e6fc4a
Update cpp/src/io/parquet/reader_impl_helpers.cpp
mhaseeb123 Apr 30, 2024
a6eca13
Co-walk arrow and parquet schema
mhaseeb123 May 1, 2024
ced5dd9
fixing copyrights
mhaseeb123 May 1, 2024
b192352
fix the hardcoded if conditions for duration type
mhaseeb123 May 1, 2024
18d5e6c
add boolean check for arrow type columns
mhaseeb123 May 1, 2024
8f55983
add basic testing for duration type
mhaseeb123 May 1, 2024
6883c7e
revert clangd induced formatting
mhaseeb123 May 1, 2024
ab5cacd
more reverting clangd
mhaseeb123 May 1, 2024
649148c
remove raw for loops, verify equal fields at each schema level
mhaseeb123 May 2, 2024
416dbbd
Remove flatbuffer files. Add flatbuffers via CMake
mhaseeb123 May 2, 2024
c5a7b0e
Make arrow schema use in PQ reader optional. Add tests.
mhaseeb123 May 2, 2024
6f18766
minor updates for better readability
mhaseeb123 May 2, 2024
e4b9e74
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 2, 2024
dc7564a
fix arrow schema walk to handle list type columns. Add more pytests
mhaseeb123 May 3, 2024
0c4e7c4
add comments for the dummy node hack
mhaseeb123 May 3, 2024
0514b5c
Adding `map` type to parquet testing.
mhaseeb123 May 3, 2024
a1f8fe7
relocate files, fix copyirghts and ruff checks
mhaseeb123 May 6, 2024
a36c1c6
minor fix for verify copyright hook
mhaseeb123 May 6, 2024
59d84f4
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 6, 2024
6b9bde5
update copyright messages
mhaseeb123 May 6, 2024
041ff76
Merge branch 'arrow-schema-support-pq-reader' of https://github.com/m…
mhaseeb123 May 6, 2024
cb691dd
segfault-proof the `validate_schemas` method
mhaseeb123 May 6, 2024
59610cd
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 6, 2024
ed83908
C++ friendly base64 encoder/decoder implementations
mhaseeb123 May 7, 2024
fbd3356
minor updates
mhaseeb123 May 7, 2024
b93c2c0
fix the erroneous inequality check to equality
mhaseeb123 May 7, 2024
d01f94c
use string find instead of custom function for better speed
mhaseeb123 May 7, 2024
b8c338b
optimize base64 encode
mhaseeb123 May 7, 2024
e47bbfb
fix minor signed comparison error
mhaseeb123 May 7, 2024
0b5ec61
speed optimization for decoder
mhaseeb123 May 7, 2024
83a13a7
Apply suggestions from code review
mhaseeb123 May 8, 2024
69be7db
applying suggestions from reviewers
mhaseeb123 May 8, 2024
0d41d99
minor updates from reviewer suggestions
mhaseeb123 May 8, 2024
56bbc15
add ctests for base64 encoder and decoder
mhaseeb123 May 8, 2024
bd54430
minor comments update
mhaseeb123 May 9, 2024
e954b45
Apply styling suggestions from code review
mhaseeb123 May 9, 2024
b870359
minor updates and better styling
mhaseeb123 May 9, 2024
c34c248
adding const to decode_ipc_message fn
mhaseeb123 May 9, 2024
dda87d1
avoid returning raw pointer in decode_ipc_message
mhaseeb123 May 9, 2024
e9f441d
move base64 definitions to a source file and add it to cmake
mhaseeb123 May 10, 2024
ac85ecc
apply suggestions from the reviews
mhaseeb123 May 10, 2024
45261f1
Apply suggestions from code review
mhaseeb123 May 10, 2024
f92fcc8
improve round trip tests for thorough arrow schema testing plus minor…
mhaseeb123 May 10, 2024
1c36d36
Update cpp/src/io/parquet/reader_impl_helpers.cpp
mhaseeb123 May 10, 2024
336574a
minor syntactical updates to tests
mhaseeb123 May 10, 2024
b0289b8
Apply suggestions from code review
mhaseeb123 May 13, 2024
3a602cc
small improvements and using zip iterator instead of counting iterato…
mhaseeb123 May 13, 2024
63b4df3
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
vuule May 13, 2024
7fbbea0
Remove explicit check for dtypes as already being done
mhaseeb123 May 13, 2024
6ab3b17
move `use_arrow_schema` to the end of parameters
mhaseeb123 May 14, 2024
4d74b24
Update tests to construct `expected` and use `assert_eq` for dtypes
mhaseeb123 May 14, 2024
a80f562
Remove `use_arrow_schema` from public Python APIs.
mhaseeb123 May 14, 2024
4e368d8
Remove `use_arrow_schema` from Cython API args as well
mhaseeb123 May 14, 2024
93ec789
Throw some Nulls in python tests
mhaseeb123 May 14, 2024
09eadcf
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
galipremsagar May 14, 2024
1d94cc8
Merge remote-tracking branch 'upstream/branch-24.06' into arrow-schem…
mhaseeb123 May 14, 2024
50d0b77
Update .pre-commit-config.yaml
galipremsagar May 14, 2024
56b2edc
Merge branch 'branch-24.06' into arrow-schema-support-pq-reader
mhaseeb123 May 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Throw some Nulls in python tests
  • Loading branch information
mhaseeb123 committed May 14, 2024
commit 93ec78978deb0c0bd1b3b424740c3c47e38861f4
14 changes: 7 additions & 7 deletions python/cudf/cudf/tests/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -3251,9 +3251,9 @@ def test_parquet_reader_roundtrip_with_arrow_schema():
# round trip duration types (timedelta64) across Parquet read and write.
pdf = pd.DataFrame(
{
"s": pd.Series([1234, 3456, 32442], dtype="timedelta64[s]"),
"ms": pd.Series([1234, 3456, 32442], dtype="timedelta64[ms]"),
"us": pd.Series([1234, 3456, 32442], dtype="timedelta64[us]"),
"s": pd.Series([None, None, None], dtype="timedelta64[s]"),
"ms": pd.Series([1234, None, 32442], dtype="timedelta64[ms]"),
"us": pd.Series([None, 3456, None], dtype="timedelta64[us]"),
"ns": pd.Series([1234, 3456, 32442], dtype="timedelta64[ns]"),
"duration_list": list(
[
Expand All @@ -3262,12 +3262,12 @@ def test_parquet_reader_roundtrip_with_arrow_schema():
datetime.timedelta(minutes=7),
],
[
datetime.timedelta(minutes=7, seconds=4),
datetime.timedelta(minutes=7),
None,
None,
],
[
datetime.timedelta(minutes=7, seconds=4),
datetime.timedelta(minutes=7),
None,
],
]
),
Expand Down Expand Up @@ -3307,7 +3307,7 @@ def test_parquet_reader_roundtrip_structs_with_arrow_schema():
},
"StreamId": "12345678",
"Duration": datetime.timedelta(minutes=4),
"Offset": 12,
"Offset": None,
"Resource": [
{
"Name": "ZoneName",
Expand Down
Loading