-
Notifications
You must be signed in to change notification settings - Fork 67
Add databricks direct unload file containing complex map key #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Add databricks direct unload file containing complex map key #87
Conversation
Thank you @njaremko 🙏 Could you please add some comments about what this file contains and how you created it? Ideally you could follow the model of https://github.com/apache/parquet-testing/blob/master/data/README.md Also I noticed the file is 8KB -- given how widely this repo is cloned / copied is there any way to make the example file smaller? |
79cc811
to
6eea094
Compare
I've removed the unneeded columns, and it's 2kb now. I've also updated the readme |
6eea094
to
f0d64c2
Compare
f0d64c2
to
54bffa8
Compare
What are the next steps to get this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @njaremko . This looks good to me
I dumped schema and layout ofdata/complex_map_key.parquet
and it looks good to me
Note that I am not a committer on parquet so I can not commit this PR. Perhaps @wgtmac or @emkornfield could take a look.
Also FYI I don't think we need to gate the fix in arrow-rs on this PR. I will comment on apache/arrow-rs#7769 as well
parquet-layout data/complex_map_key.parquet
required group field_id=-1 spark_schema {
required group field_id=-1 map_nested (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
required group field_id=-1 value (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
required binary field_id=-1 value (String);
}
}
}
}
required group field_id=-1 map_nested_array (Map) {
repeated group field_id=-1 key_value {
required group field_id=-1 key (List) {
repeated group field_id=-1 list {
required int32 field_id=-1 element;
}
}
required group field_id=-1 value (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
required int32 field_id=-1 value;
}
}
}
}
}
and
File Name: data/complex_map_key.parquet
Version: 1.0
Created By: parquet-mr version 1.12.3-databricks-0002 (build 2484a95dbe16a0023e3eb29c201f99ff9ea771ee)
Total rows: 1
Number of RowGroups: 1
Number of Real Columns: 2
Number of Columns: 6
Number of Selected Columns: 6
Column 0: map_nested.key_value.key (BYTE_ARRAY / String / UTF8)
Column 1: map_nested.key_value.value.key_value.key (BYTE_ARRAY / String / UTF8)
Column 2: map_nested.key_value.value.key_value.value (BYTE_ARRAY / String / UTF8)
Column 3: map_nested_array.key_value.key.list.element (INT32)
Column 4: map_nested_array.key_value.value.key_value.key (BYTE_ARRAY / String / UTF8)
Column 5: map_nested_array.key_value.value.key_value.value (INT32)
--- Row Group: 0 ---
--- Total Bytes: 256 ---
--- Total Compressed Bytes: 266 ---
--- Rows: 1 ---
Column 0
Values: 1, Null Values: 0, Distinct Values: 0
Max: a, Min: a
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 40, Compressed Size: 42
Column 1
Values: 1, Null Values: 0, Distinct Values: 0
Max: b, Min: b
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 42, Compressed Size: 44
Column 2
Values: 1, Null Values: 0, Distinct Values: 0
Max: c, Min: c
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 42, Compressed Size: 44
Column 3
Values: 2, Null Values: 0, Distinct Values: 0
Max: 2, Min: 1
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 45, Compressed Size: 47
Column 4
Values: 1, Null Values: 0, Distinct Values: 0
Max: green, Min: green
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 46, Compressed Size: 46
Column 5
Values: 1, Null Values: 0, Distinct Values: 0
Max: 5, Min: 5
Compression: SNAPPY, Encodings: PLAIN
Uncompressed Size: 41, Compressed Size: 43
--- Values ---
key |key |value |element |key |value |
a |b |c |1 |green |5 |
2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we actually need the map_nested
field if we just want to add complex key/value types?
|
||
| File | Description | | ||
|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------| | ||
| complex_map_key.parquet | Contains a map with an array key. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth describing the exact schema of the file at this line?
Required for 7769