Skip to content

Conversation

njaremko
Copy link

Required for 7769

@alamb
Copy link
Contributor

alamb commented Jun 24, 2025

Thank you @njaremko 🙏

Could you please add some comments about what this file contains and how you created it?

Ideally you could follow the model of https://github.com/apache/parquet-testing/blob/master/data/README.md

Also I noticed the file is 8KB -- given how widely this repo is cloned / copied is there any way to make the example file smaller?

@njaremko njaremko force-pushed the nathan_06-24-add_databricks_direct_unload_file_containing_complex_map_key branch from 79cc811 to 6eea094 Compare June 25, 2025 19:26
@njaremko
Copy link
Author

njaremko commented Jun 25, 2025

I've removed the unneeded columns, and it's 2kb now. I've also updated the readme

@njaremko njaremko force-pushed the nathan_06-24-add_databricks_direct_unload_file_containing_complex_map_key branch from 6eea094 to f0d64c2 Compare June 25, 2025 19:30
@njaremko njaremko force-pushed the nathan_06-24-add_databricks_direct_unload_file_containing_complex_map_key branch from f0d64c2 to 54bffa8 Compare June 25, 2025 19:31
@njaremko
Copy link
Author

What are the next steps to get this merged?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @njaremko . This looks good to me

I dumped schema and layout ofdata/complex_map_key.parquet and it looks good to me

Note that I am not a committer on parquet so I can not commit this PR. Perhaps @wgtmac or @emkornfield could take a look.

Also FYI I don't think we need to gate the fix in arrow-rs on this PR. I will comment on apache/arrow-rs#7769 as well

parquet-layout data/complex_map_key.parquet
required group field_id=-1 spark_schema {
  required group field_id=-1 map_nested (Map) {
    repeated group field_id=-1 key_value {
      required binary field_id=-1 key (String);
      required group field_id=-1 value (Map) {
        repeated group field_id=-1 key_value {
          required binary field_id=-1 key (String);
          required binary field_id=-1 value (String);
        }
      }
    }
  }
  required group field_id=-1 map_nested_array (Map) {
    repeated group field_id=-1 key_value {
      required group field_id=-1 key (List) {
        repeated group field_id=-1 list {
          required int32 field_id=-1 element;
        }
      }
      required group field_id=-1 value (Map) {
        repeated group field_id=-1 key_value {
          required binary field_id=-1 key (String);
          required int32 field_id=-1 value;
        }
      }
    }
  }
}

and

File Name: data/complex_map_key.parquet
Version: 1.0
Created By: parquet-mr version 1.12.3-databricks-0002 (build 2484a95dbe16a0023e3eb29c201f99ff9ea771ee)
Total rows: 1
Number of RowGroups: 1
Number of Real Columns: 2
Number of Columns: 6
Number of Selected Columns: 6
Column 0: map_nested.key_value.key (BYTE_ARRAY / String / UTF8)
Column 1: map_nested.key_value.value.key_value.key (BYTE_ARRAY / String / UTF8)
Column 2: map_nested.key_value.value.key_value.value (BYTE_ARRAY / String / UTF8)
Column 3: map_nested_array.key_value.key.list.element (INT32)
Column 4: map_nested_array.key_value.value.key_value.key (BYTE_ARRAY / String / UTF8)
Column 5: map_nested_array.key_value.value.key_value.value (INT32)
--- Row Group: 0 ---
--- Total Bytes: 256 ---
--- Total Compressed Bytes: 266 ---
--- Rows: 1 ---
Column 0
  Values: 1, Null Values: 0, Distinct Values: 0
  Max: a, Min: a
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 40, Compressed Size: 42
Column 1
  Values: 1, Null Values: 0, Distinct Values: 0
  Max: b, Min: b
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 42, Compressed Size: 44
Column 2
  Values: 1, Null Values: 0, Distinct Values: 0
  Max: c, Min: c
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 42, Compressed Size: 44
Column 3
  Values: 2, Null Values: 0, Distinct Values: 0
  Max: 2, Min: 1
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 45, Compressed Size: 47
Column 4
  Values: 1, Null Values: 0, Distinct Values: 0
  Max: green, Min: green
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 46, Compressed Size: 46
Column 5
  Values: 1, Null Values: 0, Distinct Values: 0
  Max: 5, Min: 5
  Compression: SNAPPY, Encodings: PLAIN
  Uncompressed Size: 41, Compressed Size: 43
--- Values ---
key                           |key                           |value                         |element                       |key                           |value                         |
a                             |b                             |c                             |1                             |green                         |5                             |
2                             |

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we actually need the map_nested field if we just want to add complex key/value types?


| File | Description |
|----------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| complex_map_key.parquet | Contains a map with an array key. |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth describing the exact schema of the file at this line?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants