-
Notifications
You must be signed in to change notification settings - Fork 253
Description
What is the problem the feature request solves?
We would like Comet to fully support complex types (arrays, structs, and maps). This issue is for tracking all of the individual issues.
Google doc: https://docs.google.com/document/d/1eiDFEScPjxBMahJW6lmBI8JjVlI6CwhiJgkTSsTvPVY/edit?usp=sharing
Implement new native scans based on DataFusion's DataSourceExec
We now have new native_datafusion and native_iceberg_compat scans that use DataFusion's DataSourceExec, which already supports complex types.
We need to fix the remaining Spark SQL test failures:
- Spark SQL test failures in native_iceberg_compat mode #1542
- Spark SQL test failures in native_datafusion scan #1545
Known issues:
- Parquet scan NATIVE_DATAFUSION and NATIVE_ICEBERG_COMPAT fail to read uint8, uint16 negative values correctly #1348
- [native_datafusion] Spark SQL failure "select nested field from a complex map key using map_keys" #1754
- [native_datafusion] PARQUET_FIELD_ID_READ_ENABLED is not respected #1758
- [native_datafusion] No support for default values for Parquet columns #1750
native_datafusion/native_iceberg_compatscans case sensitive #1574ParquetEncryptionITCasefails withnative_iceberg_compat#1488
Other scan-related work
These items may not be relevant to all users, but for some environments, there is more work required to allow the new ParquetExec scans to be used. Comet's current default native_comet scan is JVM-based and leverages Hadoop data source functionality that is not available in DataFusion.
- Support reading data from S3 using native_datafusion Parquet scanner #1766
- Wrap Hadoop file readers in JNI so that we can call from Rust, to support use cases such as encryption
- Add Parquet column index support #1082
- Custom Authentication for cloud storage
- HDFS Support
Supporting expressions that operate on complex types
- Expressions
- Array
- [EPIC] Add support for all array expressions #1042
- Update to_json to support arrays
- Implement CAST from array to string
- Struct
- Map
- [EPIC] Add support for all Map functions #1044
- Implement CAST from Map to String
- Add map support to
to_json
- Array
Performance
- Create benchmarks for complex types
Testing
- Test
native_datafusionandnative_iceberg_compatwith all supported Java, Spark, and Scala versions #1486 - Test
native_datafusionandnative_iceberg_compatwith Spark SQL tests #1489 - Fuzz testing
Older / related issues:
- Use parquet crate for decoding Parquet data into Arrow arrays #1040
- Support complex datatypes in Comet Scan #434
Describe the potential solution
No response
Additional context
No response