Description
Is your feature request related to a problem or challenge?
arrow-rs is in the process of gaining support for Parquet modular encryption - see apache/arrow-rs#7278. It would be useful to be able to read and write encrypted Parquet files with DataFusion, but it's not clear how to integrate this feature due to the complex configuration required.
Examples of this complex configuration are:
- Users may require different encryption or decryption keys to be specified per Parquet file
- The encryption and decryption keys specified may depend on the file schema
- The encryption keys may need to be generated per file by interacting with a user's key management service (KMS)
- Decryption keys may need to be retrieved dynamically based on the metadata read from Parquet files and require interaction with a KMS. This process would be opaque to DataFusion, but requires the
FileDecryptionProperties
in arrow-rs to be created with a callback that can't be represented as a string configuration option (Allow retrieving Parquet decryption keys based on the key metadata arrow-rs#7257).
I have an example of what using a KMS might look like to read and write encrypted files but this isn't yet merged in arrow-rs: https://github.com/adamreeve/arrow-rs/blob/7afb60e1ee0e4c190468c153b252324235a63d96/parquet/examples/round_trip_encrypted_parquet.rs
Currently all Parquet format options can be easily encoded as strings or primitive types, and live in datafusion-common
, which has an optional dependency on the parquet crate, although TableParquetOptions
is always defined even if the parquet feature is disabled.
We're experimenting with using encryption in DataFusion by adding encoded keys to the ParquetOptions
struct, but this is quite limited and doesn't support the more complex configuration options I mention above.
Describe the solution you'd like
One solution might be to allow users to arbitrarily customize the Parquet writing and reading options, eg. with something like:
--- a/datafusion/common/src/config.rs
+++ b/datafusion/common/src/config.rs
@@ -1615,6 +1615,12 @@ pub struct TableParquetOptions {
/// )
/// ```
pub key_value_metadata: HashMap<String, Option<String>>,
+ /// Callback to modify the Parquet WriterPropertiesBuilder with custom configuration
+ #[cfg(feature = "parquet")]
+ pub writer_configuration: Option<Arc<dyn Fn(WriterPropertiesBuilder) -> WriterPropertiesBuilder>>,
+ /// Callback to modify the Parquet ArrowReaderOptions with custom configuration
+ #[cfg(feature = "parquet")]
+ pub read_configuration: Option<Arc<dyn Fn(ArrowReaderOptions) -> ArrowReaderOptions>>,
}
impl TableParquetOptions {
These callbacks would probably need some other inputs like the file schema too. This would allow DataFusion users to specify encryption specific options without DataFusion itself needing to know about them, and might be useful for applying other Parquet options that aren't already exposed in DataFusion. This also supports generating different encryption properties per file.
TableParquetOptions
can currently be created from environment variables, which wouldn't be possible for these extra fields, but I don't think that should be a problem?
Another minor issue is that TableParquetOptions
implements PartialEq
, and I don't think it would be possible to sanely implement equality while allowing custom callbacks like this.
Describe alternatives you've considered
@alamb also suggested in delta-io/delta-rs#3300 that it could be possible to use an Arc<dyn Any>
to allow passing more complex configuration types through TableParquetOptions
.
I'm not sure exactly what this would look like though. Maybe the option would still hold a callback function but just hidden behind the Any
trait, or maybe we would want to limit this to encryption specific configuration options, but I think we'd need to maintain the ability to generate ArrowReaderOptions
and WriterProperties
per file.
Additional context
No response