Skip to content

feat: Parquet modular encryption #16351

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 38 commits into
base: main
Choose a base branch
from

Conversation

corwinjoy
Copy link

@corwinjoy corwinjoy commented Jun 10, 2025

Which issue does this PR close?

What changes are included in this PR?

This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.

Are these changes tested?

Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.

Are there any user-facing changes?

Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.

Catalog of changes via copilot

Show a summary per file
File Description
docs/source/user-guide/configs.md Added documentation for new encryption and decryption properties.
datafusion/sqllogictest/test_files/information_schema.slt Updated SQL logic tests to include encryption/decryption settings.
datafusion/proto/src/logical_plan/file_formats.rs Added default None values for encryption properties in proto.
datafusion/proto-common/src/from_proto/mod.rs Updated conversion to set encryption properties to None.
datafusion/datasource-parquet/src/source.rs Passed file decryption properties from table options.
datafusion/datasource-parquet/src/opener.rs Disabled page index when encryption is enabled.
datafusion/datasource-parquet/src/file_format.rs Propagated decryption properties in schema and statistics fetching.
datafusion/core/tests/parquet/mod.rs Added the encryption test module reference.
datafusion/core/tests/parquet/encryption.rs Added tests for round-trip encryption.
datafusion/core/tests/parquet/custom_reader.rs Updated custom reader to pass None for decryption properties.
datafusion/core/src/datasource/file_format/parquet.rs Added extra parameters for decryption in metadata/statistics fetching.
datafusion/core/src/dataframe/parquet.rs Added roundtrip test for encrypted Parquet files.
datafusion/common/src/file_options/parquet_writer.rs Updated writer options to support encryption properties.
datafusion/common/Cargo.toml Added the hex dependency.
datafusion-examples/examples/parquet_encrypted.rs Introduced an example that reads/writes encrypted Parquet files.
datafusion-examples/README.md Updated examples list to include the encrypted Parquet demo.
Cargo.toml Enabled encryption feature for the parquet crate.

corwinjoy and others added 30 commits May 29, 2025 19:04
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
2. Fixed unused header warning in config.rs.
3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add an example to read and write encrypted parquet files.
@github-actions github-actions bot added documentation Improvements or additions to documentation core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate labels Jun 10, 2025
@corwinjoy
Copy link
Author

@adamreeve @rok

@@ -60,11 +60,11 @@ pub async fn main() -> Result<()> {
Options::Cancellation(opt) => opt.run().await,
Options::Clickbench(opt) => opt.run().await,
Options::H2o(opt) => opt.run().await,
Options::Imdb(opt) => opt.run().await,
Options::Imdb(opt) => Box::pin(opt.run()).await,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

requested by clippy

// Any hex encoded values must be pre-encoded using
// hex::encode() before calling set.
if key.starts_with("column_keys_as_hex.") {
let k = match key.split(".").collect::<Vec<_>>()[..] {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could use some feedback on how to do the column keys. Originally, I had used a separator of '::' to match what is done with metadata fields. But TableParquetOptions redirects all '::' delimitors as seen here.
https://github.com/corwinjoy/datafusion/blob/a81855fcbf3cfb63512c1ba124e1ebbfd5e6b15c/datafusion/common/src/config.rs#L2100

So, I'm not quite sure what to do here. For now, we use '.' to separate columns.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the encryption related settings were directly set on the TableParquetOptions or a crypto/encryption namespace rather than in ParquetOptions then I think we could avoid this issue. But then they'd probably need to be included in ParquetReadOptions too to work with SessionContext::read_parquet (see related comment at https://github.com/apache/datafusion/pull/16351/files#r2136718671).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened a PR against your branch that implements the suggestion to move these configuration options under a new field in TableParquetOptions: corwinjoy#5. I think this worked quite nicely and simplified the ConfigField implementations

pub column_metadata_as_hex: HashMap<String, String>,
pub aad_prefix_as_hex: String,
pub store_aad_prefix: bool, // default = false
}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a separate Config struct, then use From methods to convert back and forth from the underlying parquet FileEncryptionProperties.

pub file_decryption_properties: Option<ConfigFileDecryptionProperties>, default = None

/// Optional file encryption properties
pub file_encryption_properties: Option<ConfigFileEncryptionProperties>, default = None
Copy link
Author

@corwinjoy corwinjoy Jun 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamreeve and I are not completely sure where these settings should go. On the session context there's only a way to set the "global" ParquetOptions but not TableParquetOptions, which contains extra table-specific settings.

It does feel a bit wrong to put file-specific decryption properties in the execution context (see later examples). Eg. if users were reading two different encrypted Parquet files in one query they might need to set different decryption properties for each file, so setting them in the execution context wouldn't work. At the moment I think this scenario would require creating separate listing tables and specifying TableParquetOptions. That's an edge case so maybe I'm overthinking this, but maybe being able to set file decryption properties in ParquetReadOptions would be a good idea?

This doesn't really fit all that well with the reader options that Parquet has, though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @metesynnada or @berkaysynnada have some ideas of how to do this

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@adamreeve Has a nice PR to move this all to a crypto namespace which cleans this up a lot. We are still debating a bit, since we want to understand the impact downstream for tools like delta-rs.
corwinjoy#5

.unwrap();

for (i, col_name) in column_names.iter().enumerate() {
let key = format!("file_encryption_properties.column_keys_as_hex.{col_name}");
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note use of '.' as separator for column name, as mentioned above.

@corwinjoy
Copy link
Author

@alamb One piece I would like to solicit feedback on is if there is a way to leverage the existing tests to more thoroughly vet encryption. What I mean by that, is that we uncovered a read bug when using filters in a query, and I worry that there could be other edge cases that might not be covered. What I would like to do is take an encrypted parquet file and then run the datafusion SQL tests over it (and maybe other operation tests). This would help to make sure that all the SQL operations are really covered. And maybe in addition, somehow double-check things like statistics and bloom filters? Anyway, I'm hoping there is a way to leverage the existing test suite to cover these cases. Any suggestions?

@mbutrovich
Copy link
Contributor

Thank you and @adamreeve for driving so much of the modular encryption work! I'll take a look at this branch this week and see how this might get Comet supporting modular encryption within Spark, or if any obvious gaps jump out at me.

@alamb
Copy link
Contributor

alamb commented Jun 16, 2025

I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet

Move encryption and decryption configuration options into a separate crypto namespace
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support integration with Parquet modular encryption
4 participants