-
Notifications
You must be signed in to change notification settings - Fork 1.5k
feat: Parquet modular encryption #16351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…uite, column encryption is broken.
Co-authored-by: Adam Reeve <adreeve@gmail.com>
Co-authored-by: Adam Reeve <adreeve@gmail.com>
…operties to use references.
… "." instead of "::"
2. Fixed unused header warning in config.rs. 3. Fix test case in encryption.rs to call conversion to ConfigFileDecryption properties correctly.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Add an example to read and write encrypted parquet files.
@@ -60,11 +60,11 @@ pub async fn main() -> Result<()> { | |||
Options::Cancellation(opt) => opt.run().await, | |||
Options::Clickbench(opt) => opt.run().await, | |||
Options::H2o(opt) => opt.run().await, | |||
Options::Imdb(opt) => opt.run().await, | |||
Options::Imdb(opt) => Box::pin(opt.run()).await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
requested by clippy
datafusion/common/src/config.rs
Outdated
// Any hex encoded values must be pre-encoded using | ||
// hex::encode() before calling set. | ||
if key.starts_with("column_keys_as_hex.") { | ||
let k = match key.split(".").collect::<Vec<_>>()[..] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use some feedback on how to do the column keys. Originally, I had used a separator of '::' to match what is done with metadata fields. But TableParquetOptions redirects all '::' delimitors as seen here.
https://github.com/corwinjoy/datafusion/blob/a81855fcbf3cfb63512c1ba124e1ebbfd5e6b15c/datafusion/common/src/config.rs#L2100
So, I'm not quite sure what to do here. For now, we use '.' to separate columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the encryption related settings were directly set on the TableParquetOptions
or a crypto
/encryption
namespace rather than in ParquetOptions
then I think we could avoid this issue. But then they'd probably need to be included in ParquetReadOptions
too to work with SessionContext::read_parquet
(see related comment at https://github.com/apache/datafusion/pull/16351/files#r2136718671).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened a PR against your branch that implements the suggestion to move these configuration options under a new field in TableParquetOptions
: corwinjoy#5. I think this worked quite nicely and simplified the ConfigField implementations
datafusion/common/src/config.rs
Outdated
pub column_metadata_as_hex: HashMap<String, String>, | ||
pub aad_prefix_as_hex: String, | ||
pub store_aad_prefix: bool, // default = false | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We create a separate Config struct, then use From methods to convert back and forth from the underlying parquet FileEncryptionProperties
.
datafusion/common/src/config.rs
Outdated
pub file_decryption_properties: Option<ConfigFileDecryptionProperties>, default = None | ||
|
||
/// Optional file encryption properties | ||
pub file_encryption_properties: Option<ConfigFileEncryptionProperties>, default = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamreeve and I are not completely sure where these settings should go. On the session context there's only a way to set the "global" ParquetOptions
but not TableParquetOptions
, which contains extra table-specific settings.
It does feel a bit wrong to put file-specific decryption properties in the execution context (see later examples). Eg. if users were reading two different encrypted Parquet files in one query they might need to set different decryption properties for each file, so setting them in the execution context wouldn't work. At the moment I think this scenario would require creating separate listing tables and specifying TableParquetOptions
. That's an edge case so maybe I'm overthinking this, but maybe being able to set file decryption properties in ParquetReadOptions
would be a good idea?
This doesn't really fit all that well with the reader options that Parquet has, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe @metesynnada or @berkaysynnada have some ideas of how to do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@adamreeve Has a nice PR to move this all to a crypto namespace which cleans this up a lot. We are still debating a bit, since we want to understand the impact downstream for tools like delta-rs.
corwinjoy#5
datafusion/common/src/config.rs
Outdated
.unwrap(); | ||
|
||
for (i, col_name) in column_names.iter().enumerate() { | ||
let key = format!("file_encryption_properties.column_keys_as_hex.{col_name}"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note use of '.' as separator for column name, as mentioned above.
@alamb One piece I would like to solicit feedback on is if there is a way to leverage the existing tests to more thoroughly vet encryption. What I mean by that, is that we uncovered a read bug when using filters in a query, and I worry that there could be other edge cases that might not be covered. What I would like to do is take an encrypted parquet file and then run the datafusion SQL tests over it (and maybe other operation tests). This would help to make sure that all the SQL operations are really covered. And maybe in addition, somehow double-check things like statistics and bloom filters? Anyway, I'm hoping there is a way to leverage the existing test suite to cover these cases. Any suggestions? |
Thank you and @adamreeve for driving so much of the modular encryption work! I'll take a look at this branch this week and see how this might get Comet supporting modular encryption within Spark, or if any obvious gaps jump out at me. |
I am sorry I haven't had a chance to review this yet. It would be great if @mbutrovich could also take a look. I have this on my list to review but I haven't been able to find the time yet |
Move encryption and decryption configuration options into a separate crypto namespace
Which issue does this PR close?
What changes are included in this PR?
This PR adds support for encryption in DataFusion’s Parquet implementation. The changes introduce new configuration options for file encryption and decryption properties, update various components (including proto conversion, file reading/writing, and tests), and add an end-to-end encrypted Parquet example.
Are these changes tested?
Tests and examples have been added to demonstrate and test functionality versus Parquet modular encryption. These could use feedback since there may be additional DataFusion usage cases that should be covered.
Are there any user-facing changes?
Additional options have been added to allow encryption/decryption configuration. We are soliciting additional feedback on how to handle key columns in a way that best fits the existing API.
Catalog of changes via copilot
Show a summary per file