-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Encrypt all string trigger attributes #36801
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| def _decrypt(_value: Any) -> Any: | ||
| if isinstance(_value, str): | ||
| return fernet.decrypt(_value.encode("utf-8")).decode("utf-8") | ||
| if isinstance(_value, dict): | ||
| return {k: _decrypt(v) for k, v in _value.items()} | ||
| if isinstance(_value, list): | ||
| return [_decrypt(v) for v in _value] | ||
| if isinstance(_value, tuple): | ||
| return tuple(_decrypt(v) for v in _value) | ||
| return _value | ||
|
|
||
| decrypted_kwargs = {} | ||
| for key, value in trigger_row.kwargs.items(): | ||
| decrypted_kwargs[key] = _decrypt(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not all the attributes?
fernet encrypt/decrypt only bytes, it will be complicated to convert all the types to bytes (without a 3rd party lib like pickle), also we cannot convert the dict to string because there is no guarantee that it's json serialzable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm. I am not sure. That will complicate rotation of the fernet key.
Actually if you look at the kwargs definition in the Trigger model, this field is ExtendedJson which (look at it) already serializes whatever is passed to it as string (and this is how it stores it).
So actually what we would need to do is we should - I think modify ExtendedJson to encrypt/decrypt the serialized string just before saving/after retrieving it - and there also all the fernet_rotation could be implemented as well much faster (without having to deserialize the data - it will just need to be re-encrypted. Also this ExtendedJson field type could handle migration from previous airflow version and simply use the original string if decryption fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also one more thing here - encrypting all strings vs. the whole blob might be a performance hit and it will produce strange results if strings are small. I had a discussion about it with @bolkedebruin - if you have quite a number o of small strings that have predictable values encrypted, there might be ways that this can be utilised to guess the encryption keys (rainbow tables and the like) - in this case it's not likely to be possible because fernet key is generally a rather long random bate array, but I think a good security practice is to encrypt whole blob rather than individual fields witin it (especially if some of those fields are small and predictable).
I do not have an exact attack in mind - it's just intelligent guess/intuition that in this case encrypting all kwargs fields serialized as a single blob will have much better performance and security properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I agree with the performance argument, I'll try to implement it that way.
|
|
||
| Triggers are serialized and stored in the database, so they can be re-instantiated on any triggerer process. This means that any sensitive information you pass to a trigger will be stored in the database. | ||
| If you want to pass sensitive information to a trigger, you can encrypt it before passing it to the trigger, and decrypt it inside the trigger, or update the argument name in the ``serialize`` method by adding ``encrypted__`` as a prefix, and Airflow will automatically encrypt the argument before storing it in the database, and decrypt it when it is read from the database. | ||
| Since Airflow 2.8.1, all the string values in the serialized trigger are encrypted before storing them in the database, and decrypted when they are read from the database. This means that the sensitive information is not stored in plain text in the database anymore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ephraimbuddy in case you want to move it to 2.9.0, we need to commit this suggestion before merging.
| Since Airflow 2.8.1, all the string values in the serialized trigger are encrypted before storing them in the database, and decrypted when they are read from the database. This means that the sensitive information is not stored in plain text in the database anymore. | |
| Since Airflow 2.9.0, all the string values in the serialized trigger are encrypted before storing them in the database, and decrypted when they are read from the database. This means that the sensitive information is not stored in plain text in the database anymore. |
|
Two things:
|
jscheffl
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two things that are not 100% clear to me:
- Where to handle the key rotation as mentioned by @potiuk ?
- How about existing triggers if Airflow is upgraded, will running/open tasks on triggers fail after upgrade as previous data is not encrypted?
Haha, hitting the same two things in parallel :-D |
After the doc you created about the upgrade process, I think the safest solution is keeping this improvement to 2.9.0, where we recommend shutting down the Airflow components during the upgrade to a different minor version.
I will check it. |
potiuk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling we should do it more "calmly" for 2.8.2.
Also I think if we do this automated encryption it's not a feature - merely an implementation detail, but _encrypted was a new feature.
Why? Because if any Trigger would like to store things in _encrypted and rely on it being well, encrypted, they will have to add (somehow) >= VERSION_OF_AIRFLOW_WHERE_IT_WAS RELEASED
If we encrypt by default, then all Triggers will use it automatically and this will become implementation detail (and essentially a bugfix)
See my above comments. I think it can be both - 2.8.2 and handling migration transparently with relative ease if we implement it by implementing it at ExtendedJson level |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can solve the migration problem
Yes, let's keep it to 2.8.2, I will try to implement a full dict encryption to improve the performance, and for the migration, could you check the migration script (will update it after updating the encryption method) |
|
What if we encrypt entire payload? By the same way as we done in Connection Extra? By doing that it required to convert existed type field in DB from JSON (text depends on DB) to binary type. And this could be done during migration with convertion/encryption existed payload. |
Quite Agree. I also thought the same - see my comment #36801 (comment) and our earlier discusion with @jscheffl #36492 (comment) First of all - I think indeed there is no particular reason to use JSON type here - we are using ExtendedJSON type which implements airflow serialization, but that is on its own weird - our serialization serializes more than JSON - AND it uses much proprietary JSON that is generally not suitable to using as JSON source of data - the way it is implemented (and the fact it uses JSON) is merely an implementation detail - so using JSON field to store it, makes very little sense, because (as @jscheffl mentioned in his comments) it anyhow requires the data to be deserialized using our Python code to be "useful" so we generaly lose any DB-level support to read and query that json data. So it makes very little sense to store it in JSON field (and we are not using anywhere the fact that it is JSON field). IMHO we should convert it to a simple TEXT or even BLOB field. Secondly the security property/ performance - let me repeat from what I wrote above - I have no hard data, but intuitively it's much more secure and faster to encrypt the whole blob rather than multiple individual fields in it. It likely takes a lot more memory and CPU (and enthropy which is also important performance factor for encryption) to convert and encrypt multiple individual (sometimes short) strings in a data structure, than it is to encrypt the whole string representation in a single operation. Especially if you consider the usual padding for short strings, and include the fact that encrypting multiple short string generally decreases "security" properties of these (think dictionary/rainbow attacks - it makes much more sense to encrypt the whole blob - especially if you have no need/possibility to read the remaining unencrypted fields for all your data. I made the same comments in #35867 by @bolkedebruin - who wants to add encryption for serde primitives, which I think is generally bad idea and we should instead encrypt the whole serialized blob. Its faster, more secure,and we could even add a nice tool that will decrypt it for diagnostic purposes if needed. And most of all - it is far more future-proof because you do not have to rely on individuals who write serialziation to mark all the fields that require to be serialized (the last point is somewhat mitigated by @hussein-awala here by serializing all strings, but I think it solves just a subset of it). |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
|
Closed in favor of #38233 |
related: #36492
This PR encrypts all the string attributes in the trigger instead of encrypting only the ones that start with
encrypted__prefix; it also encrypts the string values stored in the dict, list, tuple or any nested object of these types.cc: @potiuk @jscheffl @dstandish @ephraimbuddy