-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Closed
Description
What would you like to happen?
Add a transform which allows users to use a secret manager to ensure that any GBK data is encrypted before being sent to a runner. This allows users to avoid unencrypted data at rest.
Exact support is still runner dependent (the runner needs to guarantee it won't otherwise materialize Pcollections), but this gives the ability to offer this capability.
Subparts:
- Add GBEK transform in Python - Add GroupByEncryptedKey transform #36213
- Add GBEK transform in Java - Java GroupByEncryptedKey #36217
- Support pipeline option to auto-replace GBEK with encrypted GBEK in Python - Add pipeline option to enforce gbek #36321
- Support pipeline option to auto-replace GBEK with encrypted GBEK in Java - Add pipeline option to force GBEK (Java) #36346
- CombinePerKey can break this workflow since some workflows will lift this to the top level - we should not allow CombineValues transform replacement when using GBEK, which can be done by not using the COMBINE_PER_KEY urn) - CombinePerKey with gbek (Python) #36382 and CombinePerKey with gbek (Java) #36408
- Make sure pipeline option works in x-lang setting (e.g. Java GBK in Python pipeline) - Use consistent encoding for GBEK across languages #36431 , x-lang GroupByEncryptedKey (Java to Python) #36418 , Add some x-lang gbek tests (Python to Java) #36457 , and Fix passing pipeline options to external transforms #36443
- Add gbek option to and
def runnerV2CommonPipelineOptions = [ and run postcommit suites (no need to merge). Set up a similar suite which runs a few integration tests. Led to PRs - Handle null keys in gbek #36505 and Softens the GBEK determinism requirement #36495opts=( - (Dataflow only) Set a service option to tell the service that gbek enforcement is in place - Add use_gbek service option when gbek option used #36452
Future work: Support a more universal pipeline option which runners can opt into/out of for this concept, like #36214 (comment)
Issue Priority
Priority: 2 (default / most feature requests should be filed as P2)
Issue Components
- Component: Python SDK
- Component: Java SDK
- Component: Go SDK
- Component: Typescript SDK
- Component: IO connector
- Component: Beam YAML
- Component: Beam examples
- Component: Beam playground
- Component: Beam katas
- Component: Website
- Component: Infrastructure
- Component: Spark Runner
- Component: Flink Runner
- Component: Samza Runner
- Component: Twister2 Runner
- Component: Hazelcast Jet Runner
- Component: Google Cloud Dataflow Runner