-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consider predicate columns for encryption #15187
Conversation
@@ -470,6 +481,43 @@ public CounterStat getHighMemorySplitSource() | |||
return concat(partitionBatches); | |||
} | |||
|
|||
private static Optional<Set<HiveColumnHandle>> mergeRequestedAndPredicateColumns(Optional<Set<HiveColumnHandle>> requestedColumns, Set<HiveColumnHandle> predicateColumns) | |||
{ | |||
if (!requestedColumns.isPresent() || predicateColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remind me again why requested columns is optional? requested columns being optional and predicate columns not makes the logic here kind of confusing. Is it possible to always provide it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am just going with the structure of HiveTableLayoutHandle
and in this class - requestedColumns
is optional and predicateColumns
is not. Given this is where I get the data from, I don't think it is possible for me to always provide requestedColumns
return requestedColumns; | ||
} | ||
|
||
Map<String, Set<HiveColumnHandle>> nameToColumnHandles = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the groupingBy collector https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collectors.html#groupingBy-java.util.function.Function-java.util.stream.Collector-
continue; | ||
} | ||
|
||
if (finalHandle.getHiveType().getCategory() != STRUCT || finalHandle.getRequiredSubfields().isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can we short circuit for non-structs? is it because there will never be any subfields? I think the shortcircuiting is more confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because there will never be subfields in NON-structs. Well, there might be key, value stuff for maps, but that doesn't really impact encryption implementation since either the full map/array is encrypted or not.
.map(columnHandles -> { | ||
HiveColumnHandle finalHandle = null; | ||
for (HiveColumnHandle columnHandle : columnHandles) { | ||
if (finalHandle == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this whole section be simplified to
if (finalHandle == null || finalHandle.getRequiredSubfields.isEmpty() || columnHandle.getRequiredSubfields.isEmpty()) {
finalHandle = columnHandle;
}
else {
finalHandle = (HiveColumnHandle) finalHandle.withRequiredSubfields(ImmutableList.copyOf(ImmutableSet.copyOf(
ImmutableList.<Subfield>builder().addAll(finalHandle.getRequiredSubfields()).addAll(columnHandle.getRequiredSubfields()).build())));
}
Also, can we do it as a reduce on the stream instead of having a forloop inside the stream? (if we use groupingBy, don't even need to collect to map first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for advise on grouping by and reduce - this has made the code a lot simpler now.
87d4e2e
to
eec44be
Compare
In encryption we pass the list of columns that are being read to help with getting the credentials only for those columns and allow for more granular access control. Uptil now, only the columns being requested in the select statement were the only ones being considered. This commit adds support for columns in the predicate as well. We now consider the predicate columns as well. A special case to callout is `ROW`. Presto currently supports encryption granularity of a subfield, however, it is non-trivial to get the exact subfields being accessed for a `ROW`. This PR puts the whole `ROW` column if any subfield is being used in the predicate. This is a limitation for now.
eec44be
to
4f86b23
Compare
@rschlussel - all the tests are green - so feel free to merge it whenever you get a chance to :) |
@VisibleForTesting | ||
static Optional<Set<HiveColumnHandle>> mergeRequestedAndPredicateColumns(Optional<Set<HiveColumnHandle>> requestedColumns, Set<HiveColumnHandle> predicateColumns) | ||
{ | ||
if (!requestedColumns.isPresent() || predicateColumns.isEmpty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mayankgarg1990 Is this the right logic? For a SELECT count(*) FROM t WHERE a > 10
we can have requestedColumns empty and predicateColumns containing "a". In this case, don't we need to make sure user has access to "a" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case - I am checking for requestedColumns.isPresent()
and the situation you are talking about will actually be requestedColumns.get().isEmpty()
.
So this check mainly ensures that due to some reason if we don't end up getting the requestedColumns field populated at all, the encryption provider will now try to get the credentials for all columns.
In encryption we pass the list of columns that are being read to help with
getting the credentials only for those columns and allow for more granular
access control.
Uptil now, only the columns being requested in the select statement were
the only ones being considered. This commit adds support for columns in
the predicate as well. We now consider the predicate columns as well.
A special case to callout is
ROW
. Presto currently supports encryptiongranularity of a subfield, however, it is non-trivial to get the exact
subfields being accessed for a
ROW
. This PR puts the wholeROW
columnif any subfield is being used in the predicate. This is a limitation for
now.
Test plan - It is very difficult to test this change in a unit test in its current format. A
bigger refactor is needed to be able to create tables with dummy encryption providers.
For this specific PR, I performed extensive tests against Facebook infrastructure and
ensured that all specific situations are succeeding.