Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider predicate columns for encryption #15187

Merged
merged 1 commit into from
Sep 21, 2020

Conversation

mayankgarg1990
Copy link
Contributor

In encryption we pass the list of columns that are being read to help with
getting the credentials only for those columns and allow for more granular
access control.

Uptil now, only the columns being requested in the select statement were
the only ones being considered. This commit adds support for columns in
the predicate as well. We now consider the predicate columns as well.

A special case to callout is ROW. Presto currently supports encryption
granularity of a subfield, however, it is non-trivial to get the exact
subfields being accessed for a ROW. This PR puts the whole ROW column
if any subfield is being used in the predicate. This is a limitation for
now.

Test plan - It is very difficult to test this change in a unit test in its current format. A
bigger refactor is needed to be able to create tables with dummy encryption providers.

For this specific PR, I performed extensive tests against Facebook infrastructure and
ensured that all specific situations are succeeding.

@@ -470,6 +481,43 @@ public CounterStat getHighMemorySplitSource()
return concat(partitionBatches);
}

private static Optional<Set<HiveColumnHandle>> mergeRequestedAndPredicateColumns(Optional<Set<HiveColumnHandle>> requestedColumns, Set<HiveColumnHandle> predicateColumns)
{
if (!requestedColumns.isPresent() || predicateColumns.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you remind me again why requested columns is optional? requested columns being optional and predicate columns not makes the logic here kind of confusing. Is it possible to always provide it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am just going with the structure of HiveTableLayoutHandle and in this class - requestedColumns is optional and predicateColumns is not. Given this is where I get the data from, I don't think it is possible for me to always provide requestedColumns

return requestedColumns;
}

Map<String, Set<HiveColumnHandle>> nameToColumnHandles = new HashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue;
}

if (finalHandle.getHiveType().getCategory() != STRUCT || finalHandle.getRequiredSubfields().isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why can we short circuit for non-structs? is it because there will never be any subfields? I think the shortcircuiting is more confusing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, because there will never be subfields in NON-structs. Well, there might be key, value stuff for maps, but that doesn't really impact encryption implementation since either the full map/array is encrypted or not.

.map(columnHandles -> {
HiveColumnHandle finalHandle = null;
for (HiveColumnHandle columnHandle : columnHandles) {
if (finalHandle == null) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this whole section be simplified to

if (finalHandle == null || finalHandle.getRequiredSubfields.isEmpty() || columnHandle.getRequiredSubfields.isEmpty()) {
   finalHandle = columnHandle;
}
else {
   finalHandle = (HiveColumnHandle) finalHandle.withRequiredSubfields(ImmutableList.copyOf(ImmutableSet.copyOf(
                                    ImmutableList.<Subfield>builder().addAll(finalHandle.getRequiredSubfields()).addAll(columnHandle.getRequiredSubfields()).build())));
}

Also, can we do it as a reduce on the stream instead of having a forloop inside the stream? (if we use groupingBy, don't even need to collect to map first)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for advise on grouping by and reduce - this has made the code a lot simpler now.

In encryption we pass the list of columns that are being read to help with
getting the credentials only for those columns and allow for more granular
access control.

Uptil now, only the columns being requested in the select statement were
the only ones being considered. This commit adds support for columns in
the predicate as well. We now consider the predicate columns as well.

A special case to callout is `ROW`. Presto currently supports encryption
granularity of a subfield, however, it is non-trivial to get the exact
subfields being accessed for a `ROW`. This PR puts the whole `ROW` column
if any subfield is being used in the predicate. This is a limitation for
now.
@mayankgarg1990
Copy link
Contributor Author

@rschlussel - all the tests are green - so feel free to merge it whenever you get a chance to :)

@rschlussel rschlussel merged commit fa513e2 into prestodb:master Sep 21, 2020
@VisibleForTesting
static Optional<Set<HiveColumnHandle>> mergeRequestedAndPredicateColumns(Optional<Set<HiveColumnHandle>> requestedColumns, Set<HiveColumnHandle> predicateColumns)
{
if (!requestedColumns.isPresent() || predicateColumns.isEmpty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mayankgarg1990 Is this the right logic? For a SELECT count(*) FROM t WHERE a > 10 we can have requestedColumns empty and predicateColumns containing "a". In this case, don't we need to make sure user has access to "a" ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case - I am checking for requestedColumns.isPresent() and the situation you are talking about will actually be requestedColumns.get().isEmpty().

So this check mainly ensures that due to some reason if we don't end up getting the requestedColumns field populated at all, the encryption provider will now try to get the credentials for all columns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants