Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Support for user-defined enums #14691

Closed
wants to merge 13 commits into from

Conversation

daniel-ohayon
Copy link
Contributor

@daniel-ohayon daniel-ohayon commented Jun 22, 2020

The goal of this PR is to enable basic support for enum user-defined types (UDTs) in Presto. Enums already exist in Postgres and MySQL, and this would be a first step toward generic UDT support.

Requirements

  1. Support for both integer-valued (64-bit) and string-valued enums
  2. Enum types should be provided by plugins and kept up-to-date as they change (without the need to restart the server)
  3. Enum literals should be supported, with case-insensitive type names, but case-sensitive keys eg Mood 'HAPPY'
  4. CAST from primitive types to enums should be supported, failing if the provided value is not part of the enum
  5. NULLs are allowed
  6. Both enum keys and values must be unique for a given enum
  7. Enum definition should be forwarded to clients (including the Presto CLI) for custom rendering
  8. The fact that a given column is an enum should be persisted in table schemas
  9. You can compare enum values of the same type (EQUAL, NOT_EQUAL operators)
  10. [?] You can compare an enum value to a literal of the underlying primitive type, but not to a reference, i.e. my_enum_column = 4 is OK, but my_enum_column = my_int_column is not. This would be in line with enum behavior in languages like Hack, and it would facilitate backwards-compatibility of existing queries that do not use enum literals
  11. [?] Support for ephemeral enum definitions forwarded by client session parameters

Implementation

  • Extend FunctionNamespaceManager to provide types
  • Cast functions generated on-the-fly for enums

TODO

[ ] Disambiguate key lookup from value lookup with string enums
[ ] Persist enum info to schema (separate PR)
[ ] Support =, != for enums
[ ] Support literal comparison
[ ] Move getTypeDynamically to TypeRegistry

@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Jun 22, 2020

CLA Check
The committers are authorized under a signed CLA.

@daniel-ohayon daniel-ohayon requested a review from rongrong June 22, 2020 03:17
@@ -69,7 +69,10 @@ private FixJsonDataUtils() {}
}
requireNonNull(columns, "columns is null");
List<TypeSignature> signatures = columns.stream()
.map(column -> parseTypeSignature(column.getType()))
.map(column -> {
TypeSignature displayType = column.getTypeMetadata().getDisplayType();
Copy link
Contributor Author

@daniel-ohayon daniel-ohayon Jun 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in this file, we use the type signature client-side to determine what kind of data we're dealing with.
Since the client doesn't know about the enum types, if it encounters a String value, it will default to base64 encoding as per:

if (value instanceof String) {
    return Base64.getDecoder().decode((String) value);
}

For string-valued enums, this results in gibberish in the CLI output.
This is why I'm passing this extra info about the display type

if (rowBuffer.size() >= MAX_BUFFERED_ROWS) {
flush(false);
}
}

private List<?> prettyPrintRow(List<?> row, List<Column> columns)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this doesn't work with complex types (row, map, array...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebuild switch-case logic here

@Override
public boolean isComparable()
{
return false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enum types are not comparable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be, but I haven't found a way to register the comparison functions dynamically (like we do for CAST) in a way that I'm happy with yet.

@@ -685,6 +685,16 @@ private boolean isMoreSpecificThan(ApplicableFunction left, ApplicableFunction r
return boundVariables.isPresent();
}

// Called by TypeRegistry to look up types from FunctionNamespaceManagers
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: move to TypeRegistry

@daniel-ohayon daniel-ohayon changed the title [WIP] Support for user-defined enums [WIP] Support for user-defined enums [language] Jun 23, 2020
@daniel-ohayon daniel-ohayon marked this pull request as draft June 26, 2020 00:59
@daniel-ohayon daniel-ohayon changed the title [WIP] Support for user-defined enums [language] [WIP] Support for user-defined enums Jun 26, 2020
import java.util.Map;
import java.util.stream.Collectors;

public class IntegerEnumType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call it LongEnumType so there's no confusion? Integer can also mean specifically the SQL INTEGER type which is 32-bit (though using 32-bit is probably enough for an enum type so that would be an option as well?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's go for LongEnumType!

public interface EnumType<T>
extends Type
{
Map<String, T> getEntries();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getEntries sounds like it should return a list. Maybe explicitly call it getEnumNameToValueMap or just getEnumMap or getMap.

@@ -23,12 +23,12 @@

import static java.util.Collections.singletonList;

public final class VarcharType
public class VarcharType
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably want to refactor VarcharType into AbstractVarcharType and VarcharType (similar to AbstractLongType and BigintType) and keep this as final. That way you also avoid the weird need to provide a type signature for the constructor.

@@ -901,6 +904,14 @@ private SpecializedFunctionKey doGetSpecializedFunctionKey(Signature signature)
Iterable<SqlFunction> candidates = getFunctions(null, signature.getName());
// search for exact match
Type returnType = typeManager.getType(signature.getReturnType());

if (returnType instanceof EnumType && CastType.CAST.getCastName().equals(signature.getName())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this logic in BuiltInFunctionNamespaceManager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reorganized it a bit in #14728 to have the enum-specific logic only live in EnumOperators (but we still have a call from BuiltinFunctionNamespaceManager - I think it makes sense to have logic there to handle those UDT functions on-the-fly since it's a "core" feature which is independent of any specific UDT).

name,
type.getTypeSignature().toString(),
new ClientTypeSignature(type.getTypeSignature()),
type.getTypeMetadata());
}

public Column(String name, TypeSignature signature)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is what you can do: remove this constructor (Column(String name, TypeSignature signature)). There's no real use case for this one. Create a Type class in client (I think call it Type is fine, but if it feels too confusing we can call it ClientType), which wraps the string type, client type signature and enum maps.

if (rowBuffer.size() >= MAX_BUFFERED_ROWS) {
flush(false);
}
}

private List<?> prettyPrintRow(List<?> row, List<Column> columns)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might just call this function processRow or transform.

@daniel-ohayon daniel-ohayon deleted the dohayon/enums-v4 branch November 11, 2020 23:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants