-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow loading custom Catalog implementation in Spark and Flink #1640
Conversation
flink/src/main/java/org/apache/iceberg/flink/FlinkCatalogFactory.java
Outdated
Show resolved
Hide resolved
Thanks Jack. I like the following approach. |
My main concern with this is how configuration is currently passed using Hadoop I think we also want to pass options from the catalog config in Flink and Spark, where users can pass properties like |
I like the Map over Configuration suggestion as well. In #1587 I made the constructor take Has anyone thought of how to do this for the |
@rymurr, for your
Figuring out how we want to do this shouldn't be too difficult. We just found it easier to keep the existing behavior for the last release since there weren't other catalogs at the time. |
d3201f2
to
17f1c5c
Compare
17f1c5c
to
d3ec643
Compare
fix rebase issue, reopen PR |
flink/src/main/java/org/apache/iceberg/flink/CatalogLoader.java
Outdated
Show resolved
Hide resolved
@@ -58,13 +59,19 @@ | |||
|
|||
// Can not just use "type", it conflicts with CATALOG_TYPE. | |||
public static final String ICEBERG_CATALOG_TYPE = "catalog-type"; | |||
public static final String ICEBERG_CATALOG_TYPE_HADOOP = "hadoop"; | |||
public static final String ICEBERG_CATALOG_TYPE_HIVE = "hive"; | |||
public static final String ICEBERG_CATALOG_TYPE_CUSTOM = "custom"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using type=custom
and impl=com.example.Catalog
, why not just combine them into type=com.example.Catalog
. We can try to load the type as an implementation class if it isn't a well-known name like "hive".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aokolnychyi and @RussellSpitzer, do you have an opinion here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this sounds like a cleaner way to go, the only disadvantage is that we are overloading the term type
. I have updated the code.
spark2/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java
Outdated
Show resolved
Hide resolved
@jackye1995, I'm happy with the implementation and behavior (other than changing |
2ddfeb7
to
c7ce41a
Compare
c7ce41a
to
485dea0
Compare
Interesting, but it should still be compatible by using |
I was suggesting using the narrower function instead map here as well. |
What's the drawback of using a map? It is a bit narrower to use the function, but it is also limiting. Catalogs can't use |
Just a lot extra methods exposed. Iteration, put, etc. |
I think the advantage of using a
|
@rdblue @jacques-n any thoughts? |
I'm fine with it as is. Some background: I'm not especially supportive of maps in interfaces since they are so broad. I've had bad experiences in the past where we've had to move maps to external storage (say DynamoDB) and having a broad use of the Map interface hurt us in reworking things. |
I think it is okay to use a map here. Like I said, even if we had to copy values to create a map to pass in here, that happens once per session and is not an unreasonable amount of overhead. |
As we are having multiple new Catalog implementations added to Iceberg, we need a way to load those Catalogs in Spark and Flink easily. Currently there is a simple switch branch that chooses between
hive
andhadoop
catalogs. This approach requires theiceberg-spark
andiceberg-flink
module to take a dependency on the catalog implementation modules. This would potentially bring in many unnecessary dependencies as more and more cloud providers try to add support for Iceberg.This PR proposes the following way to load custom Catalog implementations:
type
of a catalog can behive
orhadoop
to keep existing behaviorscatalog-impl
is set,type
is ignored and we will load catalog based on the class valueinitialize()
Configurable
to read Hadoop configurationFor example, a
GlueCatalog
will be used in Spark like the following: