-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dell: Add document. #4993
Dell: Add document. #4993
Conversation
Fix the issue of DellClientFactory.
@samredai, can you take a look? |
docs/integrations/dell.md
Outdated
|
||
This session will show you how to use Iceberg with Dell ECS. Dell ECS provides several features that are more appropriate for Iceberg: | ||
|
||
1. Append operation for file writer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to provide more description or dell doc link for those two features ? I think we need to let others to know more about the difference between Dell ECS and other vendor Object Storage Service.
docs/integrations/dell.md
Outdated
For example, to use the Dell ECS catalog with Flink 1.14, you should create a Flink environment like: | ||
|
||
```sh | ||
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.14.0/iceberg-flink-runtime-1.14-0.14.0.jar |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we still don't release the iceberg 0.14.0, so we can not verify & experience this Dell ECS integration work. It's possible to wget
the latest snapshot artifact from the maven repo ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the dependencies are also added in this commit. I don't know which version should I use?
docs/integrations/dell.md
Outdated
'ecs.s3.secret-access-key' = 'xxxxxxxxxxxxxxxx') | ||
``` | ||
|
||
Then, `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is possible to provide a section for the HiveCatalog + Dell ECS ? In my real experience, there are many people whose just want to replace the HDFS storage with Object Storage Service while still maintain their tables in HiveCatalog.
docs/integrations/dell.md
Outdated
|
||
### Limitations | ||
|
||
When you use the catalog with Dell ECS only, you should care about these limitations: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the DELL ECS catalog the expected catalog that you Dell ECS guys recommend to ? In real cases, I see people will prefer to use HiveCatalog or other centralized metastore to collect all the table so that they can track table transforms in the global view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We wish to provide an all-in-one product.
docs/integrations/dell.md
Outdated
For example, to use the Dell ECS catalog with Spark 3.2.1, you should create a Spark session like: | ||
|
||
```sh | ||
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,com.emc.ecs:object-client-bundle:3.3.2 \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The different mechanisms to set the spark configuration feels like material more suited for a quickstart. Since the properties are listed in the Connection parameters
section above--here I think it just needs to describe the added prefix to set these in Spark (ecs.s3.endpoint
is set as spark.sql.catalog.<catalog-name>.ecs.s3.endpoint
)
In a quickstart, we can include this in a tabbed box with a spark-sql, spark-shell, and pyspark example to give something easy to copy & paste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed the example of spark-sql
and Flink sql-client.sh
to avoid redundancy.
docs/integrations/dell.md
Outdated
|
||
# Iceberg Dell Integrations | ||
|
||
## Dell ECS Integrations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This h2 title and the h1 title above it feels redundant. I would consolidate them into a single title such as # Dell ECS
or # Dell ECS Catalog
. Also I think the word "Integration" refers more to Iceberg's incorporation into compute/query engines such as Spark or Trino. I could be wrong but I think Dell ECS is more of an "Implementation".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In previous PRs, the reviewers let me use dell
as the module name.
I also used dell
as the document title and used the h2 title for ECS. Should I change this document to ECS only?
And the "Integration" will change to "Implementation".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On second thought, I think integration makes sense here. I think I'm getting caught up in the fact that this is an implementation of the catalog interface, but you're right--integration is more consistent with how we've been using this elsewhere and overall makes sense (this integrates Iceberg and ECS).
Thanks for @samredai 's help. My English is bad. If you have other suggestions, I'll fix them to make this document more readable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm thinking more broadly about how we organize integrations into various tools, frameworks, and vendors to make it more intuitive for readers. I may end up including some refactoring/relocating here but that shouldn't hold up this PR.
import java.util.Map; | ||
|
||
public interface DellClientFactory { | ||
public interface DellClientFactory extends Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move code changes out of a documentation PR? These should be in a code PR for people that cherry-pick commits.
spark/v3.2/build.gradle
Outdated
@@ -213,6 +213,7 @@ project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio | |||
exclude group: 'commons-logging', module: 'commons-logging' | |||
} | |||
implementation project(':iceberg-hive-metastore') | |||
implementation project(':iceberg-dell') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this module to the Spark runtime should be discussed and done in a separate PR, not in a documentation PR. Can you bring this up on the dev list and summarize what it pulls in and changes?
You'll also need to update the LICENSE and NOTICE files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the documentation is probably close to being ready, but this should not mix code changes, runtime Jar additions, and documentation in a single PR.
Since this PR is replaced by newer ones, I'm going to close it. |
@rdblue |
No, if this is only docs, then we can review it. Thanks, @wang-x-xia. |
@rdblue If you have some comments about the document, please let me know. This PR needs your approval now.😂 |
docs/integrations/dell.md
Outdated
|
||
Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.14.0. | ||
|
||
Dell ECS has many features that make Iceberg a highly compatible table format, such as append operations for file writers and content addressable storage (CAS) for table commits. See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what it means to be a "highly compatible table format". I think this should just have the second sentence, "See Dell ECS for more information."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
docs/integrations/dell.md
Outdated
|
||
### Connection parameters | ||
|
||
When using Dell ECS with Iceberg, these configuration parameters are required: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where?
docs/integrations/dell.md
Outdated
| ecs://bucket-a | Use the whole bucket as the data | | ||
| ecs://bucket-a/namespace-a | Use a prefix to access the data only in this specific namespace | | ||
|
||
When you provide the `warehouse`, the last `/` will be ignored. The `ecs://bucket-a` is same with `ecs://bucket-a/`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you specifically state all of the required properties and where to set them? I think these are catalog properties.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following examples will show how to config these properties.
docs/integrations/dell.md
Outdated
|
||
### Runtime dependencies | ||
|
||
The Iceberg `runtime` jar supports different versions of Spark and Flink. If the version was not matched in the example, please check the related document of Spark and Flink. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't start a section. It's probably a good note (if your version is different, make sure you have the right runtime Jar!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed!
docs/integrations/dell.md
Outdated
from pyspark.sql import SparkSession | ||
|
||
spark = SparkSession.builder | ||
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,com.emc.ecs:object-client-bundle:3.3.2") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't the iceberg-dell
Jar required as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added.
When I wrote this document, I assume all PRs will be merged.
But now no PR be merged into 0.14.0. So I also fix the version to 0.15.0.
😂
docs/integrations/dell.md
Outdated
When you use the catalog with Dell ECS only, you should care about these limitations: | ||
|
||
1. `RENAME` statements are supported without other protections. When you try to rename a table, you need to guarantee all commits are finished in the original table. | ||
2. `RENAME` statements only rename the table without moving any data files. This can lead to a table's data being stored in a path outside of the configured warehouse path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that table rename is supported. Wouldn't that break all metadata pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a rename for the catalog layer.
Only the entry object which store the latest version will be renamed, but not changed.
So the data controlled by Iceberg won't be changed. Any pointers won't be broken.
2. Add the iceberg-dell jar. 3. Simplify the sentences of the Parameters part.
@rdblue |
Sorry for the later response, @wang-x-xia. I remember @wang-x-xia ping me in other IM tools, but I missed to track this document. so sorry. Let me take a look and get this merged if it's ready. |
docs/integrations/dell.md
Outdated
@@ -0,0 +1,132 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this document path still valid ? I saw the current docs directory don't have any integration now ? would you mind to update this PR ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The document is moved and fixed the header~
docs/integrations/dell.md
Outdated
|
||
Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.14.0. | ||
|
||
Dell ECS has many features that make Iceberg a highly compatible table format, such as append operations for file writers and content addressable storage (CAS) for table commits. See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind to attach the ECS CAS link for more detailed information ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is already removed in the latest version.
# Conflicts: # docs/dell.md
docs/integrations/dell.md
Outdated
Use the Dell ECS catalog with Flink, you first must create a Flink environment. | ||
|
||
```python | ||
import requests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it good enough to provide a python script to download the flink jars? I see the script will need the pyflink
and requests
dependencies and we don't provide any more details about how to run the python script. I'm afraid most of people won't be able to run this python script successfully in their hosts.
I think it's simple enough to wget the flink jars in two line bash scripts and show the following SQL Execution by using flink SQL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the scripts are bash-script.
docs/integrations/dell.md
Outdated
|
||
The related problems of catalog usage: | ||
|
||
1. The [pyspark.sql.SparkSession.catalog](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.catalog.html#pyspark.sql.SparkSession.catalog) won't access the 3rd-party catalog of Spark, so please use DDL SQL to list all tables and namespaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link is 404 now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The link is removed.
docs/integrations/dell.md
Outdated
|
||
For example, to use the Dell ECS catalog with Spark 3.2.1, you should create a Spark session like: | ||
|
||
```python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's just use java/scalar or bash example for keep consistentence as other pages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now the scripts are bash-script.
docs/dell.md
Outdated
|
||
The related problems of catalog usage: | ||
|
||
1. The `pyspark.sql.SparkSession.catalog` won't access the 3rd-party catalog of Spark, so please use DDL SQL to list all tables and namespaces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the pyspark.sql.SparkSession.catalog
has this problems ? How about the java & scala ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All languages are the same. I have fixed this statement.
``` | ||
|
||
Then, you can run `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question: Should we also add hive section ? I guess many people still use hive tez to process their data, it's okay if you think we don't need to. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a good idea. Relevant tests are being prepared. And I wish to merge this document firstly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks @wang-x-xia for the contribution !
The PR has 3 parts:
:iceberg:dell
module to the runtime package.