Dell: Add document. #4993

wang-x-xia · 2022-06-08T02:41:27Z

The PR has 3 parts:

Add a document of ECS how-to.
Fix the bugs when I test the catalog in production.
Add the :iceberg:dell module to the runtime package.

Fix the issue of DellClientFactory.

rdblue · 2022-06-12T23:25:45Z

@samredai, can you take a look?

openinx · 2022-06-13T02:25:33Z

docs/integrations/dell.md

+
+This session will show you how to use Iceberg with Dell ECS. Dell ECS provides several features that are more appropriate for Iceberg:
+
+1. Append operation for file writer.


Would you mind to provide more description or dell doc link for those two features ? I think we need to let others to know more about the difference between Dell ECS and other vendor Object Storage Service.

docs/integrations/dell.md

openinx · 2022-06-13T02:39:26Z

docs/integrations/dell.md

+For example, to use the Dell ECS catalog with Flink 1.14, you should create a Flink environment like:
+
+```sh
+wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.14.0/iceberg-flink-runtime-1.14-0.14.0.jar


Since we still don't release the iceberg 0.14.0, so we can not verify & experience this Dell ECS integration work. It's possible to wget the latest snapshot artifact from the maven repo ?

Because the dependencies are also added in this commit. I don't know which version should I use?

docs/integrations/dell.md

openinx · 2022-06-13T02:46:08Z

docs/integrations/dell.md

+    'ecs.s3.secret-access-key' = 'xxxxxxxxxxxxxxxx')
+```
+
+Then, `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog.


Is possible to provide a section for the HiveCatalog + Dell ECS ? In my real experience, there are many people whose just want to replace the HDFS storage with Object Storage Service while still maintain their tables in HiveCatalog.

openinx · 2022-06-13T02:50:07Z

docs/integrations/dell.md

+
+### Limitations
+
+When you use the catalog with Dell ECS only, you should care about these limitations:


Is the DELL ECS catalog the expected catalog that you Dell ECS guys recommend to ? In real cases, I see people will prefer to use HiveCatalog or other centralized metastore to collect all the table so that they can track table transforms in the global view.

Yes. We wish to provide an all-in-one product.

docs/integrations/dell.md

samredai · 2022-06-12T23:59:04Z

docs/integrations/dell.md

+For example, to use the Dell ECS catalog with Spark 3.2.1, you should create a Spark session like:
+
+```sh
+spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,com.emc.ecs:object-client-bundle:3.3.2 \


The different mechanisms to set the spark configuration feels like material more suited for a quickstart. Since the properties are listed in the Connection parameters section above--here I think it just needs to describe the added prefix to set these in Spark (ecs.s3.endpoint is set as spark.sql.catalog.<catalog-name>.ecs.s3.endpoint)

In a quickstart, we can include this in a tabbed box with a spark-sql, spark-shell, and pyspark example to give something easy to copy & paste.

I've removed the example of spark-sql and Flink sql-client.sh to avoid redundancy.

docs/integrations/dell.md

samredai · 2022-06-13T06:21:43Z

docs/integrations/dell.md

+
+# Iceberg Dell Integrations
+
+## Dell ECS Integrations


This h2 title and the h1 title above it feels redundant. I would consolidate them into a single title such as # Dell ECS or # Dell ECS Catalog. Also I think the word "Integration" refers more to Iceberg's incorporation into compute/query engines such as Spark or Trino. I could be wrong but I think Dell ECS is more of an "Implementation".

In previous PRs, the reviewers let me use dell as the module name.
I also used dell as the document title and used the h2 title for ECS. Should I change this document to ECS only?

And the "Integration" will change to "Implementation".

On second thought, I think integration makes sense here. I think I'm getting caught up in the fact that this is an implementation of the catalog interface, but you're right--integration is more consistent with how we've been using this elsewhere and overall makes sense (this integrates Iceberg and ECS).

docs/integrations/dell.md

wang-x-xia · 2022-06-14T06:31:52Z

Thanks for @samredai 's help. My English is bad. If you have other suggestions, I'll fix them to make this document more readable.

samredai

LGTM. I'm thinking more broadly about how we organize integrations into various tools, frameworks, and vendors to make it more intuitive for readers. I may end up including some refactoring/relocating here but that shouldn't hold up this PR.

rdblue · 2022-06-15T17:46:06Z

dell/src/main/java/org/apache/iceberg/dell/DellClientFactory.java

 import java.util.Map;

-public interface DellClientFactory {
+public interface DellClientFactory extends Serializable {


Can you move code changes out of a documentation PR? These should be in a code PR for people that cherry-pick commits.

rdblue · 2022-06-15T17:47:09Z

spark/v3.2/build.gradle

@@ -213,6 +213,7 @@ project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
      exclude group: 'commons-logging', module: 'commons-logging'
    }
    implementation project(':iceberg-hive-metastore')
+    implementation project(':iceberg-dell')


Adding this module to the Spark runtime should be discussed and done in a separate PR, not in a documentation PR. Can you bring this up on the dev list and summarize what it pulls in and changes?

You'll also need to update the LICENSE and NOTICE files.

rdblue

I think the documentation is probably close to being ready, but this should not mix code changes, runtime Jar additions, and documentation in a single PR.

wang-x-xia · 2022-06-16T05:43:50Z

@rdblue
The changes are reverted,
I've created a new PR for bugfix:
#5059
And a new PR for runtime JAR:
#5060

rdblue · 2022-06-29T16:57:43Z

Since this PR is replaced by newer ones, I'm going to close it.

wang-x-xia · 2022-06-30T03:14:00Z

@rdblue
This PR is for documents only. I moved other changes to new PRs except the document.
Should I create a new PR for the ducoment?

rdblue · 2022-06-30T15:36:12Z

No, if this is only docs, then we can review it. Thanks, @wang-x-xia.

wang-x-xia · 2022-07-05T02:48:53Z

@rdblue If you have some comments about the document, please let me know. This PR needs your approval now.😂

rdblue · 2022-07-29T19:58:11Z

docs/integrations/dell.md

+
+Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.14.0.
+
+Dell ECS has many features that make Iceberg a highly compatible table format, such as append operations for file writers and content addressable storage (CAS) for table commits. See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS.


I'm not sure what it means to be a "highly compatible table format". I think this should just have the second sentence, "See Dell ECS for more information."

rdblue · 2022-07-29T19:58:21Z

docs/integrations/dell.md

+
+### Connection parameters
+
+When using Dell ECS with Iceberg, these configuration parameters are required:


rdblue · 2022-07-29T19:59:16Z

docs/integrations/dell.md

+| ecs://bucket-a             | Use the whole bucket as the data                                |
+| ecs://bucket-a/namespace-a | Use a prefix to access the data only in this specific namespace |
+
+When you provide the `warehouse`, the last `/` will be ignored. The `ecs://bucket-a` is same with `ecs://bucket-a/`.


Can you specifically state all of the required properties and where to set them? I think these are catalog properties.

The following examples will show how to config these properties.

rdblue · 2022-07-29T19:59:53Z

docs/integrations/dell.md

+
+### Runtime dependencies
+
+The Iceberg `runtime` jar supports different versions of Spark and Flink. If the version was not matched in the example, please check the related document of Spark and Flink.


This shouldn't start a section. It's probably a good note (if your version is different, make sure you have the right runtime Jar!).

rdblue · 2022-07-29T20:00:39Z

docs/integrations/dell.md

+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder
+    .config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,com.emc.ecs:object-client-bundle:3.3.2")


Isn't the iceberg-dell Jar required as well?

Added.
When I wrote this document, I assume all PRs will be merged.
But now no PR be merged into 0.14.0. So I also fix the version to 0.15.0.
😂

rdblue · 2022-07-29T20:01:49Z

docs/integrations/dell.md

+When you use the catalog with Dell ECS only, you should care about these limitations:
+
+1. `RENAME` statements are supported without other protections. When you try to rename a table, you need to guarantee all commits are finished in the original table.
+2. `RENAME` statements only rename the table without moving any data files. This can lead to a table's data being stored in a path outside of the configured warehouse path.


I don't think that table rename is supported. Wouldn't that break all metadata pointers?

It's just a rename for the catalog layer.
Only the entry object which store the latest version will be renamed, but not changed.
So the data controlled by Iceberg won't be changed. Any pointers won't be broken.

2. Add the iceberg-dell jar. 3. Simplify the sentences of the Parameters part.

wang-x-xia · 2022-09-01T02:37:27Z

@rdblue
Could you please confirm whether the next release is 0.14.1 or 0.15?
If the next release is 0.14.1, I'll change the Jar version to 0.14.1.
And if the document has any other problems, I can also fix them! 😂

openinx · 2022-09-05T02:46:03Z

Sorry for the later response, @wang-x-xia. I remember @wang-x-xia ping me in other IM tools, but I missed to track this document. so sorry. Let me take a look and get this merged if it's ready.

openinx · 2022-09-05T02:57:38Z

docs/integrations/dell.md

@@ -0,0 +1,132 @@
+---


Is this document path still valid ? I saw the current docs directory don't have any integration now ? would you mind to update this PR ?

The document is moved and fixed the header~

openinx · 2022-09-05T03:06:23Z

docs/integrations/dell.md

+
+Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.14.0.
+
+Dell ECS has many features that make Iceberg a highly compatible table format, such as append operations for file writers and content addressable storage (CAS) for table commits. See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS.


Would you mind to attach the ECS CAS link for more detailed information ?

This part is already removed in the latest version.

# Conflicts: # docs/dell.md

openinx · 2022-09-05T03:18:15Z

docs/integrations/dell.md

+Use the Dell ECS catalog with Flink, you first must create a Flink environment.
+
+```python
+import requests


Is it good enough to provide a python script to download the flink jars? I see the script will need the pyflink and requests dependencies and we don't provide any more details about how to run the python script. I'm afraid most of people won't be able to run this python script successfully in their hosts.

I think it's simple enough to wget the flink jars in two line bash scripts and show the following SQL Execution by using flink SQL.

Now the scripts are bash-script.

openinx · 2022-09-05T03:19:15Z

docs/integrations/dell.md

+
+The related problems of catalog usage:
+
+1. The [pyspark.sql.SparkSession.catalog](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.catalog.html#pyspark.sql.SparkSession.catalog) won't access the 3rd-party catalog of Spark, so please use DDL SQL to list all tables and namespaces.


The link is 404 now.

The link is removed.

openinx · 2022-09-05T03:20:15Z

docs/integrations/dell.md

+
+For example, to use the Dell ECS catalog with Spark 3.2.1, you should create a Spark session like:
+
+```python


Let's just use java/scalar or bash example for keep consistentence as other pages.

Now the scripts are bash-script.

openinx · 2022-09-05T12:38:28Z

docs/dell.md

+
+The related problems of catalog usage:
+
+1. The `pyspark.sql.SparkSession.catalog` won't access the 3rd-party catalog of Spark, so please use DDL SQL to list all tables and namespaces.


Only the pyspark.sql.SparkSession.catalog has this problems ? How about the java & scala ?

All languages are the same. I have fixed this statement.

openinx · 2022-09-05T12:45:05Z

docs/dell.md

+```
+
+Then, you can run `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog.
+


Open question: Should we also add hive section ? I guess many people still use hive tez to process their data, it's okay if you think we don't need to. Thanks.

It's a good idea. Relevant tests are being prepared. And I wish to merge this document firstly.

openinx

LGTM, thanks @wang-x-xia for the contribution !

wang-x-xia added 4 commits April 11, 2022 07:36

Add documents.

fa19798

Fix the issue of DellClientFactory.

Fix typo in document.

0fb0682

Fix typo in document.

83fa83a

Merge remote-tracking branch 'origin/master' into feature-ecs-document

bd84b38

github-actions bot added build DELL docs flink spark labels Jun 8, 2022