Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dell: Add document. #4993

Merged
merged 14 commits into from
Sep 7, 2022
Merged

Dell: Add document. #4993

merged 14 commits into from
Sep 7, 2022

Conversation

wang-x-xia
Copy link
Contributor

The PR has 3 parts:

  1. Add a document of ECS how-to.
  2. Fix the bugs when I test the catalog in production.
  3. Add the :iceberg:dell module to the runtime package.

@rdblue
Copy link
Contributor

rdblue commented Jun 12, 2022

@samredai, can you take a look?


This session will show you how to use Iceberg with Dell ECS. Dell ECS provides several features that are more appropriate for Iceberg:

1. Append operation for file writer.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to provide more description or dell doc link for those two features ? I think we need to let others to know more about the difference between Dell ECS and other vendor Object Storage Service.

For example, to use the Dell ECS catalog with Flink 1.14, you should create a Flink environment like:

```sh
wget https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-flink-runtime-1.14/0.14.0/iceberg-flink-runtime-1.14-0.14.0.jar
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we still don't release the iceberg 0.14.0, so we can not verify & experience this Dell ECS integration work. It's possible to wget the latest snapshot artifact from the maven repo ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the dependencies are also added in this commit. I don't know which version should I use?

'ecs.s3.secret-access-key' = 'xxxxxxxxxxxxxxxx')
```

Then, `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is possible to provide a section for the HiveCatalog + Dell ECS ? In my real experience, there are many people whose just want to replace the HDFS storage with Object Storage Service while still maintain their tables in HiveCatalog.


### Limitations

When you use the catalog with Dell ECS only, you should care about these limitations:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the DELL ECS catalog the expected catalog that you Dell ECS guys recommend to ? In real cases, I see people will prefer to use HiveCatalog or other centralized metastore to collect all the table so that they can track table transforms in the global view.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. We wish to provide an all-in-one product.

For example, to use the Dell ECS catalog with Spark 3.2.1, you should create a Spark session like:

```sh
spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,com.emc.ecs:object-client-bundle:3.3.2 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The different mechanisms to set the spark configuration feels like material more suited for a quickstart. Since the properties are listed in the Connection parameters section above--here I think it just needs to describe the added prefix to set these in Spark (ecs.s3.endpoint is set as spark.sql.catalog.<catalog-name>.ecs.s3.endpoint)

In a quickstart, we can include this in a tabbed box with a spark-sql, spark-shell, and pyspark example to give something easy to copy & paste.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the example of spark-sql and Flink sql-client.sh to avoid redundancy.


# Iceberg Dell Integrations

## Dell ECS Integrations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This h2 title and the h1 title above it feels redundant. I would consolidate them into a single title such as # Dell ECS or # Dell ECS Catalog. Also I think the word "Integration" refers more to Iceberg's incorporation into compute/query engines such as Spark or Trino. I could be wrong but I think Dell ECS is more of an "Implementation".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In previous PRs, the reviewers let me use dell as the module name.
I also used dell as the document title and used the h2 title for ECS. Should I change this document to ECS only?

And the "Integration" will change to "Implementation".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, I think integration makes sense here. I think I'm getting caught up in the fact that this is an implementation of the catalog interface, but you're right--integration is more consistent with how we've been using this elsewhere and overall makes sense (this integrates Iceberg and ECS).

@wang-x-xia
Copy link
Contributor Author

Thanks for @samredai 's help. My English is bad. If you have other suggestions, I'll fix them to make this document more readable.

Copy link
Contributor

@samredai samredai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I'm thinking more broadly about how we organize integrations into various tools, frameworks, and vendors to make it more intuitive for readers. I may end up including some refactoring/relocating here but that shouldn't hold up this PR.

import java.util.Map;

public interface DellClientFactory {
public interface DellClientFactory extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move code changes out of a documentation PR? These should be in a code PR for people that cherry-pick commits.

@@ -213,6 +213,7 @@ project(":iceberg-spark:iceberg-spark-runtime-${sparkMajorVersion}_${scalaVersio
exclude group: 'commons-logging', module: 'commons-logging'
}
implementation project(':iceberg-hive-metastore')
implementation project(':iceberg-dell')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this module to the Spark runtime should be discussed and done in a separate PR, not in a documentation PR. Can you bring this up on the dev list and summarize what it pulls in and changes?

You'll also need to update the LICENSE and NOTICE files.

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the documentation is probably close to being ready, but this should not mix code changes, runtime Jar additions, and documentation in a single PR.

@github-actions github-actions bot removed the DELL label Jun 16, 2022
@wang-x-xia
Copy link
Contributor Author

@rdblue
The changes are reverted,
I've created a new PR for bugfix:
#5059
And a new PR for runtime JAR:
#5060

@rdblue
Copy link
Contributor

rdblue commented Jun 29, 2022

Since this PR is replaced by newer ones, I'm going to close it.

@rdblue rdblue closed this Jun 29, 2022
@wang-x-xia
Copy link
Contributor Author

@rdblue
This PR is for documents only. I moved other changes to new PRs except the document.
Should I create a new PR for the ducoment?

@rdblue rdblue reopened this Jun 30, 2022
@rdblue
Copy link
Contributor

rdblue commented Jun 30, 2022

No, if this is only docs, then we can review it. Thanks, @wang-x-xia.

@wang-x-xia
Copy link
Contributor Author

@rdblue If you have some comments about the document, please let me know. This PR needs your approval now.😂


Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.14.0.

Dell ECS has many features that make Iceberg a highly compatible table format, such as append operations for file writers and content addressable storage (CAS) for table commits. See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what it means to be a "highly compatible table format". I think this should just have the second sentence, "See Dell ECS for more information."

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!


### Connection parameters

When using Dell ECS with Iceberg, these configuration parameters are required:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where?

| ecs://bucket-a | Use the whole bucket as the data |
| ecs://bucket-a/namespace-a | Use a prefix to access the data only in this specific namespace |

When you provide the `warehouse`, the last `/` will be ignored. The `ecs://bucket-a` is same with `ecs://bucket-a/`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you specifically state all of the required properties and where to set them? I think these are catalog properties.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The following examples will show how to config these properties.


### Runtime dependencies

The Iceberg `runtime` jar supports different versions of Spark and Flink. If the version was not matched in the example, please check the related document of Spark and Flink.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't start a section. It's probably a good note (if your version is different, make sure you have the right runtime Jar!).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed!

from pyspark.sql import SparkSession

spark = SparkSession.builder
.config("spark.jars.packages", "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.14.0,com.emc.ecs:object-client-bundle:3.3.2")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't the iceberg-dell Jar required as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.
When I wrote this document, I assume all PRs will be merged.
But now no PR be merged into 0.14.0. So I also fix the version to 0.15.0.
😂

When you use the catalog with Dell ECS only, you should care about these limitations:

1. `RENAME` statements are supported without other protections. When you try to rename a table, you need to guarantee all commits are finished in the original table.
2. `RENAME` statements only rename the table without moving any data files. This can lead to a table's data being stored in a path outside of the configured warehouse path.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that table rename is supported. Wouldn't that break all metadata pointers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a rename for the catalog layer.
Only the entry object which store the latest version will be renamed, but not changed.
So the data controlled by Iceberg won't be changed. Any pointers won't be broken.

2. Add the iceberg-dell jar.
3. Simplify the sentences of the Parameters part.
@wang-x-xia
Copy link
Contributor Author

@rdblue
Could you please confirm whether the next release is 0.14.1 or 0.15?
If the next release is 0.14.1, I'll change the Jar version to 0.14.1.
And if the document has any other problems, I can also fix them! 😂

@openinx
Copy link
Member

openinx commented Sep 5, 2022

Sorry for the later response, @wang-x-xia. I remember @wang-x-xia ping me in other IM tools, but I missed to track this document. so sorry. Let me take a look and get this merged if it's ready.

@@ -0,0 +1,132 @@
---
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this document path still valid ? I saw the current docs directory don't have any integration now ? would you mind to update this PR ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The document is moved and fixed the header~


Iceberg can be used with Dell's Enterprise Object Storage (ECS) by using the ECS catalog since 0.14.0.

Dell ECS has many features that make Iceberg a highly compatible table format, such as append operations for file writers and content addressable storage (CAS) for table commits. See [Dell ECS](https://www.dell.com/en-us/dt/storage/ecs/index.htm) for more information on Dell ECS.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind to attach the ECS CAS link for more detailed information ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is already removed in the latest version.

Use the Dell ECS catalog with Flink, you first must create a Flink environment.

```python
import requests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it good enough to provide a python script to download the flink jars? I see the script will need the pyflink and requests dependencies and we don't provide any more details about how to run the python script. I'm afraid most of people won't be able to run this python script successfully in their hosts.

I think it's simple enough to wget the flink jars in two line bash scripts and show the following SQL Execution by using flink SQL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the scripts are bash-script.


The related problems of catalog usage:

1. The [pyspark.sql.SparkSession.catalog](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.SparkSession.catalog.html#pyspark.sql.SparkSession.catalog) won't access the 3rd-party catalog of Spark, so please use DDL SQL to list all tables and namespaces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is 404 now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is removed.


For example, to use the Dell ECS catalog with Spark 3.2.1, you should create a Spark session like:

```python
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use java/scalar or bash example for keep consistentence as other pages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now the scripts are bash-script.

docs/dell.md Outdated

The related problems of catalog usage:

1. The `pyspark.sql.SparkSession.catalog` won't access the 3rd-party catalog of Spark, so please use DDL SQL to list all tables and namespaces.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the pyspark.sql.SparkSession.catalog has this problems ? How about the java & scala ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All languages are the same. I have fixed this statement.

```

Then, you can run `USE CATALOG my_catalog`, `SHOW DATABASES`, and `SHOW TABLES` to fetch the namespaces and tables of the catalog.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question: Should we also add hive section ? I guess many people still use hive tez to process their data, it's okay if you think we don't need to. Thanks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. Relevant tests are being prepared. And I wish to merge this document firstly.

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @wang-x-xia for the contribution !

@openinx openinx merged commit 9a80238 into apache:master Sep 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants