[#1135] improvement(docs): Add docs about tables advanced feature like partitioning #1203

yuqi1129 · 2023-12-19T08:17:08Z

What changes were proposed in this pull request?

Add docs about the details of table partitioning, bucketing, and sorting order.

Why are the changes needed?

The document is mandatory for users.

Fix: #1135

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

N/A

docs/advanced-table-feature.md

docs/manage-metadata-using-gravitino.md

qqqttt123 · 2023-12-21T03:31:18Z

docs/manage-metadata-using-gravitino.md

+| `list`            | Partition the table by a list value                          | Any                                                                                 | Any         | `{"strategy":"list","fieldNames":[["dt"],["city"]]}`             | `Transforms.list(new String[] {"dt", "city"})`  | `PARTITION BY list(dt, city)`      |
+| `range`           | Partition the table by a range value                         | Any                                                                                 | Any         | `{"strategy":"range","fieldName":["dt"]}`                        | `Transforms.range(20, "score")`                 | `PARTITION BY range(score)`        |
+
+Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL.


You should add this to the document.

All transforms must return null for a null input value.

I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive.

OK, if we don't follow this, should we explain null input?

I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive.

Can you help confirm this, @mchades?

I believe this is not determined by Gravitino, but rather depends on the underlying catalog.

docs/manage-metadata-using-gravitino.md

mchades

The Expression section should be an independent chapter, making it easier for Partitioning, Bucket, and sortOrder to reference it.

docs/manage-metadata-using-gravitino.md

mchades · 2023-12-25T02:38:18Z

docs/manage-metadata-using-gravitino.md

+The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table.
+:::
+
+| Function strategy | Description                                                  | Source types                                                                        | Result type | Json example                                                     | Java example                                    | Equivalent SQL semantics           |


Actually, Gravitino do not care about the source type and result type, the type limitation depends on the underlying catalog

@qqqttt123 What's your opinion?

For a transform, source type and result type are important. Gravitino may not care. But users will care about it.

How about adding links to different catalog partitioning docs?

For the same type and partitioning strategy, it may be feasible in catalogA but prohibited in catalogB, as this is likely dependent on the catalog's implicit type conversion strategy.

This may require a few more PRs to refine it, so I will add an issue about it later.

docs/manage-metadata-using-gravitino.md

yuqi1129 · 2023-12-25T04:04:24Z

The Expression section should be an independent chapter, making it easier for Partitioning, Bucket, and sortOrder to reference it.

The first version I did used a separate chapter to describe it. It seems a bit isolated, so I merged it into this file.

docs/table-partitioning-bucketing-sort-order.md

yuqi1129 · 2023-12-26T02:06:14Z

@jerryshao
Can you spare some time to review the PR?

justinmclean

Needs some minor improvements. Also there are several double blank lines.

docs/manage-metadata-using-gravitino.md

docs/table-partitioning-bucketing-sort-order.md

yuqi1129 · 2023-12-27T13:32:56Z

@jerryshao Can you spare some time to review the PR?

@jerryshao
Please take time to review it, thanks.

docs/table-partitioning-bucketing-sort-order.md

jerryshao · 2023-12-28T11:47:12Z

docs/table-partitioning-bucketing-sort-order.md

+| Bucket strategy | Description                                                                                                                   | JSON     | Java             |
+|-----------------|-------------------------------------------------------------------------------------------------------------------------------|----------|------------------|
+| hash            | Bucket table using hash. Gravitino will distribute table data into buckets based on the hash value of the key.                | `hash`   | `Strategy.HASH`  |
+| range           | Bucket table using range. Gravitino will distribute table data into buckets based on a specified range or interval of values. | `range`  | `Strategy.RANGE` |
+| even            | Bucket table using even.  Gravitino will distribute table data, ensuring an equal distribution of data.                       | `even`   | `Strategy.EVEN`  |
+
+- Number. It defines how many buckets you use to bucket the table.
+- Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments.
+
+| Expression type | JSON example                                                   | Java example                                                                              | Equivalent SQL semantics | Description                       | 
+|-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------|
+| field           | `{"type":"field","fieldName":["score"]}`                       | `FieldReferenceDTO.of("score")`                                                           | `score`                  | The field reference value `score` |
+| function        | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)`               | The function value `hour(dt)`     |
+| constant        | `{"type":"literal","value":10, "dataType": "integer"}`         | `new LiteralDTO.Builder().withValue("10").withDataType(Types.IntegerType.get()).build()`  | `10`                     | The integer literal `10`          |


The content here has no any introduction about bucketing, directly introducing Strategy, Number... is very hard for user to understand what is it. You should have a basic introduction about the bucketing, and the required fields of bucketing.

Also not only for bucketing, but also for partitioning and sort ordering.

This part is in the manage-metadata-using-gravitino.md and I have updated it.

I mean you'd better have a paragraph introducing all the required fields to combine partitioning, bucketing and sort ordering. For a user who doesn't have any background of these things, directly introducing Strategy, Number and others seems not so easy to understand.

docs/table-partitioning-bucketing-sort-order.md

yuqi1129 · 2024-01-02T03:26:20Z

@jerryshao
Please take a look if the modification is reasonable, thanks.

jerryshao · 2024-01-02T03:55:41Z

docs/table-partitioning-bucketing-sort-order.md

+<Tabs>
+<TabItem value="shell" label="Shell">
+
+```shell


Some of the doc I see using bash, I think you'd better unifying them all.

I remember we use bash thorough the doc manmage-metadata-using-gravitino.md, maybe I should change them all.

We use shell in other places, so better to change to shell.

Using bash or shell may affect the code highlighting, although I am not entirely certain. I suggest that you take a look at the actual effect on the website.

I have tested locally and it seems fine to use shell.

2. Change the language of some code blocks from `bash` to `shell`

jerryshao · 2024-01-02T07:06:23Z

docs/table-partitioning-bucketing-sort-order.md

+    new DistributionDTO.Builder()
+      .withStrategy(Strategy.HASH)
+      .withNumber(4)
+      .withArgs(FieldReferenceDTO.of("id"))
+      .build(),
+    // SORTED BY name asc
+    new SortOrderDTO[] {
+      SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST)
+    });


I think we should not use xxxDTO, instead we should use some methods in APIs, right? @mchades

Yes, there are several static methods in APIs, xxxDTO is simply a representation of the intermediate result of JSON deserialization.

The method isn't yet available in the APIs,

So, do I need to add it to this PR or use another PR later?

use com.datastrato.gravitino.rel.expressions.NamedReference#field(java.lang.String)

Please fix it here.

There are two follow-things:

"Column" should have an implementation used by client, not DTO, @mchades can you please work on this?

Make sure all the IT codes use APIs, not DTOs for partitioning/sorting/distribution. @yuqi1129 can you please work on this?

Please fix it here.

done

Please fix it here.

done

"Column" should have an implementation used by client, not DTO

Tracked by #1292

jerryshao · 2024-01-02T14:46:30Z

@mchades would you please help to review again?

…e partitioning (#1203) ### What changes were proposed in this pull request? Add docs about the details of table partitioning, bucketing, and sorting order. ### Why are the changes needed? The document is mandatory for users. Fix: #1135 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? N/A --------- Co-authored-by: Jerry Shao <jerryshao@datastrato.com>

Add docs about tables advanced feature like partitioning

3049470