-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#1135] improvement(docs): Add docs about tables advanced feature like partitioning #1203
Conversation
| `list` | Partition the table by a list value | Any | Any | `{"strategy":"list","fieldNames":[["dt"],["city"]]}` | `Transforms.list(new String[] {"dt", "city"})` | `PARTITION BY list(dt, city)` | | ||
| `range` | Partition the table by a range value | Any | Any | `{"strategy":"range","fieldName":["dt"]}` | `Transforms.range(20, "score")` | `PARTITION BY range(score)` | | ||
|
||
Except the strategies above, you can use other functions strategies to partition the table, for example, the strategy can be `{"strategy":"functionName","fieldName":["score"]}`. The `functionName` can be any function name that you can use in SQL, for example, `{"strategy":"toDate","fieldName":["score"]}` is equivalent to `PARTITION BY toDate(score)` in SQL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should add this to the document.
All transforms must return null for a null input value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for Hive
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, if we don't follow this, should we explain null input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this point as it's only appliable to Iceberg currently, I need to check this for
Hive
.
Can you help confirm this, @mchades?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is not determined by Gravitino, but rather depends on the underlying catalog.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Expression
section should be an independent chapter, making it easier for Partitioning
, Bucket
, and sortOrder
to reference it.
The `score`, `dt`, and `city` appearing in the table below refer to the field names in a table. | ||
::: | ||
|
||
| Function strategy | Description | Source types | Result type | Json example | Java example | Equivalent SQL semantics | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Gravitino do not care about the source type and result type, the type limitation depends on the underlying catalog
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@qqqttt123 What's your opinion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For a transform, source type and result type are important. Gravitino may not care. But users will care about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about adding links to different catalog partitioning docs?
For the same type and partitioning strategy, it may be feasible in catalogA
but prohibited in catalogB
, as this is likely dependent on the catalog's implicit type conversion strategy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may require a few more PRs to refine it, so I will add an issue about it later.
The first version I did used a separate chapter to describe it. It seems a bit isolated, so I merged it into this file. |
@jerryshao |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs some minor improvements. Also there are several double blank lines.
@jerryshao |
| Bucket strategy | Description | JSON | Java | | ||
|-----------------|-------------------------------------------------------------------------------------------------------------------------------|----------|------------------| | ||
| hash | Bucket table using hash. Gravitino will distribute table data into buckets based on the hash value of the key. | `hash` | `Strategy.HASH` | | ||
| range | Bucket table using range. Gravitino will distribute table data into buckets based on a specified range or interval of values. | `range` | `Strategy.RANGE` | | ||
| even | Bucket table using even. Gravitino will distribute table data, ensuring an equal distribution of data. | `even` | `Strategy.EVEN` | | ||
|
||
- Number. It defines how many buckets you use to bucket the table. | ||
- Function arguments. It defines the arguments of the strategy above, Gravitino supports the following three kinds of arguments, for more, you can refer to Java class [FunctionArg](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/expressions/FunctionArg.java) and [DistributionDTO](https://github.com/datastrato/gravitino/blob/main/common/src/main/java/com/datastrato/gravitino/dto/rel/DistributionDTO.java) to use more complex function arguments. | ||
|
||
| Expression type | JSON example | Java example | Equivalent SQL semantics | Description | | ||
|-----------------|----------------------------------------------------------------|-------------------------------------------------------------------------------------------|--------------------------|-----------------------------------| | ||
| field | `{"type":"field","fieldName":["score"]}` | `FieldReferenceDTO.of("score")` | `score` | The field reference value `score` | | ||
| function | `{"type":"function","functionName":"hour","fieldName":["dt"]}` | `new FuncExpressionDTO.Builder().withFunctionName("hour").withFunctionArgs("dt").build()` | `hour(dt)` | The function value `hour(dt)` | | ||
| constant | `{"type":"literal","value":10, "dataType": "integer"}` | `new LiteralDTO.Builder().withValue("10").withDataType(Types.IntegerType.get()).build()` | `10` | The integer literal `10` | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The content here has no any introduction about bucketing, directly introducing Strategy, Number... is very hard for user to understand what is it. You should have a basic introduction about the bucketing, and the required fields of bucketing.
Also not only for bucketing, but also for partitioning and sort ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is in the manage-metadata-using-gravitino.md
and I have updated it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean you'd better have a paragraph introducing all the required fields to combine partitioning, bucketing and sort ordering. For a user who doesn't have any background of these things, directly introducing Strategy
, Number
and others seems not so easy to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added.
@jerryshao |
<Tabs> | ||
<TabItem value="shell" label="Shell"> | ||
|
||
```shell |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of the doc I see using bash
, I think you'd better unifying them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember we use bash
thorough the doc manmage-metadata-using-gravitino.md
, maybe I should change them all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use shell
in other places, so better to change to shell
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using bash
or shell
may affect the code highlighting, although I am not entirely certain. I suggest that you take a look at the actual effect on the website.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have tested locally and it seems fine to use shell
.
2. Change the language of some code blocks from `bash` to `shell`
new DistributionDTO.Builder() | ||
.withStrategy(Strategy.HASH) | ||
.withNumber(4) | ||
.withArgs(FieldReferenceDTO.of("id")) | ||
.build(), | ||
// SORTED BY name asc | ||
new SortOrderDTO[] { | ||
SortOrders.of(FieldReferenceDTO.of("score"), SortDirection.ASCENDING, NullOrdering.NULLS_LAST) | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not use xxxDTO, instead we should use some methods in APIs, right? @mchades
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are several static methods in APIs, xxxDTO
is simply a representation of the intermediate result of JSON deserialization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use com.datastrato.gravitino.rel.expressions.NamedReference#field(java.lang.String)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please fix it here.
done
Please fix it here.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- "Column" should have an implementation used by client, not DTO
Tracked by #1292
@mchades would you please help to review again? |
…e partitioning (#1203) ### What changes were proposed in this pull request? Add docs about the details of table partitioning, bucketing, and sorting order. ### Why are the changes needed? The document is mandatory for users. Fix: #1135 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? N/A --------- Co-authored-by: Jerry Shao <jerryshao@datastrato.com>
What changes were proposed in this pull request?
Add docs about the details of table partitioning, bucketing, and sorting order.
Why are the changes needed?
The document is mandatory for users.
Fix: #1135
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
N/A