Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ARCTIC-761] update README.md for v0.4 and optimize setup docs #865

Merged
merged 5 commits into from
Dec 6, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 19 additions & 19 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,27 @@
![logo](site/docs/ch/images/arctic_logo_for_git.png)

Welcome to arctic, arctic is a streaming lake warehouse system open sourced by NetEase.
Arctic adds more real-time capabilities on top of iceberg and hive, and provides stream-batch unified, out-of-the-box metadata services for dataops,
allowing Data lakes much more usable and practical.
Arctic is a LakeHouse management system under open architecture, which on top of data lake open formats provides more optimizations for streaming and upsert scenarios, as well as a set of pluggable self-optimizing mechanisms and management services. Using Arctic could help various data platforms, tools and products build out-of-the-box, streaming and batch unified LakeHouses quickly.

## What is arctic
Arctic is a streaming lakehouse service built on top of apache iceberg table format.
Through arctic, users could benefit optimized CDC、streaming update、fresh olap etc. on engines like flink, spark, and trino.
Combined with efficient offline processing capabilities of data lakes, arctic can serve more scenarios where streaming and batch are fused.
At the same time, the function of self-optimization、concurrent conflict resolution and standard management tools could effectively reduce the burden on users in data lake management and optimization.

Currently, Arctic is a LakeHouse management system on top of iceberg format. Benefit from the thriving ecology of Apache Iceberg, Arctic could be used on kinds of data lakes on premise or clouds with varities of engines. Several concepts should be known before your deeper steps:

![Introduce](site/docs/ch/images/introduce_arctic.png)

Arctic services are presented by deploying AMS, which can be considered as a replacement for HMS (Hive Metastore), or HMS for iceberg.
Arctic uses iceberg as the base table format, but instead of hacking the iceberg implementation, it uses iceberg as a lib.
Arctic's open overlay architecture can help large-scale offline data lakes quickly upgraded to real-time data lakes, without worrying about compatibility issues with the original data lakes,
allowing data lakes to meet more real-time analysis, real-time risk control, Real-time training, feature engineering and other scenarios.
- AMS and optimizers - Arctic Management Service provides management features including self-optimizing mechanisms running on optimizers, which could be scaled as demand and scheduled on different platforms.
- Mutiple formats — Arctic use formats analogous to MySQL or ClickHouse using storage engines to meet different scenarios. Two formats were available since Arctic v0.4.
* Iceberg format — learn more about iceberg format details and usage with different engines: [Iceberg Docs](https://iceberg.apache.org/docs/latest/)
* Mixed streaming format - if you are interested in advanced features like auto-bucket, logstore, hive compatible, strict PK constraints etc. learn Arctic [Mixed Iceberg format](https://arctic.netease.com/ch/concetps/table-formats/#mixed-iceberg-format) and [Mixed Hive format](https://arctic.netease.com/ch/concetps/table-formats/#mixed-hive-format)

## Arctic features

* Efficient streaming updates based on primary keys
* Data auto bucketing and self-optimized for performance and efficiency
* Encapsulating data lake and message queue into a unified table to achieve lower-latency computing
* Provide standardized metrics, dashboard and related management tools for streaming lakehouse
* Support spark and flink to read and write data, support trino to query data
* 100% compatible with iceberg / hive table format and syntax
* Provide transactional guarantee for streaming and batch concurrent writing
- Defining keys - supports defining primary key with strict constraints, and more types of keys in future
- Self-optimizing - user-insensitive asynchronous self-optimization mechanisms could keep lakehouse fresh and healthy
- Management features - dashboard UI to support catalog/table management, SQL terminal and all kinds of metrics
- Formats compatible - Hive/Iceberg format compatible means writing and reading through native Hive/Iceberg connector
- Better data pipeline SLA - using LogStore like kafka to accelarate streaming data pipeline to ms/s latency
- Better OLAP performace - provides auto-bucket feature for better compaction and merge-on-read performance
- Concurrent conflicts resovling - Flink or Spark could concurrent write data without worring about conflicts

## Modules

Expand Down Expand Up @@ -60,6 +59,7 @@ Arctic is built using Maven with Java 1.8 and Java 11(only for `trino` module).
```
* To invoke a build and run tests: `mvn package -P toolchain`
* To skip tests: `mvn -DskipTests package -P toolchain`
* To package without trino module and JAVA 11 dependency: `mvn clean package -DskipTests -pl '!trino'`

## Engines supported

Expand All @@ -73,7 +73,7 @@ Arctic support multiple processing engines as below:

## Quickstart

Visit [https://arctic.netease.com/ch/quickstart/quick-demo/](https://arctic.netease.com/ch/quickstart/quick-demo/) to quickly explore what arctic can do.
Visit [https://arctic.netease.com/ch/docker-quickstart/](https://arctic.netease.com/ch/docker-quickstart/) to quickly explore what arctic can do.

## Join Community
If you are interested in Lakehouse, Data Lake Format, welcome to join our community, we welcome any organizations, teams and individuals to grow together, and sincerely hope to help users better use Data Lake Format through open source.
Expand Down
30 changes: 14 additions & 16 deletions site/config/ch/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,21 +52,19 @@ nav:
- Overview: index.md
- Concepts:
- Catalogs: concepts/catalogs.md
- Table formats: concepts/table-formats.md
- Table Formats: concepts/table-formats.md
- Self-optimizing: concepts/self-optimizing.md
- Table watermark: concepts/table-watermark.md
- Table Watermark: concepts/table-watermark.md
- Quick Start:
- Setup:
- Setup from docker: quickstart/setup/setup-from-docker.md
- Setup from binary release: quickstart/setup/setup-from-binary-release.md
- Quick demo: quickstart/quick-demo.md
- CDC ingestion: quickstart/cdc-ingestion.md
- Admin guide:
- Setup: quickstart/setup.md
- Quick Demo: quickstart/quick-demo.md
- CDC Ingestion: quickstart/cdc-ingestion.md
- Admin Guides:
- Deployment: guides/deployment.md
- Managing catalogs: guides/managing-catalogs.md
- Managing tables: guides/managing-tables.md
- Managing optimizers: guides/managing-optimizers.md
- Using Kyuubi by terminal: guides/using-kyuubi.md
- Managing Catalogs: guides/managing-catalogs.md
- Managing Tables: guides/managing-tables.md
- Managing Optimizers: guides/managing-optimizers.md
- Using Kyuubi By Terminal: guides/using-kyuubi.md
- Metrics: guides/metrics.md
- Configurations: configurations.md
- Flink integration:
Expand All @@ -77,17 +75,17 @@ nav:
- Flink DataStream: flink/flink-ds.md
- Using Kafka as Logstore: flink/hidden-kafka.md
- Flink CDC to Arctic: flink/flink-cdc-to-arctic.md
- Spark integration:
- Spark Integration:
- Getting Started: spark/spark-get-started.md
- Spark DDL: spark/spark-ddl.md
- Spark DML: spark/spark-dml.md
- Spark DataFrame: spark/spark-dataframe.md
- MPP integration:
- MPP Integrations:
- Trino: mpp/trino.md
- Impala: mpp/impala.md
- Benchmark:
- Report: benchmark/benchmark.md
- How to benchmark: benchmark/benchmark-step.md
- Benchmark Report: benchmark/benchmark.md
- How To Benchmark: benchmark/benchmark-step.md
- Roadmap: roadmap.md
- Contributing: contribute.md

Expand Down
21 changes: 15 additions & 6 deletions site/docs/ch/concepts/self-optimizing.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,18 +40,27 @@ self-optimizing.target-size 定义了 major optimizing 的目标输出大小,

![Minor optimizing](../images/concepts/minor_optimizing.png){:height="80%" width="80%"}

Minor optimizing 的目标是将 fragment 文件尽可能快地合并为 segment 文件,以缓解读放大,所以当 fragment 文件积累到一定量时, minor optimizing 会较为频繁的调度执行。
Major optimizing 会同时处理 segment 和 fragment 文件,在这个过程中会按照主键消除部分或全部的重复数据,如果是有主键表,相比 minor optimizing 通过 major optimizing 一般可以更为明显地提升读取性能,并且更低的 major 调度频率可以有效缓解写放大问题。Full optimizing 会将 target space 内所有文件合并成一个文件,是 major optimizing 的一种特殊情况:
Minor optimizing 的目标是缓解读放大问题,这里涉及两项工作:

* 将 fragment 文件尽可能快地合并为 segment 文件,当小文件积累时, minor optimizing 会较为频繁地执行
* 将写友好(WriteStore)的文件格式转换为读友好(ReadStore)的文件格式,对 Mixed format 而言是 ChangeStore 向 BaseStore 的转换,对 Iceberg format 则是 eq-delete 文件向 pos-delete 的转换

在 Minor optimizing 执行多次之后,表空间内会存在较多的 Segment 文件,虽然 Segement 文件的读取效率很多情况下能够满足对性能的要求,但是:

* 各个 Segment 文件上可能积累了总量可观的 delete 数据
* Segment 之间可能存在很多在主键上的重复数据

这时候影响读取性能不再是小文件和文件格式引发的读放大问题,而是在存在过量垃圾数据,需要在 merge-on-read 时被合并处理,所以 Arctic 在这里引入了 major optimizing 通过 segment 文件合并来清理垃圾数据,从而将垃圾数据量控制在一个对读友好的比例,一般情况下,minor optimizing 已经做过多轮去重,major optimizing 不会频繁调度执行,从而避免了写放大问题。另外,Full optimizing 会将 target space 内所有文件合并成一个文件,是 major optimizing 的一种特殊情况:

![Major optimizing](../images/concepts/major_optimizing.png){:height="80%" width="80%"}

Minor、major 和 full optimizing 的输入输出关系如下表所示:
Major optimizing 和 minor optimizing 的设计参考了垃圾回收算法的分代设计,两种 optimizing 的执行逻辑是一致的,都会执行文件合并,数据去重,WriteStore 格式向 ReadStore 格式的转换,Minor、major 和 full optimizing 的输入输出关系如下表所示:

| Self-optimizing type | Input space | Output space | Input file types | Output file types |
|:----------|:----------|:----------|:----------|:----------|
| minor | fragment | fragment/segment | insert, delete | insert, deletee |
| major | fragment, segment | segment | insert, delete | insert, delete |
| full | fragment, segment | segment | insert, delete | insert |
| minor | fragment | fragment/segment | insert, eq-delete, pos-delete | insert, pos-delete |
| major | fragment, segment | segment | insert, eq-delete, pos-delete | insert, pos-delete |
| full | fragment, segment | segment | insert, eq-delete, pos-delete | insert |


## Self-optimizing quota
Expand Down
Binary file modified site/docs/ch/images/concepts/major_optimizing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified site/docs/ch/images/concepts/minor_optimizing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified site/docs/ch/images/introduce_arctic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
9 changes: 2 additions & 7 deletions site/docs/ch/quickstart/cdc-ingestion.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@

这篇文档将为您演示通过 Flink CDC 同步 MySQL 数据变更到 Arctic 的过程。本部分文档采用的是
[lakehouse-benchmark](https://github.com/NetEase/lakehouse-benchmark)项目中提供的 TPCC 数据集,
模拟真实业务场景下对 MySQL 的读写, 并使用
[lakehouse-benchmark-ingestion](https://github.com/NetEase/lakehouse-benchmark-ingestion) 工具完成数据通过 Binlog 从 MySQL 到 Arctic 的同步。
您需要提前通过 [Setup from docker](./setup/setup-from-docker.md) 完成集群的部署,
并完成 [Quick Demo](./quick-demo.md) 中,创建 Catalog 并开启 Optimizer 的部分。
这篇文档将演示通过 FlinkCDC 同步 MySQL 数据变更到 Arctic 的过程。这里采用的是 [lakehouse-benchmark](https://github.com/NetEase/lakehouse-benchmark) 项目中提供的 TPCC 数据集,模拟真实业务场景下对 MySQL 的读写, 并使用 [lakehouse-benchmark-ingestion](https://github.com/NetEase/lakehouse-benchmark-ingestion) 工具完成数据通过 Binlog 从 MySQL 到 Arctic 的同步。需要提前通过 [Setup from Docker-Compose](./setup.md#setup-from-docker-compose) 完成集群的部署,并完成 [Quick demo](./quick-demo.md) 中,创建 catalog 并启动 Optimizer 的部分。


### Step1. initialize tables
Expand Down Expand Up @@ -53,7 +48,7 @@ docker exec -it lakehouse-benchmark java \
此命令会一直不断的在测试库上执行 OLTP 操作,直到程序退出。
此时可以回到 AMS 的 Terminal 页面,通过 Spark SQL 查询到 MySQL 上的数据变更会随着 Ingestion 任务不断的同步到 Arctic Table 上。

> Ingestion 任务的 Checkpoint 周期为 60s, 所以 Arctic 数据湖和 MySQL 的数据变更有 60s 的延迟。
???+note "Ingestion 任务的 Checkpoint 周期为 60s, 所以 Arctic 数据湖和 MySQL 的数据变更有 60s 的延迟。"


### Step 4. check table result
Expand Down
Loading