Skip to content

Commit

Permalink
[Doc] update delta lake faq (#43143)
Browse files Browse the repository at this point in the history
(cherry picked from commit a0e5859)
  • Loading branch information
amber-create authored and mergify[bot] committed Mar 26, 2024
1 parent 97f1b17 commit e0b12f1
Show file tree
Hide file tree
Showing 2 changed files with 33 additions and 5 deletions.
20 changes: 17 additions & 3 deletions docs/en/data_source/datalake_faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@ displayed_sidebar: "English"

This topic describes some commonly asked questions (FAQ) about data lake and provides solutions to these issues. Some metrics mentioned in this topic can be obtained only from the profiles of the SQL queries. To obtain the profiles of SQL queries, you must specify `set enable_profile=true`.

## Slow HDFS nodes
## Slow HDFS DataNodes

### Issue description

When you access the data files stored in your HDFS cluster, you may find a huge difference between the values of the `__MAX_OF_FSIOTime` and `__MIN_OF_FSIOTime` metrics from the profiles of the SQL queries you run, which indicates slow HDFS nodes. The following example is a typical profile that indicates an HDFS node slowdown issue:
When you access the data files stored in your HDFS cluster, you may find a huge difference between the values of the `__MAX_OF_FSIOTime` and `__MIN_OF_FSIOTime` metrics from the profiles of the SQL queries you run. This indicates that some DataNodes in the HDFS cluster are slow. The following example is a typical profile that indicates a slow HDFS DataNode issue:

```plaintext
- InputStream: 0
Expand Down Expand Up @@ -38,13 +38,27 @@ When you access the data files stored in your HDFS cluster, you may find a huge

You can use one of the following solutions to resolve this issue:

- **[Recommended]** Enable the [data cache](../data_source/data_cache.md) feature, which eliminates the impact of slow HDFS nodes on queries by automatically caching the data from external storage systems to the BEs of your StarRocks cluster.
- **[Recommended]** Enable the [data cache](../data_source/data_cache.md) feature, which eliminates the impact of slow HDFS DataNodes on queries by automatically caching the data from external storage systems to the BEs of your StarRocks cluster.
- **[Recommended]** Shorten the timeout duration between the HDFS client and DataNode. This solution is suitable when Data Cache cannot help resolve the slow HDFS DataNode issue.
- Enable the [Hedged Read](https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/release/2.4.0/RELEASENOTES.2.4.0.html) feature. With this feature enabled, if a read from a block is slow, StarRocks starts up a new read, which runs in parallel to the original read, to read against a different block replica. Whenever one of the two reads returns, the other read is cancelled. **The Hedged Read feature can help accelerate reads, but it also significantly increases heap memory consumption on Java virtual machines (JVMs). Therefore, if your physical machines provide a small memory capacity, we recommend that you do not enable the Hedged Read feature.**

#### [Recommended] Data Cache

See [Data Cache](../data_source/data_cache.md).

#### [Recommended] Shorten timeout duration between HDFS client and DataNode

Configure the `dfs.client.socket-timeout` property in the `hdfs-site.xml` file to shorten the timeout duration between the HDFS client and DataNode. (The default timeout duration is 60s, which is a bit long.) As such, when StarRocks encounters a slow DataNode, the connection request from it can time out within a very short period of time and then be forwarded to another DataNode. The following example sets a 5-second timeout duration:

```xml
<configuration>
<property>
<name>dfs.client.socket-timeout</name>
<value>5000</value>
</property>
</configuration>
```

#### Hedged Read

Use the following parameters (supported from v3.0 onwards) in the BE configuration file `be.conf` to enable and configure the Hedged Read feature in your HDFS cluster.
Expand Down
18 changes: 16 additions & 2 deletions docs/zh/data_source/datalake_faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ displayed_sidebar: "Chinese"

### 问题描述

在访问 HDFS 上存储的数据文件时,如果发现 SQL 查询的 Profile 中 `__MAX_OF_FSIOTime``__MIN_OF_FSIOTime` 两个指标的值相差很大,说明当前环境存在 HDFS 慢节点的情况。如下所示的 Profile,就是典型的 HDFS 慢节点场景:
在访问 HDFS 上存储的数据文件时,如果发现 SQL 查询的 Profile 中 `__MAX_OF_FSIOTime``__MIN_OF_FSIOTime` 两个指标的值相差很大,说明当前环境存在 HDFS 集群某些 DataNode 节点较慢的情况。如下所示的 Profile,就是典型的 HDFS 慢节点场景:

```plaintext
- InputStream: 0
Expand All @@ -36,15 +36,29 @@ displayed_sidebar: "Chinese"

### 解决方案

当前有两种解决方案
当前有三种解决方案

- 【推荐】开启 [Data Cache](../data_source/data_cache.md)。通过自动缓存远端数据到 BE 节点,消除 HDFS 慢节点对查询的影响。
- 【推荐】缩短 HDFS 客户端和 DataNode 之间的超时时间,适合 Data Cache 不起效果的场景。
- 开启 [Hedged Read](https://hadoop.apache.org/docs/r2.8.3/hadoop-project-dist/hadoop-common/release/2.4.0/RELEASENOTES.2.4.0.html) 功能。开启后,如果当前从某个数据块读取数据比较慢,StarRocks 发起一个新的 Read 任务,与原来的 Read 任务并行,用于从目标数据块的副本上读取数据。不管哪个 Read 任务先返回结果,另外一个 Read 任务则会取消。**Hedged Read 可以加速数据读取速度,但是也会导致 Java 虚拟机(简称“JVM”)堆内存的消耗显著增加。因此,在物理机内存比较小的情况下,不建议开启 Hedged Read。**

#### 【推荐】Data Cache

参见 [Data Cache](../data_source/data_cache.md)

#### 【推荐】缩短 HDFS 客户端和 DataNode 之间的超时时间

可以通过在 `hdfs-site.xml` 配置 `dfs.client.socket-timeout` 属性,来缩短 HDFS 客户端和 DataNode 之间的超时时间(默认超时时间是 60s,比较长)。这样,当 StarRocks 遇到一个反应缓慢的 DataNode 节点时,能够快速超时,转而向新的 DataNode 发起请求。如下例子中,配置了 5s 的超时时间:

```xml
<configuration>
<property>
<name>dfs.client.socket-timeout</name>
<value>5000</value>
</property>
</configuration>
```

#### Hedged Read

在 BE 配置文件 `be.conf` 中通过如下参数(从 3.0 版本起支持),开启并配置 HDFS 集群的 Hedged Read 功能。
Expand Down

0 comments on commit e0b12f1

Please sign in to comment.