Skip to content

Commit

Permalink
[featrue] add apache hdfs monitor (#1920)
Browse files Browse the repository at this point in the history
Co-authored-by: zhangshenghang <shenghang.zhang@avrisdigital.com>
Co-authored-by: zhangshenghang <admin@hadoop.wiki>
Co-authored-by: crossoverJie <crossoverJie@gmail.com>
Co-authored-by: yqxxgh <42080876+yqxxgh@users.noreply.github.com>
Co-authored-by: Ceilzcx <48920254+Ceilzcx@users.noreply.github.com>
Co-authored-by: aias00 <rokkki@163.com>
Co-authored-by: tomsun28 <tomsun28@outlook.com>
  • Loading branch information
8 people authored May 6, 2024
1 parent f793178 commit 49fbf2a
Show file tree
Hide file tree
Showing 7 changed files with 1,059 additions and 0 deletions.
56 changes: 56 additions & 0 deletions home/docs/help/hdfs_datanode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
id: hdfs_datanode
title: Monitoring Apache HDFS DataNode Monitoring
sidebar_label: Apache HDFS DataNode
keywords: [big data monitoring system, distributed file system monitoring, Apache HDFS DataNode monitoring]
---

> Hertzbeat monitors metrics for Apache HDFS DataNode nodes.
**Protocol Used: HTTP**

## Pre-monitoring Operations

Retrieve the HTTP monitoring port for the Apache HDFS DataNode. Value: `dfs.datanode.http.address`

## Configuration Parameters

| Parameter Name | Parameter Description |
| ----------------- |-------------------------------------------------------|
| Target Host | IP(v4 or v6) or domain name of the target to be monitored. Exclude protocol. |
| Port | Monitoring port number for Apache HDFS DataNode, default is 50075. |
| Query Timeout | Timeout for querying Apache HDFS DataNode, in milliseconds, default is 6000 milliseconds. |
| Metrics Collection Interval | Time interval for monitoring data collection, in seconds, minimum interval is 30 seconds. |
| Probe Before Monitoring | Whether to probe and check monitoring availability before adding. |
| Description/Remarks | Additional description and remarks for this monitoring. |

### Metrics Collected

#### Metric Set: FSDatasetState

| Metric Name | Metric Unit | Metric Description |
| ------------ | ----------- | ------------------------------ |
| DfsUsed | GB | DataNode HDFS usage |
| Remaining | GB | Remaining space on DataNode HDFS |
| Capacity | GB | Total capacity of DataNode HDFS |

#### Metric Set: JvmMetrics

| Metric Name | Metric Unit | Metric Description |
| ---------------------- | ----------- | ----------------------------------------- |
| MemNonHeapUsedM | MB | Current usage of NonHeapMemory by JVM |
| MemNonHeapCommittedM | MB | Committed size of NonHeapMemory configured in JVM |
| MemHeapUsedM | MB | Current usage of HeapMemory by JVM |
| MemHeapCommittedM | MB | Committed size of HeapMemory by JVM |
| MemHeapMaxM | MB | Maximum size of HeapMemory configured in JVM |
| MemMaxM | MB | Maximum memory available for JVM at runtime |
| ThreadsRunnable | Count | Number of threads in RUNNABLE state |
| ThreadsBlocked | Count | Number of threads in BLOCKED state |
| ThreadsWaiting | Count | Number of threads in WAITING state |
| ThreadsTimedWaiting | Count | Number of threads in TIMED WAITING state |

#### Metric Set: runtime

| Metric Name | Metric Unit | Metric Description |
| ------------ | ----------- | ------------------ |
| StartTime | | Startup time |
92 changes: 92 additions & 0 deletions home/docs/help/hdfs_namenode.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,92 @@
---
id: hdfs_namenode
title: Monitoring HDFS NameNode Monitoring
sidebar_label: Apache HDFS NameNode
keywords: [big data monitoring system, distributed file system monitoring, HDFS NameNode monitoring]
---

> Hertzbeat monitors metrics for HDFS NameNode nodes.
**Protocol Used: HTTP**

## Pre-Monitoring Actions

Ensure that you have obtained the JMX monitoring port for the HDFS NameNode.

## Configuration Parameters

| Parameter Name | Parameter Description |
| ------------------ |--------------------------------------------------------|
| Target Host | The IPv4, IPv6, or domain name of the target being monitored. Exclude protocol headers. |
| Port | The monitoring port number of the HDFS NameNode, default is 50070. |
| Query Timeout | Timeout for querying the HDFS NameNode, in milliseconds, default is 6000 milliseconds. |
| Metrics Collection Interval | Time interval for collecting monitoring data, in seconds, minimum interval is 30 seconds. |
| Probe Before Monitoring | Whether to probe and check the availability of monitoring before adding it. |
| Description/Remarks | Additional description and remarks for this monitoring. |

### Collected Metrics

#### Metric Set: FSNamesystem

| Metric Name | Metric Unit | Metric Description |
| --------------------------- | ----------- | ------------------------------------- |
| CapacityTotal | | Total cluster storage capacity |
| CapacityTotalGB | GB | Total cluster storage capacity |
| CapacityUsed | | Used cluster storage capacity |
| CapacityUsedGB | GB | Used cluster storage capacity |
| CapacityRemaining | | Remaining cluster storage capacity |
| CapacityRemainingGB | GB | Remaining cluster storage capacity |
| CapacityUsedNonDFS | | Non-HDFS usage of cluster capacity |
| TotalLoad | | Total client connections in the cluster |
| FilesTotal | | Total number of files in the cluster |
| BlocksTotal | | Total number of BLOCKs |
| PendingReplicationBlocks | | Number of blocks awaiting replication |
| UnderReplicatedBlocks | | Number of blocks with insufficient replicas |
| CorruptBlocks | | Number of corrupt blocks |
| ScheduledReplicationBlocks | | Number of blocks scheduled for replication |
| PendingDeletionBlocks | | Number of blocks awaiting deletion |
| ExcessBlocks | | Number of excess blocks |
| PostponedMisreplicatedBlocks| | Number of misreplicated blocks postponed for processing |
| NumLiveDataNodes | | Number of live data nodes in the cluster |
| NumDeadDataNodes | | Number of data nodes marked as dead |
| NumDecomLiveDataNodes | | Number of decommissioned live nodes |
| NumDecomDeadDataNodes | | Number of decommissioned dead nodes |
| NumDecommissioningDataNodes | | Number of nodes currently being decommissioned |
| TransactionsSinceLastCheckpoint | | Number of transactions since the last checkpoint |
| LastCheckpointTime | | Time of the last checkpoint |
| PendingDataNodeMessageCount| | Number of DATANODE requests queued in the standby namenode |

#### Metric Set: RPC

| Metric Name | Metric Unit | Metric Description |
| ------------------------- | ----------- | -------------------------- |
| ReceivedBytes | | Data receiving rate |
| SentBytes | | Data sending rate |
| RpcQueueTimeNumOps | | RPC call rate |

#### Metric Set: runtime

| Metric Name | Metric Unit | Metric Description |
| ------------------------- | ----------- | ------------------- |
| StartTime | | Start time |

#### Metric Set: JvmMetrics

| Metric Name | Metric Unit | Metric Description |
| ------------------------- | ----------- | ------------------- |
| MemNonHeapUsedM | MB | Current usage of NonHeapMemory by JVM |
| MemNonHeapCommittedM | MB | Committed NonHeapMemory by JVM |
| MemHeapUsedM | MB | Current usage of HeapMemory by JVM |
| MemHeapCommittedM | MB | Committed HeapMemory by JVM |
| MemHeapMaxM | MB | Maximum HeapMemory configured for JVM |
| MemMaxM | MB | Maximum memory that can be used by JVM |
| GcCountParNew | Count | Number of ParNew GC events |
| GcTimeMillisParNew | Milliseconds| Time spent in ParNew GC |
| GcCountConcurrentMarkSweep| Count | Number of ConcurrentMarkSweep GC events|
| GcTimeMillisConcurrentMarkSweep | Milliseconds | Time spent in ConcurrentMarkSweep GC |
| GcCount | Count | Total number of GC events |
| GcTimeMillis | Milliseconds| Total time spent in GC events |
| ThreadsRunnable | Count | Number of threads in RUNNABLE state |
| ThreadsBlocked | Count | Number of threads in BLOCKED state |
| ThreadsWaiting | Count | Number of threads in WAITING state |
| ThreadsTimedWaiting | Count | Number of threads in TIMED WAITING state|
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
---
id: hdfs_datanode
title: 监控:Apache HDFS DataNode监控
sidebar_label: Apache HDFS DataNode
keywords: [大数据监控系统, 分布式文件系统监控, Apache HDFS DataNode监控]
---

> Hertzbeat 对 Apache HDFS DataNode 节点监控指标进行监控。
**使用协议:HTTP**

## 监控前操作

获取 Apache HDFS DataNode 的 HTTP 监控端口。 取值:`dfs.datanode.http.address`

## 配置参数

| 参数名称 | 参数帮助描述 |
| ---------------- |---------------------------------------|
| 目标Host | 被监控的对端IPV4,IPV6或域名。不带协议头。 |
| 端口 | Apache HDFS DataNode 的监控端口号,默认为50075。 |
| 查询超时时间 | 查询 Apache HDFS DataNode 的超时时间,单位毫秒,默认6000毫秒。 |
| 指标采集间隔 | 监控数据采集的时间间隔,单位秒,最小间隔为30秒。 |
| 是否探测 | 新增监控前是否先探测检查监控可用性。 |
| 描述备注 | 此监控的更多描述和备注信息。 |

### 采集指标

#### 指标集合:FSDatasetState

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------------- | -------- | ------------------------------------ |
| DfsUsed | GB | DataNode HDFS使用量 |
| Remaining | GB | DataNode HDFS剩余空间 |
| Capacity | GB | DataNode HDFS空间总量 |

#### 指标集合:JvmMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| ------------------------ | -------- | ------------------------------------ |
| MemNonHeapUsedM | MB | JVM 当前已经使用的 NonHeapMemory 的大小 |
| MemNonHeapCommittedM | MB | JVM 配置的 NonHeapCommittedM 的大小 |
| MemHeapUsedM | MB | JVM 当前已经使用的 HeapMemory 的大小 |
| MemHeapCommittedM | MB | JVM HeapMemory 提交大小 |
| MemHeapMaxM | MB | JVM 配置的 HeapMemory 的大小 |
| MemMaxM | MB | JVM 运行时可以使用的最大内存大小 |
| ThreadsRunnable || 处于 RUNNABLE 状态的线程数量 |
| ThreadsBlocked || 处于 BLOCKED 状态的线程数量 |
| ThreadsWaiting || 处于 WAITING 状态的线程数量 |
| ThreadsTimedWaiting || 处于 TIMED WAITING 状态的线程数量 |

#### 指标集合:runtime

| 指标名称 | 指标单位 | 指标帮助描述 |
| --------------------| -------- | ----------------- |
| StartTime | | 启动时间 |
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
id: hdfs_namenode
title: 监控:Apache HDFS NameNode监控
sidebar_label: Apache HDFS NameNode
keywords: [大数据监控系统, 分布式文件系统监控, Apache HDFS NameNode监控]
---

> Hertzbeat 对 Apache HDFS NameNode 节点监控指标进行监控。
**使用协议:HTTP**

## 监控前操作

获取 Apache HDFS NameNode 的 HTTP 监控端口。取值:`dfs.namenode.http-address`

## 配置参数

| 参数名称 | 参数帮助描述 |
| ---------------- |---------------------------------------|
| 目标Host | 被监控的对端IPV4,IPV6或域名。不带协议头。 |
| 端口 | HDFS NameNode 的监控端口号,默认为50070。 |
| 查询超时时间 | 查询 HDFS NameNode 的超时时间,单位毫秒,默认6000毫秒。 |
| 指标采集间隔 | 监控数据采集的时间间隔,单位秒,最小间隔为30秒。 |
| 是否探测 | 新增监控前是否先探测检查监控可用性。 |
| 描述备注 | 此监控的更多描述和备注信息。 |

### 采集指标

#### 指标集合:FSNamesystem

| 指标名称 | 指标单位 | 指标帮助描述 |
| -------------------------- | -------- | ------------------------------------ |
| CapacityTotal | | 集群存储总容量 |
| CapacityTotalGB | GB | 集群存储总容量 |
| CapacityUsed | | 集群存储已使用容量 |
| CapacityUsedGB | GB | 集群存储已使用容量 |
| CapacityRemaining | | 集群存储剩余容量 |
| CapacityRemainingGB | GB | 集群存储剩余容量 |
| CapacityUsedNonDFS | | 集群非 HDFS 使用容量 |
| TotalLoad | | 整个集群的客户端连接数 |
| FilesTotal | | 集群文件总数量 |
| BlocksTotal | | 总 BLOCK 数量 |
| PendingReplicationBlocks | | 等待被备份的块数量 |
| UnderReplicatedBlocks | | 副本数不够的块数量 |
| CorruptBlocks | | 坏块数量 |
| ScheduledReplicationBlocks | | 安排要备份的块数量 |
| PendingDeletionBlocks | | 等待被删除的块数量 |
| ExcessBlocks | | 多余的块数量 |
| PostponedMisreplicatedBlocks | | 被推迟处理的异常块数量 |
| NumLiveDataNodes | | 活的数据节点数量 |
| NumDeadDataNodes | | 已经标记为 Dead 状态的数据节点数量 |
| NumDecomLiveDataNodes | | 下线且 Live 的节点数量 |
| NumDecomDeadDataNodes | | 下线且 Dead 的节点数量 |
| NumDecommissioningDataNodes | | 正在下线的节点数量 |
| TransactionsSinceLastCheckpoint | | 从上次Checkpoint之后的事务数量 |
| LastCheckpointTime | | 上一次Checkpoint时间 |
| PendingDataNodeMessageCount | | DATANODE 的请求被 QUEUE 在 standby namenode 中的个数 |

#### 指标集合:RPC

| 指标名称 | 指标单位 | 指标帮助描述 |
| ------------------- | -------- | ---------------------- |
| ReceivedBytes | | 接收数据速率 |
| SentBytes | | 发送数据速率 |
| RpcQueueTimeNumOps | | RPC 调用速率 |

#### 指标集合:runtime

| 指标名称 | 指标单位 | 指标帮助描述 |
| --------------------| -------- | ----------------- |
| StartTime | | 启动时间 |

#### 指标集合:JvmMetrics

| 指标名称 | 指标单位 | 指标帮助描述 |
| ------------------------ | -------- | ---------------- |
| MemNonHeapUsedM | MB | JVM 当前已经使用的 NonHeapMemory 的大小 |
| MemNonHeapCommittedM | MB | JVM 配置的 NonHeapCommittedM 的大小 |
| MemHeapUsedM | MB | JVM 当前已经使用的 HeapMemory 的大小 |
| MemHeapCommittedM | MB | JVM HeapMemory 提交大小 |
| MemHeapMaxM | MB | JVM 配置的 HeapMemory 的大小 |
| MemMaxM | MB | JVM 运行时可以使用的最大内存大小 |
| GcCountParNew || 新生代GC消耗时间 |
| GcTimeMillisParNew | 毫秒 | 新生代GC消耗时间 |
| GcCountConcurrentMarkSweep | 毫秒 | 老年代GC次数 |
| GcTimeMillisConcurrentMarkSweep || 老年代GC消耗时间 |
| GcCount || GC次数 |
| GcTimeMillis || GC消耗时间 |
| ThreadsRunnable || 处于 BLOCKED 状态的线程数量 |
| ThreadsBlocked || 处于 BLOCKED 状态的线程数量 |
| ThreadsWaiting || 处于 WAITING 状态的线程数量 |
| ThreadsTimedWaiting || 处于 TIMED WAITING 状态的线程数量 |

2 changes: 2 additions & 0 deletions home/sidebars.json
Original file line number Diff line number Diff line change
Expand Up @@ -221,6 +221,8 @@
"help/hadoop",
"help/hbase_master",
"help/hbase_regionserver",
"help/hdfs_namenode",
"help/hdfs_datanode",
"help/iotdb",
"help/hive",
"help/airflow",
Expand Down
Loading

0 comments on commit 49fbf2a

Please sign in to comment.