-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDDS-1935. Improve the visibility with Ozone Insight tool #1255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
Let us sync up some time. If I get an overview of the code layout, it will be easier for me to review this. I really appreciate you doing this. Thank you ... I will sync with you when you are back |
This is a very useful addition @elek. Is there any documentation or slides that I can look at to understand this more? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is very useful. Found a few minor issues while testing it, see code comments.
It seems that log level is kept at debug/trace after ozone insight log
command is finished (^C). I would expect it to be restored to avoid spamming logs. Maybe it's specific to docker environment where logs go to console.
I also observed occasional (probably harmless) EofException
around WriterAppender.append
.
.../java/org/apache/hadoop/ozone/protocolPB/ScmBlockLocationProtocolServerSideTranslatorPB.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/insight/src/main/java/org/apache/hadoop/ozone/insight/om/KeyManagerInsight.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/insight/src/main/java/org/apache/hadoop/ozone/insight/LogSubcommand.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/insight/src/main/java/org/apache/hadoop/ozone/insight/MetricsSubCommand.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/insight/src/main/java/org/apache/hadoop/ozone/insight/ConfigurationSubCommand.java
Outdated
Show resolved
Hide resolved
hadoop-ozone/insight/src/main/java/org/apache/hadoop/ozone/insight/List.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
I think all the CLI parameters are well documented. But I will definitely create new doc pages if the patch is accepted. @anuengineer @arp7: Do you have any more comments? Can you please review? |
Can you please rebase this patch? The patch is not applying cleanly to the head of trunk. |
💔 -1 overall
This message was automatically generated. |
Could be a camel case problem on osx. I think earlier the camele case usage was not consistent. Base on github there are no rebase problem. Can you please try to delete your local insight folder and retry? |
…lServerSideTranslatorPB Co-Authored-By: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
Co-Authored-By: Doroszlai, Attila <6454655+adoroszlai@users.noreply.github.com>
💔 -1 overall
This message was automatically generated. |
Visibility is a key aspect for the operation of any Ozone cluster. We need better visibility to improve correctnes and performance. While the distributed tracing is a good tool for improving the visibility of performance we have no powerful tool which can be used to check the internal state of the Ozone cluster and debug certain correctness issues.
To improve the visibility of the internal components I propose to introduce a new command line application
ozone insight
.The new tool will show the selected metrics / logs / configuration for any of the internal components (like replication-manager, pipeline, etc.).
For each insight points we can define the required logs and log levels, metrics and configuration and the tool can display only the component specific information during the debug.
h2. Usage
First we can check the available insight point:
{code}
bash-4.2$ ozone insight list
Available insight points:
scm.node-manager SCM Datanode management related information.
scm.replica-manager SCM closed container replication manager
scm.event-queue Information about the internal async event delivery
scm.protocol.block-location SCM Block location protocol endpoint
scm.protocol.container-location Planned insight point which is not yet implemented.
scm.protocol.datanode Planned insight point which is not yet implemented.
scm.protocol.security Planned insight point which is not yet implemented.
scm.http Planned insight point which is not yet implemented.
om.key-manager OM Key Manager
om.protocol.client Ozone Manager RPC endpoint
om.http Planned insight point which is not yet implemented.
datanode.pipeline[id] More information about one ratis datanode ring.
datanode.rocksdb More information about one ratis datanode ring.
s3g.http Planned insight point which is not yet implemented.
{code}
Insight points can define configuration, metrics and/or logs. Configuration can be displayed based on the configuration objects:
{code}
ozone insight config scm.protocol.block-location
Configuration for
scm.protocol.block-location
(SCM Block location protocol endpoint)The hostname or IP address used by the SCM block client endpoint to bind
The port number of the Ozone SCM block client service.
The address of the Ozone SCM block client service. If not defined value of ozone.scm.client.address is used
{code}
Metrics can be retrieved from the prometheus entrypoint:
{code}
ozone insight metrics scm.protocol.block-location
Metrics for
scm.protocol.block-location
(SCM Block location protocol endpoint)RPC connections
Open connections: 0
Dropped connections: 0
Received bytes: 0
Sent bytes: 0
RPC queue
RPC average queue time: 0.0
RPC call queue length: 0
RPC performance
RPC processing time average: 0.0
Number of slow calls: 0
Message type counters
Number of AllocateScmBlock: 0
Number of DeleteScmKeyBlocks: 0
Number of GetScmInfo: 2
Number of SortDatanodes: 0
{code}
Log levels can be adjusted with the existing logLevel servlet and can be collected / streamd via a simple logstream servlet:
{code}
ozone insight log scm.node-manager
[SCM] 2019-08-08 12:42:37,392 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:43:37,392 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:44:37,392 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:45:37,393 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 12:46:37,392 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=ozone_datanode_1.ozone_default]
{code}
The verbose mode can display the raw messages as well:
{code}
[SCM] 2019-08-08 13:16:37,398 [DEBUG|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] Processing node report from [datanode=ozone_datanode_1.ozone_default]
[SCM] 2019-08-08 13:16:37,400 [TRACE|org.apache.hadoop.hdds.scm.node.SCMNodeManager|SCMNodeManager] HB is received from [datanode=ozone_datanode_1.ozone_default]:
storageReport {
storageUuid: "DS-bffe6bee-1166-4502-acf5-57fc16c5aa98"
storageLocation: "/data/hdds"
capacity: 470282264576
scmUsed: 16384
remaining: 205695963136
storageType: DISK
failed: false
}
{code}
h2. Use cases
Ozone insight can be used for any kind of debuging. Some problem examples from my yesterday
Due to a cache problem the volumes were created twice without any error at the second time. With this tool I can check the state of the internal cache, or check if the volume is added to the rocksdb itself.
After fixing this problem we found an DNS caching issue. The OM responded with an error but it was not clear where the error was propagated from (it was created in OzoneManagerProtocolClientSideTranslatorPB.handleError). With checking the traffic between SCM and OM it can be easy to track the origin of a specific error.
After fixing this problem we found some pipline problem (reported later at HDDS-1933). With this tool I could check the content of the reports and messages to the pipeline manager.
h2. Implementation
We can implement the tool without any significant code change as it uses existing features:
/prom
endpoint/logLevel
servlet endpoint (from hadoop-common)A new interface can be introduced for
InsightPoint
s where all the affected logs/levels, metrics and config classes can be defined for each components.Prometheus servlet endpoint can be changed to be turned on by default.
See: https://issues.apache.org/jira/browse/HDDS-1935