Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tikv, server: turn on grpc channelz to help diagnosis and add wait conn establish metric #16263

Merged
merged 4 commits into from
Apr 14, 2020
Merged

Conversation

lysu
Copy link
Contributor

@lysu lysu commented Apr 9, 2020

What problem does this PR solve?

Problem Summary:

channelz is one of two officially recommended tool to debug grpc question(another is log, tidb enable it by #14140).

and officially test said that turn it on by default is very low overhead, so maybe we can try it.

official view tool need install npm and start other server.....very hard to use

but there are a unofficial tool https://github.com/kazegusuri/channelzcli seems be useful.

after this Patch, TiDB status-port(default 10080) will export channelz service, then we can use https://github.com/kazegusuri/channelzcli:

  1. which channels TiDB used to connect other
./channelzcli -k --addr 0.0.0.0:10080 list channel                   
ID	Name                                                                            	State	Channel	SubChannel	Calls	Success	Fail	LastCall
1	127.0.0.1:2379                                                                  	READY	0      	1         	47    	46    	0     	8s      
4	endpoint://client-7dd43e13-d160-405d-aa72-d2fea023ba8e/127.0.0.1:2379           	READY	0      	1         	173   	173   	0     	8s      
7	127.0.0.1:40500                                                                 	READY	0      	1         	10    	9     	0     	21m     
9	127.0.0.1:40500                                                                 	READY	0      	1         	11    	10    	0     	21m     
11	127.0.0.1:40500                                                                 	READY	0      	1         	11    	10    	0     	21m     
13	127.0.0.1:40500                                                                 	READY	0      	1         	11    	10    	0     	20m     
22	endpoint://client-2618b91a-09c6-40ee-8dcd-ab67d4f783ba/127.0.0.1:2379           	READY	0      	2         	194   	192   	0     	5s    

127.0.0.1:40500 is tikv and 127.0.0.1:2379 is pd

  1. channel detail
./channelzcli -k --addr 0.0.0.0:10080 describe channel 7
Name:     	127.0.0.1:40500
ChannelID:	7
State:    	READY
Target:   	127.0.0.1:40500
Calls:
  Started:        	2
  Succeeded:      	0
  Failed:         	1
  LastCallStarted:	2020-04-09 16:17:41.317816978 +0000 UTC
Socket:   	<none>
Channels:   	<none>
Subchannels:
  ID	Name	State	Start	Succeeded	Failed
  8		READY	2	0	1
Trace:
  NumEvents:	46
  CreationTimestamp:	2020-04-09 16:16:46.19125706 +0000 UTC
  Events
    Severity	Description                                                                     	Timestamp
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:03.889785185 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:03.890119389 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:05.039407607 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:05.039660693 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:06.693333846 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:06.694109686 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:09.042878093 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:09.04324 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:12.42886261 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:12.429131451 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:14.859276276 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:14.85938234 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:18.209358109 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:18.20993323 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:21.187080916 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:21.18741749 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:24.321505023 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:24.321754044 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:27.014237302 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:27.014638801 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:29.947948483 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:29.948568087 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:32.519156992 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:32.519477234 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:35.416572618 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:35.41690579 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:38.220759965 +0000 UTC
    INFO	Channel Connectivity change to TRANSIENT_FAILURE                                	2020-04-09 16:17:38.221381207 +0000 UTC
    INFO	Channel Connectivity change to CONNECTING                                       	2020-04-09 16:17:41.316467407 +0000 UTC
    INFO	Channel Connectivity change to READY                                            	2020-04-09 16:17:41.317780751 +0000 UTC

we can see fail/success count and connection event log without change tidb code

and read more about tools in channelzcli's Readme.

and maybe we can integrate channelzcli https://github.com/kazegusuri/channelzcli/blob/master/channelz/client.go into TiDB-Dashboard in future

What is changed and how it works?

What's Changed:

  • turn on channelz by default
  • add metric for wait conn establish

How it Works:

Related changes

  • Need to cherry-pick to the release branch

Check List

Tests

  • Manual test (add detailed scripts or steps below)

Side effects

  • Performance regression
    • wait bench result

Release note


This change is Reviewable

@lysu lysu changed the title tikv/server: turn on grpc channelz and add wait conn establish metric tikv/server: turn on grpc channelz to help diagnosis and add wait conn establish metric Apr 9, 2020
@lysu
Copy link
Contributor Author

lysu commented Apr 9, 2020

/bench

@lysu lysu changed the title tikv/server: turn on grpc channelz to help diagnosis and add wait conn establish metric tikv, server: turn on grpc channelz to help diagnosis and add wait conn establish metric Apr 9, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Apr 9, 2020

Benchmark Report

Run Sysbench Performance Test on VMs

@@                               Benchmark Diff                               @@
================================================================================
--- tidb: 2a2bf377aed3bc16ce5873a30f4c6d9b15dc6c0e
+++ tidb: 784d997e3faeb595da4e2a23502c0410778ad6eb
tikv: a00b00877db7ecf9ac60a1a99d1bcc7a6b84cc24
pd: 8b8f19f2de7ae75d5d41d2c932cf6c05a7f45e10
================================================================================
oltp_update_index:
    * QPS: 4341.03 ± 0.55% (std=19.64) delta: -0.09% (p=0.593)
    * Latency p50: 29.46 ± 0.46% (std=0.10) delta: -0.02%
    * Latency p99: 54.86 ± 5.49% (std=1.87) delta: 2.78%
            
oltp_insert:
    * QPS: 6943.63 ± 0.38% (std=20.51) delta: -0.81% (p=0.425)
    * Latency p50: 18.43 ± 0.37% (std=0.05) delta: 0.81%
    * Latency p99: 32.53 ± 1.81% (std=0.59) delta: -0.51%
            
oltp_read_write:
    * QPS: 14257.47 ± 0.18% (std=18.50) delta: -0.36% (p=0.846)
    * Latency p50: 179.95 ± 0.21% (std=0.27) delta: 0.38%
    * Latency p99: 341.12 ± 2.70% (std=9.21) delta: -0.27%
            
oltp_point_select:
    * QPS: 41613.22 ± 0.21% (std=64.98) delta: 0.57% (p=0.814)
    * Latency p50: 3.07 ± 0.22% (std=0.00) delta: -0.62%
    * Latency p99: 10.79 ± 2.27% (std=0.16) delta: 0.00%
            
oltp_update_non_index:
    * QPS: 4749.22 ± 0.03% (std=1.14) delta: -0.23% (p=0.304)
    * Latency p50: 26.95 ± 0.04% (std=0.01) delta: 0.24%
    * Latency p99: 42.36 ± 1.20% (std=0.36) delta: 3.06%
            

@lysu lysu marked this pull request as ready for review April 10, 2020 02:10
@lysu lysu requested review from tiancaiamao, coocood and jackysp April 10, 2020 02:10
@tiancaiamao
Copy link
Contributor

LGTM

@tiancaiamao tiancaiamao added the status/LGT1 Indicates that a PR has LGTM 1. label Apr 10, 2020
@jackysp
Copy link
Member

jackysp commented Apr 10, 2020

It is better to upgrade grafana script in the same PR.

@codecov
Copy link

codecov bot commented Apr 13, 2020

Codecov Report

Merging #16263 into master will decrease coverage by 0.2361%.
The diff coverage is 100.0000%.

@@               Coverage Diff                @@
##             master     #16263        +/-   ##
================================================
- Coverage   80.6338%   80.3976%   -0.2362%     
================================================
  Files           506        506                
  Lines        137730     136137      -1593     
================================================
- Hits         111057     109451      -1606     
- Misses        18147      18164        +17     
+ Partials       8526       8522         -4     

@pingcap pingcap deleted a comment from sre-bot Apr 13, 2020
@pingcap pingcap deleted a comment from sre-bot Apr 13, 2020
@pingcap pingcap deleted a comment from sre-bot Apr 13, 2020
@lysu
Copy link
Contributor Author

lysu commented Apr 13, 2020

/run-monitor-test

@sre-bot
Copy link
Contributor

sre-bot commented Apr 13, 2020

Visit the grafana server at: http://172.16.5.5:31411, it will last for 5 hours

@lysu
Copy link
Contributor Author

lysu commented Apr 14, 2020

/build

@lysu
Copy link
Contributor Author

lysu commented Apr 14, 2020

@jackysp monitor fixed, PTAL~

Copy link
Member

@jackysp jackysp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@jackysp
Copy link
Member

jackysp commented Apr 14, 2020

/merge

@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Apr 14, 2020
@sre-bot
Copy link
Contributor

sre-bot commented Apr 14, 2020

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Apr 14, 2020

@lysu merge failed.

@jackysp
Copy link
Member

jackysp commented Apr 14, 2020

/merge

@sre-bot
Copy link
Contributor

sre-bot commented Apr 14, 2020

Your auto merge job has been accepted, waiting for 16008, 16009, 16175, 16232, 16233, 15738, 16132, 16133, 16218, 15585

@sre-bot
Copy link
Contributor

sre-bot commented Apr 14, 2020

/run-all-tests

@sre-bot
Copy link
Contributor

sre-bot commented Apr 14, 2020

@lysu merge failed.

@lysu
Copy link
Contributor Author

lysu commented Apr 14, 2020

/build

@lysu
Copy link
Contributor Author

lysu commented Sep 17, 2020

/run-cherry-picker

@ti-srebot
Copy link
Contributor

cherry pick to release-4.0 in PR #20077

lysu added a commit to ti-srebot/tidb that referenced this pull request Sep 23, 2020
Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
ti-srebot added a commit that referenced this pull request Oct 3, 2020
…nn establish metric (#16263) (#20077)

Signed-off-by: ti-srebot <ti-srebot@pingcap.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/tikv status/can-merge Indicates a PR has been approved by a committer. status/LGT1 Indicates that a PR has LGTM 1. type/usability
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants