Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gearmand performance worse than Python gear in our production, is some configuration missing / not correct ? #393

Open
pythonerdog opened this issue Jun 18, 2024 · 8 comments

Comments

@pythonerdog
Copy link

In our production, Zuul as a gear client, and Jenkins gearman plugin as a gear worker (each jenkins node executor registered as a gear worker).

  1. gearmand version: 1.1.21
    gearmand -t 0 --job-retries 1 --keepalive --keepalive-idle 600 --keepalive-count 9 --keepalive-interval 75 --verbose DEBUG -p 4730
  2. Python gear: 0.16.0 (https://pypi.org/project/gear/)
    GearServer(4730, host="0.0.0.0", statsd_prefix='zuul.geard', keepalive=True, tcp_keepidle=100, tcp_keepintvl=30, tcp_keepcnt=5)

Running on the same production env, kubernets
With gearmand, 13000 tasks can be consumed per hour
With Python gear, 24000 tasks can be consumed per hour

Is any comment for this ? Thanks very much

PS: next, we continue try enable gearmand multi-thread with parameter "-t" to further verification.

@esabol
Copy link
Member

esabol commented Jun 18, 2024

Well, I imagine -t 0 is the problem. It's not the default, and I wouldn't recommend that for anyone unless they are encountering some weirdness. You've hamstrung gearmand with that setting alone.

@esabol
Copy link
Member

esabol commented Jun 18, 2024

You're also not comparing the same keepalive settings. I doubt it matters much, but you should compare the two implementations with the same settings.

@esabol
Copy link
Member

esabol commented Jun 18, 2024

-verbose DEBUG is also doing an excessive amount of logging for a production environment. Either get rid of that option entirely or at least change it to --verbose INFO.

Kubernetes? Are you using the Docker image from https://hub.docker.com/r/artefactual/gearmand/ ?

@pythonerdog
Copy link
Author

Thanks esabol
Will try the test with same parameters. the keepalive may be a suspect and also the "-t 0"
And I also notice the other 2 parameter "-b" and "-f"
-b [ --backlog ] arg (=32) Number of backlog connections for listen.
-f [ --file-descriptors ] arg Number of file descriptors to allow for the process (total connections will be slightly less). Default is max allowed for user.

What do you think about these 2 parameter with default value, and what's that potential impact ?

The docker image is built by ourself
RUN wget https://github.com/gearman/gearmand/releases/download/1.1.21/gearmand-1.1.21.tar.gz && tar -zxvf gearmand-1.1.21.tar.gz && cd gearmand-1.1.21 && ./configure --with-boost-libdir=/usr/lib/x86_64-linux-gnu/ --enable-ssl && make && make install && gearmand --help && rm -rf gearmand-1.1.21 && rm -rf gearmand-1.1.21.tar.gz
And gearmand process managed by supervisor

Thanks again for your quick support

@SpamapS
Copy link
Member

SpamapS commented Jun 20, 2024

-b will only matter if you have a lot of churn in workers/clients.

from man listen:

   The backlog argument defines the maximum length to which the queue of pending connections for sockfd may grow.  If  a connection
 request  arrives when the queue is full, the client may receive an error with an indication of ECONNREFUSED or, if the underlying
  protocol supports retransmission, the request may be ignored so that a later reattempt at connection succeeds.

For -f, that's likely not important unless you're seeing socket/file errors. Open file limits are mostly meant to stop runaway processes from eating up kernel resources. The only open files gearmand is going to use is sockets or a handful for things like logs or local sqlite files if you're using a background queue plugin. The user-level ulimit will be the highest they can go so this would only be to reduce it anyway.

@esabol
Copy link
Member

esabol commented Jun 21, 2024

-b will only matter if you have a lot of churn in workers/clients.

And it seems to me like that could be the case if one is doing performance testing with a trivial worker. So a higher value might be better in this arbitrary scenario?

@pythonerdog
Copy link
Author

Hi @SpamapS and @esabol

After have several trial on our production CI, usually in busy developing time there are about over 20k gear tasks.
One abnormal case is that, C gearmand run very slowly evet though only have few gear tasks

For example, client submit a task take about over 1s
DEBUG 2024-09-07 15:36:05.331334 [ 9 ] Received GEARMAN_SUBMIT_JOB_HIGH -> libgearman-server/thread.cc:311
DEBUG 2024-09-07 15:36:06.809641 [ proc ] PACKET COMMAND: GEARMAN_SUBMIT_JOB_HIGH -> libgearman-server/server.cc:122
~~> 1s

After check the server debug log:
DEBUG 2024-09-07 15:36:05.331334 [ 9 ] Received GEARMAN_SUBMIT_JOB_HIGH -> libgearman-server/thread.cc:311
DEBUG 2024-09-07 15:36:05.331339 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331341 [ 6 ] 10.175.51.166:23220 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
DEBUG 2024-09-07 15:36:05.331342 [ proc ] Registering function: build:production/24r2/test-thor-nr-pdsch-ctrl-gcc-release-03 -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331348 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331350 [ proc ] Registering function: build:production/master/sct-thor-clang64-cpri-fdd-wb-fr1-nr-dl-01:middleweight -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331346 [ 9 ] 10.254.7.244:43488 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
DEBUG 2024-09-07 15:36:05.331355 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331357 [ proc ] Registering function: build:production/23r1/sct-loki-gcc64-rtm-lte-dl-fdd-03 -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331363 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331364 [ proc ] Registering function: build:production/master/sct-thor-clang64-ecpri-tdd-fr1-nr-cpri-fdd-fr1-nr-cpri-fdd-lte:1exec -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331368 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331369 [ proc ] Registering function: build:production/23r3/sct-thor-clang64-cpri-fdd-nb-fr1-nr-ul-01:middleweight -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331370 [ 2 ] 10.175.51.100:15019 Ready POLLIN -> libgearman-server/gearmand_con.cc:138
DEBUG 2024-09-07 15:36:05.331374 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331374 [ 2 ] read 12 bytes -> libgearman-server/io.cc:810
DEBUG 2024-09-07 15:36:05.331376 [ proc ] Registering function: build:production/24r1/test-thor-nr-pucch-f1-fxp -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331376 [ 2 ] Gear unpack -> libgearman-server/plugins/protocol/gear/protocol.cc:117
DEBUG 2024-09-07 15:36:05.331378 [ 2 ] Received GEARMAN_GRAB_JOB_UNIQ -> libgearman-server/thread.cc:311
DEBUG 2024-09-07 15:36:05.331380 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331380 [ 2 ] 10.175.51.100:15019 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
DEBUG 2024-09-07 15:36:05.331382 [ proc ] Registering function: build:production/master/sct-thor-clang64-cpri-fdd-fr1-cpri-tdd-fr1-nr-new-agent:middleweight -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331385 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:05.331386 [ proc ] Registering function: build:production/23r3/test-thor-clang64-sctlite:middleweight -> libgearman-server/server.cc:526
DEBUG 2024-09-07 15:36:05.331392 [ proc ] PACKET COMMAND: GEARMAN_CAN_DO -> libgearman-server/server.cc:122
.......
DEBUG 2024-09-07 15:36:06.807743 [ proc ] PACKET COMMAND: GEARMAN_PRE_SLEEP -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.807940 [ proc ] PACKET COMMAND: GEARMAN_GRAB_JOB_UNIQ -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.808553 [ proc ] PACKET COMMAND: GEARMAN_GRAB_JOB_UNIQ -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.808560 [ 10 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:633
DEBUG 2024-09-07 15:36:06.808577 [ 10 ] send() 12 bytes to peer -> libgearman-server/io.cc:407
DEBUG 2024-09-07 15:36:06.808581 [ 10 ] Sent NO_JOB -> libgearman-server/thread.cc:356
DEBUG 2024-09-07 15:36:06.808913 [ 10 ] 10.175.51.100:30472 Ready POLLIN -> libgearman-server/gearmand_con.cc:138
DEBUG 2024-09-07 15:36:06.808920 [ 10 ] read 12 bytes -> libgearman-server/io.cc:810
DEBUG 2024-09-07 15:36:06.808923 [ 10 ] Gear unpack -> libgearman-server/plugins/protocol/gear/protocol.cc:117
DEBUG 2024-09-07 15:36:06.808926 [ 10 ] Received GEARMAN_PRE_SLEEP -> libgearman-server/thread.cc:311
DEBUG 2024-09-07 15:36:06.808931 [ 10 ] 10.175.51.100:30472 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
DEBUG 2024-09-07 15:36:06.808973 [ proc ] PACKET COMMAND: GEARMAN_PRE_SLEEP -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.808980 [ 10 ] Received RUN wakeup event -> libgearman-server/gearmand_thread.cc:633
DEBUG 2024-09-07 15:36:06.808995 [ 10 ] send() 12 bytes to peer -> libgearman-server/io.cc:407
DEBUG 2024-09-07 15:36:06.808997 [ 10 ] Sent NO_JOB -> libgearman-server/thread.cc:356
DEBUG 2024-09-07 15:36:06.809370 [ 10 ] 10.175.51.166:43854 Ready POLLIN -> libgearman-server/gearmand_con.cc:138
DEBUG 2024-09-07 15:36:06.809375 [ 10 ] read 12 bytes -> libgearman-server/io.cc:810
DEBUG 2024-09-07 15:36:06.809377 [ 10 ] Gear unpack -> libgearman-server/plugins/protocol/gear/protocol.cc:117
DEBUG 2024-09-07 15:36:06.809380 [ 10 ] Received GEARMAN_PRE_SLEEP -> libgearman-server/thread.cc:311
DEBUG 2024-09-07 15:36:06.809383 [ 10 ] 10.175.51.166:43854 Watching POLLIN -> libgearman-server/gearmand_thread.cc:151
DEBUG 2024-09-07 15:36:06.809480 [ proc ] PACKET COMMAND: GEARMAN_PRE_SLEEP -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.809628 [ proc ] PACKET COMMAND: GEARMAN_WORK_DATA -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.809636 [ proc ] PACKET COMMAND: GEARMAN_WORK_STATUS -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.809641 [ proc ] PACKET COMMAND: GEARMAN_SUBMIT_JOB_HIGH -> libgearman-server/server.cc:122
DEBUG 2024-09-07 15:36:06.809644 [ proc ] Received submission, function:build:production/master/sct-thor-abip-cpri-fdd-lte-ul-capacity-13 unique:95b3b2c512fc4229bb8773b7744ec842 with 2 arguments -> libgearman-server/server.cc:252

-- mostly to processing (CAN_DO && PRE_SLEEP)

workers register like this:

build:production/24r2/xxxxx-lte-newagent 0 0 1671
build:production/24r2/xxxxx-ul-capacity-15 0 0 1671
build:production/master/xxxxx-sctlite-dl 0 0 1529
build:production/24r3/xxxxx-ul-02:lightweight 0 0 1671
build:production/24r1/xxxxx-dl-s02-04 0 0 1671
build:production/24r2/xxxxx-release:middleweight 0 0 636
build:production/24r2/xxxxx-15:lightweight 0 0 1671

Seems it only has one thread "proc" to process the received packets and one by one. which caused many tasks (launching test) can't submit in time for it should get the gear server response (job handler)

Is there any special configuration or method to make it fast ?

I have an idea to submit task async in client which still use "submit_job" and handle the gear server response to get handler via a callback or similar function, I am not sure is it available and I will verify it in test env. Great appreciate you can give some comments.

Big thanks for your continuous support

@esabol
Copy link
Member

esabol commented Sep 18, 2024

@pythonerdog asked:

Is there any special configuration or method to make it fast ?

Beyond what we've already told you? Probably not. What command line options you are currently using to start gearmand?

20K tasks seems kind of crazy. Is that in a single job submission? If so, it seems reasonable to me that that would take 1.5 seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants