-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Setting topic permission under load #2952
Comments
Now verified same issue on a full Pulsar setup. |
I think the problem might be related to concurrent updates on namespace policy. we can probably handle the bad version and retry, or put all the updates on same namespace in ordered executor. |
Yes, the zookeeper bad version implies that pulsar have concurrent updates.
At least a retry is needed. Don't know if a queue would help, do you only
have one entry point? Otherwise, you would still have potential concurrent
updates.
Harald
|
Hi,
Just a comment is it not normal to give status service unavailable 503, for
a retry scenario. Status conflict implies that the client has sent a faulty
request, which is not the case, and sometimes it could be hard to
distinguish this conflict from an actual conflict, and a need to still
parse response text. Or is this zookeeper conflict handled inside pulsar
with a retry.
Harald
…On Fri, 28 Dec 2018, 19:15 Sijie Guo ***@***.*** wrote:
Closed #2952 <#2952> via 1fa81ca
<1fa81ca>
.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#2952 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADKqXSU-BC9QxLxFu9KkjppUfPNwcDxzks5u9l_dgaJpZM4YSXhe>
.
|
@HaraldGustafsson At the time of bad version happening, it means there are concurrent updates to namespace policy. It is hard to determine what is the real conflict and what conflict it is. so returning CONFLICT status code sounds a right approach to me. IMO, Pulsar applications which are using pulsar admin api should be able to catch this exception and retries, because pulsar applications are better places to resolve the conflict. Does that align with what you are thinking? |
I understand the reasoning, but from the api it is not obvious that when
you change permissions on different topics in the same namespace that you
could get this concurrent conflict (and should retry), if you don't know
that all topics permissions are stored in the namespace. At least I did not
know it before digging a bit deeper. When realising this I needed to queue
such operations on the client side and add retry since multiple workers
might attempt to do changes. So when setting up many topics, with several
configurations like retention, ttl, permissions, etc, they all need to be
queued, and get rtt delay. In my setup, for each admin request we get we
need to queue around 10 admin requests to pulsar, with retries even more.
Would it be possible to introduce a larger namespace policy update method
to change multiple things simultaneously, since in my case the conflict was
often due to changing several things in the ns policy.
…On Fri, 28 Dec 2018, 19:49 Sijie Guo ***@***.*** wrote:
@HaraldGustafsson <https://github.com/HaraldGustafsson> At the time of
bad version happening, it means there are concurrent updates to namespace
policy. It is hard to determine what is the real conflict and what conflict
it is. so returning CONFLICT status code sounds a right approach to me.
IMO, Pulsar applications which are using pulsar admin api should be able to
catch this exception and retries, because pulsar applications are better
places to resolve the conflict. Does that align with what you are thinking?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#2952 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ADKqXf-Nb3u2i7X88mY23dJtaugbUCD_ks5u9metgaJpZM4YSXhe>
.
|
@HaraldGustafsson I see your point now. Let me reopen this issue and think through a better solution on addressing such conflicts. |
@HaraldGustafsson for my point of view, retry could not a good solution, due to retry also could fail again. Maybe a queue/or only one request at one time is a better idea, what I know for kafka-manager to change policy does like this . @sijie what do you think about this ? |
I was thinking about same thing. that sgtm |
@sijie I'll try to add one global write lock for this request first, I think it could be enough for change policy. |
@foreversunyao sgtm |
@sijie for this issue, I am considering create a persistent zknode in zk if not exists, zknode name is the namespace name . by this way to implement a lock for change permission. But I have a problem which path I should use, what do you think about this ? current path: |
@HaraldGustafsson @sijie @merlimat , Right now, what I think we could have 2 other ways to do retry thing except this one:
any ideas ? thanks |
Any news here? I have topics where there are ~10 different roles with different permissions. I manage them with pulsar terraform provider which basically does 10 POSTs in a row. In a new cluster with no load it was almost certain that already the second POST would fail. Also, the terraform provider uses internally pulsarctl which does not have any retry mechanism, although it could, at least while the internals of the server are not fixed? |
I am in the exact same situation as @jaak-pruulmann-sympower - have you found any workaround? I just I'm interested in contributing a fix to this since it's a pretty frequent pain-point for me; I like the idea of handling the edit: just noticed a zk sequential node was mentioned; this still needs a timeout since the caller needs to wait for the queue item to be processed and if the queue gets stuck then the caller will need to timeout. Still a fan of just retrying on BadVersionException to handle majority of cases where maybe 10s or hundreds of updates occur concurrently |
Expected behavior
Do a REST call to admin interface for setting topic permissions like:
body:
Should work every time and return status 204.
Actual behavior
When this is done with heavy load setting up permission for many topics in parallel. Some will result in the below logged exception and status code 500.
Steps to reproduce
Set up tenant and namespace, then run the set permission for many different topics in that namespace in parallel. I use the same role and actions for all of them.
System configuration
pulsar: 2.2.0
This error was when running the broker in standalone mode in a docker container supplied on docker hub, with authorisation and authentication via client certs. Have not tested on our full pulsar setup yet.
Logs
The text was updated successfully, but these errors were encountered: