-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Brokers crash if all bookies are full #6969
Comments
@trexinc Thanks for the reporting of this issue. Would you please help collect the broker logs when this error happens. |
@trexinc if you have function workers running along with brokers, function workers use pulsar topics for metadata management. so if the bookkeeper cluster is not writable, it will cause function workers not able to produce messages and cause brokers not able to startup. We can think about adding the retry logic in function worker and let it retry until the it is able to produce the messages. |
@sijie ou function workers run on separate pods, not along with the brokers. @jiazhai unfortunately the log of the first crash wasn't saved, all next logs showed the crash because of "Broker-znode owned by different zk-session", even if I stop all brokers but one. Didn't see any other interesting logs. |
We will try to set-up a separate environment where we can replicate this issue on demand without affecting others. It reproduces easily on our active env, hopefully it will replicate as well on a dedicated one. |
@trexinc interesting. it would be good to get the logs so we can help you analyze the logs. |
@trexinc Were you using "small volumes" with large ingestion? Better put, could you fill ~10% of your total bookies in < 10 seconds? We faced a similar issue where the cluster filled all the bookies before the ReadOnly safety check could even be performed and the cluster went into a state of being partially unusable for some functions. |
Closed as stale. Please open a new issue if it's still relevant to the maintained versions. |
Happens with both 2.5.0 and 2.5.1
Running distributed pulsar on k8s. Several bookies, brokers. functions workers and proxies.
If bookies get completely full (because of a bug with retention - #6935 ), brokers begin to loop crash making it impossible to remove large topics or troubleshoot.
As a workaround we add another bookie, and then clear large topics, but I would expect brokers not to crash or maybe even go into some emergency mode where only admin API is available.
The text was updated successfully, but these errors were encountered: