Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Brokers crash if all bookies are full #6969

Closed
trexinc opened this issue May 15, 2020 · 7 comments
Closed

Brokers crash if all bookies are full #6969

trexinc opened this issue May 15, 2020 · 7 comments
Labels
area/broker lifecycle/stale type/bug The PR fixed a bug or issue reported a bug

Comments

@trexinc
Copy link
Contributor

trexinc commented May 15, 2020

Happens with both 2.5.0 and 2.5.1
Running distributed pulsar on k8s. Several bookies, brokers. functions workers and proxies.
If bookies get completely full (because of a bug with retention - #6935 ), brokers begin to loop crash making it impossible to remove large topics or troubleshoot.
As a workaround we add another bookie, and then clear large topics, but I would expect brokers not to crash or maybe even go into some emergency mode where only admin API is available.

@jiazhai
Copy link
Member

jiazhai commented May 18, 2020

@trexinc Thanks for the reporting of this issue. Would you please help collect the broker logs when this error happens.

@sijie
Copy link
Member

sijie commented May 18, 2020

@trexinc if you have function workers running along with brokers, function workers use pulsar topics for metadata management. so if the bookkeeper cluster is not writable, it will cause function workers not able to produce messages and cause brokers not able to startup. We can think about adding the retry logic in function worker and let it retry until the it is able to produce the messages.

@trexinc
Copy link
Contributor Author

trexinc commented May 19, 2020

@sijie ou function workers run on separate pods, not along with the brokers.

@jiazhai unfortunately the log of the first crash wasn't saved, all next logs showed the crash because of "Broker-znode owned by different zk-session", even if I stop all brokers but one. Didn't see any other interesting logs.

@trexinc
Copy link
Contributor Author

trexinc commented May 19, 2020

We will try to set-up a separate environment where we can replicate this issue on demand without affecting others. It reproduces easily on our active env, hopefully it will replicate as well on a dedicated one.

@sijie
Copy link
Member

sijie commented May 19, 2020

@trexinc interesting. it would be good to get the logs so we can help you analyze the logs.

@ckdarby
Copy link

ckdarby commented May 25, 2020

@trexinc Were you using "small volumes" with large ingestion? Better put, could you fill ~10% of your total bookies in < 10 seconds?

We faced a similar issue where the cluster filled all the bookies before the ReadOnly safety check could even be performed and the cluster went into a state of being partially unusable for some functions.

@tisonkun
Copy link
Member

tisonkun commented Dec 6, 2022

Closed as stale. Please open a new issue if it's still relevant to the maintained versions.

@tisonkun tisonkun closed this as not planned Won't fix, can't repro, duplicate, stale Dec 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/broker lifecycle/stale type/bug The PR fixed a bug or issue reported a bug
Projects
None yet
Development

No branches or pull requests

6 participants