-
Notifications
You must be signed in to change notification settings - Fork 583
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] clusterer segfault when DB entry for clusterer can't be resolved and node_id for the local node is higher than the remote which can't be resolved #3473
Comments
@bogdan-iancu, few more comments I have to add. I have found out why I was unable to reproduce this issue in my testing environment. I have pinpointed the problem to be an ordering issue (node_id on the local node is higher than the node_id of the remote node). I haven't drilled down into the query that clusterer runs, but I assume it is ordering by node_id ASC. In my case since the lower node_id was that of the remote node and that failed on DNS resolution, Additionally, I've found the correct place to make opensips gracefully quit without core-dumping (but I'm not sure if you want a more thorough fix.) See PR #3474 |
Hey @tommybrecher , could you please test the fix I just pushed? if I get your ACK, I can do the backport. Thanks! |
Hi @bogdan-iancu , opensips did not segfault this time but I'm not sure that the status of clusterer is correct. I only see the address that failed to resolve (node_id 20) in the table, but it appears we never got to the next entry in clusterer? (node_id 200) posted originally in this issue. is this expected?
|
@tommybrecher , sorry for delay here. What are the error logs you see at startup? The code has some explicit logic NOT to discard itself as node - if that node cannot be added -> the load will fail. |
Hi @bogdan-iancu, no issues at all about delay, I understand the nature of open-source and appreciate the time and effort that you put into this :) I would have to re-test this in the lab to confirm what's being logged. I'll try to get that done today |
On failing to add a node (to the cluster), continue, as time as that node is not the current node. And if giving up on whole cluster info loading, be sure we clean up and properly report this to the above layer - the last thing we should do is to report success (on load) but have partial, unconsistent data. Fixes #3473 (cherry picked from commit 784e1a4)
On failing to add a node (to the cluster), continue, as time as that node is not the current node. And if giving up on whole cluster info loading, be sure we clean up and properly report this to the above layer - the last thing we should do is to report success (on load) but have partial, unconsistent data. Fixes #3473 (cherry picked from commit 784e1a4)
OpenSIPS version you are running
Describe the bug
When starting opensips, if a DNS entry exists in the url field of the clusterer DB table which can't be resolved (missing DNS entry), opensips will segfault in sync.c:97 (queue_sync_request).
This happens because
cluster->current_node
is NULL, resulting in a segmentation fault when trying to access->flags
To Reproduce
opensips.cfg
Clusterer table
Expected behavior
Relevant System Logs
backtrace full
OS/environment information
Additional context
if (cluster->current)
but ran into other issues with other areas in the code where the same access is attempted (timer, etc) and after multiple attempts just got opensips deadlocked.The text was updated successfully, but these errors were encountered: