Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] clusterer segfault when DB entry for clusterer can't be resolved and node_id for the local node is higher than the remote which can't be resolved #3473

Open
tommybrecher opened this issue Sep 18, 2024 · 5 comments
Assignees
Milestone

Comments

@tommybrecher
Copy link

OpenSIPS version you are running

  • OpenSIPs 3.6-dev master branch (4d79a71)
root@ubuntu:~/build/opensips# /usr/local/opensips-upstream/sbin/opensips -V
version: opensips 3.6.0-dev (x86_64/linux)
flags: STATS: On, SHM_EXTRA_STATS, EXTRA_DEBUG, DISABLE_NAGLE, USE_MCAST, SHM_MMAP, PKG_MALLOC, Q_MALLOC, F_MALLOC, HP_MALLOC, DBG_MALLOC, CC_O0, FAST_LOCK-ADAPTIVE_WAIT, DBG_LOCK
ADAPTIVE_WAIT_LOOPS=1024, MAX_RECV_BUFFER_SIZE 262144, MAX_LISTEN 16, MAX_URI_SIZE 1024, BUF_SIZE 65535
poll method support: poll, epoll, sigio_rt, select.
git revision: 4d79a71ee
main.c compiled on 14:02:57 Sep 18 2024 with cc 9

Describe the bug

When starting opensips, if a DNS entry exists in the url field of the clusterer DB table which can't be resolved (missing DNS entry), opensips will segfault in sync.c:97 (queue_sync_request).
This happens because cluster->current_node is NULL, resulting in a segmentation fault when trying to access ->flags

┌──sync.c──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│   83                                  return node->node_id;                                                                                                                                                                                                  │
│   84                          }                                                                                                                                                                                                                              │
│   85                                                                                                                                                                                                                                                         │
│   86                          lock_release(node->lock);                                                                                                                                                                                                      │
│   87                  }                                                                                                                                                                                                                                      │
│   88                                                                                                                                                                                                                                                         │
│   89                  return 0;                                                                                                                                                                                                                              │
│   90          }                                                                                                                                                                                                                                              │
│   91                                                                                                                                                                                                                                                         │
│   92          int queue_sync_request(cluster_info_t *cluster, struct local_cap *lcap)                                                                                                                                                                        │
│   93          {                                                                                                                                                                                                                                              │
│   94                  lock_get(cluster->lock);                                                                                                                                                                                                               │
│   95                  lcap->flags |= CAP_SYNC_PENDING;                                                                                                                                                                                                       │
│   96                                                                                                                                                                                                                                                         │
│  >97                  if (cluster->current_node->flags & NODE_IS_SEED)                                                                                                                                                                                       │
│   98                          gettimeofday(&lcap->sync_req_time, NULL);                                                                                                                                                                                      │
│   99                                                                                                                                                                                                                                                         │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7fc5c0279b In: queue_sync_request                                                                                                                                                                                  L97   PC: 0x7fc5bd0f18e4 
(gdb) p *cluster
$1 = {
  cluster_id = 1,
  no_nodes = 0,
  node_list = 0x0,
  current_node = 0x0,
  send_sock = 0x0,
  lock = 0x7fc5bda518a8,
  top_version = 0,
  capabilities = 0x7fc5bda560f0,
  next = 0x0
}

To Reproduce

opensips.cfg

####### Global Parameters #########
log_level=4
xlog_level=4
stderror_enabled=no
syslog_enabled=yes
syslog_facility=LOG_LOCAL0
udp_workers=4
socket=udp:10.192.200.39:5060
socket=bin:10.192.200.39:5566

####### Modules Section ########

#set module path
mpath="/usr/local/opensips-upstream/lib64/opensips/modules/"
loadmodule "signaling.so"
loadmodule "sl.so"
loadmodule "tm.so"
loadmodule "rr.so"
loadmodule "maxfwd.so"
loadmodule "sipmsgops.so"
loadmodule "mi_fifo.so"
loadmodule "proto_udp.so"
loadmodule "proto_bin.so"
loadmodule "clusterer.so"
loadmodule "dialog.so"
loadmodule "db_mysql.so"

# configure modules
# TM
modparam("tm", "fr_timeout", 5)
modparam("tm", "fr_inv_timeout", 30)
modparam("tm", "restart_fr_on_each_reply", 0)
modparam("tm", "onreply_avp_mode", 1)
# RR
modparam("rr", "append_fromtag", 0)
# MI_FIFO
modparam("mi_fifo", "fifo_name", "/run/opensips/opensips_fifo")
modparam("mi_fifo", "fifo_mode", 0666)
# CLUSTERER
modparam("clusterer", "my_node_id", 200)
modparam("clusterer", "seed_fallback_interval", 300)
modparam("clusterer", "db_url", "mysql://******:******@localhost/opensips_upstream")
# DIALOG
modparam("dialog", "db_mode", 3)
modparam("dialog", "db_url", "mysql://******:******@localhost/opensips_upstream")
modparam("dialog", "profiles_with_value", "trunkCalls;")
modparam("dialog", "dialog_replication_cluster", 1)

####### Routing Logic ########
route{
        if (!mf_process_maxfwd_header(10)) {
                send_reply(483,"Too Many Hops");
                exit;
        }
        if (has_totag()) {
                if ( is_method("ACK") && t_check_trans() ) {
                        t_relay();
                        exit;
                }
                if ( !loose_route() ) {
                        send_reply(404,"Not here");
                        exit;
                }
                route(relay);
                exit;
        }
        if (is_method("CANCEL")) {
                if (t_check_trans())
                        t_relay();
                exit;
        }
        t_check_trans();
        if ( !(is_method("REGISTER")  ) ) {
                if (is_myself("$fd")) {
                } else {
                        if (!is_myself("$rd")) {
                                send_reply(403,"Relay Forbidden");
                                exit;
                        }
                }
        }
        if (loose_route()) {
                xlog("L_ERR", "Attempt to route with preloaded Route's [$fu/$tu/$ru/$ci]");
                if (!is_method("ACK"))
                        send_reply(403,"Preload Route denied");
                exit;
        }
        if (!is_method("REGISTER|MESSAGE"))
                record_route();
        if (!is_myself("$rd")) {
                append_hf("P-hint: outbound\r\n"); 
                route(relay);
        }
        if (is_method("PUBLISH|SUBSCRIBE")) {
                send_reply(503, "Service Unavailable");
                exit;
        }
        if ($rU==NULL) {
                # request with no Username in RURI
                send_reply(484,"Address Incomplete");
                exit;
        }
        route(relay);
}

route[relay] {
        # for INVITEs enable some additional helper routes
        if (is_method("INVITE")) {
                t_on_branch("per_branch_ops");
                t_on_reply("handle_nat");
                t_on_failure("missed_call");
        }

        if (!t_relay()) {
                send_reply(500,"Internal Error");
        }
        exit;
}
branch_route[per_branch_ops] {
        xlog("new branch at $ru\n");
}
onreply_route[handle_nat] {
        xlog("incoming reply\n");
}
failure_route[missed_call] {
        if (t_was_cancelled()) {
                exit;
        }
}

Clusterer table

INSERT INTO clusterer (id, cluster_id, node_id, url, state, no_ping_retries, priority, sip_addr, flags, description) VALUES
(1, 1, 200, 'bin:10.192.200.39:5566', 1, 3, 50, '10.192.200.39', 'seed', 'local_node'),
(2, 1, 20, 'bin:XXXXX8-XXX2g.XX.XXX.XXXX.XXXXXXXXXXXXXX.XXX:5566', 1, 3, 50, '10.192.200.39', 'seed', 'neighbor');

Expected behavior

Relevant System Logs

Sep 18 16:28:45 [236214] DBG:core:daemonize: waiting for status code from children
Sep 18 16:28:45 [236214] DBG:core:count_module_procs: modules require 1 extra processes
Sep 18 16:28:45 [236214] DBG:core:count_init_child_processes: 16 children are going to be inited
Sep 18 16:28:45 [236217] DBG:core:install_sigs: override SIGSEGV handler: success
Sep 18 16:28:45 [236217] DBG:core:set_core_dump: core dump limits set to 18446744073709551615
Sep 18 16:28:45 [236217] NOTICE:core:main: version: opensips 3.6.0-dev (x86_64/linux)
Sep 18 16:28:45 [236217] NOTICE:core:main: using 32 MB of shared memory, allocator: Q_MALLOC_DBG
Sep 18 16:28:45 [236217] NOTICE:core:main: using 16 MB of private process memory, allocator: Q_MALLOC_DBG
Sep 18 16:28:45 [236217] INFO:core:init_reactor_size: reactor size 1024 (using up to 0.04Mb of memory per process)
Sep 18 16:28:45 [236217] DBG:core:ipc_register_handler: IPC type 0 [RPC] registered with handler (nil)
Sep 18 16:28:45 [236217] DBG:core:__search_avp_map: looking for [serial_branch] avp  - found -1
Sep 18 16:28:45 [236217] DBG:core:new_avp_alias: added alias serial_branch with id 1
Sep 18 16:28:45 [236217] DBG:core:find_mod_export: <trace_bind_api> in module proto_hep not found
Sep 18 16:28:45 [236217] DBG:core:trace_prot_bind: <proto_hep> has no bind api function
Sep 18 16:28:45 [236217] DBG:core:try_load_trace_api: No tracing module used!
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CORE_THRESHOLD(1)>
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CORE_SHM_THRESHOLD(2)>
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CORE_PKG_THRESHOLD(3)>
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CORE_PROC_AUTO_SCALE(4)>
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CORE_TCP_DISCONNECT(5)>
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module tm
Sep 18 16:28:45 [236217] INFO:tm:mod_init: TM - initializing...
Sep 18 16:28:45 [236217] DBG:core:get_flag_id_by_name: Flag name is null!
Sep 18 16:28:45 [236217] DBG:tm:init_callid: Call-ID initialization: '36ece506127557a6'
Sep 18 16:28:45 [236217] DBG:tm:lock_initialize: lock initialization started
Sep 18 16:28:45 [236217] DBG:tm:tm_init_timers: creating 0 parallel timer structures
Sep 18 16:28:45 [236217] DBG:tm:mod_init: timer set shift is 3
Sep 18 16:28:45 [236217] DBG:core:MD5StringArray: MD5 calculated: d0d1c9455982febf81eb76e31906bf37
Sep 18 16:28:45 [236217] DBG:core:register_pv_context: Registered new context: request / 0x7fc5bd1fa3f0
Sep 18 16:28:45 [236217] DBG:core:register_pv_context: Context not found
Sep 18 16:28:45 [236217] DBG:core:register_pv_context: Registered new context: reply / 0x7fc5bd1fa261
Sep 18 16:28:45 [236217] DBG:core:register_pv_context: Context not found
Sep 18 16:28:45 [236217] DBG:core:add_pv_context: Registered new context: reply
Sep 18 16:28:45 [236217] DBG:core:__search_avp_map: looking for [uac_ctx] avp  - found -1
Sep 18 16:28:45 [236217] DBG:core:new_avp_alias: added alias uac_ctx with id 2
Sep 18 16:28:45 [236217] DBG:tm:tm_init_cluster: tm_replication_cluster not set - not engaging!
Sep 18 16:28:45 [236217] DBG:core:init_mod: registering stats for tm
Sep 18 16:28:45 [236217] DBG:core:init_mod: register MI for tm
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module sl
Sep 18 16:28:45 [236217] INFO:sl:mod_init: Initializing StateLess engine
Sep 18 16:28:45 [236217] DBG:core:init_mod: registering stats for sl
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module signaling
Sep 18 16:28:45 [236217] NOTICE:signaling:mod_init: initializing module ...
Sep 18 16:28:45 [236217] DBG:core:find_mod_cmd_export_t: found <load_tm> in module tm [/usr/local/opensips-upstream/lib64/opensips/modules/]
Sep 18 16:28:45 [236217] DBG:core:find_mod_cmd_export_t: found <load_sl> in module sl [/usr/local/opensips-upstream/lib64/opensips/modules/]
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module rr
Sep 18 16:28:45 [236217] INFO:rr:mod_init: rr - initializing
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module maxfwd
Sep 18 16:28:45 [236217] INFO:maxfwd:mod_init: initializing...
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module sipmsgops
Sep 18 16:28:45 [236217] INFO:sipmsgops:mod_init: initializing...
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module mi_fifo
Sep 18 16:28:45 [236217] DBG:mi_fifo:mi_mod_init: testing fifo existence ...
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module proto_udp
Sep 18 16:28:45 [236217] INFO:core:mod_init: initializing UDP-plain protocol
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module proto_bin
Sep 18 16:28:45 [236217] INFO:proto_bin:mod_init: initializing BIN protocol
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module clusterer
Sep 18 16:28:45 [236217] INFO:clusterer:mod_init: Clusterer module - initializing
Sep 18 16:28:45 [236217] DBG:core:find_mod_export: found <db_bind_api> in module db_mysql [/usr/local/opensips-upstream/lib64/opensips/modules/]
Sep 18 16:28:45 [236217] DBG:core:db_bind_mod: using db bind api for db_mysql
Sep 18 16:28:45 [236217] DBG:core:db_do_init: connection 0x7fc5bf2c4a30 not found in pool
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_connect: opening connection: mysql://xxxx:xxxx@localhost/opensips_upstream
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_connect: connection type is Localhost via UNIX socket
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_connect: protocol version is 10
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_connect: server version is 5.5.5-10.3.39-MariaDB-0ubuntu0.20.04.2
Sep 18 16:28:45 [236217] DBG:core:db_do_init: connection 0x7fc5bf2c4a30 inserted in pool as 0x7fc5bf2c4cc0
Sep 18 16:28:45 [236217] DBG:db_mysql:mysql_raise_event: event not registered -1
Sep 18 16:28:45 [236217] DBG:core:db_new_result: allocate 48 bytes for result set at 0x7fc5bf2c5388
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: 1 columns returned from the query
Sep 18 16:28:45 [236217] DBG:core:db_allocate_columns: allocate 28 bytes for result columns at 0x7fc5bf2c5420
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c5428)[0]=[table_version]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:core:db_allocate_rows: allocate 48 bytes for result rows and values at 0x7fc5bf2c54a8
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [4]
Sep 18 16:28:45 [236217] DBG:core:db_free_columns: freeing result columns at 0x7fc5bf2c5420
Sep 18 16:28:45 [236217] DBG:core:db_free_rows: freeing 1 rows
Sep 18 16:28:45 [236217] DBG:core:db_free_row: freeing row values at 0x7fc5bf2c54b8
Sep 18 16:28:45 [236217] DBG:core:db_free_rows: freeing rows at 0x7fc5bf2c54a8
Sep 18 16:28:45 [236217] DBG:core:db_free_result: freeing result set at 0x7fc5bf2c5388
Sep 18 16:28:45 [236217] DBG:clusterer:load_db_info: DB query - retrieve the list of clusters in which the local node runs
Sep 18 16:28:45 [236217] DBG:db_mysql:mysql_raise_event: event not registered -1
Sep 18 16:28:45 [236217] DBG:core:db_new_result: allocate 48 bytes for result set at 0x7fc5bf2c5388
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: 1 columns returned from the query
Sep 18 16:28:45 [236217] DBG:core:db_allocate_columns: allocate 28 bytes for result columns at 0x7fc5bf2c5420
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c5428)[0]=[cluster_id]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:core:db_allocate_rows: allocate 48 bytes for result rows and values at 0x7fc5bf2c54a8
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [1]
Sep 18 16:28:45 [236217] DBG:clusterer:load_db_info: 1 rows found in clusterer
Sep 18 16:28:45 [236217] DBG:core:db_free_columns: freeing result columns at 0x7fc5bf2c5420
Sep 18 16:28:45 [236217] DBG:core:db_free_rows: freeing 1 rows
Sep 18 16:28:45 [236217] DBG:core:db_free_row: freeing row values at 0x7fc5bf2c54b8
Sep 18 16:28:45 [236217] DBG:core:db_free_rows: freeing rows at 0x7fc5bf2c54a8
Sep 18 16:28:45 [236217] DBG:core:db_free_result: freeing result set at 0x7fc5bf2c5388
Sep 18 16:28:45 [236217] DBG:clusterer:load_db_info: DB query - retrieve nodes info
Sep 18 16:28:45 [236217] DBG:db_mysql:mysql_raise_event: event not registered -1
Sep 18 16:28:45 [236217] DBG:core:db_new_result: allocate 48 bytes for result set at 0x7fc5bf2c5388
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: 10 columns returned from the query
Sep 18 16:28:45 [236217] DBG:core:db_allocate_columns: allocate 280 bytes for result columns at 0x7fc5bf2c5420
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c5470)[0]=[id]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c5480)[1]=[cluster_id]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c5490)[2]=[node_id]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c54a0)[3]=[url]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_STRING result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c54b0)[4]=[state]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c54c0)[5]=[no_ping_retries]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c54d0)[6]=[priority]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_INT result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c54e0)[7]=[sip_addr]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_STRING result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c54f0)[8]=[flags]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_STRING result type
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: RES_NAMES(0x7fc5bf2c5500)[9]=[description]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_get_columns: use DB_STRING result type
Sep 18 16:28:45 [236217] DBG:core:db_allocate_rows: allocate 672 bytes for result rows and values at 0x7fc5bf2c6858
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [2]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [1]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [20]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [bin:XXXXX8-XXX2g.XX.XXX.XXXX.XXXXXXXXXXXXXX.XXX:5566]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [1]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [3]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [50]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [10.192.200.39]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [seed]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [neighbor]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [1]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [1]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [200]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [bin:10.192.200.39:5566]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [1]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [3]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting INT [50]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [10.192.200.39]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [seed]
Sep 18 16:28:45 [236217] DBG:db_mysql:db_mysql_str2val: converting STRING [local_node]
Sep 18 16:28:45 [236217] DBG:clusterer:load_db_info: 2 rows found in clusterer
Sep 18 16:28:45 [236217] DBG:clusterer:add_node_info: adding node: XXXXX8-XXX2g.XX.XXX.XXXX.XXXXXXXXXXXXXX.XXX:5566 (df: 5555)
Sep 18 16:28:45 [236217] DBG:core:sip_resolvehost: has port -> do A record lookup!
Sep 18 16:28:45 [236217] ERROR:clusterer:add_node_info: Cannot resolve host: XXXXX8-XXX2g.XX.XXX.XXXX.XXXXXXXXXXXXXX.XXX
Sep 18 16:28:45 [236217] ERROR:clusterer:load_db_info: Unable to add node info to backing list
Sep 18 16:28:45 [236217] DBG:core:pool_remove: removing connection from the pool
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CLUSTERER_REQ_RECEIVED(6)>
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CLUSTERER_RPL_RECEIVED(7)>
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CLUSTERER_NODE_STATE_CHANGED(8)>
Sep 18 16:28:45 [236217] DBG:core:_create_new_group: adding new group [clusterer]
Sep 18 16:28:45 [236217] DBG:core:sr_register_identifier: adding new identifier [node_states] to group [clusterer]
Sep 18 16:28:45 [236217] DBG:core:sr_register_identifier: adding new identifier [sharing_tags] to group [clusterer]
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_CLUSTERER_SHARING_TAG_CHANGED(9)>
Sep 18 16:28:45 [236217] DBG:core:init_mod: registering stats for clusterer
Sep 18 16:28:45 [236217] DBG:core:init_mod: register MI for clusterer
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module db_mysql
Sep 18 16:28:45 [236217] DBG:db_mysql:mysql_mod_init: mysql: MySQL client version is 8.0.39
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_MYSQL_CONNECTION(10)>
Sep 18 16:28:45 [236217] DBG:core:init_mod: initializing module dialog
Sep 18 16:28:45 [236217] INFO:dialog:mod_init: Dialog module - initializing
Sep 18 16:28:45 [236217] INFO:core:evi_publish_event: Registered event <E_DLG_STATE_CHANGED(11)>
Sep 18 16:28:45 [236217] DBG:dialog:add_profile_definitions: creating profile <trunkCalls> 
Sep 18 16:28:45 [236217] DBG:core:find_mod_cmd_export_t: found <load_tm> in module tm [/usr/local/opensips-upstream/lib64/opensips/modules/]
Sep 18 16:28:45 [236217] DBG:core:find_mod_cmd_export_t: found <load_rr> in module rr [/usr/local/opensips-upstream/lib64/opensips/modules/]
Sep 18 16:28:45 [236217] DBG:core:find_mod_cmd_export_t: found <load_clusterer> in module clusterer [/usr/local/opensips-upstream/lib64/opensips/modules/]
Sep 18 16:28:45 [236217] DBG:core:sr_register_identifier: adding new identifier [cap:dialog-dlg-repl] to group [clusterer]
Sep 18 16:28:45 [236217] DBG:clusterer:cl_register_cap: Registered capability: dialog-dlg-repl
Sep 18 16:28:45 [236217] DBG:clusterer:cl_request_sync: requesting dialog-dlg-repl sync in cluster 1
Sep 18 16:28:45 [236217] DBG:clusterer:cl_request_sync: donor node not found
Sep 18 16:28:45 [236217] CRITICAL:core:sig_usr: segfault in attendant (starter) process!
Sep 18 16:28:45 [236217] DBG:core:restore_segv_handler: restoring SIGSEGV handler...
Sep 18 16:28:45 [236217] DBG:core:restore_segv_handler: successfully restored system SIGSEGV handler
Sep 18 16:28:45 [236214] DBG:core:wait_status_code: read code 28 (0 byte)
Sep 18 16:28:45 [236214] INFO:core:daemonize: pre-daemon process exiting with -1

backtrace full

#0  queue_sync_request (cluster=0x7fc5bda51800, lcap=0x7fc5bda560f0) at sync.c:97
        __FUNCTION__ = "queue_sync_request"
#1  0x00007fc5bd0f2003 in cl_request_sync (capability=0x7fc5bd0dea10 <dlg_repl_cap>, cluster_id=1, from_cb=0) at sync.c:174
        cluster = 0x7fc5bda51800
        lcap = 0x7fc5bda560f0
        source_id = 0
        rc = -1
        __FUNCTION__ = "cl_request_sync"
#2  0x00007fc5bd08bc53 in dlg_init_clustering () at dlg_replication.c:160
        __FUNCTION__ = "dlg_init_clustering"
#3  0x00007fc5bd06cbc6 in mod_init () at dialog.c:843
        n = 59809476
        __FUNCTION__ = "mod_init"
#4  0x000055a21d03b87f in init_mod (m=0x7fc5bf2b6980, skip_others=0) at sr_module.c:761
        dep = 0x0
        __FUNCTION__ = "init_mod"
#5  0x000055a21d03b73f in init_mod (m=0x7fc5bf2b7450, skip_others=0) at sr_module.c:742
        dep = 0x55a21d3e33e0 <cdb_slow_queries>
        __FUNCTION__ = "init_mod"
#6  0x000055a21d03bd10 in init_modules () at sr_module.c:821
        currentMod = 0x0
        ret = 32764
        __FUNCTION__ = "init_modules"
#7  0x000055a21d037055 in main (argc=4, argv=0x7ffc6bcdf728) at main.c:880
        result = 0
        pred_cmp = 0x55a21d033932 <op_neq>
        c = -1
        n = 12
        tmp = 0xf0b5ff <error: Cannot access memory at address 0xf0b5ff>
        tmp_len = 59809476
        port = -1
        proto = 0
        protos_no = 2
        options = 0x55a21d1dad48 "A:f:cCm:M:b:l:n:N:rRvdDFEVhw:t:u:g:p:P:G:W:o:a:k:s:"
        ret = -1
        seed = 2615788149
        rfd = 3
        procs_no = 8388608
        __FUNCTION__ = "main"

OS/environment information

  • Debian 10 && Ubuntu 20.04
  • OpenSIPS installation: 2.4.9 upto master branch (4d79a71)
  • other relevant information:

Additional context

  • issue originally reported in slack #general
  • Attempted to patch this by guarding the statement with if (cluster->current) but ran into other issues with other areas in the code where the same access is attempted (timer, etc) and after multiple attempts just got opensips deadlocked.
  • Addresses and IPs are redacted for obvious reasons, I can provide further information over slack direct message (we can do a screen share session, etc)
@tommybrecher
Copy link
Author

@bogdan-iancu, few more comments I have to add. I have found out why I was unable to reproduce this issue in my testing environment. cluster_list does have a ->current_cluster in my lab example.

I have pinpointed the problem to be an ordering issue (node_id on the local node is higher than the node_id of the remote node).

I haven't drilled down into the query that clusterer runs, but I assume it is ordering by node_id ASC.

In my case since the lower node_id was that of the remote node and that failed on DNS resolution, ->current_node was NULL on the next run resulting in the segmentation fault.

Additionally, I've found the correct place to make opensips gracefully quit without core-dumping (but I'm not sure if you want a more thorough fix.)

See PR #3474

@tommybrecher tommybrecher changed the title [BUG] clusterer segfault when DB entry for clusterer can't be resolved [BUG] clusterer segfault when DB entry for clusterer can't be resolved and node_id for the local node is higher than the remote which can't be resolved Sep 18, 2024
@bogdan-iancu
Copy link
Member

Hey @tommybrecher , could you please test the fix I just pushed? if I get your ACK, I can do the backport. Thanks!

@bogdan-iancu bogdan-iancu reopened this Sep 23, 2024
@bogdan-iancu bogdan-iancu self-assigned this Sep 23, 2024
@bogdan-iancu bogdan-iancu added this to the 3.6-dev milestone Sep 23, 2024
@tommybrecher
Copy link
Author

Hi @bogdan-iancu , opensips did not segfault this time but I'm not sure that the status of clusterer is correct.

I only see the address that failed to resolve (node_id 20) in the table, but it appears we never got to the next entry in clusterer? (node_id 200) posted originally in this issue.

is this expected?

{                                                 
    "Clusters": [                                          
        {                        
            "cluster_id": 1,                                 
            "Nodes": [                                        
                {                                                                                                           
                    "node_id": 20,                            
                    "db_id": 2,                                                                                             
                    "url": "bin:XXXXX8-XX2g.XX.XXX.XXXX.XXXXXXXXXXXXXX.XXX:5566",                                          
                    "link_state": "Down",
                    "state": "enabled",                                                                                     
                    "next_hop": "none",                                                                                     
                    "description": "neighbor"                 
                }
            ]                                        
        }                                                                                                                                                                                                                                               
    ]                                                                                                                                                                                                                                                   
}         ```

@bogdan-iancu
Copy link
Member

@tommybrecher , sorry for delay here. What are the error logs you see at startup? The code has some explicit logic NOT to discard itself as node - if that node cannot be added -> the load will fail.

@tommybrecher
Copy link
Author

Hi @bogdan-iancu, no issues at all about delay, I understand the nature of open-source and appreciate the time and effort that you put into this :)

I would have to re-test this in the lab to confirm what's being logged. I'll try to get that done today

liviuchircu pushed a commit that referenced this issue Oct 23, 2024
On failing to add a node (to the cluster), continue, as time as that node is not the current node.
And if giving up on whole cluster info loading, be sure we clean up and properly report this to the above layer - the last thing we should do is to report success (on load) but have partial, unconsistent data.

Fixes #3473

(cherry picked from commit 784e1a4)
liviuchircu pushed a commit that referenced this issue Oct 23, 2024
On failing to add a node (to the cluster), continue, as time as that node is not the current node.
And if giving up on whole cluster info loading, be sure we clean up and properly report this to the above layer - the last thing we should do is to report success (on load) but have partial, unconsistent data.

Fixes #3473

(cherry picked from commit 784e1a4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants