Skip to content

Infinite connection retry when starting two training processes #3839

Closed

Description

How you are using LightGBM?

We are using LightGBM in our project Mars to train models distributedly. When we tests our modules with separate processes to mock what happens in distributed environment, sometimes two processes cannot connect with each other and LightGBM retrys connection in an infinite loop.

We discover that all ports are opened successfully. The cause of the connection failure is that in

TcpSocket cur_socket;
int connect_fail_delay_time = connect_fail_retry_first_delay_interval;
for (int i = 0; i < connect_fail_retry_cnt; ++i) {
if (cur_socket.Connect(client_ips_[out_rank].c_str(), client_ports_[out_rank])) {
break;
} else {
Log::Warning("Connecting to rank %d failed, waiting for %d milliseconds", out_rank, connect_fail_delay_time);
std::this_thread::sleep_for(std::chrono::milliseconds(connect_fail_delay_time));
connect_fail_delay_time = static_cast<int>(connect_fail_delay_time * connect_fail_retry_delay_factor);
}
}

when a connection attempt fails, the socket handle is reused again and OS reports bad fle descriptor, and connection attempt can never be successful.

We creates a PR by recreating the socket handle on every connection attempt.

Choose one of the following components

  • Python package

Environment info

Operating System: MacOS 10.15.7

CPU/GPU model: Intel Core i7

Python version: 3.8.5

LightGBM version or commit hash: ac706e1

Error message and / or logs

[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 36838...
[LightGBM] [Info] Binding port 36838 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 36839...
[LightGBM] [Info] Binding port 36839 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2113 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2746 milliseconds

Reproducible example(s)

We make a minimal example to demonstrate the issue.

import random
from concurrent.futures import ProcessPoolExecutor

import lightgbm
import numpy as np

N_ROWS = 10000
N_COLS = 10


def fit_part(x, y, ports, idx):
    params = dict(
        machines=','.join([f'127.0.0.1:{port}' for port in ports]),
        time_out=3600,
        num_machines=len(ports),
        local_listen_port=ports[idx],
        tree_learner='data',
    )
    model = lightgbm.LGBMRegressor(**params)
    model.fit(x, y)
    return model


def main():
    rs = np.random.RandomState(0)
    start_port = random.randint(10000, 60000)
    ports = [start_port, start_port + 1]

    X = rs.rand(N_ROWS, N_COLS)
    y = rs.rand(N_ROWS)

    proc_pool = ProcessPoolExecutor(2)

    try:
        f1 = proc_pool.submit(fit_part, X[:N_ROWS // 2, :], y[:N_ROWS // 2], ports, 0)
        f2 = proc_pool.submit(fit_part, X[N_ROWS // 2:N_ROWS, :], y[N_ROWS // 2:N_ROWS], ports, 1)

        f1.result()
        f2.result()
    except KeyboardInterrupt:
        pass


if __name__ == '__main__':
    main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions