Description
openedon Jan 24, 2021
How you are using LightGBM?
We are using LightGBM in our project Mars to train models distributedly. When we tests our modules with separate processes to mock what happens in distributed environment, sometimes two processes cannot connect with each other and LightGBM retrys connection in an infinite loop.
We discover that all ports are opened successfully. The cause of the connection failure is that in
LightGBM/src/network/linkers_socket.cpp
Lines 200 to 210 in ac706e1
when a connection attempt fails, the socket handle is reused again and OS reports bad fle descriptor, and connection attempt can never be successful.
We creates a PR by recreating the socket handle on every connection attempt.
Choose one of the following components
- Python package
Environment info
Operating System: MacOS 10.15.7
CPU/GPU model: Intel Core i7
Python version: 3.8.5
LightGBM version or commit hash: ac706e1
Error message and / or logs
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 36838...
[LightGBM] [Info] Binding port 36838 succeeded
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 200 milliseconds
[LightGBM] [Warning] Set TCP_NODELAY failed
[LightGBM] [Info] Trying to bind port 36839...
[LightGBM] [Info] Binding port 36839 succeeded
[LightGBM] [Info] Listening...
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 260 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 338 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 439 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 570 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 741 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 963 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1251 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 1626 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2113 milliseconds
[LightGBM] [Warning] Connecting to rank 1 failed, waiting for 2746 milliseconds
Reproducible example(s)
We make a minimal example to demonstrate the issue.
import random
from concurrent.futures import ProcessPoolExecutor
import lightgbm
import numpy as np
N_ROWS = 10000
N_COLS = 10
def fit_part(x, y, ports, idx):
params = dict(
machines=','.join([f'127.0.0.1:{port}' for port in ports]),
time_out=3600,
num_machines=len(ports),
local_listen_port=ports[idx],
tree_learner='data',
)
model = lightgbm.LGBMRegressor(**params)
model.fit(x, y)
return model
def main():
rs = np.random.RandomState(0)
start_port = random.randint(10000, 60000)
ports = [start_port, start_port + 1]
X = rs.rand(N_ROWS, N_COLS)
y = rs.rand(N_ROWS)
proc_pool = ProcessPoolExecutor(2)
try:
f1 = proc_pool.submit(fit_part, X[:N_ROWS // 2, :], y[:N_ROWS // 2], ports, 0)
f2 = proc_pool.submit(fit_part, X[N_ROWS // 2:N_ROWS, :], y[N_ROWS // 2:N_ROWS], ports, 1)
f1.result()
f2.result()
except KeyboardInterrupt:
pass
if __name__ == '__main__':
main()