Description
I represent a team using Node.js within Microsoft. When running Node on machines under heavy load, we have found that some Node processes fail, due to an assertion failing within /deps/uv/win/tty.c
. This is the assertion that is failing (edited for brevity):
static void uv__tty_console_signal_resize(void) {
...
uv_mutex_lock(&uv__tty_console_resize_mutex);
assert(uv__tty_console_width != -1 && uv__tty_console_height != -1); <-- this fails
if (width != uv__tty_console_width || height != uv__tty_console_height) {
...
}
}
I believe the root cause is that there is a race condition in uv_console_init()
:
void uv_console_init(void) {
if (uv_sem_init(&uv_tty_output_lock, 1))
abort();
uv__tty_console_handle = CreateFileW(L"CONOUT$",
GENERIC_READ | GENERIC_WRITE,
FILE_SHARE_WRITE,
0,
OPEN_EXISTING,
0,
0);
if (uv__tty_console_handle != INVALID_HANDLE_VALUE) {
CONSOLE_SCREEN_BUFFER_INFO sb_info;
QueueUserWorkItem(uv__tty_console_resize_message_loop_thread, <-- this starts a task in a thread pool
NULL,
WT_EXECUTELONGFUNCTION);
uv_mutex_init(&uv__tty_console_resize_mutex);
if (GetConsoleScreenBufferInfo(uv__tty_console_handle, &sb_info)) {
uv__tty_console_width = sb_info.dwSize.X;
uv__tty_console_height = sb_info.srWindow.Bottom - sb_info.srWindow.Top + 1;
}
}
}
This code starts a task in a thread pool, and then queries the console size. If the thread pool task wakes up fast enough, then it will run the code that queries the console buffer size and attempts to resize it, before the first query of that console buffer succeeds, leading to the assertion failure.
Also, the worker thread can call uv_mutex_lock(&uv__tty_console_resize_mutex);
before the mutex is even initialized, which would be another source of crashes.
We see this on build machines, where we spawn 140,000+ Node.js processes on machines with very large CPU counts (128 or more cores). It is more common when running VMs than when running on bare metal (where we have rarely seen this).
We are running Node.js v18.13.0. I have checked the sources, and this issue appears to be present in v18.13.0 and all later versions, up to and including main.
The fix should be to move the QueueUserWorkItem
call after the if (GetConsoleScreenBuffer(...)) { ... }
block. That should guarantee that the mutex is properly initialized, and that the first call to GetConsoleScreenBuffer
has occurred, before the resizing thread can win the race.