You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
print(f"[{ip}] [{datetime.now()}] 💥 Exception during exec_command: {e}")
178
+
ssh.close()
179
+
return ip, private_ip, False
180
+
181
+
finally:
182
+
stdin.close()
183
+
stdout.close()
184
+
stderr.close()
185
+
186
+
187
+
ssh.close()
188
+
transport = ssh.get_transport()
189
+
if transport is not None:
190
+
transport.close()
191
+
print(f"Installation completed on {ip}")
192
+
return ip, private_ip, True
193
+
194
+
```
195
+
196
+
197
+
198
+
In the latest test run had to increase the swap to 20GB to prevent the VPS fron hanging. Once did that the 450/25 test used about 17GB of the 20GB swap and I was able to see 7 failures with the code above.
199
+
Correlating the gitlab console logs with the debug above with the process level and main pooling level orchestration logs revealed the following:
200
+
201
+
7 Failures Isolated
202
+
-All occurred between timestamps [T+14min to T+16min], deep into the 450/25 cycle
203
+
-5 match the ghost profile: no `Installation completed` marker, no exit status, hung `recv()` with no output
204
+
-2 exited prematurely but did execute `exec_command()`, indicating a socket or fork failure downstream
205
+
206
+
Ghost Indicators:
207
+
-Threads hung at `channel.recv(1024)` without exception
208
+
-No log entry after `exec_command()` for those cases
209
+
-Swap was above 16.8GB—suggesting buffer delay or scheduler stalls, not memory depletion
210
+
211
+
Process-Level Logs:
212
+
-Missing heartbeat timestamps on 5 processes → confirms they stalled post-connection
213
+
-The others show clean command dispatch but dead ends in channel echo collection
214
+
215
+
Main Log Trace:
216
+
-Dispatcher launched all 25 in that burst correctly
217
+
-5 stuck at recv with no retries
218
+
-No thread timeout logic triggered → perfect test case for inserting watchdogs later
219
+
220
+
221
+
These are the 7 failed threads:
222
+
223
+
| Thread ID | Timestamp | Type | Symptom | Suggest Fix |
| Timeout | Channel blocks too long |`socket.timeout` raised | Retry with exponential backoff |
244
+
| Stall (Ghost) | Output never received, no error |`None` from `recv()` or hangs | Watchdog timer + force exit |
245
+
| Early Exit | Channel closes before command completes|`exit_status_ready()` too soon | Check exit code + validate output |
246
+
247
+
248
+
249
+
250
+
251
+
### Failure type one: Timeout
252
+
253
+
254
+
```
255
+
channel.settimeout(90)
256
+
channel.exec_command(cmd)
257
+
stdout = channel.recv(1024) # <--- blocks too long, then raises socket.timeout
258
+
```
259
+
260
+
-A `socket.timeout` exception
261
+
-No output received within the expected window
262
+
-Easily caught with `try/except`
263
+
264
+
265
+
266
+
Possible fix to introdcue into the code
267
+
268
+
```
269
+
for attempt in range(3):
270
+
try:
271
+
stdout = channel.recv(1024)
272
+
break
273
+
except socket.timeout:
274
+
time.sleep(5)
275
+
```
276
+
277
+
278
+
Also consider tightening or relaxing `settimeout()` depending on system load
279
+
280
+
281
+
282
+
283
+
284
+
### Failure type two: Stall failures(Ghost)
285
+
286
+
The channel call doesn’t raise an error—it just sits indefinitely or returns `None`.
287
+
288
+
289
+
```
290
+
stdout = channel.recv(1024) # returns None, never throws
291
+
```
292
+
293
+
-No exception is raised
294
+
-The thread appears "alive" but makes no progress
295
+
-Worker pool gets clogged silently
296
+
297
+
298
+
299
+
For these we need to first detect them
300
+
301
+
-Use forensic timestamps (your specialty!) to show that:
302
+
-The `exec_command()` was sent
303
+
-No output was received after reasonable delay
304
+
-Inject manual inactivity watchdogs:
305
+
306
+
```
307
+
start = time.time()
308
+
while True:
309
+
if channel.recv_ready():
310
+
stdout = channel.recv(1024)
311
+
break
312
+
if time.time() - start > 90:
313
+
raise RuntimeError("SSH stall detected")
314
+
time.sleep(1)
315
+
```
316
+
317
+
318
+
319
+
320
+
321
+
### Failure type three: Early exit
322
+
323
+
Here, the process exits before completing SSH output collection.
324
+
325
+
These are caused by the following:
326
+
327
+
-Fork pressure
328
+
-Swap delays
329
+
-Broken pipe in the SSH stream
330
+
331
+
332
+
333
+
Detection involves the following:
334
+
335
+
-`"Installation completed"` never logged
336
+
-Unexpected thread exit without traceback
337
+
-`channel.exit_status_ready()` == True but with `None` output
338
+
339
+
340
+
Intercept these with the following code:
341
+
342
+
```
343
+
status = channel.recv_exit_status()
344
+
if status != 0:
345
+
logger.warn("Command exited early or failed silently")
346
+
```
347
+
348
+
349
+
350
+
351
+
352
+
353
+
354
+
355
+
28
356
## UPDATES: part 13: Hyper-scaling of processes (400+), SSH connect issues and troubleshooting and log forensics and correlation
29
357
30
358
After doing foresensic correlation between the process level logs, the main() process orchestration logs and the gitlab
@@ -35,8 +363,7 @@ that this has in a susceptible area of the code below.
35
363
The logs consisted of a 450/0 (450 processes with no pooling) and a 450/25 (450 processes with 25 of them pooled). Both of
36
364
these tests revealed the weakness in the area of code below.
37
365
38
-
The main area of code that needs to be refactored is below. This area of code is located in
39
-
install_tomcat which is in tomcat_worker which is in tomcat_worker_wrapper which is called from main() through the
366
+
The main area of code that needs to be refactored is below. This area of code is located in install_tomcat which is in tomcat_worker which is in tomcat_worker_wrapper which is called from main() through the
0 commit comments