Skip to content

Commit 136c9e9

Browse files
author
Ubuntu
committed
updated teh README with forensic analysis and new SSH REFACTOR 3 code to address the first of 3 issues: the ghosts
1 parent cb6506f commit 136c9e9

File tree

2 files changed

+433
-21
lines changed

2 files changed

+433
-21
lines changed

README.md

Lines changed: 329 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,334 @@ pertain to the optmizations to this module.
2525

2626

2727

28+
## UPDATES: part 15 Hyper-scaling of processes 450/25: REFACTOR SSH3 phase1 to deal with the ghosts.
29+
30+
This is the first phase of refactoring to deal with the ghost issue of UPDATE part 14 below.
31+
32+
33+
Refactor for Retry + Watchdog Logic
34+
This version:
35+
-Adds a retry loop for `exec_command` execution
36+
-Wraps `stdout.read()` and `stderr.read()` with a watchdog timer using `recv_ready()`
37+
-Monitors for stalls and retries intelligently
38+
39+
Initially run this with a 450/25 test and then ramp it up to 480/25
40+
41+
42+
The refactored code is below:
43+
44+
```
45+
46+
## REFACTOR SSH 3 – Phase 1: Retry + Watchdog Protection:
47+
48+
from datetime import datetime
49+
import time
50+
51+
WATCHDOG_TIMEOUT = 90
52+
RETRY_LIMIT = 3
53+
SLEEP_BETWEEN_ATTEMPTS = 5
54+
55+
def read_output_with_watchdog(stream, label, ip):
56+
start = time.time()
57+
collected = b''
58+
while True:
59+
if stream.channel.recv_ready():
60+
try:
61+
collected += stream.read()
62+
break
63+
except Exception as e:
64+
print(f"[{ip}] ⚠️ Failed reading {label}: {e}")
65+
break
66+
if time.time() - start > WATCHDOG_TIMEOUT:
67+
print(f"[{ip}] ⏱️ Watchdog timeout on {label} read.")
68+
break
69+
time.sleep(1)
70+
return collected.decode()
71+
72+
for idx, command in enumerate(commands):
73+
for attempt in range(RETRY_LIMIT):
74+
try:
75+
print(f"[{ip}] [{datetime.now()}] Command {idx+1}/{len(commands)}: {command} (Attempt {attempt + 1})")
76+
stdin, stdout, stderr = ssh.exec_command(command, timeout=60)
77+
78+
stdout.channel.settimeout(WATCHDOG_TIMEOUT)
79+
stderr.channel.settimeout(WATCHDOG_TIMEOUT)
80+
81+
stdout_output = read_output_with_watchdog(stdout, "STDOUT", ip)
82+
stderr_output = read_output_with_watchdog(stderr, "STDERR", ip)
83+
84+
print(f"[{ip}] [{datetime.now()}] STDOUT: '{stdout_output.strip()}'")
85+
print(f"[{ip}] [{datetime.now()}] STDERR: '{stderr_output.strip()}'")
86+
87+
if "E: Package 'tomcat9' has no installation candidate" in stderr_output:
88+
print(f"[{ip}] ❌ Tomcat install failure.")
89+
ssh.close()
90+
return ip, private_ip, False
91+
92+
if "WARNING:" in stderr_output:
93+
print(f"[{ip}] ⚠️ Warning ignored: {stderr_output.strip()}")
94+
stderr_output = ""
95+
96+
if stderr_output.strip():
97+
print(f"[{ip}] ❌ Non-warning stderr received.")
98+
ssh.close()
99+
return ip, private_ip, False
100+
101+
print(f"[{ip}] ✅ Command succeeded.")
102+
time.sleep(20)
103+
break # Command succeeded, no need to retry
104+
105+
except Exception as e:
106+
print(f"[{ip}] 💥 Exception during exec_command: {e}")
107+
time.sleep(SLEEP_BETWEEN_ATTEMPTS)
108+
109+
finally:
110+
stdin.close()
111+
stdout.close()
112+
stderr.close()
113+
114+
ssh.close()
115+
transport = ssh.get_transport()
116+
if transport:
117+
transport.close()
118+
print(f"Installation completed on {ip}")
119+
return ip, private_ip, True
120+
```
121+
122+
123+
124+
125+
126+
127+
## UPDATES: part 14 Hyper-scalling of processes 450/25, SSH connect issues, log forencsics and initial code refactoring
128+
129+
130+
After using REFACTOR SSH 1 code block, used this block to dig deeper in to the issue.
131+
132+
133+
```
134+
## REFACTOR SSH 2:
135+
136+
137+
from datetime import datetime
138+
139+
for idx, command in enumerate(commands):
140+
for attempt in range(3):
141+
try:
142+
print(f"[{ip}] [{datetime.now()}] Command {idx+1}/{len(commands)}: {command} (Attempt {attempt + 1})")
143+
stdin, stdout, stderr = ssh.exec_command(command, timeout=60)
144+
145+
146+
## Add this timout code to detect why some instances are silently failing without hitting my except block below
147+
## this will force it out of the try loop to execept bloc.
148+
149+
# 🔒 Ensure the VPS doesn’t hang forever waiting on output
150+
stdout.channel.settimeout(90)
151+
stderr.channel.settimeout(90)
152+
153+
stdout_output = stdout.read().decode()
154+
stderr_output = stderr.read().decode()
155+
156+
print(f"[{ip}] [{datetime.now()}] STDOUT: '{stdout_output.strip()}'")
157+
print(f"[{ip}] [{datetime.now()}] STDERR: '{stderr_output.strip()}'")
158+
159+
if "E: Package 'tomcat9' has no installation candidate" in stderr_output:
160+
print(f"[{ip}] [{datetime.now()}] ❌ Package install failure. Exiting early.")
161+
ssh.close()
162+
return ip, private_ip, False
163+
164+
if "WARNING:" in stderr_output:
165+
print(f"[{ip}] [{datetime.now()}] ⚠️ Warning ignored: {stderr_output.strip()}")
166+
stderr_output = ""
167+
168+
if stderr_output.strip():
169+
print(f"[{ip}] [{datetime.now()}] ❌ Non-warning error output. Command failed.")
170+
ssh.close()
171+
return ip, private_ip, False
172+
173+
print(f"[{ip}] [{datetime.now()}] ✅ Command succeeded.")
174+
time.sleep(20)
175+
176+
except Exception as e:
177+
print(f"[{ip}] [{datetime.now()}] 💥 Exception during exec_command: {e}")
178+
ssh.close()
179+
return ip, private_ip, False
180+
181+
finally:
182+
stdin.close()
183+
stdout.close()
184+
stderr.close()
185+
186+
187+
ssh.close()
188+
transport = ssh.get_transport()
189+
if transport is not None:
190+
transport.close()
191+
print(f"Installation completed on {ip}")
192+
return ip, private_ip, True
193+
194+
```
195+
196+
197+
198+
In the latest test run had to increase the swap to 20GB to prevent the VPS fron hanging. Once did that the 450/25 test used about 17GB of the 20GB swap and I was able to see 7 failures with the code above.
199+
Correlating the gitlab console logs with the debug above with the process level and main pooling level orchestration logs revealed the following:
200+
201+
7 Failures Isolated
202+
-All occurred between timestamps [T+14min to T+16min], deep into the 450/25 cycle
203+
-5 match the ghost profile: no `Installation completed` marker, no exit status, hung `recv()` with no output
204+
-2 exited prematurely but did execute `exec_command()`, indicating a socket or fork failure downstream
205+
206+
Ghost Indicators:
207+
-Threads hung at `channel.recv(1024)` without exception
208+
-No log entry after `exec_command()` for those cases
209+
-Swap was above 16.8GB—suggesting buffer delay or scheduler stalls, not memory depletion
210+
211+
Process-Level Logs:
212+
-Missing heartbeat timestamps on 5 processes → confirms they stalled post-connection
213+
-The others show clean command dispatch but dead ends in channel echo collection
214+
215+
Main Log Trace:
216+
-Dispatcher launched all 25 in that burst correctly
217+
-5 stuck at recv with no retries
218+
-No thread timeout logic triggered → perfect test case for inserting watchdogs later
219+
220+
221+
These are the 7 failed threads:
222+
223+
| Thread ID | Timestamp | Type | Symptom | Suggest Fix |
224+
|-----------|---------------|--------------|----------------------------------------|----------------------|
225+
| #112 | T+14:02 | Ghost | No output, stuck at `recv()` | Watchdog + retries |
226+
| #127 | T+14:45 | Early Exit | `exec_command` done, no completion log | Exit code check |
227+
| #135 | T+15:04 | Ghost | Silent thread, no further output | Force termination |
228+
| #142 | T+15:17 | Ghost | Stalled post-command | Retry + backoff |
229+
| #158 | T+15:39 | Ghost | Socket hung, swap rising | Launch stagger |
230+
| #163 | T+15:52 | Early Exit | Command dropped early | Retry w/ delay |
231+
| #170 | T+16:02 | Ghost | No output captured | Watchdog |
232+
233+
234+
235+
Based uon this: The code solution to this involves implementing code changes in the three folowing areas below:
236+
The Ghost and Early Exit would have resolved the 7 thread failures in the last run.
237+
238+
239+
Three types of general failures:
240+
241+
| Failure Type | What Happens | Python Symptom | Fix Strategy |
242+
|------------------|----------------------------------------|-------------------------------------|----------------------------------------|
243+
| Timeout | Channel blocks too long | `socket.timeout` raised | Retry with exponential backoff |
244+
| Stall (Ghost) | Output never received, no error | `None` from `recv()` or hangs | Watchdog timer + force exit |
245+
| Early Exit | Channel closes before command completes| `exit_status_ready()` too soon | Check exit code + validate output |
246+
247+
248+
249+
250+
251+
### Failure type one: Timeout
252+
253+
254+
```
255+
channel.settimeout(90)
256+
channel.exec_command(cmd)
257+
stdout = channel.recv(1024) # <--- blocks too long, then raises socket.timeout
258+
```
259+
260+
-A `socket.timeout` exception
261+
-No output received within the expected window
262+
-Easily caught with `try/except`
263+
264+
265+
266+
Possible fix to introdcue into the code
267+
268+
```
269+
for attempt in range(3):
270+
try:
271+
stdout = channel.recv(1024)
272+
break
273+
except socket.timeout:
274+
time.sleep(5)
275+
```
276+
277+
278+
Also consider tightening or relaxing `settimeout()` depending on system load
279+
280+
281+
282+
283+
284+
### Failure type two: Stall failures(Ghost)
285+
286+
The channel call doesn’t raise an error—it just sits indefinitely or returns `None`.
287+
288+
289+
```
290+
stdout = channel.recv(1024) # returns None, never throws
291+
```
292+
293+
-No exception is raised
294+
-The thread appears "alive" but makes no progress
295+
-Worker pool gets clogged silently
296+
297+
298+
299+
For these we need to first detect them
300+
301+
-Use forensic timestamps (your specialty!) to show that:
302+
-The `exec_command()` was sent
303+
-No output was received after reasonable delay
304+
-Inject manual inactivity watchdogs:
305+
306+
```
307+
start = time.time()
308+
while True:
309+
if channel.recv_ready():
310+
stdout = channel.recv(1024)
311+
break
312+
if time.time() - start > 90:
313+
raise RuntimeError("SSH stall detected")
314+
time.sleep(1)
315+
```
316+
317+
318+
319+
320+
321+
### Failure type three: Early exit
322+
323+
Here, the process exits before completing SSH output collection.
324+
325+
These are caused by the following:
326+
327+
-Fork pressure
328+
-Swap delays
329+
-Broken pipe in the SSH stream
330+
331+
332+
333+
Detection involves the following:
334+
335+
-`"Installation completed"` never logged
336+
-Unexpected thread exit without traceback
337+
-`channel.exit_status_ready()` == True but with `None` output
338+
339+
340+
Intercept these with the following code:
341+
342+
```
343+
status = channel.recv_exit_status()
344+
if status != 0:
345+
logger.warn("Command exited early or failed silently")
346+
```
347+
348+
349+
350+
351+
352+
353+
354+
355+
28356
## UPDATES: part 13: Hyper-scaling of processes (400+), SSH connect issues and troubleshooting and log forensics and correlation
29357

30358
After doing foresensic correlation between the process level logs, the main() process orchestration logs and the gitlab
@@ -35,8 +363,7 @@ that this has in a susceptible area of the code below.
35363
The logs consisted of a 450/0 (450 processes with no pooling) and a 450/25 (450 processes with 25 of them pooled). Both of
36364
these tests revealed the weakness in the area of code below.
37365

38-
The main area of code that needs to be refactored is below. This area of code is located in
39-
install_tomcat which is in tomcat_worker which is in tomcat_worker_wrapper which is called from main() through the
366+
The main area of code that needs to be refactored is below. This area of code is located in install_tomcat which is in tomcat_worker which is in tomcat_worker_wrapper which is called from main() through the
40367
multiprocessing.Pool method below:
41368

42369
```

0 commit comments

Comments
 (0)