test point pscheduler failures #1466

rhclopes · 2024-08-19T10:54:23Z

I have seen pscheduler related services failing several times in three different perfsonar test point hosts.

I am attaching troubleshoot results for two of those hosts.
failures.txt

Also attached are logs for one of those servers.

Regards,
Raul

timchown · 2024-08-23T13:26:58Z

And further, the available memory and CPU utilisation (very high load) clues for the affected systems, with the service outage periods, are quite interesting.

These are both testpoint installations. The toolkit installations seem fine.

ps-small (a 1G small node in Szymon's PMP mesh):
See https://ps-mesh.perf.ja.net/grafana/d/eb96563b-0d93-4910-be3c-5be331a00339/perfsonar-host-metrics?orgId=1&var-host=ps-small-slough.perf.ja.net&var-node_name=ps-london-bw.perf.ja.net&var-node_ip=All&from=now-7d&to=now.

And for Imperial ( a full node):
https://ps-mesh.perf.ja.net/grafana/d/eb96563b-0d93-4910-be3c-5be331a00339/perfsonar-host-metrics?orgId=1&var-host=lt2ps00-bw.grid.hep.ph.ic.ac.uk&var-node_name=lt2ps00.grid.hep.ph.ic.ac.uk&var-node_ip=All&from=now-7d&to=now

rhclopes · 2024-08-27T13:51:18Z

Attaching logs obtained today.

postgresql-15-main.log.2024-08-27.txt
pscheduler.log-Aug27.txt

Services started failing around 10:09.

I restarted pgsql around 13:30. Then all services around 13:50.

rhclopes · 2024-08-27T14:17:08Z

I haven't attached syslog logs because I am always weary about attaching syslog in a public website. We can share them somehow if Mark and team need them.

rhclopes · 2024-08-30T07:28:51Z

I intended to attach statistics collected over 20h about use of resources, including

numer of open files;
open files;
number of processes owned by pscheduler, postgresql, perfsonar;
processes running reverse sorted by memory.

Problem: the statistics file has 473MB. Too much for giithub. Is there a place where I can upload it?

The data seems to define a trend where the user pscheduler is requesting resources and never releasing them.

48K files opened yesterday at 13:00, 100K+ today around 8:00.
40 prices owned by psecheduper yesterday, 349 now.

This testpoint was member of the PMP dashboard. A month ago, it was reinstalled and never subscribed to PMP. Yet, the PMP giant node keeps contacting this node. I wonder if this can lead to a non-RAII situation.

How could an archiver unsubscribe a node that is dead? Does that make sense?

rhclopes · 2024-08-30T07:46:41Z

nohup.out-last1000.gz

mfeit-internet2 · 2024-08-30T17:04:08Z

This is being caused by twin memory leaks, one in the API and one in the runner. The one in the API has been identified and there's a candidate fix for it. I'm working on the runner.

* Closing branch * Properly cache thread-local database connection object. #1466 * Check cursor closed attribute correctly. #1466 * Don't let leaky API processes live longer than 30 minutes. #1466 * Remove closed branch file from earlier mistake --------- Co-authored-by: Andy Lake <andy@es.net>

mfeit-internet2 added a commit that referenced this issue Aug 30, 2024

Properly cache thread-local database connection object. #1466

099f451

mfeit-internet2 added a commit that referenced this issue Aug 30, 2024

Check cursor closed attribute correctly. #1466

75bee04

mfeit-internet2 added a commit that referenced this issue Sep 3, 2024

Don't let leaky API processes live longer than 30 minutes. #1466

fa48273

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test point pscheduler failures #1466

test point pscheduler failures #1466

rhclopes commented Aug 19, 2024

timchown commented Aug 23, 2024

rhclopes commented Aug 27, 2024

rhclopes commented Aug 27, 2024 •

edited

Loading

rhclopes commented Aug 30, 2024

rhclopes commented Aug 30, 2024

mfeit-internet2 commented Aug 30, 2024

test point pscheduler failures #1466

test point pscheduler failures #1466

Comments

rhclopes commented Aug 19, 2024

timchown commented Aug 23, 2024

rhclopes commented Aug 27, 2024

rhclopes commented Aug 27, 2024 • edited Loading

rhclopes commented Aug 30, 2024

rhclopes commented Aug 30, 2024

mfeit-internet2 commented Aug 30, 2024

rhclopes commented Aug 27, 2024 •

edited

Loading