Skip to content

CS-1187 add systemd and cgroups integration #60

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 66 commits into from
Jul 22, 2025
Merged

Conversation

jgabler-hpc
Copy link
Contributor

No description provided.

jgabler-hpc and others added 30 commits April 23, 2025 19:28
and returns before the action has actually finished.
Need to wait for the job to be finished.
// BelongsTo: CS-1123
…stemd support in sge_execd, libsystemd.so is too old

- fixed broken build on CentOS 8
* error messages were truncated at 100 characters, introduced SFN4 macro for 400 character strings
- cleanup, moved code to own module
// BelongsTo: CS-1241
- fixed a race condition leading to multiple execd children trying to create the shepherds.scope
- enforce cleanup in execd only when KEEP_ACTIVE is changed to FALSE
ernst-bablick
ernst-bablick previously approved these changes Jul 22, 2025
@jgabler-hpc jgabler-hpc merged commit e66d328 into master Jul 22, 2025
1 check passed
ernst-bablick pushed a commit that referenced this pull request Jul 29, 2025
* EH: CS-1188 control daemons with systemd

* avoid endless loop in case an invalid slice is given in the autoinstall template
// BelongsTo: CS-1188

* EH: CS-1192 at startup of daemons output the cgroups slice the service is running in

* fixed type "deamon"

* EH: CS-1223 with systemd integration, move sge_shepherd processes out of the sge_execd service cgroup

* sd_bus method StartTransientUnit does only start a job creating the unit
and returns before the action has actually finished.
Need to wait for the job to be finished.
// BelongsTo: CS-1123

* - do not report systemd as init system on ulx-* as we cannot build systemd support in sge_execd, libsystemd.so is too old
- fixed broken build on CentOS 8

* * sd_bus error was not reported to caller
* error messages were truncated at 100 characters, introduced SFN4 macro for 400 character strings

* fixed non-unique message ids

* EH: CS-1291 move shepherd child to its own scope

* shepherd tried to use systemd on host having systemd library but not having systemd as init system (Antix Linux)

* EH: CS-1292 get job online usage information via systemd

* tried to connect to systemd on host not having systemd

* errors in StartTransientUnit were not always propagated to caller

* EH: CS-1294 set job limits via systemd

* EH: CS-1315 set binding via systemd

* cleanup

* EH CS-1295 set device isolation via systemd

* EH: CS-1241 add profiling information for systemd operations

* - execd profiling could not be disabled again
- cleanup, moved code to own module
// BelongsTo: CS-1241

* EH: CS-1318 allow to run jobs under systemd control even if sge_execd itself is not started as systemd service

* EH: CS-1319 make running jobs under systemd control configurable

* added ENABLE_SYSTEMD to sge_conf.5 man page
// BelongsTo: CS-1319

* EH: CS-1322 the job specific scopes need to contain the toplevel slice name to be unique

* EH: CS-1300 do not add and handle the additional group id for jobs running under systemd

* BF: CS-1325 possible race condition between calling StartTransientUnit and waiting for the corresponding job to finish

* EH: CS-1296 kill jobs via systemd

* EH: CS-1321 allow to configure a hybrid usage data collection (both via systemd and the pdc)

* fixed memory leaks

* BF: CS-1335 need special handling for interrupted system call

* EH: CS-1342 add systemd specific settings (toplevel slice name) to the installation guide

* cleanup and added systemd integration to the release notes

* cleanup

* - addressed review comments
- fixed a race condition leading to multiple execd children trying to create the shepherds.scope

* added more details of the systemd integration to the release notes

* addressed review comments

* refactoring and documentation with Doxygen headers

* EH: CS-1408 USAGE_COLLECTION mode must be kept consistent for running jobs

* EH: CS-1419 disable systemd integration if sge_execd is started as non privileged user

* with HYBRID usage collection non systemd hosts didn't report cpu and rss

* reprioritization code was broken by systemd integration
// SeeAlso: CS-1421

* - improved diagnostics when ptf job / osjob cannot be found
- enforce cleanup in execd only when KEEP_ACTIVE is changed to FALSE

* BF: CS-1019 sge_execd logs errors when running tightly integrated parallel jobs

* BF: CS-1425 backup/restore does not handle $SGE_ROOT/$SGE_CELL/slice_name

* BF: CS-1429 sge_qmaster can segfault on qdel -f

* BF: CS-1019 sge_execd logs errors when running tightly integrated parallel jobs

* BF: CS-1430 running tightly integrated parallel jobs leaves systemd slices
// + additional cleanup

* fix to the fix for CS-1019

* added missing files

---------

Co-authored-by: Joachim Gabler <joga.oge@gabler-net.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants