Skip to content

prod-feng/compile-and-setup-pam_slurm_adopt.so-module

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 

Repository files navigation

compile-and-setup-pam_slurm_adopt.so-module

Tested on Rockly linux 9, CGroup2, Slurm 23.02.8.

CGroup1 works too, prefer CGroup2, which seems to have a nicer(or at least good looking) handling of the processes.

1 Down load the Slurm source code

as the same version(major and minor should be the same) as the one that you installed on your HPC clsuter.

2 Figure out the default PATH setting that your installed Slurm binary packages.

E.g., the prefix=xxx when the package was built. You can open a installed Slurm lib, like "vi libslurm.so.39" and search for "slurm.conf".
If you have these info, then you can(keep using the same)

>./configure  --prefix=/usr/local  --sysconfdir=/opt/etc/slurm --libdir=/opt/lib64

The major one is "--prefix=". The other path can be deduced from it, or you can set it differently.

>cd contribs/pam_slurm_adopt

>gmake

You do not need to run "make install", as you already have the package isnatlled, and you do not want to mess with them. The compiled "pam_slurm_adopt.so" is just in folder of "contribs/pam_slurm_adopt/.libs".

Copy it to "/lib64/security/" folder of the compute nodes.

3 Update pam files:

#/etc/pam.d/sshd
#Add to the last line

account    required     pam_slurm_adopt.so log_level=debug5

#Comment all lines of "pam_systemd.so" of all of the pam files

/etc/pam.d/password-auth:#-session     optional      pam_systemd.so
/etc/pam.d/runuser-l:#-session optional pam_systemd.so
/etc/pam.d/system-auth:#-session     optional      pam_systemd.so

4 Modify pam_slurm_adopt.c if needed

Sometimes, if you can not find or organize the PATHs like "--prefix=", "--sysconfdir" or "--libdir", you can try to modify the source file of "pam_slurm_adopt.c".

>cd contribs/pam_slurm_adopt

> vi pam_slurm_adopt.c

#change line 843

        slurm_conf_init("/opt/etc/myslurm/myslurm.conf");//(NULL);

>gmake

And then copy the pam_slurm_adopt.so to /lib64/security folder.

$ ssh node005
Access denied by pam_slurm_adopt: you have no active jobs on this node
Login not allowed: no running jobs and no WLM allocations

Connection closed by 192.168.0.205 port 22

If you have a job on the same node, you shoudl be able to SSH login. AND all spawned processes of this new SSH session on the same node will be adopted and controlled by your Slurm job's CGROUP, to share the same limit of resources, like GPU devce, RAM, etc. Once the job is done, that SSH session will be terminated imediately.

To check whether it works as expected, you can run command on the compute node:

$ systemd-cgls

And you can check if the extra SSH session's processes are adopted to the proper Slurm job:

......
  ├─slurmstepd.scope … (#5289)
  │ → user.invocation_id: 29f6242c7a2c4590ae1da8976195058b
  │ → user.delegate: 1
  │ ├─job_1651 (#2688323)
  │ │ ├─step_0 (#2688543)
  │ │ │ ├─slurm (#2688631)
  │ │ │ │ └─2739202 slurmstepd: [1651.0]
  │ │ │ └─user (#2688587)
  │ │ │   └─task_0 (#2688719)
  │ │ │     ├─2739212 /bin/bash
  │ │ │     └─2739337 top        <-------------my srun session
  │ │ └─step_extern (#2688367)
  │ │   ├─slurm (#2688455)
  │ │   │ └─2739195 slurmstepd: [1651.extern]
  │ │   └─user (#2688411)
  │ │     └─task_special (#2688499)
  │ │       ├─2739199 sleep 100000000
  │ │       ├─2739273 sshd: feng [priv]    <----my SSH session
  │ │       ├─2739278 sshd: feng@pts/1
  │ │       ├─2739279 -bash
  │ │       ├─2739338 systemd-cgls
  .......

About

slurm, pam, pam_slurm_adopt, adopt

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published