-
Notifications
You must be signed in to change notification settings - Fork 14
Description
I am unfortunately not succeeding in getting the remote job submission to work properly. I am running pyiron version 0.4.7 installed via conda and pyiron_base 0.5.32 installed via pip from the git repo at that tag.
I have followed the steps in the docs in order to send jobs to a HPC over ssh. For this, I have set the DISABLE_DATABASE=TRUE as the documentation suggests.
Right now I am testing if I can get a simple minimization job to work, which i submit via
n_type1 = 11
n_type2 = 1
box_length = 22.2
potential = lj_potential
minim = project.create_job("LammpsWL", f'minimization{n_type1}_{n_type2}',\
delete_existing_job=True)
unit_cell = box_length * np.eye(3)
positions = np.random.random((n_type1 + n_type2, 3)) * box_length
random_start = project.create.structure.atoms(elements=n_type1 * ['Ar'] + n_type2 * ['Kr'],
positions=positions,
cell=unit_cell, pbc=True)
minim.structure = random_start
minim.potential = potential
minim.calc_minimize(
ionic_energy_tolerance=0.0,
ionic_force_tolerance=1e-4,
e_tol=None,
f_tol=None,
max_iter=100000,
pressure=None,
n_print=100,
style="cg",)
minim.server.queue = 'queue_one'
minim.server.cpus = 1
minim.server.run_mode = 'queue'
minim.run()
This successfully pushes the job to the cluster, in exactly the working directory that I expect and it runs also flawlessly until the job status gets changed to 'collect'.
During the collection, the following error can be seen in the output:
IndexError: list index out of range (pyiron_base/database/filetable.py line 121)
This seems to be caused by the job table expecting the job to have id 1 (if I check project.db._job_table this is the only job id that exists on the HPC cluster) . However the job id in the slurm queue is pi_5123 or something like that, probably caused by the fact that this is the id that the job would have gotten on my local machine, from which I have submitted the job. Hence, the entire communication between the machines breaks at this point.
Is there something in the setup that I have missed? Should I somehow set the id on my local machine to start at 0 again?
In that theme: Is it possible to submit series ("Flexible" pyiron jobs, which are connected by step function) over ssh?