Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't easily find core files #2

Open
G-Ragghianti opened this issue Feb 7, 2024 · 2 comments
Open

Can't easily find core files #2

G-Ragghianti opened this issue Feb 7, 2024 · 2 comments

Comments

@G-Ragghianti
Copy link
Contributor

No description provided.

@G-Ragghianti
Copy link
Contributor Author

@abouteiller

@abouteiller
Copy link

A bit more details here

ulimit -c 0 is the default, and that is good, should stay like that. If core files are needed the user should request it using ulimit -c unlimited.

What happens when we run with core generation enabled

Running srun -wleconte -n1 testing_redistribution...

Core files are created into /var/lib/systemd/coredump/core.testing_redistr.1003.87f81996462b4acbb4e80cb69dbe57c2.1901042.1707242127000000.zst on the system on which the task is ran (e.g., Leconte, when the user is on Methane).

Investigating the bug is only possible by running a set of commands to know what is the name of the core file and running gdb remote

srun -wleconte ls /var/lib/systemd/coredump
srun -wleconte unzstd /var/lib/systemd/coredump/core.testing_redistr.1003.87f81996462b4acbb4e80cb69dbe57c2.1901042.1707242127000000.zst -o $HOME/core
gdb testing_redistribution core

What we'd like

  1. Core files are collected on a shared scratch filesystem that is easily referenced as /cores
  2. Core files can be passed to gdb directly, the fact that they are compressed with zstandard saves space, but our gdb version cannot read that, maybe we need a newer version of the gdb spack?

We had a similar setup on Saturn that can probably be replicated here. In particular the scratch filesystem was using relaxed locking/consistency semantics to avoid NFS freaking out when written to from multiple nodes at the same time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants