Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jobtap history plugin throws errors #5475

Closed
garlick opened this issue Sep 27, 2023 · 6 comments · Fixed by #5608
Closed

jobtap history plugin throws errors #5475

garlick opened this issue Sep 27, 2023 · 6 comments · Fixed by #5608

Comments

@garlick
Copy link
Member

garlick commented Sep 27, 2023

In one of @vsoch's cloudy experiments, in the output of a failing intel MPI job, we noticed two jobtap errors:

jobtap: job.new: callback returned error
jobtap: job.inactive.add: callback returned error

The only jobtap in-tree that registers job.inactive-add is history. Maybe there is something to run down here or at least a couple places where log messages could be added to give us something to go on next time.

@grondo
Copy link
Contributor

grondo commented Dec 5, 2023

Just reporting that another user hit this same issue when trying to set up a small system instance. (using flux RPMs on TOSS)

Looking at the code, the only way the history plugin returns an error from these callbacks is if hola_list_insert() fails:

        if (!hola_list_insert (hist->users, int2ptr (userid), entry, false)) {
            job_entry_destroy (entry);
            return -1;
        }

Maybe adding a flux_log_error() at these spots would help run down the issue.

@grondo
Copy link
Contributor

grondo commented Dec 5, 2023

Ah, this user reports they are running jobs (and the flux-broker) as uid 0, which will cause hola_list_insert() to fail since int2ptr(0) == NULL. Perhaps we should actively discourage running the flux-broker as root by checking for that in the broker. However, for now maybe the history plugin should just ignore root jobs, or should add a check specific to uid 0 and give a meaningful error message.

@vsoch
Copy link
Member

vsoch commented Dec 5, 2023

I see this all the time in containers (with root) and I just ignore it.

@grondo
Copy link
Contributor

grondo commented Dec 5, 2023

Well now we know you can suppress the errors with flux jobtap unload .history

@vsoch
Copy link
Member

vsoch commented Dec 5, 2023

Nice! I'll add that to my containers.

@garlick
Copy link
Member Author

garlick commented Dec 6, 2023

(uid_t)-1 is apparently reserved per POSIX so maybe we can just map root to that within the cache?

garlick added a commit to garlick/flux-core that referenced this issue Dec 6, 2023
Problem: the history jobtap plugin uses an integer userid as
a hash key, but the hash implementation won't accept a NULL key,
so the root history cannot be stored.

This results in errors like:

jobtap: job.new: callback returned error
jobtap: job.inactive.add: callback returned error

Map userid 0 to (uid_t)-1 since that value is reserved in POSIX.

Fixes flux-framework#5475
garlick added a commit to garlick/flux-core that referenced this issue Dec 6, 2023
Problem: the job history plugin uses an integer userid as hash key,
but the hash implementation won't accept a key (NULL) for userid 0.

This results in errors like:

jobtap: job.new: callback returned error
jobtap: job.inactive.add: callback returned error

Map userid 0 to (uid_t)-1 since that value is reserved in POSIX.

Fixes flux-framework#5475
garlick added a commit to garlick/flux-core that referenced this issue Dec 6, 2023
Problem: the job history plugin uses an integer userid as hash key,
but the hash implementation won't accept a key (NULL) for userid 0.

This results in errors like:

jobtap: job.new: callback returned error
jobtap: job.inactive.add: callback returned error

Map userid 0 to (uid_t)-1 since that value is reserved in POSIX.

Fixes flux-framework#5475
garlick added a commit to garlick/flux-core that referenced this issue Dec 6, 2023
Problem: the job history plugin uses an integer userid as hash key,
but the hash implementation won't accept a key (NULL) for userid 0.

This results in errors like:

jobtap: job.new: callback returned error
jobtap: job.inactive.add: callback returned error

Map userid 0 to (uid_t)-1 since that value is reserved in POSIX.

Fixes flux-framework#5475
garlick added a commit to garlick/flux-core that referenced this issue Dec 7, 2023
Problem: the job history plugin uses an integer userid as hash key,
but the hash implementation won't accept a key (NULL) for userid 0.

This results in errors like:

jobtap: job.new: callback returned error
jobtap: job.inactive.add: callback returned error

Map userid 0 to (uid_t)-1 since that value is reserved in POSIX.

Fixes flux-framework#5475
@mergify mergify bot closed this as completed in #5608 Dec 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants