Skip to content

logs: better capturing of Dask and job logs in error situations #739

Open
@tiborsimko

Description

Current behaviour

When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.

Here is one observation from running reana-demo-dask-coffea example.

The status shows that the workflow failed:

$ rcg status -w dask-coffea-serial-kubernetes
NAME                            RUN_NUMBER   CREATED               STARTED               ENDED                 STATUS   PROGRESS
dask-coffea-serial-kubernetes   1            2025-01-28T10:02:38   2025-01-28T10:02:51   2025-01-28T10:04:03   failed   0/1

The logs show ditto:

$ rcg logs -w dask-coffea-serial-kubernetes

==> Workflow engine logs
2025-01-28 10:02:51,384 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ
2025-01-28 10:04:03,456 | root | MainThread | INFO | Workflow a98dfa47-b7a2-4979-b75f-44ee4b224776 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776.



==> Job logs
==> Step: process
==> Workflow ID: a98dfa47-b7a2-4979-b75f-44ee4b224776
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4
==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
==> Command: python analysis.py
==> Status: failed
==> Started: 2025-01-28T10:02:51
==> Finished: 2025-01-28T10:03:59
==> Logs:

...


                -------------------------------------------------------------------
                -------------------------------------------------------------------
                ----------------         DASK WORKER LOGS          ----------------
                -------------------------------------------------------------------
                -------------------------------------------------------------------




                /usr/local/lib/python3.10/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
  warnings.warn(
2025-01-28 10:02:56,570 - distributed.nanny - INFO -         Start Nanny at: 'tcp://10.244.0.62:44707'
2025-01-28 10:02:57,354 - distributed.worker - INFO -       Start worker at:    tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO -          Listening to:    tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO -           Worker name: reana-run-dask-a98dfa47-default-worker-1b35478060
2025-01-28 10:02:57,354 - distributed.worker - INFO -          dashboard at:           10.244.0.62:8788
2025-01-28 10:02:57,354 - distributed.worker - INFO - Waiting to connect to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,354 - distributed.worker - INFO -               Threads:                         16
2025-01-28 10:02:57,354 - distributed.worker - INFO -                Memory:                   2.00 GiB
2025-01-28 10:02:57,354 - distributed.worker - INFO -       Local Directory: /tmp/dask-worker-space/worker-qrbfmwwd
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,360 - distributed.worker - INFO -         Registered to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,360 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,361 - distributed.core - INFO - Starting established connection to tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Electron_photonIdx => Photon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx1 => SubJet
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx2 => SubJet
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FsrPhoton_muonIdx => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx1 => Electron
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx2 => Electron
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx1 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx2 => Muon
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Muon_fsrPhotonIdx => FsrPhoton
  warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Photon_electronIdx => Electron
  warnings.warn(

However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:

$ k logs reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4-rvwr9
Defaulted container "job" out of: job, krb5-renew, krb5-init (init)
Matplotlib created a temporary cache directory at /tmp/matplotlib-t6r42k3b because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[######                                Traceback (most recent call last):
  File "/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776/analysis.py", line 65, in <module>
    output, metrics = run(fileset, "Events", processor_instance=Processor())
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1700, in __call__
    wrapped_out = self.run(fileset, processor_instance, treename)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1848, in run
    wrapped_out, e = executor(chunks, closure, None)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 974, in __call__
    else _decompress(work.result())
  File "/usr/local/lib/python3.10/site-packages/distributed/client.py", line 287, in result
    raise exc.with_traceback(tb)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 221, in __call__
    out = self.function(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1367, in automatic_retries
    raise e
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1336, in automatic_retries
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1575, in _work_function
    filecontext = uproot.open(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 141, in open
    file = ReadOnlyFile(
  File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 580, in __init__
    self._source = Source(
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 272, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 275, in _open
    self._resource = XRootDResource(self._file_path, self._timeout)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 83, in __init__
    self._open()
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 92, in _open
    self._xrd_error(status)
  File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 118, in _xrd_error
    raise OSError(
OSError: XRootD error: [ERROR] Operation expired
in file root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root

This error did not seem to have been captured and exposed back to the user.

Expected behaviour

The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.

Note

It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    • Status

      Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions