Open
Description
Current behaviour
When a workflow using Dask fails, the errors may not be fully captured and exposed back to users.
Here is one observation from running reana-demo-dask-coffea
example.
The status shows that the workflow failed:
$ rcg status -w dask-coffea-serial-kubernetes
NAME RUN_NUMBER CREATED STARTED ENDED STATUS PROGRESS
dask-coffea-serial-kubernetes 1 2025-01-28T10:02:38 2025-01-28T10:02:51 2025-01-28T10:04:03 failed 0/1
The logs show ditto:
$ rcg logs -w dask-coffea-serial-kubernetes
==> Workflow engine logs
2025-01-28 10:02:51,384 | root | MainThread | INFO | Publishing step:0, cmd: python analysis.py, total steps 1 to MQ
2025-01-28 10:04:03,456 | root | MainThread | INFO | Workflow a98dfa47-b7a2-4979-b75f-44ee4b224776 finished. Files available at /var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776.
==> Job logs
==> Step: process
==> Workflow ID: a98dfa47-b7a2-4979-b75f-44ee4b224776
==> Compute backend: Kubernetes
==> Job ID: reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4
==> Docker image: docker.io/coffeateam/coffea-dask-cc7:0.7.22-py3.10-g7f049
==> Command: python analysis.py
==> Status: failed
==> Started: 2025-01-28T10:02:51
==> Finished: 2025-01-28T10:03:59
==> Logs:
...
-------------------------------------------------------------------
-------------------------------------------------------------------
---------------- DASK WORKER LOGS ----------------
-------------------------------------------------------------------
-------------------------------------------------------------------
/usr/local/lib/python3.10/site-packages/distributed/cli/dask_worker.py:264: FutureWarning: dask-worker is deprecated and will be removed in a future release; use `dask worker` instead
warnings.warn(
2025-01-28 10:02:56,570 - distributed.nanny - INFO - Start Nanny at: 'tcp://10.244.0.62:44707'
2025-01-28 10:02:57,354 - distributed.worker - INFO - Start worker at: tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO - Listening to: tcp://10.244.0.62:41989
2025-01-28 10:02:57,354 - distributed.worker - INFO - Worker name: reana-run-dask-a98dfa47-default-worker-1b35478060
2025-01-28 10:02:57,354 - distributed.worker - INFO - dashboard at: 10.244.0.62:8788
2025-01-28 10:02:57,354 - distributed.worker - INFO - Waiting to connect to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,354 - distributed.worker - INFO - Threads: 16
2025-01-28 10:02:57,354 - distributed.worker - INFO - Memory: 2.00 GiB
2025-01-28 10:02:57,354 - distributed.worker - INFO - Local Directory: /tmp/dask-worker-space/worker-qrbfmwwd
2025-01-28 10:02:57,354 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,360 - distributed.worker - INFO - Registered to: tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
2025-01-28 10:02:57,360 - distributed.worker - INFO - -------------------------------------------------
2025-01-28 10:02:57,361 - distributed.core - INFO - Starting established connection to tcp://reana-run-dask-a98dfa47-scheduler.default.svc.cluster.local:8786
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Electron_photonIdx => Photon
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx1 => SubJet
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FatJet_subJetIdx2 => SubJet
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for FsrPhoton_muonIdx => Muon
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx1 => Electron
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_electronIdx2 => Electron
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx1 => Muon
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Jet_muonIdx2 => Muon
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Muon_fsrPhotonIdx => FsrPhoton
warnings.warn(
/usr/local/lib/python3.10/site-packages/coffea/nanoevents/schemas/nanoaod.py:201: RuntimeWarning: Missing cross-reference index for Photon_electronIdx => Electron
warnings.warn(
However, when one consults the Kubernetes pod logs, one can see the root cause is the job pod:
$ k logs reana-run-job-d60a2b71-c263-46c5-8bb1-06793e9083b4-rvwr9
Defaulted container "job" out of: job, krb5-renew, krb5-init (init)
Matplotlib created a temporary cache directory at /tmp/matplotlib-t6r42k3b because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.
[###### Traceback (most recent call last):
File "/var/reana/users/00000000-0000-0000-0000-000000000000/workflows/a98dfa47-b7a2-4979-b75f-44ee4b224776/analysis.py", line 65, in <module>
output, metrics = run(fileset, "Events", processor_instance=Processor())
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1700, in __call__
wrapped_out = self.run(fileset, processor_instance, treename)
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1848, in run
wrapped_out, e = executor(chunks, closure, None)
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 974, in __call__
else _decompress(work.result())
File "/usr/local/lib/python3.10/site-packages/distributed/client.py", line 287, in result
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 221, in __call__
out = self.function(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1367, in automatic_retries
raise e
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1336, in automatic_retries
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/coffea/processor/executor.py", line 1575, in _work_function
filecontext = uproot.open(
File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 141, in open
file = ReadOnlyFile(
File "/usr/local/lib/python3.10/site-packages/uproot/reading.py", line 580, in __init__
self._source = Source(
File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 272, in __init__
self._open()
File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 275, in _open
self._resource = XRootDResource(self._file_path, self._timeout)
File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 83, in __init__
self._open()
File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 92, in _open
self._xrd_error(status)
File "/usr/local/lib/python3.10/site-packages/uproot/source/xrootd.py", line 118, in _xrd_error
raise OSError(
OSError: XRootD error: [ERROR] Operation expired
in file root://eospublic.cern.ch//eos/root-eos/benchmark/Run2012B_SingleMu.root
This error did not seem to have been captured and exposed back to the user.
Expected behaviour
The user should see in the logs that the workflow failed due to an XRootD "Operation expired" data access error.
Note
It looks like we may need to check exceptional situations when capturing both Dask worker logs and REANA job logs and merging them.
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Backlog
Activity