-
Notifications
You must be signed in to change notification settings - Fork 1
Closed
Labels
bug:stall reportReport of a backup2datalad run getting stuck or stallingReport of a backup2datalad run getting stuck or stalling
Description
I noticed that https://github.com/dandi/dandisets says that last commit was 3 weeks ago. But we have lots of recently updated dandisets:

on drogon I see only some not pushed (why?) updates to 000026 in dandisets superdataset
dandi@drogon:/mnt/backup/dandi/dandisets$ git log github/draft..draft
commit a60c7bf88d82a350f538bd83594c163a3b29e51e (HEAD -> draft)
Author: DANDI Team <team@dandiarchive.org>
Date: Wed Mar 27 03:25:33 2024 -0400
1 updated (000026)
000026:
- [backups2datalad] 22 files added, 10 files updated
commit df1dce3a6fe1aa954dd818ce54420ab31b07e71d
Author: DANDI Team <team@dandiarchive.org>
Date: Tue Mar 26 04:10:35 2024 -0400
1 updated (000026)
000026:
- [backups2datalad] 229 files added, 29 files updated
and git status shows that MANY dandisets have uncommitted changes (new commits OR modified content!!) on drogon
dandi@drogon:/mnt/backup/dandi/dandisets$ git status
Refresh index: 100% (612/612), done.
On branch draft
Your branch is ahead of 'github/draft' by 2 commits.
(use "git push" to publish your local commits)
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
(commit or discard the untracked or modified content in submodules)
modified: 000018 (new commits)
modified: 000024 (new commits)
modified: 000029 (modified content)
modified: 000030 (new commits)
modified: 000031 (new commits)
modified: 000032 (new commits)
modified: 000033 (new commits)
modified: 000038 (new commits)
modified: 000040 (new commits)
modified: 000042 (new commits)
...so most likely recent changes on handling embargoed/public dandisets broke this...
Then I looked at ps on drogon -- and apparently there is still a process which runs in --verify mode since Mar 07!
As such it seems to be stuck:
dandi 2645263 0.0 0.0 2484 404 ? Ss Mar07 0:00 /bin/sh -c chronic flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c '/mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify'
dandi 2645264 0.0 0.0 19748 5512 ? S Mar07 7:09 /usr/bin/perl /usr/bin/chronic flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify
dandi 2645317 0.0 0.0 5548 544 ? S Mar07 0:00 /usr/bin/flock -E 0 -e -n /home/dandi/.run/backup2datalad-cron-nonzarr.lock bash -c /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify
dandi 2645318 0.0 0.0 6952 2536 ? S Mar07 0:00 /bin/bash /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify
dandi 2650110 0.0 0.0 6952 1240 ? S Mar07 0:00 /bin/bash /mnt/backup/dandi/dandisets/tools/backups2datalad-update-cron --mode verify
dandi 2650139 0.0 0.3 1790364 200180 ? Sl Mar07 1:21 /home/dandi/miniconda3/envs/dandisets-2/bin/python3.1 /home/dandi/miniconda3/envs/dandisets-2/bin/backups2datalad -l WARNING --backup-root /mnt/backup/dandi --config tools/backups2datalad.cfg.yaml update-from-backup --workers 5 -e 000(026|108|243)$ --mode verify
dandi 2650140 0.0 0.0 6384 2272 ? S Mar07 0:00 grep -v nothing to save, working tree clean
py-spy'ing that process gives no useful detail
then I recalled that we have logs! and this is what I see
(dandisets-2) dandi@drogon:/mnt/backup/dandi/dandisets$ tail -f .git/dandi/backups2datalad/2024.03.07.06.50.32Z.log
2024-03-29T03:12:15-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000729/versions/draft/ in 4292.823660 seconds as it raised PoolTimeout:
2024-03-29T03:34:24-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000726/versions/draft/ in 8515.649940 seconds as it raised PoolTimeout:
2024-03-29T04:12:28-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000727/versions/draft/ in 7882.410218 seconds as it raised PoolTimeout:
2024-03-29T04:19:15-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000728/versions/draft/ in 8402.162677 seconds as it raised PoolTimeout:
2024-03-29T04:24:48-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000729/versions/draft/ in 8413.961004 seconds as it raised PoolTimeout:
2024-03-29T05:34:07-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000725/versions/draft/ in 15890.092814 seconds as it raised PoolTimeout:
2024-03-29T05:57:19-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000726/versions/draft/ in 17122.589867 seconds as it raised PoolTimeout:
2024-03-29T06:24:50-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000727/versions/draft/ in 17075.343311 seconds as it raised PoolTimeout:
2024-03-29T06:40:18-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000728/versions/draft/ in 16793.199329 seconds as it raised PoolTimeout:
2024-03-29T06:46:02-0400 [WARNING ] backups2datalad: Retrying GET request to /dandisets/000729/versions/draft/ in 17087.073695 seconds as it raised PoolTimeout:
some seems to be open (000726) and some embargoed (000729) but even for embargoed endpoint seems to work just fine:
❯ curl -X 'GET' 'https://api.dandiarchive.org/api/dandisets/000729/versions/draft/' -H 'accept: application/json'
{"detail":"Authentication credentials were not provided."}%
so seems like a bug somewhere.
Metadata
Metadata
Assignees
Labels
bug:stall reportReport of a backup2datalad run getting stuck or stallingReport of a backup2datalad run getting stuck or stalling

