Skip to content

Commit 98518de

Browse files
authored
Merge pull request #215 from ISISComputingGroup/adr_7
Draft ADR 7 (Document decisions relating to file archiving)
2 parents 7b8d6b9 + b0c10db commit 98518de

File tree

4 files changed

+290
-2
lines changed

4 files changed

+290
-2
lines changed
Lines changed: 283 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,283 @@
1+
# 7. Output file archiving
2+
3+
## Status
4+
5+
Draft
6+
7+
## Context
8+
9+
Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example:
10+
- [Human-readable scan result files](/callbacks/file_writing): {py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>`
11+
- [Fitting results](/fitting/livefit_logger): {py:obj}`LiveFitLogger <ibex_bluesky_core.callbacks.LiveFitLogger>`
12+
- [Plot PNGs](#plot_png_saver): {py:obj}`PlotPNGSaver <ibex_bluesky_core.callbacks.PlotPNGSaver>`
13+
14+
In addition, we have a [developer-facing callback for diagnostics](/callbacks/docs_logging_callback),
15+
{py:obj}`DocLoggingCallback <ibex_bluesky_core.callbacks.DocLoggingCallback>`.
16+
17+
The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we
18+
need to consider how these files are archived for the long term. This must align with the
19+
[ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx). We should make an attempt to align with
20+
[FAIR principles](https://www.go-fair.org/fair-principles/).
21+
22+
According to the definitions in the [ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx), the data
23+
generated by bluesky is generally either "facility generated reduced data" or "metadata".
24+
25+
This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure
26+
which is therefore used to keep these files for the long term.
27+
28+
---
29+
30+
At the time of writing this ADR, in June 2025, the scientist-facing files are being written to
31+
```
32+
...\inst$\<instrument>\user\bluesky_scans\<rb_number>\
33+
```
34+
35+
This location has some disadvantages:
36+
- It is a network location, which means that a site network break will cause bluesky scans to fail to run
37+
- It is not a location designed for long-term scientifically useful data - for example in terms of data integrity
38+
- It is not necessarily accessible from downstream systems such as Topcat
39+
40+
Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written.
41+
42+
---
43+
44+
Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to
45+
expand each use case):
46+
47+
<details>
48+
<summary>1 Bluesky scan, no neutron runs (e.g. scanning against a block)</summary>
49+
50+
```{mermaid}
51+
sequenceDiagram
52+
actor PI
53+
participant NDX
54+
participant Archive
55+
participant TopCat
56+
note over PI:Start of RBNumber experiment
57+
PI ->> NDX: Start bluesky scan
58+
note over PI: Time Passes
59+
note over NDX: Bluesky scan ends
60+
note over NDX: creates scan.ascii and scan.nxs
61+
NDX ->> Archive: Sends scan.ascii and scan.nxs
62+
TopCat ->> Archive: Collects scan.ascii and scan.nxs
63+
note over PI: 5 months later
64+
PI ->> TopCat: Show me my data
65+
TopCat ->> PI: Provides access to scan.ascii and scan.nxs
66+
note over PI: 1 year later
67+
PI ->> TopCat: Show me my data
68+
TopCat ->> PI: Provides access to scan.nxs
69+
```
70+
</details>
71+
72+
<details>
73+
<summary>1 Bluesky scan, aborted neutron runs</summary>
74+
75+
```{mermaid}
76+
sequenceDiagram
77+
actor PI
78+
participant NDX
79+
participant Archive
80+
participant TopCat as Online Catalogue
81+
note over PI:Start of RBNumber experiment
82+
PI ->> NDX: Start bluesky scan
83+
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
84+
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
85+
note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run
86+
note over NDX: Bluesky scan ends
87+
note over NDX: creates scan.ascii and scan.nxs
88+
NDX ->> Archive: Sends scan.ascii and scan.nxs
89+
TopCat ->> Archive: Collects scan.ascii and scan.nxs
90+
note over PI: 5 months later
91+
PI ->> TopCat: Show me my data
92+
TopCat ->> PI: Provides access to scan.ascii and scan.nxs
93+
note over PI: 1 year later
94+
PI ->> TopCat: Show me my data
95+
TopCat ->> PI: Provides access to scan.nxs
96+
```
97+
</details>
98+
99+
<details>
100+
<summary>1 Bluesky scan, one neutron run</summary>
101+
102+
```{mermaid}
103+
sequenceDiagram
104+
actor PI
105+
participant NDX
106+
participant Archive
107+
participant TopCat
108+
note over PI:Start of RBNumber experiment
109+
PI ->> NDX: Start bluesky scan
110+
note over NDX: Bluesky scan starts DAE run
111+
note over PI: Time Passes
112+
note over NDX: Bluesky scan ends DAE run <br/> Bluesky scan ends
113+
par
114+
note over NDX: creates runnumber.nxs with DAE and SE data
115+
and
116+
note over NDX: creates scan.ascii and scan.nxs
117+
end
118+
NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs
119+
TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs
120+
note over PI: 5 months later
121+
PI ->> TopCat: Show me my data
122+
TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs
123+
note over PI: 1 year later
124+
PI ->> TopCat: Show me my data
125+
TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs
126+
```
127+
</details>
128+
129+
<details>
130+
<summary>1 Bluesky scan, N neutron runs</summary>
131+
132+
```{mermaid}
133+
sequenceDiagram
134+
actor PI
135+
participant NDX
136+
participant Archive
137+
participant TopCat
138+
note over PI:Start of RBNumber experiment
139+
PI ->> NDX: Start bluesky scan
140+
note over NDX: Bluesky scan starts DAE run
141+
note over PI: Time Passes
142+
note over NDX: Bluesky scan ends DAE run
143+
note over NDX: creates runnumber.nxs with DAE and SE data
144+
NDX ->> Archive: Sends runnumber.nxs
145+
TopCat ->> Archive: Collects runnumber.nxs
146+
note over PI: Time Passes
147+
note over NDX: Bluesky scan starts DAE run
148+
note over PI: Time Passes
149+
note over NDX: Bluesky scan ends DAE run
150+
note over NDX: creates runnumber+1.nxs with DAE and SE data
151+
NDX ->> Archive: Sends runnumber+1.nxs
152+
TopCat ->> Archive: Collects runnumber+1.nxs
153+
note over NDX: Bluesky scan ends
154+
NDX ->> Archive: Sends scan.ascii and scan.nxs
155+
TopCat ->> Archive: Collects scan.ascii and scan.nxs
156+
note over PI: 5 months later
157+
PI ->> TopCat: Show me my data
158+
TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs
159+
note over PI: 1 year later
160+
PI ->> TopCat: Show me my data
161+
TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs
162+
```
163+
</details>
164+
165+
<details>
166+
<summary>1 Bluesky scan, neutron/muon runs on multiple instruments</summary>
167+
168+
```{mermaid}
169+
sequenceDiagram
170+
actor PI
171+
participant NDX-A
172+
participant NDX-B
173+
participant NDX-C
174+
participant Archive
175+
participant TopCat
176+
note over PI:Start of RBNumber experiment
177+
PI ->> NDX-A: Start bluesky scan
178+
NDX-A ->> NDX-B: Start DAE run
179+
NDX-A ->> NDX-C: Start DAE run
180+
note over PI: Time Passes
181+
NDX-B ->> NDX-A: Provides summary run data
182+
NDX-C ->> NDX-A: Provides summary run data
183+
NDX-A ->> NDX-B: End DAE run
184+
note over NDX-B: creates runnumberB.nxs with DAE and SE data
185+
NDX-B ->> Archive: Sends runnumberB.nxs
186+
TopCat ->> Archive: Collects runnumberB.nxs
187+
NDX-A ->> NDX-C: End DAE run
188+
note over NDX-C: creates runnumberC.nxs with DAE and SE data
189+
NDX-C ->> Archive: Sends runnumberC.nxs
190+
TopCat ->> Archive: Collects runnumberC.nxs
191+
note over NDX-A: Bluesky scan ends
192+
NDX-A ->> Archive: Sends scan.ascii and scan.nxs
193+
TopCat ->> Archive: Collects scan.ascii and scan.nxs
194+
note over PI: 5 months later
195+
PI ->> TopCat: Show me my data
196+
TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs
197+
note over PI: 1 year later
198+
PI ->> TopCat: Show me my data
199+
TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs
200+
```
201+
</details>
202+
203+
## Present
204+
205+
The following people have been involved in discussions leading up to this ADR:
206+
207+
- Tom
208+
- Chris M-S
209+
- George
210+
- Kathryn
211+
- Jack H
212+
- CK (Reflectometry)
213+
214+
This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team.
215+
216+
## Decisions
217+
218+
### File-writing location
219+
220+
Bluesky should write data into the `c:\data\RB<rb_number>\bluesky_scans\` folder during a scan.
221+
File naming itself will keep its current scheme (timestamped files).
222+
223+
This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT.
224+
225+
### Attributes & checksums
226+
227+
Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so
228+
that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the
229+
likelihood that a file is accidentally modified.
230+
231+
Checksums should be generated, either at the point when the data is initially generated, or by the archiving process
232+
just before it first copies or moves a file.
233+
234+
We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are
235+
useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers.
236+
A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed
237+
are:
238+
- **Use windows alternate file streams**. This is how checksums are done in existing DAE `.raw` files. It has the
239+
advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file
240+
systems.
241+
- **Generate one checksum per file**, for example `file.txt` would also have an associated `file.sha1.txt` containing the
242+
checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles
243+
the number of files visible in the archive area.
244+
- **Generate a single checksum file** containing the checksums of all bluesky data, at a higher level of granularity (for
245+
example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what
246+
point these checksums would be moved to the archive.
247+
248+
### Moving to the ISIS archive
249+
250+
An automated cron task will look for read-only Bluesky output files, and their associated checksums, in `c:\data` at
251+
regular short intervals (for example, 1 minute), and will move them to:
252+
- The ISIS data archive, under `autoreduced/bluesky_scans`. The `autoreduced` folder already exists on the archive.
253+
- The data cache disk on the instrument, under `c:\data\Export only\RB<rb_number\bluesky_scans`.
254+
255+
Data on the cache disk, under `Export only`, is kept on the instrument for a short period (usually 24 hours), and then
256+
deleted by existing processes.
257+
258+
This is run as a cron task so that, if the network happens to be unavailable at the time when a scan ends, the copy
259+
process will catch up when the network becomes available again. This cron task will only move files which sit within
260+
a `bluesky_scans` folder, to prevent it from interfering with other non-bluesky files.
261+
262+
Creating a new `bluesky_scans` folder alongside the existing `autoreduced` folder was considered, but was felt to be
263+
unachievable - it would require too much work relative to using the existing `autoreduced` folder.
264+
265+
### File formats
266+
267+
At present, our scan file output format is explicitly designed to be "human-readable" (and, in fact, the callback which
268+
generates these files is explicitly called
269+
{py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>`).
270+
271+
We have [issue 26](https://github.com/ISISComputingGroup/ibex_bluesky_core/issues/26) which will implement
272+
machine-readable files, using a format such as `.hdf5` or `.nxs`. These files will sit alongside the existing
273+
human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and
274+
archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without
275+
using special software.
276+
277+
## Consequences
278+
279+
- Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes
280+
data integrity and availability concerns.
281+
- Bluesky scans will no longer be reliant on a network location being available to run a scan
282+
- The initial location where bluesky writes data (`c:\data\<rb number>`) will not be the same as its final location (the
283+
`autoreduced` folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP.

doc/callbacks/plotting.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,7 @@ Due to an implementation detail of {py:obj}`matplotlib.pyplot.pcolormesh`,
8383
the plot will only appear once at least *two* rows of data have been collected.
8484
:::
8585

86+
{#plot_png_saver}
8687
## Saving plots to PNG files
8788

8889
`ibex_bluesky_core` provides a {py:obj}`PlotPNGSaver<ibex_bluesky_core.callbacks.PlotPNGSaver>` callback to save plots on a run stop to PNG files, which by saves them to the default output file location unless a filepath is explicitly given.

doc/conf.py

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@
2929
("py:obj", r"^.*\.T.*_co$"),
3030
]
3131

32-
myst_enable_extensions = ["dollarmath", "strikethrough", "colon_fence"]
32+
myst_enable_extensions = ["dollarmath", "strikethrough", "colon_fence", "attrs_block"]
3333
suppress_warnings = ["myst.strikethrough"]
3434

3535
extensions = [
@@ -43,7 +43,10 @@
4343
"sphinx.ext.intersphinx",
4444
# Add links to source code in API docs
4545
"sphinx.ext.viewcode",
46+
# Mermaid diagrams
47+
"sphinxcontrib.mermaid",
4648
]
49+
mermaid_d3_zoom = True
4750
napoleon_google_docstring = True
4851
napoleon_numpy_docstring = False
4952

@@ -70,7 +73,7 @@
7073
html_favicon = "favicon.svg"
7174

7275
autoclass_content = "both"
73-
myst_heading_anchors = 3
76+
myst_heading_anchors = 7
7477
autodoc_preserve_defaults = True
7578

7679
intersphinx_mapping = {

pyproject.toml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,7 @@ doc = [
6363
"sphinx_rtd_theme",
6464
"myst_parser",
6565
"sphinx-autobuild",
66+
"sphinxcontrib-mermaid",
6667
]
6768
dev = [
6869
"ibex_bluesky_core[doc]",

0 commit comments

Comments
 (0)