|
| 1 | +# 7. Output file archiving |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Draft |
| 6 | + |
| 7 | +## Context |
| 8 | + |
| 9 | +Our bluesky implementation contains bluesky callbacks which produce scientist-facing output files, for example: |
| 10 | +- [Human-readable scan result files](/callbacks/file_writing): {py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>` |
| 11 | +- [Fitting results](/fitting/livefit_logger): {py:obj}`LiveFitLogger <ibex_bluesky_core.callbacks.LiveFitLogger>` |
| 12 | +- [Plot PNGs](#plot_png_saver): {py:obj}`PlotPNGSaver <ibex_bluesky_core.callbacks.PlotPNGSaver>` |
| 13 | + |
| 14 | +In addition, we have a [developer-facing callback for diagnostics](/callbacks/docs_logging_callback), |
| 15 | +{py:obj}`DocLoggingCallback <ibex_bluesky_core.callbacks.DocLoggingCallback>`. |
| 16 | + |
| 17 | +The above callbacks produce files on disk in response to a bluesky scan. These files contain valuable data and so we |
| 18 | +need to consider how these files are archived for the long term. This must align with the |
| 19 | +[ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx). We should make an attempt to align with |
| 20 | +[FAIR principles](https://www.go-fair.org/fair-principles/). |
| 21 | + |
| 22 | +According to the definitions in the [ISIS Data Policy](https://www.isis.stfc.ac.uk/pages/data-policy.aspx), the data |
| 23 | +generated by bluesky is generally either "facility generated reduced data" or "metadata". |
| 24 | + |
| 25 | +This ADR is concerned with the location in which these bluesky output files are stored, and the archiving infrastructure |
| 26 | +which is therefore used to keep these files for the long term. |
| 27 | + |
| 28 | +--- |
| 29 | + |
| 30 | +At the time of writing this ADR, in June 2025, the scientist-facing files are being written to |
| 31 | +``` |
| 32 | +...\inst$\<instrument>\user\bluesky_scans\<rb_number>\ |
| 33 | +``` |
| 34 | + |
| 35 | +This location has some disadvantages: |
| 36 | +- It is a network location, which means that a site network break will cause bluesky scans to fail to run |
| 37 | +- It is not a location designed for long-term scientifically useful data - for example in terms of data integrity |
| 38 | +- It is not necessarily accessible from downstream systems such as Topcat |
| 39 | + |
| 40 | +Therefore, we would like to define a different, more suitable, location into which bluesky output files can be written. |
| 41 | + |
| 42 | +--- |
| 43 | + |
| 44 | +Some representative use-cases are presented below, showing how data is expected to be used by scientists (click to |
| 45 | +expand each use case): |
| 46 | + |
| 47 | +<details> |
| 48 | +<summary>1 Bluesky scan, no neutron runs (e.g. scanning against a block)</summary> |
| 49 | + |
| 50 | +```{mermaid} |
| 51 | +sequenceDiagram |
| 52 | +actor PI |
| 53 | +participant NDX |
| 54 | +participant Archive |
| 55 | +participant TopCat |
| 56 | +note over PI:Start of RBNumber experiment |
| 57 | +PI ->> NDX: Start bluesky scan |
| 58 | +note over PI: Time Passes |
| 59 | +note over NDX: Bluesky scan ends |
| 60 | +note over NDX: creates scan.ascii and scan.nxs |
| 61 | +NDX ->> Archive: Sends scan.ascii and scan.nxs |
| 62 | +TopCat ->> Archive: Collects scan.ascii and scan.nxs |
| 63 | +note over PI: 5 months later |
| 64 | +PI ->> TopCat: Show me my data |
| 65 | +TopCat ->> PI: Provides access to scan.ascii and scan.nxs |
| 66 | +note over PI: 1 year later |
| 67 | +PI ->> TopCat: Show me my data |
| 68 | +TopCat ->> PI: Provides access to scan.nxs |
| 69 | +``` |
| 70 | +</details> |
| 71 | + |
| 72 | +<details> |
| 73 | +<summary>1 Bluesky scan, aborted neutron runs</summary> |
| 74 | + |
| 75 | +```{mermaid} |
| 76 | +sequenceDiagram |
| 77 | +actor PI |
| 78 | +participant NDX |
| 79 | +participant Archive |
| 80 | +participant TopCat as Online Catalogue |
| 81 | +note over PI:Start of RBNumber experiment |
| 82 | +PI ->> NDX: Start bluesky scan |
| 83 | +note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run |
| 84 | +note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run |
| 85 | +note over NDX: DAE run started by scan <br/> Time passes <br/> Required data gathered in scan documents <br/> Abort DAE run |
| 86 | +note over NDX: Bluesky scan ends |
| 87 | +note over NDX: creates scan.ascii and scan.nxs |
| 88 | +NDX ->> Archive: Sends scan.ascii and scan.nxs |
| 89 | +TopCat ->> Archive: Collects scan.ascii and scan.nxs |
| 90 | +note over PI: 5 months later |
| 91 | +PI ->> TopCat: Show me my data |
| 92 | +TopCat ->> PI: Provides access to scan.ascii and scan.nxs |
| 93 | +note over PI: 1 year later |
| 94 | +PI ->> TopCat: Show me my data |
| 95 | +TopCat ->> PI: Provides access to scan.nxs |
| 96 | +``` |
| 97 | +</details> |
| 98 | + |
| 99 | +<details> |
| 100 | +<summary>1 Bluesky scan, one neutron run</summary> |
| 101 | + |
| 102 | +```{mermaid} |
| 103 | +sequenceDiagram |
| 104 | +actor PI |
| 105 | +participant NDX |
| 106 | +participant Archive |
| 107 | +participant TopCat |
| 108 | +note over PI:Start of RBNumber experiment |
| 109 | +PI ->> NDX: Start bluesky scan |
| 110 | +note over NDX: Bluesky scan starts DAE run |
| 111 | +note over PI: Time Passes |
| 112 | +note over NDX: Bluesky scan ends DAE run <br/> Bluesky scan ends |
| 113 | +par |
| 114 | +note over NDX: creates runnumber.nxs with DAE and SE data |
| 115 | +and |
| 116 | +note over NDX: creates scan.ascii and scan.nxs |
| 117 | +end |
| 118 | +NDX ->> Archive: Sends runnumber.nxs, scan.ascii, and scan.nxs |
| 119 | +TopCat ->> Archive: Collects runnumber.nxs, scan.ascii, and scan.nxs |
| 120 | +note over PI: 5 months later |
| 121 | +PI ->> TopCat: Show me my data |
| 122 | +TopCat ->> PI: Provides access to runnumber.nxs, scan.ascii, and scan.nxs |
| 123 | +note over PI: 1 year later |
| 124 | +PI ->> TopCat: Show me my data |
| 125 | +TopCat ->> PI: Provides access to runnumber.nxs and scan.nxs |
| 126 | +``` |
| 127 | +</details> |
| 128 | + |
| 129 | +<details> |
| 130 | +<summary>1 Bluesky scan, N neutron runs</summary> |
| 131 | + |
| 132 | +```{mermaid} |
| 133 | +sequenceDiagram |
| 134 | +actor PI |
| 135 | +participant NDX |
| 136 | +participant Archive |
| 137 | +participant TopCat |
| 138 | +note over PI:Start of RBNumber experiment |
| 139 | +PI ->> NDX: Start bluesky scan |
| 140 | +note over NDX: Bluesky scan starts DAE run |
| 141 | +note over PI: Time Passes |
| 142 | +note over NDX: Bluesky scan ends DAE run |
| 143 | +note over NDX: creates runnumber.nxs with DAE and SE data |
| 144 | +NDX ->> Archive: Sends runnumber.nxs |
| 145 | +TopCat ->> Archive: Collects runnumber.nxs |
| 146 | +note over PI: Time Passes |
| 147 | +note over NDX: Bluesky scan starts DAE run |
| 148 | +note over PI: Time Passes |
| 149 | +note over NDX: Bluesky scan ends DAE run |
| 150 | +note over NDX: creates runnumber+1.nxs with DAE and SE data |
| 151 | +NDX ->> Archive: Sends runnumber+1.nxs |
| 152 | +TopCat ->> Archive: Collects runnumber+1.nxs |
| 153 | +note over NDX: Bluesky scan ends |
| 154 | +NDX ->> Archive: Sends scan.ascii and scan.nxs |
| 155 | +TopCat ->> Archive: Collects scan.ascii and scan.nxs |
| 156 | +note over PI: 5 months later |
| 157 | +PI ->> TopCat: Show me my data |
| 158 | +TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, scan.ascii, and scan.nxs |
| 159 | +note over PI: 1 year later |
| 160 | +PI ->> TopCat: Show me my data |
| 161 | +TopCat ->> PI: Provides access to runnumber.nxs, runnumber+1.nxs, and scan.nxs |
| 162 | +``` |
| 163 | +</details> |
| 164 | + |
| 165 | +<details> |
| 166 | +<summary>1 Bluesky scan, neutron/muon runs on multiple instruments</summary> |
| 167 | + |
| 168 | +```{mermaid} |
| 169 | +sequenceDiagram |
| 170 | +actor PI |
| 171 | +participant NDX-A |
| 172 | +participant NDX-B |
| 173 | +participant NDX-C |
| 174 | +participant Archive |
| 175 | +participant TopCat |
| 176 | +note over PI:Start of RBNumber experiment |
| 177 | +PI ->> NDX-A: Start bluesky scan |
| 178 | +NDX-A ->> NDX-B: Start DAE run |
| 179 | +NDX-A ->> NDX-C: Start DAE run |
| 180 | +note over PI: Time Passes |
| 181 | +NDX-B ->> NDX-A: Provides summary run data |
| 182 | +NDX-C ->> NDX-A: Provides summary run data |
| 183 | +NDX-A ->> NDX-B: End DAE run |
| 184 | +note over NDX-B: creates runnumberB.nxs with DAE and SE data |
| 185 | +NDX-B ->> Archive: Sends runnumberB.nxs |
| 186 | +TopCat ->> Archive: Collects runnumberB.nxs |
| 187 | +NDX-A ->> NDX-C: End DAE run |
| 188 | +note over NDX-C: creates runnumberC.nxs with DAE and SE data |
| 189 | +NDX-C ->> Archive: Sends runnumberC.nxs |
| 190 | +TopCat ->> Archive: Collects runnumberC.nxs |
| 191 | +note over NDX-A: Bluesky scan ends |
| 192 | +NDX-A ->> Archive: Sends scan.ascii and scan.nxs |
| 193 | +TopCat ->> Archive: Collects scan.ascii and scan.nxs |
| 194 | +note over PI: 5 months later |
| 195 | +PI ->> TopCat: Show me my data |
| 196 | +TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, scan.ascii, and scan.nxs |
| 197 | +note over PI: 1 year later |
| 198 | +PI ->> TopCat: Show me my data |
| 199 | +TopCat ->> PI: Provides access to runnumberB.nxs, runnumberC.nxs, and scan.nxs |
| 200 | +``` |
| 201 | +</details> |
| 202 | + |
| 203 | +## Present |
| 204 | + |
| 205 | +The following people have been involved in discussions leading up to this ADR: |
| 206 | + |
| 207 | +- Tom |
| 208 | +- Chris M-S |
| 209 | +- George |
| 210 | +- Kathryn |
| 211 | +- Jack H |
| 212 | +- CK (Reflectometry) |
| 213 | + |
| 214 | +This document was additionally reviewed in a regular Thursday code-review slot by the whole IBEX team. |
| 215 | + |
| 216 | +## Decisions |
| 217 | + |
| 218 | +### File-writing location |
| 219 | + |
| 220 | +Bluesky should write data into the `c:\data\RB<rb_number>\bluesky_scans\` folder during a scan. |
| 221 | +File naming itself will keep its current scheme (timestamped files). |
| 222 | + |
| 223 | +This location was chosen because it mirrors the archiving setup used by neutron cameras on IMAT. |
| 224 | + |
| 225 | +### Attributes & checksums |
| 226 | + |
| 227 | +Bluesky should mark files as read-only, using Windows file attributes, when it has finished writing them. This is so |
| 228 | +that the archiving process can unambiguously tell whether a file has finished being written. It also reduces the |
| 229 | +likelihood that a file is accidentally modified. |
| 230 | + |
| 231 | +Checksums should be generated, either at the point when the data is initially generated, or by the archiving process |
| 232 | +just before it first copies or moves a file. |
| 233 | + |
| 234 | +We have agreed on the desire to generate checksums for data, which is already done for DAE data. These checksums are |
| 235 | +useful to check for data corruption, which might occur in transit, or in-place on instrument computers or archive servers. |
| 236 | +A number of checksumming approaches have been considered, and no approach has been chosen yet. The options discussed |
| 237 | +are: |
| 238 | +- **Use windows alternate file streams**. This is how checksums are done in existing DAE `.raw` files. It has the |
| 239 | +advantage that it is relatively simple to implement, but the disadvantage that they do not map nicely onto Linux file |
| 240 | +systems. |
| 241 | +- **Generate one checksum per file**, for example `file.txt` would also have an associated `file.sha1.txt` containing the |
| 242 | +checksum. The advantage is that this is simple to implement and platform-agnostic. The disadvantage is that it doubles |
| 243 | +the number of files visible in the archive area. |
| 244 | +- **Generate a single checksum file** containing the checksums of all bluesky data, at a higher level of granularity (for |
| 245 | +example by RB number or by cycle). It is currently unclear exactly how this approach would be implemented, and at what |
| 246 | +point these checksums would be moved to the archive. |
| 247 | + |
| 248 | +### Moving to the ISIS archive |
| 249 | + |
| 250 | +An automated cron task will look for read-only Bluesky output files, and their associated checksums, in `c:\data` at |
| 251 | +regular short intervals (for example, 1 minute), and will move them to: |
| 252 | +- The ISIS data archive, under `autoreduced/bluesky_scans`. The `autoreduced` folder already exists on the archive. |
| 253 | +- The data cache disk on the instrument, under `c:\data\Export only\RB<rb_number\bluesky_scans`. |
| 254 | + |
| 255 | +Data on the cache disk, under `Export only`, is kept on the instrument for a short period (usually 24 hours), and then |
| 256 | +deleted by existing processes. |
| 257 | + |
| 258 | +This is run as a cron task so that, if the network happens to be unavailable at the time when a scan ends, the copy |
| 259 | +process will catch up when the network becomes available again. This cron task will only move files which sit within |
| 260 | +a `bluesky_scans` folder, to prevent it from interfering with other non-bluesky files. |
| 261 | + |
| 262 | +Creating a new `bluesky_scans` folder alongside the existing `autoreduced` folder was considered, but was felt to be |
| 263 | +unachievable - it would require too much work relative to using the existing `autoreduced` folder. |
| 264 | + |
| 265 | +### File formats |
| 266 | + |
| 267 | +At present, our scan file output format is explicitly designed to be "human-readable" (and, in fact, the callback which |
| 268 | +generates these files is explicitly called |
| 269 | +{py:obj}`HumanReadableFileCallback <ibex_bluesky_core.callbacks.HumanReadableFileCallback>`). |
| 270 | + |
| 271 | +We have [issue 26](https://github.com/ISISComputingGroup/ibex_bluesky_core/issues/26) which will implement |
| 272 | +machine-readable files, using a format such as `.hdf5` or `.nxs`. These files will sit alongside the existing |
| 273 | +human-readable files; it is acknowledged that while machine-readable files are better from a data preservation and |
| 274 | +archiving standpoint, we will need to retain the human-readable files to support quick browsing by scientists without |
| 275 | +using special software. |
| 276 | + |
| 277 | +## Consequences |
| 278 | + |
| 279 | +- Bluesky output data will be stored in a location suitable for long-term, scientifically useful, data. This includes |
| 280 | +data integrity and availability concerns. |
| 281 | +- Bluesky scans will no longer be reliant on a network location being available to run a scan |
| 282 | +- The initial location where bluesky writes data (`c:\data\<rb number>`) will not be the same as its final location (the |
| 283 | +`autoreduced` folder on the ISIS archive). This is also true for current DAE data, as generated by the ISISICP. |
0 commit comments