-
Notifications
You must be signed in to change notification settings - Fork 251
Prairie fixes #217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prairie fixes #217
Conversation
|
It looks like @rleigh-dundee has a more thorough solution to the The only tricky issue is how to handle the series metadata (if any). This PR may actually have a minor issue there; I think as written the series metadata from the old |
|
There are still outstanding issues with the bugfixes this PR provides. I will investigate and update the PR when these problems have been resolved. |
|
For the record, I am tracking my own progress on this work using the LOCI Software Trac. Relevant tickets: And while I am at it, I will probably also investigate as part of this branch: |
|
Note that PR #215 has now been merged, so you may wish to rebase and revert f1daac0. If what is in CoreMetadata on develop isn't sufficient or doesn't work quite right, then we can discuss with @rleigh-dundee how best to fix. |
This avoids a spurious "Content is not allowed in prolog" error that Java normally emits when attempting to parse an XML document with UTF-8 encoding that includes a BOM: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
The usage of parallel arrays makes it very difficult to guarantee robust behavior in the case of missing and malformed data. The code is also harder than necessary to understand. The new Prairie metadata structure provides a "1st class citizen" for accessing Prairie-specific information. It will also ease later migration to the SCIFIO structure.
This change utilizes the new PrairieMetadata data structure, improving
on the behavior of the previous reader implementation as follows:
1. PrairieView is capable of acquiring multiple stage positions (i.e.,
multiple Images), but does not differentiate them in the metadata in
any way. Rather, all time points at all positions are simply numbered
sequentially as "cycles" and conflated.
However, we can disentangle stage positions from time points by
comparing stage positions. If we notice a repeating pattern of stage
positions, we record them as multiple series in the core metadata.
PrairieView acquires planes for all positions at a given time point
before moving on to the next time point. Hence, the rasterization
order is always P faster, T slower; e.g.:
P1T1, P2T1, P3T1, P1T2, P2T2, P3T2
2. PrairieView numbers each Sequence (i.e., stage position / time point
pair) with a "cycle" value, each Frame (i.e., focal plane of a
Sequence) with an "index" value, and each File (i.e., channel of a
Frame) with a "channel" value. These values are 1-based.
In our experience, the XML is always recorded in sequential order.
However, the new implementation does not assume that will always be
the case. Rather, it attempts to gracefully handle missing Sequences,
Frames and Files, using the cycle, index and channel metadata as the
canonical definition of how each TIFF file fits into the dataset. One
useful consequence of this flexibility is that the reader now
supports partial Prairie datasets: if the run is interrupted before
completion, PrairieView produces a partial dataset, with no XML
elements or TIFF files beyond the point of interruption. This is
useful if, for example, the desired phenomenon or occurrence is
observed taking place before the acquisition is complete; there may
be no reason to continue the run after that point.
|
This branch now offers a near-total rewrite of the Prairie reader, which should hopefully be more robust and maintainable. However, I am still debugging the behavior with respect to a few of our sample datasets, so it is not quite ready for merging. That said, if anyone has time to review it at a conceptual level, comments are welcome. |
|
[testng] Total tests run: 158752, Failures: 31912, Skips: 125356 |
|
@rleigh-dundee: Utterly unsurprising. As I said, the code is ready for review at a conceptual level. There are known test failures. Some of those are bugs, and some are legitimate changes in behavior from the previous code. |
|
checkUTF8: Should BOM stripping be done in sanitizeXML? Maybe additionally to this change if sanitizeXML can't be used here directly--it could also filter out the invalid values in addition to its current blacklist? |
|
@rleigh-dundee: Good idea. The problem with By the way, I also tried a suggested hack of interpreting XML 1.0 as XML 1.1, which supposedly is backward compatible but allows more characters, but unfortunately ran into a bizarre Java XML parsing problem with 1.1 (attribute values were mangled, and who knows what else). So I'd like to stick to the current sanitization strategy, except make it work more generally with any I will do that as part of my additional work on this branch! |
|
I have now fixed several bugs introduced by these changes. Still to do:
|
If linesPerFrame or pixelsPerLine is missing, we fall back to using the first TIFF file's Y or X size, respectively.
This method returns the list of Sequences ordered by key. It will be useful to handle cases where the cycle numbering does not increment one at a time (e.g., for metadata with cycle=2,5,8,11,... or even for non-linear increment patterns).
PrairieView is capable of producing data where the cycle numbering does not increment one at a time (e.g., for metadata with cycle=2,5,8,11,...). We have not yet observed any datasets with variable, non-linear increment patterns (e.g., cycle=2,4,5,11,12,...), but this new approach should handle that too. The new approach works by explicitly creating and sorting a list of existing sequences, then using that list as the basis for sizeT & sizeP, rather than assuming that Sequence#getCycleCount() will be sensible.
In the case where a Sequence is flagged as a TSeries, the Frames of that Sequence must be treated as time points rather than focal planes. This logic was previously not fully propagated through the code.
|
I took a look at fixing the The Even if we did create a 100% correct TL;DR: I am not going to change |
|
@ctrueden, got the |
|
Apologies for being unclear about the status. I have not yet had time to perform any further tests. In particular, I have not tested that stitching still works with the Fiji Stitching plugins. I have also not manually tested many more Prairie datasets. And it is difficult to rely on the automated tests because the behavior has changed—specifically, datasets that used to be interpreted as a single 5D Image are now split into multiple Images based on the StagePosition values. This makes the behavior more consistent with how Bio-Formats treats multi-position datasets of other formats. Perhaps it would be best to ask for assistance in testing at this point. Would someone (@melissalinkert?) have any time within the next few days to update the .bioformats metadata files? Any other thoughts on how to proceed with testing of these changes? I would very much like to get these changes in by the end of the month if possible, so suggestions would be most welcome. |
|
I won't have time this week yet, but barring any travel issues should be able to at least make decent progress on testing and updating the .bioformats files on Sunday. At the moment, though, this won't merge without conflict. Can you please merge in openmicroscopy/develop and push (double-checking that it still compiles)? The conflicts are likely due to the |
Conflicts: components/bio-formats/src/loci/formats/in/PrairieReader.java The conflicts are due to the FormatReader#core variable changing from a CoreMetadata[] to a List<CoreMetadata>. PrairieReader.java has been left intact in accordance with the prairie-fixes branch; the core data structure changes will be addressed in the following commit.
It required an update to reflect the new data structure: List<CoreMetadata> rather than CoreMetadata[].
|
@melissalinkert: Merged in develop. The merge favors the prairie-fixes version of PrairieReader.java, which results in compile errors due to the change to FormatReader#core. So I added one more commit on top that updates PrairieReader to use the new core data structure. I did some quick testing and all still appears to work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason not to use return Boolean.parseBoolean(value); here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 82d9fcc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The acquisition date needs to be set before returning, as it counts as "minimum metadata".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in 8f70e62.
|
Aside from above comments, there are a number of legitimate test failures. I've gone through and updated (or created) .bioformats files as seems to be appropriate, but there are places where the configuration is correct and the tests fail (e.g. the The biggest problem seems to be with situations where multiple channels are defined, but a non-consecutive subset are acquired. It should be a matter of respecting the One last thing is that some of the datasets in |
Thanks to Melissa Linkert for noticing.
Thanks to Melissa Linkert for pointing this out.
I somehow missed this before, so it was always null.
|
OK, most of the small issues are addressed. Still working on the missing channels issue. Thanks for the pointer there, @melissalinkert. Once that issue is fixed, I will double check the situation with inverted coordinates. |
This is required to properly handle cases where some channels are active and others are not, as defined in the CFG metadata. For example, we have datasets with "*_Ch1_*.tif" and "*_Ch3_*.tif" planes acquired, but no "*_Ch2_*.tif" files.
We no longer care about which channel min/maxes exist per frame, as we
rely totally on the CFG metadata to determine the desired channels.
To be absolutely sure that all of our sample Prairie datasets have
channel metadata given in CFG, I did a quick check:
find . -name '*.cfg' -print0 | xargs -0 grep -L channel_
And indeed they do. If we come across a dataset in the future with this
information missing, it would be pretty easy to synthesize based on the
per-frame channel min/maxes again, but until then, we don't need it.
|
The channel handling has been improved as @melissalinkert suggested above. I investigated the stage position inversion issue, but there is an unfortunate inconsistency. Check out these two datasets:
With the current logic, dataset 1 stitches correctly, but dataset 2 has inverted X coordinates. We could change Note that even if we cannot get it right 100% of the time, the "Grid/Collection stitching" plugin has the option of inverting X and/or Y coordinates, to correct for such situations. I also tested both of these datasets with the 4.4.5 release, and the situation is the same as described above: dataset 1 stitches correctly and dataset 2 is incorrect. (As a side note: with the 4.4.5 release, the Bio-Formats "Stitch tiles" option does not work with Prairie datasets. But the So in short, while I would welcome ideas on how to address the issue, this PR does not introduce a regression, so let's not delay the merge on that account. |
More specifically, we ignore the X stage position inversion flag, because our sample data does not appear to respect it anyway.
|
It occurred to me that there was an incredibly simple solution: ignore the X stage position inversion flag completely, since it does not appear to be respected in the I took at look at the other datasets in But until we get another sample dataset which clearly disproves the current approach, let's use it. |
|
I believe all issues with this PR are addressed. If further automated testing reveals no other issues, it is ready for merge. |
|
Nobody is waiting on me for anything, right? Just needs further testing and review? |
|
On my list to review this week, I just haven't had a chance yet. |
|
--test prairie Adding to the build. Fine to merge if the Thursday evening builds are green (since I wasn't fast enough to get this in for all of tonight's builds). |
This branch includes two improvements to the Prairie reader:
Partial dataset handling
It seems that PrairieView is capable of producing partial datasets. In particular, we received some sample data that is missing both channels of the last focal plane of the last time point. To handle such cases more robustly, we now pad the image count so that the ZCT product will match, and then return 0-filled planes for any missing files.
Test data for the partial datasets issue can be found at:
For example:
Multiple stage positions
PrairieView is capable of acquiring multiple stage positions (i.e., multiple Images), but does not differentiate them in the metadata in any way. Rather, all time points at all positions are simply numbered sequentially as "cycles" and conflated.
However, we can disentangle stage positions from time points after the fact by comparing stage positions. If we notice a repeating pattern of stage positions, we update the core metadata to reflect that fact.
Test data for the multiple stage position issue can be found at: