Prairie fixes #217

ctrueden · 2012-11-14T20:37:00Z

This branch includes two improvements to the Prairie reader:

Partial dataset handling

It seems that PrairieView is capable of producing partial datasets. In particular, we received some sample data that is missing both channels of the last focal plane of the last time point. To handle such cases more robustly, we now pad the image count so that the ZCT product will match, and then return 0-filled planes for any missing files.

Test data for the partial datasets issue can be found at:

prairie/julie/bill-bement

For example:

prairie/julie/bill-bement/3/TSeries-10092012-1446-001

Multiple stage positions

PrairieView is capable of acquiring multiple stage positions (i.e., multiple Images), but does not differentiate them in the metadata in any way. Rather, all time points at all positions are simply numbered sequentially as "cycles" and conflated.

However, we can disentangle stage positions from time points after the fact by comparing stage positions. If we notice a repeating pattern of stage positions, we update the core metadata to reflect that fact.

Test data for the multiple stage position issue can be found at:

prairie/kristin/sampleset-001

joshmoore · 2012-11-15T08:19:55Z

@ctrueden , the new constructor conflicts with @rleigh-dundee 's in gh-215. Could you take a look at that and possibly suggest a resolution?

ctrueden · 2012-11-15T17:45:53Z

It looks like @rleigh-dundee has a more thorough solution to the CoreMetadata limitations than I do here. So I am happy to rebase this branch over PR #215, either before or after it has been merged. All this PR needs to be functional is an easy way to make a copy of an existing CoreMetadata.

The only tricky issue is how to handle the series metadata (if any). This PR may actually have a minor issue there; I think as written the series metadata from the old CoreMetadata gets dropped when restructuring to multiple series.

ctrueden · 2012-11-20T21:32:28Z

There are still outstanding issues with the bugfixes this PR provides. I will investigate and update the PR when these problems have been resolved.

ctrueden · 2012-11-26T20:30:09Z

For the record, I am tracking my own progress on this work using the LOCI Software Trac. Relevant tickets:

And while I am at it, I will probably also investigate as part of this branch:

#770: Prairie reader is very slow in populating metadata

melissalinkert · 2012-11-28T01:24:03Z

Note that PR #215 has now been merged, so you may wish to rebase and revert f1daac0. If what is in CoreMetadata on develop isn't sufficient or doesn't work quite right, then we can discuss with @rleigh-dundee how best to fix.

This avoids a spurious "Content is not allowed in prolog" error that Java normally emits when attempting to parse an XML document with UTF-8 encoding that includes a BOM: http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html

The usage of parallel arrays makes it very difficult to guarantee robust behavior in the case of missing and malformed data. The code is also harder than necessary to understand. The new Prairie metadata structure provides a "1st class citizen" for accessing Prairie-specific information. It will also ease later migration to the SCIFIO structure.

This change utilizes the new PrairieMetadata data structure, improving on the behavior of the previous reader implementation as follows: 1. PrairieView is capable of acquiring multiple stage positions (i.e., multiple Images), but does not differentiate them in the metadata in any way. Rather, all time points at all positions are simply numbered sequentially as "cycles" and conflated. However, we can disentangle stage positions from time points by comparing stage positions. If we notice a repeating pattern of stage positions, we record them as multiple series in the core metadata. PrairieView acquires planes for all positions at a given time point before moving on to the next time point. Hence, the rasterization order is always P faster, T slower; e.g.: P1T1, P2T1, P3T1, P1T2, P2T2, P3T2 2. PrairieView numbers each Sequence (i.e., stage position / time point pair) with a "cycle" value, each Frame (i.e., focal plane of a Sequence) with an "index" value, and each File (i.e., channel of a Frame) with a "channel" value. These values are 1-based. In our experience, the XML is always recorded in sequential order. However, the new implementation does not assume that will always be the case. Rather, it attempts to gracefully handle missing Sequences, Frames and Files, using the cycle, index and channel metadata as the canonical definition of how each TIFF file fits into the dataset. One useful consequence of this flexibility is that the reader now supports partial Prairie datasets: if the run is interrupted before completion, PrairieView produces a partial dataset, with no XML elements or TIFF files beyond the point of interruption. This is useful if, for example, the desired phenomenon or occurrence is observed taking place before the acquisition is complete; there may be no reason to continue the run after that point.

ctrueden · 2012-12-19T16:29:28Z

This branch now offers a near-total rewrite of the Prairie reader, which should hopefully be more robust and maintainable. However, I am still debugging the behavior with respect to a few of our sample datasets, so it is not quite ready for merging. That said, if anyone has time to review it at a conceptual level, comments are welcome.

ghost · 2012-12-20T15:01:38Z

[testng] Total tests run: 158752, Failures: 31912, Skips: 125356

ctrueden · 2012-12-20T15:04:58Z

@rleigh-dundee: Utterly unsurprising. As I said, the code is ready for review at a conceptual level. There are known test failures. Some of those are bugs, and some are legitimate changes in behavior from the previous code.

ghost · 2012-12-20T15:17:59Z

checkUTF8: Should BOM stripping be done in sanitizeXML? Maybe additionally to this change if sanitizeXML can't be used here directly--it could also filter out the invalid values in addition to its current blacklist?

ctrueden · 2012-12-20T16:07:31Z

@rleigh-dundee: Good idea. The problem with sanitizeXML is that it requires an input String which can be very wasteful for large blocks of XML. I have an idea for a subclass of FilterInputStream that takes care of everything: the BOM hack, as well as sanitizing the illegal characters.

By the way, I also tried a suggested hack of interpreting XML 1.0 as XML 1.1, which supposedly is backward compatible but allows more characters, but unfortunately ran into a bizarre Java XML parsing problem with 1.1 (attribute values were mangled, and who knows what else). So I'd like to stick to the current sanitization strategy, except make it work more generally with any InputStream.

I will do that as part of my additional work on this branch!

ctrueden · 2013-01-11T22:25:45Z

I have now fixed several bugs introduced by these changes.

Still to do:

Fix the sanitizeXML code, as discussed with @rleigh-dundee above.
Test reading of more Prairie datasets.
Verify that stitching still works properly in various cases.
Verify that integration tests still pass. In some cases, the .bioformats metadata will be incorrect and need updating, though.

If linesPerFrame or pixelsPerLine is missing, we fall back to using the first TIFF file's Y or X size, respectively.

This method returns the list of Sequences ordered by key. It will be useful to handle cases where the cycle numbering does not increment one at a time (e.g., for metadata with cycle=2,5,8,11,... or even for non-linear increment patterns).

PrairieView is capable of producing data where the cycle numbering does not increment one at a time (e.g., for metadata with cycle=2,5,8,11,...). We have not yet observed any datasets with variable, non-linear increment patterns (e.g., cycle=2,4,5,11,12,...), but this new approach should handle that too. The new approach works by explicitly creating and sorting a list of existing sequences, then using that list as the basis for sizeT & sizeP, rather than assuming that Sequence#getCycleCount() will be sensible.

In the case where a Sequence is flagged as a TSeries, the Frames of that Sequence must be treated as time points rather than focal planes. This logic was previously not fully propagated through the code.

ctrueden · 2013-01-14T20:56:03Z

I took a look at fixing the sanitizeXML implementation as discussed above. Unfortunately, there are several complicating factors that make me think it's not worth our time, unless/until it comes up later in performance tuning.

The XMLTools.sanitizeXML strategy replaces individual invalid characters with spaces. I thought we could have a FilterInputStream extension (XMLSanitizer or some such) that does the same thing in its overridden read method(s). However, InputStreams work with bytes, not characters. To work with characters properly across different encodings, we would instead need a FilterReader. I briefly implemented one, but quickly realized it becomes more complicated if an invalid &# character sequence falls across the boundary of two read calls. A call to super.read may populate the char array such that it terminates with &, making it unclear whether the next character (which has yet to be read by the super implementation) will be a # or not. If it is going to be, the & should become a space, but if not, it should stay an ampersand. Unfortunately, we cannot know a priori. A more complex solution is then required to keep the sanitization working both correctly and efficiently. The current implementation (which operates on entire Strings) works for now, so I would rather not worry about premature optimization.

Even if we did create a 100% correct XMLSanitizer somehow which extends FilterReader, it would still not be usable with XMLTools.parseDOM(InputStream); a Reader is not an InputStream after all. We would need to add an XMLTools.parseDOM(InputSource) since an InputSource can wrap a Reader. However, I am wary of that route: it has several layers (FileInputStream, InputStreamReader, XMLSanitizer, InputSource, ...) and in the past, I have experienced bugs with Java XML parsing when an InputStreamReader is involved in the equation.

TL;DR: I am not going to change checkBOM or sanitizeXML for this PR. They work fine as-is. Let's file a separate ticket for later if we want to improve the behavior or performance.

joshmoore · 2013-01-21T08:29:59Z

@ctrueden, got the TL;DR, but any other commits left from the other 3 bullet points above?

ctrueden · 2013-01-22T03:13:04Z

Apologies for being unclear about the status. I have not yet had time to perform any further tests. In particular, I have not tested that stitching still works with the Fiji Stitching plugins. I have also not manually tested many more Prairie datasets. And it is difficult to rely on the automated tests because the behavior has changed—specifically, datasets that used to be interpreted as a single 5D Image are now split into multiple Images based on the StagePosition values. This makes the behavior more consistent with how Bio-Formats treats multi-position datasets of other formats.

Perhaps it would be best to ask for assistance in testing at this point. Would someone (@melissalinkert?) have any time within the next few days to update the .bioformats metadata files? Any other thoughts on how to proceed with testing of these changes? I would very much like to get these changes in by the end of the month if possible, so suggestions would be most welcome.

melissalinkert · 2013-01-22T20:33:34Z

I won't have time this week yet, but barring any travel issues should be able to at least make decent progress on testing and updating the .bioformats files on Sunday.

At the moment, though, this won't merge without conflict. Can you please merge in openmicroscopy/develop and push (double-checking that it still compiles)? The conflicts are likely due to the core variable in FormatReader changing from an array to a java.util.List.

Conflicts: components/bio-formats/src/loci/formats/in/PrairieReader.java The conflicts are due to the FormatReader#core variable changing from a CoreMetadata[] to a List<CoreMetadata>. PrairieReader.java has been left intact in accordance with the prairie-fixes branch; the core data structure changes will be addressed in the following commit.

It required an update to reflect the new data structure: List<CoreMetadata> rather than CoreMetadata[].

ctrueden · 2013-01-24T22:47:28Z

@melissalinkert: Merged in develop. The merge favors the prairie-fixes version of PrairieReader.java, which results in compile errors due to the change to FormatReader#core. So I added one more commit on top that updates PrairieReader to use the new core data structure. I did some quick testing and all still appears to work.

melissalinkert · 2013-01-27T10:49:27Z

components/bio-formats/src/loci/formats/in/PrairieMetadata.java

Any reason not to use return Boolean.parseBoolean(value); here?

Done in 82d9fcc.

melissalinkert · 2013-01-27T11:39:35Z

components/bio-formats/src/loci/formats/in/PrairieReader.java

The acquisition date needs to be set before returning, as it counts as "minimum metadata".

Done in 8f70e62.

melissalinkert · 2013-01-28T14:46:34Z

Aside from above comments, there are a number of legitimate test failures. I've gone through and updated (or created) .bioformats files as seems to be appropriate, but there are places where the configuration is correct and the tests fail (e.g. the prairie/misc directory).

The biggest problem seems to be with situations where multiple channels are defined, but a non-consecutive subset are acquired. It should be a matter of respecting the channel_0 (and channel_1, channel_2, etc.) keys in the .cfg file. prairie/misc/TSeries-pollendouble-004 shows this - only channels 0 and 2 are really acquired, but the SizeC is recorded as 3 (with warnings that the files for channel 1 are missing).

One last thing is that some of the datasets in prairie/joe/x,y inversion test still appear to have the wrong values for PositionX and PositionY. Opening e.g. 1mg-dish1-wide-003 in Fiji with "Stitch tiles" checked shows the issue.

Thanks to Melissa Linkert for noticing.

Thanks to Melissa Linkert for pointing this out.

I somehow missed this before, so it was always null.

ctrueden · 2013-01-30T21:52:57Z

OK, most of the small issues are addressed. Still working on the missing channels issue. Thanks for the pointer there, @melissalinkert. Once that issue is fixed, I will double check the situation with inverted coordinates.

This is required to properly handle cases where some channels are active and others are not, as defined in the CFG metadata. For example, we have datasets with "*_Ch1_*.tif" and "*_Ch3_*.tif" planes acquired, but no "*_Ch2_*.tif" files.

We no longer care about which channel min/maxes exist per frame, as we rely totally on the CFG metadata to determine the desired channels. To be absolutely sure that all of our sample Prairie datasets have channel metadata given in CFG, I did a quick check: find . -name '*.cfg' -print0 | xargs -0 grep -L channel_ And indeed they do. If we come across a dataset in the future with this information missing, it would be pretty easy to synthesize based on the per-frame channel min/maxes again, but until then, we don't need it.

ctrueden · 2013-01-31T17:56:11Z

The channel handling has been improved as @melissalinkert suggested above.

I investigated the stage position inversion issue, but there is an unfortunate inconsistency. Check out these two datasets:

prairie/joe/TSeries-03232012-0934-006
prairie/joe/x,y inversion test/1mg-dish1-wide-003

With the current logic, dataset 1 stitches correctly, but dataset 2 has inverted X coordinates. We could change PrairieMetadata#isInvertX() to return an inverted value, which would fix dataset 2, but break dataset 1. So it seems PrairieView is inconsistent about how it records this information. Or else there is another piece of metadata we could use to differentiate these two cases. Or else the 1mg-dish1-wide-003 metadata was manually edited (though I doubt it).

Note that even if we cannot get it right 100% of the time, the "Grid/Collection stitching" plugin has the option of inverting X and/or Y coordinates, to correct for such situations.

I also tested both of these datasets with the 4.4.5 release, and the situation is the same as described above: dataset 1 stitches correctly and dataset 2 is incorrect. (As a side note: with the 4.4.5 release, the Bio-Formats "Stitch tiles" option does not work with Prairie datasets. But the prairie-fixes branch makes it work properly, due to the new stage position splitting logic. It's much faster for testing than using the "Grid/Collection stitching" plugin! Though I did test both, of course.)

So in short, while I would welcome ideas on how to address the issue, this PR does not introduce a regression, so let's not delay the merge on that account.

More specifically, we ignore the X stage position inversion flag, because our sample data does not appear to respect it anyway.

ctrueden · 2013-01-31T18:17:03Z

It occurred to me that there was an incredibly simple solution: ignore the X stage position inversion flag completely, since it does not appear to be respected in the 1mg-dish1-wide-003 dataset. With this change, both of the above two datasets stitch properly in both X and Y.

I took at look at the other datasets in x,y inversion test for corroboration, but personally, I cannot visually tell which way the tiles are supposed to align, regardless of X and/or Y inverted or not. All configurations look wrong to me, which makes me question the source data. So unfortunately there is no guidance there.

But until we get another sample dataset which clearly disproves the current approach, let's use it.

ctrueden · 2013-01-31T18:17:40Z

I believe all issues with this PR are addressed. If further automated testing reveals no other issues, it is ready for merge.

ctrueden · 2013-02-19T19:49:20Z

Nobody is waiting on me for anything, right? Just needs further testing and review?

melissalinkert · 2013-02-19T20:00:18Z

On my list to review this week, I just haven't had a chance yet.

melissalinkert · 2013-02-20T23:53:33Z

--test prairie

Adding to the build. Fine to merge if the Thursday evening builds are green (since I wasn't fast enough to get this in for all of tonight's builds).

Prairie fixes

ctrueden added 3 commits December 18, 2012 10:43

ctrueden added 5 commits January 11, 2013 16:26

PrairieReader: avoid NPE from missing attributes

7e20e7c

If linesPerFrame or pixelsPerLine is missing, we fall back to using the first TIFF file's Y or X size, respectively.

PrairieReader: fix bug in logging statement

006ea82

PrairieReader: Add PrairieMetadata#getSequences

59e2e95

This method returns the list of Sequences ordered by key. It will be useful to handle cases where the cycle numbering does not increment one at a time (e.g., for metadata with cycle=2,5,8,11,... or even for non-linear increment patterns).

PrairieReader: fix handling of inverted Z/T

b35f7d7

In the case where a Sequence is flagged as a TSeries, the Frames of that Sequence must be treated as time points rather than focal planes. This logic was previously not fully propagated through the code.

ctrueden added 2 commits January 24, 2013 16:22

PrairieReader: fix core metadata population

71e2dde

It required an update to reflect the new data structure: List<CoreMetadata> rather than CoreMetadata[].

melissalinkert reviewed Jan 27, 2013
View reviewed changes

ctrueden added 6 commits January 29, 2013 12:51

PrairieReader: use Boolean.parseBoolean

82d9fcc

PrairieReader: use Locations instead of Files

d9869a5

PrairieReader: reset singleTiffMode when closing

97ab65e

PrairieReader: fix bugs in sequence indexing

5b59ed3

Thanks to Melissa Linkert for noticing.

PrairieReader: always populate AcquisitionDate

8f70e62

Thanks to Melissa Linkert for pointing this out.

PrairieReader: parse acquisition date

53e2a92

I somehow missed this before, so it was always null.

ctrueden added 4 commits January 31, 2013 10:50

PrairieReader: add missing final keywords

db6e354

PrairieReader: parse active channels from CFG file

10b34aa

PrairieReader: use active channels metadata

d643002

This is required to properly handle cases where some channels are active and others are not, as defined in the CFG metadata. For example, we have datasets with "*_Ch1_*.tif" and "*_Ch3_*.tif" planes acquired, but no "*_Ch2_*.tif" files.

PrairieReader: add a hack for X position inversion

cf70030

More specifically, we ignore the X stage position inversion flag, because our sample data does not appear to respect it anyway.

melissalinkert added a commit that referenced this pull request Feb 22, 2013

Merge pull request #217 from ctrueden/prairie-fixes

79a0786

Prairie fixes

melissalinkert merged commit 79a0786 into ome:develop Feb 22, 2013

ctrueden deleted the prairie-fixes branch February 25, 2013 16:54

ctrueden mentioned this pull request May 7, 2013

Prairie fixes #508

Merged

Prairie fixes #217

Prairie fixes #217

Uh oh!

Conversation

ctrueden commented Nov 14, 2012

Partial dataset handling

Multiple stage positions

Uh oh!

joshmoore commented Nov 15, 2012

Uh oh!

ctrueden commented Nov 15, 2012

Uh oh!

ctrueden commented Nov 20, 2012

Uh oh!

ctrueden commented Nov 26, 2012

Uh oh!

melissalinkert commented Nov 28, 2012

Uh oh!

ctrueden commented Dec 19, 2012

Uh oh!

ghost commented Dec 20, 2012

Uh oh!

ctrueden commented Dec 20, 2012

Uh oh!

ghost commented Dec 20, 2012

Uh oh!

ctrueden commented Dec 20, 2012

Uh oh!

ctrueden commented Jan 11, 2013

Uh oh!

ctrueden commented Jan 14, 2013

Uh oh!

joshmoore commented Jan 21, 2013

Uh oh!

ctrueden commented Jan 22, 2013

Uh oh!

melissalinkert commented Jan 22, 2013

Uh oh!

ctrueden commented Jan 24, 2013

Uh oh!

melissalinkert Jan 27, 2013

Choose a reason for hiding this comment

Uh oh!

ctrueden Jan 29, 2013

Choose a reason for hiding this comment

Uh oh!

melissalinkert Jan 27, 2013

Choose a reason for hiding this comment

Uh oh!

ctrueden Jan 29, 2013

Choose a reason for hiding this comment

Uh oh!

melissalinkert commented Jan 28, 2013

Uh oh!

ctrueden commented Jan 30, 2013

Uh oh!

ctrueden commented Jan 31, 2013

Uh oh!

ctrueden commented Jan 31, 2013

Uh oh!

ctrueden commented Jan 31, 2013

Uh oh!

ctrueden commented Feb 19, 2013

Uh oh!

melissalinkert commented Feb 19, 2013

Uh oh!

melissalinkert commented Feb 20, 2013

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants