Skip to content

Conversation

@ctrueden
Copy link
Member

This branch includes two improvements to the Prairie reader:

Partial dataset handling

It seems that PrairieView is capable of producing partial datasets. In particular, we received some sample data that is missing both channels of the last focal plane of the last time point. To handle such cases more robustly, we now pad the image count so that the ZCT product will match, and then return 0-filled planes for any missing files.

Test data for the partial datasets issue can be found at:

prairie/julie/bill-bement

For example:

prairie/julie/bill-bement/3/TSeries-10092012-1446-001

Multiple stage positions

PrairieView is capable of acquiring multiple stage positions (i.e., multiple Images), but does not differentiate them in the metadata in any way. Rather, all time points at all positions are simply numbered sequentially as "cycles" and conflated.

However, we can disentangle stage positions from time points after the fact by comparing stage positions. If we notice a repeating pattern of stage positions, we update the core metadata to reflect that fact.

Test data for the multiple stage position issue can be found at:

prairie/kristin/sampleset-001

@joshmoore
Copy link
Member

@ctrueden , the new constructor conflicts with @rleigh-dundee 's in gh-215. Could you take a look at that and possibly suggest a resolution?

@ctrueden
Copy link
Member Author

It looks like @rleigh-dundee has a more thorough solution to the CoreMetadata limitations than I do here. So I am happy to rebase this branch over PR #215, either before or after it has been merged. All this PR needs to be functional is an easy way to make a copy of an existing CoreMetadata.

The only tricky issue is how to handle the series metadata (if any). This PR may actually have a minor issue there; I think as written the series metadata from the old CoreMetadata gets dropped when restructuring to multiple series.

@ctrueden
Copy link
Member Author

There are still outstanding issues with the bugfixes this PR provides. I will investigate and update the PR when these problems have been resolved.

@ctrueden
Copy link
Member Author

For the record, I am tracking my own progress on this work using the LOCI Software Trac. Relevant tickets:

And while I am at it, I will probably also investigate as part of this branch:

@melissalinkert
Copy link
Member

Note that PR #215 has now been merged, so you may wish to rebase and revert f1daac0. If what is in CoreMetadata on develop isn't sufficient or doesn't work quite right, then we can discuss with @rleigh-dundee how best to fix.

This avoids a spurious "Content is not allowed in prolog" error that
Java normally emits when attempting to parse an XML document with UTF-8
encoding that includes a BOM:

http://www.rgagnon.com/javadetails/java-handle-utf8-file-with-bom.html
The usage of parallel arrays makes it very difficult to guarantee robust
behavior in the case of missing and malformed data. The code is also
harder than necessary to understand. The new Prairie metadata structure
provides a "1st class citizen" for accessing Prairie-specific
information. It will also ease later migration to the SCIFIO structure.
This change utilizes the new PrairieMetadata data structure, improving
on the behavior of the previous reader implementation as follows:

1. PrairieView is capable of acquiring multiple stage positions (i.e.,
   multiple Images), but does not differentiate them in the metadata in
   any way. Rather, all time points at all positions are simply numbered
   sequentially as "cycles" and conflated.

   However, we can disentangle stage positions from time points by
   comparing stage positions. If we notice a repeating pattern of stage
   positions, we record them as multiple series in the core metadata.

   PrairieView acquires planes for all positions at a given time point
   before moving on to the next time point. Hence, the rasterization
   order is always P faster, T slower; e.g.:

     P1T1, P2T1, P3T1, P1T2, P2T2, P3T2

2. PrairieView numbers each Sequence (i.e., stage position / time point
   pair) with a "cycle" value, each Frame (i.e., focal plane of a
   Sequence) with an "index" value, and each File (i.e., channel of a
   Frame) with a "channel" value. These values are 1-based.

   In our experience, the XML is always recorded in sequential order.
   However, the new implementation does not assume that will always be
   the case. Rather, it attempts to gracefully handle missing Sequences,
   Frames and Files, using the cycle, index and channel metadata as the
   canonical definition of how each TIFF file fits into the dataset. One
   useful consequence of this flexibility is that the reader now
   supports partial Prairie datasets: if the run is interrupted before
   completion, PrairieView produces a partial dataset, with no XML
   elements or TIFF files beyond the point of interruption. This is
   useful if, for example, the desired phenomenon or occurrence is
   observed taking place before the acquisition is complete; there may
   be no reason to continue the run after that point.
@ctrueden
Copy link
Member Author

This branch now offers a near-total rewrite of the Prairie reader, which should hopefully be more robust and maintainable. However, I am still debugging the behavior with respect to a few of our sample datasets, so it is not quite ready for merging. That said, if anyone has time to review it at a conceptual level, comments are welcome.

@ghost
Copy link

ghost commented Dec 20, 2012

[testng] Total tests run: 158752, Failures: 31912, Skips: 125356

@ctrueden
Copy link
Member Author

@rleigh-dundee: Utterly unsurprising. As I said, the code is ready for review at a conceptual level. There are known test failures. Some of those are bugs, and some are legitimate changes in behavior from the previous code.

@ghost
Copy link

ghost commented Dec 20, 2012

checkUTF8: Should BOM stripping be done in sanitizeXML? Maybe additionally to this change if sanitizeXML can't be used here directly--it could also filter out the invalid values in addition to its current blacklist?

@ctrueden
Copy link
Member Author

@rleigh-dundee: Good idea. The problem with sanitizeXML is that it requires an input String which can be very wasteful for large blocks of XML. I have an idea for a subclass of FilterInputStream that takes care of everything: the BOM hack, as well as sanitizing the illegal characters.

By the way, I also tried a suggested hack of interpreting XML 1.0 as XML 1.1, which supposedly is backward compatible but allows more characters, but unfortunately ran into a bizarre Java XML parsing problem with 1.1 (attribute values were mangled, and who knows what else). So I'd like to stick to the current sanitization strategy, except make it work more generally with any InputStream.

I will do that as part of my additional work on this branch!

@ctrueden
Copy link
Member Author

I have now fixed several bugs introduced by these changes.

Still to do:

  • Fix the sanitizeXML code, as discussed with @rleigh-dundee above.
  • Test reading of more Prairie datasets.
  • Verify that stitching still works properly in various cases.
  • Verify that integration tests still pass. In some cases, the .bioformats metadata will be incorrect and need updating, though.

If linesPerFrame or pixelsPerLine is missing, we fall back to using the
first TIFF file's Y or X size, respectively.
This method returns the list of Sequences ordered by key. It will be
useful to handle cases where the cycle numbering does not increment one
at a time (e.g., for metadata with cycle=2,5,8,11,... or even for
non-linear increment patterns).
PrairieView is capable of producing data where the cycle numbering does
not increment one at a time (e.g., for metadata with
cycle=2,5,8,11,...). We have not yet observed any datasets with
variable, non-linear increment patterns (e.g., cycle=2,4,5,11,12,...),
but this new approach should handle that too.

The new approach works by explicitly creating and sorting a list of
existing sequences, then using that list as the basis for sizeT & sizeP,
rather than assuming that Sequence#getCycleCount() will be sensible.
In the case where a Sequence is flagged as a TSeries, the Frames of that
Sequence must be treated as time points rather than focal planes. This
logic was previously not fully propagated through the code.
@ctrueden
Copy link
Member Author

I took a look at fixing the sanitizeXML implementation as discussed above. Unfortunately, there are several complicating factors that make me think it's not worth our time, unless/until it comes up later in performance tuning.

The XMLTools.sanitizeXML strategy replaces individual invalid characters with spaces. I thought we could have a FilterInputStream extension (XMLSanitizer or some such) that does the same thing in its overridden read method(s). However, InputStreams work with bytes, not characters. To work with characters properly across different encodings, we would instead need a FilterReader. I briefly implemented one, but quickly realized it becomes more complicated if an invalid &# character sequence falls across the boundary of two read calls. A call to super.read may populate the char array such that it terminates with &, making it unclear whether the next character (which has yet to be read by the super implementation) will be a # or not. If it is going to be, the & should become a space, but if not, it should stay an ampersand. Unfortunately, we cannot know a priori. A more complex solution is then required to keep the sanitization working both correctly and efficiently. The current implementation (which operates on entire Strings) works for now, so I would rather not worry about premature optimization.

Even if we did create a 100% correct XMLSanitizer somehow which extends FilterReader, it would still not be usable with XMLTools.parseDOM(InputStream); a Reader is not an InputStream after all. We would need to add an XMLTools.parseDOM(InputSource) since an InputSource can wrap a Reader. However, I am wary of that route: it has several layers (FileInputStream, InputStreamReader, XMLSanitizer, InputSource, ...) and in the past, I have experienced bugs with Java XML parsing when an InputStreamReader is involved in the equation.

TL;DR: I am not going to change checkBOM or sanitizeXML for this PR. They work fine as-is. Let's file a separate ticket for later if we want to improve the behavior or performance.

@joshmoore
Copy link
Member

@ctrueden, got the TL;DR, but any other commits left from the other 3 bullet points above?

@ctrueden
Copy link
Member Author

Apologies for being unclear about the status. I have not yet had time to perform any further tests. In particular, I have not tested that stitching still works with the Fiji Stitching plugins. I have also not manually tested many more Prairie datasets. And it is difficult to rely on the automated tests because the behavior has changed—specifically, datasets that used to be interpreted as a single 5D Image are now split into multiple Images based on the StagePosition values. This makes the behavior more consistent with how Bio-Formats treats multi-position datasets of other formats.

Perhaps it would be best to ask for assistance in testing at this point. Would someone (@melissalinkert?) have any time within the next few days to update the .bioformats metadata files? Any other thoughts on how to proceed with testing of these changes? I would very much like to get these changes in by the end of the month if possible, so suggestions would be most welcome.

@melissalinkert
Copy link
Member

I won't have time this week yet, but barring any travel issues should be able to at least make decent progress on testing and updating the .bioformats files on Sunday.

At the moment, though, this won't merge without conflict. Can you please merge in openmicroscopy/develop and push (double-checking that it still compiles)? The conflicts are likely due to the core variable in FormatReader changing from an array to a java.util.List.

Conflicts:
	components/bio-formats/src/loci/formats/in/PrairieReader.java

The conflicts are due to the FormatReader#core variable changing from a
CoreMetadata[] to a List<CoreMetadata>. PrairieReader.java has been left
intact in accordance with the prairie-fixes branch; the core data
structure changes will be addressed in the following commit.
It required an update to reflect the new data structure:
List<CoreMetadata> rather than CoreMetadata[].
@ctrueden
Copy link
Member Author

@melissalinkert: Merged in develop. The merge favors the prairie-fixes version of PrairieReader.java, which results in compile errors due to the change to FormatReader#core. So I added one more commit on top that updates PrairieReader to use the new core data structure. I did some quick testing and all still appears to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to use return Boolean.parseBoolean(value); here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 82d9fcc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The acquisition date needs to be set before returning, as it counts as "minimum metadata".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 8f70e62.

@melissalinkert
Copy link
Member

Aside from above comments, there are a number of legitimate test failures. I've gone through and updated (or created) .bioformats files as seems to be appropriate, but there are places where the configuration is correct and the tests fail (e.g. the prairie/misc directory).

The biggest problem seems to be with situations where multiple channels are defined, but a non-consecutive subset are acquired. It should be a matter of respecting the channel_0 (and channel_1, channel_2, etc.) keys in the .cfg file. prairie/misc/TSeries-pollendouble-004 shows this - only channels 0 and 2 are really acquired, but the SizeC is recorded as 3 (with warnings that the files for channel 1 are missing).

One last thing is that some of the datasets in prairie/joe/x,y inversion test still appear to have the wrong values for PositionX and PositionY. Opening e.g. 1mg-dish1-wide-003 in Fiji with "Stitch tiles" checked shows the issue.

@ctrueden
Copy link
Member Author

OK, most of the small issues are addressed. Still working on the missing channels issue. Thanks for the pointer there, @melissalinkert. Once that issue is fixed, I will double check the situation with inverted coordinates.

This is required to properly handle cases where some channels are active
and others are not, as defined in the CFG metadata. For example, we have
datasets with "*_Ch1_*.tif" and "*_Ch3_*.tif" planes acquired, but no
"*_Ch2_*.tif" files.
We no longer care about which channel min/maxes exist per frame, as we
rely totally on the CFG metadata to determine the desired channels.

To be absolutely sure that all of our sample Prairie datasets have
channel metadata given in CFG, I did a quick check:

    find . -name '*.cfg' -print0 | xargs -0 grep -L channel_

And indeed they do. If we come across a dataset in the future with this
information missing, it would be pretty easy to synthesize based on the
per-frame channel min/maxes again, but until then, we don't need it.
@ctrueden
Copy link
Member Author

The channel handling has been improved as @melissalinkert suggested above.

I investigated the stage position inversion issue, but there is an unfortunate inconsistency. Check out these two datasets:

  1. prairie/joe/TSeries-03232012-0934-006
  2. prairie/joe/x,y inversion test/1mg-dish1-wide-003

With the current logic, dataset 1 stitches correctly, but dataset 2 has inverted X coordinates. We could change PrairieMetadata#isInvertX() to return an inverted value, which would fix dataset 2, but break dataset 1. So it seems PrairieView is inconsistent about how it records this information. Or else there is another piece of metadata we could use to differentiate these two cases. Or else the 1mg-dish1-wide-003 metadata was manually edited (though I doubt it).

Note that even if we cannot get it right 100% of the time, the "Grid/Collection stitching" plugin has the option of inverting X and/or Y coordinates, to correct for such situations.

I also tested both of these datasets with the 4.4.5 release, and the situation is the same as described above: dataset 1 stitches correctly and dataset 2 is incorrect. (As a side note: with the 4.4.5 release, the Bio-Formats "Stitch tiles" option does not work with Prairie datasets. But the prairie-fixes branch makes it work properly, due to the new stage position splitting logic. It's much faster for testing than using the "Grid/Collection stitching" plugin! Though I did test both, of course.)

So in short, while I would welcome ideas on how to address the issue, this PR does not introduce a regression, so let's not delay the merge on that account.

More specifically, we ignore the X stage position inversion flag,
because our sample data does not appear to respect it anyway.
@ctrueden
Copy link
Member Author

It occurred to me that there was an incredibly simple solution: ignore the X stage position inversion flag completely, since it does not appear to be respected in the 1mg-dish1-wide-003 dataset. With this change, both of the above two datasets stitch properly in both X and Y.

I took at look at the other datasets in x,y inversion test for corroboration, but personally, I cannot visually tell which way the tiles are supposed to align, regardless of X and/or Y inverted or not. All configurations look wrong to me, which makes me question the source data. So unfortunately there is no guidance there.

But until we get another sample dataset which clearly disproves the current approach, let's use it.

@ctrueden
Copy link
Member Author

I believe all issues with this PR are addressed. If further automated testing reveals no other issues, it is ready for merge.

@ctrueden
Copy link
Member Author

Nobody is waiting on me for anything, right? Just needs further testing and review?

@melissalinkert
Copy link
Member

On my list to review this week, I just haven't had a chance yet.

@melissalinkert
Copy link
Member

--test prairie

Adding to the build. Fine to merge if the Thursday evening builds are green (since I wasn't fast enough to get this in for all of tonight's builds).

melissalinkert added a commit that referenced this pull request Feb 22, 2013
@melissalinkert melissalinkert merged commit 79a0786 into ome:develop Feb 22, 2013
@ctrueden ctrueden deleted the prairie-fixes branch February 25, 2013 16:54
@ctrueden ctrueden mentioned this pull request May 7, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants