Full text model layout features: BLOCKSTART missing, if very first block token is a new line #712
Open
Description
At least for some documents, the first token of a block seem to be a line feed.
In that case the line feed is filtered out:
if (TextUtilities.filterLine(text)) {
n++;
continue;
}
But when it is then getting to process the next "real" token, n
will no longer be 0
but 1
. Therefore it will not go into the main blockstart block:
if (n == 0) {
features.lineStatus = "LINESTART";
// be sure that previous token is closing a line, except if it's a starting line
if (previousFeatures != null) {
if (!previousFeatures.lineStatus.equals("LINESTART"))
previousFeatures.lineStatus = "LINEEND";
}
if (token != null)
lineStartX = token.getX();
features.blockStatus = "BLOCKSTART";
} else if (n == tokens.size() - 1) {
Example document
475335v1
(DOI: 10.1101/475335)
bioRxiv XML
<sec id="s3c">
<title>Epidemic synchrony and annual phase coherence</title>
<p>
We explored correlations between dengue time series in different regions. Both epidemic synchrony and phase coherence were higher for closer regions and declined with distance (
<xref rid="fig3" ref-type="fig">Fig 3</xref>
). For the Urban-2 (n = 161) spatial level, epidemic synchrony reached the average countrywide correlation at approximately 1,260 kilometres (
<xref rid="fig3" ref-type="fig">Fig 3A</xref>
). This synchrony length represents a substantial part of Brazil’s dimensions as the country extends 4,395 kilometres north to south and 4,319 kilometres west to east. The coherence length had a higher value of 1,590 kilometres (
<xref rid="fig3" ref-type="fig">Fig 3B</xref>
), suggesting that agreement in dengue seasonality spreads further than correlations of epidemic curves.
</p>
<fig id="fig3" position="float" fig-type="figure">
<label>Fig 3.</label>
<caption>
<title>
Epidemic synchrony and annual phase coherence between Brazilian Urban-2 regions.
</title>
<p>
Epidemic synchrony (A) and annual phase coherence (B) summarised using nonparametric spline covariance function. Solid blue line describes the mean pairwise correlation from the data and the dotted lines represent the 95% envelope for bootstrapped correlations of case and annual phase angle time series, respectively. Red line indicates global countrywide correlation.
</p>
</caption>
<graphic xlink:href="475335_fig3.tif"/>
</fig>
<p>
We also looked at epidemic synchrony and annual phase coherence at other spatial levels (S4 and S5 Figs, respectively) and found that both synchrony and coherence lengths tend to decrease for smaller spatial resolutions and stabilise at 1,240 km and 1,500 km.
</p>
</sec>
The text We also looked at epidemic synchrony...
(line 216) doesn't get the BLOCKSTART
feature (it will be BLOCKIN
), even though it is in its own block (but with a line feed as the first token as described above).
I could try to submit a fix PR for it.
/cc @kermitt2