Closed
Description
While iterating through the paragraph index( lucene-index-cord19-paragraph-2020-04-17
), some outliers are identified.
-
There are files e.g.(
docid: 'ij3ncdb'
,https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4896250/) has a really long abstract and the abstract is appended to each paragraph index caused the index file size explosion. -
There is a weird file. Cord-uid is
'hwjkbpqp'
, doi is'10.1007/s12529-010-9106-9'
.
This document has an empty abstract and meanwhile has 5849 paragraphs. It appears as an extreme outlier regarding to the size per index.
Metadata
Assignees
Labels
No labels
Activity