GROBID splits sentences, puts second half in a figure description #1160

mariadelmarq · 2024-08-29T07:54:48Z

Potential error case, not sure if open access (i.e., can be used for training). For the PDF file from: https://link.springer.com/article/10.1007/s12144-016-9469-4.

The PDF looks like this:

Whereas GROBID appears to split the text inside this section:

and arbitrarily puts the second half into a figure description:

lfoppiano · 2024-09-01T19:52:17Z

Hi @mariadelmarq, and thanks again for reporting the issue. This is a recurring issue in the fulltext, and likely going to be solved by #963.

mariadelmarq · 2024-09-01T23:43:04Z

Thanks heaps, @lfoppiano! Do you have a rough timeline for the next release? No pressure at all, of course, it's just a great package and I would like to know whether a new iteration will be out before the end of the project I'm working on, later this year. Thanks again for all your work on this!

lfoppiano · 2024-09-02T04:04:45Z

Hi @mariadelmarq we are currently working on releasing version 0.8.1 (#1123), we've been facing an issue with the JVM that requires to process large amount of PDF documents and this is taking more time. For the change I mentioned, is going to be next year.

vegarab · 2024-09-02T09:30:52Z

Hi, @mariadelmarq, @lfoppiano I've been facing similar issues this past week and was about to enquire myself.

Experiencing very simple and plain PDFs (Clean front page, pages are typically just a subheader + paragraph, clear bibliography with standard format) being parsed incorrectly. Mostly text disappears into figure descriptions, where sentences are split in the middle.
The same happens with sentences into tables and full paragraphs being pulled into bibliography elements.

I mostly experience this with non-English PDFs, typically German.

For the change I mentioned, is going to be next year.

Is there any way to contribute to speed up the work on this? I've found that GROBID is the best solution for full-text extraction from scholarly PDFs/documents. Or do you recommend any other way of extracting fulltexts that is less involved than the GROBID biblio and header extraction? Not looking for bibliography data or headers, just the clean paragraph-level text from the documents, removing any metadata, footers, author info, etc. etc.

Thanks

lfoppiano · 2024-09-12T12:22:04Z

hi @vegarab, I'm assuming you are dealing with scientific articles.
One solution would be to create additional training data for the grobid models. This could help to improve the results. I did not see any German document in the fulltext training data, so I think one ore two could already improve the results.

Unfortunately, creating new training data can appear complicated at first. The steps are divided into two: a) generate per-annotated training data, and b) correct them following the guidelines. Ref to the documentation.

Since the Grobid model is working in cascade, you will have to start from the segmentation and go throught it. I explained in another issue here.

Unfortunately, I don't' have time to work on the training data at the moment, but I can help you with the process if needed.

lfoppiano · 2024-09-16T12:06:31Z

Adding additional cases here.

lfoppiano · 2024-12-04T08:25:10Z

Another example, with body text missclassified as figure:

34	34	3	34	34	34	4	34	34	34	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	LOWERFONT	0	0	NOCAPS	ALLDIGIT	0	NOPUNCT	9	10	0	NUMBER	1	1	I-<citation_marker>
DNA	dna	D	DN	DNA	DNA	A	NA	DNA	DNA	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	HIGHERFONT	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
libraries	libraries	l	li	lib	libr	s	es	ies	ries	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
will	will	w	wi	wil	will	l	ll	ill	will	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
be	be	b	be	be	be	e	be	be	be	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
con	con	c	co	con	con	n	on	con	con	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	10	0	NUMBER	0	0	<figure>
structed	structed	s	st	str	stru	d	ed	ted	cted	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
using	using	u	us	usi	usin	g	ng	ing	sing	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
Illumina	illumina	I	Il	Ill	Illu	a	na	ina	mina	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
DNA	dna	D	DN	DNA	DNA	A	NA	DNA	DNA	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
Prep	prep	P	Pr	Pre	Prep	p	ep	rep	Prep	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
with	with	w	wi	wit	with	h	th	ith	with	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
Enrichment	enrichment	E	En	Enr	Enri	t	nt	ent	ment	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
,	,	,	,	,	,	,	,	,	,	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	COMMA	9	10	0	NUMBER	0	0	<figure>
Tagmentation	tagmentation	T	Ta	Tag	Tagm	n	on	ion	tion	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
kit	kit	k	ki	kit	kit	t	it	kit	kit	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
IDT	idt	I	ID	IDT	IDT	T	DT	IDT	IDT	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
xGen	xgen	x	xG	xGe	xGen	n	en	Gen	xGen	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
Exome	exome	E	Ex	Exo	Exom	e	me	ome	xome	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
Panel	panel	P	Pa	Pan	Pane	l	el	nel	anel	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
v2	v2	v	v2	v2	v2	2	v2	v2	v2	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	CONTAINSDIGITS	0	NOPUNCT	9	10	0	NUMBER	0	0	<figure>
with	with	w	wi	wit	with	h	th	ith	with	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
xGen	xgen	x	xG	xGe	xGen	n	en	Gen	xGen	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
Universal	universal	U	Un	Uni	Univ	l	al	sal	rsal	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
Blockers	blockers	B	Bl	Blo	Bloc	s	rs	ers	kers	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	11	0	NUMBER	0	0	<figure>
NXT	nxt	N	NX	NXT	NXT	T	XT	NXT	NXT	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
Mix	mix	M	Mi	Mix	Mix	x	ix	Mix	Mix	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
dual	dual	d	du	dua	dual	l	al	ual	dual	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
unique	unique	u	un	uni	uniq	e	ue	que	ique	BLOCKEND	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figure>
barcodes	barcodes	b	ba	bar	barc	s	es	des	odes	BLOCKSTART	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	9	5	1	NUMBER	0	0	<figure>
Paired	paired	P	Pa	Pai	Pair	d	ed	red	ired	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	5	1	NUMBER	0	0	<figure>
end	end	e	en	end	end	d	nd	end	end	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
sequencing	sequencing	s	se	seq	sequ	g	ng	ing	cing	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
(	(	(	(	(	(	(	(	(	(	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	OPENBRACKET	9	5	1	NUMBER	0	0	<figure>
2	2	2	2	2	2	2	2	2	2	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	1	NOPUNCT	9	5	1	NUMBER	1	0	<figure>
×	×	×	×	×	×	×	×	×	×	BLOCKIN	LINEIN	LINEINDENT	NEWFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
150	150	1	15	150	150	0	50	150	150	BLOCKIN	LINEIN	LINEINDENT	NEWFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
bps	bps	b	bp	bps	bps	s	ps	bps	bps	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
)	)	)	)	)	)	)	)	)	)	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	ENDBRACKET	9	5	1	NUMBER	0	0	<figure>
on	on	o	on	on	on	n	on	on	on	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figure>

And then:

DNA	dna	D	DN	DNA	DNA	A	NA	DNA	DNA	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	HIGHERFONT	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
libraries	libraries	l	li	lib	libr	s	es	ies	ries	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
will	will	w	wi	wil	will	l	ll	ill	will	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
be	be	b	be	be	be	e	be	be	be	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
con	con	c	co	con	con	n	on	con	con	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	10	0	NUMBER	0	0	<figDesc>
structed	structed	s	st	str	stru	d	ed	ted	cted	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
using	using	u	us	usi	usin	g	ng	ing	sing	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
Illumina	illumina	I	Il	Ill	Illu	a	na	ina	mina	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
DNA	dna	D	DN	DNA	DNA	A	NA	DNA	DNA	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
Prep	prep	P	Pr	Pre	Prep	p	ep	rep	Prep	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
with	with	w	wi	wit	with	h	th	ith	with	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
Enrichment	enrichment	E	En	Enr	Enri	t	nt	ent	ment	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
,	,	,	,	,	,	,	,	,	,	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	COMMA	9	10	0	NUMBER	0	0	<figDesc>
Tagmentation	tagmentation	T	Ta	Tag	Tagm	n	on	ion	tion	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
kit	kit	k	ki	kit	kit	t	it	kit	kit	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
IDT	idt	I	ID	IDT	IDT	T	DT	IDT	IDT	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
xGen	xgen	x	xG	xGe	xGen	n	en	Gen	xGen	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
Exome	exome	E	Ex	Exo	Exom	e	me	ome	xome	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
Research	research	R	Re	Res	Rese	h	ch	rch	arch	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
Panel	panel	P	Pa	Pan	Pane	l	el	nel	anel	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
v2	v2	v	v2	v2	v2	2	v2	v2	v2	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	CONTAINSDIGITS	0	NOPUNCT	9	10	0	NUMBER	0	0	<figDesc>
with	with	w	wi	wit	with	h	th	ith	with	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
xGen	xgen	x	xG	xGe	xGen	n	en	Gen	xGen	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
Universal	universal	U	Un	Uni	Univ	l	al	sal	rsal	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
Blockers	blockers	B	Bl	Blo	Bloc	s	rs	ers	kers	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	11	0	NUMBER	0	0	<figDesc>
NXT	nxt	N	NX	NXT	NXT	T	XT	NXT	NXT	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
Mix	mix	M	Mi	Mix	Mix	x	ix	Mix	Mix	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
dual	dual	d	du	dua	dual	l	al	ual	dual	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
unique	unique	u	un	uni	uniq	e	ue	que	ique	BLOCKEND	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	11	0	NUMBER	0	0	<figDesc>
barcodes	barcodes	b	ba	bar	barc	s	es	des	odes	BLOCKSTART	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	9	5	1	NUMBER	0	0	<figDesc>
Paired	paired	P	Pa	Pai	Pair	d	ed	red	ired	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	5	1	NUMBER	0	0	<figDesc>
end	end	e	en	end	end	d	nd	end	end	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
sequencing	sequencing	s	se	seq	sequ	g	ng	ing	cing	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
(	(	(	(	(	(	(	(	(	(	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	OPENBRACKET	9	5	1	NUMBER	0	0	<figDesc>
2	2	2	2	2	2	2	2	2	2	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	1	NOPUNCT	9	5	1	NUMBER	1	0	<figDesc>
×	×	×	×	×	×	×	×	×	×	BLOCKIN	LINEIN	LINEINDENT	NEWFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
150	150	1	15	150	150	0	50	150	150	BLOCKIN	LINEIN	LINEINDENT	NEWFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
bps	bps	b	bp	bps	bps	s	ps	bps	bps	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
)	)	)	)	)	)	)	)	)	)	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	ENDBRACKET	9	5	1	NUMBER	0	0	<figDesc>
on	on	o	on	on	on	n	on	on	on	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
Illumina	illumina	I	Il	Ill	Illu	a	na	ina	mina	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
NovaSeq	novaseq	N	No	Nov	Nova	q	eq	Seq	aSeq	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
6000	6000	6	60	600	6000	0	00	000	6000	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
System	system	S	Sy	Sys	Syst	m	em	tem	stem	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
at	at	a	at	at	at	t	at	at	at	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
100×	100×	1	10	100	100×	×	0×	00×	100×	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	CONTAINSDIGITS	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
depth	depth	d	de	dep	dept	h	th	pth	epth	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
for	for	f	fo	for	for	r	or	for	for	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
exome	exome	e	ex	exo	exom	e	me	ome	xome	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
sequencing	sequencing	s	se	seq	sequ	g	ng	ing	cing	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	9	5	1	NUMBER	0	0	<figDesc>
Library	library	L	Li	Lib	Libr	y	ry	ary	rary	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
preparation	preparation	p	pr	pre	prep	n	on	ion	tion	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
and	and	a	an	and	and	d	nd	and	and	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
sequencing	sequencing	s	se	seq	sequ	g	ng	ing	cing	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
will	will	w	wi	wil	will	l	ll	ill	will	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
be	be	b	be	be	be	e	be	be	be	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
per	per	p	pe	per	per	r	er	per	per	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	5	1	NUMBER	0	0	<figDesc>
formed	formed	f	fo	for	form	d	ed	med	rmed	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
for	for	f	fo	for	for	r	or	for	for	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
all	all	a	al	all	all	l	ll	all	all	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
family	family	f	fa	fam	fami	y	ly	ily	mily	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
members	members	m	me	mem	memb	s	rs	ers	bers	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
at	at	a	at	at	at	t	at	at	at	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
same	same	s	sa	sam	same	e	me	ame	same	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
time	time	t	ti	tim	time	e	me	ime	time	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
to	to	t	to	to	to	o	to	to	to	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
minimize	minimize	m	mi	min	mini	e	ze	ize	mize	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
potential	potential	p	po	pot	pote	l	al	ial	tial	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
artifactual	artifactual	a	ar	art	arti	l	al	ual	tual	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
differences	differences	d	di	dif	diff	s	es	ces	nces	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
due	due	d	du	due	due	e	ue	due	due	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
to	to	t	to	to	to	o	to	to	to	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
sample	sample	s	sa	sam	samp	e	le	ple	mple	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
preparation	preparation	p	pr	pre	prep	n	on	ion	tion	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	5	1	NUMBER	0	0	<figDesc>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	9	5	1	NUMBER	0	0	<figDesc>
DNA	dna	D	DN	DNA	DNA	A	NA	DNA	DNA	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
samples	samples	s	sa	sam	samp	s	es	les	ples	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
will	will	w	wi	wil	will	l	ll	ill	will	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
be	be	b	be	be	be	e	be	be	be	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
stored	stored	s	st	sto	stor	d	ed	red	ored	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
at	at	a	at	at	at	t	at	at	at	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
-	-	-	-	-	-	-	-	-	-	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	HYPHEN	9	6	1	NUMBER	0	0	<figDesc>
80°	80°	8	80	80°	80°	°	0°	80°	80°	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	CONTAINSDIGITS	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
centigrade	centigrade	c	ce	cen	cent	e	de	ade	rade	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
to	to	t	to	to	to	o	to	to	to	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
allow	allow	a	al	all	allo	w	ow	low	llow	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
for	for	f	fo	for	for	r	or	for	for	BLOCKIN	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
future	future	f	fu	fut	futu	e	re	ure	ture	BLOCKIN	LINESTART	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
verification	verification	v	ve	ver	veri	n	on	ion	tion	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
studies	studies	s	st	stu	stud	s	es	ies	dies	BLOCKIN	LINEIN	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	NOPUNCT	9	6	1	NUMBER	0	0	<figDesc>
.	.	.	.	.	.	.	.	.	.	BLOCKEND	LINEEND	LINEINDENT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	DOT	9	6	1	NUMBER	0	0	<figDesc>

And the output is in a figure with no attributes beside the caption:

DNA libraries will be constructed using the Illumina DNA Prep with Enrichment, Tagmentation kit and IDT xGen Exome Research Panel v2 with xGen Universal Blockers-NXT Mix and dual unique barcodes. Paired-end sequencing (2 × 150 bps) on the Illumina NovaSeq 6000 System at 100× depth for exome sequencing. Library preparation and sequencing will be performed for all family members at the same time to minimize potential artifactual differences due to sample preparation. DNA samples will be stored at -80° centigrade to allow for future verification studies.

Here it seems that everything get classified as figDesc only, so we might detect a edge case earlier during fulltext processing and avoid trying to generate a figure that is incomplete

PDF: pub.1160333290.pdf

lfoppiano · 2024-12-06T13:17:04Z

@mariadelmarq I'm working on a different issue. (#1206), but the fix there on the tables, fixes the issue you've reported on the elsevier paper. It's in the branch #1207 which is still WIP but it seems to correct the most clear cases of table misclassification. The issue is in fact text classified as table (not figure, which are more difficult to validate).

lfoppiano added the error cases Some error/test case for future improvements label Sep 1, 2024

lfoppiano added the models:fulltext label Nov 1, 2024

lfoppiano added the licence:needs_CC-BY The articles are not CC-BY label Nov 12, 2024

lfoppiano mentioned this issue Dec 6, 2024

Misclassified tables and/or figures maybe tossed incorrectly #1206

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GROBID splits sentences, puts second half in a figure description #1160

GROBID splits sentences, puts second half in a figure description #1160

mariadelmarq commented Aug 29, 2024

lfoppiano commented Sep 1, 2024

mariadelmarq commented Sep 1, 2024

lfoppiano commented Sep 2, 2024

vegarab commented Sep 2, 2024

lfoppiano commented Sep 12, 2024

lfoppiano commented Sep 16, 2024

lfoppiano commented Dec 4, 2024 •

edited

Loading

lfoppiano commented Dec 6, 2024

GROBID splits sentences, puts second half in a figure description #1160

GROBID splits sentences, puts second half in a figure description #1160

Comments

mariadelmarq commented Aug 29, 2024

lfoppiano commented Sep 1, 2024

mariadelmarq commented Sep 1, 2024

lfoppiano commented Sep 2, 2024

vegarab commented Sep 2, 2024

lfoppiano commented Sep 12, 2024

lfoppiano commented Sep 16, 2024

lfoppiano commented Dec 4, 2024 • edited Loading

lfoppiano commented Dec 6, 2024

lfoppiano commented Dec 4, 2024 •

edited

Loading