Skip to content

GateNLP/atrium-text_extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ATRIUM PDF Text Extraction

Converts PDFs into the ATRIUM JSON format described below:

#########################################################
# T4.1.2 PDF report text extraction -
# suggested JSON output per PDF report.
# Reports may be represented by separate JSON files, 
# or as separate records in a JSONL format file.
# Note - we don't need to include text for individual 
# identified sections as we have full text plus start/end 
# character positions of these sections
#########################################################
{
	"meta": {
		(any metadata about the source document)
	},
	"text": (full text contents of the document),
	"sections": [
		{ 
			"start": (number),	<-- character position
			"end": (number),	<-- character position
			"type": (string) 	<-- e.g. "title", "subtitle", "abstract", "conclusions", "references" etc.
		}
	]
	
}

About

ATRIUM: PDF Text Extraction

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages