sibils-collections

Version 4.1 (2024)

These scripts are used to collect and update SIBiLS collections. In 2024, they are deployed in Denver.

General information

The general principle is based on the mechanism used by the NCBI to update medline or pmc:

a one-time baseline of all available documents
updates, generated on demand (usually daily), to take account of changes in the collections: documents added, updated or modified.

General mechanism: Baseline & updates

Each collection has its own directory, with a script to generate a baseline and a script to generate updates.

For NCBI collections, the baseline and update files are provided on the NCBI FTP (ftp.ncbi.nlm.nih.gov). Generating the baseline therefore involves processing all the baseline archives; generating an update involves processing all the update files that have not yet been processed.

For other collections, a baseline is generated by processing all the documents available. An update is then generated by processing the delta between the baseline and the present. This delta can be found in various ways. For collections deposited by their owner on the SIBiLS FTP, it is the file movements tracked by our FTP server that will determine the updates.

General mechanism: Files generated

Documents are grouped together in bundles of several thousand, as NLM does. When a packet is processed, a log file is generated with the packet identifier. For each package, 2 files are generated and uploaded to the Denver FTP, to be processed by the :

a bib file, which contains bibliographic information and the available text of the documents. Bib files also contain identifiers of deleted documents. The information is delivered as JSON, with each value being a text or a list of texts, with the exception of the PMC collection, which contains deeper fields reflecting the hierarchical structure of the documents.
a sen file, which contains the sentences of the documents, intended for the annotation process. These sentences were obtained using the sentence_splitter.py script.

Copies of the data are saved in a local MongoDB for visualization.

Specific mechanisms : MEDLINE

Medline citations are delivered by NLM in a dedicated XML format. These XML files are parsed using a third-party Python module (pubmed_parser), which has been modified to take into account the needs of SIBiLS.

Specific mechanisms : PMC & authors manuscripts

PMC items are parsed using the PAM module. This module retains the hierarchical structure of the documents, so that the article can be viewed in BiodiversityPMC.

Specific mechanisms : PMC suppdata

Additional data must be collected in a gz archive stored on the NCBI FTP. The NCBI file_list provides the path to the archive for each pmcid. Only pmcids already in the pmc collection are processed in this collection. The corresponding archive is downloaded and decompressed. The text is then extracted from various file types:

for jpg files, we used a local API (https://ocrweb.text-analytics.ch/) based on Tesseract.
for pdf, we used the Python module PyPDF2.
for tables (xlsx, xls and csv) we used a local extractor devel-oped by Nona based on the Python module pandas.
for Word documents (doc, docx), we used the Python modules textract and doc2txt (https://textract.readthedocs.io/en/stable/, https://pypi.org/project/docx2txt/).
for html and xml, we extract the text using the Python module BeautifulSoup (https://pypi.org/project/beautifulsoup4/).

bib records for suppdata are enriched thanks to their article metadata.

It took 1 year to process all the suppdata files. This baseline is stored in the local MongoDB, DB SIBiLS_KB, collection pmcsuppdata. For a baseline, the documents are taken from this collection. Updates are then made according to the items added to the file-list.csv file.

Specific mechanisms : Plazi treatments & Pensoft collection

Plazi treatments are inserted, modified or deleted by Plazi on the Denver FTP. During a baseline, all current versions are collected and treated. Then, every day, file movements are tracked using FTP server logs to find out which files have been added, modified or deleted.

XML treatments are parsed using dedicated local functions.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src/collections		src/collections
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

sibils-collections

General information

General mechanism: Baseline & updates

General mechanism: Files generated

Specific mechanisms : MEDLINE

Specific mechanisms : PMC & authors manuscripts

Specific mechanisms : PMC suppdata

Specific mechanisms : Plazi treatments & Pensoft collection

About

Uh oh!

Releases

Packages

Languages

sibils/collections

Folders and files

Latest commit

History

Repository files navigation

sibils-collections

General information

General mechanism: Baseline & updates

General mechanism: Files generated

Specific mechanisms : MEDLINE

Specific mechanisms : PMC & authors manuscripts

Specific mechanisms : PMC suppdata

Specific mechanisms : Plazi treatments & Pensoft collection

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages