Skip to content

sibils/collections

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 

Repository files navigation

sibils-collections

Version 4.1 (2024)

These scripts are used to collect and update SIBiLS collections. In 2024, they are deployed in Denver.

General information

The general principle is based on the mechanism used by the NCBI to update medline or pmc:

  • a one-time baseline of all available documents
  • updates, generated on demand (usually daily), to take account of changes in the collections: documents added, updated or modified.

General mechanism: Baseline & updates

Each collection has its own directory, with a script to generate a baseline and a script to generate updates.

For NCBI collections, the baseline and update files are provided on the NCBI FTP (ftp.ncbi.nlm.nih.gov). Generating the baseline therefore involves processing all the baseline archives; generating an update involves processing all the update files that have not yet been processed.

For other collections, a baseline is generated by processing all the documents available. An update is then generated by processing the delta between the baseline and the present. This delta can be found in various ways. For collections deposited by their owner on the SIBiLS FTP, it is the file movements tracked by our FTP server that will determine the updates.

General mechanism: Files generated

Documents are grouped together in bundles of several thousand, as NLM does. When a packet is processed, a log file is generated with the packet identifier. For each package, 2 files are generated and uploaded to the Denver FTP, to be processed by the :

  • a bib file, which contains bibliographic information and the available text of the documents. Bib files also contain identifiers of deleted documents. The information is delivered as JSON, with each value being a text or a list of texts, with the exception of the PMC collection, which contains deeper fields reflecting the hierarchical structure of the documents.
  • a sen file, which contains the sentences of the documents, intended for the annotation process. These sentences were obtained using the sentence_splitter.py script.

Copies of the data are saved in a local MongoDB for visualization.

Specific mechanisms : MEDLINE

Medline citations are delivered by NLM in a dedicated XML format. These XML files are parsed using a third-party Python module (pubmed_parser), which has been modified to take into account the needs of SIBiLS.

Specific mechanisms : PMC & authors manuscripts

PMC items are parsed using the PAM module. This module retains the hierarchical structure of the documents, so that the article can be viewed in BiodiversityPMC.

Specific mechanisms : PMC suppdata

Additional data must be collected in a gz archive stored on the NCBI FTP. The NCBI file_list provides the path to the archive for each pmcid. Only pmcids already in the pmc collection are processed in this collection. The corresponding archive is downloaded and decompressed. The text is then extracted from various file types:

bib records for suppdata are enriched thanks to their article metadata.

It took 1 year to process all the suppdata files. This baseline is stored in the local MongoDB, DB SIBiLS_KB, collection pmcsuppdata. For a baseline, the documents are taken from this collection. Updates are then made according to the items added to the file-list.csv file.

Specific mechanisms : Plazi treatments & Pensoft collection

Plazi treatments are inserted, modified or deleted by Plazi on the Denver FTP. During a baseline, all current versions are collected and treated. Then, every day, file movements are tracked using FTP server logs to find out which files have been added, modified or deleted.

XML treatments are parsed using dedicated local functions.

About

make baselines and updates for SIBiLS collection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages