Version 4.1 (2024)
These scripts are used to collect and update SIBiLS collections. In 2024, they are deployed in Denver.
The general principle is based on the mechanism used by the NCBI to update medline or pmc:
- a one-time baseline of all available documents
- updates, generated on demand (usually daily), to take account of changes in the collections: documents added, updated or modified.
Each collection has its own directory, with a script to generate a baseline and a script to generate updates.
For NCBI collections, the baseline and update files are provided on the NCBI FTP (ftp.ncbi.nlm.nih.gov). Generating the baseline therefore involves processing all the baseline archives; generating an update involves processing all the update files that have not yet been processed.
For other collections, a baseline is generated by processing all the documents available. An update is then generated by processing the delta between the baseline and the present. This delta can be found in various ways. For collections deposited by their owner on the SIBiLS FTP, it is the file movements tracked by our FTP server that will determine the updates.
Documents are grouped together in bundles of several thousand, as NLM does. When a packet is processed, a log file is generated with the packet identifier. For each package, 2 files are generated and uploaded to the Denver FTP, to be processed by the :
- a bib file, which contains bibliographic information and the available text of the documents. Bib files also contain identifiers of deleted documents. The information is delivered as JSON, with each value being a text or a list of texts, with the exception of the PMC collection, which contains deeper fields reflecting the hierarchical structure of the documents.
- a sen file, which contains the sentences of the documents, intended for the annotation process. These sentences were obtained using the sentence_splitter.py script.
Copies of the data are saved in a local MongoDB for visualization.
Medline citations are delivered by NLM in a dedicated XML format. These XML files are parsed using a third-party Python module (pubmed_parser), which has been modified to take into account the needs of SIBiLS.
PMC items are parsed using the PAM module. This module retains the hierarchical structure of the documents, so that the article can be viewed in BiodiversityPMC.
Additional data must be collected in a gz archive stored on the NCBI FTP. The NCBI file_list provides the path to the archive for each pmcid. Only pmcids already in the pmc collection are processed in this collection. The corresponding archive is downloaded and decompressed. The text is then extracted from various file types:
- for jpg files, we used a local API (https://ocrweb.text-analytics.ch/) based on Tesseract.
- for pdf, we used the Python module PyPDF2.
- for tables (xlsx, xls and csv) we used a local extractor devel-oped by Nona based on the Python module pandas.
- for Word documents (doc, docx), we used the Python modules textract and doc2txt (https://textract.readthedocs.io/en/stable/, https://pypi.org/project/docx2txt/).
- for html and xml, we extract the text using the Python module BeautifulSoup (https://pypi.org/project/beautifulsoup4/).
bib records for suppdata are enriched thanks to their article metadata.
It took 1 year to process all the suppdata files. This baseline is stored in the local MongoDB, DB SIBiLS_KB, collection pmcsuppdata. For a baseline, the documents are taken from this collection. Updates are then made according to the items added to the file-list.csv file.
Plazi treatments are inserted, modified or deleted by Plazi on the Denver FTP. During a baseline, all current versions are collected and treated. Then, every day, file movements are tracked using FTP server logs to find out which files have been added, modified or deleted.
XML treatments are parsed using dedicated local functions.