Release 1.1 · blekhmanlab/rxivist

API

The /db directory has been added to document the pre-built Docker images now being released with the Rxivist database dumps.
The author_translations database table is no longer used to redirect outdated author profile page URLs to the new ones.

The methods to retrieve a preprint's date of publication have been pulled into the web crawler properly—previously, this was used only to collect data for the Rxivist preprint. It is now part of regular data collection (toggle from new option in config file).
More command-line options for launching the spider. Primarily, running python spider.py refresh no longer requires the ID of a single preprint, and will launch a regular refresh session.
More nuanced handling of errors encountered when querying the publication status of a preprint. Rather than bailing on the entire session if too many errors are encountered when calling this endpoint, that feature is instead simply turned off for that run.
Bug fix that didn't appropriately validate scraped DOI information.
Workaround for counting the number of recognized papers when searching for new preprints—previously, we used a new URL to indicate that a revision had been posted, which caused problems when bioRxiv changed the format of all their URLs. The new way is less accurate, but less fragile.
There was an increase to the default cap for number of articles refreshed per category in a single run. This cap is also now doubled automatically for the neuroscience collection.
Removal of several irrelevant utilities—a sitemap builder, for example.
Modified excessively verbose logging when searching for publication status.