The backup file in this repository is a copy of the production database powering Rxivist.org, a website that helps readers find interesting preprints by using download and Twitter metrics to sort papers posted to bioRxiv.org.
Beginning with the 2019-02-13 snapshot, all database dumps here should have a corresponding container image hosted on Docker Hub. If you are unfamiliar with SQL databases, using the most recent image there is likely simpler than loading the data manually from this repository.
- Download the
docker-compose.yml
file from this repository. You don't need the snapshot itself if you are using the pre-built Docker container; the data is already baked in. - Install Docker, a free containerization platform. Containers are used for all the Rxivist components to limit external dependencies and side-effects from installation of tools such as PostgreSQL.
- Start the Docker daemon.
- Open a command terminal and navigate to the directory in which you've stored
docker-compose.yml
. Run the commanddocker-compose up
- In your browser, navigate to
http://localhost:8080
; this should show you the login screen for pgAdmin, a web application for interacting with SQL databases. (Both this web app and the database are only accessible from your computer.) The username and password are bothpostgres
. - Once you are logged in, look in the the left-hand toolbar; right-click on "Servers," and select "Create Server." Give this connection a name, then, in the "Connection" tab of the "Create - Server" dialog, enter
rxdb
in the field for "Host name/address." All other fields should be correct. Click "Save." - The new server connections should now appear in the "Servers" tree in the left sidebar. Open this, and select the
postgres
database from under theDatabases
category. Right-clicking on thepostgres
database and selecting the "Query Tool..." option will open a window with which you can submit queries to the database. For examples, see below.
If you want to load the snapshot directly, rather than trusting the contents of the rxivist_data
container image, any conventional method of restoring Postgres backups should work properly for these snapshots. The instructions below are for launching a Postgres server on a local workstation and importing the data.
- Download the database snapshot from this repository. It's the file called
rxivist.backup
. - Install Docker, a free containerization platform. Containers are used for all the Rxivist components to limit external dependencies and side-effects from installation of tools such as PostgreSQL.
- Start the Docker daemon.
- Open a command terminal and start a PostgreSQL server as a background process:
docker run --name rxdb -e POSTGRES_PASSWORD=asdf -d -p 127.0.0.1:5432:5432 postgres
This should result in a database server being available on your workstation atlocalhost:5432
, the standard Postgres port. - While you can use the
psql
tool to interact with the database from within the container, those unfamiliar with SQL databases may have an easier time using pgAdmin 4, a free tool that adds a browser-based interface with which to run commands. Instead of installing the software on your computer, you can launch it in a container by going to your command line and runningdocker run -p 8080:80 -e "PGADMIN_DEFAULT_EMAIL=postgres" -e "PGADMIN_DEFAULT_PASSWORD=postgres" -d dpage/pgadmin4
- In your browser, navigate to
http://localhost:8080
. The login page presented here is for the pgAdmin tool itself—not the database. The username and password are bothpostgres
. - On the left side, right-click on "Servers," and select "Create Server." Give this connection a name, then, in the "Connection" tab of the "Create - Server" dialog, enter the properties of the PostgreSQL database you created in step 4. "Host name/address" should be
localhost
; most other values can remain unchanged. For the "Password" field, enter whatever password you specified in the Docker command in step 4 (in the example, it'sasdf
.) Click "Save." - The new server connections should now appear in the "Servers" tree in the left sidebar. Open this, and select the
postgres
database from under theDatabases
category. Right-click onpostgres
, and select the "Restore..." option. - This should open the "Restore" dialog. For "Filename," click on the ellipsis on the right side of the box and find the
rxivist.backup
file that you downloaded in step 1. - Switch to the "Restore options" tab of this dialog box. Most of these values can remain unchanged, but three values need to be changed from "No" to "Yes": "Pre-data," "Data" and "Owner."
- Click the blue "Restore" button.
- Once this process is complete, your "postgres" database should have two schemas: "public," which doesn't have data in it, and "prod," which contains the data presented on Rxivist.org.
(For those looking to create a snapshot of their own, Snapshots are created using the "Backup" dialog with the "Yes" option selected for "Pre-data," "Data," "Blobs," "Owner," "Privilege," "With OID(s)" and "Verbose messages.")
These queries are based on ones used to generate the figures that appear in our preprint, Tracking the popularity and outcomes of all bioRxiv preprints.
Total papers: SELECT COUNT(id) FROM prod.articles
Total neuroscience papers: SELECT COUNT(id) FROM prod.articles WHERE collection='neuroscience'
Total authors: SELECT COUNT(id) FROM prod.authors
Submissions per month:
SELECT EXTRACT(YEAR FROM posted)||'-'||lpad(EXTRACT(MONTH FROM posted)::text, 2, '0') AS month,
COUNT(id) AS submissions
FROM prod.articles
GROUP BY 1
ORDER BY 1;
Submissions per month, per category:
SELECT EXTRACT(YEAR FROM posted)||'-'||lpad(EXTRACT(MONTH FROM posted)::text, 2, '0') AS date,
REPLACE(collection, '-', ' ') AS collection,
COUNT(id) AS submissions
FROM prod.articles
GROUP BY 1,2
ORDER BY 1,2;
Downloads per month, per category:
SELECT article_traffic.year||'-'||lpad(article_traffic.month::text, 2, '0') AS date,
SUM(article_traffic.pdf) AS month,
REPLACE(articles.collection, '-', ' ') AS collection
FROM prod.article_traffic
LEFT JOIN prod.articles ON article_traffic.article=articles.id
GROUP BY 1,3
ORDER BY 1,3;
Downloads per month:
SELECT month, year, sum(pdf) AS downloads
FROM prod.article_traffic
GROUP BY year, month
ORDER BY year, month
Publication rate by month:
SELECT month, posted, published, published::decimal/posted AS rate
FROM (
SELECT EXTRACT(YEAR FROM a.posted)||'-'||lpad(EXTRACT(MONTH FROM a.posted)::text, 2, '0') AS month,
COUNT(a.id) AS posted,
COUNT(p.doi) AS published
FROM prod.articles a
LEFT JOIN prod.article_publications p ON a.id=p.article
GROUP BY month
ORDER BY month
) AS counts
Interval between posting to bioRxiv and date of publication:
SELECT a.id, EXTRACT(YEAR FROM a.posted) AS year, REPLACE(a.collection, '-', ' ') AS collection,
p.date AS published, (p.date-a.posted) AS interval
FROM prod.articles a
INNER JOIN prod.publication_dates p ON a.id=p.article
WHERE p.date > '1900-01-01' --- Dummy value used for unknown values
ORDER BY interval DESC
Downloads in first month on bioRxiv:
SELECT a.id, t.month, t.year, t.pdf AS downloads
FROM prod.articles a
LEFT JOIN prod.article_traffic t
ON a.id=t.article
AND t.year = (
SELECT MIN(year)
FROM prod.article_traffic t
WHERE t.article = a.id
)
AND t.month = (
SELECT MIN(month)
FROM prod.article_traffic t
WHERE a.id=t.article AND
year = (
SELECT MIN(year)
FROM prod.article_traffic t
WHERE t.article = a.id
)
)
ORDER BY id
Papers per author:
SELECT a.id, REPLACE(a.name, ',', ' ') AS name, COUNT(DISTINCT p.article) AS papers, COUNT(DISTINCT e.email) AS emails
FROM prod.authors a
INNER JOIN prod.article_authors p
ON a.id=p.author
LEFT JOIN prod.author_emails e
ON a.id=e.author
GROUP BY 1
ORDER BY 3 DESC
Preprints published within a week of posting:
SELECT COUNT(id)
FROM (
SELECT a.id, EXTRACT(YEAR FROM a.posted) AS year, REPLACE(a.collection, '-', ' ') AS collection,
p.date AS published, (p.date-a.posted) AS interval
FROM prod.articles a
INNER JOIN prod.publication_dates p ON a.id=p.article
WHERE p.date > '1900-01-01' --- Dummy value used for unknown values
ORDER BY interval DESC
) AS intervals
WHERE interval >=0 AND interval <= 7