Skip to content
This repository has been archived by the owner on Mar 1, 2023. It is now read-only.

Latest commit

 

History

History

db

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

Rxivist database snapshots

The backup file in this repository is a copy of the production database powering Rxivist.org, a website that helps readers find interesting preprints by using download and Twitter metrics to sort papers posted to bioRxiv.org.

PostgreSQL snapshot restoration

Using the pre-built Docker container

Beginning with the 2019-02-13 snapshot, all database dumps here should have a corresponding container image hosted on Docker Hub. If you are unfamiliar with SQL databases, using the most recent image there is likely simpler than loading the data manually from this repository.

  1. Download the docker-compose.yml file from this repository. You don't need the snapshot itself if you are using the pre-built Docker container; the data is already baked in.
  2. Install Docker, a free containerization platform. Containers are used for all the Rxivist components to limit external dependencies and side-effects from installation of tools such as PostgreSQL.
  3. Start the Docker daemon.
  4. Open a command terminal and navigate to the directory in which you've stored docker-compose.yml. Run the command docker-compose up
  5. In your browser, navigate to http://localhost:8080; this should show you the login screen for pgAdmin, a web application for interacting with SQL databases. (Both this web app and the database are only accessible from your computer.) The username and password are both postgres.
  6. Once you are logged in, look in the the left-hand toolbar; right-click on "Servers," and select "Create Server." Give this connection a name, then, in the "Connection" tab of the "Create - Server" dialog, enter rxdb in the field for "Host name/address." All other fields should be correct. Click "Save."
  7. The new server connections should now appear in the "Servers" tree in the left sidebar. Open this, and select the postgres database from under the Databases category. Right-clicking on the postgres database and selecting the "Query Tool..." option will open a window with which you can submit queries to the database. For examples, see below.

Loading the database backup manually

If you want to load the snapshot directly, rather than trusting the contents of the rxivist_data container image, any conventional method of restoring Postgres backups should work properly for these snapshots. The instructions below are for launching a Postgres server on a local workstation and importing the data.

  1. Download the database snapshot from this repository. It's the file called rxivist.backup.
  2. Install Docker, a free containerization platform. Containers are used for all the Rxivist components to limit external dependencies and side-effects from installation of tools such as PostgreSQL.
  3. Start the Docker daemon.
  4. Open a command terminal and start a PostgreSQL server as a background process: docker run --name rxdb -e POSTGRES_PASSWORD=asdf -d -p 127.0.0.1:5432:5432 postgres This should result in a database server being available on your workstation at localhost:5432, the standard Postgres port.
  5. While you can use the psql tool to interact with the database from within the container, those unfamiliar with SQL databases may have an easier time using pgAdmin 4, a free tool that adds a browser-based interface with which to run commands. Instead of installing the software on your computer, you can launch it in a container by going to your command line and running docker run -p 8080:80 -e "PGADMIN_DEFAULT_EMAIL=postgres" -e "PGADMIN_DEFAULT_PASSWORD=postgres" -d dpage/pgadmin4
  6. In your browser, navigate to http://localhost:8080. The login page presented here is for the pgAdmin tool itself—not the database. The username and password are both postgres.
  7. On the left side, right-click on "Servers," and select "Create Server." Give this connection a name, then, in the "Connection" tab of the "Create - Server" dialog, enter the properties of the PostgreSQL database you created in step 4. "Host name/address" should be localhost; most other values can remain unchanged. For the "Password" field, enter whatever password you specified in the Docker command in step 4 (in the example, it's asdf.) Click "Save."
  8. The new server connections should now appear in the "Servers" tree in the left sidebar. Open this, and select the postgres database from under the Databases category. Right-click on postgres, and select the "Restore..." option.
  9. This should open the "Restore" dialog. For "Filename," click on the ellipsis on the right side of the box and find the rxivist.backup file that you downloaded in step 1.
  10. Switch to the "Restore options" tab of this dialog box. Most of these values can remain unchanged, but three values need to be changed from "No" to "Yes": "Pre-data," "Data" and "Owner."
  11. Click the blue "Restore" button.
  12. Once this process is complete, your "postgres" database should have two schemas: "public," which doesn't have data in it, and "prod," which contains the data presented on Rxivist.org.

(For those looking to create a snapshot of their own, Snapshots are created using the "Backup" dialog with the "Yes" option selected for "Pre-data," "Data," "Blobs," "Owner," "Privilege," "With OID(s)" and "Verbose messages.")

Example queries

These queries are based on ones used to generate the figures that appear in our preprint, Tracking the popularity and outcomes of all bioRxiv preprints.

Total papers: SELECT COUNT(id) FROM prod.articles

Total neuroscience papers: SELECT COUNT(id) FROM prod.articles WHERE collection='neuroscience'

Total authors: SELECT COUNT(id) FROM prod.authors

Submissions per month:

SELECT EXTRACT(YEAR FROM posted)||'-'||lpad(EXTRACT(MONTH FROM posted)::text, 2, '0') AS month,
	COUNT(id) AS submissions
FROM prod.articles
GROUP BY 1
ORDER BY 1;

Submissions per month, per category:

SELECT EXTRACT(YEAR FROM posted)||'-'||lpad(EXTRACT(MONTH FROM posted)::text, 2, '0') AS date,
	REPLACE(collection, '-', ' ') AS collection,
	COUNT(id) AS submissions
FROM prod.articles
GROUP BY 1,2
ORDER BY 1,2;

Downloads per month, per category:

SELECT article_traffic.year||'-'||lpad(article_traffic.month::text, 2, '0') AS date,
	SUM(article_traffic.pdf) AS month,
	REPLACE(articles.collection, '-', ' ') AS collection
FROM prod.article_traffic
LEFT JOIN prod.articles ON article_traffic.article=articles.id
GROUP BY 1,3
ORDER BY 1,3;

Downloads per month:

SELECT month, year, sum(pdf) AS downloads
FROM prod.article_traffic
GROUP BY year, month
ORDER BY year, month

Publication rate by month:

SELECT month, posted, published, published::decimal/posted AS rate
FROM (
  SELECT EXTRACT(YEAR FROM a.posted)||'-'||lpad(EXTRACT(MONTH FROM a.posted)::text, 2, '0') AS month,
	COUNT(a.id) AS posted,
  COUNT(p.doi) AS published
  FROM prod.articles a
  LEFT JOIN prod.article_publications p ON a.id=p.article
  GROUP BY month
  ORDER BY month
) AS counts

Interval between posting to bioRxiv and date of publication:

SELECT a.id, EXTRACT(YEAR FROM a.posted) AS year, REPLACE(a.collection, '-', ' ') AS collection,
	p.date AS published, (p.date-a.posted) AS interval
FROM prod.articles a
INNER JOIN prod.publication_dates p ON a.id=p.article
WHERE p.date > '1900-01-01' --- Dummy value used for unknown values
ORDER BY interval DESC

Downloads in first month on bioRxiv:

SELECT a.id, t.month, t.year, t.pdf AS downloads
FROM prod.articles a
LEFT JOIN prod.article_traffic t
  ON a.id=t.article
  AND t.year = (
    SELECT MIN(year)
    FROM prod.article_traffic t
    WHERE t.article = a.id
  )
  AND t.month = (
    SELECT MIN(month)
    FROM prod.article_traffic t
    WHERE a.id=t.article AND
      year = (
        SELECT MIN(year)
        FROM prod.article_traffic t
        WHERE t.article = a.id
      )
  )
ORDER BY id

Papers per author:

SELECT a.id, REPLACE(a.name, ',', ' ') AS name, COUNT(DISTINCT p.article) AS papers, COUNT(DISTINCT e.email) AS emails
FROM prod.authors a
INNER JOIN prod.article_authors p
  ON a.id=p.author
LEFT JOIN prod.author_emails e
  ON a.id=e.author
GROUP BY 1
ORDER BY 3 DESC

Preprints published within a week of posting:

SELECT COUNT(id)
FROM (
	SELECT a.id, EXTRACT(YEAR FROM a.posted) AS year, REPLACE(a.collection, '-', ' ') AS collection,
	p.date AS published, (p.date-a.posted) AS interval
	FROM prod.articles a
	INNER JOIN prod.publication_dates p ON a.id=p.article
	WHERE p.date > '1900-01-01' --- Dummy value used for unknown values
	ORDER BY interval DESC
) AS intervals
WHERE interval >=0 AND interval <= 7