ReCiterDB

Summary
Functionality
Architecture
- Nightly Pipeline
- Directory Structure
Technical
Database Schema
- Key Tables
- Stored Procedures
Components
- Setup
- Update (Nightly Pipeline)
Operations
- Common Tasks
- Troubleshooting
Configuration
More on the ReCiter suite of applications

Summary

ReCiterDB is an open source MariaDB database and set of tools that stores publication lists and computes bibliometric statistics for an academic institution's faculty and other people of interest. ReCiterDB is designed to be populated by person and publication data from ReCiter (a machine learning-driven publication suggestion engine) and from third party sources such as NIH's iCite and Digital Science's Altmetric services. The data in the system can be viewed using the ReCiter Publication Manager web application, or it can serve as a stand alone reporting database. For more on the functionality in Publication Manager, see that repository.

This repository contains:

A MariaDB schema for ReCiterDB
Stored procedures for computing bibliometric summary tables with zero-downtime atomic table swaps
Python and shell scripts for importing data from ReCiter, NIH iCite, and Altmetric
Docker and Kubernetes configs for automated nightly deployment on AWS EKS

Functionality

In conjunction with data from ReCiter, ReCiterDB has been used to answer questions such as the following:

Senior-authored academic articles in Department of Anesthesiology
Percentage of full-time faculty publications that were indexed in PubMed with an ORCID identifier
Publications by full-time faculty added in the past week
h5 index of full-time faculty
Which active full-time faculty does any given faculty cite most often on their papers?
Which faculty publish the most frequently on cancer, overall and by proportion of their total scholarly output?
What percent of papers published by a given faculty are in collaboration with existing members of the Cancer Center?
What are the most influential cancer-related papers by members of the Cancer Center?
Finally, a variety of person-level bibliometric statistics are available through a bibliometric report that can be generated on demand (see sample)

Architecture

Nightly Pipeline

The system runs as a Kubernetes CronJob. Entry point is run_all.py:

CronJob (daily)
 └─ run_all.py
     ├─ 1. executeFeatureGenerator.py    Trigger ReCiter ML scoring API
     ├─ 2. retrieveArticles.py           Pull person/article data from S3 + DynamoDB
     ├─ 3. retrieveNIH.py                NIH iCite API → analysis_nih (atomic swap)
     ├─ 4. run_nightly_indexing.sh        Run populateAnalysisSummaryTables_v2()
     │      ├─ Polls analysis_job_log every 3s for progress
     │      ├─ Auto-retries 3x with 60s backoff
     │      └─ Auto-restores from backup on failure
     ├─ 5. abstractImport.py             PubMed abstracts from DynamoDB
     └─ 6. conflictsImport.py            COI statements from DynamoDB

Key patterns:

Atomic table swap: Data is loaded into staging tables (_new suffix), validated against production row counts, then swapped via RENAME TABLE (zero downtime). Used by both retrieveNIH.py and the main stored procedure.
Validation gate: retrieveNIH.py requires staging to have >= 80% of production's unique row count before swapping, preventing bad data from reaching production.
Set-based h-index: Uses ROW_NUMBER() window functions instead of loop-based computation.
Bulk loading: LOAD DATA LOCAL INFILE for 10-100x faster imports vs. row-by-row inserts.

Directory Structure

ReCiterDB/
├── setup/                                         # Database initialization
│   ├── setupReciterDB.py                          # Orchestrator: runs schema SQL
│   ├── createDatabaseTableReciterDb.sql           # All table DDL
│   ├── createEventsProceduresReciterDb.sql        # All stored procedures
│   ├── insertBaselineDataReciterDb.sql            # Seed data (roles, special chars, journals)
│   ├── populateAnalysisSummaryTables_v2.sql       # Main nightly SP (standalone copy)
│   ├── setup_job_logging.sql                      # Job log table DDL
│   ├── log_progress.sql                           # Progress logging SP
│   ├── cleanup_staging_tables_v2.sql              # Cleanup SP
│   └── restore_from_backup_v2.sql                 # Recovery SP
│
├── update/                                        # Daily ETL pipeline
│   ├── run_all.py                                 # EKS orchestrator (entry point)
│   ├── run_nightly_indexing.sh                    # SP runner with monitoring/retry
│   ├── retrieveArticles.py                        # S3 + DynamoDB article fetcher
│   ├── retrieveNIH.py                             # NIH iCite fetcher (atomic swap)
│   ├── retrieveAltmetric.py                       # Altmetric API fetcher
│   ├── updateReciterDB.py                         # Bulk loader (LOAD DATA LOCAL INFILE)
│   ├── dataTransformer.py                         # ReCiter JSON → CSV
│   ├── abstractImport.py                          # Abstract importer from DynamoDB
│   ├── conflictsImport.py                         # COI statement importer
│   └── executeFeatureGenerator.py                 # Triggers ReCiter scoring API
│
├── kubernetes/                                    # K8s deployment
│   ├── k8-cronjob.yaml                            # CronJob definition
│   ├── k8-configmap.yaml                          # Non-secret config
│   └── k8-secrets.yaml                            # Secret template (placeholders only)
│
├── Dockerfile                                     # Python 3.12-slim container
├── buildspec.yml                                  # CodeBuild → ECR
├── k8-buildspec.yml                               # CodeBuild → EKS image update
└── requirements.txt                               # Python dependencies

Technical

Prerequisites

MariaDB (not MySQL). The stored procedures use MariaDB-specific features (e.g., REGEXP_REPLACE with BINARY modifier).
Populated instance of ReCiter. This is where all the person and publication data live.
ReCiter Publication Manager (optional). Needed for viewing data through a web interface.

Installation

Locally

Clone this repository.
Install MariaDB and create the database:

CREATE DATABASE IF NOT EXISTS `reciterdb` DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
CREATE USER 'admin' IDENTIFIED BY '<your password>';
GRANT ALL PRIVILEGES ON *.* TO 'admin';

Set environment variables (see Environment Variables below).
Install dependencies and run setup:

pip install -r requirements.txt
cd setup && python3 setupReciterDB.py

Run the daily update:

python3 update/run_all.py

On AWS (EKS)

The project is packaged as a Docker container and deployed as a Kubernetes CronJob. See kubernetes/ for deployment configs and Dockerfile for the container definition. For full infrastructure setup, see ReCiter-CDK.

Credentials

Credentials must never be committed to this repository. All secrets are managed through:

Kubernetes Secrets — kubernetes/k8-secrets.yaml is a template with placeholders. Actual secrets are stored in the EKS cluster.
Environment variables — All scripts read credentials from env vars at runtime.
AWS Secrets Manager — Docker Hub credentials for builds are fetched from Secrets Manager during CodeBuild.

For local development, export env vars in your terminal session or use a MySQL options file (~/.my.cnf with chmod 600).

Environment Variables

Variable	Purpose	Required
`DB_HOST`	MariaDB hostname	Yes
`DB_USERNAME`	Database user	Yes
`DB_PASSWORD`	Database password	Yes
`DB_NAME`	Database name (e.g., `reciterdb`)	Yes
`AWS_ACCESS_KEY_ID`	For S3/DynamoDB access	Yes
`AWS_SECRET_ACCESS_KEY`	For S3/DynamoDB access	Yes
`AWS_DEFAULT_REGION`	AWS region (e.g., `us-east-1`)	Yes
`URL`	ReCiter feature generator endpoint	For feature generation
`API_KEY`	ReCiter API auth token	For feature generation
`LOG_FILE`	Log file path	No
`S3_BUCKET`	S3 bucket for log archival	No
`S3_KEY_PREFIX`	S3 key prefix for logs	No
`SCRIPT_TIMEOUT_SECONDS`	Pipeline timeout (default: 15000s)	No

Database Schema

Key Tables

Summary tables (populated by the nightly stored procedure, read by Publication Manager):

Table	Purpose
`analysis_summary_article`	One row per pmid: title, journal, citations, NIH percentile, RCR, impact scores
`analysis_summary_author`	One row per person+pmid: author list string, author position (first/last)
`analysis_summary_author_list`	One row per author+pmid: individual author name and rank
`analysis_summary_person`	One row per person: h-index (NIH + Scopus), article counts, percentile rankings

Data source tables (populated by ETL scripts):

Table	Purpose
`person_article`	ReCiter scoring output (~80 columns of evidence per person+pmid pair)
`person_article_author`	Author list with rank, targetAuthor flag, ORCID, equalContrib
`analysis_nih`	NIH iCite metrics: RCR, percentile, citation count, field citation rate. UNIQUE KEY on pmid.
`analysis_nih_cites` / `analysis_nih_cites_clin`	Citation networks (multiple rows per pmid)
`analysis_altmetric`	Altmetric social metrics

Admin tables (managed by Publication Manager):

Table	Purpose
`admin_users`, `admin_roles`, `admin_users_roles`	Role-based access control
`admin_feedback_log`	User curation actions
`admin_settings`	UI configuration (JSON)
`analysis_summary_person_scope`	Defines who gets bibliometric analysis
`analysis_override_author_position`	Manual author position corrections

Reference data (loaded once at setup):

Table	Purpose
`journal_impact_scimago`, `journal_impact_alternative`	Journal rankings
`journal_nlm`	NLM catalog
`analysis_special_characters`	294 Unicode-to-RTF character mappings

See setup/createDatabaseTableReciterDb.sql for the complete schema.

Stored Procedures

All defined in setup/createEventsProceduresReciterDb.sql.

Nightly indexing:

Procedure	Purpose
`populateAnalysisSummaryTables_v2()`	Main job: rebuild all `analysis_summary_*` tables with atomic swap
`log_progress()`	Write progress entries to `analysis_job_log`
`cleanup_staging_tables_v2()`	Drop `*_new` staging tables on failure
`restore_from_backup_v2()`	Rename `*_backup` tables back to production
`check_analysis_summary_status()`	Show row counts and backup table status
`view_job_progress(job_id)`	Show all log entries for a job

Report generation:

Procedure	Purpose
`generateBibliometricReport(personID)`	Full bibliometric report for one person
`generatePubsRTF()`	RTF publication list
`generateEmailNotifications()`	New publication email alerts

Admin:

Procedure	Purpose
`populateAnalysisSummaryPersonScopeTable()`	Populate the scope table (who gets bibliometrics)
`updateCurateSelfRole()`	Update Curator_Self role permissions

Components

Setup

File	Frequency	Purpose
`setupReciterDB.py`	At initial setup	Runs schema SQL to create tables
`createDatabaseTableReciterDb.sql`	At initial setup	Creates all database tables
`createEventsProceduresReciterDb.sql`	At initial setup	Creates all stored procedures
`insertBaselineDataReciterDb.sql`	At initial setup	Seed data: roles, special characters, journal rankings, NLM catalog
`populateAnalysisSummaryTables_v2.sql`	Reference	Standalone copy of the main nightly stored procedure

Update (Nightly Pipeline)

File	Purpose
`run_all.py`	EKS orchestrator: runs all pipeline steps in sequence with timeout enforcement, memory logging, and S3 log upload
`retrieveArticles.py`	Fetches person and article data from S3 and DynamoDB in batches
`retrieveNIH.py`	Fetches NIH iCite metrics in batches of 150; loads to staging table with validation and atomic swap
`retrieveAltmetric.py`	Fetches Altmetric scores for articles published in the last 2 years
`updateReciterDB.py`	Bulk data loader using `LOAD DATA LOCAL INFILE` with retry and reconnect logic
`dataTransformer.py`	Transforms ReCiter JSON output to CSV format for all `person_*` tables
`run_nightly_indexing.sh`	Calls `populateAnalysisSummaryTables_v2()` with progress monitoring, auto-retry, and auto-restore
`abstractImport.py`	Imports PubMed abstracts from DynamoDB (parallel batch fetches)
`conflictsImport.py`	Imports conflict-of-interest statements from DynamoDB
`executeFeatureGenerator.py`	Triggers ReCiter feature generator API with rate limiting and metrics

Operations

Common Tasks

Run the stored procedure manually (rebuilds all summary tables, ~20 minutes):

CALL populateAnalysisSummaryTables_v2();

Monitor progress:

SELECT step, substep, status, rows_affected, created_at
FROM analysis_job_log
WHERE job_id = (SELECT MAX(job_id) FROM analysis_job_log)
ORDER BY id DESC LIMIT 10;

Check table health:

-- Row counts
SELECT 'analysis_summary_article' t, COUNT(*) n FROM analysis_summary_article
UNION ALL SELECT 'analysis_summary_person', COUNT(*) FROM analysis_summary_person
UNION ALL SELECT 'analysis_nih', COUNT(*) FROM analysis_nih;

-- Check for duplicates in analysis_nih (total should equal uniq)
SELECT COUNT(*) AS total, COUNT(DISTINCT pmid) AS uniq FROM analysis_nih;

Restore from backup (if the SP fails mid-swap):

CALL restore_from_backup_v2();

View CronJob status:

kubectl -n reciter get cronjob reciterdb
kubectl -n reciter get jobs --sort-by=.metadata.creationTimestamp | tail -5
kubectl -n reciter logs job/<job-name> 2>&1 | tail -50

Troubleshooting

"Bibliometric Analysis" page is blank in Publication Manager: The UI reads from analysis_summary_article. If NIH columns are NULL, the stored procedure needs to be re-run after analysis_nih is refreshed:

CALL populateAnalysisSummaryTables_v2();

Validation rejects NIH table swap: retrieveNIH.py compares staging row count against production. If production has duplicate pmids, validation uses COUNT(DISTINCT pmid) to detect this automatically. Check with:

SELECT COUNT(*) AS total, COUNT(DISTINCT pmid) AS uniq FROM analysis_nih;

Stored procedure appears stuck: The SP acquires a named lock. If a prior run crashed without releasing it:

SELECT IS_FREE_LOCK('populateAnalysisSummaryTables_v2_lock');
-- 0 = locked, 1 = free
SELECT RELEASE_LOCK('populateAnalysisSummaryTables_v2_lock');

Configuration

Define scope of bibliometrics. As an administrator, you have control over the people for whom the system calculates person-level bibliometrics. This allows for download of a person's bibliometric analysis complete with comparisons to institutional peers. To do this, update the populateAnalysisSummaryPersonScopeTable stored procedure which populates the analysis_summary_person_scope table. Here at Weill Cornell Medicine, we consider only full-time employed faculty (i.e., person_person_type.personType = academic-faculty-weillfulltime).
Importing additional journal-level metrics (optional). ReCiterDB ships with journal impact data from Scimago Journal Rank. If you have another journal level impact metric, which uses ISSN as a primary key, it can be imported into the journal_impact_alternative table.

More on the ReCiter suite of applications

As the figure describes, the ReCiter suite of applications can fully manage many key steps in institutional publication management.

Repository	Required?	Purpose
ReCiter	Yes	Core ML scoring engine: stores identity info, retrieves articles from PubMed/Scopus, estimates authorship likelihood, shares data through web services
ReCiter PubMed Retrieval Tool	Yes	Retrieve and normalize publication data from PubMed
ReCiter Scopus Retrieval Tool	No	Retrieve and normalize publication data from Scopus
ReCiter Publication Manager	No	Web UI for curating publications and generating bibliometric reports
ReCiterDB	For Publication Manager	Scripts for retrieving data from ReCiter and a relational database for publication and bibliometric data
ReCiter-CDK	No	AWS CDK infrastructure-as-code for deploying the entire ReCiter ecosystem

Name		Name	Last commit message	Last commit date
Latest commit History 341 Commits
files		files
kubernetes		kubernetes
setup		setup
update		update
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
buildspec.yml		buildspec.yml
init.py		init.py
k8-buildspec.yml		k8-buildspec.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ReCiterDB

Summary

Functionality

Architecture

Nightly Pipeline

Directory Structure

Technical

Prerequisites

Installation

Locally

On AWS (EKS)

Credentials

Environment Variables

Database Schema

Key Tables

Stored Procedures

Components

Setup

Update (Nightly Pipeline)

Operations

Common Tasks

Troubleshooting

Configuration

More on the ReCiter suite of applications

About

Uh oh!

Releases 1

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ReCiterDB

Summary

Functionality

Architecture

Nightly Pipeline

Directory Structure

Technical

Prerequisites

Installation

Locally

On AWS (EKS)

Credentials

Environment Variables

Database Schema

Key Tables

Stored Procedures

Components

Setup

Update (Nightly Pipeline)

Operations

Common Tasks

Troubleshooting

Configuration

More on the ReCiter suite of applications

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages