A comprehensive Python toolkit for retrieving and analyzing statistical data from the RCSB Protein Data Bank via API. Extracts information on experimental techniques, organism sources, resolution distributions, folding types, and yearly deposit trends.
PDB Explorer provides automated data extraction from the RCSB Protein Data Bank, enabling researchers to quickly gather comprehensive statistics about deposited protein structures. The tool queries the PDB API and generates structured datasets in both JSON and CSV formats for downstream analysis.
- Comprehensive Data Collection: Retrieves 11 different categories of PDB statistics
- Temporal Analysis: Tracks deposits by year from 1970 to present
- Experimental Methods: Analyzes distribution of techniques (X-ray, Cryo-EM, NMR, etc.)
- Organism Information: Extracts both expression systems and source organisms
- Structural Data: Collects resolution distributions and residue count statistics
- Classification Systems: Retrieves folding types and structural classifications (CATH, SCOP, ECOD)
- Protein Analysis: Identifies most commonly deposited proteins
- Multiple Export Formats: Saves data as JSON and individual CSV files
- Executive Summary: Generates comprehensive statistical overview
Total number of structures deposited in the PDB
Annual deposit counts from 1970 to present
Distribution of experimental methods:
- X-ray Diffraction
- Electron Microscopy (Cryo-EM)
- Solution NMR
- Electron Crystallography
- Neutron Diffraction
- Solid-State NMR
Top organisms used for protein expression
Organisms from which proteins originate
Structure resolution ranges in Angstroms
Number of residues per structure categorized by ranges
- Polymer entity types
- Structural composition
Protein folding classifications and domain architectures
Distribution across classification systems (ECOD, CATH, SCOP2, SCOP, GlyGen)
Most frequently deposited proteins in the PDB
Python 3.7 or higher
pip install requests pandas matplotlib seabornClone the repository:
git clone https://github.com/yourusername/pdb_explorer.git
cd pdb_explorerOpen pdb_database_assessment.ipynb in Jupyter Notebook or Google Colab and run all cells.
import requests
import json
import pandas as pd
from datetime import datetime
# Import and run the main function
from pdb_explorer import main
# Execute the analysis
results = main()The script generates the following files:
pdb_statistics_complete.json- Complete dataset with all statistics
deposits_by_year.csvexperimental_techniques.csvexpression_organisms.csvsource_organisms.csvresolution_angstroms.csvnumber_of_residues.csventity_types.csvstructural_composition.csvfolding_types.csvstructural_classifications.csvmost_common_proteins.csv
- Structural Biology Research: Analyze trends in protein structure determination
- Method Development: Compare experimental technique adoption over time
- Database Statistics: Generate comprehensive PDB statistics for publications
- Organism Distribution: Study which organisms are most represented
- Resolution Analysis: Understand quality distribution of deposited structures
- Temporal Trends: Track PDB growth and method evolution
- Educational Purposes: Teach students about structural biology databases
This tool uses the RCSB PDB Search API v2. For more information:
The main function returns a dictionary with the following keys:
total_structures: Total count (integer)deposits_by_year: Dictionary with year as key and count as valueexperimental_techniques: Dictionary with technique name and countexpression_organisms: Dictionary with organism name and countsource_organisms: Dictionary with organism name and countresolution_angstroms: Dictionary with resolution range and countnumber_of_residues: Dictionary with residue range and countentity_types: Dictionary with entity type and countstructural_composition: Dictionary with composition type and countfolding_types: Dictionary with folding type and countstructural_classifications: Dictionary with classification system and countmost_common_proteins: Dictionary with protein description and count
This project is licensed under the MIT License - see the LICENSE file for details.
- RCSB Protein Data Bank for providing the comprehensive API
- The structural biology community for maintaining this invaluable resource
If you use this tool in your research, please cite:
PDB Explorer (2025). A Python toolkit for RCSB Protein Data Bank statistical analysis.
GitHub repository: https://github.com/madsondeluna/pdb_explorer
- v1.0.0 (2025-10-12)
- Initial release
- Support for 11 different statistical categories
- JSON and CSV export functionality
- Executive summary generation
- API rate limits may affect large-scale queries
- Some PDB entries may have incomplete metadata
- Add data visualization capabilities
- Implement caching for faster repeated queries
- Add support for custom date ranges
- Include more detailed structural analysis
- Add command-line interface
Q: How long does the analysis take? A: Typically 2-5 minutes depending on your internet connection and the current PDB API load.
Q: Can I analyze specific subsets of the PDB? A: Currently, the tool analyzes the entire PDB. Custom filtering is planned for future releases.
Q: Is an API key required? A: No, the RCSB PDB API is freely accessible without authentication.
Q: What is the data update frequency? A: The tool queries real-time data from the PDB, so results reflect the current state of the database.
Note: This tool queries the RCSB PDB API. Please be respectful of their resources and implement appropriate rate limiting for large-scale analyses.
Made with ❤️ and countless cups of coffee!
Coded from my bedroom (which also happens to be my office)...