Skip to content

PDB Explorer analyzes the Protein Data Bank through automated API queries, extracting key statistics including experimental methodologies, source and expression organisms, structural resolution, and classification data. Generates organized datasets in JSON and CSV formats.

License

Notifications You must be signed in to change notification settings

madsondeluna/pdb_explorer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

PDB Explorer

A comprehensive Python toolkit for retrieving and analyzing statistical data from the RCSB Protein Data Bank via API. Extracts information on experimental techniques, organism sources, resolution distributions, folding types, and yearly deposit trends.

Overview

PDB Explorer provides automated data extraction from the RCSB Protein Data Bank, enabling researchers to quickly gather comprehensive statistics about deposited protein structures. The tool queries the PDB API and generates structured datasets in both JSON and CSV formats for downstream analysis.

Features

  • Comprehensive Data Collection: Retrieves 11 different categories of PDB statistics
  • Temporal Analysis: Tracks deposits by year from 1970 to present
  • Experimental Methods: Analyzes distribution of techniques (X-ray, Cryo-EM, NMR, etc.)
  • Organism Information: Extracts both expression systems and source organisms
  • Structural Data: Collects resolution distributions and residue count statistics
  • Classification Systems: Retrieves folding types and structural classifications (CATH, SCOP, ECOD)
  • Protein Analysis: Identifies most commonly deposited proteins
  • Multiple Export Formats: Saves data as JSON and individual CSV files
  • Executive Summary: Generates comprehensive statistical overview

Data Retrieved

1. Total Structures

Total number of structures deposited in the PDB

2. Deposits by Year

Annual deposit counts from 1970 to present

3. Experimental Techniques

Distribution of experimental methods:

  • X-ray Diffraction
  • Electron Microscopy (Cryo-EM)
  • Solution NMR
  • Electron Crystallography
  • Neutron Diffraction
  • Solid-State NMR

4. Expression Organisms

Top organisms used for protein expression

5. Source Organisms

Organisms from which proteins originate

6. Resolution Distribution

Structure resolution ranges in Angstroms

7. Residue Count Distribution

Number of residues per structure categorized by ranges

8. Secondary Structure Types

  • Polymer entity types
  • Structural composition

9. Folding Types

Protein folding classifications and domain architectures

10. Structural Classifications

Distribution across classification systems (ECOD, CATH, SCOP2, SCOP, GlyGen)

11. Most Common Proteins

Most frequently deposited proteins in the PDB

Getting Started

Prerequisites

Python 3.7 or higher

Required Libraries

pip install requests pandas matplotlib seaborn

Cloning

Clone the repository:

git clone https://github.com/yourusername/pdb_explorer.git
cd pdb_explorer

Usage

Run in Jupyter Notebook

Open pdb_database_assessment.ipynb in Jupyter Notebook or Google Colab and run all cells.

Run as Python Script

import requests
import json
import pandas as pd
from datetime import datetime

# Import and run the main function
from pdb_explorer import main

# Execute the analysis
results = main()

Output Files

The script generates the following files:

JSON Output

  • pdb_statistics_complete.json - Complete dataset with all statistics

CSV Outputs

  • deposits_by_year.csv
  • experimental_techniques.csv
  • expression_organisms.csv
  • source_organisms.csv
  • resolution_angstroms.csv
  • number_of_residues.csv
  • entity_types.csv
  • structural_composition.csv
  • folding_types.csv
  • structural_classifications.csv
  • most_common_proteins.csv

Use Cases

  • Structural Biology Research: Analyze trends in protein structure determination
  • Method Development: Compare experimental technique adoption over time
  • Database Statistics: Generate comprehensive PDB statistics for publications
  • Organism Distribution: Study which organisms are most represented
  • Resolution Analysis: Understand quality distribution of deposited structures
  • Temporal Trends: Track PDB growth and method evolution
  • Educational Purposes: Teach students about structural biology databases

API Documentation

This tool uses the RCSB PDB Search API v2. For more information:

Data Structure

The main function returns a dictionary with the following keys:

  • total_structures: Total count (integer)
  • deposits_by_year: Dictionary with year as key and count as value
  • experimental_techniques: Dictionary with technique name and count
  • expression_organisms: Dictionary with organism name and count
  • source_organisms: Dictionary with organism name and count
  • resolution_angstroms: Dictionary with resolution range and count
  • number_of_residues: Dictionary with residue range and count
  • entity_types: Dictionary with entity type and count
  • structural_composition: Dictionary with composition type and count
  • folding_types: Dictionary with folding type and count
  • structural_classifications: Dictionary with classification system and count
  • most_common_proteins: Dictionary with protein description and count

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • RCSB Protein Data Bank for providing the comprehensive API
  • The structural biology community for maintaining this invaluable resource

Citation

If you use this tool in your research, please cite:

PDB Explorer (2025). A Python toolkit for RCSB Protein Data Bank statistical analysis.
GitHub repository: https://github.com/madsondeluna/pdb_explorer

Version History

  • v1.0.0 (2025-10-12)
    • Initial release
    • Support for 11 different statistical categories
    • JSON and CSV export functionality
    • Executive summary generation

Known Issues

  • API rate limits may affect large-scale queries
  • Some PDB entries may have incomplete metadata

Future Enhancements

  • Add data visualization capabilities
  • Implement caching for faster repeated queries
  • Add support for custom date ranges
  • Include more detailed structural analysis
  • Add command-line interface

FAQ

Q: How long does the analysis take? A: Typically 2-5 minutes depending on your internet connection and the current PDB API load.

Q: Can I analyze specific subsets of the PDB? A: Currently, the tool analyzes the entire PDB. Custom filtering is planned for future releases.

Q: Is an API key required? A: No, the RCSB PDB API is freely accessible without authentication.

Q: What is the data update frequency? A: The tool queries real-time data from the PDB, so results reflect the current state of the database.


Note: This tool queries the RCSB PDB API. Please be respectful of their resources and implement appropriate rate limiting for large-scale analyses.


Made with ❤️ and countless cups of coffee!
Coded from my bedroom (which also happens to be my office)...


About

PDB Explorer analyzes the Protein Data Bank through automated API queries, extracting key statistics including experimental methodologies, source and expression organisms, structural resolution, and classification data. Generates organized datasets in JSON and CSV formats.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published