Skip to content

command line tool to extract consensus regions from sets of homologous structures

License

Notifications You must be signed in to change notification settings

DessimozLab/corecut

 
 

Repository files navigation

CoreCut

CoreCut Logo

Extract structural cores from protein families using Foldseek alignments.

Overview

CoreCut is a command-line tool that identifies and extracts conserved structural cores from families of protein structures. It uses Foldseek for rapid structural alignment and comparison, then identifies regions that are consistently aligned across the family to define a structural "core". The tool outputs the core regions as well as the N-terminal and C-terminal variable regions as separate PDB files.

Features

  • Fast structural comparison using Foldseek
  • Automatic core identification based on alignment consistency
  • Flexible thresholds for core definition
  • Multi-format output: CSV results and separated PDB structures
  • Easy installation via pip
  • Command-line interface for batch processing

Installation

Prerequisites

CoreCut requires Foldseek to be installed and accessible. Install Foldseek from:

Install CoreCut

# Install from PyPI (when available)
pip install corecut

# Or install from source
git clone https://github.com/corecut/corecut.git
cd corecut
pip install .

# For development
pip install -e .[dev]

Quick Start

# Basic usage
corecut /path/to/pdb_files/

# Custom thresholds
corecut /path/to/pdb_files/ --hit-thresh 0.9 --min-thresh 0.7

# Specify output directory
corecut /path/to/pdb_files/ --output-dir /path/to/results/

Usage

Command Line Interface

corecut input_directory [options]

Options

  • input_directory: Directory containing PDB files
  • --output-dir, -o: Output directory (default: same as input)
  • --hit-thresh: Proportion of structures needed for core inclusion (default: 0.8)
  • --min-thresh: Minimum proportion if hit-thresh not met (default: 0.6)
  • --foldseek-path: Path to foldseek executable (default: 'foldseek')
  • --foldseek-results: Use existing Foldseek results file
  • --core-folder: Name for core structures folder (default: 'core_structs/')
  • --nter-folder: Name for N-terminal folder (default: 'nter_structs/')
  • --cter-folder: Name for C-terminal folder (default: 'cter_structs/')
  • --output-csv: Name of output CSV file (default: 'core_extraction_results.csv')
  • --overwrite: Overwrite existing Foldseek results

Example Workflow

  1. Prepare your structures: Place all PDB files in a single directory

    ls my_proteins/
    # protein1.pdb  protein2.pdb  protein3.pdb  ...
  2. Run CoreCut:

    corecut my_proteins/ --output-dir results/
  3. Examine results:

    ls results/
    # core_extraction_results.csv  # Summary of core boundaries
    # core_structs/                # Core region PDB files
    # nter_structs/                # N-terminal region PDB files  
    # cter_structs/                # C-terminal region PDB files
    # foldseek_results.m8          # Raw Foldseek alignments

Output Files

CSV Results File

Contains core boundary information for each protein:

  • min: Start position of core region
  • max: End position of core region
  • len: Total length of protein

Structure Directories

  • core_structs/: PDB files containing only the conserved core regions
  • nter_structs/: PDB files containing N-terminal variable regions
  • cter_structs/: PDB files containing C-terminal variable regions

Algorithm

  1. Structural Comparison: Uses Foldseek to perform all-vs-all structural alignments
  2. Alignment Mapping: Maps alignment regions to positions for each structure
  3. Core Identification: Identifies positions aligned in ≥ hit_thresh proportion of structures
  4. Fallback Logic: If no core found, uses positions aligned in ≥ min_thresh proportion
  5. Structure Extraction: Uses BioPython to extract core and terminal regions

Dependencies

  • Python ≥ 3.8
  • Foldseek (external dependency)
  • pandas ≥ 1.3.0
  • numpy ≥ 1.20.0
  • biopython ≥ 1.79
  • tqdm ≥ 4.60.0

Contributing

We welcome contributions! Please see our contributing guidelines for details.

License

MIT License - see LICENSE file for details.

Citation

If you use CoreCut in your research, please cite:

CoreCut: A tool for extracting structural cores from protein families

Support

About

command line tool to extract consensus regions from sets of homologous structures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages