Extract structural cores from protein families using Foldseek alignments.
CoreCut is a command-line tool that identifies and extracts conserved structural cores from families of protein structures. It uses Foldseek for rapid structural alignment and comparison, then identifies regions that are consistently aligned across the family to define a structural "core". The tool outputs the core regions as well as the N-terminal and C-terminal variable regions as separate PDB files.
- Fast structural comparison using Foldseek
- Automatic core identification based on alignment consistency
- Flexible thresholds for core definition
- Multi-format output: CSV results and separated PDB structures
- Easy installation via pip
- Command-line interface for batch processing
CoreCut requires Foldseek to be installed and accessible. Install Foldseek from:
- Foldseek GitHub
- Or via conda:
conda install -c conda-forge foldseek
# Install from PyPI (when available)
pip install corecut
# Or install from source
git clone https://github.com/corecut/corecut.git
cd corecut
pip install .
# For development
pip install -e .[dev]# Basic usage
corecut /path/to/pdb_files/
# Custom thresholds
corecut /path/to/pdb_files/ --hit-thresh 0.9 --min-thresh 0.7
# Specify output directory
corecut /path/to/pdb_files/ --output-dir /path/to/results/corecut input_directory [options]input_directory: Directory containing PDB files--output-dir, -o: Output directory (default: same as input)--hit-thresh: Proportion of structures needed for core inclusion (default: 0.8)--min-thresh: Minimum proportion if hit-thresh not met (default: 0.6)--foldseek-path: Path to foldseek executable (default: 'foldseek')--foldseek-results: Use existing Foldseek results file--core-folder: Name for core structures folder (default: 'core_structs/')--nter-folder: Name for N-terminal folder (default: 'nter_structs/')--cter-folder: Name for C-terminal folder (default: 'cter_structs/')--output-csv: Name of output CSV file (default: 'core_extraction_results.csv')--overwrite: Overwrite existing Foldseek results
-
Prepare your structures: Place all PDB files in a single directory
ls my_proteins/ # protein1.pdb protein2.pdb protein3.pdb ... -
Run CoreCut:
corecut my_proteins/ --output-dir results/
-
Examine results:
ls results/ # core_extraction_results.csv # Summary of core boundaries # core_structs/ # Core region PDB files # nter_structs/ # N-terminal region PDB files # cter_structs/ # C-terminal region PDB files # foldseek_results.m8 # Raw Foldseek alignments
Contains core boundary information for each protein:
min: Start position of core regionmax: End position of core regionlen: Total length of protein
- core_structs/: PDB files containing only the conserved core regions
- nter_structs/: PDB files containing N-terminal variable regions
- cter_structs/: PDB files containing C-terminal variable regions
- Structural Comparison: Uses Foldseek to perform all-vs-all structural alignments
- Alignment Mapping: Maps alignment regions to positions for each structure
- Core Identification: Identifies positions aligned in ≥ hit_thresh proportion of structures
- Fallback Logic: If no core found, uses positions aligned in ≥ min_thresh proportion
- Structure Extraction: Uses BioPython to extract core and terminal regions
- Python ≥ 3.8
- Foldseek (external dependency)
- pandas ≥ 1.3.0
- numpy ≥ 1.20.0
- biopython ≥ 1.79
- tqdm ≥ 4.60.0
We welcome contributions! Please see our contributing guidelines for details.
MIT License - see LICENSE file for details.
If you use CoreCut in your research, please cite:
CoreCut: A tool for extracting structural cores from protein families
- Issues: GitHub Issues
