A Python-based pipeline for analyzing chromatin loops from BEDPE files, identifying CTCF binding motifs at loop anchors, and classifying loops by their orientation patterns.
This pipeline processes chromatin loop data to:
- Extract and analyze CTCF binding motifs at loop anchor regions using FIMO motif scanning
- Classify loops into 7 categories based on CTCF anchor presence and orientation patterns
- Analyze loop length distributions with statistical significance testing across categories
- Generate publication-quality visualizations comparing loop metrics across classifications
# Create environment
mamba env create -f environment.yml
mamba activate loops-qc
# Execute pipeline
python src/main.py configs/my_analysis.yaml
Loops are classified into 7 categories based on CTCF anchor presence and orientation:
Category | Description | CTCF Orientation |
---|---|---|
Zero anchors | No CTCF binding motifs | - |
One anchor | CTCF at only one anchor | - |
All two-anchor | All loops with 2 anchors | - |
Convergent | CTCF facing each other | F-R |
Tandem forward | Both CTCF forward | F-F |
Tandem reverse | Both CTCF reverse | R-R |
Divergent | CTCF facing away | R-F |
- Loop length distributions compared across all categories
- Significance testing (t-tests) with markers:
*
= p < 0.05**
= p < 0.01***
= p < 0.001
When a Hi-C cooler file is provided:
- Aggregate Peak Analysis aggregates Hi-C contact frequencies around loop anchor regions for each category
- Symmetrization corrects for Hi-C matrix storage format bias
- Comparison heatmaps visualize loop strength and interaction patterns by category
- Enrichment statistics quantify contact enrichment at anchor regions
Analysis produces organized outputs:
- loop_length_violin_by_category.png - Violin plot comparing loop length distributions across all 7 categories with statistical significance markers
- apa_comparison_heatmaps.png - APA (Aggregate Peak Analysis) heatmaps showing Hi-C contact enrichment around loop anchors for each category (generated if Hi-C file is provided)
- generated_files/ - 7 category-specific BEDPE files + statistics CSVs
- temp/ - Intermediate files for debugging (BED, FASTA, FIMO results)
- apa_results/ - Per-category APA matrices and statistics (if APA analysis enabled)
# Input BEDPE file
input:
bedpe: "/net/noble/vol4/user/gmurtaza/doulatov_hic_data/shared/hicfoundation_outputs/Bcell/10000/low_res_lc/HiCFoundation_loop_0.1.bedpe"
has_header: true # Set to false if BEDPE file has no header row
# Output configuration
output:
directory: "outputs/Bcell_res:10000_threshold:0.1/"
# Loop filtering options
loops:
include_interchromosomal: false # false = intra-chromosomal only, true = all loops
# Motif analysis (optional - omit or set to null to skip)
motif_analysis:
fasta: /net/noble/vol4/user/gmurtaza/doulatov_hic_data/4DN_supporting_files/GRCh38_no_alt_analysis_set_GCA_000001405.15.fa # Path to reference genome FASTA file
motif: /net/noble/vol1/home/gmurtaza/Documents/2025_gmurtaza_loops-qc/motifs/MA0139.2.meme # Path to motif file in MEME format
fdr_threshold: 0.25
fimo_threshold: "1e-5"
optimize_convergent: false
# APA analysis (optional - omit or set to null to skip)
apa_analysis:
hic_file: /net/noble/vol4/user/gmurtaza/doulatov_hic_data/processed/Bcell/Bcell.multires.cool # Path to .cool or .h5 Hi-C contact matrix file
window_size: 25000 # Window size in bp around anchors (±25kb)
resolution: 10000 # Resolution of Hi-C matrix in bp
nproc: 8 # Number of threads for pileup computation
min_loop_length: 200000 # Minimum loop length in bp for APA analysis (e.g., 10000 for 10kb). Set to null to disable