Skip to content

Detailed usage with examples

Sasha edited this page May 13, 2025 · 2 revisions

Best Practices for Allele-Specific RNA‑seq Analysis

This document provides a comprehensive guide for running the fastq2allelictabs pipeline and related tools for allele‑specific RNA‑seq analysis. It covers installation, preprocessing of reference files, processing of FASTQ files into allelic count tables, and the downstream analysis of allelic imbalance. Detailed worked examples are provided for each major step.


Overview

The workflow is divided into three main steps:

  1. References Preprocessing
    Create parental pseudogenomes, update VCFs, and prepare combined (chimeric) references for both the sample and spike‑in (if used).
  2. FASTQ to Allelic Gene Count Tables Processing
    Align sequencing reads to allele‑specific references, assign reads to alleles, and generate allelic count tables per gene.
  3. Allelic Imbalance and Differential Allelic Imbalance Analysis
    Compute technical overdispersion (iQCC), generate allelic imbalance tables, and perform differential analysis.

Installation and Setup

  • Containers:
    Use Apptainer (formerly Singularity) to pull the required images:

    • ASE Container:
      apptainer pull ase.sif docker://sasha/ase:latest
    • Example Data and References Container:
      apptainer pull ase_data.sif docker://sasha/ase_data:latest
    • Extract Example Data:
      apptainer exec ase_data.sif cp -r /app/example_data /app/example_ref /app/pt3_example_data .
  • Disk Space Considerations:
    Reference files generated at Step 1 may exceed 20 GB. These are only needed until count tables are produced.

  • Working Environment:
    Containers run on Linux. If using Windows Subsystem for Linux (WSL), ensure your working directory is on an ext4 partition (e.g., within your WSL home directory).

  • Processing Options:

    • Sample + Spike‑in (Recommended):
      Use when an RNA spike‑in is added before library preparation.
    • Sample Only (2+ replicates):
      Use when multiple replicates of the sample are available and no spike‑in is used.

Step 1: References Preprocessing

1.1 Creating Parental Pseudogenomes and VCF Files

You will need:

  • A reference genome in FASTA format (e.g., GRCh38.fa).
  • The corresponding gene annotation in GTF format.
  • A VCF file with phased (heterozygous) SNP calls for your organism (donor sample or F1 cross).

Process:

  • Generate Heterozygous VCF:
    Prepare a VCF containing heterozygous positions with allele1 reassigned as the reference (A1) and allele2 as the alternative (A2).

  • Generate Pseudoreference FASTAs:
    Create allele‑specific FASTA files (A1 and A2) by inserting SNPs from the VCF into the reference genome.

  • Annotation:
    Use your existing GTF (or SAF if using non‑gene intervals) for subsequent steps.

Tip: Ensure chromosome names are consistent across all files (FASTA, VCF, and GTF).

A one‑script solution is provided in the pipeline:

python3 pseudoreferences_creation/prepare_pseudoreference.py \
  --PSEUDOREF True --HETVCF True \
  --pseudoref_dir /full/path/to/pseudo/ref/ \
  --vcf_dir /full/path/to/vcf/outputs/ \
  --ref /full/path/to/reference.fa \
  --name_ind donor_name \
  --vcf_ind /full/path/to/donor.vcf

1.2 Creating Mixed References for Sample and Spike‑in

When using a spike‑in (e.g., a mouse RNA added to a human sample), the reference files for both organisms must be merged without chromosome name collisions.

Key Steps:

  • Rename Chromosomes:
    Append unique suffixes (e.g., “_human”, “_mouse”) to chromosome names in the FASTA, VCF, and GTF files.
    Example command snippet:
    awk -v suffix="human" '{print $1 "\t" $1 suffix}' your.fasta.fai > names.txt
    bcftools annotate --rename-chrs names.txt your.vcf -Oz -o renamed.vcf
  • Combine Files:
    Concatenate the renamed sample and spike‑in FASTAs, VCFs, and GTFs. For FASTAs, rebuild the index afterward:
    cat sample_renamed.fa spikein_renamed.fa > combined.fa.tmp
    samtools faidx combined.fa.tmp --length 80 > combined.fa
    samtools faidx combined.fa
    rm combined.fa.tmp*

1.3 Creating STAR Indexes

For each allele (or combined reference when using spike‑in), generate a STAR index:

STAR --runMode genomeGenerate \
     --genomeDir /path/to/index_dir/ \
     --genomeFastaFiles combined.fa \
     --sjdbGTFfile combined.gtf \
     --sjdbOverhang [read_length-1] \
     --runThreadN 4

1A. Worked Example: Reference Preparation

This example demonstrates a typical spike‑in experiment (using samples from two individuals spiked with 129xCASTF1 mouse RNA).

  1. Setup Working Directory and Pull Containers:

    work_dir=runASE
    mkdir -p $work_dir; cd $work_dir
    apptainer pull ase.sif docker://sasha/ase:latest
    apptainer pull ase_data.sif docker://sasha/ase_data:latest
    apptainer exec ase_data.sif cp -r /app/example_data /app/example_ref /app/pt3_example_data .
  2. Create Pseudoreferences:

    For each donor, run:

    apptainer exec --no-home --bind $(pwd):/mnt ase.sif python3 \
      /opt/fastq2allelictabs/pseudoreferences_creation/prepare_pseudoreference.py \
      --PSEUDOREF True --HETVCF True \
      --pseudoref_dir /mnt/pseudoref \
      --vcf_dir /mnt/pseudoref \
      --ref /mnt/example_ref/GRCh38_chr15/GRCh38.chr15.fa \
      --name_ind Stemcell_CE0006419 \
      --vcf_ind /mnt/example_ref/CE0006419_chr15/Stemcell_CE0006419.individual.SNP.chr15.vcf

    (Repeat for each donor.)

  3. Combine and Rename References:

    Use the provided rename_chromosomes.sh and combine_sample_spikein_files.sh scripts to:

    • Rename sample and spike‑in references (e.g., append “_human” or “_mouse”).
    • Combine files into chimeric references.
  4. Index with STAR:
    Follow the STAR index command above for each combined pseudoreference.


Step 2: FASTQ → Allelic Gene Count Tables Processing

2.1 Preparing the Sample Table

Create a TSV file (e.g., example_ase.tsv) where each row describes a sample. Required columns include:

  • sampleID, sample name, sequencing platform, read length, sample and spike‑in reference names, spike‑in proportion, chimeric name
  • Paths:
    Paths to pseudoreference directories (A1 and A2), FASTQ files (read1 and read2), and output directories.
  • Suffixes:
    Suffixes identifying chromosome names for sample and spike‑in in the reference.

2.2 Alignment and Allelic Resolving

For each sample in the table, the pipeline will:

  1. Align Reads:
    Run STAR alignments on both allele‑specific pseudogenomes.
  2. Allele Resolution:
    Assign reads to A1, A2, or ambiguous (NA).

Use the provided script:

for i in $(cut -f1 example_ase.tsv | tail -n +2); do
  fastq2allelicbams.sh example_ase.tsv $i
done

2.3 Capturing Alignment Statistics

Gather alignment and haplotyping statistics by running:

fastq2allelicbams_stats.sh example_ase.tsv

This creates a new table (e.g., example_ase_info.tsv) with detailed statistics.

2.4 Creating Allelic Gene Count Tables

Finally, generate gene count tables from the allelic BAM files:

allelicbams2genecounts.sh example_ase_info.tsv

FeatureCounts is run on each pair (sample and spike‑in) to produce tables with allelic counts per gene.


2A. Worked Example: Generating Allelic Count Tables

Using the example_ase.tsv file from the example data:

  1. Step 2.1 – Align and Resolve Alleles:

    fastq2allelicbams=/opt/fastq2allelictabs/fastq2allelicbams.sh
    TAB=example_data/example_ase.tsv
    for i in $(cut -f1 ${TAB} | tail -n +2); do
      apptainer exec --no-home --bind $(pwd):/mnt ase.sif \
        $fastq2allelicbams /mnt/${TAB} $i
    done
  2. Step 2.2 – Capture Alignment Statistics:

    fastq2allelicbams_stats=/opt/fastq2allelictabs/fastq2allelicbams_stats.sh
    apptainer exec --no-home --bind $(pwd):/mnt ase.sif \
      $fastq2allelicbams_stats /mnt/${TAB}
  3. Step 2.3 – Generate Allelic Count Tables:

    allelicbams2genecounts=/opt/fastq2allelictabs/allelicbams2genecounts.sh
    apptainer exec --no-home --bind $(pwd):/mnt ase.sif \
      $allelicbams2genecounts /mnt/${TAB::-4}_info.tsv

Step 3: Allelic Gene Count Tables → Allelic Imbalance Analysis

3.1 Preparing the R Environment

Install the necessary R packages:

install.packages('tidyverse')
install.packages('devtools')
devtools::install_github("gimelbrantlab/Qllelic")
devtools::install_github("gimelbrantlab/ControlFreq")

3.2 Allelic Imbalance Analysis Options

Depending on your experimental design, choose one of the following:

  • No Technical Replication/No Spike‑in:
    Use a reasonable default (e.g., Q = 1, binomial) and skip iQCC computation.
  • Technical Replication:
    Group replicates, compute iQCC (technical overdispersion), then generate allelic imbalance tables.
  • RNA Spike‑ins:
    Group samples by identical spike‑in RNA; compute iQCC on spike‑in chromosomes, then apply to the full sample.

3.3 Analysis Procedures

  1. Compute iQCC (Overdispersion):
    Use ControlFreq::compute_iQCC_for_selected_samples() on grouped samples.
  2. Make Allelic Imbalance Tables:
    Use Qllelic::PerformBinTestAIAnalysisForConditionNPoint_knownCC() with the computed iQCC values.
  3. Differential Allelic Imbalance Testing:
    For comparing two conditions, use Qllelic::PerformBinTestAIAnalysisForTwoConditions_knownCC().

3A. Worked Example: Analysis of (Differential) Allelic Imbalance

This example uses a processed allelic count table from the full dataset (located in pt3_example_data).

  1. Change to the Example Data Directory:

    cd pt3_example_data
  2. Run the R Markdown Analysis:

    apptainer exec --no-home --bind $(pwd):/mnt ../ase.sif \
      Rscript -e "rmarkdown::render('/mnt/ase.Rmd')"
  3. Resulting Files:
    The process generates:

    • An allelic imbalance table per sample (each gene compared against the null hypothesis of a.i. = 0.5).
    • A differential allelic imbalance table (comparing two samples). These files have names like *ai_table.tsv. For the example data run, the output files are
    • example_ase.CE0006419_vs_CE0006653.diff_ai_table.tsv
    • example_ase.CE0006653.ai_table.tsv
    • example_ase.CE0006419.ai_table.tsv
  4. Interpreting the Results:
    Review the output tables. For further details, see related wiki pages:


Additional Notes and Resources

  • Cleanup:
    Once allelic count tables are generated, you may delete pseudoreference files to free disk space.
  • Graphical Overview:
    Refer to the original pipeline outline for a visual representation of the workflow.
  • Further Documentation:
    Visit the ControlFreq Wiki for more details and troubleshooting tips.

Feel free to adjust paths, parameters, and options based on your experimental design and computing environment. Contributions and updates to this documentation are welcome.


This document is intended as a living resource. For the most up‑to‑date usage examples and troubleshooting, check the associated GitHub repositories and wiki pages.


End of document.