-
Couldn't load subscription status.
- Fork 0
Detailed usage with examples
This document provides a comprehensive guide for running the fastq2allelictabs pipeline and related tools for allele‑specific RNA‑seq analysis. It covers installation, preprocessing of reference files, processing of FASTQ files into allelic count tables, and the downstream analysis of allelic imbalance. Detailed worked examples are provided for each major step.
The workflow is divided into three main steps:
-
References Preprocessing
Create parental pseudogenomes, update VCFs, and prepare combined (chimeric) references for both the sample and spike‑in (if used). -
FASTQ to Allelic Gene Count Tables Processing
Align sequencing reads to allele‑specific references, assign reads to alleles, and generate allelic count tables per gene. -
Allelic Imbalance and Differential Allelic Imbalance Analysis
Compute technical overdispersion (iQCC), generate allelic imbalance tables, and perform differential analysis.
-
Containers:
Use Apptainer (formerly Singularity) to pull the required images:-
ASE Container:
apptainer pull ase.sif docker://sasha/ase:latest
-
Example Data and References Container:
apptainer pull ase_data.sif docker://sasha/ase_data:latest
-
Extract Example Data:
apptainer exec ase_data.sif cp -r /app/example_data /app/example_ref /app/pt3_example_data .
-
ASE Container:
-
Disk Space Considerations:
Reference files generated at Step 1 may exceed 20 GB. These are only needed until count tables are produced. -
Working Environment:
Containers run on Linux. If using Windows Subsystem for Linux (WSL), ensure your working directory is on an ext4 partition (e.g., within your WSL home directory). -
Processing Options:
-
Sample + Spike‑in (Recommended):
Use when an RNA spike‑in is added before library preparation. -
Sample Only (2+ replicates):
Use when multiple replicates of the sample are available and no spike‑in is used.
-
Sample + Spike‑in (Recommended):
You will need:
- A reference genome in FASTA format (e.g.,
GRCh38.fa). - The corresponding gene annotation in GTF format.
- A VCF file with phased (heterozygous) SNP calls for your organism (donor sample or F1 cross).
Process:
-
Generate Heterozygous VCF:
Prepare a VCF containing heterozygous positions with allele1 reassigned as the reference (A1) and allele2 as the alternative (A2). -
Generate Pseudoreference FASTAs:
Create allele‑specific FASTA files (A1 and A2) by inserting SNPs from the VCF into the reference genome. -
Annotation:
Use your existing GTF (or SAF if using non‑gene intervals) for subsequent steps.
Tip: Ensure chromosome names are consistent across all files (FASTA, VCF, and GTF).
A one‑script solution is provided in the pipeline:
python3 pseudoreferences_creation/prepare_pseudoreference.py \
--PSEUDOREF True --HETVCF True \
--pseudoref_dir /full/path/to/pseudo/ref/ \
--vcf_dir /full/path/to/vcf/outputs/ \
--ref /full/path/to/reference.fa \
--name_ind donor_name \
--vcf_ind /full/path/to/donor.vcfWhen using a spike‑in (e.g., a mouse RNA added to a human sample), the reference files for both organisms must be merged without chromosome name collisions.
Key Steps:
-
Rename Chromosomes:
Append unique suffixes (e.g., “_human”, “_mouse”) to chromosome names in the FASTA, VCF, and GTF files.
Example command snippet:awk -v suffix="human" '{print $1 "\t" $1 suffix}' your.fasta.fai > names.txt bcftools annotate --rename-chrs names.txt your.vcf -Oz -o renamed.vcf
-
Combine Files:
Concatenate the renamed sample and spike‑in FASTAs, VCFs, and GTFs. For FASTAs, rebuild the index afterward:cat sample_renamed.fa spikein_renamed.fa > combined.fa.tmp samtools faidx combined.fa.tmp --length 80 > combined.fa samtools faidx combined.fa rm combined.fa.tmp*
For each allele (or combined reference when using spike‑in), generate a STAR index:
STAR --runMode genomeGenerate \
--genomeDir /path/to/index_dir/ \
--genomeFastaFiles combined.fa \
--sjdbGTFfile combined.gtf \
--sjdbOverhang [read_length-1] \
--runThreadN 4This example demonstrates a typical spike‑in experiment (using samples from two individuals spiked with 129xCASTF1 mouse RNA).
-
Setup Working Directory and Pull Containers:
work_dir=runASE mkdir -p $work_dir; cd $work_dir apptainer pull ase.sif docker://sasha/ase:latest apptainer pull ase_data.sif docker://sasha/ase_data:latest apptainer exec ase_data.sif cp -r /app/example_data /app/example_ref /app/pt3_example_data .
-
Create Pseudoreferences:
For each donor, run:
apptainer exec --no-home --bind $(pwd):/mnt ase.sif python3 \ /opt/fastq2allelictabs/pseudoreferences_creation/prepare_pseudoreference.py \ --PSEUDOREF True --HETVCF True \ --pseudoref_dir /mnt/pseudoref \ --vcf_dir /mnt/pseudoref \ --ref /mnt/example_ref/GRCh38_chr15/GRCh38.chr15.fa \ --name_ind Stemcell_CE0006419 \ --vcf_ind /mnt/example_ref/CE0006419_chr15/Stemcell_CE0006419.individual.SNP.chr15.vcf
(Repeat for each donor.)
-
Combine and Rename References:
Use the provided
rename_chromosomes.shandcombine_sample_spikein_files.shscripts to:- Rename sample and spike‑in references (e.g., append “_human” or “_mouse”).
- Combine files into chimeric references.
-
Index with STAR:
Follow the STAR index command above for each combined pseudoreference.
Create a TSV file (e.g., example_ase.tsv) where each row describes a sample. Required columns include:
- sampleID, sample name, sequencing platform, read length, sample and spike‑in reference names, spike‑in proportion, chimeric name
-
Paths:
Paths to pseudoreference directories (A1 and A2), FASTQ files (read1 and read2), and output directories. -
Suffixes:
Suffixes identifying chromosome names for sample and spike‑in in the reference.
For each sample in the table, the pipeline will:
-
Align Reads:
Run STAR alignments on both allele‑specific pseudogenomes. -
Allele Resolution:
Assign reads to A1, A2, or ambiguous (NA).
Use the provided script:
for i in $(cut -f1 example_ase.tsv | tail -n +2); do
fastq2allelicbams.sh example_ase.tsv $i
doneGather alignment and haplotyping statistics by running:
fastq2allelicbams_stats.sh example_ase.tsvThis creates a new table (e.g., example_ase_info.tsv) with detailed statistics.
Finally, generate gene count tables from the allelic BAM files:
allelicbams2genecounts.sh example_ase_info.tsvFeatureCounts is run on each pair (sample and spike‑in) to produce tables with allelic counts per gene.
Using the example_ase.tsv file from the example data:
-
Step 2.1 – Align and Resolve Alleles:
fastq2allelicbams=/opt/fastq2allelictabs/fastq2allelicbams.sh TAB=example_data/example_ase.tsv for i in $(cut -f1 ${TAB} | tail -n +2); do apptainer exec --no-home --bind $(pwd):/mnt ase.sif \ $fastq2allelicbams /mnt/${TAB} $i done
-
Step 2.2 – Capture Alignment Statistics:
fastq2allelicbams_stats=/opt/fastq2allelictabs/fastq2allelicbams_stats.sh apptainer exec --no-home --bind $(pwd):/mnt ase.sif \ $fastq2allelicbams_stats /mnt/${TAB}
-
Step 2.3 – Generate Allelic Count Tables:
allelicbams2genecounts=/opt/fastq2allelictabs/allelicbams2genecounts.sh apptainer exec --no-home --bind $(pwd):/mnt ase.sif \ $allelicbams2genecounts /mnt/${TAB::-4}_info.tsv
Install the necessary R packages:
install.packages('tidyverse')
install.packages('devtools')
devtools::install_github("gimelbrantlab/Qllelic")
devtools::install_github("gimelbrantlab/ControlFreq")Depending on your experimental design, choose one of the following:
-
No Technical Replication/No Spike‑in:
Use a reasonable default (e.g., Q = 1, binomial) and skip iQCC computation. -
Technical Replication:
Group replicates, compute iQCC (technical overdispersion), then generate allelic imbalance tables. -
RNA Spike‑ins:
Group samples by identical spike‑in RNA; compute iQCC on spike‑in chromosomes, then apply to the full sample.
-
Compute iQCC (Overdispersion):
UseControlFreq::compute_iQCC_for_selected_samples()on grouped samples. -
Make Allelic Imbalance Tables:
UseQllelic::PerformBinTestAIAnalysisForConditionNPoint_knownCC()with the computed iQCC values. -
Differential Allelic Imbalance Testing:
For comparing two conditions, useQllelic::PerformBinTestAIAnalysisForTwoConditions_knownCC().
This example uses a processed allelic count table from the full dataset (located in pt3_example_data).
-
Change to the Example Data Directory:
cd pt3_example_data -
Run the R Markdown Analysis:
apptainer exec --no-home --bind $(pwd):/mnt ../ase.sif \ Rscript -e "rmarkdown::render('/mnt/ase.Rmd')"
-
Resulting Files:
The process generates:- An allelic imbalance table per sample (each gene compared against the null hypothesis of a.i. = 0.5).
- A differential allelic imbalance table (comparing two samples). These files have names like *ai_table.tsv. For the example data run, the output files are
- example_ase.CE0006419_vs_CE0006653.diff_ai_table.tsv
- example_ase.CE0006653.ai_table.tsv
- example_ase.CE0006419.ai_table.tsv
-
Interpreting the Results:
Review the output tables. For further details, see related wiki pages:
-
Cleanup:
Once allelic count tables are generated, you may delete pseudoreference files to free disk space. -
Graphical Overview:
Refer to the original pipeline outline for a visual representation of the workflow. -
Further Documentation:
Visit the ControlFreq Wiki for more details and troubleshooting tips.
Feel free to adjust paths, parameters, and options based on your experimental design and computing environment. Contributions and updates to this documentation are welcome.
This document is intended as a living resource. For the most up‑to‑date usage examples and troubleshooting, check the associated GitHub repositories and wiki pages.
End of document.