Skip to content

Molecule iteration

Buys de Barbanson edited this page Feb 22, 2021 · 18 revisions

The SingleCellMultiOmics API heavily relies on the Molecule object.

A Molecule is a entity which usually represents a single molecule in an experiment. The Molecule is a class which refers to the original molecular template (DNA/RNA). Usually the original template molecule is amplified and sequenced, resulting in one or more sequenced fragments which are evidence for the existence and properties of the Molecule.

A Fragment holds one or more reads which where sequenced as part of a single fragment. This is usually read 1 and read 2 of a Illumina read. The reads are stored as PySAM objects.

How fragments are associated to molecules is protocol specific. A variety of classes encode ways of associating fragments to molecules. Usually a combination of the unique molecular identifier (UMI) and mapping coordinates of the fragments are used to associate fragments to a molecule.

The example below iterates over all NLAIII molecules in my_bam.bam and writes valid molecules to output.bam

import pysam
import singlecellmultiomics.molecule
import singlecellmultiomics.fragment
import pysamiterators
alignments = pysam.AlignmentFile('my_bam.bam')
reference = pysamiterators.iterators.CachedFasta( pysam.FastaFile('my_reference.fasta' ) )

with pysam.AlignmentFile('output.bam' , "wb",header=alignments.header) as output:
    for i,molecule in  enumerate( 
            singlecellmultiomics.molecule.MoleculeIterator(
                alignments=alignments,
                moleculeClass=singlecellmultiomics.molecule.NlaIIIMolecule,
                fragmentClass=singlecellmultiomics.fragment.NLAIIIFragment,
                molecule_class_args={
                    'reference':reference,
                    'site_has_to_be_mapped':True
                }      
            )):
        molecule.write_pysam(output)

Obtain the allele

To obtain the allele for your molecules you will need a VCF file which contains variants which can be used to distinguish the alleles. Make sure to use a BAM file which is generated by mapping to a variant masked reference, which you can generate using fastaMaskVariants.py.

Then we can prepare the AlleleResolver which will try to assign an allele to a molecule. If the VCF file contains more than two samples use the select_samples argument to select to which alleles the molecule should be assigned.

allele_resolver = singlecellmultiomics.alleleTools.AlleleResolver(
          'my_variants.vcf', 
          select_samples=['AlleleA','AlleleB'],
          lazyLoad=True)

Then we can set up the molecule iterator as usual, and use the molecule.get_allele(allele_resolver) to obtain the allele.