Skip to content

applicativesystem/genome-alignment

Repository files navigation

genomic-alignment-extraction

  • A scalable large scale genomic fraction aligner and extractor for the large scale alignment of the genomes and the transcriptomes and process them over the cores for the extraction of the aligned regions.
  • The aligned regions can also be mapped to the length plotter and can be machine trained for specific applications.
  • This repository contains two codes one for the one-one genome alignment and extractions and it is used for one reference genome and one target genome and the other code in which you can have one target genome and as many as reference genomes you want.
  • In case of the one genome one target, such as bacterial or the organell genome alignments, it provides the target genome start, end, reference genome start end and also corrdinates and reverse corrdinates and sequences.
  • In case of the one 2 many genome, such as a specific transcript alignment to the larger database of the transcripts, or genome, it provides, the target start, end, cordinates, sequences and returns a list of lists with the named tuple as the referencename, sequenceextracted.
def genomeExtraction(alignmentgenome = FALSE,
                          reference_genome = FALSE,
                                 target_genome = FALSE,
                                    percent_match = FALSE):
    """
    Function: genomeExtraction
    Summary: this will take the genome alignment file in the format given below from
    the lastz or the blast alignments and then will extract the reference and the target
    genome regions. It will extract from both the positive and the negative strand and in
    the case of the negative strand it reverses the sequences. It also provides the option
    for the filtering according to the given percentage match and also writes all the coordinates
    and the sequences as tab delimited files.
    Examples:
        genome  query_size  aligned_start   aligned_end matches mismatches  %_aligned   %_matched   chromosome  strand  start   end
0   taeGut2 8000    1   8000    7788    60  100 97.35   chr1    +   5637454 5646796
1   taeGut2 8000    6295    7425    1095    22  14.14   96.82   chr1    +   5642077 5643357
2   taeGut2 8000    6332    6383    45  4   0.65    86.54   chr1    -   1365309 1365357
3   taeGut2 8000    6351    6387    34  1   0.46    91.89   chr1   +   2096236 2096628
4   taeGut2 8000    6351    6383    31  2   0.41    93.94   chr1    +   7208240 7208272
5   taeGut2 8000    6357    6383    26  1   0.34    96.3    chr1   -   3293331 3293357
6   taeGut2 8000    3498    3526    25  1   0.36    86.21   chr1   -   71676009    71676034
7   taeGut2 8000    3602    3624    22  1   0.29    95.65   chr1   +   12695042    12695064
    Attributes:
        @param (multilevelgenome) default=FALSE: InsertHere
        @param (splitgenome) default=FALSE: InsertHere
    Returns: InsertHere

Gaurav Sablok
University of Potsdam
Potsdam,Germany