Skip to content

nickgreensgithub/find_longest_sequences

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Find longest sequences

A simple tool written in c++ for use in AutoTax, it dereplicates and filters sequences in an input fasta file. Sequences are removed if longer sequences are found which are 100% identical, relative order of sequences is maintained.

U and T sequence characters are treated as if they're the same.

Compilation:

cmake .
make

Usage:

findLongSeqs <input_file> <output_file> <threads>

Dependencies:

  • OpenMP

Example usage

input_file.fa

>one
GAT
>two
EGACA
>three
GAAB
>four
GAAT
>five
GAATA
>six
GATACR
>seven
EGATA
>eight
EGATA
>nine
GAACA
>ten
EGATA
>eleven
GATU
>twelve
GATT

Running with 1 thread:

./findLongSeqs ./input_file.fa ./output_file.fa 1

Reading input file...
done
Sorting sequences by length for faster searching...
done
Creating sequence length index...
done
Dereplicating sequences...
done
Original sequence count: 10
{         █          █          █          █    █} 100%
Final sequence count: 6
Writing output file
done
Process took: 0 seconds to complete

output_file.fa

>two
EGACA
>three
GAAb
>five
GAATA
>six
GATACR
>seven
EGATA
>nine
GAACA
>eleven
GATU

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors