A simple tool written in c++ for use in AutoTax, it dereplicates and filters sequences in an input fasta file. Sequences are removed if longer sequences are found which are 100% identical, relative order of sequences is maintained.
U and T sequence characters are treated as if they're the same.
cmake .
makefindLongSeqs <input_file> <output_file> <threads>- OpenMP
input_file.fa
>one
GAT
>two
EGACA
>three
GAAB
>four
GAAT
>five
GAATA
>six
GATACR
>seven
EGATA
>eight
EGATA
>nine
GAACA
>ten
EGATA
>eleven
GATU
>twelve
GATTRunning with 1 thread:
./findLongSeqs ./input_file.fa ./output_file.fa 1
Reading input file...
done
Sorting sequences by length for faster searching...
done
Creating sequence length index...
done
Dereplicating sequences...
done
Original sequence count: 10
{ █ █ █ █ █} 100%
Final sequence count: 6
Writing output file
done
Process took: 0 seconds to completeoutput_file.fa
>two
EGACA
>three
GAAb
>five
GAATA
>six
GATACR
>seven
EGATA
>nine
GAACA
>eleven
GATU