© 2018 Merly Escalona (merlyescalona@uvigo.es)
Phylogenomics Lab, University of Vigo, Spain,
This program allows to compress the number of files and file sizes of a SimPhy run.
Concatenating all loci sequences into a single multiple sequence alignment file for all the different
FASTA outputs (sequences with gaps, or sequences without gaps (*_TRUE.fasta)). They are concatenated
with N sequences (as long as desired - -n/--nsize parameter). Gene tree files
are shrunked into a sinlge gzipped tab-separated file with 2 columns:
filename tree
Where, filename is the basename of the gene tree file (e.g. g_trees00001.tree -> g_treees00001)
and tree corresponds to the content of such file.
- We are working under a SimPhy simulation. Follwing its hierarchical folder structure and sequence labeling.
To know more about the simulation pipeline scenario go to SimPhy's repository, and/or check:
- Mallo D, de Oliveira Martins L, Posada D (2016) SimPhy: Phylogenomic Simulation of Gene, Locus and Species Trees. Syst. Biol. 65(2) 334-344. doi: http://dx.doi.org/10.1093/sysbio/syv082
- SimPhy folder path
- prefix of the existing FASTA files
- (optional) length of the N sequence that will be used to separate the sequences when concatenated
- Modifications are made INPLACE. Meaning, files are concatenated and gzipped in the same SimPhy folder. And so, the other files are removed.
- Clone this repository
git clone git@github.com:merlyescalona/simphycompress.git
- Chance your current directory to the downloaded folder:
cd simphycompress
- Install:
python setup.py install --user
Required arguments:
-s <path>, --simphy-path <path>: Path of the SimPhy folder.-ip <input_prefix>, --input-prefix <input_prefix>: Prefix of the FASTA filenames.
Optional arguments:
-n <N_seq_size>, --nsize <N_seq_size>: Number of N's that will be introduced to separate the sequences selected. If the parameter is not set, the output file per replicate will be a multiple alignment sequence file, otherwise, the output will be a single sequence file per replicate consisting of a concatenation of the reference sequences selected separated with as many N's as set for this parameter.-l <log_level>, --log <log_level>: Specified level of log that will be shown through the standard output. Entire log will be stored in a separate file.- Values:['DEBUG', 'INFO', 'WARNING', 'ERROR'].
- Default: 'INFO'.
Information arguments:
-v, --version: Show program's version number and exit-h, --help: Show this help message and exit