Skip to content
svkazakov edited this page Nov 19, 2014 · 26 revisions

Welcome to the MetaFast wiki!

Fast metagenome analysis toolkit, version 0.1.0.

Authors:

  • Software: Sergey Kazakov and Vladimir Ulyantsev, ITMO University, Saint-Petersburg.
  • Testing: Veronika Dubinkina and Alexandr Tyakht, SRI of Physical-Chemical Medicine, Moscow.
  • Idea: Dmitry Alexeev, SRI of Physical-Chemical Medicine, Moscow.

Description

MetaFast — FAST METAgenome analysis toolkit — is a software for calculating different statistics of metagenome sequences and building the distance matrix between them.

Last stable release can be downloaded from http://github.com/ulyantsev/metafast/releases. You need only metafast.sh to run metafast on Linux (don't forget chmod a+x metafast.sh), or only metafast.bat to run on Windows OS.

Input files

MetaFast accepts input sequence files of fastq and fasta formats. Input files can also be compressed with gzip of bzip2.

Examples

metafast.sh -i sample_1.fastq sample_2.fastq — runs metafast with default parameters on Linux on two samples with reads in sample_1.fastq and sample_2.fastq.

metafast.sh -i data/*.fastq — runs metafast on Linux on all reads files with extension .fastq in data directory.

metafast.sh -m 4G -k 7 -b 0 -l 8 -b1 3 -i test_data/tinytest_A.fastq test_data/tinytest_B.fastq — runs metafast on Linux on two samples with reads in tinytest_A.fastq and tinytest_B.fastq.
-m 4G — using 4 GB of memory;
-k 7 — use k-mer size of 7 nucleotides;
-b 0 — maximal frequency for a k-mer to be assumed erroneous is 0 (all k-mers are good);
-l 8 — only sequences with at least 8 nucleotides will be added to a component;
-b1 3 — minimum component size is 3 different k-mers.

Output

Once metafast has finished, working directory will contain following results:

  • workDir/matrices/dist_matrix_<date>_<time>.txt — distance matrix between samples (based on Bray–Curtis dissimilarity);
  • workDir/kmer-counter-many/stats/<in_file>.stat.txt — k-mers frequency statistics;
  • workDir/components-cutter/components-stat-<b1>-<b2>.txt — statistics of extracted components;
  • workDir/features-calculator/vectors/<in_file>.vec — characteristic vector for the input file in_file (number of k-mers in the input file from each extracted component).

Running metafast

Usage: metafast [<Launch options>] [<Input parameters>]

Input parameters:

  • -i, --reads <args>
    List of reads files from single environment. FASTQ, BINQ, FASTA files are acceptable, gzip- and bzip2-compressed files are allowed too.
  • -k, --k <arg>
    K-mer size (in nucleotides, maximum 31 due to realization details). The default value is 31 nucleotide.
  • -b, --maximal-bad-frequency <arg>
    Maximal frequency for a k-mer to be assumed erroneous. The default value is 1 k-mer.
  • -l, --min-seq-len <arg>
    Minimum sequence length to be added to a component (in nucleotides). The default value is 100 nucleotides.
  • --matrix-file <arg>
    Resulting distance matrix file. The default value is <work_dir>/matrices/dist_matrix_<date>_<time>.txt.
  • --stats-dir <arg>
    Directory with statistics for kmers. The default value is <work_dir>/kmer-counter-many/stats.
  • -bp, --bottom-cut-percent <arg>
    K-mers percent to be assumed erroneous while building sequences in seq-builder. If specified, --maximal-bad-frequency wouldn't be used in sequence builder.
  • -b1, --min-component-size <arg>
    Minimum component size in component-cutter (in k-mers). The default value is 1000 k-mers.
  • -b2, --max-component-size <arg>
    Maximum component size in component-cutter (in k-mers). The default value is 10000 k-mers.
  • -wn, --without-names
    Do not print matrix row and column names as given file names.

Launch options:

  • -ts, --tools
    Print available tools.
  • -t, --tool <arg>
    Set certain tool to run (by specifying its name). Default tool to run is the matrix-builder tool.
  • -m, --memory <arg>
    Memory to use (values with suffix, for example: 1500M, 4G, etc.). By default metafast uses 95% of free memory on Linux and 90% of free memory on Windows.
  • -p, --available-processors <arg>
    Available processors. By default metafast uses all processors in the computer.
  • -w, --work-dir <arg>
    Working directory. The default working directory is workDir/ in current directory.
  • -c, --continue
    Continue the previous run in working directory (there is no need to set other input parameters except working directory itself).
  • -s, --start <arg>
    First force run stage (with rewriting old results).
  • -f, --finish <arg>
    Finishing stage.
  • -ea, --enable-assertions
    Enable assertions. If metafast works strange you can use this flag for additional checking during working process. By default assertions are disabled.
  • -v, --verbose
    Enable debug output.
  • -h, --help
    Print short help message.
  • -ha, --help-all
    Print full help message.

System requirements

  • RAM: metafast requires 2-2.5 times more memory than maximum size of processing uncompressed fastq file.
  • Hard disk space: metafast requires 25-30% of total size of processing uncompressed fastq files.
  • Software: Java Runtime Environment 1.6 or higher is required for running metafast (nowadays almost any computer already has such software).
Clone this wiki locally