Scripts in this directory, 'sandbox', are various utility or trial scripts that we have not fully tested. They are also not under semantic versioning, so their functionality and command line arguments may change without notice.
We are still triaging and documenting the various scripts.
Awaiting promotion to scripts:
- calc-error-profile.py - calculate a per-base "error profile" for shotgun sequencing data, w/o a reference. (Used/tested in 2014 paper on semi-streaming algorithms)
- count-kmers.py - output k-mer counts for multiple input files.
- count-kmers-single.py - output k-mer counts for a single k-mer file.
- correct-errors.py - streaming error correction.
- unique-kmers.py - estimate the number of k-mers present in a file with the HyperLogLog low-memory probabilistic cardinality estimation algorithm.
Scripts with recipes:
- calc-median-distribution.py - plot coverage distribution; see khmer-recipes #1
- collect-reads.py - subsample reads until a particular average coverage; see khmer-recipes #2
- saturate-by-median.py - calculate collector's curve on shotgun sequencing; see khmer-recipes #4
- slice-reads-by-coverage.py - extract reads based on coverage; see khmer-recipes #1
To keep, document, and build recipes for:
- make-coverage.py - RPKM calculation script
- assemstats3.py - print out assembly statistics
- build-sparse-graph.py - code for building a sparse graph (by Camille Scott)
- calc-best-assembly.py - calculate the "best assembly" - used in metagenome protocol
- collect-variants.py - used in a gist
- extract-single-partition.py - extract all the sequences that belong to a specific partition, from a file with multiple partitions
- filter-below-abund.py - like filter-abund, but trim off high-abundance k-mers
- filter-median-and-pct.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
- filter-median.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
- graph-size.py - filter reads based on size of connected graph
- memusg - memory usage analysis
- multi-rename.py - rename sequences from multiple files with a common prefix
- normalize-by-median-pct.py - see blog post on Trinity in silico norm (http://ivory.idyll.org/blog/trinity-in-silico-normalize.html)
- print-stoptags.py - print out the stoptag k-mers
- print-tagset.py - print out the tagset k-mers
- renumber-partitions.py - systematically renumber partitions
- shuffle-reverse-rotary.py - FASTA file shuffler for larger FASTA files
- split-fasta.py - break a FASTA file up into smaller chunks
- stoptags-by-position.py - print out where stoptags tend to occur
- strip-partition.py - clear off partition information
- subset-report.py - report stats on pmap files
- sweep-files.py - various ways to extract reads based on k-mer overlap
- sweep-out-reads-with-contigs.py - various ways to extract reads based on k-mer overlap
- sweep-reads.py - various ways to extract reads based on k-mer overlap
- sweep-reads2.py - various ways to extract reads based on k-mer overlap
- sweep-reads3.py - various ways to extract reads based on k-mer overlap
- write-trimmomatic.py - used to build Trimmomatic command lines in khmer-protocols
Good ideas to rewrite using newer tools/approaches:
- assembly-diff.py - find sequences that differ between two assemblies
- assembly-diff-2.py - find subsequences that differ between two assemblies
- bloom-count.py - count # of unique k-mers; should be reimplemented with HyperLogLog, Renamed from bloom_count.py in commit 4788c31
- split-sequences-by-length.py - break up short reads by length
Present in commit ff7f047b5b0d9acb6c1eb73d54cfd39c9e3d1393 but removed thereafter:
- abundance-hist-by-position.py - look at abundance of k-mers by position within read; use with fasta-to-abundance-hist.py
- fasta-to-abundance-hist.py - generate abundance of k-mers by position within reads; use with abundance-hist-by-position.py
- find-high-abund-kmers.py - extract high-abundance k-mers into a list
- hi-lo-abundance-by-position.py - look at high and low-abundance k-mers by position within read
- stoptag-abundance-hist.py - print out abundance histogram of stoptags
Present in commit 19b0a09353cddc45070edcf1283cae2c83c13b0e but removed thereafter:
- bloom-count-intersection.py - look at unique and disjoint #s of k-mers, renamed from bloom_count_intersection.py in commit 4788c31.
Present in commit d295bc847 but removed thereafter:
- combine-pe.py - combine partitions based on shared PE reads.
- compare-partitions.py - compare read membership in partitions.
- count-within-radius.py - calculating graph density by position with seq
- degree-by-position.py - calculating graph degree by position in seq
- dn-identify-errors.py - prototype script to identify errors in reads based on diginorm principles
- ec.py - new error correction foo
- error-correct-pass2.py - new error correction foo
- find-unpart.py - something to do with finding unpartitioned sequences
- normalize-by-align.py - new error correction foo
- read_aligner.py - new error correction foo
- shuffle-fasta.py - FASTA file shuffler for small FASTA files
- to-casava-1.8-fastq.py - convert reads to different Casava format
- uniqify-sequences.py - print out paths that are unique in the graph
- write-interleave.py - is this used by any protocol etc?
Present in commit 691b0b3ae but removed thereafter:
- annotate-with-median-count.py - replaced by count-median.py
- assemble-individual-partitions.py - better done with parallel
- assemstats.py - statistics gathering; see assemstats3.
- assemstats2.py - statistics gathering; see assemstats3.
- abund-ablate-reads.py - trim reads of high abundance k-mers.
- bench-graphsize-orig.py - benchmarking script for graphsize elimination
- bench-graphsize-th.py - benchmarking script for graphsize elimination
- bin-reads-by-abundance.py - see slice-reads-by-coverage.py
- bowtie-parser.py - parse bowtie map file
- calc-degree.py - various k-mer statistics
- calc-kmer-partition-counts.py - various k-mer statistics
- calc-kmer-read-abunds.py - various k-mer statistics
- calc-kmer-read-stats.py - various k-mer statistics
- calc-kmer-to-partition-ratio.py - various k-mer statistics
- calc-sequence-entropy.py - calculate per-sequence entropy
- choose-largest-assembly.py - see calc-best-assembly.py
- consume-and-traverse.py - replaced by load-graph.py
- contig-coverage.py - calculate coverage of contigs by k-mers
- count-circum-by-position.py - k-mer graph statistics by position within read
- count-density-by-position.py - k-mer graph stats by position within read
- count-distance-to-volume.py - k-mer stats from graph
- count-median-abund-by-partition.py - count median k-mer abundance by partition;
- count-shared-kmers-btw-assemblies.py - count shared k-mers between assemblies;
- ctb-iterative-bench-2-old.py - old benchmarking code
- ctb-iterative-bench.py - old benchmarking code
- discard-high-abund.py - discard reads by coverage; see slice-reads-by-coverage.py
- discard-pre-high-abund.py - discard reads by coverage; see slice-reads-by-coverage.py
- do-intertable-part.py - unused partitioning method
- do-partition-2.py - replaced by scripts/do-partition.py
- do-partition-stop.py - replaced by scripts/do-partition.py
- do-partition.py - moved to scripts/
- do-subset-merge.py - replaced by scripts/merge-partitions.py
- do-th-subset-calc.py - unused benchmarking scripts
- do-th-subset-load.py - unused benchmarking scripts
- do-th-subset-save.py - unused benchmarking scripts
- extract-surrender.py - no longer used partitioning feature
- extract-with-median-count.py - see slice-reads-by-coverage.py
- fasta-to-fastq.py - just a bad idea
- filter-above-median.py - replaced by filter-below-abund.py
- filter-abund-output-by-length.py - replaced by filter-abund/filter-below-abund
- filter-area.py - trim highly connected k-mers
- filter-degree.py - trim highly connected k-mers
- filter-density-explosion.py - trim highly connected k-mers
- filter-if-present.py - replaced by filter-abund and others
- filter-max255.py - remove reads w/high-abundance k-mers.
- filter-min2-multi.py - remove reads w/low-abundance k-mers
- filter-sodd.py - no longer used partitioning feature
- filter-subsets-by-partsize.py - deprecated way to filter out partitions by size
- get-occupancy.py - utility script no longer needed
- get-occupancy2.py - utility script no longer needed
- graph-partition-separate.py - deprecated graph partitioning stuff
- graph-size-circum-trim.py - experimental mods to graph-size.py
- graph-size-degree-trim.py - experimental mods to graph-size.py
- graph-size-py.py - experimental mods to graph-size.py
- join_pe.py - silly attempts to deal with PE interleaving?
- keep-stoptags.py - trim at stoptags
- label-pairs.py - deprecated PE fixing script
- length-dist.py - deprecated length distribution calc script
- load-ht-and-tags.py - load and examine hashtable & tags
- multi-abyss.py - better done with parallel
- make-coverage-by-position-for-node.py - deprecated coverage calculation
- make-coverage-histogram.py - build coverage histograms
- make-random.py - make random DNA; see dbg-graph-null project.
- make-read-stats.py - see readstats.py
- multi-stats.py - see readstats.py
- multi-velvet.py - better done with parallel
- normalize-by-min.py - normalize by min k-mer abundance in seq; just a bad idea
- occupy.py - no longer needed utility script
- parse-bowtie-pe.py - no longer needed utility script
- parse-stats.py - partition stats
- partition-by-contig.py - various approaches to partitioning
- partition-by-contig2.py - various approaches to partitioning
- partition-size-dist-running.py - various approaches to partitioning
- partition-size-dist.py - various approaches to partitioning
- path-compare-to-vectors.py - ??
- print-exact-abund-kmer.py - ??
- print-high-density-kmers.py - display high abundance k-mers
- quality-trim-pe.py - no longer needed utility script
- quality-trim.py - no longer needed utility script
- reformat.py - FASTA sequence description line reformatter for partitioned files
- remove-N.py - eliminate sequences that have Ns in them
- softmask-high-abund.py - softmask high abundance sequences (convert ACGT to acgt)
- split-fasta-on-circum.py - various ways of breaking sequences on graph properties
- split-fasta-on-circum2.py - various ways of breaking sequences on graph properties
- split-fasta-on-circum3.py - various ways of breaking sequences on graph properties
- split-fasta-on-circum4.py - various ways of breaking sequences on graph properties
- split-fasta-on-degree-th.py - various ways of breaking sequences on graph properties
- split-fasta-on-degree.py - various ways of breaking sequences on graph properties
- split-fasta-on-density.py - various ways of breaking sequences on graph properties
- split-N.py - truncate sequences on N
- split-reads-on-median-diff.py - various ways of breaking sequences on graph properties
- summarize.py - sequence stats calculator
- sweep_perf.py - benchmarking tool
- test_scripts.py - old test file
- traverse-contigs.py - deprecated graph traversal stuff
- traverse-from-reads.py - deprecated graph traversal stuff
- validate-partitioning.py - unneeded test