Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
annotate		annotate
cmd/smoove		cmd/smoove
cnvnator		cnvnator
docker		docker
duphold		duphold
hipstr		hipstr
lumpy		lumpy
merge		merge
paste		paste
shared		shared
svtyper		svtyper
tests		tests
.dockerignore		.dockerignore
.gitignore		.gitignore
.travis.yml		.travis.yml
CHANGES.md		CHANGES.md
HISTORY.md		HISTORY.md
LICENSE		LICENSE
README.md		README.md
smoove.go		smoove.go

Repository files navigation

smoove

smoove simplifies and speeds calling and genotyping SVs for short reads. It also improves specificity by removing many spurious alignment signals that are indicative of low-level noise and often contribute to spurious calls.

There is a blog-post describing smoove in more detail here

It both supports small cohorts in a single command, and population-level calling with 4 total steps, 2 of which are parallel by sample.

It requires:

lumpy and lumpy_filter

And optionally (but all highly recommended):

svtyper: to genotypes SVs
svtools: required for large cohorts
samtools: for CRAM support
gsort: to sort final VCF
bgzip+tabix: to compress and index final VCF
mosdepth: remove high coverage regions.
bcftools: version 1.5 or higher for VCF indexing and filtering.

Running smoove without any arguments will show which of these are found so they can be added to the PATH as needed.

smoove will:

parallelize calls to lumpy_filter to extract split and discordant reads required by lumpy
further filter lumpy_filter calls to remove high-coverage, spurious regions and user-specified chroms like 'hs37d5'; it will also remove reads that we've found are likely spurious signals. after this, it will remove singleton reads (where the mate was removed by one of the previous filters) from the discordant bams. This makes lumpy much faster and less memory-hungry.
calculate per-sample metrics for mean, standard deviation, and distribution of insert size as required by lumpy.
stream output of lumpy directly into multiple svtyper processes for parallel-by-region genotyping while lumpy is still running.
sort, compress, and index final VCF.

installation

you can get smoove and all dependencies via (a large) docker image:

docker pull brentp/smoove
docker run -it brentp/smoove smoove -h

Or, you can download a smoove binary from here: https://github.com/brentp/smoove/releases

usage

small cohorts (n < ~ 40)

for small cohorts it's possible to get a jointly-called, genotyped VCF in a single command.

smoove call -x --name my-cohort --exclude $bed --fasta $fasta -p $threads --genotype /path/to/*.bam

output will go to ./my-cohort-smoove.genotyped.vcf.gz

the $exclude is optional but can be used to remove problematic regions.

population calling

For population-level calling (large cohorts) the steps are:

For each sample, call genotypes:

smoove call --outdir results-smoove/ --name $sample --fasta $fasta -p $threads --genotype /path/to/$sample.bam

output will go to `results-smoove/$sample-smoove.genotyped.vcf.gz``

Get the union of sites across all samples (this can parallelize this across as many CPUs or machines as needed):

# this will create ./merged.sites.vcf.gz
smoove merge --name merged -f $fasta --outdir ./ results-smoove/*.genotyped.vcf.gz

genotype all samples at those sites (this can parallelize this across as many CPUs or machines as needed).

smoove genotype -x -p 1 --name $sample-joint --outdir results-genotped/ --fasta $fasta --vcf merged.sites.vcf.gz /path/to/$sample.$bam

paste all the single sample VCFs with the same number of variants to get a single, squared, joint-called file.

smoove paste --name $cohort results-genotyped/*.vcf.gz

(optional) annotate the variants with exons, UTRs that overlap from a GFF and annotate high-quality heterozygotes:

smoove annotate --gff Homo_sapiens.GRCh37.82.gff3.gz $cohort.smoove.square.vcf.gz | bgzip -c > $cohort.smoove.square.anno.vcf.gz

This adds a SHQ (Smoove Het Quality) tag to every sample format) a value of 4 is a high quality call and the value of 1 is low quality. -1 is non-het. It also adds a MSHQ for Mean SHQ to the INFO field which is the mean SHQ score across all heterozygous samples for that variant.

As a first pass, users can look for variants with MSHQ > 3

Troubleshooting

A panic with a message like Segmentation fault (core dumped) | bcftools view -O z -c 1 -o is likely to mean you have an old version of bcftools. see #10
smoove will write to the system TMPDIR. For large cohorts, make sure to set this to something with a lot of space. e.g. export TMPDIR=/path/to/big
smoove requires recent version of lumpy and lumpy_filter so build those from source or get the most recent bioconda version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smoove

installation

usage

small cohorts (n < ~ 40)

population calling

Troubleshooting

see also

About

Releases 19

Packages

Contributors 5

Languages

License

brentp/smoove

Folders and files

Latest commit

History

Repository files navigation

smoove

installation

usage

small cohorts (n < ~ 40)

population calling

Troubleshooting

see also

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 19

Packages 0

Contributors 5

Languages

Packages