Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A fix for HHfilter #182

Merged
merged 1 commit into from
Jan 16, 2020
Merged

A fix for HHfilter #182

merged 1 commit into from
Jan 16, 2020

Conversation

huhlim
Copy link
Contributor

@huhlim huhlim commented Jan 14, 2020

There was a bug when there was no overlap between two sequences such as

>seq_0
AAAA----AAAA
>seq_1
----BBBB----

In this case, cov_kj=0 and diff=0, but diff_suff is non-zero. So, in this case, seq_1 is filtered out although there is no overlap between them.

The fix also can be
if (diff < diff_suff && float(diff) <= diff_min_frac * cov_kj && cov_kj > 0)

@martin-steinegger martin-steinegger merged commit 318cbc1 into soedinglab:master Jan 16, 2020
@martin-steinegger
Copy link
Member

Great catch! Thank you Lim.
You actually fixed two bugs now. I will back port this to MMseqs2 as well.

martin-steinegger added a commit to soedinglab/MMseqs2 that referenced this pull request Jan 16, 2020
RuoshiZhang added a commit to soedinglab/spacepharer that referenced this pull request May 12, 2020
46c843895 Update combine pval agg-mode 3
67d610136 Disable fancy progress bars on travis to reduce output
203a21736 Updated two more tests to use tighter ROC thresholds
a9052f449 Update regression with tighter bounds for ROC tests
c62736a6d Correctly parse keys from data files in filterdb --filter-file This was causing a linsearch instability
fe007cb4e Use MultiParam for gapOpen, gapExtend costs
3513001d3 Add easy-rbh workflow
d0d3032e9 Fix RBH search if using -a to show alignments
ce1a43bf1 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
ea24e4934 Fix issues with abs. path if using aria2c
5228745f5 Improve --alignment-mode parameter description and make it a non expert parameter
fffa9b10e Fix various inconsistencies and usability issues with alignall: * alignall alignment-mode did not correspond to align alignment-mode * add-backtrace did not do anything, has to be specified now if backtrace is needed * Did return a alignment db type even though it is incompatible with that type, uses generic for now * various parameters were passed but unused   - zdrop and scorebias are used now (however see below)   - realign, alt ali, max accept/reject, wrapped are now gone
290668474 Fix wrong warning
813d81f29 Update regression
264d78117 Switch greedy clustering algorithm back to old idea
c09f6574e Improve nucleotide clustering workflow
38a737708 Set k-mers in linclust to 0 for the nucleotide clustering
7df6e3f75 Replace characters that can not be reversed by N in extract frames
e9678f625 Update regression
f886e868f Add nucleotide support to cluster (workflow nucleotide_clustering), clust module will infer identity automatically if missing, Improve low. mem. greedy incremental algorithm, Update regression
5f8735872 Add kmers-per-sequence-scale to linsearch
0310eb607 Change --kmer-per-seq-scale to a multi parameter, add error if cluster is called with a nucleotide sequence
e258bc8d8 Fix #299 PDB70 database creation was not working
7095f37e4 Add support reverse complemente in rescorediagonal --rescore-mode 0 and 1
61ca48883 Fix result2dnamsa
70d014e41 Add search-type 4 to Search
462f24cbb Add module result2dnamsa
5670d990e Fix regression error
e4451d591 Add result direction parameter to kmersearch
12c499dcd Fix reverse sequences issues in linclust and linsearch
44499c3ce Update filterdb regression test
807b4a56a Fix issue soedinglab/MMseqs2#290. Filterdb checked for mode == true but mode was 2.
24479bc27 Fix Docker
a578f52a7 Fix char signedness on PPC
a0d64a989 Update regression
a07a266f9 Working on PPC64LE support
09734177c Remove remaining _mm_shuffle_epi32
cdef78a69 Merge pull request #285 from hgsommer/misc_small
283c8d03f Replace goto end in ssw
6bfc50281 Fix c/p mistake in convertalignments
e61da3447 Fix spelling of 'length'
9a63760fa Replace nested ternary operator
4349b5c6e Avoid repeatedly checking for profile db types
c170a11f5 Call MsaFilter::shuffleSequences() from MsaFilter::filter()
ef49ba220 Return value from MsaFilter::filter()
d155dc36c Replace int by bool literals for bool variable
ec6722adc Align headings with column in PSSMCalculator::printProfile()
548a9bd68 Avoid forward declaration of ScoreMatrix
d0fbe471f Do some cleanup in StripedSmithWaterman.cpp
91d1aeddc Replace check for zero-sized containers by empty()
e47b8eed9 Remove superfluous parameter from ssw_init()
250b1221d Simplify return statements
4fe1116ae Remove counting zero scores in Sequence::mapProfile()
4303728b5 Replace multiplication by zero
1bd602420 Remove increment by zero
e4d4389f2 Move check for exit condition in front of allocations
556d26d1a Clean up function signatures in MultipleAlignment
3863af9ac Move include back to header to restore build
e1208493a Remove unused TmpResult score field
1fd4db8f2 Die if DBReader cannot reopen files (e.g. no more file handles left)
1e21b87ba Purge sequenceLookup early since its recreate in split databases
40854ddcd Prefiltering and CacheFriendlyOperations refactoring
2433e086b WASM work in progress
14014cd0e Fix prefilter overflow instability
e0f971848 Add conda forge to conda install instructions
aa175d636 Fix off by one in kmermatcher soedinglab/MMseqs2#274 (comment)
d1607bc8a Remove LINE_MAX
eca2155d7 Clear string buffer instead of reassigning in swapresults
0f4645edd Fix wrong reverse marking in linsearch reported by UBSAN
5b612a327 Missing mpi binaries for travis regression
83d22417a Next try for ARM compiler flags
7ad122f0a Missed a few variables
ac7914bea Do not require a cmake variable to build ARM
0dcfaadbb Update regression to fix broken samtools call on ARM
29927b4c4 More NEON fixes, we assume signed chars, ARM uses unsigned by default
7760220ff Next try to get the ARM regression to work
cc6d0d52b Add hack to not break travis log size limit
5408c3d10 Try to get NEON to compile
83192cabd Fix search workflow parameters printed twice
f6f001c8c Fix new clang-10 warnings and further travis fixes
259e64341 llvm-10 alias is not whitelisted in travis yet
b1249fd54 Fix errors in Travis YAML from previous commit
18486d4c5 Update travis - use native aarch64 for neon - use xenial - shorten script
98c37f3c3 shortend MultiParam usage, improved line breaks in usage
c9be07f1a Add gcc-9 to travis
2e5fb309a Fix travis clang build
d5865c894 Remove MultiParam g++-9 warning
73679835b Rework target split merging
ca5869397 Fix RESSIZE issue in slice search if sequences are used
491900b99 Improve usage text of cluster/linclust
0166850a2 Remove old greedy incremental clustering code and just run the memory efficient version instead.
15163e64c Fix Verbosity in workflows
aa78af463 Fix issue soedinglab/MMseqs2#274
7846dfce3 fixed clang template error
e1206371c extended MultiParam class, replaced ScoreMatrixFile type by MultiParam<char*>
b88b54756 rewrite alphabetSize as multi parameter
ecb4e35d4 started template class MultiParam to store sequence type specific values
e1a1c1226 changed dbtype comparision in AlignmentSymmetry
2a829aef7 Replace symlinkat call with getcwd/chdir/symlink/chdir to fix Conda build using macOS 10.9 SDK
28e83e8d5 Add OpenMP include to DBReader
fb00aa0c3 Fix realloc issue while IndexTable creation of profiles
504e5021f Take max. seq. len of query and target db in prefilter and alignment
16e235214 Fix bug if seq. len > max seq. length in Alignment
80d0187de Fix asan issue
751f5c19f Make ZDROP an expert parameter, change description text
1b6edd0d4 Rework x detection (SIMD)
9677254ab Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1ac1e6866 Fix max seq issues in prefilter
cb737033c Reset download strategy to not use aria2c for the NCBI download
c95f3ee0e fixed ksw2 test
72b95c0ce Error if we cannot download from NCBI
1d0aad50b Fix databases not piecing togehter all kalamari accessions
516723d53 Merge branch 'master' of https://github.com/soedinglab/MMseqs2
d81b6cca5 added zdrop parameter to control banded nucleotide alignment
e2e39a971 Add Kalamari Contaminants database
c0c538ea3 Various fixes in databases script
08cc95b3a Fix createtaxdb redownloading when taxdump already exists
018eb3498 Remove a bit whitespace in front of each parameter in usage message
8aa7513de add aggregatetax example, fix typos
8bcd7c740 Fix typo
8e581b762 Rework usage texts
7dc25764a Hide most parameters from createindex
2baa609e8 Add examples to many modules
00a7d7696 fixed bugs for long or wrapped nucleotide sequences
a4bdcb478 eggNOG profiles should not depend on the deleted MSAs
4c7830954 Fix eggNOG database construction
f7a5599c8 Cleanup not needed files immediately in databases workflow
3ed3690d4 Fix downloads always restarting in databases workflow
4cfac9a8a Fix aria warning with more than 16 connections
e0a00e10d Revert "Use SW instead of BandedNucAln if we don't have diagonals"
7ac966b2e Fix result2msa could fail if it was writing compressed output
95729ac7c Fix wrong output DB type written in alignall
f899e7c7a Use SW instead of BandedNucAln if we don't have diagonals
c08d9fa8e Allow parameter descriptions to span multiple lines
57868498e MMseqs2 is not limited to proteins, update README to reflect that
11818b0a2 Cleanup hiding parameters in workflows
c481cea60 Remove some useless includes
2f64aeeb8 Fix databases timestamp appending instead of overwriting
ae9e9e329 Add eggNOG setup procedure to databases
31c8e5d50 Shorten two short parameter descriptions
2f49d3e3e Read header from lookup in msa2profile if available
1356869b0 add option to reverese profile dbs
ac3482e80 More issues with zlib and tar2db
aaafafe43 Fix tar2db keys
c751d9e2f More tar2db fixes
a9c93014c Fix variadic input to tar2db
51a761305 Add tar2db module to convert content of any tar to a DB
96f9a91e5 Use nedmalloc on Windows/Cygwin
73f5c2a2d Add databases workflow to README
5a7ac9e54 make align output consistent
c5ebe5297 fixed setcover cluster mode (by fixing bug in similarity reading for short aln results e.g. hamming distance aln)
481696b5f Fix databases output
c6b4a57a8 Beginning cleaning up parameter descriptions
a9552a177 Show default value of bool parameters
af89c4677 Add a proposed example text structure
9c17f4eba Rework module description texts, better categories, shorten all descriptions, prepare to replace long descriptions with examples
00ff199e8 Add Resfinder DB
f1011ecb4 Fix krona again marked as vendored
02001ab03 missing mode resulted in different top1
4375463bc Header db should not have to be a unsplit db
edccbf33f Actually fix extractorfs lookup creation
041e8e558 Improve README
a8f2c7bad Remove correct workflow script in createtaxdb.sh
26c8202a9 print createdb cmd line again
df02bae34 Refactor createseqfiledb, remove stringstream
2523ebe1a do not write null byte
af847a724 Fix clang warning from DBConcat
ef1ec596f extend dbconcat to handle auxillary files
528bd2134 not needed
dec1b9215 Silence warning in GCC 4.8 casting function to void*
2d44c886d Fix extractorfs not being able to create lookup
ffe66afac Replace isnumber with isdigit. Add more tests to TestTaxExpr
fbe09867e Rework Taxon Expr parsing
f58329ef5 Add constructor to define custom functions to ExpressionParser
b6ef07281 Initialize expressionparser per thread, was not thread safe
f966bfa62 Fix reallocation issue in BandedAlignment
bbd3c2bb7 Add +1 to realloc in BandedNucleotideAligner but not to length
6b6e82ae6 Add +1 to realloc in mapSequence
75e2c8ec4 Fix off by one issues in realloc in rescorediagonal and BandedNucleotideAligner
afd14c8c2 First step to get rid of maxSeqLen
13ca612db Fix allocation issue in kermatcher if sequences are longer than > 2^16
62de5ba93 Fix off by one in computation for splits in kmermatcher
35e95d180 Change int_sequence to char (big change)
ecf82f2f4 Revert "Temporarily disable soft split mode for createdb in easy workflows"
d19219dd4 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1a0d898ec Fix softlink issue in createdb soedinglab/MMseqs2#265
13e0fe466 Temporarily disable soft split mode for createdb in easy workflows
4487b6e14 Fix view module to work with softlinked createdb dbs
c1e9eb0e3 Fix MPI issue if only one server is used
e781c3fe5 fix MPI compile error
9bcff2844 Fix Filter2 bug of HH-suite in MMseqs2 soedinglab/hh-suite#182
01db79d33 Fix some bugs in splitting handling
d9a887453 Fix memory splitting issues in kmermatcher, kmerindexdb
37880f083 Fix MPI in kmermatcher and indexdb
bee93123f Update regression
03a89ff1c Merge branch 'master' of https://github.com/soedinglab/mmseqs2
6ca967362 Update the way how k-mers are extracted in kmermatcher. Extraction should be now ~3 times faster.
f1388309d Introducing databases workflow to automatically setup and download common databases
d78fdbb06 Add progress to convertmsa
18acba224 Do not recreate _mapping file if it already exists in createtaxdb
63a373f5a Skip validations steps correctly if a input db is neither INPUT nor OUTPUT
d95caa1a7 Allow modules with zero parameters
9f8aff948 Allow modules to handle -h or --help themselves
cf5691f92 Typo
8ebc9d16b fixed access mode
31895414d Clarify parameter help in createdb
f644744a8 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
c287719d9 Remove check for profiles for splice serach. It should also work with sequence databases.
c75fe9acf regression submodule w filtertaxseqdb
7587a872f Add one more missing check in kmermatcher
8d4e9f4fc Remove +1 from size in initKmerPositionMemory
aca141e95 Fix shellcheck error in splicesearch
8bdff50e1 Move +1 from initKmerPositionMemory outside
f12821e35 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
d74b76ca5 Avoid overflow in kmermatcher if split is needed
fd90ff2c3 Move compiled data resources into subfolders
2fd9f25d2 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
b439ce831 Make the slice search applicable to other databases types, not just profiles
589a2e276 Fix apply crashing on empty entries
82542a6ac Merge branch 'master' of https://github.com/soedinglab/mmseqs2
c0acdd8f3 Fix memory leak in createsubdb.
5129a956d Validate taxonomic ranks and make input/output formats consistent
53bb55b38 Fix issues in hash function soedinglab/MMseqs2#252
764c4a3e7 Fix lca message
c013a6929 Fix LCA output message
a1206690d Change db validator from result2stats
714f5b4fb Replace mmaped input file with std c io in createsubdb
6e43e9413 Add remove .source file to rmdb
3e58bb85b Fix result2flat soedinglab/MMseqs2#261
3e27833db Revert easycluster.sh back to result2flat. Reason is that createsubdb can not handle soft linked sequence databases (input.0 -> input.fas)
33354680f Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1e92fb504 Replace result2repseq and result2flat with createsubdb and convert2fasta
55bcdd303 single step clustering could potential cluster unrelated sequences due to hash collisions
fdd0646b1 Fix clusthash issues with parallelization and nucl input
e62a1c717 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
1336b7ad2 Add MSA to allDb and allDbAndFlat
48a037a2e Update Prefiltering.cpp
a1adbf52d Fix warning: Remove useless copy constructor from Matcher::result_t
d3ca42657 Remove truncatedCounter variable in QueryMatcher
4647525ec Show full help text if "Error in argument " occurs
4149ae457 Remove annoying message in prefilter (truncated result). Move it to the statistics section.
d5aab5b86 Update regression
1f1e049e6 Fix output of unclassified hits in convertalis
83ff5c601 Fix permission issues for tmp directory
cce6e6714 add support to output taxon in easy-search when using an indexed database
f200bdd62 Merge branch 'master' of https://github.com/soedinglab/mmseqs2
6f28a29ae Fix seg. fault if all sequences could be classified
473d60580 Update batches
b52668f6e Add chat icon
af54c8e8e Update README.md
7eb6a0b70 Makde addtaxonomy more resilient against invalid taxonomy mappings
3482b0e91 Merge pull request #260 from RuoshiZhang/master
36f49f5b5 Fix issue in memory computation for split
bcb97d63f Update README.md
abcd97de7 write same number of fields even if no hit
38e102181 Update regression to hopefully fix windows failure
f41511465 Fix spelling error
1fd24924e Add a search-type 4 for trans-trans search returning a nucl backtrace in offsetalignment
31f6d7ac3 add aggragatetax to assign set tax by majority vote
b6e8ee239 allow more dbtypes in swapdb
c9d02ef21 add option to view rank index
49db7258e typo fix
9c32930f3 Merge branch 'master' of github.com:soedinglab/MMseqs2
17b5494fe Fix auto detection of dbtype in createdb
8831df81d Merge branch 'master' of github.com:soedinglab/MMseqs2
be1a9822c Fix createseqfiledb soedinglab/MMseqs2#258
02be0c4ea Fix summarizeresult to support reverse position in alignment
7ef586276 added filtertaxseqdb
00f2fd2b8 added mode for all but index
127db8c6d minor tidying for filtertaxdb
8144e7653 Merge branch 'master' of github.com:soedinglab/MMseqs2
48f77fa7d Fix ASan issue in filterdb
d722d5724 Fix warning in filterdb
4a4e6ea15 Update regression test for filterdb
31a7dc124 filterdb --join-db ignores lines it cannot join instead of crash
6c6faa96d filterdb's --extract-lines works together with --trim-to-one-column
12bee8142 filterdb can filter by rows with value within percentage #249
5c919ab95 Allow double parameters separately from floats in parsing
f9be8a88d Remove broken filterdb paths
1dc04f5e1 Refactoring of filterdb
90e3a9aaf Fix bug for enforced dbtypes in createdb
a4cee78db New regression to check stdin support
17ec97c78 Add stdin support to easy workflows
76c9e7c36 Fix compiler warnings in KSeqWrapper
0cc45536b Overwrite dbtype correctly in createdb
c0045182b Add stdin to createdb
02a88e438 use https instead of ftp for downloading taxdb data
a33bd27f4 offsetalignments now correctly returns a nucleotide backtrace if needed
456e1b5ab include VTML40 in binary for easier access
775de3850 Add missed target .source file for reading in convertalis
c08c071b2 Overload patterncompiler isMatch for pos of match
ba6aa8d12 avoid appending extra tabs besthitperset

git-subtree-dir: lib/mmseqs
git-subtree-split: 46c8438958edccd8fd09640eb174e2449529e4df
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants