-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
strobealign: Too many reference sequences. #421
Comments
by removing contigs < 500bs, the number of contigs is lower than 16777215 and it works. Why is 16777215 the limit? would it be possible to modify it? |
Hi, the limit is so because the indexing data structure that strobealign uses reserves 24 bits for the contig index. The maximum number of contigs is thus This is currently a hard limit that cannot be changed by command-line parameters. We have been thinking about changing the way the index is stored in memory (for example in #227) and some of these changes could free up some bits that could then be used for the contig index, but we have not made any decisions. |
Given the recent interest in aligning/mapping to large metagenomics datasets, I would consider even allocating a full 32 bits for the contig ID, and steal the 8-bits for the strobe offset from the 64bit hash value. Having the 8 bits offset as the lowest 8 bits in the hash value may even be a feature since we can look for/prefer strobemer matches with matching length if multiple matches (this is a bulletpoint in the development document). |
With multi-context seeds, the hash value is split into a 40 bit base hash and 24 bit auxiliary hash. Which one shall we reduce by 8 bits? I.e., split up like 32/24/8 or 40/16/8? Also, the mcs bit (that denotes whether first or second strobe is main) needs to be put somewhere. It probably doesn’t matter whether we take it from the hash or the reference index. It may be slightly nicer in the code if we take it from the reference index. I guess |
I guess we can't use any of the upper N bits that are unused after indexing, because we need to allocate the bits already at start. We also need to assert that N < B (N prefix lookup bits, B base hash bits), otherwise our partial search wont work. I think 32 is too low once we index things >10Gb. I would do 40/16/8 or maybe 38/18/8 as default. the single extra bit I have no strong opinion on (yet:) 2^32 is needed for positions in chromosomes from a few genomes like the mistletoe genome, but probably we want to tailor more to metagenomic datasets with more references. If it's equally big surgery in code to put it in the position and reference index, I would opt for allocating it in the position allocation. But I am also fine with the ref index. Locating it in the hash might be the best choice. |
This changes the RefRandstrobe struct in the following way. Previously, bits were allocated in the following way: - 40 bits base hash - 24 bits auxiliary hash - 32 bits position - 23 bits reference/contig index - 8 bits strobe2 offset - 1 bit first_strobe_is_main flag We take 8 bits away from the auxiliary hash and use those bits for the reference index. New layout: - 40 bits base hash - 16 bits auxiliary hash - 8 bits strobe2 offset - 32 bits position - 31 bits reference/contig index - 1 bit first_strobe_is_main flag Closes #421
This changes the RefRandstrobe struct in the following way. Previous layout: - 40 bits base hash - 24 bits auxiliary hash - 32 bits position - 23 bits reference/contig index - 8 bits strobe2 offset - 1 bit first_strobe_is_main flag We take 8 bits away from the auxiliary hash and use those bits for the reference index. New layout: - 40 bits base hash - 16 bits auxiliary hash - 8 bits strobe2 offset - 32 bits position - 31 bits reference/contig index - 1 bit first_strobe_is_main flag Closes #421
I guess it may be possible somehow, but it’s probably inefficient. Now that you mention it, and unrelated to this issue: One idea could be to store the number of occurrences of a hash in there. Then we would no longer need to use linear or binary search to find how many hashes there are of a value.
It seems that even
Something like 38/17/1/8 (base, aux, mcs flag, offset)? |
Funny that you mention it. I initially suggested this when implementing the AEMB index. I think Shaojun even implemented counts in there in one version. I don't know why we never used that version, perhaps because there was not a large speedup (on some genome he tried on?) while adding extra steps to the indexing. But for repetitive genomes I think it would speed it up, e.g. maize and rye. So yes, good suggestion.
Thinking ahead on development, perhaps do 38/17/8/1 (base, aux, mcs flag, offset). One idea we had was that for very repetitive seeds, we at least want to prefer matches with the same length in ref and query. the leftmost bit for mcs flag should then be ignored for matches. The seed length (ie offset) be used when checking if filtered (check if >1000 hits with identical length, not just identical "hash"), or when selecting a subset of hits randomly (something we discussed in a meeting). |
@luispedro should be CCed also here in regards to: Funny that you mention it. I initially suggested this when implementing the AEMB index. I think Shaojun even implemented counts in there in one version. I don't know why we never used that version, perhaps because there was not a large speedup (on some genome he tried on?) while adding extra steps to the indexing. But for repetitive genomes I think it would speed it up, e.g. maize and rye. So yes, good suggestion. |
I think we may be talking about two slightly different ideas. In what I currently have in PR #455, the number of bits for hash values is just reduced from 64 to 55 bits. That the 9 bits that become available are used for the offset and mcs flag is a detail that is internal to If I understand correctly, what you are suggesting is that the mcs offset essentially becomes part of the hash itself. This is something that IMO should be implemented and evaluated separately. It’s easy to switch the order of things around at that point. |
Yes and no. We can say that the hash is now only 55 bits (for some of the core look-up operations in strobealign). But when needed (e.g. for very repetitive seeds), one can use the extra 8 offset bits to narrow down the hits. So not strictly part of the hash but still related. I agree that it should be implemented and evaluated separately. If we decide to test it (which I think would be interesting) my idea was that if we do 38/17/8/1 and sort by the first 64-bit INT (previously simply the hash value of the seed) the offsets would be sorted which enables our regular binary and linear searches over also the offsets . |
Hi, I am using the latest version 0.13, and when mapping reads against co-assembled contigs, no sam file is created. This message appears in the log file:
This is strobealign 0.13.0
Estimated read length: 144 bp
Time reading reference: 34.04 s
Reference size: 15899.10 Mbp (17425146 contigs; largest: 1.20 Mbp)
strobealign: Too many reference sequences. Current maximum is 16777215
It seems the indexing step is not working.
The same version worked when mapping the same samples against individually assembled contigs. Here is the log file:
This is strobealign 0.13.0
Estimated read length: 145 bp
Time reading reference: 2.20 s
Reference size: 849.75 Mbp (996534 contigs; largest: 0.74 Mbp)
Indexing ...
Time counting seeds: 1.23 s
Time generating seeds: 3.40 s
Time sorting seeds: 4.05 s
Time generating hash table index: 1.83 s
Total time indexing: 10.52 s
Running in paired-end mode
Done!
Total mapping sites tried: 114025071
Total calls to ssw: 16525791
Inconsistent NAM ends: 52038
Tried NAM rescue: 10875413
Mates rescued by alignment: 3786350
Total time mapping: 330.52 s.
Total time reading read-file(s): 43.19 s.
Total time creating strobemers: 40.19 s.
Total time finding NAMs (non-rescue mode): 79.54 s.
Total time finding NAMs (rescue mode): 1.79 s.
Total time sorting NAMs (candidate sites): 1.94 s.
Total time base level alignment (ssw): 104.23 s.
Total time writing alignment to files: 0.00 s.
Thanks in advance for your time and help.
The text was updated successfully, but these errors were encountered: