Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

strobealign: Too many reference sequences. #421

Open
lfdelzam opened this issue Apr 30, 2024 · 10 comments · May be fixed by #455
Open

strobealign: Too many reference sequences. #421

lfdelzam opened this issue Apr 30, 2024 · 10 comments · May be fixed by #455

Comments

@lfdelzam
Copy link

Hi, I am using the latest version 0.13, and when mapping reads against co-assembled contigs, no sam file is created. This message appears in the log file:

This is strobealign 0.13.0
Estimated read length: 144 bp
Time reading reference: 34.04 s
Reference size: 15899.10 Mbp (17425146 contigs; largest: 1.20 Mbp)
strobealign: Too many reference sequences. Current maximum is 16777215

It seems the indexing step is not working.

The same version worked when mapping the same samples against individually assembled contigs. Here is the log file:
This is strobealign 0.13.0
Estimated read length: 145 bp
Time reading reference: 2.20 s
Reference size: 849.75 Mbp (996534 contigs; largest: 0.74 Mbp)
Indexing ...
Time counting seeds: 1.23 s
Time generating seeds: 3.40 s
Time sorting seeds: 4.05 s
Time generating hash table index: 1.83 s
Total time indexing: 10.52 s
Running in paired-end mode
Done!
Total mapping sites tried: 114025071
Total calls to ssw: 16525791
Inconsistent NAM ends: 52038
Tried NAM rescue: 10875413
Mates rescued by alignment: 3786350
Total time mapping: 330.52 s.
Total time reading read-file(s): 43.19 s.
Total time creating strobemers: 40.19 s.
Total time finding NAMs (non-rescue mode): 79.54 s.
Total time finding NAMs (rescue mode): 1.79 s.
Total time sorting NAMs (candidate sites): 1.94 s.
Total time base level alignment (ssw): 104.23 s.
Total time writing alignment to files: 0.00 s.

Thanks in advance for your time and help.

@lfdelzam
Copy link
Author

by removing contigs < 500bs, the number of contigs is lower than 16777215 and it works. Why is 16777215 the limit? would it be possible to modify it?

@marcelm
Copy link
Collaborator

marcelm commented Apr 30, 2024

Hi, the limit is so because the indexing data structure that strobealign uses reserves 24 bits for the contig index. The maximum number of contigs is thus $2^{24}=16777216$. (I think the message should say 16777216 instead of 16777215.)

This is currently a hard limit that cannot be changed by command-line parameters.

We have been thinking about changing the way the index is stored in memory (for example in #227) and some of these changes could free up some bits that could then be used for the contig index, but we have not made any decisions.

@ksahlin
Copy link
Owner

ksahlin commented Apr 30, 2024

Given the recent interest in aligning/mapping to large metagenomics datasets, I would consider even allocating a full 32 bits for the contig ID, and steal the 8-bits for the strobe offset from the 64bit hash value. Having the 8 bits offset as the lowest 8 bits in the hash value may even be a feature since we can look for/prefer strobemer matches with matching length if multiple matches (this is a bulletpoint in the development document).

@marcelm
Copy link
Collaborator

marcelm commented Oct 22, 2024

With multi-context seeds, the hash value is split into a 40 bit base hash and 24 bit auxiliary hash. Which one shall we reduce by 8 bits? I.e., split up like 32/24/8 or 40/16/8?

Also, the mcs bit (that denotes whether first or second strobe is main) needs to be put somewhere. It probably doesn’t matter whether we take it from the hash or the reference index. It may be slightly nicer in the code if we take it from the reference index. I guess $2^{32}$ contigs is still plenty? But taking it from the hash is also not hard.

@ksahlin
Copy link
Owner

ksahlin commented Oct 22, 2024

I guess we can't use any of the upper N bits that are unused after indexing, because we need to allocate the bits already at start. We also need to assert that N < B (N prefix lookup bits, B base hash bits), otherwise our partial search wont work. I think 32 is too low once we index things >10Gb. I would do 40/16/8 or maybe 38/18/8 as default.

the single extra bit I have no strong opinion on (yet:) 2^32 is needed for positions in chromosomes from a few genomes like the mistletoe genome, but probably we want to tailor more to metagenomic datasets with more references. If it's equally big surgery in code to put it in the position and reference index, I would opt for allocating it in the position allocation. But I am also fine with the ref index. Locating it in the hash might be the best choice.

marcelm added a commit that referenced this issue Oct 22, 2024
This changes the RefRandstrobe struct in the following way.

Previously, bits were allocated in the following way:
- 40 bits base hash
- 24 bits auxiliary hash
- 32 bits position
- 23 bits reference/contig index
- 8 bits strobe2 offset
- 1 bit first_strobe_is_main flag

We take 8 bits away from the auxiliary hash and use those bits for the reference
index. New layout:

- 40 bits base hash
- 16 bits auxiliary hash
- 8 bits strobe2 offset
- 32 bits position
- 31 bits reference/contig index
- 1 bit first_strobe_is_main flag

Closes #421
marcelm added a commit that referenced this issue Oct 22, 2024
This changes the RefRandstrobe struct in the following way.

Previous layout:
- 40 bits base hash
- 24 bits auxiliary hash
- 32 bits position
- 23 bits reference/contig index
- 8 bits strobe2 offset
- 1 bit first_strobe_is_main flag

We take 8 bits away from the auxiliary hash and use those bits for the reference
index. New layout:

- 40 bits base hash
- 16 bits auxiliary hash
- 8 bits strobe2 offset
- 32 bits position
- 31 bits reference/contig index
- 1 bit first_strobe_is_main flag

Closes #421
@marcelm marcelm linked a pull request Oct 22, 2024 that will close this issue
@marcelm
Copy link
Collaborator

marcelm commented Oct 23, 2024

I guess we can't use any of the upper N bits that are unused after indexing, because we need to allocate the bits already at start.

I guess it may be possible somehow, but it’s probably inefficient. Now that you mention it, and unrelated to this issue: One idea could be to store the number of occurrences of a hash in there. Then we would no longer need to use linear or binary search to find how many hashes there are of a value.

the single extra bit I have no strong opinion on (yet:) 2^32 is needed for positions in chromosomes from a few genomes like the mistletoe genome

It seems that even $2^{32}$ wouldn’t be enough for the mistletoe genome because most (all?) chromosomes are larger than 4 Gbp. However, in the publicly available reference, the chromosomes are split up into parts that are 2 Gbp. I think they want to ensure that it is possible to use these contigs with tools that use signed integer indices, where that is the maximum size (Java doesn’t have unsigned integers, for example). Not sure what the conclusion is ...

Locating it in the hash might be the best choice

Something like 38/17/1/8 (base, aux, mcs flag, offset)?

@ksahlin
Copy link
Owner

ksahlin commented Oct 23, 2024

unrelated to this issue: One idea could be to store the number of occurrences of a hash in there. Then we would no longer need to use linear or binary search to find how many hashes there are of a value.

Funny that you mention it. I initially suggested this when implementing the AEMB index. I think Shaojun even implemented counts in there in one version. I don't know why we never used that version, perhaps because there was not a large speedup (on some genome he tried on?) while adding extra steps to the indexing. But for repetitive genomes I think it would speed it up, e.g. maize and rye. So yes, good suggestion.

Something like 38/17/1/8 (base, aux, mcs flag, offset)?

Thinking ahead on development, perhaps do 38/17/8/1 (base, aux, mcs flag, offset). One idea we had was that for very repetitive seeds, we at least want to prefer matches with the same length in ref and query. the leftmost bit for mcs flag should then be ignored for matches. The seed length (ie offset) be used when checking if filtered (check if >1000 hits with identical length, not just identical "hash"), or when selecting a subset of hits randomly (something we discussed in a meeting).

@ksahlin
Copy link
Owner

ksahlin commented Oct 23, 2024

@luispedro should be CCed also here in regards to:

Funny that you mention it. I initially suggested this when implementing the AEMB index. I think Shaojun even implemented counts in there in one version. I don't know why we never used that version, perhaps because there was not a large speedup (on some genome he tried on?) while adding extra steps to the indexing. But for repetitive genomes I think it would speed it up, e.g. maize and rye. So yes, good suggestion.

@marcelm
Copy link
Collaborator

marcelm commented Oct 23, 2024

Something like 38/17/1/8 (base, aux, mcs flag, offset)?

Thinking ahead on development, perhaps do 38/17/8/1 (base, aux, mcs flag, offset). One idea we had was that for very repetitive seeds, we at least want to prefer matches with the same length in ref and query. the leftmost bit for mcs flag should then be ignored for matches. The seed length (ie offset) be used when checking if filtered (check if >1000 hits with identical length, not just identical "hash"), or when selecting a subset of hits randomly (something we discussed in a meeting).

I think we may be talking about two slightly different ideas.

In what I currently have in PR #455, the number of bits for hash values is just reduced from 64 to 55 bits. That the 9 bits that become available are used for the offset and mcs flag is a detail that is internal to RefRandstrobe. What is used for comparisons and for lookup is this 55 bit hashvalue. It does not matter in which order the mcs flag and offset bit are.

If I understand correctly, what you are suggesting is that the mcs offset essentially becomes part of the hash itself. This is something that IMO should be implemented and evaluated separately. It’s easy to switch the order of things around at that point.

@ksahlin
Copy link
Owner

ksahlin commented Oct 23, 2024

I think we may be talking about two slightly different ideas...If I understand correctly, what you are suggesting is that the mcs offset essentially becomes part of the hash itself.

Yes and no. We can say that the hash is now only 55 bits (for some of the core look-up operations in strobealign). But when needed (e.g. for very repetitive seeds), one can use the extra 8 offset bits to narrow down the hits. So not strictly part of the hash but still related.

I agree that it should be implemented and evaluated separately. If we decide to test it (which I think would be interesting) my idea was that if we do 38/17/8/1 and sort by the first 64-bit INT (previously simply the hash value of the seed) the offsets would be sorted which enables our regular binary and linear searches over also the offsets .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants