Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MEM/SMEM in target coordinate space #4

Closed
AndreaGuarracino opened this issue Oct 2, 2024 · 14 comments
Closed

MEM/SMEM in target coordinate space #4

AndreaGuarracino opened this issue Oct 2, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@AndreaGuarracino
Copy link

AndreaGuarracino commented Oct 2, 2024

For each MEM/SMEM found in a query, is it also possible to get in output in which sequences and positions of the pangenome it was found?

@lh3
Copy link
Owner

lh3 commented Oct 2, 2024

Possible, but will be times slower.

@ekg
Copy link

ekg commented Oct 3, 2024

What about a single (random) position?

@AndreaGuarracino
Copy link
Author

AndreaGuarracino commented Oct 3, 2024 via email

@lh3
Copy link
Owner

lh3 commented Oct 3, 2024

Finding one position will be faster, but with a naive implementation it will probably be slower than MEM finding itself. There should be faster ways with standard suffix array samples. I will try.

r-index or sr-index will be faster but will take a lot more effort.

@lh3
Copy link
Owner

lh3 commented Oct 6, 2024

Just added mem -p to output up to -p positions. The logic behind the implementation is complex but I haven't extensively tested the correctness. Let me know if you see issues. Thanks.

@lh3 lh3 added the enhancement New feature or request label Oct 6, 2024
@lh3
Copy link
Owner

lh3 commented Oct 6, 2024

The positions in output are unpredictable and should be close to, but not exactly, random.

@AndreaGuarracino
Copy link
Author

I tried, but I get a seg fault when using -p:

ropebwt3 build scerevisiae8.fa -bo scerevisiae8.fmr
[M::main_build::0.195*1.01] read 192511286 symbols from file 'scerevisiae8.fa'
[M::main_build::9.092*2.23] constructed partial BWT for 192511286 symbols
[M::main_build::9.461*2.23] encoded the partial BWT for 192511286 symbols
[M::main] Version: 3.7-r244-dirty
[M::main] CMD: ropebwt3 build -bo scerevisiae8.fmr scerevisiae8.fa
[M::main] Real time: 9.531 sec; CPU: 21.201 sec; Peak RSS: 2.175 GB
ropebwt3 mem -l15 scerevisiae8.fmr 1read.fq  
[M::mr_restore] ($, A, C, G, T, N) = (272, 59461631, 36672377, 36672377, 59461631, 242998)
[M::rb3_fmi_load_all::0.038*1.03] loaded the BWT
simulated.1     14      30      8
simulated.1     16      32      8
simulated.1     39      54      7
[M::worker_pipeline::0.039*1.03] processed 1 sequences
[M::main] Version: 3.7-r244-dirty
[M::main] CMD: ropebwt3 mem -l15 scerevisiae8.fmr 1read.fq
[M::main] Real time: 0.045 sec; CPU: 0.046 sec; Peak RSS: 0.062 GB
ropebwt3 mem -l15 scerevisiae8.fmr 1read.fq -p 1
[M::mr_restore] ($, A, C, G, T, N) = (272, 59461631, 36672377, 36672377, 59461631, 242998)
[M::rb3_fmi_load_all::0.040*1.04] loaded the BWT
[1]    51729 segmentation fault (core dumped) ropebwt3 mem -l15 scerevisiae8.fmr 1read.fq -p

With gdb:

gdb --args ropebwt3 mem -l15 scerevisiae8.fmr 1read.fq -p 1
GNU gdb (Ubuntu 14.0.50.20230907-0ubuntu1) 14.0.50.20230907-git
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ropebwt3...
(gdb) run
Starting program: ropebwt3 mem -l15 scerevisiae8.fmr 1read.fq -p 1

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.ubuntu.com>
Enable debuginfod for this session? (y or [n]) n
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[M::mr_restore] ($, A, C, G, T, N) = (272, 59461631, 36672377, 36672377, 59461631, 242998)
[M::rb3_fmi_load_all::0.044*1.05] loaded the BWT
[New Thread 0x7ffff38bc6c0 (LWP 51861)]
[New Thread 0x7fffebfff6c0 (LWP 51862)]
[New Thread 0x7ffff30bb6c0 (LWP 51863)]
[New Thread 0x7ffff28ba6c0 (LWP 51864)]

Thread 4 "ropebwt3" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffff30bb6c0 (LWP 51863)]
ssa_add_intv (ssa=ssa@entry=0x0, aux=aux@entry=0x7ffff30bacf0, lo=lo@entry=147357453, hi=hi@entry=147357461, off=off@entry=0) at ssa.c:141
141             int64_t k = ((lo - m) >> ssa->ss << ssa->ss) + m;
(gdb) bt
#0  ssa_add_intv (ssa=ssa@entry=0x0, aux=aux@entry=0x7ffff30bacf0, lo=lo@entry=147357453, hi=hi@entry=147357461, off=off@entry=0) at ssa.c:141
#1  0x000055555558ff7f in rb3_ssa_multi (km=0x7fffec008fe0, f=f@entry=0x7fffffffd420, ssa=0x0, lo=147357453, hi=147357461, max_sa=<optimized out>, sa=0x7ffff18b7dc8) at ssa.c:168
#2  0x000055555559264b in worker_for_seq (data=<optimized out>, i=<optimized out>, tid=<optimized out>) at search.c:127
#3  0x000055555557d3c3 in ktf_worker (data=0x7fffec0090e0) at kthread.c:47
#4  0x00007ffff7c97b5a in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:444
#5  0x00007ffff7d285fc in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78

@lh3
Copy link
Owner

lh3 commented Oct 8, 2024

Have you generated suffix array samples and the sequence names? See README. Of course, ropebwt3 should throw an error rather than segfault.

@AndreaGuarracino
Copy link
Author

Oops, thanks! After also creating a sampled suffix array and a list of sequence names and lengths, I get the expected output:

ropebwt3 mem -l15 scerevisiae8.fmd 1read.fq -p 3 | column -t

simulated.1  14  30  8  3  SGDref#1#chrXIV:+:458823  DBVPG6765#1#chrXIV:+:446379  UWOPS034614#1#chrXIV:+:446906
simulated.1  16  32  8  3  YPS128#1#chrX:+:327075    Y12#1#chrX:+:330139          SGDref#1#chrX:+:336423
simulated.1  39  54  7  3  Y12#1#chrXIV:-:560052     DBVPG6044#1#chrXIV:-:542558  S288C#1#chrXIV:-:543079

@lh3
Copy link
Owner

lh3 commented Oct 14, 2024

Great! Thanks for the confirmation. I will close this issue. Let me know if you see problems.

@lh3 lh3 closed this as completed Oct 14, 2024
@ekg
Copy link

ekg commented Oct 14, 2024

Thanks @lh3, this is extremely useful!

@AndreaGuarracino
Copy link
Author

@lh3, what does it mean when the reported reference position in -1?

A00404:155:HV27LDSXX:3:2515:17011:18552/2       121     150     89      1       NA20129#1#JAHEPE010000077.1:3240254-3317239:+:-1

lh3 added a commit that referenced this issue Oct 18, 2024
@lh3
Copy link
Owner

lh3 commented Oct 18, 2024

This is a bug. Now fixed at HEAD.

@lh3
Copy link
Owner

lh3 commented Oct 18, 2024

FYI: the latest v3.8 release now has this feature. Thanks again for the suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants