-
Notifications
You must be signed in to change notification settings - Fork 164
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow custom geometry single-cell protocols #734
base: master
Are you sure you want to change the base?
Conversation
@Gaura one thought, since the regex doesnt change, can we precomputed that just once? |
I don't have a great handle on the code yet, but precomputing the regex object as Rob suggested is essential. |
Great suggestion, thanks @rob-p and @gmarcais . Somehow, I missed it. I added it in the latest commit. Speed now from 3 runs:
50% improvement over the past results, i.e., about 33% slower than specific protocol flag now. Although, ideally I should have ran the earlier tests thrice but the sd is small so results should be valid. Nonetheless, I'll do more speed tests with versions in the future. Let me know what other thoughts you have and what else have I missed. I have some minor improvements in mind too. |
Can you give us the regular expression that are generated and used for matching? |
Sure.
For other custom geometries, these can be obtained by uncommenting lines 1473 and 1474 in |
Does it make sense to have '^' and/or '$' around the regex? Having anchors usually speeds up a regex. Otherwise, I am not sure. Have we tested other libraries than boost::regexp? |
I can try '^' and '$' after lunch. I didn't test any other library, since boost was already a pre-requisite for salmon and my focus was on getting it to work first. But now that it is done, other libraries can be tried. However, at this moment, I'm not clear about the effort and speed-up ratio. |
Including '^' didn't change the speed, and '$' can't be added as there are extra bases (not in protocol spec) at end which can vary in number.
|
Thanks @Gaura! @gmarcais — I don't have a good sense about how boost stacks up against other regex libraries, but I think it's competitive. Actually modern C++ has built-in regex support in the standard library, but I've heard many bad things about the standard implementation. I wonder if you think @gmarcais that the approximate difference between hand-rolled extraction code and regex extraction that we see here is in line with what you'd expect. Would you expect a 1/3 slowdown to the overall procedure by using regex extraction? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
b293731
to
66b6f68
Compare
Hi @rob-p! Thanks for all the great suggestions and comments. I have addressed all of them. I also tested the speed after the changes. The test was done 3 times in the same way as done earlier.
The real time in case 1. was 57.92 ± 0.57s and for case 2. it was 68.75 ± 5.68s, which is about 19% slower. |
1/3 loss in performance seems significant, given that presumably the code does something else than just parsing UMIs. I am looking at Boost own comparison and benchmarks, and on long inputs (20MB) it is competitive with PCRE2. But with short inputs (20-30 characters) PCRE2 is consistently faster (by about 30% 🤔 ). And if PCRE2 is feature full, not sure it is the fastest either, especially for simple regexp. |
@Gaura So with the changes implemented, custom geometry is ~19% slower here than the hand-coded sci-seq3 protocol (improved from ~1/3 slower); is that correct? That's a nice improvement. @gmarcais — do you think it's worth testing out PCRE2? Most of these regexes are very short — and if boost is ~20% slower than PCRE2 and we are ~20% slower than the custom parsing code .... maybe that's the whole gap? Any idea how difficult this would be to try? |
I don't know how difficult it would be to use PCRE2, I have never used it directly. Can we write a thin struct wrapper structure that is defined as the boost object by default, but with Keeping with boost, we should also try https://www.boost.org/doc/libs/1_78_0/doc/html/xpressive.html . That does not add a new dependency. Don't know about the speed. |
My little testing: https://github.com/OceanGenomics/RegexBench Some results on matching 1 million short strings with ~90% positive match and ~10% random strings:
retwo is RE2. It is surprisingly highly affected by the number of captures. With 4 captures as above, it is the slowest. Without any it is as fast as grep. And xpressive is very slow, while I expected it to be the fastest! |
Very interesting @gmarcais. I wonder if/how allocations have an effect here. Are all methods using pre-allocated space to store their captures? Also, is xpressive drunk? |
I explicitly preallocate the output array/vector for pcre and re2. Boost regex doesn't seem to offer that (at least, I don't know). Regarding xpressive: yeah, what a disappointment. And I don't actually save the capture with xpressive. I thought the automaton was entirely generated and optimize at compile time. Apparently creating an automaton with C++ template system must be really hard because the generated code is garbage. Or I am using it wrong. In any case, xpressive as I use it is entirely static (I haven't tested the dynamic version). So it is not useful in our case. I was just curious if it could match hand crafted code. What was I thinking! |
I wonder of the preallocation capabilities of pcre2 might offer a bigger benefit over a larger workload. For small patterns processing hundreds of millions of reads, maybe there is a bigger difference? |
- introduce exclude 'x' to parsing - change positions with lengths
- read returned only when read regex search is a success - only check reads that have barcode and umi for bc and umi - if regex search fails in bc, return false
While this has languished somewhat as we try to figure out what regex engine to include and how best to package it, I wanted to mention that I am attempting something similar in rust (which has a canonical regex crate which, I believe, is supposed to be among the fast ones). That repo is over on seq_geom_xform and it relies on seq_geom_parser. I like @Gaura's geometry string specification, so we're going with that for time time being. If you want to chime in or start a discussion on either of those repos, please do let me know if you have any other thoughts on this generalized scheme. The purpose of the --Rob |
Just an update, this approach currently seems to be working. We can stream sciseq3 data through our |
This PR introduces a new feature that will allow users to specify custom single cell protocols and use with alevin. Custom Geometry (--custom-geo) should be used when:
1. Barcode or umi have variable lengths
2. There is known fixed sequence in the reads
3. There is some sequence to be excluded
From the input peglib spec it creates a regex. Boost regex library is used to parse the reads.
Apart from small tests on multiple outlier cases, it was tested on a sci-RNA-seq3 sample SRR7827207 for speed. For this the spec is
--custom-geo 1{b[9-10]f[CAGAGC]u[8]b[10]}2{r}
. It says:The barcodes are concatenated in the output and a padding sequence is added to make the length as max length + 1. The extra base is added so that we don't introduce spurious matches in barcode. For example, if the barcodes have length 3-4 bp and the two barcodes are
ATG
andATGA
, after padding they will beATGAC
andATGAA
resp. Adding justA
to shorter barcode would result in a spurious match.Since
--custom-geo
uses regex, it is slower than protocol specific flag. The time with 8 threads on a large Ubuntu 20 machine:--sciseq3
:--custom-geo
Notably, it is about 66% slower. However, it allows support for almost all current and future protocols.
There will be a tutorial shortly on how and when to use this and how is it different from other flags such as
--umi-geometry
,--read-geometry
and--bc-geomtery
. There is scope of speed improvement in the future along with integration of all custom geometry processing protocols.