Skip to content

One-bit encoding for genotypes #809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 5, 2023
Merged

Conversation

jeromekelleher
Copy link
Member

WIP

Doesn't look too bad if we abstract things a bit, will have a look at the C API to see if it ports across.

@codecov
Copy link

codecov bot commented Mar 4, 2023

Codecov Report

Merging #809 (dde4e26) into main (18b51ae) will decrease coverage by 0.63%.
The diff coverage is 83.11%.

❗ Current head dde4e26 differs from pull request most recent head 77e3466. Consider uploading reports for the commit 77e3466 to get more accurate results

@@            Coverage Diff             @@
##             main     #809      +/-   ##
==========================================
- Coverage   93.34%   92.71%   -0.63%     
==========================================
  Files          17       17              
  Lines        5662     5806     +144     
  Branches     1016     1036      +20     
==========================================
+ Hits         5285     5383      +98     
- Misses        247      291      +44     
- Partials      130      132       +2     
Flag Coverage Δ
C 92.71% <83.11%> (-0.63%) ⬇️
python 95.90% <78.65%> (-0.39%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tsinfer/formats.py 96.28% <9.52%> (-1.25%) ⬇️
lib/ancestor_builder.c 83.12% <85.71%> (-4.46%) ⬇️
lib/err.c 100.00% <100.00%> (ø)
tsinfer/algorithm.py 98.56% <100.00%> (+0.08%) ⬆️
tsinfer/inference.py 98.57% <100.00%> (+0.01%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@benjeffery
Copy link
Member

Looking good - let me know when I should run some quick tests.

@jeromekelleher
Copy link
Member Author

I think this is basically working @benjeffery, but haven't quite got the tests to pass. Should be at least informative to run this on a decent sized SampleData, if you want to give it a quick try.

@jeromekelleher
Copy link
Member Author

Use like generate_ancestors(params as before, genotype_encoding=1)

@jeromekelleher
Copy link
Member Author

Aha, that should be it. Curios to see what the perf cost/benefit of this is!

@benjeffery
Copy link
Member

Not looking good so far, ancestor_builder_finalise is taking a lot longer. I'm not sure how much as it is still running after 8min, (usually that step takes a few seconds). Attaching gdb shows the process to be in unpackbits. I'll try a more optimised version of that function.

@jeromekelleher
Copy link
Member Author

Not looking good so far, ancestor_builder_finalise is taking a lot longer.

Hmm, I can see how that might happen. OK, I'll see what I can do.

@benjeffery
Copy link
Member

This seems to be better:

void unpackbits(const uint8_t *restrict source, size_t len, int8_t *restrict dest) {
    uint64_t MAGIC = 0x8040201008040201ULL;
    uint64_t MASK  = 0x8080808080808080ULL;
    size_t dest_index = 0;
    for (size_t i = 0; i < len; i++) {
        uint64_t t = ((MAGIC*source[i]) & MASK) >> 7;
        *(uint64_t*)&dest[dest_index] = t;
        dest_index += 8;
    }
}

It's running now, and has got past the ancestor_builder_finalise step.

@benjeffery
Copy link
Member

Whoops, maybe that isn't doing the right thing as some tests are failing now, was running fast until it failed though!!

@jeromekelleher
Copy link
Member Author

Currently refactoring to do selective decoding in some places. We can drop a more optimised function in later.

@jeromekelleher
Copy link
Member Author

Here's a version with partial decoding @benjeffery. You might see a change in the amount of memory we're grabbing at the start, hopefully that won't be a show stopper for the first pass though.

@jeromekelleher
Copy link
Member Author

Looks like this is working pretty well. Here's my informal perf testing based on a simulation with 10K samples and 78K sites:

  • main: time 1:20s, max resident: 760M
  • this branch, genotype_encoding=0, time=1:31m max resident: 1.0G
  • this branch, genotype_encoding=1, time=1:20, max_resident: 460M

So, we're about as fast as the current head and save quite a bit of memory! The memory isn't optimised at the moment, I'll need to take a bit more care with that.

But, I think the approach basically works now and we don't really need any major changes, so I'll try to get this mergable ASAP.

@jeromekelleher
Copy link
Member Author

Good to get some confirmation on real data as well though @benjeffery, if you can get it working on the datasets you're working on?

@jeromekelleher
Copy link
Member Author

Hmm, might have made a retrograde step with the memory usage, will have another look.

@benjeffery
Copy link
Member

I also get:

CRITICAL:tsinfer.threads:Exception occured in thread; exiting
CRITICAL:tsinfer.threads:Traceback (most recent call last):
  File "/home/benj/projects/tsinfer/tsinfer/threads.py", line 59, in thread_target
    worker(index)
  File "/home/benj/projects/tsinfer/tsinfer/inference.py", line 1214, in build_worker
    start, end = self.ancestor_builder.make_ancestor(focal_sites, a)
_tsinfer.LibraryError: Bad focal site.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Mar 10, 2023

Right, crap, we're reusing a decode buffer across threads.

Can you try again with 1 thread?

@jeromekelleher
Copy link
Member Author

that should fix it

@benjeffery
Copy link
Member

Right, here's some results. Generating ancestors for 1000g with 8 threads (not all fully used):
(793802 sites * 3202 samples * 2 ploidy = 5.08GB/635MB

Code version wallclock builder RAM process resident RAM
1kg(byte) 8:31 n/a 7.2GB
bitpack(byte) 8:58 4.7GB 8.7GB
bitpack(1bit) 8:56 608.7MB 4.3GB
bitpack-magic(1bit) 6:58 608.7MB 4.0GB

So with the "magic" unpack I posted in #809 (comment) we're faster than the normal code, which I guess makes sense as you can hold more data in the CPU cache, and need less main memory bandwidth. The ~3.5GB ram is a constant term from the ancestor write buffers I think, so shouldn't worry us.

I will now try this code on the GEL chr 22.

@jeromekelleher
Copy link
Member Author

Awesome!

Since the speed diff isn't massive we can add the bitpack-magic change in a follow-up. It needs some care about buffer sizes, I think, and probably a few choice tests to make sure we don't overflow on awkard cases.

@benjeffery
Copy link
Member

GEL chr22 in 3:05:51 using 76GB RAM using the bitpack-magic branch. Awesome work on this one, that's hopefully the last big blocker removed!

@jeromekelleher
Copy link
Member Author

Great!

Don't use the magic version for actual inference though, there's definitely a buffer overflow and undefined behaviour. Easy fix though (just a case of making sure genotype encode buffer size is divisible by 64 rather than 8)

@hyanwong
Copy link
Member

BTW, presumably if we have missing data we can use a 2-bit encoding for twice the RAM cost? Would this be useful for others, or hard to code up? No reason to do it now, but maybe an issue for the future?

@jeromekelleher
Copy link
Member Author

Yeah, that's the plan

@hyanwong
Copy link
Member

Yeah, that's the plan

Shall I open an issue?

@jeromekelleher jeromekelleher force-pushed the bit-pack branch 2 times, most recently from 77fdb30 to 30f4b49 Compare March 28, 2023 14:50
@jeromekelleher jeromekelleher marked this pull request as ready for review March 28, 2023 14:50
@jeromekelleher
Copy link
Member Author

I think this is ready for final review and merging now.

I've only added the most basic documentation, since we'll also want to consider the mmap stuff for doing this at scale also we may as well wait till that's in to discuss perf in a more accessible way.

@jeromekelleher
Copy link
Member Author

The implementation has changed slightly @benjeffery, so may be worth rerunning some basic benchmarks just to be sure.

@benjeffery
Copy link
Member

Ok will re-run the benchmarks on 1kg.

@jeromekelleher
Copy link
Member Author

Can you open an issue to track adding your 64 bit unpack implementation? It should be a nice simple follow up here

@benjeffery
Copy link
Member

#816

@jeromekelleher
Copy link
Member Author

I think this should be ready to go @benjeffery?

Copy link
Member

@benjeffery benjeffery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, couple of comments.
There is a block of uncovered code in the data_equal method. I assume that's because all the tests pass, but might be nice to add one unequal set that is unequal by the last criteria to be checked?

@jeromekelleher
Copy link
Member Author

Hopefully ready to go now

@mergify mergify bot merged commit 2b6e898 into tskit-dev:main Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants