Read BGEN files #36

tomwhite · 2020-06-08T12:18:14Z

This changes adds support for reading a BGEN file as a GenotypeDosageDataset object.

It uses the PyBGEN library, which is pure Python (so may need work to optimize for large BGEN files: we shall see). The advantage over bgen-reader is that PyBGEN uses BGEN index files, whereas bgen-reader uses its own 'metafile'. The main problem I saw with bgen-reader is that it opens a new file for every variant it reads, while PyBGEN opens a new file for each batch of variants that are being read (and uses the index to seek appropriately).

notebooks/platform/xarray/lib/io/pybgen_backend.py

tomwhite · 2020-06-09T15:10:59Z

BTW I spent some time today using this code to load some larger files. I was able to convert a BGEN with 100K variants and 1000 samples to Zarr using this code. Also 1KG ChrX, which took about 5 minutes on my 4 core machine.

So I think this can be merged if you are happy with it now. (BTW I don't know if I have commit permissions yet.)

eric-czech · 2020-06-09T15:24:38Z

Added write permission for you @tomwhite

Read BGEN files

fdb7fe6

eric-czech reviewed Jun 8, 2020

View reviewed changes