Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read BGEN files #36

Merged
merged 2 commits into from
Jun 9, 2020
Merged

Read BGEN files #36

merged 2 commits into from
Jun 9, 2020

Conversation

tomwhite
Copy link
Contributor

@tomwhite tomwhite commented Jun 8, 2020

This changes adds support for reading a BGEN file as a GenotypeDosageDataset object.

It uses the PyBGEN library, which is pure Python (so may need work to optimize for large BGEN files: we shall see). The advantage over bgen-reader is that PyBGEN uses BGEN index files, whereas bgen-reader uses its own 'metafile'. The main problem I saw with bgen-reader is that it opens a new file for every variant it reads, while PyBGEN opens a new file for each batch of variants that are being read (and uses the index to seek appropriately).

@tomwhite
Copy link
Contributor Author

tomwhite commented Jun 9, 2020

BTW I spent some time today using this code to load some larger files. I was able to convert a BGEN with 100K variants and 1000 samples to Zarr using this code. Also 1KG ChrX, which took about 5 minutes on my 4 core machine.

So I think this can be merged if you are happy with it now. (BTW I don't know if I have commit permissions yet.)

@eric-czech eric-czech merged commit 605b1fa into related-sciences:master Jun 9, 2020
@eric-czech
Copy link
Collaborator

Added write permission for you @tomwhite

@tomwhite tomwhite mentioned this pull request Jun 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants