-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cram file support? #123
Comments
@robertaboukhalil I believe CRAM is pretty difficult for us to support, did I recall that correctly? |
I had trouble with CRAM support in the past because samtools kept trying to download the reference from ebi.ac.uk despite me trying to tell it to use a local fasta file. That was a couple years ago and using samtools 1.10 so it's worth another try with a newer version. |
@dnil Can you share a small CRAM file with a reference FASTA we can test with? |
Sure, I could downsample or make a small extract I guess, but GitHub doesn't allow exchange very large files. Please suggest an upload destination/mode and I'll try to give you something! 😊 Or perhaps easier just to grab a NIST or 1000G one? According to their alignment readme it should be on GRCh38DH - i e |
Thanks @dnil, that's really useful! I tried it with those files and I see a similar issue: it first downloads ~2MB from the CRAM file but then tries to download the entire FASTA file from the URL. This is even though it only needs a small part of the FASTA, the FASTA is indexed, and I explicitly specify the index path with We might have better luck with something like cram.js? <script src="https://biowasm.com/cdn/v3/aioli.js"></script>
<script type="module">
const CLI = await new Aioli([{ tool: "samtools", version: "1.17" }]);
const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
]);
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
// const output = await CLI.exec(`samtools view -T ${path_fa}##idx##${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);
</script> |
Thank you for trying! Cool wasm system! I wonder if samtools was trying to make a 'REF_CACHE'? And if one could feed it one over https? It kind of sounds like it, but no direct examples on https://www.htslib.org/workflow/cram.html. |
From this page: If no REF_PATH is defined, both REF_PATH and REF_CACHE will be automatically set (see above), but if REF_PATH is defined and REF_CACHE not then no local cache is used I tried setting REF_PATH but it's still ignoring it and downloading the whole file 🤷♂️ |
Yes, I confirm the same on the js console/Safari Sources. Too bad! When testing the same command as above, also with the same URLs, with samtools 1.21 or 1.18 from a local shell, they seem way too quick to download any large files (screenshots below). I'm sure there are all kinds of intricacies on how A faint hope is that it was different with 1.17? Is wasm-compiling a new
|
Unfortunately using the same example with the biowasm samtools 1.21 candidate did not solve the issue out of the box. I am however still curious to test the same with a local mirror of the I made a silly test with mounting a
in between mounting, and running the command, which didn't work. How does one do it? Thank you kindly for the support! |
This is what I tried previously: const CLI = await new Aioli([
"samtools/1.17",
"coreutils/env/8.32"
]);
const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
]);
console.log(await CLI.exec("env REF_PATH=/data/shared"))
console.log(await CLI.exec("env"))
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output); but it still downloads the whole file |
I see! Nice with the env tool, but how did we populate the One way or the other we are trying to emulate the behaviour of http://www.ebi.ac.uk/ena/cram/md5. The common version to avoid hammering them when running like large scale stuff is to first locally run the samtools script Just mounting the url as a dir would be too naive - I still tested below. Is there like a way to have dynamic endpoint URLs mounted, or to crawl urls to mount all encountered endpoints as files?
|
Hmm, why do we need to populate that folder? According to this page, "if REF_PATH is defined and REF_CACHE not then no local cache is used". By setting REF_PATH, we're disabling the cache, so that shouldn't be the problem, right? The issue I'm seeing is that samtools is reading through the entire FASTA file for some reason, instead of using a small subset of it 🤔 |
Oh oups... the problem is that I'm using the wrong .fai URL 🤦 This works just fine: // Initialize Aioli with samtools and seqtk
const CLI = await new Aioli([
"samtools/1.17",
"coreutils/env/8.32"
], { debug: true });
const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai",
]);
console.log(await CLI.exec("env REF_PATH=/shared/data/"))
console.log(await CLI.exec("env"))
const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output); And it only downloads a small subset of the files! I guess because I used the wrong FAI URL, samtools was rebuilding the index by downloading the whole FASTA file 🤦 🤦 🤦 |
This is sweet; should be enough for a start. I think it will be useful to test and merge the I imagined for a "production" version one would want to solve the So, how do we move forward? I could make a PR to start it off perhaps, but I have only skimmed the codebase so I am likely to need some help sooner or later. |
Yes, I'll look at merging the samtools 1.21 version, there's an unexpected build error on previous versions, so I'll look into that more closely.
The browser should already cache requests when downloading subsets of FASTA files, so that shouldn't be an issue.
We don't have a lot of bandwidth right now to work on Ribbon, though we might later in the year. So feel free to work on it but can't guarantee we'll be able to help and review it in a timely manner :) |
Hi, very nice tool - well done! Any plans for supporting cram files? It looked like you are using
samtools
for handling alignments, but one would then presumably need to pass along a reference genome file/URL. Cheers!The text was updated successfully, but these errors were encountered: