Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cram file support? #123

Open
dnil opened this issue Sep 20, 2024 · 15 comments
Open

cram file support? #123

dnil opened this issue Sep 20, 2024 · 15 comments

Comments

@dnil
Copy link

dnil commented Sep 20, 2024

Hi, very nice tool - well done! Any plans for supporting cram files? It looked like you are using samtools for handling alignments, but one would then presumably need to pass along a reference genome file/URL. Cheers!

@MariaNattestad
Copy link
Owner

@robertaboukhalil I believe CRAM is pretty difficult for us to support, did I recall that correctly?
Was it something about how samtools assumes certain things about the reference that are hard to replicate in a web environment?

@robertaboukhalil
Copy link
Collaborator

I had trouble with CRAM support in the past because samtools kept trying to download the reference from ebi.ac.uk despite me trying to tell it to use a local fasta file. That was a couple years ago and using samtools 1.10 so it's worth another try with a newer version.

@robertaboukhalil
Copy link
Collaborator

@dnil Can you share a small CRAM file with a reference FASTA we can test with?

@dnil
Copy link
Author

dnil commented Sep 23, 2024

Sure, I could downsample or make a small extract I guess, but GitHub doesn't allow exchange very large files. Please suggest an upload destination/mode and I'll try to give you something! 😊

Or perhaps easier just to grab a NIST or 1000G one?
E.g. here:
ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/GBR/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram

According to their alignment readme it should be on GRCh38DH - i e ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa

@robertaboukhalil
Copy link
Collaborator

Thanks @dnil, that's really useful!

I tried it with those files and I see a similar issue: it first downloads ~2MB from the CRAM file but then tries to download the entire FASTA file from the URL. This is even though it only needs a small part of the FASTA, the FASTA is indexed, and I explicitly specify the index path with -t or ##idx## 🤔

We might have better luck with something like cram.js?

<script src="https://biowasm.com/cdn/v3/aioli.js"></script>
<script type="module">
    const CLI = await new Aioli([{ tool: "samtools", version: "1.17" }]);
    const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
        "https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
        "https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
        "https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
        "https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
    ]);

    const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
    // const output = await CLI.exec(`samtools view -T ${path_fa}##idx##${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
    console.log(output);
</script>

@dnil
Copy link
Author

dnil commented Sep 23, 2024

Thank you for trying! Cool wasm system! I wonder if samtools was trying to make a 'REF_CACHE'? And if one could feed it one over https? It kind of sounds like it, but no direct examples on https://www.htslib.org/workflow/cram.html.

@robertaboukhalil
Copy link
Collaborator

From this page: If no REF_PATH is defined, both REF_PATH and REF_CACHE will be automatically set (see above), but if REF_PATH is defined and REF_CACHE not then no local cache is used

I tried setting REF_PATH but it's still ignoring it and downloading the whole file 🤷‍♂️

@dnil
Copy link
Author

dnil commented Oct 20, 2024

Yes, I confirm the same on the js console/Safari Sources. Too bad!

When testing the same command as above, also with the same URLs, with samtools 1.21 or 1.18 from a local shell, they seem way too quick to download any large files (screenshots below). I'm sure there are all kinds of intricacies on how samtools/htslib decides to range request or not there. I feel I know too little about the Aioli mounts and htslib to understand the issue at this point. 😞

A faint hope is that it was different with 1.17? Is wasm-compiling a new samtools version a lot of work? From their release notes, htslib did at least some reworking on cram access when they fixed the cram bugs in the versions prior to 1.18. Those bugs alone are probably reason enough to have a later version if intended for cram access.

time samtools view -T https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa -t https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram chr17:7,667,421-7,688,490

Screenshot 2024-10-20 at 19 40 31

Screenshot 2024-10-20 at 19 38 39

@dnil
Copy link
Author

dnil commented Oct 23, 2024

Unfortunately using the same example with the biowasm samtools 1.21 candidate did not solve the issue out of the box.

I am however still curious to test the same with a local mirror of the REF_PATH md5s set. As far as I understand, that is kind of the way it is supposed to work for remote references (e.g. samtools/samtools#837), so a more proper test. Since we would not really see attempts to download from EBI http://www.ebi.ac.uk/ena/cram/md5. If I understand this correctly, we really should not be able to access it from within the assembled module without "mounting" it.

I made a silly test with mounting a REF_PATH containing the md5s for the GRCh38 reference, but again I don't know enough about aioli/emscripten to properly set it. I naively tried

CLI.preRun = () => { ENV.REF_PATH = ref_path };

in between mounting, and running the command, which didn't work. How does one do it? Thank you kindly for the support!

@robertaboukhalil
Copy link
Collaborator

This is what I tried previously:

const CLI = await new Aioli([
	"samtools/1.17",
	"coreutils/env/8.32"
]);

const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
	"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
	"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
	"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
	"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
]);

console.log(await CLI.exec("env REF_PATH=/data/shared"))
console.log(await CLI.exec("env"))

const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);

but it still downloads the whole file

@dnil
Copy link
Author

dnil commented Oct 23, 2024

I see! Nice with the env tool, but how did we populate the /data/shared with the output from the samtools perl script? That should be rather exactly 3G of reference genome data?

One way or the other we are trying to emulate the behaviour of http://www.ebi.ac.uk/ena/cram/md5.

The common version to avoid hammering them when running like large scale stuff is to first locally run the samtools script seq_cache_populate.pl, with splits the fasta reference into convenient hash keyed chunks. Similar to the ena, but by default in subfolders instead of with a full url lookup.

Just mounting the url as a dir would be too naive - I still tested below. Is there like a way to have dynamic endpoint URLs mounted, or to crawl urls to mount all encountered endpoints as files?

<script src="https://biowasm.com/cdn/v3/aioli.js"></script>
<script type="module">
const CLI = await new Aioli(["coreutils/env/8.32", "coreutils/ls/8.32", {
    tool: "samtools",
    version: "1.21",
    urlPrefix: "http://localhost:80/biowasm/build/samtools/1.21",
}]);

const [path_cram, path_crai, path_fa, path_fai, ref_path] = await CLI.mount([
    "https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
    "https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
    "https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
    "https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fai",
    "http://localhost/refs/cache"
]);

console.log(await CLI.exec(`env REF_PATH=${ref_path}/%2s/%2s/%s`))
console.log(await CLI.exec("env"))

console.log(await CLI.exec(`ls ${ref_path}/`))
console.log(await CLI.exec(`ls ${ref_path}/00`))

const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);

document.getElementById("output").innerHTML = output;
</script>

<h4>Output of <code>samtools view cram file</code>:</h4>
<pre id="output">Loading...</pre>
Screenshot 2024-10-23 at 17 24 06

Screenshot 2024-10-23 at 17 24 25

@robertaboukhalil
Copy link
Collaborator

Hmm, why do we need to populate that folder? According to this page, "if REF_PATH is defined and REF_CACHE not then no local cache is used". By setting REF_PATH, we're disabling the cache, so that shouldn't be the problem, right? The issue I'm seeing is that samtools is reading through the entire FASTA file for some reason, instead of using a small subset of it 🤔

@robertaboukhalil
Copy link
Collaborator

robertaboukhalil commented Oct 23, 2024

Oh oups... the problem is that I'm using the wrong .fai URL 🤦

This works just fine:

// Initialize Aioli with samtools and seqtk
const CLI = await new Aioli([
	"samtools/1.17",
	"coreutils/env/8.32"
], { debug: true });

const [path_cram, path_crai, path_fa, path_fai] = await CLI.mount([
	"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram",
	"https://42basepairs.com/download/s3/1000genomes/data/HG00099/alignment/HG00099.alt_bwamem_GRCh38DH.20150718.GBR.low_coverage.cram.crai",
	"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa",
	"https://42basepairs.com/download/s3/1000genomes/technical/reference/GRCh38_reference_genome/GRCh38_full_analysis_set_plus_decoy_hla.fa.fai",
]);
console.log(await CLI.exec("env REF_PATH=/shared/data/"))
console.log(await CLI.exec("env"))

const output = await CLI.exec(`samtools view -T ${path_fa} -t ${path_fai} ${path_cram} chr17:7,667,421-7,688,490`);
console.log(output);

And it only downloads a small subset of the files!

image

I guess because I used the wrong FAI URL, samtools was rebuilding the index by downloading the whole FASTA file 🤦 🤦 🤦

@dnil
Copy link
Author

dnil commented Nov 1, 2024

This is sweet; should be enough for a start. I think it will be useful to test and merge the samtools 1.21 PR on biowasm anyway to get the htslib CRAM file fixes made after 1.17.

I imagined for a "production" version one would want to solve the REF_CACHE issue as well, but as long as the user does not inspect too many loci it should be fine. The refs are certainly smaller than the read piles. If it becomes an issue, it should be solvable with some form of local sharing of that perl-script generated cache, or a remote one that is not the default EBI one.

So, how do we move forward? I could make a PR to start it off perhaps, but I have only skimmed the codebase so I am likely to need some help sooner or later.

@robertaboukhalil
Copy link
Collaborator

robertaboukhalil commented Nov 1, 2024

Yes, I'll look at merging the samtools 1.21 version, there's an unexpected build error on previous versions, so I'll look into that more closely.

I imagined for a "production" version one would want to solve the REF_CACHE issue as well, but as long as the user does not inspect too many loci it should be fine

The browser should already cache requests when downloading subsets of FASTA files, so that shouldn't be an issue.

So, how do we move forward? I could make a PR to start it off perhaps

We don't have a lot of bandwidth right now to work on Ribbon, though we might later in the year. So feel free to work on it but can't guarantee we'll be able to help and review it in a timely manner :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants