-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Presence/absence matrix suggestion #30
Comments
The desired matrix can be produced using the normal AMRFinder output as input. |
Hi Irene, There are a few points you bring up. I'll discuss one-by-one:
We have over 4,600 gene symbols that could be printed, do you want a matrix with 4,000 0's in columns or 4,000 zeros in rows? I'm not sure I understand the matrix format you're describing. I can create a line of shell script to print out a uniq'd list of the gene symbols that are returned from AMRFinderPlus. Note that genes with "PARTIAL" should probably be excluded since they are less likely to be functional.
We do provide options to select minimum blast thresholds, but those options are primarily for testing and development, and they are not how we recommend running AMRFinderPlus. The decision tree AMRFinderPlus uses to select a gene symbol and name is a bit more complex and based on a tree of gene nomenclature including curated HMM searches, blastx, blastp, and blastn searches with curated blast cutoffs for some genes where the defaults are inappropriate.
So this is making me think that you mean a gene symbol per column maybe? Not all gene symbols are defined by sequences, some higher level nodes would be defined by HMMs. Based on how I often use AMRFinderPlus I'm thinking the following might be helpful to you and produce output a bit closer to what you're looking for. Here's a line of shell to get a list of gene symbols and names for an assembly on the command-line. Here I'm using the example sequence test_dna.fa.
To just get a list of gene symbols skipping partials internal to contigs:
Obviously much more sophisticated manipulations are possible with short shell scripts, but I'm not sure exactly what you're looking for. Thanks for your suggestions, |
Sorry, I didn't explain well. As the database is huge, 1/0 matrix with all gene symbol will not be very intuitive due to high number of rows and columns, as you said. But I suggest you another option to visualize results of different samples at the same time. For example, a data frame with gene symbols for each different Element type (columns) and genome (rows). That information could be showed in different ways. For that, something should be added in the otuput for the tool to know what sequences correspond to each genome. Perhaps, input file names could be used for that. I can give you an example of what I mean AMR_matrix.txt, using these results AMR_output.txt (A, B and C are genome names). I joined several outputs in one file, after I added a column in each single file with the file name, that corresponds to the genome id. This is highly interesting in your output if users want to join outputs in one file to know what results correspond to each genome. Considering I am just thinking a way of visualizing results of a set of genomes and compare them quickly at the same time. Your tool would be excellent if you think in data visualization, too. Thanks for your considerations. |
Hi Irene, Thanks for posting the examples. Now I understand what you're talking about. We do ourselves transform the AMRFinderPlus output to a format similar to what you are describing for the Pathogen Detection Isolates Browser, so I see the utility. I think this is probably better handled by post-processing the AMRFinderPlus output, with the possible exception of adding a 'sample name' option. I do a fair amount of AMRFinderPlus analysis myself, and certainly have my own ways of transforming the output to be useful to me. There have also been other requests for a friendlier output format (E.g., #25). We should create a page with scripts or command-lines to modify the output in different ways. For example here's a little awk script to get gene lists from a file like your AMR_output.txt:
Arjun |
Hi Arjun, That helped me, thanks. I had a look to #25 and that could be an enhancement, too. Otherwise, I agree with you that some examples of command-lines to modify the output will help users with no programming experience, I'm sure. Irene |
I just use abricate --summary by translating the amrfinderplus output (which I generate per fasta file) to abricate compatible files
|
Once the output is created, it would be nice if a matrix 1/0 representing presence/absence for each gene is generated. Users could select minimum coverage and identity treshold for considering a gene as present. In that way, results would be summarized and easy to read. You can include as many columns as you want from the original output as extra information in the matrix. What do you think?
The text was updated successfully, but these errors were encountered: