Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VADR predicted nested genes, prevents submission to ENA #54

Open
taltman opened this issue Dec 23, 2021 · 4 comments
Open

VADR predicted nested genes, prevents submission to ENA #54

taltman opened this issue Dec 23, 2021 · 4 comments

Comments

@taltman
Copy link

taltman commented Dec 23, 2021

This seemed to anger the validation guards at ENA:

19094   20750   gene
                        gene    N
19094   20750   CDS
                        product nucleocapsid phosphoprotein
                        protein_id      NODE_1_length_10623_cov_925.238_7
19115   19838   gene
                        gene    N2
19115   19838   CDS
                        product nucleocapsid phosphoprotein 2
                        protein_id      NODE_1_length_10623_cov_925.238_8

Is this desired behavior by VADR?

@taltman taltman changed the title VADR predicted nested genes VADR predicted nested genes, prevents submission to ENA Dec 23, 2021
@nawrockie
Copy link
Member

What was the issue exactly? The protein_id values? If so there's a --noprotid option that will get rid of them. If it's not that let me know what the problem is, there may be a way around it.

@nawrockie
Copy link
Member

Ah, I see from the title of the issue the problem is that they are nested. Can you send me the .minfo file used with v-annotate.pl?

@taltman
Copy link
Author

taltman commented Dec 27, 2021

Hi @nawrockie , I'm using the pan-Coronavirus model,
version 1.3:

Please let me know if I misunderstood what you were asking for. Thanks!

@nawrockie
Copy link
Member

nawrockie commented Dec 29, 2021

It looks like the best matching model for your sequence must be the NC_006577 model because that is the only model with a N2 gene. The NC_006577 RefSeq has N2 nested within N as shown in the .minfo file, so that's why vadr is annotating it in your sequence:

FEATURE NC_006577 type:"gene" coords:"28320..29645:+" parent_idx_str:"GBNULL" gene:"N"
FEATURE NC_006577 type:"CDS" coords:"28320..29645:+" parent_idx_str:"GBNULL" gene:"N" product:"nucleocapsid phosphoprotein"
FEATURE NC_006577 type:"gene" coords:"28342..28959:+" parent_idx_str:"GBNULL" gene:"N2"
FEATURE NC_006577 type:"CDS" coords:"28342..28959:+" parent_idx_str:"GBNULL" gene:"N2" product:"nucleocapsid phosphoprotein 2"

If nested CDS and gene features are not allowed by ENA for submission purposes, you can just remove the N2 annotations manually from your .tbl file, or you can make a new .minfo file for vadr that has N2 removed and use that to redo the annotation, whichever is easier.

Let me know if that addresses your question or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants