Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gene bed and gene info #12

Open
pintoa1-mskcc opened this issue Jun 27, 2023 · 7 comments
Open

Gene bed and gene info #12

pintoa1-mskcc opened this issue Jun 27, 2023 · 7 comments

Comments

@pintoa1-mskcc
Copy link

pintoa1-mskcc commented Jun 27, 2023

EDIT:

I think a better way to phrase this:

When a gene_name/symbol from NCBI in the gene info does not belong to the gene bed file (ensembl version outdated so synonyms changed/areas remapped/discovered,etc), a warning is spit out:

if gene_name != "NA":
print >> sys.stderr, "Warnning: gene name", gene_name, "is not in current annotation."
return GeneInterval([])

What happens to the gene in this instance? We are seeing a lot of NoHeadGene/truncated annotations. Gene order will probably not change from the caller in these cases... (since score seems to be dependent on the annotation in the gene bed file) Is anything else effected?

@pintoa1-mskcc
Copy link
Author

Hi again,

I've been investigating this issue and I see that when a gene_name does not exist in the gene_bed file, something does wrong with the gene order algorithm which assigns the fusion category/type. Do you all have any suggestions on how to find the issue?

@mike8115
Copy link
Member

Hey there,

Sorry for going dark for so long. This completely fell off my radar.

From the main Metafusion.sh script, it looks like gene_bed is used twice.

if [ $genome_fasta ]; then
echo Annotate cff, extract sequence surrounding breakpoint
python reann_cff_fusion.py --cff $cff --gene_bed $gene_bed --ref_fa $genome_fasta > $outdir/$(basename $cff).reann.WITH_SEQ
else
echo Annotate cff, no extraction of sequence surrounding breakpoint
python reann_cff_fusion.py --cff $cff --gene_bed $gene_bed > $outdir/$(basename $cff).reann.NO_SEQ
fi

if [ $annotate_exons -eq 1 ] && [ $exons -eq 1 ]; then
echo Add adjacent exons to cff
python $fusiontools/extract_closest_exons.py $cff $gene_bed $genome_fasta > $outdir/$(basename $cff).exons
fi

That said, it's possible that the error is coming from downstream if there's a problem when reading the transformed data. Perhaps there's an uncaught case that's producing a default value.

Do you happen to know which output file is the first one affected? That might help us pinpoint which function/script needs to be looked at.

Cheers,
Michael

@pintoa1-mskcc
Copy link
Author

pintoa1-mskcc commented Jul 31, 2023

The part that I saw having the effect is during reann_cff_fusion.py.
I think this is because of faulty logic during an if else statement here:

for gname1 in genes1:
for bpann1 in genes1[gname1]:
score1, is_on_boundary1, close_to_boundary1 = self.__cal_score(bpann1, "head")
if score1 == max_t2[0] and gname1==self.t_gene1:
max_t1 = score1, is_on_boundary1, close_to_boundary1, bpann1
if score1 > max_t1[0]:
max_t1 = score1, is_on_boundary1, close_to_boundary1, bpann1
for gname2 in genes2:
for bpann2 in genes2[gname2]:
score2, is_on_boundary2, close_to_boundary2 = self.__cal_score(bpann2, "tail")
#print(score2, is_on_boundary2, close_to_boundary2)
if score2 == max_t2[0] and gname2==self.t_gene2:
max_t2 = score2, is_on_boundary2, close_to_boundary2, bpann2
elif score2 > max_t2[0]:
max_t2 = score2, is_on_boundary2, close_to_boundary2, bpann2

Particularly:
if score1 == max_t2[0] and gname1==self.t_gene1:
During this step, if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out...

Also I just noticed its an if -> if statement for score1, but elsif for score 2

@mike8115
Copy link
Member

Ah, I remember asking one of the authors about Line 877. Unfortunately, I never got a satisfactory response about that. The equality comparison between score1 and max_t2 looks like typo to me. As you mentioned previously, it would make more sense to compare to max_t1.

Also I just noticed its an if -> if statement for score1, but elsif for score 2

My guess is that there were multiple contributors and someone noticed the score1 == max_t2[0] expression. Using two separate ifs makes (slightly) more sense than an else-if... however, neither of them really makes a lot of sense to begin with. If we correct line 877, then both for-loops should work the same because the two if-conditions are mutually exclusive.

if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out

You'd be correct in that regard. If all the scores are 0 and t_gene1 is not in the bed file, then it looks like the self.reann_gene1 becomes "". Perhaps there could be a check to prevent this, but I'm not sure what would be a good fallback value.

@pintoa1-mskcc
Copy link
Author

Upon further analysis, I found that the issue I was facing was actually a dual issue:

  • If a gene name is not in the gene bed, all genes that are within range of the chromosome:breakpoint are analyzed. Gene name that creates the highest scoring gene order will be selected for reann gene. Any annotations from here on will be based on potentially different gene than originally called

  • if there is NO entry in the gene bed within range for the called chr:bp, no gene name will be selected for reann.

A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful.

@pintoa1-mskcc
Copy link
Author

Additionally I think and gname1==self.t_gene1: this is unnecessary because earlier the tool filtered the input gene list to select the gene_name that is self.t_gene1 if self.t_gene1 is in the gene list.

@mike8115
Copy link
Member

mike8115 commented Aug 2, 2023

A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful

I think that should be possible if you edit the code. Since that section of code is a method of CffFusion, you could probably create a new attribute for that flag. However, you'll also need to edit downstream to check for that flag. Alternatively, you could edit the gene name to include a unique suffix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants