Gene bed and gene info #12

pintoa1-mskcc · 2023-06-27T19:35:42Z

EDIT:

I think a better way to phrase this:

When a gene_name/symbol from NCBI in the gene info does not belong to the gene bed file (ensembl version outdated so synonyms changed/areas remapped/discovered,etc), a warning is spit out:

MetaFusion/scripts/pygeneann_MetaFusion.py

Lines 1792 to 1795 in 81df512

    
           if gene_name != "NA": 
        
               print >> sys.stderr, "Warnning: gene name", gene_name, "is not in current annotation." 
        
           return GeneInterval([])

What happens to the gene in this instance? We are seeing a lot of NoHeadGene/truncated annotations. Gene order will probably not change from the caller in these cases... (since score seems to be dependent on the annotation in the gene bed file) Is anything else effected?

pintoa1-mskcc · 2023-07-31T16:54:33Z

Hi again,

I've been investigating this issue and I see that when a gene_name does not exist in the gene_bed file, something does wrong with the gene order algorithm which assigns the fusion category/type. Do you all have any suggestions on how to find the issue?

mike8115 · 2023-07-31T17:18:01Z

Hey there,

Sorry for going dark for so long. This completely fell off my radar.

From the main Metafusion.sh script, it looks like gene_bed is used twice.

MetaFusion/scripts/MetaFusion.sh

Lines 96 to 102 in 81df512

    
           if [ $genome_fasta ]; then  
        
             echo Annotate cff, extract sequence surrounding breakpoint 
        
             python reann_cff_fusion.py --cff $cff --gene_bed $gene_bed --ref_fa $genome_fasta > $outdir/$(basename $cff).reann.WITH_SEQ 
        
           else  
        
             echo Annotate cff, no extraction of sequence surrounding breakpoint 
        
             python reann_cff_fusion.py --cff $cff --gene_bed $gene_bed > $outdir/$(basename $cff).reann.NO_SEQ 
        
           fi

MetaFusion/scripts/MetaFusion.sh

Lines 113 to 116 in 81df512

    
           if [ $annotate_exons -eq 1 ] && [ $exons -eq 1 ]; then 
        
             echo Add adjacent exons to cff 
        
             python $fusiontools/extract_closest_exons.py $cff $gene_bed $genome_fasta  > $outdir/$(basename $cff).exons 
        
           fi

That said, it's possible that the error is coming from downstream if there's a problem when reading the transformed data. Perhaps there's an uncaught case that's producing a default value.

Do you happen to know which output file is the first one affected? That might help us pinpoint which function/script needs to be looked at.

Cheers,
Michael

pintoa1-mskcc · 2023-07-31T17:37:16Z

The part that I saw having the effect is during reann_cff_fusion.py.
I think this is because of faulty logic during an if else statement here:

MetaFusion/scripts/pygeneann_MetaFusion.py

Lines 874 to 888 in 81df512

    
           for gname1 in genes1: 
        
               for bpann1 in genes1[gname1]: 
        
                   score1, is_on_boundary1, close_to_boundary1 = self.__cal_score(bpann1, "head")   
        
                   if score1 == max_t2[0] and gname1==self.t_gene1: 
        
                       max_t1 = score1, is_on_boundary1, close_to_boundary1, bpann1 
        
                   if score1 > max_t1[0]: 
        
                       max_t1 = score1, is_on_boundary1, close_to_boundary1, bpann1 
        
           for gname2 in genes2: 
        
               for bpann2 in genes2[gname2]: 
        
                   score2, is_on_boundary2, close_to_boundary2 = self.__cal_score(bpann2, "tail")   
        
                   #print(score2, is_on_boundary2, close_to_boundary2) 
        
                   if score2 == max_t2[0] and gname2==self.t_gene2: 
        
                       max_t2 = score2, is_on_boundary2, close_to_boundary2, bpann2 
        
                   elif score2 > max_t2[0]: 
        
                       max_t2 = score2, is_on_boundary2, close_to_boundary2, bpann2

Particularly:
if score1 == max_t2[0] and gname1==self.t_gene1:
During this step, if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out...

Also I just noticed its an if -> if statement for score1, but elsif for score 2

mike8115 · 2023-07-31T19:54:26Z

Ah, I remember asking one of the authors about Line 877. Unfortunately, I never got a satisfactory response about that. The equality comparison between score1 and max_t2 looks like typo to me. As you mentioned previously, it would make more sense to compare to max_t1.

Also I just noticed its an if -> if statement for score1, but elsif for score 2

My guess is that there were multiple contributors and someone noticed the score1 == max_t2[0] expression. Using two separate ifs makes (slightly) more sense than an else-if... however, neither of them really makes a lot of sense to begin with. If we correct line 877, then both for-loops should work the same because the two if-conditions are mutually exclusive.

if self.t_gene1 does not exist in the list of genes you are iterating over (genes1), then you will only ever change the gene order when the score increases, but if the score is always the lowest possible something seems to fall out

You'd be correct in that regard. If all the scores are 0 and t_gene1 is not in the bed file, then it looks like the self.reann_gene1 becomes "". Perhaps there could be a check to prevent this, but I'm not sure what would be a good fallback value.

pintoa1-mskcc · 2023-08-01T14:18:19Z

Upon further analysis, I found that the issue I was facing was actually a dual issue:

If a gene name is not in the gene bed, all genes that are within range of the chromosome:breakpoint are analyzed. Gene name that creates the highest scoring gene order will be selected for reann gene. Any annotations from here on will be based on potentially different gene than originally called
if there is NO entry in the gene bed within range for the called chr:bp, no gene name will be selected for reann.

A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful.

pintoa1-mskcc · 2023-08-01T15:20:17Z

Additionally I think and gname1==self.t_gene1: this is unnecessary because earlier the tool filtered the input gene list to select the gene_name that is self.t_gene1 if self.t_gene1 is in the gene list.

mike8115 · 2023-08-02T19:28:58Z

A flag for when these things occurs to tell the analyst that the fusion may not be the correct annotation in these cases would be very helpful

I think that should be possible if you edit the code. Since that section of code is a method of CffFusion, you could probably create a new attribute for that flag. However, you'll also need to edit downstream to check for that flag. Alternatively, you could edit the gene name to include a unique suffix.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gene bed and gene info #12

Gene bed and gene info #12

pintoa1-mskcc commented Jun 27, 2023 •

edited

Loading

pintoa1-mskcc commented Jul 31, 2023

mike8115 commented Jul 31, 2023

pintoa1-mskcc commented Jul 31, 2023 •

edited

Loading

mike8115 commented Jul 31, 2023

pintoa1-mskcc commented Aug 1, 2023

pintoa1-mskcc commented Aug 1, 2023

mike8115 commented Aug 2, 2023

Gene bed and gene info #12

Gene bed and gene info #12

Comments

pintoa1-mskcc commented Jun 27, 2023 • edited Loading

pintoa1-mskcc commented Jul 31, 2023

mike8115 commented Jul 31, 2023

pintoa1-mskcc commented Jul 31, 2023 • edited Loading

mike8115 commented Jul 31, 2023

pintoa1-mskcc commented Aug 1, 2023

pintoa1-mskcc commented Aug 1, 2023

mike8115 commented Aug 2, 2023

pintoa1-mskcc commented Jun 27, 2023 •

edited

Loading

pintoa1-mskcc commented Jul 31, 2023 •

edited

Loading