Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(#741): Pretty print #742

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

andreasprlic
Copy link
Member

@andreasprlic andreasprlic commented Jun 10, 2024

This (draft) PR is an attempt to add a context-visualization framework on top of the tooling that we already have in hgvs. Unless somebody has a better suggestion for a name, for now I called it "pretty_print".

Features:

  1. It provides a datacompiler class that merges together all the data needed for the visualization. (no need for uta_align here). This could be used e.g. by a future effort to developer alternative visualisations such as SVG graphics, or hooked up on a web site.
  2. For now there is only a text-based rendering framework that has been added here. It offers a set of elemental "renderer" objects that are responsible for providing the text for one line in the final text.
  3. There are several unit tests that contain variants with various features. They are also the reason why this PR is still only in draft. The fixture that gets compiled for this unit test (tests/cache-py3.hdp) becomes really big and some suggestions for what to do about this would be helpful. Perhaps disable most of the tests? Or offer an ipython runbook with examples? Any feedback/suggestion is welcome here.
  4. There's a configuration class where some parameters around the display can be modified
  5. On a different thread we were discussing improvements around repeats. I wrote a basic repeat-detection script, and added the result of that as a FYI to the visualization here too, so we can see how well this works.

Examples:

In [1]: hgvs_g = "NC_000005.10:g.123346517_123346518insATTA"

In [2]: var_g = parse(hgvs_g)

In [3]: print(pretty.display(var_g))
hgvs_g    : NC_000005.10:g.123346517_123346518insATTA
hgvs_c    : NM_001166226.1:c.*1_*2insTAAT
hgvs_p    : NP_001159698.1:p.?
         :   123,346,500         123,346,520         123,346,540
chrom pos :   |    .    |    .    |    .    |    .    |    .
seq    -> : ATAAAGCTTTTCCAAATGTTATTAATTACTGGCATTGCTTTTTGCCAA
region    :                     |------|
tx seq <- : TATTTCGAAAAGGTTTACAATAATTAATGACCGTAACGAAAAACGGTT
tx pos    :  |    .    |    .   |   |    .    |    .    |
         :  *20       *10      *1  2880      2870      2860
aa seq <- :                      TerAsnSerAlaAsnSerLysAlaLeu
aa pos    :                         |||            ...
         :                         960
ref>alt   : ATTA[2]>ATTA[3]

Here also a screenshot how on the terminal the text can be color coded:

Screenshot 2024-06-30 at 22 53 19

…ranscripts, new option to show reverse-chrom strand sequence.
# Conflicts:
#	tests/data/cache-py3.hdp
@andreasprlic andreasprlic marked this pull request as ready for review July 1, 2024 05:54
@andreasprlic andreasprlic requested a review from a team as a code owner July 1, 2024 05:54
@andreasprlic
Copy link
Member Author

Note: this PR adds two new unit tests. They are pulling a lot of data from UTA and seqrepo, and as such they are slow. Should I tag them and exclude from the CI runs? Also, I did not want to upload an updated cache file for these tests, because the size of the test cache would have jumped from 1MB to ~600MB.

Copy link

github-actions bot commented Aug 1, 2024

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Aug 1, 2024
@davmlaw
Copy link
Contributor

davmlaw commented Aug 1, 2024

Wow, that's pretty neat and it's impressive you were able to do all of that without bringing in any dependencies (other than test framework)

From a very quick skim, it looks like you define your own data classes and do a lot of your own maths - I guess the worry would be if HGVS code is wrong, or yours is, or they're just different, then it would render differently than what's internally going on under the hood. Maybe there's no other way to do it, not sure - I'll try and block out some time next week to look over it

@github-actions github-actions bot removed the stale Issue is stale and subject to automatic closing label Aug 2, 2024
@andreasprlic
Copy link
Member Author

andreasprlic commented Aug 12, 2024

Yes, the datacompilerclass is doing some calculations, but it is pretty much based around the internals of the hgvs library.

One idea is that there could be other renderers that are not text based and that could e.g. create SVG graphics, or other types of visualizations. To enable this, this introduces a model-view-control pattern:

The view (renderer) should be decoupled from the data. Therefore the newly added models here are basic dataclasses to store the knowledge for one specific position in the final display. The renderer classes are using these dataclasses to print the output. For the controller (datacompiler), the goal is to stitch together all the data around the context of a variant and store the results in the models.

One addition to this is the repeats class. This is derived from a conversation ( #113 ) about improving the support in the hgvs library for variants in repetitive regions. The current core of the library won't be easily able to detect repeats in my opinion, but the fully-justified data calculations that are happening as part of the datacompiler make it relatively straightforward to identify basic repeats (that are not interrupted). I found it insightful for my own purposes to add this to the provided output in pretty_print and as such this ended up in here.

@reece
Copy link
Member

reece commented Aug 26, 2024

From 8/26 maintainers' meeting: @andreasprlic: yes, please tag to exclude CI testing.

Also, we decided to keep the similar function https://github.com/biocommons/hgvs/blob/main/src/hgvs/utils/context.py for now. If someone cares, we can combine functionality later.

Copy link
Member

@reece reece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing. It's way better than src/hgvs/utils/context.py. Thank you.

I have a minor suggestion/request for discussion: Having hgvs.pretty_print and hgvs.pretty as siblings seems a bit confusing, and leads to (IMO) oddities like the renderer importing from a pretty_print.py.

Instead, I think a better structure might be to:

  • put everything under pretty
  • rename renderer to console
  • put the color constants under console

I think that will create a better separate of the backend logic (immediately under pretty/) and the console renderer (under pretty/console/).

What say you?

@reece reece linked an issue Sep 2, 2024 that may be closed by this pull request
Copy link

gitguardian bot commented Sep 2, 2024

⚠️ GitGuardian has uncovered 1 secret following the scan of your pull request.

Please consider investigating the findings and remediating the incidents. Failure to do so may lead to compromising the associated services or software components.

Since your pull request originates from a forked repository, GitGuardian is not able to associate the secrets uncovered with secret incidents on your GitGuardian dashboard.
Skipping this check run and merging your pull request will create secret incidents on your GitGuardian dashboard.

🔎 Detected hardcoded secret in your pull request
GitGuardian id GitGuardian status Secret Commit Filename
7486813 Triggered PostgreSQL Credentials d4f2ac4 docs/contributing.rst View secret
🛠 Guidelines to remediate hardcoded secrets
  1. Understand the implications of revoking this secret by investigating where it is used in your code.
  2. Replace and store your secret safely. Learn here the best practices.
  3. Revoke and rotate this secret.
  4. If possible, rewrite git history. Rewriting git history is not a trivial act. You might completely break other contributing developers' workflow and you risk accidentally deleting legitimate data.

To avoid such incidents in the future consider


🦉 GitGuardian detects secrets in your source code to help developers and security teams secure the modern development process. You are seeing this because you or someone else with access to this repository has authorized GitGuardian to scan your pull request.

@andreasprlic
Copy link
Member Author

... with the most recent changes, I believe we are ready for another round of reviews.

@andreasprlic
Copy link
Member Author

In the latest revision, the default view is always to show transcripts 5'->3'. That means for reverse strand transcripts, we are flipping the orientation for the forward-strand. (behavior can be enabled/disabled with a config property)

Screenshot 2024-10-02 at 21 43 49

Copy link

github-actions bot commented Nov 3, 2024

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Nov 3, 2024
@ahwagner ahwagner removed the stale Issue is stale and subject to automatic closing label Nov 3, 2024
@ahwagner
Copy link
Member

ahwagner commented Nov 3, 2024

Removing stale for now and re-requesting review from prior reviewers. @reece and @davmlaw please take another look and confirm that your requested changes have been adequately addressed.

@davmlaw
Copy link
Contributor

davmlaw commented Nov 13, 2024

@andreasprlic I've been playing a bit and I can see this being really useful!

A few nit-picks:

Suggest changing PrettyPrint.__init__ to be initialised with AssemblyMapper

Advantages are to support any future builds (eg T2T), or DataProviders that only support 1 assembly

At the moment, you instantiate 2 assembly mappers for hard-coded 37/38 and then only pick one, using default_assembly. "default" is not really right - default is usually for a fall back, but this is the only choice ever used, so you only ever use 1 of the AssemblyMappers. If you passed in "GRCh37.p13" (or anything that doesn't match "GRCh37") it would give you the GRCh38 one due to the else

So - change would be to remove default_assembly and alt_aln_method, replace with assembly_mapper. I think it's ok to be explicit on what you want, but if you want to add a default, you could use Optional[AssemblyMapper] = None. Then if assembly_mapper is None, create one using 37/splign one

This can then remove all the code that tests default_assembly

Off by 1 error at very start of + strand transcript sequence

This appears OK with - strand, but on + strand if you pick the 1st base of transcript, eg "NM_198689.2:n.1C>G" it appears the sequence only start rendering 1 base after where it should, eg NM_198689.2 starts with CATCTCCTCCAG (so missing "C") below:

In [79]: print(pp.display(var_c))
hgvs_g    : NC_000021.9:g.44600597C>G
hgvs_c    : NM_198689.2(KRTAP10-7):c.-25C>G
hgvs_p    : NP_941962.1:p.?
          :    44,600,590          44,600,610          44,600,630
chrom pos :    |    .    |    .    |    .    |    .    |    .    |    .  
seq    -> : CACTCACTCCCATCTCCTCCAGTTCAATCCCCAGCATGGCTGCGTCCACTATGTCTGTCTG
region    :           G                                                  
tx seq -> :            atctcctccagttcaatccccagcATGGCTGCGTCCACTATGTCTGTCTG
tx pos    :                |    .    |    .    |   .    |    .    |    . 
          :                -20       -10       1        10        20
aa seq -> :                                    MetAlaAlaSerThrMetSerValCy
aa pos    :                                                ...           
          :                                    1                        

Exception with intergenic g.HGVS

hgvs/pretty/datacompiler.py:354, in DataCompiler.data(self, var_g, var_c_or_n, display_start, display_end)
    351     pd = reversed(position_details)
    352     position_details = list(pd)
--> 354 is_rna = var_c_or_n.ac.startswith("NR_") # not sure how to check this for ENSTs

AttributeError: 'NoneType' object has no attribute 'ac'

fix would be to replace it with is_rna = var_c_or_n and var_c_or_n.ac.startswith("NR_")

On the comment not sure how to check this for ENSTs - perhaps it's better to just store whether the type is 'n' or 'c'? That's what ultimately you care about in hgvs.pretty.console.tx_pos.TxRulerRenderer.display

            c_pos = pdata.c_pos
            interval = pdata.c_interval
            if is_rna:
                c_pos = pdata.n_pos
                interval = pdata.n_interval

…ariants, and only one assembly mapper on pretty print.
@andreasprlic
Copy link
Member Author

andreasprlic commented Nov 16, 2024

Thanks, these are great comments. Thank you for looking into the details of this. Luckily this was easy to address. PrettyPrint now only has one assembly mapper, and I fixed the problem with the first base on the transcript, as well as with intergenic variants. Also added unit tests for these two issues to the test_pretty_print (even if it is disabled for the CI, but it works locally).

@andreasprlic
Copy link
Member Author

@reece any more requests for changes from your side?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Context view improvements
4 participants