Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SEAB-5117: Display formatted notebook #1747

Closed

Conversation

svonworl
Copy link
Contributor

@svonworl svonworl commented Apr 3, 2023

Description
This PR displays a non-interactive rendition of a Jupyter notebook in the "Code" tab of its entry page. Our goal is to provide a human-readable representation that helps the user to understand what the notebook does, rather than to validate, verify integrity, or pinpoint problems. To that end, we intentionally keep the parsing loose and try to gracefully handle/ignore malformed/unknown constructs and other errors.

At the start of this ticket, I evaluated two existing options to format notebooks: nbconvert. rejected because it's python which we would've run on the server, and notebookjs, rejected because it felt a bit immature and didn't support at least one fundamental newer feature (attached images).

Primarily, a notebook consists of a series of "cells":

  • Markdown cells contain a notebook variant of Markdown
  • Code cells contain iPython source code and corresponding outputs.
  • Outputs typically represent text streams (stdout, etc), or richer objects which are represented by a "mime bundle" (see below).

Currently, this implementation displays images that are part of the notebook: that is, images which are completely embedded in the notebook as base64 blobs and referenced in the markdown (see below) or as a code cell output. See the screenshots for an example of such a notebook, the images within are attached to and 100%-included in the notebook. Does this run afoul of our policy on serving images from the Dockstore domain? Please discuss.

Jupyter notebooks support "attached images", which are referenced via the typical markdown image syntax, but with the url attachment:<id>. Attached images, as well as display_data code cell outputs, are encoded in the notebook as a "mime bundle", which represents a single resource by several mime types and maps each to the corresponding image data/text/etc. Our code selects the "best" type as the first match in a list of "allowed" mime types. All of the allowed types except for text/html, are purposefully inert [relatively speaking] to avoid security hijinx.

Notebook markdown also supports embedded TeX equations, which are delimited by $, such as:
$ E = mc^2 $

This implementation introduces a FormattedNotebookComponent that:

  1. retrieves the notebook (primary descriptor).
  2. parses the notebook json.
  3. converts the notebook cells to HTML, resulting in a list of divs with the following classes and contents:
  • markdown: rendered markdown
  • code: python source code
  • count: execution count (ex: [1]:)
  • output: output of a code cell.
  1. sets the innerHTML of the top level notebook div to the above HTML. Because the content is generated dynamically, the corresponding styles appear near the end of the the global css file, as rules for the class notebook .
  2. syntax highlights the code cell source div elements.
  3. formats the embedded equations in the markdown div elements.

Layout is in a two column grid, the left column of which contains the "count" divs and collapses to fit the width of its elements.

The code was designed so that as part of step 3 above, non-HTML user input is escaped, and all user input is sanitized via our existing markdown wrapper DOMPurify invocation. Thus, any user-supplied HTML, which can surface either in the markdown or in a code cell output, should be converted to a form as secure as the results of our historical markdown sanitization process.

We intentionally use type any to refer to the chunks of json that we propagate through the parser. The methods that process the chunks are written to fail gracefully if they don't have the expected structure...

We use Prism to syntax highlight the code cell source, and MathJax to format embedded equations, Both are applied on the appropriate DOM elements, near the end of the process. The Prism output is lightly sanitized, with an invocation that conserves the classes that Prism attaches to associate the code spans with styles. The input to MathJax is sanitized, but we don't sanitize the output with DOMPurify, because it mangles the math. Instead, we include the ui/safe module, and set up a non-permissive configuration which ostensibly should reduce the XSS potential: https://docs.mathjax.org/en/latest/options/safe.html

Sonarcloud complains about some regexps, but they seem ok to me. Please take a look and try to figure out how they could go wrong.

When Google Colab saves to github, it inserts a cell at the top of the notebook that contains a gaudy "view on colab" image and link. So as to maintain control of the "launch with" process, this implementation filters and does not display it. It remains in the notebook source file, of course.

In the compileMarkdown method, we do some markdown pre-processing to handle the vagaries of notebook-specific markdown, such as support for attached images and the unfortunate overlap of TeX/markdown syntax (for example, \\ has a special meaning as both a TeX construct and a markdown escape). There's likely a way to wire a solution deeper into marked and leverage context (are we in a code block? etc) to implement this more elegantly/correctly, but what's there works well enough that it's probably ok for now.

While this PR is in review, I'll write some followup tickets to improve the markdown preprocessing/handling, improve the styling, make some miscellaneous enhancements, and address other issues.

Review Instructions
Register some notebooks, pull up their entry pages, click their "Code" tabs, and confirm that they look ok. You can find some test notebooks in: https://github.com/svonworl/test-notebooks

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-5117

Screenshots

The "What is.." notebook:
Screen Shot 2023-04-04 at 10 39 42 AM

Example equation typesetting notebook:
Screen Shot 2023-04-04 at 10 30 18 AM

Image output in the ibm-mlb notebook:
Screen Shot 2023-04-04 at 10 32 31 AM

Security
We're displaying lots of user-supplied content, and we added a new 3rd party library to process it, so we need to be careful to avoid injection vulnerabilities (XSS).

Please make sure that you've checked the following before submitting your pull request. Thanks!

  • Check that your code compiles by running npm run build
  • Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
  • If this is the first time you're submitting a PR or even if you just need a refresher, consider reviewing our style guide
  • Do not bypass Angular sanitization (bypassSecurityTrustHtml, etc.), or justify why you need to do so
  • (see above) If displaying markdown, use the markdown-wrapper component, which does extra sanitization
  • Do not use cookies, although this may change in the future
  • Run npm audit and ensure you are not introducing new vulnerabilities
  • (see above) Do due diligence on new 3rd party libraries, checking for CVEs
  • (see above) Don't allow user-uploaded images to be served from the Dockstore domain
  • If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.
  • Check whether this PR disables tests. If it legitimately needs to disable a test, create a new ticket to re-enable it in a specific milestone.

@svonworl svonworl self-assigned this Apr 4, 2023
@denis-yuen
Copy link
Member

See the screenshots for an example of such a notebook, the images within are attached to and 100%-included in the notebook. Does this run afoul of our policy on serving images from the Dockstore domain? Please discuss.

highlighting for @david4096

<p>For a Jupyter notebook, it'll be the contents of the code/documentation cells, nicely-formatted.</p>
<p>For other kinds of notebooks, it might be something different.</p>
</div>
<ng-template matTabContent>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of bikeshedding, but after looking at the screenshots, "Code" seems ambiguous. After all, I would have guessed that the "Files" tab would display code.
This looks like a "Preview" like if you were editing a readme or "Rendered"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Preview" sounds good!

Copy link
Collaborator

@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in a couple of comments, but I have a broader thought.

I think you should follow the more standard Angular way of doing things, in particular leveraging the HTML templates more, and doing less direct DOM manipulation. It would also be the same pattern our other components use.

Is your way more efficient and does it execute more quickly? Most likely.

But it's also different than how we approach other components. I know some of the 3rd party components we use, e.g., the code editor, are probably
doing something similar under the hood, and we don't see it because we're just using them as a component.

But all this JS code, particularly because we're going to be showing user input, is risky. It looks like you've tried to cover all cases, and maybe you have, but security stuff is hard, and it's better to follow best practices. See https://snyk.io/blog/angular-security-best-practices/

We intentionally use type any to refer to the chunks of json that we propagate through the parser. The methods that process the chunks are written to fail gracefully if they don't have the expected structure...

I think you should substantially reduce or eliminate the use of any. We have an aspiration (and a ticket) to get rid of any one day, and we should at least avoid introducing more uses of it in the meantime.

this.notebookTarget?.nativeElement.replaceChildren(); // Remove the current formatted notebook.
this.loading = true;
this.displayError = false;
// The next line cancels any previous request that is still in progess,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in progress

*/
createFormattedNotebookElement(notebook: string): any {
// Create the notebook container div.
const element: any = this.document.createElement('div');
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you have something like this in the HTML template:

<div class="notebook" *ngIf="<condition>" [innerHtml]="sanitizedHtml">
<!--More stuff...-->
</div>

I find this more readable than tracing through TypeScript code to see what HTML will be generated.

Also I think it would be preferable to not have the the [innerHtml] here as in my example, but flesh out the template more -- I just used this as an example since it's the first case I ran into.

Copy link
Contributor Author

@svonworl svonworl Apr 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't you have something like this in the HTML template:

<div class="notebook" *ngIf="<condition>" [innerHtml]="sanitizedHtml">
<!--More stuff...-->
</div>

I find this more readable than tracing through TypeScript code to see what HTML will be generated.

Also I think it would be preferable to not have the the [innerHtml] here as in my example, but flesh out the template more -- I just used this as an example since it's the first case I ran into.

Would "More stuff" in the above example be the "Couldn't display the template" message? Or are you proposing that the HTML template itself should contain at least some of the logic that steps through the notebook cells and converts them to output?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would "More stuff" in the above example be the "Couldn't display the template" message? Or are you proposing that the HTML template itself should contain at least some of the logic that steps through the notebook cells and converts them to output?

I'm proposing that the HTML template itself contain more of the logic and HTML.

@sonarqubecloud
Copy link

sonarqubecloud bot commented Apr 5, 2023

SonarCloud Quality Gate failed.    Quality Gate failed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot E 2 Security Hotspots
Code Smell A 1 Code Smell

No Coverage information No Coverage information
0.0% 0.0% Duplication

@svonworl
Copy link
Contributor Author

svonworl commented Apr 5, 2023

I put in a couple of comments, but I have a broader thought.

I think you should follow the more standard Angular way of doing things, in particular leveraging the HTML templates more, and doing less direct DOM manipulation. It would also be the same pattern our other components use.

Is your way more efficient and does it execute more quickly? Most likely.

But it's also different than how we approach other components. I know some of the 3rd party components we use, e.g., the code editor, are probably doing something similar under the hood, and we don't see it because we're just using them as a component.

I understand what you are saying. Here's some observations and more thoughts.

Yes, it would be good to harness Angular's sanitizer in addition to DOMPurify, which the code is not doing currently. We could add this to the existing code by calling Angular's sanitizer directly, or via [innerHTML], but...

There are the following complications:

  1. Angular's sanitizer removes most of the formatting from MathJax output, rendering it unusable. The Angular sanitizer is not customizable, so if we use [innerHTML] in the HTML template, MathJax output dies, unless we can figure out a way to identify and conditionally bypass the sanitizer for only MathJax output (hard and risky).
  2. The easiest way to invoke Prism and MathJax is to point them at a DOM element to be styled. It may be possible to invoke them on HTML, but we'd probably need to implement some support code that would be relatively nitty-gritty and error-prone.

Assuming that we want to highlight syntax and format equations in our notebook preview, the upshot is that even if we transition to a "more Angular" approach, we'll still probably need to do some DOM manipulation.

Another observation is that as we go "more Angular", the processing logic doesn't go away, it migrates into the HTML templates.

In the continuum of possible improvements to this code that address Charles' concerns, I can see at least three distinct approaches:

  1. Lift the sanitization up the call stack so it happens near the end of the formatting process. Modify the current code so it generates a list of HTML chunks, each of which can be marked either safe (controlled input) or unsafe (user-supplied input). All chunks are sanitized, either lightly for safe HTML or heavily for unsafe HTML, to generate the final HTML, right before we assign to an element.

  2. Change the code so it dynamically generates Angular components, similarly to how it generates divs at the moment. In this solution, there would be Angular components that correspond to markdown cells, execution counts, code source, and outputs, but the processing logic would remain in the parent component class and be invoked by ngOnChanges.

  3. Move [most of] the processing code into Angular HTML templates, which will iterate over cells and delegate to subcomponents, which may in turn delegate to subcomponents...

I like Approach 1 and don't mind Approach 2. Approach 3 is going to be harder than it seems, because of some of our coding constraints (no functions in templates!) and the need to propagate required values down the "component stack". As mentioned before, none of the approaches entirely obviate the need for DOM manipulation, but Approaches 2 and 3 encapsulate it in the components that need it. The amount of boilerplate increases as we define more Angular components.

Then, there's the potential performance implications, which I'm glossing over at the moment. They may not matter, they may kill Approaches 2 and 3, or somewhere in between.

Comment on lines +1397 to +1553
white-space: pre;
}

.markdown,
.output {
padding: 0;
}

.markdown p:first-child {
margin-top: 0;
}

.markdown p:last-child {
margin-bottom: 0;
}

.source {
padding: 1em;
background-color: #fcf8fa;
border: solid 1px #f2edf0;
}

.source pre,
.output pre {
padding: 0;
margin: 0;
border: none;
background-color: inherit;
color: inherit;
}

// These rules target the html generated by the Prism syntax highlighter.
// This particular style was derived from a css file from the Prism distribution.
code {
.token.comment,
.token.prolog,
.token.doctype,
.token.cdata {
color: #8ac;
}

.token.punctuation {
opacity: 0.7;
}

.token.namespace {
opacity: 0.7;
}

.token.property,
.token.tag,
.token.boolean,
.token.number,
.token.constant,
.token.symbol {
color: #000;
}

.token.selector,
.token.attr-name,
.token.string,
.token.char,
.token.builtin,
.token.inserted {
color: #57a;
}

.token.operator,
.token.entity,
.token.url,
.token.variable {
color: #940;
}

.token.atrule,
.token.attr-value,
.token.keyword {
color: #060;
}

.token.regex,
.token.important {
color: #e90;
}

.token.variable,
.token.important,
.token.bold {
font-weight: bold;
}
.token.italic {
font-style: italic;
}

.token.deleted {
color: red;
}
}
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good comments here, suggestion: because this is a large chunk of styles all pertaining to notebooks, I think it would be a good idea to create a separate file like notebooks.scss and import it into styles.scss similar to featured-content.scss which also styles external html (just for separation since these styles won't be used elsewhere):

@import 'featured-content.scss';

@import is actually going to be phased out in a few years so it's recommended to use @use instead. Just tried it locally and styles rendered correctly. Eventually we should switch over our other @imports here as well

Copy link
Collaborator

@coverbeck coverbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Let me rephrase my thoughts, even though it's probably repetitive, just because it helps me flesh out things for myself.

  1. The generated HTML is currently a black box to me. The only way I can visualize it is by tracing through the TS code, and that's hard. I stopped at <div class="notebook"... :). Or by running it and looking at in browser tools (which I'm not setup to do, because I haven't set up a notebook yet). When I look at your style sheet, I have no idea how to correlate those styles to the HTML. I find this makes maintenance harder, both because we can't visualize the HTML, and because this component is so different than our others.
  2. There are the security concerns. Yes, we have to execute some arbitrary user code and create some DOM elements directly, but it's hard for me to tell just by looking at the TS code what's going on. And security stuff is hard; Angular has a team of programmers working on it, probably a dedicated security team, issues reported by users, and a lot of work and time put into it over the years (and they probably still have a bug or two :)). We're just a small team starting this feature from scratch. So I think it's important try to stay within the Angular framework as much as possible, and clearly demarcate when we are bypassing it.

Approach 3 is going to be harder than it seems, because of some of our coding constraints (no functions in templates!)

To clarify, this is not a weird Dockstore team constraint, it's a standard Angular best practice. And can be done with a pipe. I don't think this in itself is a significant obstacle, although there may be other issues.

and the need to propagate required values down the "component stack"

Perhaps/probably I'm not getting it, and that may be a function of me not having a sense of the HTML. I feel like it wouldn't be that hard to have at least have some "scaffolding" HTML, but maybe I'm wrong.

Now that I've laid out my thoughts and concerns, I'll defer to your judgement. I'm not hands-on enough with this to assertively be able to say which way is best.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the solution based on our other discussion, can you please add a test to show arbitrary JS code is not executed? Standard is to verify an alert() doesn't execute, although I don't know how you'd do that in a test.

<div class="mt-4 ml-4 mr-4">
<app-formatted-notebook
*ngIf="workflow && selectedVersion && (extendedWorkflow$ | async)"
[workflow]="workflow"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IDE is giving me compiler errors for this and next line. We have this turned off at build time because there are too many of the errors, sigh, but we should try to avoid introducing new ones. Might be tricky here though.

@svonworl svonworl removed request for david4096 and kathy-t April 6, 2023 17:16
@svonworl svonworl marked this pull request as draft April 6, 2023 17:18
@denis-yuen denis-yuen changed the base branch from develop to release/1.14 April 20, 2023 15:02
@svonworl svonworl closed this May 10, 2023
@denis-yuen denis-yuen deleted the feature/seab-5117/display-formatted-notebook branch December 4, 2023 15:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants