SEAB-5117: Display formatted notebook #1747

svonworl · 2023-04-03T21:17:18Z

Description
This PR displays a non-interactive rendition of a Jupyter notebook in the "Code" tab of its entry page. Our goal is to provide a human-readable representation that helps the user to understand what the notebook does, rather than to validate, verify integrity, or pinpoint problems. To that end, we intentionally keep the parsing loose and try to gracefully handle/ignore malformed/unknown constructs and other errors.

At the start of this ticket, I evaluated two existing options to format notebooks: nbconvert. rejected because it's python which we would've run on the server, and notebookjs, rejected because it felt a bit immature and didn't support at least one fundamental newer feature (attached images).

Primarily, a notebook consists of a series of "cells":

Markdown cells contain a notebook variant of Markdown
Code cells contain iPython source code and corresponding outputs.
Outputs typically represent text streams (stdout, etc), or richer objects which are represented by a "mime bundle" (see below).

Currently, this implementation displays images that are part of the notebook: that is, images which are completely embedded in the notebook as base64 blobs and referenced in the markdown (see below) or as a code cell output. See the screenshots for an example of such a notebook, the images within are attached to and 100%-included in the notebook. Does this run afoul of our policy on serving images from the Dockstore domain? Please discuss.

Jupyter notebooks support "attached images", which are referenced via the typical markdown image syntax, but with the url attachment:<id>. Attached images, as well as display_data code cell outputs, are encoded in the notebook as a "mime bundle", which represents a single resource by several mime types and maps each to the corresponding image data/text/etc. Our code selects the "best" type as the first match in a list of "allowed" mime types. All of the allowed types except for text/html, are purposefully inert [relatively speaking] to avoid security hijinx.

Notebook markdown also supports embedded TeX equations, which are delimited by $, such as:
$ E = mc^2 $

This implementation introduces a FormattedNotebookComponent that:

retrieves the notebook (primary descriptor).
parses the notebook json.
converts the notebook cells to HTML, resulting in a list of divs with the following classes and contents:

markdown: rendered markdown
code: python source code
count: execution count (ex: [1]:)
output: output of a code cell.

sets the innerHTML of the top level notebook div to the above HTML. Because the content is generated dynamically, the corresponding styles appear near the end of the the global css file, as rules for the class notebook .
syntax highlights the code cell source div elements.
formats the embedded equations in the markdown div elements.

Layout is in a two column grid, the left column of which contains the "count" divs and collapses to fit the width of its elements.

The code was designed so that as part of step 3 above, non-HTML user input is escaped, and all user input is sanitized via our existing markdown wrapper DOMPurify invocation. Thus, any user-supplied HTML, which can surface either in the markdown or in a code cell output, should be converted to a form as secure as the results of our historical markdown sanitization process.

We intentionally use type any to refer to the chunks of json that we propagate through the parser. The methods that process the chunks are written to fail gracefully if they don't have the expected structure...

We use Prism to syntax highlight the code cell source, and MathJax to format embedded equations, Both are applied on the appropriate DOM elements, near the end of the process. The Prism output is lightly sanitized, with an invocation that conserves the classes that Prism attaches to associate the code spans with styles. The input to MathJax is sanitized, but we don't sanitize the output with DOMPurify, because it mangles the math. Instead, we include the ui/safe module, and set up a non-permissive configuration which ostensibly should reduce the XSS potential: https://docs.mathjax.org/en/latest/options/safe.html

Sonarcloud complains about some regexps, but they seem ok to me. Please take a look and try to figure out how they could go wrong.

When Google Colab saves to github, it inserts a cell at the top of the notebook that contains a gaudy "view on colab" image and link. So as to maintain control of the "launch with" process, this implementation filters and does not display it. It remains in the notebook source file, of course.

In the compileMarkdown method, we do some markdown pre-processing to handle the vagaries of notebook-specific markdown, such as support for attached images and the unfortunate overlap of TeX/markdown syntax (for example, \\ has a special meaning as both a TeX construct and a markdown escape). There's likely a way to wire a solution deeper into marked and leverage context (are we in a code block? etc) to implement this more elegantly/correctly, but what's there works well enough that it's probably ok for now.

While this PR is in review, I'll write some followup tickets to improve the markdown preprocessing/handling, improve the styling, make some miscellaneous enhancements, and address other issues.

Review Instructions
Register some notebooks, pull up their entry pages, click their "Code" tabs, and confirm that they look ok. You can find some test notebooks in: https://github.com/svonworl/test-notebooks

Issue
https://ucsc-cgl.atlassian.net/browse/SEAB-5117

Screenshots

The "What is.." notebook:

Example equation typesetting notebook:

Image output in the ibm-mlb notebook:

Security
We're displaying lots of user-supplied content, and we added a new 3rd party library to process it, so we need to be careful to avoid injection vulnerabilities (XSS).

Please make sure that you've checked the following before submitting your pull request. Thanks!

Check that your code compiles by running npm run build
Ensure that the PR targets the correct branch. Check the milestone or fix version of the ticket.
If this is the first time you're submitting a PR or even if you just need a refresher, consider reviewing our style guide
Do not bypass Angular sanitization (bypassSecurityTrustHtml, etc.), or justify why you need to do so
(see above) If displaying markdown, use the markdown-wrapper component, which does extra sanitization
Do not use cookies, although this may change in the future
Run npm audit and ensure you are not introducing new vulnerabilities
(see above) Do due diligence on new 3rd party libraries, checking for CVEs
(see above) Don't allow user-uploaded images to be served from the Dockstore domain
If this PR is for a user-facing feature, create and link a documentation ticket for this feature (usually in the same milestone as the linked issue). Style points if you create a documentation PR directly and link that instead.
Check whether this PR disables tests. If it legitimately needs to disable a test, create a new ticket to re-enable it in a specific milestone.

denis-yuen · 2023-04-04T17:46:26Z

See the screenshots for an example of such a notebook, the images within are attached to and 100%-included in the notebook. Does this run afoul of our policy on serving images from the Dockstore domain? Please discuss.

highlighting for @david4096

denis-yuen · 2023-04-04T17:56:55Z

src/app/workflow/workflow.component.html

-                <p>For a Jupyter notebook, it'll be the contents of the code/documentation cells, nicely-formatted.</p>
-                <p>For other kinds of notebooks, it might be something different.</p>
-              </div>
+              <ng-template matTabContent>


A bit of bikeshedding, but after looking at the screenshots, "Code" seems ambiguous. After all, I would have guessed that the "Files" tab would display code.
This looks like a "Preview" like if you were editing a readme or "Rendered"

"Preview" sounds good!

coverbeck

I put in a couple of comments, but I have a broader thought.

I think you should follow the more standard Angular way of doing things, in particular leveraging the HTML templates more, and doing less direct DOM manipulation. It would also be the same pattern our other components use.

Is your way more efficient and does it execute more quickly? Most likely.

But it's also different than how we approach other components. I know some of the 3rd party components we use, e.g., the code editor, are probably
doing something similar under the hood, and we don't see it because we're just using them as a component.

But all this JS code, particularly because we're going to be showing user input, is risky. It looks like you've tried to cover all cases, and maybe you have, but security stuff is hard, and it's better to follow best practices. See https://snyk.io/blog/angular-security-best-practices/

We intentionally use type any to refer to the chunks of json that we propagate through the parser. The methods that process the chunks are written to fail gracefully if they don't have the expected structure...

I think you should substantially reduce or eliminate the use of any. We have an aspiration (and a ticket) to get rid of any one day, and we should at least avoid introducing more uses of it in the meantime.

coverbeck · 2023-04-04T18:38:19Z

src/app/notebook/formatted-notebook.component.ts

+    this.notebookTarget?.nativeElement.replaceChildren(); // Remove the current formatted notebook.
+    this.loading = true;
+    this.displayError = false;
+    // The next line cancels any previous request that is still in progess,


Typo in progress

coverbeck · 2023-04-04T20:16:36Z

src/app/notebook/formatted-notebook.component.ts

+   */
+  createFormattedNotebookElement(notebook: string): any {
+    // Create the notebook container div.
+    const element: any = this.document.createElement('div');


Couldn't you have something like this in the HTML template:

<div class="notebook" *ngIf="<condition>" [innerHtml]="sanitizedHtml">  </div>

I find this more readable than tracing through TypeScript code to see what HTML will be generated.

Also I think it would be preferable to not have the the [innerHtml] here as in my example, but flesh out the template more -- I just used this as an example since it's the first case I ran into.

Couldn't you have something like this in the HTML template:

<div class="notebook" *ngIf="<condition>" [innerHtml]="sanitizedHtml">  </div>

I find this more readable than tracing through TypeScript code to see what HTML will be generated.

Also I think it would be preferable to not have the the [innerHtml] here as in my example, but flesh out the template more -- I just used this as an example since it's the first case I ran into.

Would "More stuff" in the above example be the "Couldn't display the template" message? Or are you proposing that the HTML template itself should contain at least some of the logic that steps through the notebook cells and converts them to output?

Would "More stuff" in the above example be the "Couldn't display the template" message? Or are you proposing that the HTML template itself should contain at least some of the logic that steps through the notebook cells and converts them to output?

I'm proposing that the HTML template itself contain more of the logic and HTML.

src/app/notebook/formatted-notebook.component.ts

sonarqubecloud · 2023-04-05T01:29:42Z

SonarCloud Quality Gate failed.

0 Bugs
0 Vulnerabilities
2 Security Hotspots
1 Code Smell

No Coverage information
0.0% Duplication

svonworl · 2023-04-05T03:05:00Z

I put in a couple of comments, but I have a broader thought.

I think you should follow the more standard Angular way of doing things, in particular leveraging the HTML templates more, and doing less direct DOM manipulation. It would also be the same pattern our other components use.

Is your way more efficient and does it execute more quickly? Most likely.

But it's also different than how we approach other components. I know some of the 3rd party components we use, e.g., the code editor, are probably doing something similar under the hood, and we don't see it because we're just using them as a component.

I understand what you are saying. Here's some observations and more thoughts.

Yes, it would be good to harness Angular's sanitizer in addition to DOMPurify, which the code is not doing currently. We could add this to the existing code by calling Angular's sanitizer directly, or via [innerHTML], but...

There are the following complications:

Angular's sanitizer removes most of the formatting from MathJax output, rendering it unusable. The Angular sanitizer is not customizable, so if we use [innerHTML] in the HTML template, MathJax output dies, unless we can figure out a way to identify and conditionally bypass the sanitizer for only MathJax output (hard and risky).
The easiest way to invoke Prism and MathJax is to point them at a DOM element to be styled. It may be possible to invoke them on HTML, but we'd probably need to implement some support code that would be relatively nitty-gritty and error-prone.

Assuming that we want to highlight syntax and format equations in our notebook preview, the upshot is that even if we transition to a "more Angular" approach, we'll still probably need to do some DOM manipulation.

Another observation is that as we go "more Angular", the processing logic doesn't go away, it migrates into the HTML templates.

In the continuum of possible improvements to this code that address Charles' concerns, I can see at least three distinct approaches:

Lift the sanitization up the call stack so it happens near the end of the formatting process. Modify the current code so it generates a list of HTML chunks, each of which can be marked either safe (controlled input) or unsafe (user-supplied input). All chunks are sanitized, either lightly for safe HTML or heavily for unsafe HTML, to generate the final HTML, right before we assign to an element.
Change the code so it dynamically generates Angular components, similarly to how it generates divs at the moment. In this solution, there would be Angular components that correspond to markdown cells, execution counts, code source, and outputs, but the processing logic would remain in the parent component class and be invoked by ngOnChanges.
Move [most of] the processing code into Angular HTML templates, which will iterate over cells and delegate to subcomponents, which may in turn delegate to subcomponents...

I like Approach 1 and don't mind Approach 2. Approach 3 is going to be harder than it seems, because of some of our coding constraints (no functions in templates!) and the need to propagate required values down the "component stack". As mentioned before, none of the approaches entirely obviate the need for DOM manipulation, but Approaches 2 and 3 encapsulate it in the components that need it. The amount of boilerplate increases as we define more Angular components.

Then, there's the potential performance implications, which I'm glossing over at the moment. They may not matter, they may kill Approaches 2 and 3, or somewhere in between.

KimberleyChong · 2023-04-05T15:47:30Z

src/styles.scss

+    white-space: pre;
+  }
+
+  .markdown,
+  .output {
+    padding: 0;
+  }
+
+  .markdown p:first-child {
+    margin-top: 0;
+  }
+
+  .markdown p:last-child {
+    margin-bottom: 0;
+  }
+
+  .source {
+    padding: 1em;
+    background-color: #fcf8fa;
+    border: solid 1px #f2edf0;
+  }
+
+  .source pre,
+  .output pre {
+    padding: 0;
+    margin: 0;
+    border: none;
+    background-color: inherit;
+    color: inherit;
+  }
+
+  // These rules target the html generated by the Prism syntax highlighter.
+  // This particular style was derived from a css file from the Prism distribution.
+  code {
+    .token.comment,
+    .token.prolog,
+    .token.doctype,
+    .token.cdata {
+      color: #8ac;
+    }
+
+    .token.punctuation {
+      opacity: 0.7;
+    }
+
+    .token.namespace {
+      opacity: 0.7;
+    }
+
+    .token.property,
+    .token.tag,
+    .token.boolean,
+    .token.number,
+    .token.constant,
+    .token.symbol {
+      color: #000;
+    }
+
+    .token.selector,
+    .token.attr-name,
+    .token.string,
+    .token.char,
+    .token.builtin,
+    .token.inserted {
+      color: #57a;
+    }
+
+    .token.operator,
+    .token.entity,
+    .token.url,
+    .token.variable {
+      color: #940;
+    }
+
+    .token.atrule,
+    .token.attr-value,
+    .token.keyword {
+      color: #060;
+    }
+
+    .token.regex,
+    .token.important {
+      color: #e90;
+    }
+
+    .token.variable,
+    .token.important,
+    .token.bold {
+      font-weight: bold;
+    }
+    .token.italic {
+      font-style: italic;
+    }
+
+    .token.deleted {
+      color: red;
+    }
+  }
+}
+


Good comments here, suggestion: because this is a large chunk of styles all pertaining to notebooks, I think it would be a good idea to create a separate file like notebooks.scss and import it into styles.scss similar to featured-content.scss which also styles external html (just for separation since these styles won't be used elsewhere):

dockstore-ui2/src/styles.scss

Line 24 in 274e8de

@import 'featured-content.scss';

@import is actually going to be phased out in a few years so it's recommended to use @use instead. Just tried it locally and styles rendered correctly. Eventually we should switch over our other @imports here as well

coverbeck

Thanks for the detailed explanation. Let me rephrase my thoughts, even though it's probably repetitive, just because it helps me flesh out things for myself.

The generated HTML is currently a black box to me. The only way I can visualize it is by tracing through the TS code, and that's hard. I stopped at <div class="notebook"... :). Or by running it and looking at in browser tools (which I'm not setup to do, because I haven't set up a notebook yet). When I look at your style sheet, I have no idea how to correlate those styles to the HTML. I find this makes maintenance harder, both because we can't visualize the HTML, and because this component is so different than our others.
There are the security concerns. Yes, we have to execute some arbitrary user code and create some DOM elements directly, but it's hard for me to tell just by looking at the TS code what's going on. And security stuff is hard; Angular has a team of programmers working on it, probably a dedicated security team, issues reported by users, and a lot of work and time put into it over the years (and they probably still have a bug or two :)). We're just a small team starting this feature from scratch. So I think it's important try to stay within the Angular framework as much as possible, and clearly demarcate when we are bypassing it.

Approach 3 is going to be harder than it seems, because of some of our coding constraints (no functions in templates!)

To clarify, this is not a weird Dockstore team constraint, it's a standard Angular best practice. And can be done with a pipe. I don't think this in itself is a significant obstacle, although there may be other issues.

and the need to propagate required values down the "component stack"

Perhaps/probably I'm not getting it, and that may be a function of me not having a sense of the HTML. I feel like it wouldn't be that hard to have at least have some "scaffolding" HTML, but maybe I'm wrong.

Now that I've laid out my thoughts and concerns, I'll defer to your judgement. I'm not hands-on enough with this to assertively be able to say which way is best.

coverbeck · 2023-04-05T17:10:18Z

src/app/notebook/formatted-notebook.component.spec.ts

Depending on the solution based on our other discussion, can you please add a test to show arbitrary JS code is not executed? Standard is to verify an alert() doesn't execute, although I don't know how you'd do that in a test.

coverbeck · 2023-04-05T17:49:41Z

src/app/workflow/workflow.component.html

+                <div class="mt-4 ml-4 mr-4">
+                  <app-formatted-notebook
+                    *ngIf="workflow && selectedVersion && (extendedWorkflow$ | async)"
+                    [workflow]="workflow"


IDE is giving me compiler errors for this and next line. We have this turned off at build time because there are too many of the errors, sigh, but we should try to avoid introducing new ones. Might be tricky here though.

svonworl added 30 commits March 22, 2023 11:57

scaffold

0be8a50

refactor into MarkdownWrapperService, add proof of concept code

0527038

develop further

faa38c5

develop

456ab9e

refactor lightly

3f9f4dc

generalize formatting by type

b01d978

insert TODOs for further work, clean up

dcf22f6

add mime bundle, baseUrl input

3298777

remove unused file

d2a9289

remove unused file #2

9530cb0

add baseUrl input setting, adjust test

758eb83

style notebook divs, add exception handling on failure

fdde446

make notebook css more scss-ey

121549f

correct execution count, lighten count label

fd6e4f2

add support for img alt and title attrs

d9bf2dd

refactor mime bundle processor

2c5001b

rename method

ce2c56b

add math support

9e94ed5

add syntax highlighting

e5617bc

set language type of code cells, misc improvements

c96564f

refactor escaping code

7f3c173

refactor

eb5113f

honor source/outputs_hidden, add syntax highlighting styles, refactor

9f57869

simplify, fix css problem

fdce41c

add test skeleton

227980b

improve template html

4e9ea22

restructure loop

133cc4c

complete basic unit tests, fix bugs

0188552

fix intermittent syntax highlighting

e7c8bc2

fix misc bugs

a389da8

svonworl added 4 commits April 3, 2023 18:41

remove unused stuff

d4d5577

improve backslashed dollar handling

c27419e

improve regexps

66e292d

improve regexps, add light sanitize of prism outputs

96afbcb

svonworl self-assigned this Apr 4, 2023

svonworl added 3 commits April 4, 2023 08:55

refactor unit test

67e967d

clean up unit test

0212b63

add comments, refactor lightly, fix regexp bug

4ce9395

svonworl requested review from coverbeck, denis-yuen, david4096, kathy-t and KimberleyChong April 4, 2023 17:43

denis-yuen reviewed Apr 4, 2023

View reviewed changes

fix spelling error

d042404

coverbeck reviewed Apr 4, 2023

View reviewed changes

svonworl added 2 commits April 4, 2023 15:37

change notebook Code tab to Preview tab

44389b3

fix test oopsie

05a5cde

KimberleyChong reviewed Apr 5, 2023

View reviewed changes

coverbeck reviewed Apr 5, 2023

View reviewed changes

svonworl removed request for david4096 and kathy-t April 6, 2023 17:16

svonworl marked this pull request as draft April 6, 2023 17:18

svonworl mentioned this pull request Apr 17, 2023

SEAB-5117: Formatted "preview" notebook with "more Angular" style #1757

Merged

8 tasks

denis-yuen changed the base branch from develop to release/1.14 April 20, 2023 15:02

svonworl closed this May 10, 2023

denis-yuen deleted the feature/seab-5117/display-formatted-notebook branch December 4, 2023 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SEAB-5117: Display formatted notebook #1747

SEAB-5117: Display formatted notebook #1747

svonworl commented Apr 3, 2023 •

edited

Loading

denis-yuen commented Apr 4, 2023

denis-yuen Apr 4, 2023

svonworl Apr 4, 2023

coverbeck left a comment

coverbeck Apr 4, 2023

coverbeck Apr 4, 2023

svonworl Apr 5, 2023 •

edited

Loading

coverbeck Apr 5, 2023

sonarqubecloud bot commented Apr 5, 2023

svonworl commented Apr 5, 2023 •

edited

Loading

KimberleyChong Apr 5, 2023

coverbeck left a comment

coverbeck Apr 5, 2023

coverbeck Apr 5, 2023

SEAB-5117: Display formatted notebook #1747

SEAB-5117: Display formatted notebook #1747

Conversation

svonworl commented Apr 3, 2023 • edited Loading

denis-yuen commented Apr 4, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coverbeck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svonworl Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sonarqubecloud bot commented Apr 5, 2023

svonworl commented Apr 5, 2023 • edited Loading

Choose a reason for hiding this comment

coverbeck left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

svonworl commented Apr 3, 2023 •

edited

Loading

svonworl Apr 5, 2023 •

edited

Loading

svonworl commented Apr 5, 2023 •

edited

Loading