Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat add puppeteer #54

Merged
merged 22 commits into from
Nov 1, 2023
Merged

Feat add puppeteer #54

merged 22 commits into from
Nov 1, 2023

Conversation

LilaKelland
Copy link
Collaborator

@LilaKelland LilaKelland commented Oct 27, 2023

Added accessibility checks (the web-endpoint-puppeteer-checks (though feel like we could use a better name for this module)).

  • Modified git clone repo to extract endpoints from .product.yaml and publish them to 'ClonedRepoEvent' NATs subject - though this will obviously change with the new structure.
  • web-endpoint-puppeteer-checks subscribes to 'ClonedRepoEvent',
  • then cycles through webEndpoints and filters for only web-type endpoints (ie serves html but not graphiql)
  • Slugs are found for each endpoint , and concatenated with endpoint itself to form the page url that is then scanned for accessibility by puppeteer.

Other changes

  • Dev container - added global puppeteer install by node user, added markdown linter, install libraries needed for chrome/ debian. (update - just realizing that I should probably be adding a docker container for the module and install it here... should I do this before approval?)

TODO

  • documentation on the checks and response.
  • get the mutation to save to database
    -tests

… Still errored, so added per docs --no-sandbox (though may not want to do in prod...., but now it works).
…so added global install of puppeteer which seems to be required. Will remove chrome install as not sure needed with puppeteer not global in future iterations.
… reformatted results and installed graphql-request to use API in future.
…globally as node user, and added extra packages needed for debian to run Chrome. Will rebuild to confirm works.
Copy link
Contributor

@Collinbrown95 Collinbrown95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is fantastic, great job!

I have only a few small comments, but I'd like to have a quick meeting before merging where we review the behaviour on a few concrete examples (e.g. nats pub "ClonedRepoEvent.>" '{"webEndpoints": ["https://www.canada.ca/en/public-health.html"]}').

In the spirit of 1 PR = 1 feature = 1 issue, I'd like to make some issues with future TODO items, some of which are blocked because the GraphQL API hasn't been merged to the main branch yet.

scanners/web-endpoint-puppeteer-checks/README.md Outdated Show resolved Hide resolved
scanners/web-endpoint-puppeteer-checks/index.js Outdated Show resolved Hide resolved
for await (const message of sub) {
const clonedRepoEventPayload = await jc.decode(message.data)
const { webEndpoints } = clonedRepoEventPayload
const productName = message.subject.split('.').pop() // last NATs subject token
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain the logic behind this line?

When I run this locally, message.subject.split('.').pop() gives me > as a result, which I think is the same regardless of the URL we are analyzing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - it was a way to pass the product name (as the last token of subject from github-cloned-repo), that would be used when saving to database. (https://docs.nats.io/nats-concepts/subjects#wildcards). When you subscribe to NATS subject.>, the '>' allows it to subscribe to all things that starts with 'subject'. Product name can just as easily be passed as part of the payload (as I image we'll do with the graph pieces). It was just a placeholder as not yet saving to database.

webEndpointResults[webEndpoint][pageToEvaluate] = axeReport
}

accessibilityResults.push(webEndpointResults)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My instinct is that this is roughly where we would publish to the GraphQL API (which doesn't exist on the main branch yet).

My thought is that we would accumulate all slugs associated with a root web URL and attach that "list" of accessibility report results to a field in the Endpoint object.

Let's discuss this in more detail when we meet about this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah - yes that does make sense to split them out rather than having the entire response for all the endpoints combined before saving - thanks for catching!

@@ -0,0 +1,33 @@
async function getSlugs(url, page, browser) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to discuss this logic in more detail when we meet.

We don't necessarily need to change anything in this PR, but I'd like to take the time to review how this works along with an example.

Typically, web crawling like this requires some kind of recursive logic like "visit every page to check links, then for every link found visit that page to check links...", where you "bottom out" once you've build the entire tree of pages starting at the root URL.

On the other hand, your current approach might get most of the possible links with only a single page request, which would involve significantly less network traffic.

Copy link
Collaborator Author

@LilaKelland LilaKelland Oct 31, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh interesting - it seemed like a lot requests, happy to refactor it if it makes sense. Although I might be missing sub-slugs the way you describe...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the spirit of minimizing complexity, I'm inclined to say leave it for now and we can refactor if we find a significant fraction of links are being missed without the recursive approach.

@@ -0,0 +1,17 @@
// fs docs node https://nodejs.org/docs/v0.3.4/api/fs.html
import * as fs from 'fs';
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to the comment on the README.md document, I'm wondering if logic for parsing product.yaml should live in the "graph updater" component rather than the "endpoint scanner" component.

The idea being that "graph updater" handles stuff related to the graph structure, and "endpoint scanner" handles stuff related to attributes of given endpoints (e.g. accessibility scans).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes absolutely - this was pre-last week conversation, getting this working.

…_URL (though may be different once api is implimented)
…e docs! only printing out message and html for nodes for incomplete (untested) and violations
@LilaKelland
Copy link
Collaborator Author

Pushing to main - there are still some open issues to address here. Have added them into issue #56 to address in the near future.

@LilaKelland LilaKelland merged commit a885625 into main Nov 1, 2023
@LilaKelland LilaKelland deleted the feat-add-puppeteer branch November 1, 2023 18:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants