Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

parser: limit maximum number of tokens #3684

Merged
merged 2 commits into from
Aug 8, 2022

Conversation

IvanGoncharov
Copy link
Member

@IvanGoncharov IvanGoncharov commented Jul 28, 2022

Motivation: Parser CPU and memory usage are linear to the number of tokens in a document however, in extreme cases, it becomes quadratic due to memory exhaustion.
On my machine, it happens on queries with 2k tokens.
For example:

{ a a <repeat 2k times> a }

It takes 741ms on my machine.
But if we create a document of the same size but a smaller number of tokens, it would be a lot faster.
Example:

{ a(arg: "a <repeat 2k times> a" }

Now it takes only 17ms to process, which is 43 times faster.

If we just limit document size, we should make this limit small since it takes only two bytes to create a token, e.g. a.
But that will create issues for legit documents with long tokens (comments, descriptions, strings, long names, etc.).

That's why this PR adds a mechanism to limit the number of tokens in the parsed document.
Also, exact same mechanism is implemented in graphql-java, see:
graphql-java/graphql-java#2549

I also tried the alternative approach of counting nodes, and it gives
slightly better approximation of how many resources would be consumed.
However, compared to the tokens, AST nodes are an implementation detail of graphql-js
so it's impossible to replicate in other implementations (e.g. to count
this number on a client).

@IvanGoncharov IvanGoncharov added the PR: feature 🚀 requires increase of "minor" version number label Jul 28, 2022
@IvanGoncharov IvanGoncharov requested review from yaacovCR and a team July 28, 2022 16:10
@netlify
Copy link

netlify bot commented Jul 28, 2022

Deploy Preview for compassionate-pike-271cb3 ready!

Name Link
🔨 Latest commit 8769378
🔍 Latest deploy log https://app.netlify.com/sites/compassionate-pike-271cb3/deploys/62e8fe0ac1035c0007695754
😎 Deploy Preview https://deploy-preview-3684--compassionate-pike-271cb3.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@github-actions
Copy link

Hi @IvanGoncharov, I'm @github-actions bot happy to help you with this PR 👋

Supported commands

Please post this commands in separate comments and only one per comment:

  • @github-actions run-benchmark - Run benchmark comparing base and merge commits for this PR
  • @github-actions publish-pr-on-npm - Build package from this PR and publish it on NPM

Motivation: Parser CPU and memory usage is linear to the number of tokens in a
document however in extreme cases it becomes quadratic due to memory exhaustion.
On my mashine it happens on queries with 2k tokens.
For example:
```
{ a a <repeat 2k times> a }
```
It takes 741ms on my machine.
But if we create document of the same size but smaller number of
tokens it would be a lot faster.
Example:
```
{ a(arg: "a <repeat 2k times> a" }
```
Now it takes only 17ms to process, which is 43 time faster.

That mean if we limit document size we should make this limit small
since it take only two bytes to create a token, e.g. ` a`.
But that will hart legit documents that have long tokens in them
(comments, describtions, strings, long names, etc.).

That's why this PR adds a mechanism to limit number of token in
parsed document.
Also exact same mechanism implemented in graphql-java, see:
graphql-java/graphql-java#2549

I also tried alternative approach of counting nodes and it gives
slightly better approximation of how many resources would be consumed.
However comparing to the tokens, AST nodes is implementation detail of graphql-js
so it's imposible to replicate in other implementation (e.g. to count
this number on a client).
@saihaj

This comment has been minimized.

@github-actions
Copy link

@github-actions run-benchmark

@saihaj Please, see benchmark results here: https://github.com/graphql/graphql-js/runs/7580332109?check_suite_focus=true#step:6:1

Copy link
Member

@saihaj saihaj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be something that lives on server side code not in a reference implementation. The users of library can chose to limit the tokens onParse phase

@yaacovCR
Copy link
Contributor

yaacovCR commented Aug 2, 2022

I think by having ability to throw an error within parse, you don't have to have a pre-parse step that would separately count the number of tokens?

src/language/parser.ts Outdated Show resolved Hide resolved
src/language/parser.ts Outdated Show resolved Hide resolved
@yaacovCR
Copy link
Contributor

yaacovCR commented Aug 2, 2022

@IvanGoncharov looks good to me, I suggested changes to the wording on a few of the comments if you think an improvement.

Co-authored-by: Yaacov Rydzinski  <yaacovCR@gmail.com>
@IvanGoncharov
Copy link
Member Author

looks good to me, I suggested changes to the wording on a few of the comments if you think an improvement.

Thanks, @yaacovCR I merged those.

I think this should be something that lives on server side code not in a reference implementation. The users of library can chose to limit the tokens onParse phase

@saihaj The problem here is that parse is sync so as @yaacovCR pointed out the only other option is to count tokens as a separate step. But that would have a performance impact.

@saihaj
Copy link
Member

saihaj commented Aug 2, 2022

looks good to me, I suggested changes to the wording on a few of the comments if you think an improvement.

Thanks, @yaacovCR I merged those.

I think this should be something that lives on server side code not in a reference implementation. The users of library can chose to limit the tokens onParse phase

@saihaj The problem here is that parse is sync so as @yaacovCR pointed out the only other option is to count tokens as a separate step. But that would have a performance impact.

Maybe we should consider making parse async too 🤔

@michaelstaib
Copy link
Member

@IvanGoncharov will you have a default max token size or is that just a new option that you add and leave the setting up to the user?

@IvanGoncharov
Copy link
Member Author

IvanGoncharov commented Aug 4, 2022

will you have a default max token size or is that just a new option that you add and leave the setting up to the user?

@michaelstaib The idea is to leave the default limit to the more high-level libraries.
Since we are using parse for all types of documents, e.g., SDL files. We can't choose one limit that will serve all use cases.
That said, all tools/libraries that implement pipelines (e.g., server libraries) are encouraged to set some limit by default.

@IvanGoncharov
Copy link
Member Author

Maybe we should consider making parse async too 🤔

@saihaj This is a way bigger task.
Also not sure what API this parser will have.
AST is not an array but a tree, so not sure what return value this parser would have.
Note: by parser being sync, I meant that you can't interrupt it, so you get a full tree.
If the parser simply returns the promise of AST that doesn't change the situation in that respect.

I propose merging this as a solution to a problem and return back to the discussion if we will have an async parser in the future.

@IvanGoncharov IvanGoncharov merged commit 9df9079 into graphql:main Aug 8, 2022
@IvanGoncharov IvanGoncharov deleted the pr_branch4 branch August 8, 2022 17:02
IvanGoncharov added a commit to IvanGoncharov/graphql-js that referenced this pull request Aug 16, 2022
Backport of graphql#3684
Motivation: Parser CPU and memory usage is linear to the number of tokens in a
document however in extreme cases it becomes quadratic due to memory exhaustion.
On my mashine it happens on queries with 2k tokens.
For example:
```
{ a a <repeat 2k times> a }
```
It takes 741ms on my machine.
But if we create document of the same size but smaller number of
tokens it would be a lot faster.
Example:
```
{ a(arg: "a <repeat 2k times> a" }
```
Now it takes only 17ms to process, which is 43 time faster.

That mean if we limit document size we should make this limit small
since it take only two bytes to create a token, e.g. ` a`.
But that will hart legit documents that have long tokens in them
(comments, describtions, strings, long names, etc.).

That's why this PR adds a mechanism to limit number of token in
parsed document.
Also exact same mechanism implemented in graphql-java, see:
graphql-java/graphql-java#2549

I also tried alternative approach of counting nodes and it gives
slightly better approximation of how many resources would be consumed.
However comparing to the tokens, AST nodes is implementation detail of graphql-js
so it's imposible to replicate in other implementation (e.g. to count
this number on a client).

* Apply suggestions from code review

Co-authored-by: Yaacov Rydzinski  <yaacovCR@gmail.com>

Co-authored-by: Yaacov Rydzinski  <yaacovCR@gmail.com>
IvanGoncharov added a commit to IvanGoncharov/graphql-js that referenced this pull request Aug 16, 2022
Backport of graphql#3684
Motivation: Parser CPU and memory usage is linear to the number of tokens in a
document however in extreme cases it becomes quadratic due to memory exhaustion.
On my mashine it happens on queries with 2k tokens.
For example:
```
{ a a <repeat 2k times> a }
```
It takes 741ms on my machine.
But if we create document of the same size but smaller number of
tokens it would be a lot faster.
Example:
```
{ a(arg: "a <repeat 2k times> a" }
```
Now it takes only 17ms to process, which is 43 time faster.

That mean if we limit document size we should make this limit small
since it take only two bytes to create a token, e.g. ` a`.
But that will hart legit documents that have long tokens in them
(comments, describtions, strings, long names, etc.).

That's why this PR adds a mechanism to limit number of token in
parsed document.
Also exact same mechanism implemented in graphql-java, see:
graphql-java/graphql-java#2549

I also tried alternative approach of counting nodes and it gives
slightly better approximation of how many resources would be consumed.
However comparing to the tokens, AST nodes is implementation detail of graphql-js
so it's imposible to replicate in other implementation (e.g. to count
this number on a client).

* Apply suggestions from code review

Co-authored-by: Yaacov Rydzinski  <yaacovCR@gmail.com>

Co-authored-by: Yaacov Rydzinski  <yaacovCR@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: feature 🚀 requires increase of "minor" version number
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants