Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

uri format regexp is validating invalid URI #408

Closed
vmaurin opened this issue Jun 15, 2021 · 18 comments
Closed

uri format regexp is validating invalid URI #408

vmaurin opened this issue Jun 15, 2021 · 18 comments

Comments

@vmaurin
Copy link
Contributor

vmaurin commented Jun 15, 2021

The regexp used by the "uri" format validation doesn't really validate that the string is a proper URI, i.e there is a chance passing a validated string to the java URI class will result into an exception

An example invalid URI that passes the validation is https://gitlab.com/?args=invalid|value

Should the regexp be more complex ? Should the validation be made with the java URI class (then catching the exception to validate) ?

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 15, 2021

The regexp in question is this one

COMMON_BUILTIN_FORMATS.add(pattern("uri", "(^[a-zA-Z][a-zA-Z0-9+-.]*:[^\\s]*$)|(^//[^\\s]*$)"));

@stevehu
Copy link
Contributor

stevehu commented Jun 16, 2021

@vmaurin The regexp definitely has room to improve. The reason I didn't use the Java URI class to verify the URL is due to performance. Most users are using this library to validate the HTTP request and it needs to be as fast as possible. I would suggest improving the regexp to make it cover more validation scenarios. The other option is to provide both options but default to regexp and with a configuration flag switch to Java URL class. What do you think?

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 16, 2021

I totally agree with the performance issue on relying on try/catch for validation, and also after a quick research, it sounds that URI java class is not perfect regarding validation either. My personal pick would be as you suggested, to enrich a bit the regexp. There are several example online, with this one as the most complete validation http://jmrware.com/articles/2009/uri_regexp/URI_regex.html but it might be also overkill. There are also some middle ground, like this one ^([a-z][a-z0-9+.-]+):(\/\/([^@]+@)?([a-z0-9.\-_~]+)(:\d+)?)?((?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])*)*|(?:\/(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@])+)*)?(\?(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?(\#(?:[a-z0-9-._~]|%[a-f0-9]|[!$&'()*+,;=:@]|[/?])+)?$ that could be implemented, I just don't know what should be the position of json-schema-validator on this regards

@stevehu
Copy link
Contributor

stevehu commented Jun 16, 2021

It looks good to me. Would you like to submit a PR to replace the old one? Thanks.

vmaurin pushed a commit to vmaurin/json-schema-validator that referenced this issue Jun 16, 2021
The json schema uri format should follow rfc3986 for uri validation. I
have replaced the current regex with a more restrictive one in order to
have a stronger validation according the RFC

refs networknt#408
@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 16, 2021

I just did but I ran in an other use case. So far, the protocol relative URL were considered valid, that is not the case anymore with the new regex. According json schema, we should follow the URI rfc, then I guess forbid protocol relative URL, but not sure it is something you desire

@stevehu
Copy link
Contributor

stevehu commented Jun 16, 2021

What is the protocol-relative URL? Could you please provide an example? Thanks. We want to make sure it is backward compatible if possible. Otherwise, some users will have a broken application after upgrade the library.

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 16, 2021

It is actually a breaking change, see in my PR here https://github.com/networknt/json-schema-validator/pull/409/files#diff-cd18df339efd8831f215e37b5248aeb7dede17bac8013db8830dee30cd59b593L200 . The value asserted "valid" is not actually a valid URI, so there are two options :

  • breaking backward compatibility but being more respectful of the json schema specs
  • keeping backward compatibility but validating a URI that is not a correct one according the RFC

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 18, 2021

@stevehu I am waiting on your call. Any more restrictive regexp won't be backward compatible by design, it will invalidate value that were considered valid before. Either you want the library to become stricter along time, ensuring good validation but no backward compatibility regarding validation, or you want to keep the library as lax as today, but give options to be more strict if needed. Depending on your decision, I will adapt the PR

@stevehu
Copy link
Contributor

stevehu commented Jun 18, 2021

We want to maintain backward compatibility; however, we cannot stop enhancing the library. Please submit a PR with your updated regexp. Thanks.

@vmaurin
Copy link
Contributor Author

vmaurin commented Jun 18, 2021

Ok so the PR is open here #409

stevehu pushed a commit that referenced this issue Jun 18, 2021
The json schema uri format should follow rfc3986 for uri validation. I
have replaced the current regex with a more restrictive one in order to
have a stronger validation according the RFC

refs #408

Co-authored-by: Vincent Maurin <vincent.maurin@vectaury.io>
@stevehu stevehu changed the title "uri" format regexp is validating invalid URI uri format regexp is validating invalid URI Jun 18, 2021
@stevehu stevehu closed this as completed Jun 22, 2021
@kamtschatka
Copy link

Because I just ran into the issue, that google font URLs are considered invalid:
Sample: "https://fonts.googleapis.com/css?family=Open+Sans:400,700|Source+Code+Pro:300,600|Titillium+Web:400,600,700"
Doc also says to use "|" and "," in the URLs: https://developers.google.com/fonts/docs/getting_started

The RFC says it is invalid, but seems like google didn't get the memo.
If you navigate to this URL in Chrome/Firefox and then access location.href it is also returned without the | and , URL encoded.

@stevehu
Copy link
Contributor

stevehu commented Oct 28, 2024

If you get the address in the above URL, it is already encoded as a valid URL.

https://fonts.googleapis.com/css?family=Open+Sans:400,700%7CSource+Code+Pro:300,600%7CTitillium+Web:400,600,700

@kamtschatka
Copy link

no it is not. If you copy/paste it into the browser it shows like this:
image

If you click on the link above it is URL encoded, because github does that.

@stevehu
Copy link
Contributor

stevehu commented Oct 28, 2024

I right-clicked the URL and copied the link address.

@kamtschatka
Copy link

as stated: this is github encoding the URL. Use the mouse to mark the URL and copy it.

@stevehu
Copy link
Contributor

stevehu commented Oct 28, 2024

I don't know who encoded the URL. What I mean is only the valid URL should be used in JSON schema.

@vmaurin
Copy link
Contributor Author

vmaurin commented Oct 28, 2024

The regexp solution for validating a URI is not perfect here, but for sure a URI with | in the query string is not valid according the RFC and then doesn't fit the uri format restriction of json schema too, so it feels normal json schema validator is rejecting these URIs.
Here as @stevehu mentioned, either you pass a valid URI by percent encoding the pipes, or you change your json schema to relax the constraints

@kamtschatka
Copy link

yes, i have already created my own format, but what I am saying is: browsers AND the java implementation are not agreeing with the regex (which is RFC compliant from what I can see). As always when it comes to the web, the rules are not always applied.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants