Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework the URI parser with Angstrom #142

Merged
merged 5 commits into from
Oct 8, 2020

Conversation

anmonteiro
Copy link
Contributor

This diff reworks the current parser based on regexes in favor of one
that uses Angstrom.

It also removes the dependency on ocaml-re, moving the regular
expressions to a new package uri-re.

The current state of this work aready passes the unit tests, but it might need to further work to get it into a mergeable state. I'm opening a PR to get any and all feedback. Thanks in advance!

This happens to fix #21.

This diff reworks the current parser based on regexes in favor of one
that uses Angstrom.

It also removes the dependency on `ocaml-re`, moving the regular
expressions to a new package `uri-re`.

This happens to fix mirage#21.
@dinosaure
Copy link
Member

Interesting work. I would like to now that you translated already available regexp to angstorm parser or you took what is described by the RFC as is?

@anmonteiro
Copy link
Contributor Author

@dinosaure This is mostly based on what the previous implementation did. I was reading the spec along with the work and noticed that we don't follow it too closely in some places.

My rationale for preserving the same assumptions was that it's already a fairly big change, and I didn't want to introduce more noise than necessary.

I think this sets up a good base to iterate on the parts that need to follow the spec more cloesly. What do you think?

@dinosaure
Copy link
Member

I think this sets up a good base to iterate on the parts that need to follow the spec more cloesly. What do you think?

I agree with you but ocaml-uri is widely used by many people. So we should have something to test your PR with others packages and see what happens first. Just to avoid any large break 👍 .

@avsm
Copy link
Member

avsm commented Dec 31, 2019

Thanks for the excellent work on this PR! I'm just getting back from vacation and can review it more over the next week.

I'm in favour of merging and releasing an Angstrom version that is as close to the behaviour of uri as currently stands, and then improving the specification compat in subsequent releases. How about setting up an AFL fuzz test to check for behavioural differences between uri-re and uri to spot any places where it differs?

@avsm
Copy link
Member

avsm commented Jan 1, 2020

I've pushed avsm/ocaml-uri@4bf0535 over to https://github.com/avsm/ocaml-uri/tree/angstrom-parser that implements the differential fuzzer. So far, it's only found one difference: the handling of \n in inputs. The old implementation would terminate the parse upon hitting a newline, but the new one will urlencode the \n and continue consuming input. We could make it terminate in the new implementation to maintain existing behaviour, and subsequently allow them in a later release.

@anmonteiro
Copy link
Contributor Author

This is great! I can add you to my fork if you wanna push there. I can look into fixing that difference soon too.

@avsm
Copy link
Member

avsm commented Jan 1, 2020

pushed :)

@avsm
Copy link
Member

avsm commented Apr 2, 2020

Any further thoughts on that difference Antonio? It would be good to get this into a release

@anmonteiro
Copy link
Contributor Author

Thanks for the ping -- I had forgotten about this but I'm still interested in it. Give me a couple days and I'll look into this.

@avsm
Copy link
Member

avsm commented Apr 3, 2020

Much appreciated @anmonteiro! No rush of course :-)

@anmonteiro
Copy link
Contributor Author

@avsm Just pushed a commit that fixes the differences in newline handling between the old (regex) and new (Angstrom) parsers. Also added a compatibility test.

@hcarty
Copy link
Contributor

hcarty commented Jun 30, 2020

Thank you for working on this @anmonteiro!

@dinosaure
Copy link
Member

Let's revive this PR.

@dinosaure dinosaure mentioned this pull request Sep 28, 2020
@dinosaure dinosaure merged commit 8634a69 into mirage:master Oct 8, 2020
dinosaure added a commit to dinosaure/opam-repository that referenced this pull request Oct 14, 2020
CHANGES:

* sexp: use the sexplib v0.13 ppx directives (@avsm, mirage/ocaml-uri#143).
* rework the URI parser with `angstrom` (@anmonteiro, review @avsm & @dinosaure, mirage/ocaml-uri#142).
* add simple fuzzer tests between `angstrom` parser and _legacy_ parser (with `re.posix`, mirage/ocaml-uri#142)
* add support of modifying pct encoding (with a custom one) (@orbitz, review @anmonteiro, @tmcgilchrist, @avsm & @dinosaure, mirage/ocaml-uri#147)
* allow the selection of generic set of safe characters (with `Generic`) (@madroach, review @dinosaure, mirage/ocaml-uri#141)
dinosaure added a commit to dinosaure/opam-repository that referenced this pull request Oct 14, 2020
CHANGES:

* sexp: use the sexplib v0.13 ppx directives (@avsm, mirage/ocaml-uri#143).
* rework the URI parser with `angstrom` (@anmonteiro, review @avsm & @dinosaure, mirage/ocaml-uri#142).
* add simple fuzzer tests between `angstrom` parser and _legacy_ parser (with `re.posix`, mirage/ocaml-uri#142)
* add support of modifying pct encoding (with a custom one) (@orbitz, review @anmonteiro, @tmcgilchrist, @avsm & @dinosaure, mirage/ocaml-uri#147)
* allow the selection of generic set of safe characters (with `Generic`) (@madroach, review @dinosaure, mirage/ocaml-uri#141)
@anmonteiro anmonteiro deleted the anmonteiro/angstrom-parser branch August 16, 2021 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is it possible to split ocaml-uri to avoid the dependency on ocaml-re ?
4 participants