Skip to content

Potential for URI parsing performance improvement. #151

@samoconnor

Description

@samoconnor

HTTP.jl uses the http_parser_parse_url function to parse URLs.

function http_parser_parse_url(url::String)

I believe this code is based on ngx_http_parse.c from NGINX. @quinnj is that right?

I recently added some more URI parsing tests based on https://github.com/cweb/url-testing/blob/master/urls.json and in the process of debugging made a simple regex pattern based on the regex from RFC 3986.

It turns out that the simple regex parser is faster than http_parser_parse_url.

Running test/uri_benchmark.jl shows that the regex parser runs in 47% of the time taken by http_parser_parse_url:

  3.058562 seconds (19.64 M allocations: 748.444 MiB, 2.00% gc time)
http_parser_parse_url parsed 204 urls 10000 times in 3059.0 ms
  1.436758 seconds (18.69 M allocations: 1.159 GiB, 6.28% gc time)
regex_parse parsed 204 urls 10000 times in 1437.0 ms (47.0%)

The regex parser is in URIs.jl here:

HTTP.jl/src/URIs.jl

Lines 101 to 121 in 6ee7083

const uri_reference_regex =
r"""^
(?: ([^:/?#]+) :) ? # 1. sheme
(?: // (?: ([^/?#@]*) @) ? # 2. userinfo
(?| (?: \[ ([^\]]+) \] ) # 3. host (ipv6)
| ([^:/?#\[]*) ) # 3. host
(?: : ([^/?#]+) ) ? ) ? # 4. port
([^?#]*) # 5. path
(?: \?([^#]*) ) ? # 6. query
(?: [#](.*) ) ? # 7. fragment
$"""x
const empty = SubString("", 1, 0)
function regex_parse(::Type{URI}, str::AbstractString)
m = match(uri_reference_regex, str)
if m == nothing
return emptyuri
end
return URI(str, (c = m[1]) == nothing ? empty : c,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions