Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce overhead to check if a host is an IP Address #9095

Merged
merged 7 commits into from
Sep 9, 2024
Merged

Conversation

bdraco
Copy link
Member

@bdraco bdraco commented Sep 9, 2024

What do these changes do?

The IP Address regexes are a bit complex and take up ~35% of the time spent in update_headers https://github.com/aio-libs/aiohttp/blob/master/aiohttp/client_reqrep.py#L361

The checks for IP Addresses did not need to validate that the IP Address is valid, only that it was not a domain name #9095 (comment) which means they can be much simpler.

Timings (more details below)

['ipv4 string - decode + isdigit', 0.05164358299225569]
['ipv4 string - regex', 0.24710870906710625]
['ipv4 binary - decode + isdigit', 0.07746362499892712]
['ipv4 binary - regex', 0.2532113748602569]
['ipv6 string - check for :', 0.00942283309996128]
['ipv6 string - regex', 1.4924992499873042]
['ipv6 binary - check for :', 0.11242458294145763]
['ipv6 binary - regex', 1.4484847909770906]

Are there changes in behavior for the user?

no

Is it a substantial burden for the maintainers to support this?

no

before
is_ip_address_before

after
is_ipv4_address_after
is_ip_address_after

Copy link

codecov bot commented Sep 9, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.30%. Comparing base (a6dd415) to head (e7a21cc).
Report is 6 commits behind head on master.

✅ All tests successful. No failed tests found.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #9095   +/-   ##
=======================================
  Coverage   98.30%   98.30%           
=======================================
  Files         107      107           
  Lines       34403    34403           
  Branches     4074     4081    +7     
=======================================
  Hits        33819    33819           
  Misses        412      412           
  Partials      172      172           
Flag Coverage Δ
CI-GHA 98.19% <100.00%> (ø)
OS-Linux 97.86% <100.00%> (ø)
OS-Windows 96.27% <100.00%> (ø)
OS-macOS 97.53% <100.00%> (-0.02%) ⬇️
Py-3.10.11 97.63% <100.00%> (ø)
Py-3.10.14 97.56% <100.00%> (ø)
Py-3.11.9 97.79% <100.00%> (ø)
Py-3.12.5 97.91% <100.00%> (ø)
Py-3.9.13 97.52% <100.00%> (ø)
Py-3.9.19 97.46% <100.00%> (ø)
Py-pypy7.3.16 97.07% <100.00%> (ø)
VM-macos 97.53% <100.00%> (-0.02%) ⬇️
VM-ubuntu 97.86% <100.00%> (ø)
VM-windows 96.27% <100.00%> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bdraco bdraco added backport-3.10 Trigger automatic backporting to the 3.10 release branch by Patchback robot backport-3.11 Trigger automatic backporting to the 3.11 release branch by Patchback robot labels Sep 9, 2024
@Dreamsorcerer
Copy link
Member

Dreamsorcerer commented Sep 9, 2024

If this is a performance concern, wouldn't it make more sense to make this a much simpler heuristic? I believe we're only trying to decide if the user is attempting a request via an IP or a domain. There doesn't seem to be any real need to validate that the syntax is correct.

e.g. If we invert it to check if it looks like a domain, we could probably get away with something like if ":" in host or host.rsplit(".", maxsplit=1)[-1][0].isdigit(). Then it shouldn't be a domain according to the DNS syntax: https://www.rfc-editor.org/rfc/rfc1034#section-3.5

@bdraco
Copy link
Member Author

bdraco commented Sep 9, 2024

If this is a performance concern, wouldn't it make more sense to make this a much simpler heuristic? I believe we're only trying to decide if the user is attempting a request via an IP or a domain. There doesn't seem to be any real need to validate that the syntax is correct.

e.g. If we invert it to check if it looks like a domain, we could probably get away with something like if ":" in host or host.rsplit(".", maxsplit=1)[-1][0].isdigit(). Then it shouldn't be a domain according to the DNS syntax: rfc-editor.org/rfc/rfc1034#section-3.5

Yeah I think that makes a lot more sense. I'll audit all usage to see if we actually care if its valid or not

@bdraco
Copy link
Member Author

bdraco commented Sep 9, 2024

is_ip_address is only used in cookies

if not self._unsafe and is_ip_address(hostname):

hostname comes from yarl as raw_host, it cannot contain a port

return not is_ip_address(hostname)

hostname comes from yarl as raw_host, it cannot contain a port

if is_ip_address(hostname):

--

if hostname and not self._is_domain_match(domain, hostname):
comes from yarl, will never contain a port
--
def clear_domain(self, domain: str) -> None:
-- its passed in so the caller would have to incorrect add a port here for which seems unlikely

is_ipv4_address is never called directly

is_ipv6_address is only used in client_reqrep to add []

if helpers.is_ipv6_address(netloc):

hostname comes from yarl as raw_host, it cannot contain a port

if helpers.is_ipv6_address(connect_host):

hostname comes from yarl as raw_host, it cannot contain a port

I don't think any of the use cases need to validate

@bdraco
Copy link
Member Author

bdraco commented Sep 9, 2024

Its the IPv6 regex that is heavy

['ipv4 string - decode + isdigit', 0.05164358299225569]
['ipv4 string - regex', 0.24710870906710625]
['ipv4 binary - decode + isdigit', 0.07746362499892712]
['ipv4 binary - regex', 0.2532113748602569]
['ipv6 string - check for :', 0.00942283309996128]
['ipv6 string - regex', 1.4924992499873042]
['ipv6 binary - check for :', 0.11242458294145763]
['ipv6 binary - regex', 1.4484847909770906]
import re
import timeit

_ipv4_pattern = (
    r"^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}"
    r"(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$"
)

_ipv6_pattern = (
    r"^(?:(?:(?:[A-F0-9]{1,4}:){6}|(?=(?:[A-F0-9]{0,4}:){0,6}"
    r"(?:[0-9]{1,3}\.){3}[0-9]{1,3}$)(([0-9A-F]{1,4}:){0,5}|:)"
    r"((:[0-9A-F]{1,4}){1,5}:|:)|::(?:[A-F0-9]{1,4}:){5})"
    r"(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])\.){3}"
    r"(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9]?[0-9])|(?:[A-F0-9]{1,4}:){7}"
    r"[A-F0-9]{1,4}|(?=(?:[A-F0-9]{0,4}:){0,7}[A-F0-9]{0,4}$)"
    r"(([0-9A-F]{1,4}:){1,7}|:)((:[0-9A-F]{1,4}){1,7}|:)|(?:[A-F0-9]{1,4}:){7}"
    r":|:(:[A-F0-9]{1,4}){7})$"
)
_ipv4_regex = re.compile(_ipv4_pattern)
_ipv6_regex = re.compile(_ipv6_pattern, flags=re.IGNORECASE)
_ipv4_regexb = re.compile(_ipv4_pattern.encode("ascii"))
_ipv6_regexb = re.compile(_ipv6_pattern.encode("ascii"), flags=re.IGNORECASE)

host_str = "1.2.3.4"
print (['ipv4 string - decode + isdigit', timeit.timeit('host_str.replace(".","").isdigit()',globals=locals())])
print (['ipv4 string - regex', timeit.timeit('_ipv4_regex.search(host_str)',globals=locals())])
host_bin = b"1.2.3.4"
print(['ipv4 binary - decode + isdigit', timeit.timeit('host_bin.decode("ascii").replace(".","").isdigit()',globals=locals())])
print(['ipv4 binary - regex', timeit.timeit('_ipv4_regexb.search(host_bin)',globals=locals())])
host_str = "2001:db8::ff00:42:8329"
print(['ipv6 string - check for :', timeit.timeit('":" in host_str',globals=locals())])
print(['ipv6 string - regex', timeit.timeit('_ipv6_regex.search(host_str)',globals=locals())])
host_bin = b"2001:db8::ff00:42:8329"
print(['ipv6 binary - check for :', timeit.timeit('b":" in host_bin',globals=locals())])
print(['ipv6 binary - regex', timeit.timeit('_ipv6_regexb.search(host_bin)',globals=locals())])

@Dreamsorcerer
Copy link
Member

host_str.replace(".","").isdigit()

Strictly speaking, you could even do host_str[0].isdigit(), though I was initially thinking of checking the TLD specifically, to be on the safer side.

@bdraco
Copy link
Member Author

bdraco commented Sep 9, 2024

host_str.replace(".","").isdigit()

Strictly speaking, you could even do host_str[0].isdigit(), though I was initially thinking of checking the TLD specifically, to be on the safer side.

I don't think we can do that since domains are allowed to start with a number 1password.com

I'm not sure there any TLDs that end with a number so host_str[-1].isdigit() might work... but I wonder if that could change

@Dreamsorcerer
Copy link
Member

I don't think we can do that since domains are allowed to start with a number 1password.com

Oh, that actually violates the DNS syntax in the spec...

@Dreamsorcerer
Copy link
Member

I don't think we can do that since domains are allowed to start with a number 1password.com

Oh, that actually violates the DNS syntax in the spec...

Nevermind, it was updated in https://www.rfc-editor.org/rfc/rfc1101 (section 3.1)

@psf-chronographer psf-chronographer bot added the bot:chronographer:provided There is a change note present in this PR label Sep 9, 2024
@Dreamsorcerer
Copy link
Member

I don't think we can do that since domains are allowed to start with a number 1password.com

Oh, that actually violates the DNS syntax in the spec...

Nevermind, it was updated in https://www.rfc-editor.org/rfc/rfc1101 (section 3.1)

It says that 26.0.0.73.COM is not a valid domain, but not clear how it reaches that conclusion..
But, I suspect what you've got will be good enough.

@bdraco
Copy link
Member Author

bdraco commented Sep 9, 2024

I don't think we can do that since domains are allowed to start with a number 1password.com

Oh, that actually violates the DNS syntax in the spec...

Nevermind, it was updated in rfc-editor.org/rfc/rfc1101 (section 3.1)

It says that 26.0.0.73.COM is not a valid domain, but not clear how it reaches that conclusion.. But, I suspect what you've got will be good enough.

host[-1].isdigit() would be a tiny bit faster but considering we are going from 1.4924992499873042 to 0.00942283309996128 for the IPv6 check it probably doesn't matter so much as the IPv4 case wasn't so much of a problem... and its unclear if we end up with something like .co3 as a tld eventually ... but I didn't see anything in https://publicsuffix.org/list/public_suffix_list.dat .. They can also be idna encoded.. not sure if that could somehow end in a digit

@bdraco bdraco marked this pull request as ready for review September 9, 2024 18:05
@bdraco
Copy link
Member Author

bdraco commented Sep 9, 2024

Tested on a few production Home Assistant instances. There are lots of places that use cookies and ip addresses so this has some good real world use testing.

@bdraco bdraco merged commit ffcf9dc into master Sep 9, 2024
34 of 35 checks passed
@bdraco bdraco deleted the is_ip_address branch September 9, 2024 19:40
Copy link
Contributor

patchback bot commented Sep 9, 2024

Backport to 3.10: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.10/ffcf9dc4ea157adc5b7b5b31b6cc69f37d533122/pr-9095

Backported as #9096

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

Copy link
Contributor

patchback bot commented Sep 9, 2024

Backport to 3.11: 💚 backport PR created

✅ Backport PR branch: patchback/backports/3.11/ffcf9dc4ea157adc5b7b5b31b6cc69f37d533122/pr-9095

Backported as #9097

🤖 @patchback
I'm built with octomachinery and
my source is open — https://github.com/sanitizers/patchback-github-app.

patchback bot pushed a commit that referenced this pull request Sep 9, 2024
bdraco added a commit that referenced this pull request Sep 9, 2024
… is an IP Address (#9096)

Co-authored-by: J. Nick Koston <nick@koston.org>
bdraco added a commit that referenced this pull request Sep 9, 2024
… is an IP Address (#9097)

Co-authored-by: J. Nick Koston <nick@koston.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-3.10 Trigger automatic backporting to the 3.10 release branch by Patchback robot backport-3.11 Trigger automatic backporting to the 3.11 release branch by Patchback robot bot:chronographer:provided There is a change note present in this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants