Skip to content
This repository has been archived by the owner on Jun 20, 2024. It is now read-only.

feat: improved http limits and dns cache #46

Merged
merged 3 commits into from
Feb 21, 2023

Conversation

lidel
Copy link
Collaborator

@lidel lidel commented Feb 18, 2023

This PR

  • improves HTTP limits to allow more concurrency per L1 peer and in general
  • removes TLS cert validation when we retrieve blocks from strn.pl because it has no value:
    • bifrost-gw will fetch raw blocks and discard if received data does not match expected CID
    • caboose should not care if TLS cert is valid, nor if the name from cert matches DNS name, because we get raw IPs from orchestrator anyway, and fake it in TLS client 🙈
    • we don't want to break gateway if Saturn ever changes the name in their certs and our client starts erroring on TLS handshake
    • tldr: only value of TLS in strn.pl is that we can use HTTP/2, nothing more
  • adds DNS cache that will not remove stale records if DNS zone suddenly disappears

Closes #34

(this was pushed very late Friday, needs some well rested eyes next week)

// to save CPU and to avoid catastrophic failure when
// Saturn L1s suddenly switch to certs with different DNS name.
InsecureSkipVerify: true,
// ServerName: "strn.pl",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it makes sense to still pin this, so that for now servers that fail to some generic nginx or other domain don't get queried

this is more about whether we are more worried about the orchestrator or the 'untrusted' nodes, i suppose

Copy link
Collaborator Author

@lidel lidel Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am strongly against pinning this, it opens vector for DoS.

  • no catastrophy: untrusted nodes or IPs being repurposed for something else won't break caboose. if L1s can't provide useful response, they will get weighted down, and over time removed by orchestrator.
  • catastrophy: we should be worried about TLS certs being rotated into different DNS name and our HTTP client failing to use L1s due to TLS validation error caused by hardcoded server name. This is a real risk [can elaborate in private why]

dns_cache.go Outdated Show resolved Hide resolved
// Saturn L1s suddenly switch to certs with different DNS name.
InsecureSkipVerify: true,
// ServerName: "strn.pl",
},
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recently noticed in RAPIDE, I belive the golang std HTTP2 client use 1.5~2x more CPU (this is kinda eyeballing, turning off http2 improved the profiles by thoses ranges).
My guess is that you will keep the same connections to mostly the same saturn nodes, I think using HTTP1.1 exclusively will be beneficial. This also does not have HOL issues.

Suggested change
},
},
TLSNextProto: make(map[string]func(authority string, c *tls.Conn) http.RoundTripper),

Such optimization at this stage might be out of place given more important oddities are still here.

Copy link
Collaborator Author

@lidel lidel Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@guanzo is it possible to update L1s and disable redirect from http:// to https:// if request is for verifiable CAR or a Block?

We would like to run tests and see if paying the cost of TLS gives us any perf benefit at all.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some transport encryption so that a passive network observer can't see what CIDs are being accessed seems still useful

Copy link
Collaborator Author

@lidel lidel Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see value here: everything is in the clear text on DHT, and here you get CIDs comingled together across all gateway users.

Until we have fully privacy-preserving DHT and IPNI, TLS here is just a security theater. Attacker can see way more by passively running a DHT server (IPs of exact user requesting CID, and not gateway run by PL).

Is the idea to keep TLS here as “aspirational” baseline for the time when we have full end-to-end privacy?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • these queries are largely going to hit the L1 cache where they won't go back to the DHT - this lets you hide which CIDs are popular at the gateway
  • arguing 'we can't improve privacy anywhere until we have it everywhere' doesn't seem like a winning strategy either. Unless this is a huge reduction in performance it doesn't seem like this hurts.

Copy link
Collaborator Author

@lidel lidel Feb 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Even if we don't want to do this, measuring how much it could buy us feels like a useful thing to know.
We can't measure it without disabling redirect I mentioned. But ok to park this for now (can do it at any time on staging env).

Copy link
Collaborator

@guanzo guanzo Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to update L1s and disable redirect from http:// to https:// if request is for verifiable CAR or a Block?

doable, ya. Although if the perf difference is significant, maybe this is something you can benchmark locally with a saturn fork?

Copy link
Collaborator Author

@lidel lidel Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I won't be able to replicate real world traffic locally, but we have such traffic mirrored on bifrost-gw-staging env.
@guanzo if you could disable mentioned redirect (for car / block requests), we will be able to test there.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lidel should be disabled in prod now filecoin-saturn/L1-node@108e0c6

dns_cache.go Outdated Show resolved Hide resolved
If persistOnFailure is true, stale
entries will not be removed on failed lookups
@lidel lidel merged commit 5393c4c into main Feb 21, 2023
@lidel lidel deleted the feat/improved-http-limits-and-dns-cache branch February 21, 2023 22:00
Comment on lines +57 to +63
// Saturn use TLS in controversial ways, which sooner or
// later will force them to switch away to different domain
// name and certs, in which case they will break us. Since
// we are fetching raw blocks and dont really care about
// TLS cert being legitimate, let's disable verification
// to save CPU and to avoid catastrophic failure when
// Saturn L1s suddenly switch to certs with different DNS name.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this is correct:

  1. It doesn't save any significant CPU. All crypto operations are still the same. You're only saving the check if the chain ends in the root CA store, and some trivial string matching on the domain name.
  2. Saturn switching to a different cert should be fine, as long as they're doing a redirect. If the plan is to use a cert for domain X on domain Y, well, that's just plain wrong.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack on 1, skipping these checks is just a bonus.
As for 2, given last week's incidents, and internal conversations, bifrost-gateway does not take any unnecessary risk with Saturn's TLS/DNS setup.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Don't die when strn.pl DNS fails to resolve
6 participants