Skip to content

Conversation

s6eskand
Copy link
Contributor

@s6eskand s6eskand commented May 21, 2025

Description

There currently is no way to connect to a given gRPC endpoint using TLS without manually providing a CA certificate file, certificate file, and/or key file. This causes issues when using managed certificates where it is not possible to connect via TLS to a secured endpoint unless the relevant secrets are distributed across workers and manually accessed via the config.

The proposed changes add an option for the ClientTlsConfig of use_native_roots. This will ensure that tonic will use the root native certs for it's tls config. If the field use_native_roots is present ca, cert, and key files will be ignored if provided but are not required. This adds the tonic tls-native-roots feature.

NOTE: The ca_file field in the nativelink_config::stores::ClientTlsConfig is no longer required.

Type of change

  • New feature (non-breaking change which adds functionality)
  • This change requires a documentation update

How Has This Been Tested?

Tested connecting remote store and the scheduler using different configs for the tls_config. Also added unit tests for tls_utils.rs
Cases:

  1. tls_config is not specified -> no behaviour change
  2. tls_config is specified but empty -> Raises error
  3. tls_config specified with use_native_roots option true -> uses native roots and successfully creates connection
  4. tls_config specified with use_native_roots option false -> Raises same error as 2
  5. tls_config specified with combinations of ca_file, cert_file, key_file -> no behaviour change

Checklist

  • Updated documentation if needed
  • Tests added/amended
  • bazel test //... passes locally
  • PR is contained in a single commit, using git amend see some docs

This change is Reviewable

@CLAassistant
Copy link

CLAassistant commented May 21, 2025

CLA assistant check
All committers have signed the CLA.

@s6eskand
Copy link
Contributor Author

@MarcusSorealheis

@MarcusSorealheis
Copy link
Collaborator

Wow. This is brilliant! Thank you @s6eskand. I'm testing the changes locally and then will definitely stamp. Give me a moment. We'll also need to cut a point release to bring this change into a released version cc @luis-munoz @jaroeichler

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is almost perfect, but I'm not sure either comment should block this PR for expediency's sake. There are three items that I think are worth discussing at some point, or quickly.

(1) we need to document some example of a config for each tls option (certificate, key, native root CA). This might be better done in a blogpost explaining the different options.

(2) For our safety and usability backlog, we might consider a warning to customers if they have no tls_config defined. I'm not sure that is a good change at this time because I need to do more thinking on it. Trust is for the customer to decide.

(3) Does it make sense to test and account for a garbage ca file? I think this is a novice yet honest mistake. I don't think we need to optimize for this user quite yet, but we should think about them. At a minimum, I want to have a helpful error a la "Invalid CA. Must be a valid .pem or .crt file."

I am curious what the others think, but I'm ready to approve this because it works.

Reviewable status: 0 of 2 LGTMs obtained, and 0 of 6 files reviewed, and pending CI: Installation / macos-15, Local / lre-rs / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, windows-2022 / stable

Copy link
Collaborator

@MarcusSorealheis MarcusSorealheis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: 1 of 2 LGTMs obtained, and 0 of 6 files reviewed, and pending CI: Installation / macos-15, Local / lre-rs / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Remote / lre-cc / xlarge-ubuntu-24.04, Remote / lre-rs / xlarge-ubuntu-24.04, windows-2022 / stable

@s6eskand
Copy link
Contributor Author

(1) we need to document some example of a config for each tls option (certificate, key, native root CA). This might be better done in a blogpost explaining the different options.

(2) For our safety and usability backlog, we might consider a warning to customers if they have no tls_config defined. I'm not sure that is a good change at this time because I need to do more thinking on it. Trust is for the customer to decide.

(3) Does it make sense to test and account for a garbage ca file? I think this is a novice yet honest mistake. I don't think we need to optimize for this user quite yet, but we should think about them. At a minimum, I want to have a helpful error a la "Invalid CA. Must be a valid .pem or .crt file."

@MarcusSorealheis

  1. Agree, let me know how I can help! Blog post is immensely useful, I also think a small example in the nativelink-config examples is also beneficial.
  2. I think what would be helpful here is raising an error if the uri specifies a secure scheme but no TLS configuration is provided. The connection will error out anyway so better to give a more defined reason. I added this to the changes along with some unit tests please review and let me know your thoughts.
  3. tonic already does some validation for cert and key files. Here's an example error when trying to use a dummy ca_file in my worker config: tonic::transport::Error(Transport, ConnectError(Custom { kind: InvalidData, error: InvalidCertificate(UnknownIssuer) }))

@MarcusSorealheis
Copy link
Collaborator

I don't think we need to say something about invalid cert today because of tonic's validation, but I also believe it can be helpful in the future to provide an even easier to understand warning that customers can take specific action on. I don't want our beginning users to need to think about Tonic too much, as great as it is.

Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

cc @jaroeichler I noticed that we use webpki_roots in the S3 which probably causes other issues like the one fixed here down the line, i.e. the s3 client doesn't use the infrastructure that this PR touches. Could you look into implementing something similar for that? I believe contrary to tonic, the aws packages already support native-platform-verifier.

Reviewed 3 of 6 files at r1, 2 of 2 files at r2, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and 5 of 6 files reviewed, and 3 discussions need to be resolved


-- commits line 2 at r2:
nit: Add the described behavior change from the PR in the commit message body.


nativelink-util/Cargo.toml line 54 at r2 (raw file):

serde = { version = "1.0.219", default-features = false }
sha2 = { version = "0.10.8", default-features = false }
tempfile = "3.20.0"

nit: Add this as tempfile = { version = "xxx", default-features = false } and add the enabled features explicitly. I.e. tempfile = { version = "xxx", default-features = false, features = ["getrandom"] }


nativelink-util/Cargo.toml line 66 at r2 (raw file):

tokio-util = { version = "0.7.14" }
tonic = { version = "0.13.0", features = [
  "tls-native-roots",

FYI I noticed that tonic upstream still appears to use rustls-native-certs which has been superseded by rustls-platform-verifier. We might fix this upstream in the future.


nativelink-util/src/tls_utils.rs line 27 at r2 (raw file):

    if config.use_native_roots == Some(true) {
        return Ok(Some(

In case 3 of your described behavior change, this look like it would entirely ignore the other ca config options. That's fine, but it could lead to UX issues where one is wondering why some specified ca file is not being used.

I think the general behavior is fine, but let's e.g. print a warning (use tracing::warn and then warn!("XXXX") in this branch if any of the other ca options are set so that the user gets informed about those options being ignored.


nativelink-util/tests/tls_utils_test.rs line 20 at r2 (raw file):

use nativelink_util::tls_utils::{endpoint_from, load_client_config};
use tempfile::NamedTempFile;

FYI Thanks a lot for these nice tests ❤️

@MarcusSorealheis
Copy link
Collaborator

MarcusSorealheis commented May 22, 2025

The warn is perfect @s6eskand, thank you!

There currently is no way to connect to a given gRPC endpoint using TLS without manually providing a CA certificate file, certificate file, and/or key file. This causes issues when using managed certificates where it is not possible to connect via TLS to a secured endpoint unless the relevant secrets are distributed across workers and manually accessed via the config.

The proposed changes add an option for the ClientTlsConfig of use_native_roots. This will ensure that tonic will use the root native certs for it's tls config. If the field use_native_roots is present ca, cert, and key files will be ignored if provided but are not required. This adds the tonic tls-native-roots feature.

NOTE: The ca_file field in the nativelink_config::stores::ClientTlsConfig is no longer required.
@s6eskand s6eskand force-pushed the add-support-for-native-certs branch from 2c5ac2a to e7e8cd7 Compare May 22, 2025 08:21
Copy link
Contributor

@jaroeichler jaroeichler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, thanks!

Reviewed 6 of 6 files at r1, 1 of 2 files at r2, 2 of 2 files at r3, all commit messages.
Reviewable status: 1 of 2 LGTMs obtained, and all files reviewed, and pending CI: Analyze (javascript-typescript), Bazel Dev / macos-15, Cargo Dev / ubuntu-24.04, Installation / macos-15, NativeLink.com Cloud / Remote Cache / macos-15, Publish image, Web Platform Deployment / ubuntu-24.04, and 4 discussions need to be resolved


-- commits line 4 at r3:
nit: Limit to 72 characters per line.

@MarcusSorealheis
Copy link
Collaborator

I reinitiated all of the jobs that GitHub stopped doing because they failed.

Copy link
Contributor

@aaronmondal aaronmondal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

:lgtm:

Reviewed 1 of 6 files at r1, 2 of 2 files at r3, all commit messages.
Reviewable status: :shipit: complete! 2 of 2 LGTMs obtained, and all files reviewed


-- commits line 4 at r3:

Previously, jaroeichler wrote…

nit: Limit to 72 characters per line.

Done. I'll add this during merge.

@aaronmondal aaronmondal merged commit cd3f993 into TraceMachina:main May 22, 2025
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants