Skip to content

rustc can produce non-deterministic crates that can't be mixed together #89904

Closed

Description

We're using nix to build rust crates and benefit from binary caches which are usually populated by our CI. I'll assume a slightly familiarity but I can elaborate when necessary. The details are not too important but it means that the following scenario can happen:

Consider a crate C that depends on crate B that depends on crate A (that depends on std). Now, the following sequence of events happens.

  • Start with empty binary cache
  • Developer hacks on code and ends up building only crate A locally. Let's call this build result A-local. This goes into the /nix/store and A is never rebuilt again (it is in immutable location) and is always reused when all the inputs (sources, dependencies) are identical.
  • Developer is happy and pushes his code. CI starts.
  • CI does a run wanting to build the more of the project which in our case is crates A, B and C. CI is also using nix and binary cache is currently empty.
  • CI builds A: this has exactly the same inputs and so is given exactly same hash. Let's call it A-cache. It pushes pushes A-cache to some binary cache (in S3 for example).
  • CI also builds B using the A-cache it just built, let's say it's called B-cache. It pushes it to binary cache too.
  • Lastly, CI builds C as C-cache and pushes to binary cache. Everything is great, CI is green.
  • The developer now wants C on local machine, perhaps it is a binary crate . However, the developer first makes some changes to C source code, maybe added some debug info. So what do we have to build?
    • A doesn't need building: it's already stored locally as A-local in local /nix/store
    • B is needed to build C: luckily B exists in a binary cache as B-cache so we just download it.
    • C source code changed so we have to build it, we have A-local and B-cache in the dependency tree now.

Does the build succeed? Most of the time, yes. But sometimes, it fails if rustc produced two incompatible results for A: that is, if A-local and A-cache differ, we're in trouble. The developer can't use A-local with B-cache: only A-cache is usable.

An error message might look like:

error[E0460]: found possibly newer version of crate `parse_display` which `trader` depends on
  --> src/c/foo.rs:21:5
   |
21 | use b::something;
   |     ^^^^^^
   |
   = note: perhaps that crate needs to be recompiled?
   = note: the following crate versions were found:
           crate `a`: /nix/store/a-local-input-hash/lib/liba-local-38cc57346a.rlib
           crate `b`: /nix/store/b-local-input-hash/lib/libb-local-f69511fec1.rlib

error: aborting due to previous error

In my real case, A is parse_display create, B is our local crate called trader and C was another crate in the workspace that depended on trader. As B (trader) came from binary cache, it is binary-identical and we only have to focus on A.

Investigation

First, let's see what rustc says:

 INFO rustc_metadata::creader resolving dep crate parse_display hash: `5963747a8c8f099e` extra filename: `-38cc57346a`
 INFO rustc_metadata::creader resolving crate `parse_display`
 INFO rustc_metadata::creader falling back to a load
 INFO rustc_metadata::locator lib candidate: target/deps/libparse_display-38cc57346a.rlib
 INFO rustc_metadata::locator rlib reading metadata from: /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib
 INFO rustc_metadata::locator Rejecting via hash: expected 5963747a8c8f099e got 5f612fa7f4240092

Looks like the hash it expects is not what it gets. Back to this later. Let's try to look for something obvious in the .rlib.

$ sha256sum libparse_display-38cc57346a.rlib /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib
c26bf883248419284608aa1595043e67a9a8a137a62e345172cd87fc112e32a2  libparse_display-38cc57346a.rlib
05af37f3ed3bd95c001f0754e5102589e2d03640de4c65b33093dfcac63b7f2f  /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib
$ diff <(nm ./libparse_display-38cc57346a.rlib) <(nm /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib)
nm: lib.rmeta: no symbols
nm: lib.rmeta: no symbols

[shana@aya:/tmp/foo]$ diff <(objdump -x ./libparse_display-38cc57346a.rlib) <(objdump -x /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib)
1c1
< In archive ./libparse_display-38cc57346a.rlib:
---
> In archive /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib:
155c155
<   0 .rmeta        000096d9  0000000000000000  0000000000000000  00000040  2**0
---
>   0 .rmeta        000096da  0000000000000000  0000000000000000  00000040  2**0

[shana@aya:/tmp/foo]$ diff <(strings ./libparse_display-38cc57346a.rlib) <(strings /nix/store/pfy5c0x8jbx44xfnsrqb5yp5sz5ahpxx-rust_parse-display-0.5.2-lib/lib/libparse_display-38cc57346a.rlib)

If I compare binaries side-by-side, it looks like the two files are 1 byte misaligned in first section and have minor differences later on. I will include both files for your own inspection too.

 228   │ 000038c0: 635f 756e 7769 6e64 94d9 ef8b cdb7 abd5 5600 0111 2d30 3666 3031 6163 3235 3738 6264 6139 3409 6f6e 6365 5f63 656c 6cd2 8bbc ec86 a5dc ab8b 0100 020b 2d35 6636  c_unwind........V...-06f01ac2578bda94.once_cell.............-5f6             000038c0: 635f 756e 7769 6e64 94d9 ef8b cdb7 abd5 5600 0111 2d30 3666 3031 6163 3235 3738 6264 6139 3409 6f6e 6365 5f63 656c 6cd2 8bbc ec86 a5dc ab8b 0100 020b 2d35 6636  c_unwind........V...-06f01ac2578bda94.once_cell.............-5f6
 229   │ 00003900: 3138 6161 3539 6605 7265 6765 78a2 84b8 b4ed bbff 9332 0002 0b2d 6230 3132 3039 6331 6531 0c72 6567 6578 5f73 796e 7461 788b a4cc c6c0 ddf3 a7de 0100 020b 2d35  18aa59f.regex........2...-b01209c1e1.regex_syntax.............-5             00003900: 3138 6161 3539 6605 7265 6765 78a2 84b8 b4ed bbff 9332 0002 0b2d 6230 3132 3039 6331 6531 0c72 6567 6578 5f73 796e 7461 788b a4cc c6c0 ddf3 a7de 0100 020b 2d35  18aa59f.regex........2...-b01209c1e1.regex_syntax.............-5
 230   │ 00003940: 3238 3233 6265 3866 380c 6168 6f5f 636f 7261 7369 636b 9aca c888 a2e8 82ca 2c00 020b 2d35 3438 6335 3563 3461 3606 6d65 6d63 6872 c7f1 a28d f4a2 84c0 0a00 020b  2823be8f8.aho_corasick........,...-548c55c4a6.memchr............             00003940: 3238 3233 6265 3866 380c 6168 6f5f 636f 7261 7369 636b 9aca c888 a2e8 82ca 2c00 020b 2d35 3438 6335 3563 3461 3606 6d65 6d63 6872 c7f1 a28d f4a2 84c0 0a00 020b  2823be8f8.aho_corasick........,...-548c55c4a6.memchr............
 231   │ 00003980: 2d31 3331 6563 3830 6434 3114 7061 7273 655f 6469 7370 6c61 795f 6465 7269 7665 ce9a f3ad b9c0 86a2 8b01 0000 0b2d 6335 6132 6166 3538 3438 0000 001d 2f90 b8af  -131ec80d41.parse_display_derive.............-c5a2af5848..../...           | 00003980: 2d31 3331 6563 3830 6434 3114 7061 7273 655f 6469 7370 6c61 795f 6465 7269 7665 9fda dfa4 e3d3 f580 1700 000b 2d63 3561 3261 6635 3834 3800 0000 1d2f 90b8 af64  -131ec80d41.parse_display_derive............-c5a2af5848..../...d
 232   │ 000039c0: 6448 ea7a e3a5 752d 3bbf 2d01 0001 001d 2f90 b8af 6448 eaa4 cc8c bfdb f1b9 2301 0003 0373 7464 001d 2f90 b8af 6448 eaee aad1 d9c3 701f 6101 0001 011d 2f90 b8af  dH.z..u-;.-...../...dH........#....std../...dH......p.a...../...           | 000039c0: 48ea 7ae3 a575 2d3b bf2d 0100 0100 1d2f 90b8 af64 48ea a4cc 8cbf dbf1 b923 0100 0303 7374 6400 1d2f 90b8 af64 48ea eeaa d1d9 c370 1f61 0100 0101 1d2f 90b8 af64  H.z..u-;.-...../...dH........#....std../...dH......p.a...../...d
 233   │ 00003a00: 6448 ea6f ff54 d9ae a053 9f01 0001 021d 2f90 b8af 6448 eaa0 3fcf ee88 19b8 c901 0001 031d 2f90 b8af 6448 eae6 0668 5fbb d337 7901 0001 041d 2f90 b8af 6448 ea21  dH.o.T...S....../...dH..?.........../...dH...h_..7y...../...dH.!           | 00003a00: 48ea 6fff 54d9 aea0 539f 0100 0102 1d2f 90b8 af64 48ea a03f cfee 8819 b8c9 0100 0103 1d2f 90b8 af64 48ea e606 685f bbd3 3779 0100 0104 1d2f 90b8 af64 48ea 2185  H.o.T...S....../...dH..?.........../...dH...h_..7y...../...dH.!.

Of course, it's very difficult to tell anything by just looking at raw binary. I tried to add extra information to rustc itself so that I could see during the hashing what values it's getting: if hash inputs were different, we'd see which ones and be able to go backwards from there. Looking around, I think crate_hash function is the point to change

diff --git a/compiler/rustc_middle/src/hir/map/mod.rs b/compiler/rustc_middle/src/hir/map/mod.rs
index 392372fad53..dafd15b5a46 100644
--- a/compiler/rustc_middle/src/hir/map/mod.rs
+++ b/compiler/rustc_middle/src/hir/map/mod.rs
@@ -968,6 +968,7 @@ pub(super) fn index_hir<'tcx>(tcx: TyCtxt<'tcx>, (): ()) -> &'tcx IndexedHir<'tc
 pub(super) fn crate_hash(tcx: TyCtxt<'_>, crate_num: CrateNum) -> Svh {
     assert_eq!(crate_num, LOCAL_CRATE);
 
+    tracing::info!("Starting to hash {:?}", crate_num);
     // We can access untracked state since we are an eval_always query.
     let mut hcx = tcx.create_stable_hashing_context();
 
@@ -984,7 +985,9 @@ pub(super) fn crate_hash(tcx: TyCtxt<'_>, crate_num: CrateNum) -> Svh {
             Some((def_path_hash, hasher.finish()))
         })
         .collect();
+    tracing::info!("pre-sort {:?}: {:?}", crate_num, hir_body_nodes);
     hir_body_nodes.sort_unstable_by_key(|bn| bn.0);
+    tracing::info!("post-sort {:?}: {:?}", crate_num, hir_body_nodes);
 
     let node_hashes = hir_body_nodes.iter().fold(
         Fingerprint::ZERO,
@@ -992,8 +995,11 @@ pub(super) fn crate_hash(tcx: TyCtxt<'_>, crate_num: CrateNum) -> Svh {
             combined_fingerprint.combine(def_path_hash.0.combine(fingerprint))
         },
     );
+    tracing::info!("node_hashes {:?}: {:?}", crate_num, node_hashes);
+
 
     let upstream_crates = upstream_crates(tcx);
+    tracing::info!("upstream_crates {:?}: {:?}", crate_num, upstream_crates);
 
     // We hash the final, remapped names of all local source files so we
     // don't have to include the path prefix remapping commandline args.
@@ -1009,17 +1015,26 @@ pub(super) fn crate_hash(tcx: TyCtxt<'_>, crate_num: CrateNum) -> Svh {
         .map(|source_file| source_file.name_hash)
         .collect();
 
+    tracing::info!("source_file_names before {:?}: {:?}", crate_num, source_file_names);
     source_file_names.sort_unstable();
+    tracing::info!("source_file_names after {:?}: {:?}", crate_num, source_file_names);
 
     let mut stable_hasher = StableHasher::new();
+    tracing::info!("node_hashes {:?}: {:?}", crate_num, node_hashes);
     node_hashes.hash_stable(&mut hcx, &mut stable_hasher);
+    tracing::info!("upstream_crates {:?}: {:?}", crate_num, upstream_crates);
     upstream_crates.hash_stable(&mut hcx, &mut stable_hasher);
+    tracing::info!("source_file_names {:?}: {:?}", crate_num, source_file_names);
     source_file_names.hash_stable(&mut hcx, &mut stable_hasher);
+    tracing::info!("opts {:?}: {:?}", crate_num, tcx.sess.opts.dep_tracking_hash(true));
     tcx.sess.opts.dep_tracking_hash(true).hash_stable(&mut hcx, &mut stable_hasher);
+    tracing::info!("local_stable_crate_id {:?}: {:?}", crate_num, tcx.sess.local_stable_crate_id());
     tcx.sess.local_stable_crate_id().hash_stable(&mut hcx, &mut stable_hasher);
+    tracing::info!("non_exported_macro_attrs {:?}: {:?}", crate_num, tcx.untracked_crate.non_exported_macro_attrs);
     tcx.untracked_crate.non_exported_macro_attrs.hash_stable(&mut hcx, &mut stable_hasher);
 
     let crate_hash: Fingerprint = stable_hasher.finish();
+    tracing::info!("crate_hash {:?}: {:?}", crate_num, crate_hash);
     Svh::new(crate_hash.to_smaller_hash())
 }
# Includes one of the default files in src/bootstrap/defaults
profile = "compiler"
changelog-seen = 2
[rust]
channel = "stable"

Sadly, this doesn't work for ironically similar reason: building this compiler locally gives a new std hash which means I can't re-invoke original rustc command with the original .rlibs as those depend on std which has the wrong hash...

At this point I decided to open this ticket: how can I debug this further? At this point I am more or less getting ready to try to gut rustc to the point that I can call crate_hash myself or the rlibs directly but it seems like a lot of work and I'm not confident it's even going to work. I didn't find any tool to read rlib files either.

If we can find what the difference between the two files it, presumably we can work backwards and find why rustc produced two different results.

Here are the files: different_rlibs.tar.gz. local was one I built locally and cache is from the binary cache, built by CI machine.

Meta

rustc --version --verbose:

rustc 1.55.0 (c8dfcfe04 2021-09-06)

I originally suspected that codegen-units may be the culprit: some concurrent resource getting non-deterministically modified, such as to hand out some Ids or something that made it to the code. But no, the issue occurred even with codegen-units=1.

It is difficult to verify or replicate the bug because it happens fairly rarely so it seems like looking at the rlibs I attached above is probably the fastest way to figure something out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions