Initial fix for #1529 #1582

asonnino · 2022-04-25T18:52:29Z

See #1529 for details.

Note that this pull request by itself does not solve the issue, we also need to land MystenLabs/narwhal#158. After the Narwhal PR lands, I need to update a number of Cargo.toml files in Sui to fully fix the issue.

Edit:

These changes are necessary but they still do not seem to fix the issue. The CPU consumption goes back to 100% after a while (a few minutes rather than imeditaly). The Narwhal Primary (neither the worker nor consensus) seem to be the problem.

mystenmark · 2022-04-25T19:09:08Z

Is changing the port numbers necessary for this fix? It's fine if it is - but knowing nothing about the real source of the problem its hard for me to see why changing port assignments would be necessary.

asonnino · 2022-04-25T20:56:23Z

Is changing the port numbers necessary for this fix? It's fine if it is - but knowing nothing about the real source of the problem its hard for me to see why changing port assignments would be necessary.

One of the sources of the problem is that some of the Narwhal task fail to start (the primary or the worker, depending on the configs and port allocation). Then the primary (or the worker) always keeps trying to connect to each other (using all CPU if needed). Narwhal assumes that the primary runs in the same data center (or the same machine) as its workers and thus implements no exponential backoff when failing to connect to those.

There have been some code refactoring in Sui that seem to have messed up the config.

mystenmark · 2022-04-25T22:20:21Z

So narwhal uses the port numbers to decide whether its peers are running locally or not?

If we change the port numbers we will have to coordinate that with https://github.com/MystenLabs/sui-operations/blob/main/docker/validator/Dockerfile#L50 for the next deploy. This is not a big deal but as we get closer to test net such changes may become increasingly hard to manage. (At least until we are using some sort of discovery mechanism that doesn't rely on putting ports in config files)

asonnino · 2022-04-25T22:22:03Z

So narwhal uses the port numbers to decide whether its peers are running locally or not?

If we change the port numbers we will have to coordinate that with https://github.com/MystenLabs/sui-operations/blob/main/docker/validator/Dockerfile#L50 for the next deploy. This is not a big deal but as we get closer to test net such changes may become increasingly hard to manage. (At least until we are using some sort of discovery mechanism that doesn't rely on putting ports in config files)

No it does not care if its peers are remote or not. We where just re-using some of the ports for Sui and for the Narwhal primary.

velvia · 2022-04-26T00:58:03Z

@asonnino and others: so I see there is a change here to increase spacing between ports for Narwhal by +100.
For AWS deployment there is something I'm trying where I need to vary the port numbers between validators. I'm wondering if we can constrain the range of port numbers for deployment purposes. IE I need to know what is the maximum range of ports that we can use for Narwhal, it looks like from here that it would be 401, so it would be safe if validator 2 uses validator1 + 500 etc.

If there isn't a good reason to space by 100, could we space by less range say +10/+20 for Narwhal? Thanks

akichidis

I am approving to make sure that you are not blocked @asonnino , but please have a look on the comment around the port changes.

akichidis · 2022-04-26T08:44:11Z

sui/src/bin/validator.rs

+    let consensus_parameters = ConsensusParameters {
+        max_header_delay: 5_000,
+        max_batch_delay: 5_000,
+        ..ConsensusParameters::default()
+    };


Maybe not for this PR, but let's consider reading those from a file. Eventually we might want to get from env variables injected to the Docker image, but let's discuss about that.

There is an issue for this #1553

akichidis · 2022-04-26T08:47:23Z

sui/src/config.rs

+                        primary_to_primary: format!("{}:{}", x.host, x.port + 100).parse().unwrap(),
+                        worker_to_primary: format!("{}:{}", x.host, x.port + 200).parse().unwrap(),
                    };
                    let workers = [(
                        /* worker_id */ 0,
                        WorkerAddresses {
-                            primary_to_worker: format!("{}:{}", x.host, x.port + 3)
+                            primary_to_worker: format!("{}:{}", x.host, x.port + 300)
                                .parse()
                                .unwrap(),
                            transactions: x.consensus_address,
-                            worker_to_worker: format!("{}:{}", x.host, x.port + 4).parse().unwrap(),
+                            worker_to_worker: format!("{}:{}", x.host, x.port + 400)
+                                .parse()
+                                .unwrap(),


Given that there is a conflict, couldn't we just change the x.port parameter? E.x if it runs on 12000 maybe make it run on 10200 , so we avoid touching those here. Can't evaluate at the moment whether that would fix the issue, but I am thinking that we might get same effect with minimal changes.

I am not sure how the docker files are written (perhaps they open a range of ports so changing the base may be a problem). I just tried to fix some errors that people have been seeing in the simplest way. All of this need to be properly fixed. I will bring the discussion during a next meeting.

My understanding is that this fix is to allow running a cluster locally (multiple nodes on one machine). In docker there should be no need to worry about port conflicts because each node is on its own IP.

Cool, so this (ugly) fix should work for everything we need for DevNet if I get it right

@asonnino still not clear on what the original conflict is. I don't think you would get conflicts from Docker itself, can you clarify how you ran into the original conflicts? In which environment, and what do you mean by running a cluster locally? If it is the standard Sui validator command you just need +10 separation not +100....

Yes a separation of +10 should work. Sadly I don't remember exactly what was the problem (I have been extinguishing fires for the last week). I think there was a problem with tracing + one primary trying to boot on the same port as a Sui authority

velvia · 2022-04-26T16:16:00Z

@asonnino and others: I'm about done with deploy changes that would let you/us easily test cluster network changes using Docker compose. What is tested would be a preview of what is deployed. In the meantime if we can keep the range of ports to a fixed range that would be much appreciated..... I'm changing the port spacing between nodes to be like 1000 so if we can keep the range of Narwhal ports to < +1000 that would be cool. :)

asonnino · 2022-04-26T16:18:24Z

Cool then we don't load consensus ports from file for DevNet, right?

velvia · 2022-04-26T16:20:29Z

@asonnino sorry I'm not sure I get your Q.... consensus ports are, according to this change, a fixed distance from the base port right? The base port will be set from a file and by environment var. For certain reasons I need to test them at different ports for each validator in dev.net.

asonnino · 2022-04-26T16:22:06Z

yes exactly. So basically we don't touch to the consensus port config any more (not before DevNet at least)?

Fix 1529

19d4521

asonnino mentioned this pull request Apr 25, 2022

100% CPU usage when running with >f faults #1529

Closed

asonnino added this to the DevNet milestone Apr 25, 2022

asonnino linked an issue Apr 25, 2022 that may be closed by this pull request

100% CPU usage when running with >f faults #1529

Closed

asonnino added Type: Bug Something isn't working Status: Needs Review PR is ready for review sui-node labels Apr 25, 2022

asonnino self-assigned this Apr 25, 2022

asonnino requested a review from mystenmark April 25, 2022 18:55

asonnino assigned huitseeker and unassigned huitseeker Apr 25, 2022

asonnino requested review from gdanezis and huitseeker April 25, 2022 18:55

asonnino enabled auto-merge (squash) April 25, 2022 18:56

asonnino disabled auto-merge April 25, 2022 18:57

Update default settings and narwhal dep

0c117a3

asonnino removed a link to an issue Apr 25, 2022

100% CPU usage when running with >f faults #1529

Closed

asonnino enabled auto-merge (squash) April 25, 2022 21:58

asonnino changed the title ~~Fix 1529~~ Initial fix for #1529 Apr 25, 2022

asonnino requested review from akichidis and LefKok April 25, 2022 22:09

akichidis approved these changes Apr 26, 2022

View reviewed changes

asonnino merged commit 30a199d into main Apr 26, 2022

asonnino deleted the fix-1529 branch April 26, 2022 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial fix for #1529 #1582

Initial fix for #1529 #1582

asonnino commented Apr 25, 2022 •

edited

Loading

mystenmark commented Apr 25, 2022

asonnino commented Apr 25, 2022 •

edited

Loading

mystenmark commented Apr 25, 2022

asonnino commented Apr 25, 2022

velvia commented Apr 26, 2022

akichidis left a comment

akichidis Apr 26, 2022

asonnino Apr 26, 2022

akichidis Apr 26, 2022

asonnino Apr 26, 2022

mystenmark Apr 26, 2022

asonnino Apr 26, 2022

velvia Apr 26, 2022

asonnino Apr 27, 2022

velvia commented Apr 26, 2022

asonnino commented Apr 26, 2022

velvia commented Apr 26, 2022

asonnino commented Apr 26, 2022

Initial fix for #1529 #1582

Initial fix for #1529 #1582

Conversation

asonnino commented Apr 25, 2022 • edited Loading

Edit:

mystenmark commented Apr 25, 2022

asonnino commented Apr 25, 2022 • edited Loading

mystenmark commented Apr 25, 2022

asonnino commented Apr 25, 2022

velvia commented Apr 26, 2022

akichidis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

velvia commented Apr 26, 2022

asonnino commented Apr 26, 2022

velvia commented Apr 26, 2022

asonnino commented Apr 26, 2022

asonnino commented Apr 25, 2022 •

edited

Loading

asonnino commented Apr 25, 2022 •

edited

Loading