-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial fix for #1529 #1582
Initial fix for #1529 #1582
Conversation
Is changing the port numbers necessary for this fix? It's fine if it is - but knowing nothing about the real source of the problem its hard for me to see why changing port assignments would be necessary. |
One of the sources of the problem is that some of the Narwhal task fail to start (the primary or the worker, depending on the configs and port allocation). Then the primary (or the worker) always keeps trying to connect to each other (using all CPU if needed). Narwhal assumes that the primary runs in the same data center (or the same machine) as its workers and thus implements no exponential backoff when failing to connect to those. There have been some code refactoring in Sui that seem to have messed up the config. |
So narwhal uses the port numbers to decide whether its peers are running locally or not? If we change the port numbers we will have to coordinate that with https://github.com/MystenLabs/sui-operations/blob/main/docker/validator/Dockerfile#L50 for the next deploy. This is not a big deal but as we get closer to test net such changes may become increasingly hard to manage. (At least until we are using some sort of discovery mechanism that doesn't rely on putting ports in config files) |
No it does not care if its peers are remote or not. We where just re-using some of the ports for Sui and for the Narwhal primary. |
@asonnino and others: so I see there is a change here to increase spacing between ports for Narwhal by +100. If there isn't a good reason to space by 100, could we space by less range say +10/+20 for Narwhal? Thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am approving to make sure that you are not blocked @asonnino , but please have a look on the comment around the port changes.
let consensus_parameters = ConsensusParameters { | ||
max_header_delay: 5_000, | ||
max_batch_delay: 5_000, | ||
..ConsensusParameters::default() | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not for this PR, but let's consider reading those from a file. Eventually we might want to get from env variables injected to the Docker image, but let's discuss about that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue for this #1553
primary_to_primary: format!("{}:{}", x.host, x.port + 100).parse().unwrap(), | ||
worker_to_primary: format!("{}:{}", x.host, x.port + 200).parse().unwrap(), | ||
}; | ||
let workers = [( | ||
/* worker_id */ 0, | ||
WorkerAddresses { | ||
primary_to_worker: format!("{}:{}", x.host, x.port + 3) | ||
primary_to_worker: format!("{}:{}", x.host, x.port + 300) | ||
.parse() | ||
.unwrap(), | ||
transactions: x.consensus_address, | ||
worker_to_worker: format!("{}:{}", x.host, x.port + 4).parse().unwrap(), | ||
worker_to_worker: format!("{}:{}", x.host, x.port + 400) | ||
.parse() | ||
.unwrap(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that there is a conflict, couldn't we just change the x.port
parameter? E.x if it runs on 12000
maybe make it run on 10200
, so we avoid touching those here. Can't evaluate at the moment whether that would fix the issue, but I am thinking that we might get same effect with minimal changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure how the docker files are written (perhaps they open a range of ports so changing the base may be a problem). I just tried to fix some errors that people have been seeing in the simplest way. All of this need to be properly fixed. I will bring the discussion during a next meeting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that this fix is to allow running a cluster locally (multiple nodes on one machine). In docker there should be no need to worry about port conflicts because each node is on its own IP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, so this (ugly) fix should work for everything we need for DevNet if I get it right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@asonnino still not clear on what the original conflict is. I don't think you would get conflicts from Docker itself, can you clarify how you ran into the original conflicts? In which environment, and what do you mean by running a cluster locally? If it is the standard Sui validator
command you just need +10 separation not +100....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes a separation of +10 should work. Sadly I don't remember exactly what was the problem (I have been extinguishing fires for the last week). I think there was a problem with tracing + one primary trying to boot on the same port as a Sui authority
@asonnino and others: I'm about done with deploy changes that would let you/us easily test cluster network changes using Docker compose. What is tested would be a preview of what is deployed. In the meantime if we can keep the range of ports to a fixed range that would be much appreciated..... I'm changing the port spacing between nodes to be like 1000 so if we can keep the range of Narwhal ports to < +1000 that would be cool. :) |
Cool then we don't load consensus ports from file for DevNet, right? |
@asonnino sorry I'm not sure I get your Q.... consensus ports are, according to this change, a fixed distance from the base port right? The base port will be set from a file and by environment var. For certain reasons I need to test them at different ports for each validator in dev.net. |
yes exactly. So basically we don't touch to the consensus port config any more (not before DevNet at least)? |
See #1529 for details.
Note that this pull request by itself does not solve the issue, we also need to land MystenLabs/narwhal#158. After the Narwhal PR lands, I need to update a number of
Cargo.toml
files in Sui to fully fix the issue.Edit:
These changes are necessary but they still do not seem to fix the issue. The CPU consumption goes back to 100% after a while (a few minutes rather than imeditaly). The Narwhal Primary (neither the worker nor consensus) seem to be the problem.