Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial fix for #1529 #1582

Merged
merged 2 commits into from
Apr 26, 2022
Merged

Initial fix for #1529 #1582

merged 2 commits into from
Apr 26, 2022

Conversation

asonnino
Copy link
Contributor

@asonnino asonnino commented Apr 25, 2022

See #1529 for details.

Note that this pull request by itself does not solve the issue, we also need to land MystenLabs/narwhal#158. After the Narwhal PR lands, I need to update a number of Cargo.toml files in Sui to fully fix the issue.

Edit:

These changes are necessary but they still do not seem to fix the issue. The CPU consumption goes back to 100% after a while (a few minutes rather than imeditaly). The Narwhal Primary (neither the worker nor consensus) seem to be the problem.

@asonnino asonnino added this to the DevNet milestone Apr 25, 2022
@asonnino asonnino linked an issue Apr 25, 2022 that may be closed by this pull request
@asonnino asonnino added Type: Bug Something isn't working Status: Needs Review PR is ready for review sui-node labels Apr 25, 2022
@asonnino asonnino self-assigned this Apr 25, 2022
@asonnino asonnino requested a review from mystenmark April 25, 2022 18:55
@asonnino asonnino assigned huitseeker and unassigned huitseeker Apr 25, 2022
@asonnino asonnino requested review from gdanezis and huitseeker April 25, 2022 18:55
@asonnino asonnino enabled auto-merge (squash) April 25, 2022 18:56
@asonnino asonnino disabled auto-merge April 25, 2022 18:57
@mystenmark
Copy link
Contributor

Is changing the port numbers necessary for this fix? It's fine if it is - but knowing nothing about the real source of the problem its hard for me to see why changing port assignments would be necessary.

@asonnino
Copy link
Contributor Author

asonnino commented Apr 25, 2022

Is changing the port numbers necessary for this fix? It's fine if it is - but knowing nothing about the real source of the problem its hard for me to see why changing port assignments would be necessary.

One of the sources of the problem is that some of the Narwhal task fail to start (the primary or the worker, depending on the configs and port allocation). Then the primary (or the worker) always keeps trying to connect to each other (using all CPU if needed). Narwhal assumes that the primary runs in the same data center (or the same machine) as its workers and thus implements no exponential backoff when failing to connect to those.

There have been some code refactoring in Sui that seem to have messed up the config.

@asonnino asonnino enabled auto-merge (squash) April 25, 2022 21:58
@asonnino asonnino changed the title Fix 1529 Initial fix for #1529 Apr 25, 2022
@asonnino asonnino requested review from akichidis and LefKok April 25, 2022 22:09
@mystenmark
Copy link
Contributor

So narwhal uses the port numbers to decide whether its peers are running locally or not?

If we change the port numbers we will have to coordinate that with https://github.com/MystenLabs/sui-operations/blob/main/docker/validator/Dockerfile#L50 for the next deploy. This is not a big deal but as we get closer to test net such changes may become increasingly hard to manage. (At least until we are using some sort of discovery mechanism that doesn't rely on putting ports in config files)

@asonnino
Copy link
Contributor Author

So narwhal uses the port numbers to decide whether its peers are running locally or not?

If we change the port numbers we will have to coordinate that with https://github.com/MystenLabs/sui-operations/blob/main/docker/validator/Dockerfile#L50 for the next deploy. This is not a big deal but as we get closer to test net such changes may become increasingly hard to manage. (At least until we are using some sort of discovery mechanism that doesn't rely on putting ports in config files)

No it does not care if its peers are remote or not. We where just re-using some of the ports for Sui and for the Narwhal primary.

@velvia
Copy link
Contributor

velvia commented Apr 26, 2022

@asonnino and others: so I see there is a change here to increase spacing between ports for Narwhal by +100.
For AWS deployment there is something I'm trying where I need to vary the port numbers between validators. I'm wondering if we can constrain the range of port numbers for deployment purposes. IE I need to know what is the maximum range of ports that we can use for Narwhal, it looks like from here that it would be 401, so it would be safe if validator 2 uses validator1 + 500 etc.

If there isn't a good reason to space by 100, could we space by less range say +10/+20 for Narwhal? Thanks

Copy link
Contributor

@akichidis akichidis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am approving to make sure that you are not blocked @asonnino , but please have a look on the comment around the port changes.

Comment on lines +103 to +107
let consensus_parameters = ConsensusParameters {
max_header_delay: 5_000,
max_batch_delay: 5_000,
..ConsensusParameters::default()
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe not for this PR, but let's consider reading those from a file. Eventually we might want to get from env variables injected to the Docker image, but let's discuss about that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue for this #1553

Comment on lines +179 to +191
primary_to_primary: format!("{}:{}", x.host, x.port + 100).parse().unwrap(),
worker_to_primary: format!("{}:{}", x.host, x.port + 200).parse().unwrap(),
};
let workers = [(
/* worker_id */ 0,
WorkerAddresses {
primary_to_worker: format!("{}:{}", x.host, x.port + 3)
primary_to_worker: format!("{}:{}", x.host, x.port + 300)
.parse()
.unwrap(),
transactions: x.consensus_address,
worker_to_worker: format!("{}:{}", x.host, x.port + 4).parse().unwrap(),
worker_to_worker: format!("{}:{}", x.host, x.port + 400)
.parse()
.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that there is a conflict, couldn't we just change the x.port parameter? E.x if it runs on 12000 maybe make it run on 10200 , so we avoid touching those here. Can't evaluate at the moment whether that would fix the issue, but I am thinking that we might get same effect with minimal changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure how the docker files are written (perhaps they open a range of ports so changing the base may be a problem). I just tried to fix some errors that people have been seeing in the simplest way. All of this need to be properly fixed. I will bring the discussion during a next meeting.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that this fix is to allow running a cluster locally (multiple nodes on one machine). In docker there should be no need to worry about port conflicts because each node is on its own IP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, so this (ugly) fix should work for everything we need for DevNet if I get it right

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@asonnino still not clear on what the original conflict is. I don't think you would get conflicts from Docker itself, can you clarify how you ran into the original conflicts? In which environment, and what do you mean by running a cluster locally? If it is the standard Sui validator command you just need +10 separation not +100....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes a separation of +10 should work. Sadly I don't remember exactly what was the problem (I have been extinguishing fires for the last week). I think there was a problem with tracing + one primary trying to boot on the same port as a Sui authority

@asonnino asonnino merged commit 30a199d into main Apr 26, 2022
@asonnino asonnino deleted the fix-1529 branch April 26, 2022 08:48
@velvia
Copy link
Contributor

velvia commented Apr 26, 2022

@asonnino and others: I'm about done with deploy changes that would let you/us easily test cluster network changes using Docker compose. What is tested would be a preview of what is deployed. In the meantime if we can keep the range of ports to a fixed range that would be much appreciated..... I'm changing the port spacing between nodes to be like 1000 so if we can keep the range of Narwhal ports to < +1000 that would be cool. :)

@asonnino
Copy link
Contributor Author

Cool then we don't load consensus ports from file for DevNet, right?

@velvia
Copy link
Contributor

velvia commented Apr 26, 2022

@asonnino sorry I'm not sure I get your Q.... consensus ports are, according to this change, a fixed distance from the base port right? The base port will be set from a file and by environment var. For certain reasons I need to test them at different ports for each validator in dev.net.

@asonnino
Copy link
Contributor Author

yes exactly. So basically we don't touch to the consensus port config any more (not before DevNet at least)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Needs Review PR is ready for review sui-node Type: Bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants