-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Rest Server] stop/start/genesis endpoints #454
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not familiar with the use case but IMHO start/stop/genesis shouldn't be exposed by the rest services, start/stop/genesis is for provisioning and starting/stoping the network, it should be a server-side thing?
sui/src/config.rs
Outdated
@@ -435,3 +436,24 @@ impl Config for NetworkConfig { | |||
&self.config_path | |||
} | |||
} | |||
|
|||
pub struct PortAllocator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets move this to utils.rs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This utils.rs
file is a code smell for me--I think modules should have descriptive name + and only contain functionality related to that name. I would be fine with (e.g.) network_utils
, ports.rs
, etc., but vanilla utils
is a bit much.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could move the Config trait into config.rs and rename utils.rs to network_utils.rs and move PortAllocator there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a network_utils directory by the way, you could consider have a port_allocator.rs there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me, I will move to network_utils dir
Agreed, I should have made note of this but these endpoints are just for debugging. It will make it easier to restart the network when its hosted on AWS especially for anyone that is outside of our group that may be trialing it before or after GDC. Post-GDC we can discuss whether we want to decouple the rest server from the network and have the two be started separately OR we could have the network genesis and start happen during startup of the rest server. The former makes more sense so that it would allow us to make changes to the rest server without having to bring down the network. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yay dropshot!
Thanks for this PR. Let me know if the long comments help plz :)
sui/src/rest_server.rs
Outdated
for _ in 0..num_authorities { | ||
let (address, key_pair) = get_key_pair(); | ||
let info = AuthorityPrivateInfo { | ||
address, | ||
key_pair, | ||
host: "127.0.0.1".to_string(), | ||
port: match port_allocator.next_port() { | ||
Some(port) => port, | ||
None => { | ||
return Err(HttpError::for_client_error( | ||
None, | ||
hyper::StatusCode::CONFLICT, | ||
String::from( | ||
"Could not create authority beacause there were no free ports", | ||
), | ||
)) | ||
} | ||
}, | ||
db_path: PathBuf::from(format!("./authorities_db/{:?}", address)), | ||
}; | ||
authority_info.push(AuthorityInfo { | ||
address, | ||
host: info.host.clone(), | ||
base_port: info.port, | ||
}); | ||
authorities.insert(info.address, 1); | ||
network_config.authorities.push(info); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we modularize the setup bits that are already in sui::genesis()
for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I should be calling into this code rather than copying the genesis code from sui_commands. I will do that refactoring in a separate PR if that is okay with you?
sui/src/rest_server.rs
Outdated
fn sync_client_state(client_state: &mut ClientState<AuthorityClient>) -> Option<HttpError> { | ||
match cb_thread::scope(|scope| { | ||
scope | ||
.spawn(|_| { | ||
// synchronize with authorities | ||
let rt = Runtime::new().unwrap(); | ||
rt.block_on(async move { client_state.sync_client_state().await }) | ||
}) | ||
.join() | ||
}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, here's my theory about what happened:
- You've set out to write a run-of-the mill blocking function.
- As you've noted,
ClientState::sync_client_state
takes an exclusive reference toclient_state
, and you've tried to reproduce this in your function, taking a&mut ClientState<AuthorityClient>
. 1 - Then you've tried to spawn a thread to call the sync, but noticed it was requiring a
'static
lifetime on theclient_state
, because it can't guarantee the reference it embarks will have a lifetime sufficient for the arbitrary lifetime of the thread2. - So you've looked for an abstraction that limits the lifetime required by the thread, and found crossbeam's scoped threads, which lets you do a short-lived borrow of the authority client: cool!
- But then you got stuck on calling an async method from that thread, which makes you create an entirely new runtime.
If I'm correct about this, here are a few points:
- on 1 & 2: if you're going to run a long-lived process that potentially never stops, it would be OK to actually require a long lifetime from the caller by taking a 'static reference, or taking a copy of the input you require ...
- ... but asynchronous scoped threads aren't a thing, so your
thread::spawn
is blocking 3, and what you're looking at is a dedicated thread. - ... and you're immediately joining the thread, thereby blocking your main thread on the back end of that dedicated thread. 😅
- as you can probably guess by now, you can dispense with the threading ceremony and just call this async function in your main thread. And you've already figured out how to do that!
fn sync_client_state(client_state: &mut ClientState<AuthorityClient>) -> Option<HttpError> {
// synchronize with authorities
let rt = Runtime::new().unwrap(); // TODO: remove or change to an expect
let res = rt.block_on(async move { client_state.sync_client_state().await });
res.err().map(|err| HttpError::for_client_error(
None,
hyper::StatusCode::FAILED_DEPENDENCY,
format!("Sync error: {:?}", err),
))
}
- left as a TODO to the author: do you need a blocking function here? If yes, why? if not, can you make this function shorter?
Footnotes
-
In general, a commendable instinct, but read on. ↩
-
The point is you could return from our function, from which the spawned thread is borrowing some data, and the borrow checker is protecting you from a use after free here. ↩
-
Congrats on not going for the one crate out there that claims to do this, but is actually unsound. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually did see what you were describing in your theory above and after being yelled at by the compiler sufficiently I was able to get it working (which may require some feedback in a future PR). However the crossbeam scoped thread was to solve the following issue which only happens at runtime.
thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like
block_on
) attempted to block the current thread while the thread is being used to drive asynchronous tasks.'
After hours of research online about starting a runtime within a runtime this is the best I found. The only other option was NOT to start a runtime from within a runtime which did not seem possible for our use case. But I am open to other approaches to solve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's forget about runtimes in runtimes and my theories: what are you trying to do?
Step 0:
You've set out to write a run-of-the mill blocking function.
If this is correct, why?
do you need a blocking function here? If yes, why? if not, can you make this function shorter?
This is the same question as the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You were correct, I was doing too much. Thanks for the simplification/explanation.
sui/src/rest_server.rs
Outdated
server_context.server_lock.store(true, Ordering::SeqCst); | ||
thread::spawn({ | ||
move || { | ||
rt.block_on(join_all(handles)); | ||
} | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So here you are already in an asynchronous function, and you launch a thread by hand, pass it a manually-crafted runtime, and make it call an async function.1 That async function happens to be a composite of the result of the futures you collected into handles
, but that's the gist of it.
My point is: this is doing something very specific and extraordinary, what is the intent?
Footnotes
-
we in general use multi-threaded Tokio runtimes that support scheduling on the OS's available threads for non-blocking tasks. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intent was to run the authorities in a background thread and ideally then be able to kill that thread (and in turn kill the authorities) when the /sui/stop
endpoint is called. I have not been able to figure out how to execute that whole flow properly in rust.
I was hoping to use server_lock: Arc<AtomicBool>
to accomplish this but I cannot poll that resource for changes within the thread I spawned because of rt.block_on(join_all(handles))
blocking that thread. When I started considering starting another thread to wrap this thread and then poll there, I decided I was crazy so I punted this until I could get some assistance 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting help when it feels like you're going crazy is the right move.
First, let's simplify the problem by setting aside cancellation for a minute.
You may want to look at tokio tasks (as opposed to bare async futures), and see if (considering my footnote above) they wouldn't correspond to your desires better than an explicit OS thread.
If they do, then let's look at cancellation:
- the task's
JoinHandle
comes with anabort
- tasks are also futures (launched differently), so you can use
Abortable
- tokio has a page on
Graceful shutdown
The core idea is the same: use a message-passing channel to transmit your cancel signal, and make a wrapping task embark both the receiving end of the channel and the core task. The wrapping task can listen to both sources of signal (your task returning or the cancel signal) and act accordingly. Here's the doc on select, which includes a specific mention of the cancellation pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to swap out the previous implementation with tokio tasks and now I can properly kill the authorities and restart the server from within the endpoints. Thanks for the tips
sui/src/utils.rs
Outdated
pub fn next_port(&mut self) -> Option<u16> { | ||
for port in self.next_port..65535 { | ||
if TcpListener::bind(("127.0.0.1", port)).is_ok() { | ||
self.next_port = port + 1; | ||
return Some(port); | ||
} | ||
} | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is only valuable if you want to specifically reach for deterministic port allocation and are kinda-sorta hoping you'll find a clean-ish port occupation picture at the moment of call1. You should probably take a usize
argument to let the caller tell you at which port number they'd like to start their port search.
But in general, the Linux API (and we run authorities on Linux) tells you that if you try to bind on port 0, you'll get a dynamic port allocation. This might be simpler?
Footnotes
-
this might work in a container. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will make a note of this and include this in the refactoring that goes along with modularizing genesis to reduce the scope of this PR
Just a quick comment - for debug routes wouldn't it be better to put it in a special debug namespace? |
@arun-koshy what's the status of this? |
Met with Francois on Friday to go over his comments on the proper ways to do the async portion of this in rust. Have the pseudo code for it, working on implementing it today. Should have it ready for review soon |
genesis_config_path: String::from("genesis.conf"), | ||
wallet_config_path: String::from("wallet.conf"), | ||
network_config_path: String::from("./network.conf"), | ||
authority_db_path: String::from("./authorities_db"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is brittle and somewhat panic-inducing: IIUC, if something goes wrong we blow up on a started server that the user has hit an endpoint of, not. on starting the server itself. I would at least open an issue for the following:
- Instances of
NetworkConfig
,GenesisConfig
, etc should be created at theServerContext
constructor. - That constructor should hit all panics due to invalid or non-existent paths, as well as basic validations (e.g. empty quorum).
- Ideally, those should re-use the config creation logic from the client. (which should be made reusable)
Moreover, all of this setup is re-done from scratch on every endpoint hit. Is there a reason why this should happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are only exposing these endpoints because we plan on hosting instances of the rest server for game developers. Which means if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints. In the long term none of this will matter because we shouldn't be starting the Sui network from within the rest server anyways. I can open an issue discussing the future design of the rest server post-GDC.
Moreover, all of this setup is re-done from scratch on every endpoint hit. Is there a reason why this should happen?
Because the server context is instantiated once here, all other changes to the server context happen via the endpoints. There are checks in the endpoints to see if the network is already running or the configs are already created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints
What's wrong with having, as we seem to, one endpoint that starts the network and calls the constructor of the server's state, one that tears it down? It would be a contract with our early test developers that nothing interesting is going to happen if they don't hit that particular bootstrap first, and that they'll carry over the networks' state if they don't tear it down before a restart.
Then we can have real state, with configuration that contains actual data rather than recreating everything on the fly.
My issue isn't with the re-startability it's with the multiplication of panic-prone config read and scaffolding code in each endpoint handler.
genesis_config_path: String::from("genesis.conf"), | ||
wallet_config_path: String::from("wallet.conf"), | ||
network_config_path: String::from("./network.conf"), | ||
authority_db_path: String::from("./authorities_db"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints
What's wrong with having, as we seem to, one endpoint that starts the network and calls the constructor of the server's state, one that tears it down? It would be a contract with our early test developers that nothing interesting is going to happen if they don't hit that particular bootstrap first, and that they'll carry over the networks' state if they don't tear it down before a restart.
Then we can have real state, with configuration that contains actual data rather than recreating everything on the fly.
My issue isn't with the re-startability it's with the multiplication of panic-prone config read and scaffolding code in each endpoint handler.
for authority in &network_config.authorities { | ||
let server = sui_commands::make_server( | ||
authority, | ||
&committee, | ||
vec![], | ||
&[], | ||
network_config.buffer_size, | ||
) | ||
.await | ||
.map_err(|error| { | ||
custom_http_error( | ||
StatusCode::CONFLICT, | ||
format!("Unable to make server: {error}"), | ||
) | ||
})?; | ||
handles.push(async move { | ||
match server.spawn().await { | ||
Ok(server) => Ok(server), | ||
Err(err) => { | ||
return Err(custom_http_error( | ||
StatusCode::FAILED_DEPENDENCY, | ||
format!("Failed to start server: {}", err), | ||
)); | ||
} | ||
} | ||
}) | ||
} | ||
|
||
let num_authorities = handles.len(); | ||
info!("Started {} authorities", num_authorities); | ||
|
||
while let Some(spawned_server) = handles.next().await { | ||
server_context | ||
.authority_handles | ||
.lock() | ||
.unwrap() | ||
.push(task::spawn(async { | ||
if let Err(err) = spawned_server.unwrap().join().await { | ||
error!("Server ended with an error: {}", err); | ||
} | ||
})); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems we're maintaining a copy of this in sui_commands
, called start_network
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am handling the async portion of this differently than start_network
does. I am maintaining a hold of the JoinHandle
so that I can abort it later during start. Don't think reusing sui_commands
makes sense here, but correct me if I am wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough. Additionally, as we've seen, the pattern of what's going on in start_network
and genesis
in sui_commands
doesn't make sense: the task waits on each launched server to finish (rather than wait for each server to start) here:
https://github.com/MystenLabs/fastnft/blob/dba291bfac0d9a9811e9a296f3f4712744ef2df6/sui/src/sui_commands.rs#L86-L88
As we've seen, that's not really usable in a task, which should keep a handle rather than block on completion.
Would you and @patrickkuo open an issue to track the refactoring of start_network
in a way that:
- has clear semantics on starting the server,
- makes this usable in a tokio task (by returning the handle of the
SpawnedServer
rather blocking on it, - de-duplicates the code between the present function and
sui_commands
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments on error handling
@huitseeker any more thoughts on this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code-wise looks ok to me, although I still think it's strange to have a endpoint to start/stop the network, maybe down the line we should give the binaries access to partners so they can start their own network locally, to avoid having this service...
I will let @huitseeker to do the final call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accepting to unblock. I left comments that I believe will help in controlling the amount of technical debt we incur from this area:
-
making sure the endpoint API looks like what it will in the future, when it does not start the network. That is:
- Building a
ServerContext
that contains configurations, rather than strings, - Setting up those configurations in one place (e.g. a constructor) to reflect our temporary need for start/stop endpoints
- Building a
-
making sure we do not maintain two parallel sets of code for network operations, one for the CLI, one for the REST server.
- I expect this would involve refactoring what's in
sui_commands
to make better async sense.
It would be great to see issues tracking this.
- I expect this would involve refactoring what's in
Thank you for your great feedback on the PR @patrickkuo & @huitseeker. I have opened two followup issues that have stemmed from this PR; Issue#552 & Issue#553 |
* fix: turn off logging stderr+out to file Removing the log file so we can just log to the container stdout and stderr to be picked up by Loki or other means. * fix: fixed depends_on to the actual loki service name * chore: updated README.md about logs Specifically we changed the container entry.sh script to not redirect the logs to files. This enables log scraping directly from the containers into log collection services like Loki.
* fix: turn off logging stderr+out to file Removing the log file so we can just log to the container stdout and stderr to be picked up by Loki or other means. * fix: fixed depends_on to the actual loki service name * chore: updated README.md about logs Specifically we changed the container entry.sh script to not redirect the logs to files. This enables log scraping directly from the containers into log collection services like Loki.
This is the first of a series of PR's that will implement the client service API described as part of the GDC Eng Deliverables
The endpoints that are included in this PR are:
curl --location --request POST 'localhost:5000/debug/sui/stop'
curl --location --request POST 'localhost:5000/debug/sui/start'
curl --location --request POST 'localhost:5000/debug/sui/genesis' \ --data-raw '{ "num_authorities": 4, "num_objects": 5 }'
Known Issues (for future PRs)