Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Rest Server] stop/start/genesis endpoints #454

Merged
merged 9 commits into from
Feb 24, 2022
Merged

Conversation

arun-koshy
Copy link
Contributor

@arun-koshy arun-koshy commented Feb 15, 2022

This is the first of a series of PR's that will implement the client service API described as part of the GDC Eng Deliverables

The endpoints that are included in this PR are:

  • Stop: Stops authorities, deletes configurations & databases
    curl --location --request POST 'localhost:5000/debug/sui/stop'
  • Start: Starts servers with specified genesis configuration
    curl --location --request POST 'localhost:5000/debug/sui/start'
  • Genesis: Create genesis configuration. Run this before you run start server.
    curl --location --request POST 'localhost:5000/debug/sui/genesis' \ --data-raw '{ "num_authorities": 4, "num_objects": 5 }'

Known Issues (for future PRs)

  • Need to refactor out sui_commands::genesis code and share it instead of copying it all into the rest server.
  • Look into using dynamic ports for port allocator

Copy link
Contributor

@patrickkuo patrickkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not familiar with the use case but IMHO start/stop/genesis shouldn't be exposed by the rest services, start/stop/genesis is for provisioning and starting/stoping the network, it should be a server-side thing?

@@ -435,3 +436,24 @@ impl Config for NetworkConfig {
&self.config_path
}
}

pub struct PortAllocator {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets move this to utils.rs?

Copy link
Collaborator

@sblackshear sblackshear Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This utils.rs file is a code smell for me--I think modules should have descriptive name + and only contain functionality related to that name. I would be fine with (e.g.) network_utils, ports.rs, etc., but vanilla utils is a bit much.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could move the Config trait into config.rs and rename utils.rs to network_utils.rs and move PortAllocator there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a network_utils directory by the way, you could consider have a port_allocator.rs there?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works for me, I will move to network_utils dir

@arun-koshy
Copy link
Contributor Author

I am not familiar with the use case but IMHO start/stop/genesis shouldn't be exposed by the rest services, start/stop/genesis is for provisioning and starting/stoping the network, it should be a server-side thing?

Agreed, I should have made note of this but these endpoints are just for debugging. It will make it easier to restart the network when its hosted on AWS especially for anyone that is outside of our group that may be trialing it before or after GDC.

Post-GDC we can discuss whether we want to decouple the rest server from the network and have the two be started separately OR we could have the network genesis and start happen during startup of the rest server. The former makes more sense so that it would allow us to make changes to the rest server without having to bring down the network.

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yay dropshot!

Thanks for this PR. Let me know if the long comments help plz :)

sui/src/rest_server.rs Outdated Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
Comment on lines 156 to 183
for _ in 0..num_authorities {
let (address, key_pair) = get_key_pair();
let info = AuthorityPrivateInfo {
address,
key_pair,
host: "127.0.0.1".to_string(),
port: match port_allocator.next_port() {
Some(port) => port,
None => {
return Err(HttpError::for_client_error(
None,
hyper::StatusCode::CONFLICT,
String::from(
"Could not create authority beacause there were no free ports",
),
))
}
},
db_path: PathBuf::from(format!("./authorities_db/{:?}", address)),
};
authority_info.push(AuthorityInfo {
address,
host: info.host.clone(),
base_port: info.port,
});
authorities.insert(info.address, 1);
network_config.authorities.push(info);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we modularize the setup bits that are already in sui::genesis() for this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, I should be calling into this code rather than copying the genesis code from sui_commands. I will do that refactoring in a separate PR if that is okay with you?

sui/src/rest_server.rs Outdated Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
Comment on lines 406 to 415
fn sync_client_state(client_state: &mut ClientState<AuthorityClient>) -> Option<HttpError> {
match cb_thread::scope(|scope| {
scope
.spawn(|_| {
// synchronize with authorities
let rt = Runtime::new().unwrap();
rt.block_on(async move { client_state.sync_client_state().await })
})
.join()
}) {
Copy link
Contributor

@huitseeker huitseeker Feb 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, here's my theory about what happened:

  1. You've set out to write a run-of-the mill blocking function.
  2. As you've noted, ClientState::sync_client_state takes an exclusive reference to client_state, and you've tried to reproduce this in your function, taking a &mut ClientState<AuthorityClient>. 1
  3. Then you've tried to spawn a thread to call the sync, but noticed it was requiring a 'static lifetime on the client_state, because it can't guarantee the reference it embarks will have a lifetime sufficient for the arbitrary lifetime of the thread2.
  4. So you've looked for an abstraction that limits the lifetime required by the thread, and found crossbeam's scoped threads, which lets you do a short-lived borrow of the authority client: cool!
  5. But then you got stuck on calling an async method from that thread, which makes you create an entirely new runtime.

If I'm correct about this, here are a few points:

  • on 1 & 2: if you're going to run a long-lived process that potentially never stops, it would be OK to actually require a long lifetime from the caller by taking a 'static reference, or taking a copy of the input you require ...
  • ... but asynchronous scoped threads aren't a thing, so your thread::spawn is blocking 3, and what you're looking at is a dedicated thread.
  • ... and you're immediately joining the thread, thereby blocking your main thread on the back end of that dedicated thread. 😅
  • as you can probably guess by now, you can dispense with the threading ceremony and just call this async function in your main thread. And you've already figured out how to do that!
fn sync_client_state(client_state: &mut ClientState<AuthorityClient>) -> Option<HttpError> {
    // synchronize with authorities
    let rt = Runtime::new().unwrap(); // TODO: remove or change to an expect
    let res = rt.block_on(async move { client_state.sync_client_state().await });
    res.err().map(|err| HttpError::for_client_error(
        None,
        hyper::StatusCode::FAILED_DEPENDENCY,
        format!("Sync error: {:?}", err),
    ))
}
  • left as a TODO to the author: do you need a blocking function here? If yes, why? if not, can you make this function shorter?

Footnotes

  1. In general, a commendable instinct, but read on.

  2. The point is you could return from our function, from which the spawned thread is borrowing some data, and the borrow checker is protecting you from a use after free here.

  3. Congrats on not going for the one crate out there that claims to do this, but is actually unsound.

Copy link
Contributor Author

@arun-koshy arun-koshy Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually did see what you were describing in your theory above and after being yelled at by the compiler sufficiently I was able to get it working (which may require some feedback in a future PR). However the crossbeam scoped thread was to solve the following issue which only happens at runtime.

thread 'tokio-runtime-worker' panicked at 'Cannot start a runtime from within a runtime. This happens because a function (like block_on) attempted to block the current thread while the thread is being used to drive asynchronous tasks.'

After hours of research online about starting a runtime within a runtime this is the best I found. The only other option was NOT to start a runtime from within a runtime which did not seem possible for our use case. But I am open to other approaches to solve this.

Copy link
Contributor

@huitseeker huitseeker Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's forget about runtimes in runtimes and my theories: what are you trying to do?
Step 0:

You've set out to write a run-of-the mill blocking function.

If this is correct, why?

do you need a blocking function here? If yes, why? if not, can you make this function shorter?

This is the same question as the above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You were correct, I was doing too much. Thanks for the simplification/explanation.

Comment on lines 323 to 328
server_context.server_lock.store(true, Ordering::SeqCst);
thread::spawn({
move || {
rt.block_on(join_all(handles));
}
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So here you are already in an asynchronous function, and you launch a thread by hand, pass it a manually-crafted runtime, and make it call an async function.1 That async function happens to be a composite of the result of the futures you collected into handles, but that's the gist of it.

My point is: this is doing something very specific and extraordinary, what is the intent?

Footnotes

  1. we in general use multi-threaded Tokio runtimes that support scheduling on the OS's available threads for non-blocking tasks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intent was to run the authorities in a background thread and ideally then be able to kill that thread (and in turn kill the authorities) when the /sui/stop endpoint is called. I have not been able to figure out how to execute that whole flow properly in rust.

I was hoping to use server_lock: Arc<AtomicBool> to accomplish this but I cannot poll that resource for changes within the thread I spawned because of rt.block_on(join_all(handles)) blocking that thread. When I started considering starting another thread to wrap this thread and then poll there, I decided I was crazy so I punted this until I could get some assistance 😅

Copy link
Contributor

@huitseeker huitseeker Feb 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting help when it feels like you're going crazy is the right move.

First, let's simplify the problem by setting aside cancellation for a minute.
You may want to look at tokio tasks (as opposed to bare async futures), and see if (considering my footnote above) they wouldn't correspond to your desires better than an explicit OS thread.

If they do, then let's look at cancellation:

The core idea is the same: use a message-passing channel to transmit your cancel signal, and make a wrapping task embark both the receiving end of the channel and the core task. The wrapping task can listen to both sources of signal (your task returning or the cancel signal) and act accordingly. Here's the doc on select, which includes a specific mention of the cancellation pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was able to swap out the previous implementation with tokio tasks and now I can properly kill the authorities and restart the server from within the endpoints. Thanks for the tips

sui/src/utils.rs Outdated
Comment on lines 59 to 67
pub fn next_port(&mut self) -> Option<u16> {
for port in self.next_port..65535 {
if TcpListener::bind(("127.0.0.1", port)).is_ok() {
self.next_port = port + 1;
return Some(port);
}
}
None
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is only valuable if you want to specifically reach for deterministic port allocation and are kinda-sorta hoping you'll find a clean-ish port occupation picture at the moment of call1. You should probably take a usize argument to let the caller tell you at which port number they'd like to start their port search.

But in general, the Linux API (and we run authorities on Linux) tells you that if you try to bind on port 0, you'll get a dynamic port allocation. This might be simpler?

Footnotes

  1. this might work in a container.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will make a note of this and include this in the refactoring that goes along with modularizing genesis to reduce the scope of this PR

@velvia
Copy link
Contributor

velvia commented Feb 19, 2022

Just a quick comment - for debug routes wouldn't it be better to put it in a special debug namespace?

@sblackshear
Copy link
Collaborator

@arun-koshy what's the status of this?

@arun-koshy
Copy link
Contributor Author

@arun-koshy what's the status of this?

Met with Francois on Friday to go over his comments on the proper ways to do the async portion of this in rust. Have the pseudo code for it, working on implementing it today. Should have it ready for review soon

@oxade oxade mentioned this pull request Feb 22, 2022
sui/src/rest_server.rs Outdated Show resolved Hide resolved
Comment on lines +81 to +84
genesis_config_path: String::from("genesis.conf"),
wallet_config_path: String::from("wallet.conf"),
network_config_path: String::from("./network.conf"),
authority_db_path: String::from("./authorities_db"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is brittle and somewhat panic-inducing: IIUC, if something goes wrong we blow up on a started server that the user has hit an endpoint of, not. on starting the server itself. I would at least open an issue for the following:

  1. Instances of NetworkConfig, GenesisConfig, etc should be created at the ServerContext constructor.
  2. That constructor should hit all panics due to invalid or non-existent paths, as well as basic validations (e.g. empty quorum).
  3. Ideally, those should re-use the config creation logic from the client. (which should be made reusable)

Moreover, all of this setup is re-done from scratch on every endpoint hit. Is there a reason why this should happen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are only exposing these endpoints because we plan on hosting instances of the rest server for game developers. Which means if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints. In the long term none of this will matter because we shouldn't be starting the Sui network from within the rest server anyways. I can open an issue discussing the future design of the rest server post-GDC.

Moreover, all of this setup is re-done from scratch on every endpoint hit. Is there a reason why this should happen?

Because the server context is instantiated once here, all other changes to the server context happen via the endpoints. There are checks in the endpoints to see if the network is already running or the configs are already created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints

What's wrong with having, as we seem to, one endpoint that starts the network and calls the constructor of the server's state, one that tears it down? It would be a contract with our early test developers that nothing interesting is going to happen if they don't hit that particular bootstrap first, and that they'll carry over the networks' state if they don't tear it down before a restart.

Then we can have real state, with configuration that contains actual data rather than recreating everything on the fly.

My issue isn't with the re-startability it's with the multiplication of panic-prone config read and scaffolding code in each endpoint handler.

sui/src/rest_server.rs Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
Comment on lines +81 to +84
genesis_config_path: String::from("genesis.conf"),
wallet_config_path: String::from("wallet.conf"),
network_config_path: String::from("./network.conf"),
authority_db_path: String::from("./authorities_db"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if they wanted to reset the SUI network the rest server is using, they need to be able to do so using the endpoints

What's wrong with having, as we seem to, one endpoint that starts the network and calls the constructor of the server's state, one that tears it down? It would be a contract with our early test developers that nothing interesting is going to happen if they don't hit that particular bootstrap first, and that they'll carry over the networks' state if they don't tear it down before a restart.

Then we can have real state, with configuration that contains actual data rather than recreating everything on the fly.

My issue isn't with the re-startability it's with the multiplication of panic-prone config read and scaffolding code in each endpoint handler.

Comment on lines +212 to +253
for authority in &network_config.authorities {
let server = sui_commands::make_server(
authority,
&committee,
vec![],
&[],
network_config.buffer_size,
)
.await
.map_err(|error| {
custom_http_error(
StatusCode::CONFLICT,
format!("Unable to make server: {error}"),
)
})?;
handles.push(async move {
match server.spawn().await {
Ok(server) => Ok(server),
Err(err) => {
return Err(custom_http_error(
StatusCode::FAILED_DEPENDENCY,
format!("Failed to start server: {}", err),
));
}
}
})
}

let num_authorities = handles.len();
info!("Started {} authorities", num_authorities);

while let Some(spawned_server) = handles.next().await {
server_context
.authority_handles
.lock()
.unwrap()
.push(task::spawn(async {
if let Err(err) = spawned_server.unwrap().join().await {
error!("Server ended with an error: {}", err);
}
}));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems we're maintaining a copy of this in sui_commands, called start_network

Copy link
Contributor Author

@arun-koshy arun-koshy Feb 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am handling the async portion of this differently than start_network does. I am maintaining a hold of the JoinHandle so that I can abort it later during start. Don't think reusing sui_commands makes sense here, but correct me if I am wrong.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough. Additionally, as we've seen, the pattern of what's going on in start_network and genesis in sui_commands doesn't make sense: the task waits on each launched server to finish (rather than wait for each server to start) here:
https://github.com/MystenLabs/fastnft/blob/dba291bfac0d9a9811e9a296f3f4712744ef2df6/sui/src/sui_commands.rs#L86-L88
As we've seen, that's not really usable in a task, which should keep a handle rather than block on completion.

Would you and @patrickkuo open an issue to track the refactoring of start_network in a way that:

  • has clear semantics on starting the server,
  • makes this usable in a tokio task (by returning the handle of the SpawnedServer rather blocking on it,
  • de-duplicates the code between the present function and sui_commands
    ?

Copy link
Contributor

@patrickkuo patrickkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on error handling

sui/Cargo.toml Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
sui/src/rest_server.rs Outdated Show resolved Hide resolved
sui/src/sui_commands.rs Outdated Show resolved Hide resolved
@oxade
Copy link
Contributor

oxade commented Feb 23, 2022

@huitseeker any more thoughts on this PR?

Copy link
Contributor

@patrickkuo patrickkuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code-wise looks ok to me, although I still think it's strange to have a endpoint to start/stop the network, maybe down the line we should give the binaries access to partners so they can start their own network locally, to avoid having this service...

I will let @huitseeker to do the final call

Copy link
Contributor

@huitseeker huitseeker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accepting to unblock. I left comments that I believe will help in controlling the amount of technical debt we incur from this area:

  • making sure the endpoint API looks like what it will in the future, when it does not start the network. That is:

    • Building a ServerContext that contains configurations, rather than strings,
    • Setting up those configurations in one place (e.g. a constructor) to reflect our temporary need for start/stop endpoints
  • making sure we do not maintain two parallel sets of code for network operations, one for the CLI, one for the REST server.

    • I expect this would involve refactoring what's in sui_commands to make better async sense.

    It would be great to see issues tracking this.

@arun-koshy
Copy link
Contributor Author

arun-koshy commented Feb 24, 2022

Thank you for your great feedback on the PR @patrickkuo & @huitseeker. I have opened two followup issues that have stemmed from this PR; Issue#552 & Issue#553

@arun-koshy arun-koshy merged commit f13ed53 into main Feb 24, 2022
@arun-koshy arun-koshy deleted the client-service-api branch February 24, 2022 18:54
mwtian pushed a commit that referenced this pull request Sep 12, 2022
* fix: turn off logging stderr+out to file

Removing the log file so we can just log to the container
stdout and stderr to be picked up by Loki or other means.

* fix: fixed depends_on to the actual loki service name

* chore: updated README.md about logs

Specifically we changed the container entry.sh script to not
redirect the logs to files.  This enables log scraping directly
from the containers into log collection services like Loki.
mwtian pushed a commit to mwtian/sui that referenced this pull request Sep 29, 2022
* fix: turn off logging stderr+out to file

Removing the log file so we can just log to the container
stdout and stderr to be picked up by Loki or other means.

* fix: fixed depends_on to the actual loki service name

* chore: updated README.md about logs

Specifically we changed the container entry.sh script to not
redirect the logs to files.  This enables log scraping directly
from the containers into log collection services like Loki.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants