-
Notifications
You must be signed in to change notification settings - Fork 117
Description
Background
The lead of the storage group needs a way to learn what storage providers are having trouble in an automated way. Right now, there are tools which allow the lead to perform certain live benchmarks, I believe primarily on downloading content, however this is an active step that has to be initiated, and probably therefore happens relatively rarely. The most acute symptom of this today is that uploads are at times frequently failing, and this is a very bad experience for creators, who are by far the most important next audience segment for Joystream. Right now, this is being solved by having application operators (e.g. @kdembler ), manually reach out to the lead, which does not scale. Preemptive detection of such failures through active interrogation would largely solve this problem.
To be clear, the purpose of this tooling is not to detect active adversarial providers, but rather faults that providers themselves are unaware of due to misconfiguration, resource exhaustion or other unintentional factors.
Proposal
An online service which continuously attempts to interrogate storage providers and reports the results of such interrogation both to some third party data warehouse, through some API, but importantly also notifies the lead operating the infrastructure about failures. It's not clear if the automation should be outsourced to the warehouse or directly as part of the service, but it has to be part of the overall package in some way.
The scope of this tool can grow over time, but the most important part of this initial service scope is to check whether trying to upload assets works or not. Failures could be any among
- inability to resolve host
- inability to connect to host
- inability to initiate upload
- upload progresses too slowly
- upload is prematurely terminated by the host
The simplest approach is to have the service use some predefined membership+channel for such interrogations, and upload assets just to that same channel. Cleanup can be done regularly by the same service. Channel should be set as unlisted in metadata, to not occur in apps.
Question
I also wonder whether we have proper tooling to allow storage providers to detect a subset of such failures on their end, and if such detection actually results in pushing data out to the operator through some channel.
┆Issue is synchronized with this Asana task by Unito