Closed
Description
As we plan to add distributed training to ML.NET, we have to consider Fault Tolerance of the individual worker nodes. In the case of FastTree, fault tolerance for individual workers has two requirements:
- Failed FastTree workers must be restarted in the current state of the calculation
- Non-failing workers must respond to failures in the IParallelTraining components*
*This response depends on the implementation of fault tolerance.