build a simulator to estimate the performance overhead of distributed ML model training under a certain config
try to profile and train a XGBoost model to estimate the slowdown caused by an overlap between communication-operator and compute-operator.