Open
Description
To optimize the forward pass latency it would be good to time GC to run in between model executions. This won't improve the QPS since the GC cost is the same amoratized but it would make the latency lower per batch.
import gc
gc.collect()
We should spin up a background thread that periodically iterates over all of the interpreter threads -- locks them between execution and runs the GC. It might also be worth it to explicitly disable GC on the individual interpreter threads so they won't run during the forward pass.
Context: