Closed
Description
Image this: I have a few pods running on an inf1.6xlarge
instance. Each pod has a neuron runtime running as a sidecar. When each pod asks for 2 inf1 devices, there's a very high chance that one of them will get 2 non-contiguous logical IDs. Here's an example of what I get:
bash-4.2# neuron-rtd -g unix:/sock/neuron.sock -x
nrtd[17]: [NRTD:nrtd_main] nrtd build using:1.0.6222.0
nrtd[17]: [NRTD:InitTongas] MLA logical Ids must be contiguous
nrtd[17]: [NRTD:nrtd_main] Failed to initialize devices: 0000:00:1c.0 0000:00:1f.0 , attempt: 1
neuron-rtd[17]: [TDRV:tdrv_destory] TDRV not initialized
This causes the pod to enter into a crash loop and that's undesirable.
The only solution I see at the moment is to only have pods asking 1 or exactly how many inf chips are on the given instance (in this case it would be 4).
@mrnikwaws any idea what could be done here?
Metadata
Metadata
Assignees
Labels
No labels