-
Notifications
You must be signed in to change notification settings - Fork 4.3k
Fix visual observation tensor indexing for Unity inference #6239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Fix visual observation tensor indexing for Unity inference #6239
Conversation
This change corrects the tensor indexing calculation in TensorExtensions.Index() to properly support CHW (channels-height-width) format used by both Unity's observation writers and ONNX models during inference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Solved it for me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for submitting the PR. I'll take a look after Release 4
Hi, I took a look at the issue you reported. It's not a bug in ML-Agents core code, but rather a missing implementation detail in the custom visual sensor that needs to match the preprocessing expectations established during training. During training, ML-Agents converts visual observations to PNG format before sending them to the Python trainer, which automatically flips the image vertically. To compensate for this during Unity ONNX inference, ML-Agents' built-in visual sensors (like CameraSensor) use the WriteTexture() method in ObservationWriter.cs, which includes a compensatory flip to match the training data format. Your custom visual sensor likely bypasses this by directly writing pixel values without the flip compensation. The fix is to either use writer.WriteTexture(texture, grayscale) if you're working with Texture2D data, or manually implement the vertical flip in your custom Write() method by iterating from height-1 down to 0 and using (height - h - 1) for your actual data indexing. This explains why switching to Vector observations works. It bypasses the entire visual preprocessing pipeline that includes the necessary flip compensation. You can check out the visual examples like Visual3DBall and VisualFoodCollector use Unity's built-in CameraSensor or RenderTextureSensor components, which automatically handle the image flipping correctly through the WriteTexture() method. |
Thank you for the detailed response! However, I believe there might be some confusion about the nature of the problem I encountered. My custom visual sensor doesn't use PNG compression: public ObservationSpec GetObservationSpec() => ObservationSpec.Visual(channels, height, width, ObservationType.Default);
public CompressionSpec GetCompressionSpec() => CompressionSpec.Default();
public byte[] GetCompressedObservation() => null; I'm using the ObservationWriter correctly according to its signature: Your suggestion about vertical image flipping doesn't explain why my fix worked, because I didn't flip the image - I fundamentally changed the tensor indexing order from HWC to CHW. If it were just a vertical flip issue, the fix would involve changing the I think this is a data layout problem, not an image orientation problem. Since my custom sensor bypasses PNG compression and uses the standard ObservationWriter interface correctly, shouldn't the tensor indexing in Unity inference match the same CHW format that Unity's visual sensors use during training? The fact that changing HWC→CHW indexing solved the problem suggests this was indeed a tensor layout mismatch issue, not a vertical flip compensation issue. |
Summary
Problem
The indexing formula was calculating HWC layout while Unity's visual sensors write data in CHW format.
Solution
Corrected TensorExtensions.Index() to properly calculate CHW tensor indices that match Unity's observation writer
format.