Skip to content

✨ Add waveform capture#54

Open
yhs0602 wants to merge 3 commits intomainfrom
feat/waveform_capture
Open

✨ Add waveform capture#54
yhs0602 wants to merge 3 commits intomainfrom
feat/waveform_capture

Conversation

@yhs0602
Copy link
Owner

@yhs0602 yhs0602 commented Jan 18, 2026

Pull

Coverage Badge

Summary by CodeRabbit

  • New Features

    • Added in-process audio loopback capture and inclusion of raw PCM waveform + metadata in observations.
    • Introduced Python wrappers for raw waveform, mel spectrogram and MFCC outputs.
    • Added a bimodal wrapper to combine vision and audio observations.
  • Integration

    • Observation schema extended to carry audio waveform data and timestamps.
  • Chores

    • Expanded ignore list to exclude generated loopback/source artifacts.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Jan 18, 2026

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds OpenAL SOFT_loopback-based audio loopback capture to the Minecraft environment, embeds captured PCM into observations via a new AudioWaveform protobuf message, integrates capture lifecycle via a SoundEngine mixin, and provides Python wrappers to decode/transform waveform data (raw PCM, mel spectrogram, MFCC).

Changes

Cohort / File(s) Summary
Gitignore
\.gitignore
Add ignore for src/craftground/MinecraftEnv/minecraft_sources/.
Audio loopback implementation
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt
New Kotlin singleton implementing OpenAL SOFT_loopback initialization, renderSamples(), PCM byte conversion, getters, and cleanup.
Lifecycle integration (mixin)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/mixin/SoundEngineMixin.java, src/craftground/MinecraftEnv/src/main/resources/com.kyhsgeekcode.minecraftenv.mixin.json
New mixin injected after SoundEngine.init() and before SoundEngine.close() to initialize/close AudioLoopbackCapturer; mixin registered in client mixins config.
Observation emission
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/MinecraftEnv.kt
Call into capturer each world tick; populate observation with audio_waveform fields (pcm_data, sample_rate, channels, bits_per_sample, num_samples, timestamp).
Protobuf schema & generated bindings
src/proto/observation_space.proto, src/craftground/MinecraftEnv/.../proto/AudioWaveformKt.kt, src/craftground/MinecraftEnv/.../proto/ObservationSpace.java, src/craftground/MinecraftEnv/.../proto/ObservationSpaceMessageKt.kt, src/craftground/proto/observation_space_pb2.py, src/craftground/proto/observation_space_pb2.pyi
Add new AudioWaveform message; add audio_waveform field (number 45) to ObservationSpaceMessage; update generated Kotlin/Java/Python protobuf bindings and DSL accessors.
Python wrappers
src/craftground/wrappers/waveform.py
New WaveformWrapper and BimodalWaveformWrapper: decode interleaved 16-bit PCM, provide raw/spectrogram/MFCC outputs (librosa optional), define observation space and silence fallbacks.

Sequence Diagram(s)

sequenceDiagram
    participant SoundEngine as SoundEngine
    participant Mixin as SoundEngineMixin
    participant Capturer as AudioLoopbackCapturer
    participant OpenAL as OpenAL

    SoundEngine->>Mixin: init() [TAIL]
    Mixin->>Capturer: initialize(mainDevice)
    Capturer->>OpenAL: check ALC_SOFT_loopback support
    OpenAL-->>Capturer: support status
    alt supported
        Capturer->>OpenAL: alcLoopbackOpenDeviceSOFT()
        OpenAL-->>Capturer: loopbackDevice
        Capturer->>OpenAL: alcCreateContext(...)
        OpenAL-->>Capturer: loopbackContext
        Capturer->>Capturer: allocate buffers, mark initialized
        Capturer-->>Mixin: initialized
    else not supported
        Capturer-->>Mixin: initialization failed
    end
    SoundEngine->>Mixin: close() [HEAD]
    Mixin->>Capturer: close()
    Capturer->>OpenAL: destroy context, close device
Loading
sequenceDiagram
    participant Game as Game Loop
    participant Env as MinecraftEnv
    participant Capturer as AudioLoopbackCapturer
    participant Proto as Protobuf
    participant Agent as Python Wrapper

    Game->>Env: world tick
    Env->>Capturer: renderSamples()
    Capturer->>Capturer: make loopback context current
    Capturer->>OpenAL: alcRenderSamplesSOFT(...)
    Capturer->>Capturer: convert shorts -> little-endian bytes
    Capturer-->>Env: pcm bytes + metadata
    Env->>Proto: build AudioWaveform message
    Env-->>Game: observation includes audio_waveform
    Game->>Agent: observation delivered
    Agent->>Agent: decode pcm, optional librosa processing
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Poem

🐰 I hopped into the loopback stream tonight,
Tiny ears tuned to OpenAL's light,
PCM ripples, mel blooms in flight,
From game to bytes, a waveform delight,
Hooray—Minecraft hums in audio bright! 🎶

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description only contains a coverage badge template placeholder and lacks substantive information about the changes, rationale, testing, or implementation details. Replace the placeholder description with details about what waveform capture does, how it integrates with the MinecraftEnv, any testing performed, and relevant implementation notes.
Docstring Coverage ⚠️ Warning Docstring coverage is 54.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title '✨ Add waveform capture' clearly and concisely summarizes the main change—adding audio waveform capture functionality across Java, Kotlin, Python, and protobuf components.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 ast-grep (0.40.5)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/proto/ObservationSpace.java

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/MinecraftEnv.kt (1)

49-49: Remove unused import flagged by ktlint.

The pipeline indicates this import is unused.

🔧 Proposed fix
-import net.minecraft.world.World
🤖 Fix all issues with AI agents
In
`@src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt`:
- Line 10: Remove the unused import java.nio.ByteBuffer from the
AudioLoopbackCapturer.kt file; open the AudioLoopbackCapturer class/companion
and delete the unused import line so ktlint no longer flags ByteBuffer as unused
and build/lint passes.
- Around line 5-6: Remove the unused imports ALC11 and ALCCapabilities from
AudioLoopbackCapturer.kt; locate the import lines at the top of the file (the
import statements for org.lwjgl.openal.ALC11 and
org.lwjgl.openal.ALCCapabilities) and delete them, then re-run ktlint/gradle
build to confirm no remaining references to these symbols and that the file
compiles cleanly.

In `@src/craftground/wrappers/waveform.py`:
- Around line 110-115: The spectrogram Box is incorrectly bounded with
low=0.0/high=np.inf; update the spectrogram observation space creation in the
code path where self.output_format == "spectrogram" (the audio_space
construction using self.n_mels and time_frames) to use realistic dB bounds
(e.g., low=-80.0, high=0.0, dtype=np.float32) to match power_to_db output and
mirror the BimodalWaveformWrapper spectrogram bounds; apply the same change to
the other spectrogram Box construction referenced around the later occurrence
(the block at lines ~194-197).
- Around line 94-135: The observation_space declared in _setup_observation_space
can diverge from actual outputs because _extract_waveform mutates buffer_samples
and channels at runtime; update the code so that either (A) after
_extract_waveform updates buffer_samples/channels you call
_setup_observation_space() to rebuild observation_space, or (B) keep
observation_space fixed and enforce that _extract_waveform returns observations
padded/truncated to match the declared space (pad with zeros or trim samples and
reshape mono/stereo to match channels), and make this normalization explicit in
_extract_waveform; reference _setup_observation_space, _extract_waveform,
observation_space, buffer_samples and channels and apply the chosen approach
consistently where buffer_samples/channels are changed (end of the update block
around lines 232-237).
- Around line 398-410: When audio_format=="spectrogram" in WaveformWrapper, do
not silently fall back when librosa is missing; instead check for librosa
availability during WaveformWrapper.__init__ (or the class constructor) and
either raise an ImportError (or ValueError) describing that librosa is required
for spectrogram mode or emit a clear warning via the existing logger; update the
runtime check in the spectrogram branch (the code that imports librosa and calls
librosa.feature.melspectrogram / librosa.power_to_db) to assume librosa is
present so the import error cannot be silently ignored.
🧹 Nitpick comments (8)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/mixin/SoundEngineMixin.java (1)

6-6: Remove unused import.

ALC10 is not used directly in this mixin; the OpenAL operations are delegated to AudioLoopbackCapturer.

🔧 Proposed fix
-import org.lwjgl.openal.ALC10;
src/craftground/wrappers/waveform.py (7)

48-48: Unused kwargs parameter.

The **kwargs parameter is accepted but never used. If it's intended for future extensibility, consider documenting it; otherwise, remove it to avoid confusion.

♻️ Suggested fix
     def __init__(
         self,
         env: gym.Env,
         output_format: Literal["raw", "spectrogram", "mfcc"] = "raw",
         normalize: bool = True,
         n_mels: int = 64,
         n_fft: int = 1024,
         hop_length: int = 512,
         include_subtitle: bool = False,
-        **kwargs,
     ):

84-92: Add exception chaining for better debugging.

When re-raising a new exception from within an except block, use from err to preserve the original traceback, or from None to explicitly suppress it.

♻️ Suggested fix
         if output_format in ["spectrogram", "mfcc"]:
             try:
                 import librosa
 
                 self._librosa = librosa
-            except ImportError:
+            except ImportError as err:
                 raise ImportError(
                     f"librosa is required for output_format='{output_format}'. "
                     "Install it with: pip install librosa"
-                )
+                ) from err

130-133: Placeholder code for include_subtitle does nothing.

The include_subtitle parameter is accepted and stored, but the corresponding logic is a no-op. Either implement the feature or remove the parameter to avoid misleading users.


277-277: Prefix unused variables with underscore.

The obs variable is intentionally unused. Prefix it with _ to signal this and silence linter warnings.

♻️ Suggested fix
     def step(
         self, action: WrapperActType
     ) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]:
         """Execute action and return observation with waveform."""
-        obs, reward, terminated, truncated, info = self.env.step(action)
+        _obs, reward, terminated, truncated, info = self.env.step(action)
         ...

     def reset(
         ...
     ) -> tuple[WrapperObsType, dict[str, Any]]:
         """Reset environment and return initial observation with waveform."""
-        obs, info = self.env.reset(seed=seed, options=options)
+        _obs, info = self.env.reset(seed=seed, options=options)

Also applies to: 296-296


324-324: Unused kwargs parameter.

Same issue as in WaveformWrapper - the **kwargs parameter is unused.


412-434: Prefix unused obs variables with underscore.

Same issue as in WaveformWrapper - the obs variables at lines 413 and 429 are unused.

♻️ Suggested fix
     def step(self, action: WrapperActType):
-        obs, reward, terminated, truncated, info = self.env.step(action)
+        _obs, reward, terminated, truncated, info = self.env.step(action)
         ...

     def reset(
         self, *, seed: Optional[int] = None, options: Optional[dict[str, Any]] = None
     ):
-        obs, info = self.env.reset(seed=seed, options=options)
+        _obs, info = self.env.reset(seed=seed, options=options)

375-410: Consider extracting shared PCM decoding logic.

BimodalWaveformWrapper._decode_audio duplicates much of the PCM decoding logic from WaveformWrapper._decode_pcm. Consider extracting a shared helper function to reduce duplication.

Comment on lines +5 to +6
import org.lwjgl.openal.ALC11
import org.lwjgl.openal.ALCCapabilities
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused imports flagged by ktlint.

Pipeline indicates ALC11 and ALCCapabilities are unused.

🔧 Proposed fix
-import org.lwjgl.openal.ALC11
-import org.lwjgl.openal.ALCCapabilities
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import org.lwjgl.openal.ALC11
import org.lwjgl.openal.ALCCapabilities
🧰 Tools
🪛 GitHub Actions: Format Check for Kotlin

[error] 5-5: ktlint: Unused import (standard:no-unused-imports).


[error] 6-6: ktlint: Unused import (standard:no-unused-imports).

🤖 Prompt for AI Agents
In
`@src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt`
around lines 5 - 6, Remove the unused imports ALC11 and ALCCapabilities from
AudioLoopbackCapturer.kt; locate the import lines at the top of the file (the
import statements for org.lwjgl.openal.ALC11 and
org.lwjgl.openal.ALCCapabilities) and delete them, then re-run ktlint/gradle
build to confirm no remaining references to these symbols and that the file
compiles cleanly.

import org.lwjgl.openal.SOFTLoopback
import org.lwjgl.system.MemoryStack
import org.lwjgl.system.MemoryUtil
import java.nio.ByteBuffer
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove unused import flagged by ktlint.

Pipeline indicates ByteBuffer is unused.

🔧 Proposed fix
-import java.nio.ByteBuffer
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
import java.nio.ByteBuffer
🧰 Tools
🪛 GitHub Actions: Format Check for Kotlin

[error] 10-10: ktlint: Unused import (standard:no-unused-imports).

🤖 Prompt for AI Agents
In
`@src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt`
at line 10, Remove the unused import java.nio.ByteBuffer from the
AudioLoopbackCapturer.kt file; open the AudioLoopbackCapturer class/companion
and delete the unused import line so ktlint no longer flags ByteBuffer as unused
and build/lint passes.

Comment on lines +94 to +135
def _setup_observation_space(self):
"""Setup the observation space based on output format."""
if self.output_format == "raw":
# Raw PCM: (channels, samples) or (samples,) for mono
if self.channels == 1:
shape = (self.buffer_samples,)
else:
shape = (self.channels, self.buffer_samples)

audio_space = gym.spaces.Box(
low=-1.0 if self.normalize else -32768,
high=1.0 if self.normalize else 32767,
shape=shape,
dtype=np.float32,
)

elif self.output_format == "spectrogram":
# Mel spectrogram: (n_mels, time_frames)
time_frames = self.buffer_samples // self.hop_length + 1
audio_space = gym.spaces.Box(
low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
)

elif self.output_format == "mfcc":
# MFCC: (n_mfcc, time_frames)
time_frames = self.buffer_samples // self.hop_length + 1
n_mfcc = 13 # Standard number of MFCCs
audio_space = gym.spaces.Box(
low=-np.inf, high=np.inf, shape=(n_mfcc, time_frames), dtype=np.float32
)
else:
raise ValueError(f"Unknown output_format: {self.output_format}")

# Create composite observation space
spaces = {"waveform": audio_space}

if self.include_subtitle:
# Add subtitle info if requested
# This would need to be coordinated with the existing sound wrapper
pass

self.observation_space = gym.spaces.Dict(spaces)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Observation space shape may not match actual observations.

The observation space is initialized with default values (buffer_samples=2205, channels=2), but _extract_waveform updates these instance variables at runtime (lines 232-237) without rebuilding the observation space. This causes a mismatch between the declared observation_space shape and the actual observation shape, which violates Gymnasium's contract and can break downstream code that relies on observation_space.

Consider either:

  1. Making audio parameters fixed and documenting them as requirements
  2. Calling _setup_observation_space() after updating parameters (though this has its own issues with dynamic spaces)
  3. Padding/truncating observations to match the declared space

Also applies to: 232-237

🧰 Tools
🪛 Ruff (0.14.13)

125-125: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In `@src/craftground/wrappers/waveform.py` around lines 94 - 135, The
observation_space declared in _setup_observation_space can diverge from actual
outputs because _extract_waveform mutates buffer_samples and channels at
runtime; update the code so that either (A) after _extract_waveform updates
buffer_samples/channels you call _setup_observation_space() to rebuild
observation_space, or (B) keep observation_space fixed and enforce that
_extract_waveform returns observations padded/truncated to match the declared
space (pad with zeros or trim samples and reshape mono/stereo to match
channels), and make this normalization explicit in _extract_waveform; reference
_setup_observation_space, _extract_waveform, observation_space, buffer_samples
and channels and apply the chosen approach consistently where
buffer_samples/channels are changed (end of the update block around lines
232-237).

Comment on lines +110 to +115
elif self.output_format == "spectrogram":
# Mel spectrogram: (n_mels, time_frames)
time_frames = self.buffer_samples // self.hop_length + 1
audio_space = gym.spaces.Box(
low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Spectrogram observation space bounds mismatch.

The spectrogram observation space declares low=0.0 (line 114), but power_to_db returns values in dB scale which are typically negative (e.g., -80 to 0 dB). The BimodalWaveformWrapper correctly uses low=-80.0, high=0.0 for its spectrogram space (line 365).

♻️ Suggested fix
         elif self.output_format == "spectrogram":
             # Mel spectrogram: (n_mels, time_frames)
             time_frames = self.buffer_samples // self.hop_length + 1
             audio_space = gym.spaces.Box(
-                low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
+                low=-80.0, high=0.0, shape=(self.n_mels, time_frames), dtype=np.float32
             )

Also applies to: 194-197

🤖 Prompt for AI Agents
In `@src/craftground/wrappers/waveform.py` around lines 110 - 115, The spectrogram
Box is incorrectly bounded with low=0.0/high=np.inf; update the spectrogram
observation space creation in the code path where self.output_format ==
"spectrogram" (the audio_space construction using self.n_mels and time_frames)
to use realistic dB bounds (e.g., low=-80.0, high=0.0, dtype=np.float32) to
match power_to_db output and mirror the BimodalWaveformWrapper spectrogram
bounds; apply the same change to the other spectrogram Box construction
referenced around the later occurrence (the block at lines ~194-197).

Comment on lines +398 to +410
if self.audio_format == "spectrogram":
try:
import librosa

mono = np.mean(audio, axis=0) if audio.ndim == 2 else audio
mel_spec = librosa.feature.melspectrogram(
y=mono, sr=self.sample_rate, n_mels=self.n_mels
)
return librosa.power_to_db(mel_spec, ref=np.max).astype(np.float32)
except ImportError:
pass

return audio.astype(np.float32)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Silent fallback on missing librosa could mask configuration errors.

When audio_format="spectrogram" is requested but librosa is unavailable, the code silently falls back to raw audio (lines 407-408). This could confuse users who expect spectrogram output. Consider either raising an error in __init__ (like WaveformWrapper does) or at least logging a warning.

♻️ Suggested fix - validate in __init__
     def __init__(
         ...
     ):
         super().__init__(env)
         self.x_dim = x_dim
         self.y_dim = y_dim
         self.audio_format = audio_format
         self.n_mels = n_mels
+
+        # Validate librosa availability for spectrogram
+        if audio_format == "spectrogram":
+            try:
+                import librosa  # noqa: F401
+            except ImportError as err:
+                raise ImportError(
+                    "librosa is required for audio_format='spectrogram'. "
+                    "Install it with: pip install librosa"
+                ) from err
🤖 Prompt for AI Agents
In `@src/craftground/wrappers/waveform.py` around lines 398 - 410, When
audio_format=="spectrogram" in WaveformWrapper, do not silently fall back when
librosa is missing; instead check for librosa availability during
WaveformWrapper.__init__ (or the class constructor) and either raise an
ImportError (or ValueError) describing that librosa is required for spectrogram
mode or emit a clear warning via the existing logger; update the runtime check
in the spectrogram branch (the code that imports librosa and calls
librosa.feature.melspectrogram / librosa.power_to_db) to assume librosa is
present so the import error cannot be silently ignored.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds waveform audio capture capability to CraftGround using the OpenAL SOFT_loopback extension. This enables reinforcement learning agents to receive raw audio waveform data from Minecraft, complementing the existing subtitle-based sound system.

Changes:

  • Added protobuf message definition for audio waveform data (PCM format with metadata)
  • Implemented Python wrappers (WaveformWrapper and BimodalWaveformWrapper) for audio observation processing with support for raw PCM, spectrogram, and MFCC formats
  • Added Java/Kotlin audio loopback capture system using OpenAL SOFT_loopback extension to capture game audio

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
src/proto/observation_space.proto Defines AudioWaveform protobuf message with PCM data and audio format metadata
src/craftground/wrappers/waveform.py Implements WaveformWrapper and BimodalWaveformWrapper for processing audio observations
src/craftground/MinecraftEnv/src/main/resources/com.kyhsgeekcode.minecraftenv.mixin.json Registers SoundEngineMixin for audio capture initialization
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/mixin/SoundEngineMixin.java Hooks into Minecraft's sound engine to initialize audio loopback
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/MinecraftEnv.kt Integrates audio waveform data into observation messages
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt Implements audio capture using OpenAL loopback device
.gitignore Excludes Minecraft decompiled sources directory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +231 to +237
# Update audio parameters from actual data
if hasattr(audio_waveform, "sample_rate") and audio_waveform.sample_rate > 0:
self.sample_rate = audio_waveform.sample_rate
if hasattr(audio_waveform, "channels") and audio_waveform.channels > 0:
self.channels = audio_waveform.channels
if hasattr(audio_waveform, "num_samples") and audio_waveform.num_samples > 0:
self.buffer_samples = audio_waveform.num_samples
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The observation space is set up during initialization with default audio parameters (sample_rate=44100, channels=2, buffer_samples=2205), but these parameters can be updated dynamically in _extract_waveform() based on actual audio data. This creates a mismatch where the observation space shape may not match the actual data shape if the audio parameters differ from defaults. Consider either enforcing fixed parameters or dynamically updating the observation space when parameters change.

Suggested change
# Update audio parameters from actual data
if hasattr(audio_waveform, "sample_rate") and audio_waveform.sample_rate > 0:
self.sample_rate = audio_waveform.sample_rate
if hasattr(audio_waveform, "channels") and audio_waveform.channels > 0:
self.channels = audio_waveform.channels
if hasattr(audio_waveform, "num_samples") and audio_waveform.num_samples > 0:
self.buffer_samples = audio_waveform.num_samples
# Use configured audio parameters; do not mutate them at runtime to avoid
# mismatches with the observation space. If needed, metadata from
# audio_waveform can be read into local variables here without changing
# self.sample_rate, self.channels, or self.buffer_samples.

Copilot uses AI. Check for mistakes.
Comment on lines +169 to +176
// Convert to byte array for transmission
buffer.rewind()
lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES)
for (i in 0 until SAMPLES_PER_RENDER * CHANNELS) {
val sample = buffer.get(i)
lastCapturedSamples[i * 2] = (sample.toInt() and 0xFF).toByte()
lastCapturedSamples[i * 2 + 1] = ((sample.toInt() shr 8) and 0xFF).toByte()
}
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The byte conversion in renderSamples() uses buffer.get(i) in a loop to manually convert shorts to bytes. This is inefficient and error-prone. Consider using buffer.asShortBuffer() or a more direct bulk conversion method. Additionally, the manual little-endian conversion ((sample.toInt() and 0xFF) for low byte, (sample.toInt() shr 8) for high byte) may not handle signed values correctly - Kotlin's Byte is signed (-128 to 127), which could cause issues with the conversion.

Copilot uses AI. Check for mistakes.

import com.google.protobuf.ByteString
import org.lwjgl.openal.ALC10
import org.lwjgl.openal.ALC11
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement includes org.lwjgl.openal.ALC11 but ALC11 is never used in the code. This import should be removed.

Suggested change
import org.lwjgl.openal.ALC11

Copilot uses AI. Check for mistakes.
Comment on lines +303 to +434
class BimodalWaveformWrapper(gym.Wrapper):
"""
Wrapper that combines vision and raw audio waveform observations.

This is similar to BimodalWrapper but uses raw waveform instead of
subtitle-based sound encoding.

Observation space:
{
"vision": Box(0, 255, (height, width, 3), uint8),
"audio": Box(-1, 1, (channels, samples), float32)
}
"""

def __init__(
self,
env: gym.Env,
x_dim: int,
y_dim: int,
audio_format: Literal["raw", "spectrogram"] = "raw",
n_mels: int = 64,
**kwargs,
):
"""
Initialize BimodalWaveformWrapper.

Args:
env: CraftGround environment
x_dim: Image width
y_dim: Image height
audio_format: "raw" for waveform, "spectrogram" for mel spectrogram
n_mels: Number of mel bands if using spectrogram
"""
super().__init__(env)
self.x_dim = x_dim
self.y_dim = y_dim
self.audio_format = audio_format
self.n_mels = n_mels

# Audio parameters
self.sample_rate = 44100
self.channels = 2
self.buffer_samples = 2205

# Setup observation space
vision_space = gym.spaces.Box(
low=0,
high=255,
shape=(y_dim, x_dim, 3),
dtype=np.uint8,
)

if audio_format == "raw":
audio_space = gym.spaces.Box(
low=-1.0,
high=1.0,
shape=(self.channels, self.buffer_samples),
dtype=np.float32,
)
else: # spectrogram
time_frames = self.buffer_samples // 512 + 1
audio_space = gym.spaces.Box(
low=-80.0, high=0.0, shape=(n_mels, time_frames), dtype=np.float32
)

self.observation_space = gym.spaces.Dict(
{
"vision": vision_space,
"audio": audio_space,
}
)

def _decode_audio(self, info: dict) -> np.ndarray:
"""Decode audio from observation info."""
obs_info = info.get("obs")
if obs_info is None or not hasattr(obs_info, "audio_waveform"):
return np.zeros((self.channels, self.buffer_samples), dtype=np.float32)

audio_waveform = obs_info.audio_waveform
if audio_waveform is None or not hasattr(audio_waveform, "pcm_data"):
return np.zeros((self.channels, self.buffer_samples), dtype=np.float32)

pcm_data = audio_waveform.pcm_data
if len(pcm_data) == 0:
return np.zeros((self.channels, self.buffer_samples), dtype=np.float32)

# Decode 16-bit PCM
audio = np.frombuffer(pcm_data, dtype=np.int16).astype(np.float32) / 32768.0

# Reshape for stereo
if self.channels == 2 and len(audio) >= 2:
if len(audio) % 2 != 0:
audio = audio[:-1]
audio = audio.reshape(-1, 2).T

if self.audio_format == "spectrogram":
try:
import librosa

mono = np.mean(audio, axis=0) if audio.ndim == 2 else audio
mel_spec = librosa.feature.melspectrogram(
y=mono, sr=self.sample_rate, n_mels=self.n_mels
)
return librosa.power_to_db(mel_spec, ref=np.max).astype(np.float32)
except ImportError:
pass

return audio.astype(np.float32)

def step(self, action: WrapperActType):
obs, reward, terminated, truncated, info = self.env.step(action)

rgb = info.get("pov", np.zeros((self.y_dim, self.x_dim, 3), dtype=np.uint8))
audio = self._decode_audio(info)

return (
{"vision": rgb, "audio": audio},
reward,
terminated,
truncated,
info,
)

def reset(
self, *, seed: Optional[int] = None, options: Optional[dict[str, Any]] = None
):
obs, info = self.env.reset(seed=seed, options=options)

rgb = info.get("pov", np.zeros((self.y_dim, self.x_dim, 3), dtype=np.uint8))
audio = self._decode_audio(info)

return {"vision": rgb, "audio": audio}, info
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new BimodalWaveformWrapper class lacks test coverage. Given that the repository has comprehensive test coverage for other components (see tests/python/unit/), consider adding tests for the bimodal observation handling, audio decoding, and the interaction between vision and audio components.

Copilot uses AI. Check for mistakes.

// Convert to byte array for transmission
buffer.rewind()
lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES)
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lastCapturedSamples ByteArray is reallocated on every call to renderSamples() (line 171), even though the size is constant (BUFFER_SIZE_BYTES). This creates unnecessary garbage collection pressure. Consider allocating it once during initialization and reusing it.

Suggested change
lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES)
if (lastCapturedSamples.size != BUFFER_SIZE_BYTES) {
lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES)
}

Copilot uses AI. Check for mistakes.
import com.google.protobuf.ByteString
import org.lwjgl.openal.ALC10
import org.lwjgl.openal.ALC11
import org.lwjgl.openal.ALCCapabilities
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The import statement includes org.lwjgl.openal.ALCCapabilities but ALCCapabilities is never used in the code. This import should be removed.

Suggested change
import org.lwjgl.openal.ALCCapabilities

Copilot uses AI. Check for mistakes.
Comment on lines +25 to +301
class WaveformWrapper(gym.Wrapper):
"""
Wrapper that extracts audio waveform data from CraftGround observations.

The audio data is captured via OpenAL SOFT_loopback extension on the
Minecraft side and transmitted via Protobuf.

Attributes:
sample_rate: Audio sample rate in Hz (typically 44100)
channels: Number of audio channels (1=mono, 2=stereo)
buffer_samples: Number of samples per channel in each observation
output_format: "raw" for PCM data, "spectrogram" for mel spectrogram
"""

def __init__(
self,
env: gym.Env,
output_format: Literal["raw", "spectrogram", "mfcc"] = "raw",
normalize: bool = True,
n_mels: int = 64,
n_fft: int = 1024,
hop_length: int = 512,
include_subtitle: bool = False,
**kwargs,
):
"""
Initialize the WaveformWrapper.

Args:
env: The CraftGround environment to wrap
output_format: Output format for audio data:
- "raw": Raw PCM samples as float32 array
- "spectrogram": Mel spectrogram (requires librosa)
- "mfcc": MFCC features (requires librosa)
normalize: Whether to normalize audio to [-1, 1] range
n_mels: Number of mel bands for spectrogram
n_fft: FFT window size for spectrogram
hop_length: Hop length for spectrogram
include_subtitle: Whether to also include subtitle-based sound info
"""
super().__init__(env)
self.output_format = output_format
self.normalize = normalize
self.n_mels = n_mels
self.n_fft = n_fft
self.hop_length = hop_length
self.include_subtitle = include_subtitle

# Default audio parameters (will be updated from actual data)
self.sample_rate = 44100
self.channels = 2
self.buffer_samples = 2205 # ~50ms at 44100Hz

# Setup observation space based on output format
self._setup_observation_space()

# Lazy load librosa if needed
self._librosa = None
if output_format in ["spectrogram", "mfcc"]:
try:
import librosa

self._librosa = librosa
except ImportError:
raise ImportError(
f"librosa is required for output_format='{output_format}'. "
"Install it with: pip install librosa"
)

def _setup_observation_space(self):
"""Setup the observation space based on output format."""
if self.output_format == "raw":
# Raw PCM: (channels, samples) or (samples,) for mono
if self.channels == 1:
shape = (self.buffer_samples,)
else:
shape = (self.channels, self.buffer_samples)

audio_space = gym.spaces.Box(
low=-1.0 if self.normalize else -32768,
high=1.0 if self.normalize else 32767,
shape=shape,
dtype=np.float32,
)

elif self.output_format == "spectrogram":
# Mel spectrogram: (n_mels, time_frames)
time_frames = self.buffer_samples // self.hop_length + 1
audio_space = gym.spaces.Box(
low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
)

elif self.output_format == "mfcc":
# MFCC: (n_mfcc, time_frames)
time_frames = self.buffer_samples // self.hop_length + 1
n_mfcc = 13 # Standard number of MFCCs
audio_space = gym.spaces.Box(
low=-np.inf, high=np.inf, shape=(n_mfcc, time_frames), dtype=np.float32
)
else:
raise ValueError(f"Unknown output_format: {self.output_format}")

# Create composite observation space
spaces = {"waveform": audio_space}

if self.include_subtitle:
# Add subtitle info if requested
# This would need to be coordinated with the existing sound wrapper
pass

self.observation_space = gym.spaces.Dict(spaces)

def _decode_pcm(self, pcm_bytes: bytes) -> np.ndarray:
"""
Decode raw PCM bytes to numpy array.

Args:
pcm_bytes: Raw PCM data (16-bit signed, little-endian)

Returns:
Audio samples as float32 array, shape (channels, samples) or (samples,)
"""
if len(pcm_bytes) == 0:
# Return silence if no audio data
if self.channels == 1:
return np.zeros(self.buffer_samples, dtype=np.float32)
else:
return np.zeros((self.channels, self.buffer_samples), dtype=np.float32)

# Decode 16-bit signed PCM
audio = np.frombuffer(pcm_bytes, dtype=np.int16)

# Convert to float32
audio = audio.astype(np.float32)

# Normalize to [-1, 1] if requested
if self.normalize:
audio = audio / 32768.0

# Reshape for stereo
if self.channels == 2:
# Interleaved stereo: L R L R L R ...
if len(audio) % 2 == 0:
audio = audio.reshape(-1, 2).T # Shape: (2, samples)
else:
# Handle odd length by padding
audio = np.pad(audio, (0, 1))
audio = audio.reshape(-1, 2).T

return audio

def _compute_spectrogram(self, audio: np.ndarray) -> np.ndarray:
"""Compute mel spectrogram from audio."""
if self._librosa is None:
raise RuntimeError("librosa not available")

# Use mono for spectrogram
if audio.ndim == 2:
audio = np.mean(audio, axis=0)

# Compute mel spectrogram
mel_spec = self._librosa.feature.melspectrogram(
y=audio,
sr=self.sample_rate,
n_mels=self.n_mels,
n_fft=self.n_fft,
hop_length=self.hop_length,
)

# Convert to dB scale
mel_spec_db = self._librosa.power_to_db(mel_spec, ref=np.max)

return mel_spec_db.astype(np.float32)

def _compute_mfcc(self, audio: np.ndarray) -> np.ndarray:
"""Compute MFCC features from audio."""
if self._librosa is None:
raise RuntimeError("librosa not available")

# Use mono for MFCC
if audio.ndim == 2:
audio = np.mean(audio, axis=0)

# Compute MFCCs
mfcc = self._librosa.feature.mfcc(
y=audio,
sr=self.sample_rate,
n_mfcc=13,
n_fft=self.n_fft,
hop_length=self.hop_length,
)

return mfcc.astype(np.float32)

def _extract_waveform(self, info: dict) -> np.ndarray:
"""Extract and process waveform from observation info."""
obs_info = info.get("obs")
if obs_info is None:
return self._get_silence()

# Check if audio waveform is available
if not hasattr(obs_info, "audio_waveform") or obs_info.audio_waveform is None:
return self._get_silence()

audio_waveform = obs_info.audio_waveform

# Update audio parameters from actual data
if hasattr(audio_waveform, "sample_rate") and audio_waveform.sample_rate > 0:
self.sample_rate = audio_waveform.sample_rate
if hasattr(audio_waveform, "channels") and audio_waveform.channels > 0:
self.channels = audio_waveform.channels
if hasattr(audio_waveform, "num_samples") and audio_waveform.num_samples > 0:
self.buffer_samples = audio_waveform.num_samples

# Get PCM data
pcm_data = (
audio_waveform.pcm_data if hasattr(audio_waveform, "pcm_data") else b""
)

# Decode PCM
audio = self._decode_pcm(pcm_data)

# Process based on output format
if self.output_format == "raw":
return audio
elif self.output_format == "spectrogram":
return self._compute_spectrogram(audio)
elif self.output_format == "mfcc":
return self._compute_mfcc(audio)
else:
return audio

def _get_silence(self) -> np.ndarray:
"""Return silence array matching the expected shape."""
if self.output_format == "raw":
if self.channels == 1:
return np.zeros(self.buffer_samples, dtype=np.float32)
else:
return np.zeros((self.channels, self.buffer_samples), dtype=np.float32)
elif self.output_format == "spectrogram":
time_frames = self.buffer_samples // self.hop_length + 1
return np.zeros((self.n_mels, time_frames), dtype=np.float32)
elif self.output_format == "mfcc":
time_frames = self.buffer_samples // self.hop_length + 1
return np.zeros((13, time_frames), dtype=np.float32)
else:
return np.zeros(self.buffer_samples, dtype=np.float32)

def step(
self, action: WrapperActType
) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]:
"""Execute action and return observation with waveform."""
obs, reward, terminated, truncated, info = self.env.step(action)

waveform = self._extract_waveform(info)

return (
{"waveform": waveform},
reward,
terminated,
truncated,
info,
)

def reset(
self,
*,
seed: Optional[int] = None,
options: Optional[dict[str, Any]] = None,
) -> tuple[WrapperObsType, dict[str, Any]]:
"""Reset environment and return initial observation with waveform."""
obs, info = self.env.reset(seed=seed, options=options)

waveform = self._extract_waveform(info)

return {"waveform": waveform}, info

Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new WaveformWrapper class lacks test coverage. Given that the repository has comprehensive test coverage for other components (see tests/python/unit/), consider adding tests for at least the critical functionality: PCM decoding, audio format conversions (raw, spectrogram, MFCC), handling of missing audio data, and observation space consistency.

Copilot uses AI. Check for mistakes.
Comment on lines +135 to +147
/**
* Render audio samples from the loopback device.
* Should be called each tick to capture the latest audio.
*
* Note: This requires the loopback context to be current when sources are playing.
* In practice, this means we need to either:
* 1. Use a shared context approach
* 2. Mirror all sound commands to the loopback device
* 3. Use OS-level audio loopback as fallback
*
* For now, this demonstrates the API usage. Full integration requires
* mirroring the SoundSystem's source operations to the loopback context.
*/
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment on lines 139-146 describes incomplete behavior. It states "For now, this demonstrates the API usage. Full integration requires mirroring the SoundSystem's source operations to the loopback context." This suggests that the current implementation may not capture actual game audio correctly. This limitation should be clearly documented in the main docstring or as a warning in the code, as users might expect the audio capture to work fully.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +114
audio_space = gym.spaces.Box(
low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spectrogram bounds in the observation space are set to low=0.0, high=np.inf, but in _compute_spectrogram() the mel spectrogram is converted to dB scale which can produce negative values (power_to_db returns values typically in the range [-80, 0] dB). This mismatch between the observation space definition and actual values could cause issues with RL algorithms that rely on correct bounds.

Suggested change
audio_space = gym.spaces.Box(
low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
# dB-scaled mel spectrogram from librosa.power_to_db is typically in [-80, 0] dB
audio_space = gym.spaces.Box(
low=-80.0, high=0.0, shape=(self.n_mels, time_frames), dtype=np.float32

Copilot uses AI. Check for mistakes.
Comment on lines +407 to +408
except ImportError:
pass
Copy link

Copilot AI Jan 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When audio_format is "spectrogram" but librosa import fails, the method falls through and returns raw audio data instead (line 410). This creates a shape mismatch with the observation space: the space expects shape (n_mels, time_frames) for spectrograms but the returned data has shape (channels, buffer_samples). Either raise an error when librosa is unavailable for spectrogram format, or return zeros with the correct spectrogram shape.

Suggested change
except ImportError:
pass
except ImportError as exc:
raise RuntimeError(
"BimodalWaveformWrapper is configured with audio_format='spectrogram' "
"but the 'librosa' package is not installed. Install librosa or "
"use audio_format='raw'."
) from exc

Copilot uses AI. Check for mistakes.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/craftground/proto/observation_space_pb2.py (1)

1-6: Update line 3 comment to reflect that generated protobuf code is intentionally committed.

The file header at line 3 states "NO CHECKED-IN PROTOBUF GENCODE", but this file—along with action_space_pb2.py and initial_environment_pb2.py—is intentionally committed to the repository. Remove this auto-generated comment or replace it with a note explaining the project's decision to commit pre-generated protobuf code.

🧹 Nitpick comments (1)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/proto/ObservationSpace.java (1)

12660-13582: Consider capping raw PCM payload size upstream.

pcm_data can grow quickly; enforce a max buffer length (or segmenting) in the producer to prevent oversized observations or memory spikes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant