Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds OpenAL SOFT_loopback-based audio loopback capture to the Minecraft environment, embeds captured PCM into observations via a new AudioWaveform protobuf message, integrates capture lifecycle via a SoundEngine mixin, and provides Python wrappers to decode/transform waveform data (raw PCM, mel spectrogram, MFCC). Changes
Sequence Diagram(s)sequenceDiagram
participant SoundEngine as SoundEngine
participant Mixin as SoundEngineMixin
participant Capturer as AudioLoopbackCapturer
participant OpenAL as OpenAL
SoundEngine->>Mixin: init() [TAIL]
Mixin->>Capturer: initialize(mainDevice)
Capturer->>OpenAL: check ALC_SOFT_loopback support
OpenAL-->>Capturer: support status
alt supported
Capturer->>OpenAL: alcLoopbackOpenDeviceSOFT()
OpenAL-->>Capturer: loopbackDevice
Capturer->>OpenAL: alcCreateContext(...)
OpenAL-->>Capturer: loopbackContext
Capturer->>Capturer: allocate buffers, mark initialized
Capturer-->>Mixin: initialized
else not supported
Capturer-->>Mixin: initialization failed
end
SoundEngine->>Mixin: close() [HEAD]
Mixin->>Capturer: close()
Capturer->>OpenAL: destroy context, close device
sequenceDiagram
participant Game as Game Loop
participant Env as MinecraftEnv
participant Capturer as AudioLoopbackCapturer
participant Proto as Protobuf
participant Agent as Python Wrapper
Game->>Env: world tick
Env->>Capturer: renderSamples()
Capturer->>Capturer: make loopback context current
Capturer->>OpenAL: alcRenderSamplesSOFT(...)
Capturer->>Capturer: convert shorts -> little-endian bytes
Capturer-->>Env: pcm bytes + metadata
Env->>Proto: build AudioWaveform message
Env-->>Game: observation includes audio_waveform
Game->>Agent: observation delivered
Agent->>Agent: decode pcm, optional librosa processing
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 ast-grep (0.40.5)src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/proto/ObservationSpace.javaThanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/MinecraftEnv.kt (1)
49-49: Remove unused import flagged by ktlint.The pipeline indicates this import is unused.
🔧 Proposed fix
-import net.minecraft.world.World
🤖 Fix all issues with AI agents
In
`@src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt`:
- Line 10: Remove the unused import java.nio.ByteBuffer from the
AudioLoopbackCapturer.kt file; open the AudioLoopbackCapturer class/companion
and delete the unused import line so ktlint no longer flags ByteBuffer as unused
and build/lint passes.
- Around line 5-6: Remove the unused imports ALC11 and ALCCapabilities from
AudioLoopbackCapturer.kt; locate the import lines at the top of the file (the
import statements for org.lwjgl.openal.ALC11 and
org.lwjgl.openal.ALCCapabilities) and delete them, then re-run ktlint/gradle
build to confirm no remaining references to these symbols and that the file
compiles cleanly.
In `@src/craftground/wrappers/waveform.py`:
- Around line 110-115: The spectrogram Box is incorrectly bounded with
low=0.0/high=np.inf; update the spectrogram observation space creation in the
code path where self.output_format == "spectrogram" (the audio_space
construction using self.n_mels and time_frames) to use realistic dB bounds
(e.g., low=-80.0, high=0.0, dtype=np.float32) to match power_to_db output and
mirror the BimodalWaveformWrapper spectrogram bounds; apply the same change to
the other spectrogram Box construction referenced around the later occurrence
(the block at lines ~194-197).
- Around line 94-135: The observation_space declared in _setup_observation_space
can diverge from actual outputs because _extract_waveform mutates buffer_samples
and channels at runtime; update the code so that either (A) after
_extract_waveform updates buffer_samples/channels you call
_setup_observation_space() to rebuild observation_space, or (B) keep
observation_space fixed and enforce that _extract_waveform returns observations
padded/truncated to match the declared space (pad with zeros or trim samples and
reshape mono/stereo to match channels), and make this normalization explicit in
_extract_waveform; reference _setup_observation_space, _extract_waveform,
observation_space, buffer_samples and channels and apply the chosen approach
consistently where buffer_samples/channels are changed (end of the update block
around lines 232-237).
- Around line 398-410: When audio_format=="spectrogram" in WaveformWrapper, do
not silently fall back when librosa is missing; instead check for librosa
availability during WaveformWrapper.__init__ (or the class constructor) and
either raise an ImportError (or ValueError) describing that librosa is required
for spectrogram mode or emit a clear warning via the existing logger; update the
runtime check in the spectrogram branch (the code that imports librosa and calls
librosa.feature.melspectrogram / librosa.power_to_db) to assume librosa is
present so the import error cannot be silently ignored.
🧹 Nitpick comments (8)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/mixin/SoundEngineMixin.java (1)
6-6: Remove unused import.
ALC10is not used directly in this mixin; the OpenAL operations are delegated toAudioLoopbackCapturer.🔧 Proposed fix
-import org.lwjgl.openal.ALC10;src/craftground/wrappers/waveform.py (7)
48-48: Unusedkwargsparameter.The
**kwargsparameter is accepted but never used. If it's intended for future extensibility, consider documenting it; otherwise, remove it to avoid confusion.♻️ Suggested fix
def __init__( self, env: gym.Env, output_format: Literal["raw", "spectrogram", "mfcc"] = "raw", normalize: bool = True, n_mels: int = 64, n_fft: int = 1024, hop_length: int = 512, include_subtitle: bool = False, - **kwargs, ):
84-92: Add exception chaining for better debugging.When re-raising a new exception from within an
exceptblock, usefrom errto preserve the original traceback, orfrom Noneto explicitly suppress it.♻️ Suggested fix
if output_format in ["spectrogram", "mfcc"]: try: import librosa self._librosa = librosa - except ImportError: + except ImportError as err: raise ImportError( f"librosa is required for output_format='{output_format}'. " "Install it with: pip install librosa" - ) + ) from err
130-133: Placeholder code forinclude_subtitledoes nothing.The
include_subtitleparameter is accepted and stored, but the corresponding logic is a no-op. Either implement the feature or remove the parameter to avoid misleading users.
277-277: Prefix unused variables with underscore.The
obsvariable is intentionally unused. Prefix it with_to signal this and silence linter warnings.♻️ Suggested fix
def step( self, action: WrapperActType ) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]: """Execute action and return observation with waveform.""" - obs, reward, terminated, truncated, info = self.env.step(action) + _obs, reward, terminated, truncated, info = self.env.step(action) ... def reset( ... ) -> tuple[WrapperObsType, dict[str, Any]]: """Reset environment and return initial observation with waveform.""" - obs, info = self.env.reset(seed=seed, options=options) + _obs, info = self.env.reset(seed=seed, options=options)Also applies to: 296-296
324-324: Unusedkwargsparameter.Same issue as in
WaveformWrapper- the**kwargsparameter is unused.
412-434: Prefix unusedobsvariables with underscore.Same issue as in
WaveformWrapper- theobsvariables at lines 413 and 429 are unused.♻️ Suggested fix
def step(self, action: WrapperActType): - obs, reward, terminated, truncated, info = self.env.step(action) + _obs, reward, terminated, truncated, info = self.env.step(action) ... def reset( self, *, seed: Optional[int] = None, options: Optional[dict[str, Any]] = None ): - obs, info = self.env.reset(seed=seed, options=options) + _obs, info = self.env.reset(seed=seed, options=options)
375-410: Consider extracting shared PCM decoding logic.
BimodalWaveformWrapper._decode_audioduplicates much of the PCM decoding logic fromWaveformWrapper._decode_pcm. Consider extracting a shared helper function to reduce duplication.
| import org.lwjgl.openal.ALC11 | ||
| import org.lwjgl.openal.ALCCapabilities |
There was a problem hiding this comment.
Remove unused imports flagged by ktlint.
Pipeline indicates ALC11 and ALCCapabilities are unused.
🔧 Proposed fix
-import org.lwjgl.openal.ALC11
-import org.lwjgl.openal.ALCCapabilities📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import org.lwjgl.openal.ALC11 | |
| import org.lwjgl.openal.ALCCapabilities |
🧰 Tools
🪛 GitHub Actions: Format Check for Kotlin
[error] 5-5: ktlint: Unused import (standard:no-unused-imports).
[error] 6-6: ktlint: Unused import (standard:no-unused-imports).
🤖 Prompt for AI Agents
In
`@src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt`
around lines 5 - 6, Remove the unused imports ALC11 and ALCCapabilities from
AudioLoopbackCapturer.kt; locate the import lines at the top of the file (the
import statements for org.lwjgl.openal.ALC11 and
org.lwjgl.openal.ALCCapabilities) and delete them, then re-run ktlint/gradle
build to confirm no remaining references to these symbols and that the file
compiles cleanly.
| import org.lwjgl.openal.SOFTLoopback | ||
| import org.lwjgl.system.MemoryStack | ||
| import org.lwjgl.system.MemoryUtil | ||
| import java.nio.ByteBuffer |
There was a problem hiding this comment.
Remove unused import flagged by ktlint.
Pipeline indicates ByteBuffer is unused.
🔧 Proposed fix
-import java.nio.ByteBuffer📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import java.nio.ByteBuffer |
🧰 Tools
🪛 GitHub Actions: Format Check for Kotlin
[error] 10-10: ktlint: Unused import (standard:no-unused-imports).
🤖 Prompt for AI Agents
In
`@src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt`
at line 10, Remove the unused import java.nio.ByteBuffer from the
AudioLoopbackCapturer.kt file; open the AudioLoopbackCapturer class/companion
and delete the unused import line so ktlint no longer flags ByteBuffer as unused
and build/lint passes.
| def _setup_observation_space(self): | ||
| """Setup the observation space based on output format.""" | ||
| if self.output_format == "raw": | ||
| # Raw PCM: (channels, samples) or (samples,) for mono | ||
| if self.channels == 1: | ||
| shape = (self.buffer_samples,) | ||
| else: | ||
| shape = (self.channels, self.buffer_samples) | ||
|
|
||
| audio_space = gym.spaces.Box( | ||
| low=-1.0 if self.normalize else -32768, | ||
| high=1.0 if self.normalize else 32767, | ||
| shape=shape, | ||
| dtype=np.float32, | ||
| ) | ||
|
|
||
| elif self.output_format == "spectrogram": | ||
| # Mel spectrogram: (n_mels, time_frames) | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| audio_space = gym.spaces.Box( | ||
| low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32 | ||
| ) | ||
|
|
||
| elif self.output_format == "mfcc": | ||
| # MFCC: (n_mfcc, time_frames) | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| n_mfcc = 13 # Standard number of MFCCs | ||
| audio_space = gym.spaces.Box( | ||
| low=-np.inf, high=np.inf, shape=(n_mfcc, time_frames), dtype=np.float32 | ||
| ) | ||
| else: | ||
| raise ValueError(f"Unknown output_format: {self.output_format}") | ||
|
|
||
| # Create composite observation space | ||
| spaces = {"waveform": audio_space} | ||
|
|
||
| if self.include_subtitle: | ||
| # Add subtitle info if requested | ||
| # This would need to be coordinated with the existing sound wrapper | ||
| pass | ||
|
|
||
| self.observation_space = gym.spaces.Dict(spaces) |
There was a problem hiding this comment.
Observation space shape may not match actual observations.
The observation space is initialized with default values (buffer_samples=2205, channels=2), but _extract_waveform updates these instance variables at runtime (lines 232-237) without rebuilding the observation space. This causes a mismatch between the declared observation_space shape and the actual observation shape, which violates Gymnasium's contract and can break downstream code that relies on observation_space.
Consider either:
- Making audio parameters fixed and documenting them as requirements
- Calling
_setup_observation_space()after updating parameters (though this has its own issues with dynamic spaces) - Padding/truncating observations to match the declared space
Also applies to: 232-237
🧰 Tools
🪛 Ruff (0.14.13)
125-125: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In `@src/craftground/wrappers/waveform.py` around lines 94 - 135, The
observation_space declared in _setup_observation_space can diverge from actual
outputs because _extract_waveform mutates buffer_samples and channels at
runtime; update the code so that either (A) after _extract_waveform updates
buffer_samples/channels you call _setup_observation_space() to rebuild
observation_space, or (B) keep observation_space fixed and enforce that
_extract_waveform returns observations padded/truncated to match the declared
space (pad with zeros or trim samples and reshape mono/stereo to match
channels), and make this normalization explicit in _extract_waveform; reference
_setup_observation_space, _extract_waveform, observation_space, buffer_samples
and channels and apply the chosen approach consistently where
buffer_samples/channels are changed (end of the update block around lines
232-237).
| elif self.output_format == "spectrogram": | ||
| # Mel spectrogram: (n_mels, time_frames) | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| audio_space = gym.spaces.Box( | ||
| low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32 | ||
| ) |
There was a problem hiding this comment.
Spectrogram observation space bounds mismatch.
The spectrogram observation space declares low=0.0 (line 114), but power_to_db returns values in dB scale which are typically negative (e.g., -80 to 0 dB). The BimodalWaveformWrapper correctly uses low=-80.0, high=0.0 for its spectrogram space (line 365).
♻️ Suggested fix
elif self.output_format == "spectrogram":
# Mel spectrogram: (n_mels, time_frames)
time_frames = self.buffer_samples // self.hop_length + 1
audio_space = gym.spaces.Box(
- low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32
+ low=-80.0, high=0.0, shape=(self.n_mels, time_frames), dtype=np.float32
)Also applies to: 194-197
🤖 Prompt for AI Agents
In `@src/craftground/wrappers/waveform.py` around lines 110 - 115, The spectrogram
Box is incorrectly bounded with low=0.0/high=np.inf; update the spectrogram
observation space creation in the code path where self.output_format ==
"spectrogram" (the audio_space construction using self.n_mels and time_frames)
to use realistic dB bounds (e.g., low=-80.0, high=0.0, dtype=np.float32) to
match power_to_db output and mirror the BimodalWaveformWrapper spectrogram
bounds; apply the same change to the other spectrogram Box construction
referenced around the later occurrence (the block at lines ~194-197).
| if self.audio_format == "spectrogram": | ||
| try: | ||
| import librosa | ||
|
|
||
| mono = np.mean(audio, axis=0) if audio.ndim == 2 else audio | ||
| mel_spec = librosa.feature.melspectrogram( | ||
| y=mono, sr=self.sample_rate, n_mels=self.n_mels | ||
| ) | ||
| return librosa.power_to_db(mel_spec, ref=np.max).astype(np.float32) | ||
| except ImportError: | ||
| pass | ||
|
|
||
| return audio.astype(np.float32) |
There was a problem hiding this comment.
Silent fallback on missing librosa could mask configuration errors.
When audio_format="spectrogram" is requested but librosa is unavailable, the code silently falls back to raw audio (lines 407-408). This could confuse users who expect spectrogram output. Consider either raising an error in __init__ (like WaveformWrapper does) or at least logging a warning.
♻️ Suggested fix - validate in __init__
def __init__(
...
):
super().__init__(env)
self.x_dim = x_dim
self.y_dim = y_dim
self.audio_format = audio_format
self.n_mels = n_mels
+
+ # Validate librosa availability for spectrogram
+ if audio_format == "spectrogram":
+ try:
+ import librosa # noqa: F401
+ except ImportError as err:
+ raise ImportError(
+ "librosa is required for audio_format='spectrogram'. "
+ "Install it with: pip install librosa"
+ ) from err🤖 Prompt for AI Agents
In `@src/craftground/wrappers/waveform.py` around lines 398 - 410, When
audio_format=="spectrogram" in WaveformWrapper, do not silently fall back when
librosa is missing; instead check for librosa availability during
WaveformWrapper.__init__ (or the class constructor) and either raise an
ImportError (or ValueError) describing that librosa is required for spectrogram
mode or emit a clear warning via the existing logger; update the runtime check
in the spectrogram branch (the code that imports librosa and calls
librosa.feature.melspectrogram / librosa.power_to_db) to assume librosa is
present so the import error cannot be silently ignored.
There was a problem hiding this comment.
Pull request overview
This pull request adds waveform audio capture capability to CraftGround using the OpenAL SOFT_loopback extension. This enables reinforcement learning agents to receive raw audio waveform data from Minecraft, complementing the existing subtitle-based sound system.
Changes:
- Added protobuf message definition for audio waveform data (PCM format with metadata)
- Implemented Python wrappers (WaveformWrapper and BimodalWaveformWrapper) for audio observation processing with support for raw PCM, spectrogram, and MFCC formats
- Added Java/Kotlin audio loopback capture system using OpenAL SOFT_loopback extension to capture game audio
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| src/proto/observation_space.proto | Defines AudioWaveform protobuf message with PCM data and audio format metadata |
| src/craftground/wrappers/waveform.py | Implements WaveformWrapper and BimodalWaveformWrapper for processing audio observations |
| src/craftground/MinecraftEnv/src/main/resources/com.kyhsgeekcode.minecraftenv.mixin.json | Registers SoundEngineMixin for audio capture initialization |
| src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/mixin/SoundEngineMixin.java | Hooks into Minecraft's sound engine to initialize audio loopback |
| src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/MinecraftEnv.kt | Integrates audio waveform data into observation messages |
| src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/AudioLoopbackCapturer.kt | Implements audio capture using OpenAL loopback device |
| .gitignore | Excludes Minecraft decompiled sources directory |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Update audio parameters from actual data | ||
| if hasattr(audio_waveform, "sample_rate") and audio_waveform.sample_rate > 0: | ||
| self.sample_rate = audio_waveform.sample_rate | ||
| if hasattr(audio_waveform, "channels") and audio_waveform.channels > 0: | ||
| self.channels = audio_waveform.channels | ||
| if hasattr(audio_waveform, "num_samples") and audio_waveform.num_samples > 0: | ||
| self.buffer_samples = audio_waveform.num_samples |
There was a problem hiding this comment.
The observation space is set up during initialization with default audio parameters (sample_rate=44100, channels=2, buffer_samples=2205), but these parameters can be updated dynamically in _extract_waveform() based on actual audio data. This creates a mismatch where the observation space shape may not match the actual data shape if the audio parameters differ from defaults. Consider either enforcing fixed parameters or dynamically updating the observation space when parameters change.
| # Update audio parameters from actual data | |
| if hasattr(audio_waveform, "sample_rate") and audio_waveform.sample_rate > 0: | |
| self.sample_rate = audio_waveform.sample_rate | |
| if hasattr(audio_waveform, "channels") and audio_waveform.channels > 0: | |
| self.channels = audio_waveform.channels | |
| if hasattr(audio_waveform, "num_samples") and audio_waveform.num_samples > 0: | |
| self.buffer_samples = audio_waveform.num_samples | |
| # Use configured audio parameters; do not mutate them at runtime to avoid | |
| # mismatches with the observation space. If needed, metadata from | |
| # audio_waveform can be read into local variables here without changing | |
| # self.sample_rate, self.channels, or self.buffer_samples. |
| // Convert to byte array for transmission | ||
| buffer.rewind() | ||
| lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES) | ||
| for (i in 0 until SAMPLES_PER_RENDER * CHANNELS) { | ||
| val sample = buffer.get(i) | ||
| lastCapturedSamples[i * 2] = (sample.toInt() and 0xFF).toByte() | ||
| lastCapturedSamples[i * 2 + 1] = ((sample.toInt() shr 8) and 0xFF).toByte() | ||
| } |
There was a problem hiding this comment.
The byte conversion in renderSamples() uses buffer.get(i) in a loop to manually convert shorts to bytes. This is inefficient and error-prone. Consider using buffer.asShortBuffer() or a more direct bulk conversion method. Additionally, the manual little-endian conversion ((sample.toInt() and 0xFF) for low byte, (sample.toInt() shr 8) for high byte) may not handle signed values correctly - Kotlin's Byte is signed (-128 to 127), which could cause issues with the conversion.
|
|
||
| import com.google.protobuf.ByteString | ||
| import org.lwjgl.openal.ALC10 | ||
| import org.lwjgl.openal.ALC11 |
There was a problem hiding this comment.
The import statement includes org.lwjgl.openal.ALC11 but ALC11 is never used in the code. This import should be removed.
| import org.lwjgl.openal.ALC11 |
| class BimodalWaveformWrapper(gym.Wrapper): | ||
| """ | ||
| Wrapper that combines vision and raw audio waveform observations. | ||
|
|
||
| This is similar to BimodalWrapper but uses raw waveform instead of | ||
| subtitle-based sound encoding. | ||
|
|
||
| Observation space: | ||
| { | ||
| "vision": Box(0, 255, (height, width, 3), uint8), | ||
| "audio": Box(-1, 1, (channels, samples), float32) | ||
| } | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| env: gym.Env, | ||
| x_dim: int, | ||
| y_dim: int, | ||
| audio_format: Literal["raw", "spectrogram"] = "raw", | ||
| n_mels: int = 64, | ||
| **kwargs, | ||
| ): | ||
| """ | ||
| Initialize BimodalWaveformWrapper. | ||
|
|
||
| Args: | ||
| env: CraftGround environment | ||
| x_dim: Image width | ||
| y_dim: Image height | ||
| audio_format: "raw" for waveform, "spectrogram" for mel spectrogram | ||
| n_mels: Number of mel bands if using spectrogram | ||
| """ | ||
| super().__init__(env) | ||
| self.x_dim = x_dim | ||
| self.y_dim = y_dim | ||
| self.audio_format = audio_format | ||
| self.n_mels = n_mels | ||
|
|
||
| # Audio parameters | ||
| self.sample_rate = 44100 | ||
| self.channels = 2 | ||
| self.buffer_samples = 2205 | ||
|
|
||
| # Setup observation space | ||
| vision_space = gym.spaces.Box( | ||
| low=0, | ||
| high=255, | ||
| shape=(y_dim, x_dim, 3), | ||
| dtype=np.uint8, | ||
| ) | ||
|
|
||
| if audio_format == "raw": | ||
| audio_space = gym.spaces.Box( | ||
| low=-1.0, | ||
| high=1.0, | ||
| shape=(self.channels, self.buffer_samples), | ||
| dtype=np.float32, | ||
| ) | ||
| else: # spectrogram | ||
| time_frames = self.buffer_samples // 512 + 1 | ||
| audio_space = gym.spaces.Box( | ||
| low=-80.0, high=0.0, shape=(n_mels, time_frames), dtype=np.float32 | ||
| ) | ||
|
|
||
| self.observation_space = gym.spaces.Dict( | ||
| { | ||
| "vision": vision_space, | ||
| "audio": audio_space, | ||
| } | ||
| ) | ||
|
|
||
| def _decode_audio(self, info: dict) -> np.ndarray: | ||
| """Decode audio from observation info.""" | ||
| obs_info = info.get("obs") | ||
| if obs_info is None or not hasattr(obs_info, "audio_waveform"): | ||
| return np.zeros((self.channels, self.buffer_samples), dtype=np.float32) | ||
|
|
||
| audio_waveform = obs_info.audio_waveform | ||
| if audio_waveform is None or not hasattr(audio_waveform, "pcm_data"): | ||
| return np.zeros((self.channels, self.buffer_samples), dtype=np.float32) | ||
|
|
||
| pcm_data = audio_waveform.pcm_data | ||
| if len(pcm_data) == 0: | ||
| return np.zeros((self.channels, self.buffer_samples), dtype=np.float32) | ||
|
|
||
| # Decode 16-bit PCM | ||
| audio = np.frombuffer(pcm_data, dtype=np.int16).astype(np.float32) / 32768.0 | ||
|
|
||
| # Reshape for stereo | ||
| if self.channels == 2 and len(audio) >= 2: | ||
| if len(audio) % 2 != 0: | ||
| audio = audio[:-1] | ||
| audio = audio.reshape(-1, 2).T | ||
|
|
||
| if self.audio_format == "spectrogram": | ||
| try: | ||
| import librosa | ||
|
|
||
| mono = np.mean(audio, axis=0) if audio.ndim == 2 else audio | ||
| mel_spec = librosa.feature.melspectrogram( | ||
| y=mono, sr=self.sample_rate, n_mels=self.n_mels | ||
| ) | ||
| return librosa.power_to_db(mel_spec, ref=np.max).astype(np.float32) | ||
| except ImportError: | ||
| pass | ||
|
|
||
| return audio.astype(np.float32) | ||
|
|
||
| def step(self, action: WrapperActType): | ||
| obs, reward, terminated, truncated, info = self.env.step(action) | ||
|
|
||
| rgb = info.get("pov", np.zeros((self.y_dim, self.x_dim, 3), dtype=np.uint8)) | ||
| audio = self._decode_audio(info) | ||
|
|
||
| return ( | ||
| {"vision": rgb, "audio": audio}, | ||
| reward, | ||
| terminated, | ||
| truncated, | ||
| info, | ||
| ) | ||
|
|
||
| def reset( | ||
| self, *, seed: Optional[int] = None, options: Optional[dict[str, Any]] = None | ||
| ): | ||
| obs, info = self.env.reset(seed=seed, options=options) | ||
|
|
||
| rgb = info.get("pov", np.zeros((self.y_dim, self.x_dim, 3), dtype=np.uint8)) | ||
| audio = self._decode_audio(info) | ||
|
|
||
| return {"vision": rgb, "audio": audio}, info |
There was a problem hiding this comment.
The new BimodalWaveformWrapper class lacks test coverage. Given that the repository has comprehensive test coverage for other components (see tests/python/unit/), consider adding tests for the bimodal observation handling, audio decoding, and the interaction between vision and audio components.
|
|
||
| // Convert to byte array for transmission | ||
| buffer.rewind() | ||
| lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES) |
There was a problem hiding this comment.
The lastCapturedSamples ByteArray is reallocated on every call to renderSamples() (line 171), even though the size is constant (BUFFER_SIZE_BYTES). This creates unnecessary garbage collection pressure. Consider allocating it once during initialization and reusing it.
| lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES) | |
| if (lastCapturedSamples.size != BUFFER_SIZE_BYTES) { | |
| lastCapturedSamples = ByteArray(BUFFER_SIZE_BYTES) | |
| } |
| import com.google.protobuf.ByteString | ||
| import org.lwjgl.openal.ALC10 | ||
| import org.lwjgl.openal.ALC11 | ||
| import org.lwjgl.openal.ALCCapabilities |
There was a problem hiding this comment.
The import statement includes org.lwjgl.openal.ALCCapabilities but ALCCapabilities is never used in the code. This import should be removed.
| import org.lwjgl.openal.ALCCapabilities |
| class WaveformWrapper(gym.Wrapper): | ||
| """ | ||
| Wrapper that extracts audio waveform data from CraftGround observations. | ||
|
|
||
| The audio data is captured via OpenAL SOFT_loopback extension on the | ||
| Minecraft side and transmitted via Protobuf. | ||
|
|
||
| Attributes: | ||
| sample_rate: Audio sample rate in Hz (typically 44100) | ||
| channels: Number of audio channels (1=mono, 2=stereo) | ||
| buffer_samples: Number of samples per channel in each observation | ||
| output_format: "raw" for PCM data, "spectrogram" for mel spectrogram | ||
| """ | ||
|
|
||
| def __init__( | ||
| self, | ||
| env: gym.Env, | ||
| output_format: Literal["raw", "spectrogram", "mfcc"] = "raw", | ||
| normalize: bool = True, | ||
| n_mels: int = 64, | ||
| n_fft: int = 1024, | ||
| hop_length: int = 512, | ||
| include_subtitle: bool = False, | ||
| **kwargs, | ||
| ): | ||
| """ | ||
| Initialize the WaveformWrapper. | ||
|
|
||
| Args: | ||
| env: The CraftGround environment to wrap | ||
| output_format: Output format for audio data: | ||
| - "raw": Raw PCM samples as float32 array | ||
| - "spectrogram": Mel spectrogram (requires librosa) | ||
| - "mfcc": MFCC features (requires librosa) | ||
| normalize: Whether to normalize audio to [-1, 1] range | ||
| n_mels: Number of mel bands for spectrogram | ||
| n_fft: FFT window size for spectrogram | ||
| hop_length: Hop length for spectrogram | ||
| include_subtitle: Whether to also include subtitle-based sound info | ||
| """ | ||
| super().__init__(env) | ||
| self.output_format = output_format | ||
| self.normalize = normalize | ||
| self.n_mels = n_mels | ||
| self.n_fft = n_fft | ||
| self.hop_length = hop_length | ||
| self.include_subtitle = include_subtitle | ||
|
|
||
| # Default audio parameters (will be updated from actual data) | ||
| self.sample_rate = 44100 | ||
| self.channels = 2 | ||
| self.buffer_samples = 2205 # ~50ms at 44100Hz | ||
|
|
||
| # Setup observation space based on output format | ||
| self._setup_observation_space() | ||
|
|
||
| # Lazy load librosa if needed | ||
| self._librosa = None | ||
| if output_format in ["spectrogram", "mfcc"]: | ||
| try: | ||
| import librosa | ||
|
|
||
| self._librosa = librosa | ||
| except ImportError: | ||
| raise ImportError( | ||
| f"librosa is required for output_format='{output_format}'. " | ||
| "Install it with: pip install librosa" | ||
| ) | ||
|
|
||
| def _setup_observation_space(self): | ||
| """Setup the observation space based on output format.""" | ||
| if self.output_format == "raw": | ||
| # Raw PCM: (channels, samples) or (samples,) for mono | ||
| if self.channels == 1: | ||
| shape = (self.buffer_samples,) | ||
| else: | ||
| shape = (self.channels, self.buffer_samples) | ||
|
|
||
| audio_space = gym.spaces.Box( | ||
| low=-1.0 if self.normalize else -32768, | ||
| high=1.0 if self.normalize else 32767, | ||
| shape=shape, | ||
| dtype=np.float32, | ||
| ) | ||
|
|
||
| elif self.output_format == "spectrogram": | ||
| # Mel spectrogram: (n_mels, time_frames) | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| audio_space = gym.spaces.Box( | ||
| low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32 | ||
| ) | ||
|
|
||
| elif self.output_format == "mfcc": | ||
| # MFCC: (n_mfcc, time_frames) | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| n_mfcc = 13 # Standard number of MFCCs | ||
| audio_space = gym.spaces.Box( | ||
| low=-np.inf, high=np.inf, shape=(n_mfcc, time_frames), dtype=np.float32 | ||
| ) | ||
| else: | ||
| raise ValueError(f"Unknown output_format: {self.output_format}") | ||
|
|
||
| # Create composite observation space | ||
| spaces = {"waveform": audio_space} | ||
|
|
||
| if self.include_subtitle: | ||
| # Add subtitle info if requested | ||
| # This would need to be coordinated with the existing sound wrapper | ||
| pass | ||
|
|
||
| self.observation_space = gym.spaces.Dict(spaces) | ||
|
|
||
| def _decode_pcm(self, pcm_bytes: bytes) -> np.ndarray: | ||
| """ | ||
| Decode raw PCM bytes to numpy array. | ||
|
|
||
| Args: | ||
| pcm_bytes: Raw PCM data (16-bit signed, little-endian) | ||
|
|
||
| Returns: | ||
| Audio samples as float32 array, shape (channels, samples) or (samples,) | ||
| """ | ||
| if len(pcm_bytes) == 0: | ||
| # Return silence if no audio data | ||
| if self.channels == 1: | ||
| return np.zeros(self.buffer_samples, dtype=np.float32) | ||
| else: | ||
| return np.zeros((self.channels, self.buffer_samples), dtype=np.float32) | ||
|
|
||
| # Decode 16-bit signed PCM | ||
| audio = np.frombuffer(pcm_bytes, dtype=np.int16) | ||
|
|
||
| # Convert to float32 | ||
| audio = audio.astype(np.float32) | ||
|
|
||
| # Normalize to [-1, 1] if requested | ||
| if self.normalize: | ||
| audio = audio / 32768.0 | ||
|
|
||
| # Reshape for stereo | ||
| if self.channels == 2: | ||
| # Interleaved stereo: L R L R L R ... | ||
| if len(audio) % 2 == 0: | ||
| audio = audio.reshape(-1, 2).T # Shape: (2, samples) | ||
| else: | ||
| # Handle odd length by padding | ||
| audio = np.pad(audio, (0, 1)) | ||
| audio = audio.reshape(-1, 2).T | ||
|
|
||
| return audio | ||
|
|
||
| def _compute_spectrogram(self, audio: np.ndarray) -> np.ndarray: | ||
| """Compute mel spectrogram from audio.""" | ||
| if self._librosa is None: | ||
| raise RuntimeError("librosa not available") | ||
|
|
||
| # Use mono for spectrogram | ||
| if audio.ndim == 2: | ||
| audio = np.mean(audio, axis=0) | ||
|
|
||
| # Compute mel spectrogram | ||
| mel_spec = self._librosa.feature.melspectrogram( | ||
| y=audio, | ||
| sr=self.sample_rate, | ||
| n_mels=self.n_mels, | ||
| n_fft=self.n_fft, | ||
| hop_length=self.hop_length, | ||
| ) | ||
|
|
||
| # Convert to dB scale | ||
| mel_spec_db = self._librosa.power_to_db(mel_spec, ref=np.max) | ||
|
|
||
| return mel_spec_db.astype(np.float32) | ||
|
|
||
| def _compute_mfcc(self, audio: np.ndarray) -> np.ndarray: | ||
| """Compute MFCC features from audio.""" | ||
| if self._librosa is None: | ||
| raise RuntimeError("librosa not available") | ||
|
|
||
| # Use mono for MFCC | ||
| if audio.ndim == 2: | ||
| audio = np.mean(audio, axis=0) | ||
|
|
||
| # Compute MFCCs | ||
| mfcc = self._librosa.feature.mfcc( | ||
| y=audio, | ||
| sr=self.sample_rate, | ||
| n_mfcc=13, | ||
| n_fft=self.n_fft, | ||
| hop_length=self.hop_length, | ||
| ) | ||
|
|
||
| return mfcc.astype(np.float32) | ||
|
|
||
| def _extract_waveform(self, info: dict) -> np.ndarray: | ||
| """Extract and process waveform from observation info.""" | ||
| obs_info = info.get("obs") | ||
| if obs_info is None: | ||
| return self._get_silence() | ||
|
|
||
| # Check if audio waveform is available | ||
| if not hasattr(obs_info, "audio_waveform") or obs_info.audio_waveform is None: | ||
| return self._get_silence() | ||
|
|
||
| audio_waveform = obs_info.audio_waveform | ||
|
|
||
| # Update audio parameters from actual data | ||
| if hasattr(audio_waveform, "sample_rate") and audio_waveform.sample_rate > 0: | ||
| self.sample_rate = audio_waveform.sample_rate | ||
| if hasattr(audio_waveform, "channels") and audio_waveform.channels > 0: | ||
| self.channels = audio_waveform.channels | ||
| if hasattr(audio_waveform, "num_samples") and audio_waveform.num_samples > 0: | ||
| self.buffer_samples = audio_waveform.num_samples | ||
|
|
||
| # Get PCM data | ||
| pcm_data = ( | ||
| audio_waveform.pcm_data if hasattr(audio_waveform, "pcm_data") else b"" | ||
| ) | ||
|
|
||
| # Decode PCM | ||
| audio = self._decode_pcm(pcm_data) | ||
|
|
||
| # Process based on output format | ||
| if self.output_format == "raw": | ||
| return audio | ||
| elif self.output_format == "spectrogram": | ||
| return self._compute_spectrogram(audio) | ||
| elif self.output_format == "mfcc": | ||
| return self._compute_mfcc(audio) | ||
| else: | ||
| return audio | ||
|
|
||
| def _get_silence(self) -> np.ndarray: | ||
| """Return silence array matching the expected shape.""" | ||
| if self.output_format == "raw": | ||
| if self.channels == 1: | ||
| return np.zeros(self.buffer_samples, dtype=np.float32) | ||
| else: | ||
| return np.zeros((self.channels, self.buffer_samples), dtype=np.float32) | ||
| elif self.output_format == "spectrogram": | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| return np.zeros((self.n_mels, time_frames), dtype=np.float32) | ||
| elif self.output_format == "mfcc": | ||
| time_frames = self.buffer_samples // self.hop_length + 1 | ||
| return np.zeros((13, time_frames), dtype=np.float32) | ||
| else: | ||
| return np.zeros(self.buffer_samples, dtype=np.float32) | ||
|
|
||
| def step( | ||
| self, action: WrapperActType | ||
| ) -> tuple[WrapperObsType, SupportsFloat, bool, bool, dict[str, Any]]: | ||
| """Execute action and return observation with waveform.""" | ||
| obs, reward, terminated, truncated, info = self.env.step(action) | ||
|
|
||
| waveform = self._extract_waveform(info) | ||
|
|
||
| return ( | ||
| {"waveform": waveform}, | ||
| reward, | ||
| terminated, | ||
| truncated, | ||
| info, | ||
| ) | ||
|
|
||
| def reset( | ||
| self, | ||
| *, | ||
| seed: Optional[int] = None, | ||
| options: Optional[dict[str, Any]] = None, | ||
| ) -> tuple[WrapperObsType, dict[str, Any]]: | ||
| """Reset environment and return initial observation with waveform.""" | ||
| obs, info = self.env.reset(seed=seed, options=options) | ||
|
|
||
| waveform = self._extract_waveform(info) | ||
|
|
||
| return {"waveform": waveform}, info | ||
|
|
There was a problem hiding this comment.
The new WaveformWrapper class lacks test coverage. Given that the repository has comprehensive test coverage for other components (see tests/python/unit/), consider adding tests for at least the critical functionality: PCM decoding, audio format conversions (raw, spectrogram, MFCC), handling of missing audio data, and observation space consistency.
| /** | ||
| * Render audio samples from the loopback device. | ||
| * Should be called each tick to capture the latest audio. | ||
| * | ||
| * Note: This requires the loopback context to be current when sources are playing. | ||
| * In practice, this means we need to either: | ||
| * 1. Use a shared context approach | ||
| * 2. Mirror all sound commands to the loopback device | ||
| * 3. Use OS-level audio loopback as fallback | ||
| * | ||
| * For now, this demonstrates the API usage. Full integration requires | ||
| * mirroring the SoundSystem's source operations to the loopback context. | ||
| */ |
There was a problem hiding this comment.
The comment on lines 139-146 describes incomplete behavior. It states "For now, this demonstrates the API usage. Full integration requires mirroring the SoundSystem's source operations to the loopback context." This suggests that the current implementation may not capture actual game audio correctly. This limitation should be clearly documented in the main docstring or as a warning in the code, as users might expect the audio capture to work fully.
| audio_space = gym.spaces.Box( | ||
| low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32 |
There was a problem hiding this comment.
The spectrogram bounds in the observation space are set to low=0.0, high=np.inf, but in _compute_spectrogram() the mel spectrogram is converted to dB scale which can produce negative values (power_to_db returns values typically in the range [-80, 0] dB). This mismatch between the observation space definition and actual values could cause issues with RL algorithms that rely on correct bounds.
| audio_space = gym.spaces.Box( | |
| low=0.0, high=np.inf, shape=(self.n_mels, time_frames), dtype=np.float32 | |
| # dB-scaled mel spectrogram from librosa.power_to_db is typically in [-80, 0] dB | |
| audio_space = gym.spaces.Box( | |
| low=-80.0, high=0.0, shape=(self.n_mels, time_frames), dtype=np.float32 |
| except ImportError: | ||
| pass |
There was a problem hiding this comment.
When audio_format is "spectrogram" but librosa import fails, the method falls through and returns raw audio data instead (line 410). This creates a shape mismatch with the observation space: the space expects shape (n_mels, time_frames) for spectrograms but the returned data has shape (channels, buffer_samples). Either raise an error when librosa is unavailable for spectrogram format, or return zeros with the correct spectrogram shape.
| except ImportError: | |
| pass | |
| except ImportError as exc: | |
| raise RuntimeError( | |
| "BimodalWaveformWrapper is configured with audio_format='spectrogram' " | |
| "but the 'librosa' package is not installed. Install librosa or " | |
| "use audio_format='raw'." | |
| ) from exc |
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/craftground/proto/observation_space_pb2.py (1)
1-6: Update line 3 comment to reflect that generated protobuf code is intentionally committed.The file header at line 3 states "NO CHECKED-IN PROTOBUF GENCODE", but this file—along with
action_space_pb2.pyandinitial_environment_pb2.py—is intentionally committed to the repository. Remove this auto-generated comment or replace it with a note explaining the project's decision to commit pre-generated protobuf code.
🧹 Nitpick comments (1)
src/craftground/MinecraftEnv/src/main/java/com/kyhsgeekcode/minecraftenv/proto/ObservationSpace.java (1)
12660-13582: Consider capping raw PCM payload size upstream.
pcm_datacan grow quickly; enforce a max buffer length (or segmenting) in the producer to prevent oversized observations or memory spikes.
Pull
Summary by CodeRabbit
New Features
Integration
Chores
✏️ Tip: You can customize this high-level summary in your review settings.