Skip to content

Comments

feature(wjy): add R1-AQA (Audio Question Answering) training examples#44

Open
JOY-SWang wants to merge 4 commits intoopendilab:mainfrom
JOY-SWang:main
Open

feature(wjy): add R1-AQA (Audio Question Answering) training examples#44
JOY-SWang wants to merge 4 commits intoopendilab:mainfrom
JOY-SWang:main

Conversation

@JOY-SWang
Copy link

LightRFT Training Script for R1-AQA (Audio Question Answering)

This script fine-tunes Qwen2-Audio-7B-Instruct using the GRPO algorithm with rule-based rewards, faithfully migrating the R1-AQA training pipeline.

Key Design Decisions

  1. Reward summation, not weighting: R1-AQA sums accuracy(0/1) + format(0/1) = max 2.0. LightRFT's GSM8K/Geo3K uses 0.9×acc + 0.1×fmt = max 1.0. We keep R1-AQA's summation for identical reward signal — GRPO's group normalization handles scale differences.

  2. Audio via image slot with monkey patches: LightRFT's pipeline is deeply optimized for images/videos. Rather than rewriting core framework code, we repurpose the image data slots and apply targeted patches:

pixel_values slot → input_features (Qwen2-Audio features)
image_grid_thw slot → feature_attention_mask
multi_modal_data["image"] → multi_modal_data["audio"]
  1. ActorAudio remaps kwargs in the model forward pass
    ActorAudio model class: Qwen2-Audio uses Qwen2AudioForConditionalGeneration (not AutoModelForVision2Seq), so we created a dedicated actor class with correct model loading and kwarg remapping while preserving the same positional interface as ActorVL.

  2. No by default: R1-AQA notes explicit reasoning didn't help for AQA. The think mode is supported via --enable_think in preprocessing but disabled by default.

  3. Chat template via processor: Audio URLs are embedded in chat messages as {"type": "audio", "audio_url": path}. The Qwen2-Audio processor's apply_chat_template converts these to correct audio placeholder tokens, preserving R1-AQA's exact prompt construction.

@@ -0,0 +1,2 @@
# R1-AQA Example for LightRFT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need for init.py here. You can refer to the file structure and import conventions used in examples/gsm8k_geo3k.

python examples/r1_aqa/data_preprocess/avqa.py \\
--input_jsonl train_r1aqa_line.json \\
--audio_dir data/AVQA/audios \\
--local_save_dir ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add example for this field

)
parser.add_argument(
"--local_save_dir",
default="~/data/avqa_lightrft",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use relative path

This script performs inference on the MMAU test-mini benchmark and outputs
results in the format expected by MMAU's official evaluation script.

Faithfully ported from R1-AQA's ``src/test_mmau.py``.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add download link for the data

@@ -0,0 +1,323 @@
"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also add mmar evaluation code here

# Audio Loading
# ============================================================================

def load_audio(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

simplify this func, we can only use librosa here

self.strategy.print(f"[WARNING] Failed to load audio {audio_path}: {e}")
audio_data = None

# ---- 4. Extract reference and label ----
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if some data fields are not used in this demo, we can just leave a default empty value. Plz simplify the code here.

},
ActorModality.AUDIO_LANGUAGE: {
"audio_values",
"pixel_values", # Audio pipeline stores input_features in pixel_values slot
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't modify definition here, audio_values field is enough.

@torch.no_grad()
def generate(
self, input_ids: torch.Tensor, audio_values: torch.Tensor, **kwargs
self, input_ids: torch.Tensor, input_features: torch.Tensor = None, **kwargs
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need change this name here, can we modify the data processor and the patch part to always use audio_values here

if has_audio_placeholder:
# Qwen2Audio's Whisper encoder requires mel features of
# exactly 3000 frames. Pad / truncate as needed.
EXPECTED_MEL_LEN = 3000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactor this part into a new function to reuse code

@puyuan1996 puyuan1996 changed the title LightRFT Training Script for R1-AQA (Audio Question Answering) feature(wjy): add R1-AQA (Audio Question Answering) training examples Feb 24, 2026
@puyuan1996 puyuan1996 added the enhancement New feature or request label Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants