feature(wjy): add R1-AQA (Audio Question Answering) training examples by JOY-SWang · Pull Request #44 · opendilab/LightRFT

JOY-SWang · 2026-02-12T11:00:30Z

LightRFT Training Script for R1-AQA (Audio Question Answering)

This script fine-tunes Qwen2-Audio-7B-Instruct using the GRPO algorithm with rule-based rewards, faithfully migrating the R1-AQA training pipeline.

Key Design Decisions

Reward summation, not weighting: R1-AQA sums accuracy(0/1) + format(0/1) = max 2.0. LightRFT's GSM8K/Geo3K uses 0.9×acc + 0.1×fmt = max 1.0. We keep R1-AQA's summation for identical reward signal — GRPO's group normalization handles scale differences.
Audio via image slot with monkey patches: LightRFT's pipeline is deeply optimized for images/videos. Rather than rewriting core framework code, we repurpose the image data slots and apply targeted patches:

pixel_values slot → input_features (Qwen2-Audio features)
image_grid_thw slot → feature_attention_mask
multi_modal_data["image"] → multi_modal_data["audio"]

ActorAudio remaps kwargs in the model forward pass
ActorAudio model class: Qwen2-Audio uses Qwen2AudioForConditionalGeneration (not AutoModelForVision2Seq), so we created a dedicated actor class with correct model loading and kwarg remapping while preserving the same positional interface as ActorVL.
No by default: R1-AQA notes explicit reasoning didn't help for AQA. The think mode is supported via --enable_think in preprocessing but disabled by default.
Chat template via processor: Audio URLs are embedded in chat messages as {"type": "audio", "audio_url": path}. The Qwen2-Audio processor's apply_chat_template converts these to correct audio placeholder tokens, preserving R1-AQA's exact prompt construction.

PaParaZz1 · 2026-02-13T05:33:51Z

examples/r1_aqa/__init__.py

@@ -0,0 +1,2 @@
+# R1-AQA Example for LightRFT


There's no need for init.py here. You can refer to the file structure and import conventions used in examples/gsm8k_geo3k.

PaParaZz1 · 2026-02-13T05:34:10Z

examples/r1_aqa/data_preprocess/avqa.py

+    python examples/r1_aqa/data_preprocess/avqa.py \\
+        --input_jsonl train_r1aqa_line.json \\
+        --audio_dir data/AVQA/audios \\
+        --local_save_dir ""


add example for this field

PaParaZz1 · 2026-02-13T05:38:29Z

examples/r1_aqa/data_preprocess/avqa.py

+    )
+    parser.add_argument(
+        "--local_save_dir",
+        default="~/data/avqa_lightrft",


use relative path

PaParaZz1 · 2026-02-13T05:39:21Z

examples/r1_aqa/eval_mmau.py

+This script performs inference on the MMAU test-mini benchmark and outputs
+results in the format expected by MMAU's official evaluation script.
+
+Faithfully ported from R1-AQA's ``src/test_mmau.py``.


add download link for the data

PaParaZz1 · 2026-02-13T05:39:31Z

examples/r1_aqa/eval_mmau.py

@@ -0,0 +1,323 @@
+"""


also add mmar evaluation code here

PaParaZz1 · 2026-02-13T05:58:56Z

examples/r1_aqa/audio_pipeline.py

+# Audio Loading
+# ============================================================================
+
+def load_audio(


simplify this func, we can only use librosa here

PaParaZz1 · 2026-02-13T06:02:13Z

examples/r1_aqa/audio_pipeline.py

+                self.strategy.print(f"[WARNING] Failed to load audio {audio_path}: {e}")
+                audio_data = None
+
+        # ---- 4. Extract reference and label ----


if some data fields are not used in this demo, we can just leave a default empty value. Plz simplify the code here.

PaParaZz1 · 2026-02-13T06:04:46Z

lightrft/models/actor_modality.py

    },
    ActorModality.AUDIO_LANGUAGE: {
        "audio_values",
+        "pixel_values",       # Audio pipeline stores input_features in pixel_values slot


we shouldn't modify definition here, audio_values field is enough.

PaParaZz1 · 2026-02-13T06:06:00Z

lightrft/models/actor_al.py

    @torch.no_grad()
    def generate(
-        self, input_ids: torch.Tensor, audio_values: torch.Tensor, **kwargs
+        self, input_ids: torch.Tensor, input_features: torch.Tensor = None, **kwargs


why do we need change this name here, can we modify the data processor and the patch part to always use audio_values here

PaParaZz1 · 2026-02-13T06:06:45Z

lightrft/models/actor_al.py

+            if has_audio_placeholder:
+                # Qwen2Audio's Whisper encoder requires mel features of
+                # exactly 3000 frames.  Pad / truncate as needed.
+                EXPECTED_MEL_LEN = 3000


refactor this part into a new function to reuse code

JOY-SWang added 2 commits February 12, 2026 18:38

transfer R1-AQA

d5390cb

transfer R1-AQA2

d01eb81

PaParaZz1 requested changes Feb 13, 2026

View reviewed changes

puyuan1996 changed the title ~~LightRFT Training Script for R1-AQA (Audio Question Answering)~~ feature(wjy): add R1-AQA (Audio Question Answering) training examples Feb 24, 2026

puyuan1996 added the enhancement New feature or request label Feb 24, 2026

JOY-SWang added 2 commits February 24, 2026 21:09

debug and simplify except actoral

641f962

debug and simplify all v0224

a2f81b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

feature(wjy): add R1-AQA (Audio Question Answering) training examples#44

feature(wjy): add R1-AQA (Audio Question Answering) training examples#44
JOY-SWang wants to merge 4 commits intoopendilab:mainfrom
JOY-SWang:main

JOY-SWang commented Feb 12, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

PaParaZz1 Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

JOY-SWang commented Feb 12, 2026

LightRFT Training Script for R1-AQA (Audio Question Answering)

Key Design Decisions

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants