feature(wjy): add R1-AQA (Audio Question Answering) training examples#44
feature(wjy): add R1-AQA (Audio Question Answering) training examples#44JOY-SWang wants to merge 4 commits intoopendilab:mainfrom
Conversation
examples/r1_aqa/__init__.py
Outdated
| @@ -0,0 +1,2 @@ | |||
| # R1-AQA Example for LightRFT | |||
There was a problem hiding this comment.
There's no need for init.py here. You can refer to the file structure and import conventions used in examples/gsm8k_geo3k.
| python examples/r1_aqa/data_preprocess/avqa.py \\ | ||
| --input_jsonl train_r1aqa_line.json \\ | ||
| --audio_dir data/AVQA/audios \\ | ||
| --local_save_dir "" |
| ) | ||
| parser.add_argument( | ||
| "--local_save_dir", | ||
| default="~/data/avqa_lightrft", |
examples/r1_aqa/eval_mmau.py
Outdated
| This script performs inference on the MMAU test-mini benchmark and outputs | ||
| results in the format expected by MMAU's official evaluation script. | ||
|
|
||
| Faithfully ported from R1-AQA's ``src/test_mmau.py``. |
There was a problem hiding this comment.
add download link for the data
| @@ -0,0 +1,323 @@ | |||
| """ | |||
There was a problem hiding this comment.
also add mmar evaluation code here
examples/r1_aqa/audio_pipeline.py
Outdated
| # Audio Loading | ||
| # ============================================================================ | ||
|
|
||
| def load_audio( |
There was a problem hiding this comment.
simplify this func, we can only use librosa here
examples/r1_aqa/audio_pipeline.py
Outdated
| self.strategy.print(f"[WARNING] Failed to load audio {audio_path}: {e}") | ||
| audio_data = None | ||
|
|
||
| # ---- 4. Extract reference and label ---- |
There was a problem hiding this comment.
if some data fields are not used in this demo, we can just leave a default empty value. Plz simplify the code here.
lightrft/models/actor_modality.py
Outdated
| }, | ||
| ActorModality.AUDIO_LANGUAGE: { | ||
| "audio_values", | ||
| "pixel_values", # Audio pipeline stores input_features in pixel_values slot |
There was a problem hiding this comment.
we shouldn't modify definition here, audio_values field is enough.
lightrft/models/actor_al.py
Outdated
| @torch.no_grad() | ||
| def generate( | ||
| self, input_ids: torch.Tensor, audio_values: torch.Tensor, **kwargs | ||
| self, input_ids: torch.Tensor, input_features: torch.Tensor = None, **kwargs |
There was a problem hiding this comment.
why do we need change this name here, can we modify the data processor and the patch part to always use audio_values here
lightrft/models/actor_al.py
Outdated
| if has_audio_placeholder: | ||
| # Qwen2Audio's Whisper encoder requires mel features of | ||
| # exactly 3000 frames. Pad / truncate as needed. | ||
| EXPECTED_MEL_LEN = 3000 |
There was a problem hiding this comment.
refactor this part into a new function to reuse code
LightRFT Training Script for R1-AQA (Audio Question Answering)
This script fine-tunes Qwen2-Audio-7B-Instruct using the GRPO algorithm with rule-based rewards, faithfully migrating the R1-AQA training pipeline.
Key Design Decisions
Reward summation, not weighting: R1-AQA sums accuracy(0/1) + format(0/1) = max 2.0. LightRFT's GSM8K/Geo3K uses 0.9×acc + 0.1×fmt = max 1.0. We keep R1-AQA's summation for identical reward signal — GRPO's group normalization handles scale differences.
Audio via image slot with monkey patches: LightRFT's pipeline is deeply optimized for images/videos. Rather than rewriting core framework code, we repurpose the image data slots and apply targeted patches:
ActorAudio remaps kwargs in the model forward pass
ActorAudio model class: Qwen2-Audio uses Qwen2AudioForConditionalGeneration (not AutoModelForVision2Seq), so we created a dedicated actor class with correct model loading and kwarg remapping while preserving the same positional interface as ActorVL.
No by default: R1-AQA notes explicit reasoning didn't help for AQA. The think mode is supported via --enable_think in preprocessing but disabled by default.
Chat template via processor: Audio URLs are embedded in chat messages as {"type": "audio", "audio_url": path}. The Qwen2-Audio processor's apply_chat_template converts these to correct audio placeholder tokens, preserving R1-AQA's exact prompt construction.