Hi, thanks again for sharing this great work!
I have a small question about the fine-tuning setup in your paper. In Appendix B.1, you describe the implementation details, such as using LoRA with rank r = 8 and α = 16, training for 3 epochs with a peak learning rate of 5e-5 under a cosine decay schedule, and AdamW with a warmup ratio of 0.1.
Could I kindly ask whether these hyperparameters and the overall training/inference recipe are exactly the same for LLaMA-3.1-8B-Instruct as for Qwen2.5-7B?
If you have time to reply, I would really appreciate it.