diff --git a/README.md b/README.md
index 51680a83a7..626af2d452 100644
--- a/README.md
+++ b/README.md
@@ -1,5 +1,5 @@
 # arxiv-daily
- Automated deployment @ 2025-02-19 09:05:53 Asia/Taipei
+ Automated deployment @ 2025-02-19 20:34:11 Asia/Taipei
 > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml).
 > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage).
 
@@ -8,6 +8,7 @@
 ### Medical explainable AI
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
 |**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
@@ -107,9 +108,22 @@
 |**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
 |**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
 |**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
 
 #### Abstracts
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
 2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
@@ -2675,36 +2689,16 @@ characteristics, in addition to the diverse stakeholders' perceptions.
 
 摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
-
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
-
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
-
 
 ### Medical
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|null|
+|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null|
 |**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
 |**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
 |**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null|
@@ -2716,17 +2710,19 @@ sinus bradycardia.
 |**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null|
 |**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|null|
 |**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|null|
+|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null|
 |**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|null|
 |**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null|
 |**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null|
 |**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|null|
-|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|null|
+|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)|
 |**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null|
 |**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null|
 |**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null|
 |**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null|
 |**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v1](http://arxiv.org/abs/2502.10526v1)|null|
 |**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null|
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null|
 |**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null|
 |**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null|
@@ -2743,6 +2739,7 @@ sinus bradycardia.
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
 |**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null|
 |**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)|
 |**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
 |**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
@@ -2754,7 +2751,7 @@ sinus bradycardia.
 |**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
 |**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)|
 |**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null|
 |**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
 |**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
 |**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
@@ -2796,17 +2793,150 @@ sinus bradycardia.
 |**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
 |**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
 |**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v2](http://arxiv.org/abs/2502.03568v2)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|[link](https://github.com/martinwimpff/eeg-continual)|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
 
 #### Abstracts
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **LLM Safety for Children**
+2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat
+
+This paper analyzes the safety of Large Language Models (LLMs) in
+interactions with children below age of 18 years. Despite the transformative
+applications of LLMs in various aspects of children's lives such as education
+and therapy, there remains a significant gap in understanding and mitigating
+potential content harms specific to this demographic. The study acknowledges
+the diverse nature of children often overlooked by standard safety evaluations
+and proposes a comprehensive approach to evaluating LLM safety specifically for
+children. We list down potential risks that children may encounter when using
+LLM powered applications. Additionally we develop Child User Models that
+reflect the varied personalities and interests of children informed by
+literature in child care and psychology. These user models aim to bridge the
+existing gap in child safety literature across various fields. We utilize Child
+User Models to evaluate the safety of six state of the art LLMs. Our
+observations reveal significant safety gaps in LLMs particularly in categories
+harmful to children but not adults
+
+摘要：本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面（例如教育和治療）都有轉變性的應用，但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性，而標準安全評估通常會忽略這些多樣性，並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外，我們開發了兒童使用者模型，這些模型反映了兒童不同的個性特質和興趣，並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞，特別是在對兒童有害但對成年人無害的類別中
+
+##### **Classifiers of Data Sharing Statements in Clinical Trial Records**
+2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth
+
+Digital individual participant data (IPD) from clinical trials are
+increasingly distributed for potential scientific reuse. The identification of
+available IPD, however, requires interpretations of textual data-sharing
+statements (DSS) in large databases. Recent advancements in computational
+linguistics include pre-trained language models that promise to simplify the
+implementation of effective classifiers based on textual inputs. In a subset of
+5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers
+based on domain-specific pre-trained language models reproduce original
+availability categories as well as manually annotated labels. Typical metrics
+indicate that classifiers that predicted manual annotations outperformed those
+that learned to output the original availability categories. This suggests that
+the textual DSS descriptions contain applicable information that the
+availability categories do not, and that such classifiers could thus aid the
+automatic identification of available IPD in large trial databases.
+
+摘要：臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而，要找出可用的 IPD，需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型，有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中，我們評估了基於特定領域預先訓練語言模型的分類器，在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示，預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊，而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。
+
 ##### **Relational Norms for Human-AI Cooperation**
 2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
 
@@ -3096,6 +3226,28 @@ chatbot applications.
 
 摘要：大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而，它們經常產生未經驗證的輸出，這會損害它們在關鍵應用中的可靠性。在本研究中，我們提出了一個創新的框架，透過檢索增強生成技術，將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體，開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型，產生在脈絡上相關且可驗證的回應，並直接參考臨床證據。實驗結果顯示，此方法顯著減少了幻覺、增強了事實準確性，並改善了生成回應的清晰度，為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。
 
+##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**
+2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang
+
+Automatic depression detection provides cues for early clinical intervention
+by clinicians. Clinical interviews for depression detection involve dialogues
+centered around multiple themes. Existing studies primarily design end-to-end
+neural network models to capture the hierarchical structure of clinical
+interview dialogues. However, these methods exhibit defects in modeling the
+thematic content of clinical interviews: 1) they fail to capture intra-theme
+and inter-theme correlation explicitly, and 2) they do not allow clinicians to
+intervene and focus on themes of interest. To address these issues, this paper
+introduces an interactive depression detection framework. This framework
+leverages in-context learning techniques to identify themes in clinical
+interviews and then models both intra-theme and inter-theme correlation.
+Additionally, it employs AI-driven feedback to simulate the interests of
+clinicians, enabling interactive adjustment of theme importance. PDIMC achieves
+absolute improvements of 35\% and 12\% compared to the state-of-the-art on the
+depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of
+modeling theme correlation and incorporating interactive external feedback.
+
+摘要：自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而，這些方法在建模臨床訪談的主題內容時表現出缺陷：1）它們無法明確捕捉主題內和主題間的關聯性，以及 2）它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題，本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題，然後對主題內和主題間的關聯性進行建模。此外，它採用 AI 驅動的回饋來模擬臨床醫師的興趣，實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比，PDIMC 的絕對改進率分別為 35% 和 12%，這證明了對主題關聯性建模和納入互動式外部回饋的有效性。
+
 ##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**
 2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu
 
@@ -3364,6 +3516,20 @@ differences, such as rotation and cropping.
 
 摘要：随着人工智能在我们的生活中变得越来越普遍，人们正在享受它带来的便利，但也面临着隐藏的威胁，例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果，特别是对于一些立即生效的应用，例如自动驾驶和医疗领域。在这些威胁中，后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象，使其成为不可忽视的威胁，然而，在部署后门模型的过程中，后门攻击往往存在一些使其在实际应用中不尽如人意的原因，例如抖动和亮度变化。基于此，我们提出了一种高度鲁棒的后门攻击，该攻击对目标样本进行平移并将其与自身结合以形成后门样本，即置换后门攻击 (DBA)。实验结果表明，DBA 攻击可以抵抗模拟真实世界差异的数据增强，例如旋转和裁剪。
 
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**
 2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
@@ -3749,6 +3915,32 @@ care interventions, and large-scale health monitoring.
 
 摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
+##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design**
+2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang
+
+Taste peptides have emerged as promising natural flavoring agents attributed
+to their unique organoleptic properties, high safety profile, and potential
+health benefits. However, the de novo identification of taste peptides derived
+from animal, plant, or microbial sources remains a time-consuming and
+resource-intensive process, significantly impeding their widespread application
+in the food industry. Here, we present TastePepAI, a comprehensive artificial
+intelligence framework for customized taste peptide design and safety
+assessment. As the key element of this framework, a loss-supervised adaptive
+variational autoencoder (LA-VAE) is implemented to efficiently optimizes the
+latent representation of sequences during training and facilitates the
+generation of target peptides with desired taste profiles. Notably, our model
+incorporates a novel taste-avoidance mechanism, allowing for selective flavor
+exclusion. Subsequently, our in-house developed toxicity prediction algorithm
+(SpepToxPred) is integrated in the framework to undergo rigorous safety
+evaluation of generated peptides. Using this integrated platform, we
+successfully identified 73 peptides exhibiting sweet, salty, and umami,
+significantly expanding the current repertoire of taste peptides. This work
+demonstrates the potential of TastePepAI in accelerating taste peptide
+discovery for food applications and provides a versatile framework adaptable to
+broader peptide engineering challenges.
+
+摘要：味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而，从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程，严重阻碍了它们在食品工业中的广泛应用。在此，我们提出了 TastePepAI，这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素，实现了损失监督自适应变分自动编码器 (LA-VAE)，以在训练期间有效优化序列的潜在表示，并促进生成具有所需味觉特征的目标肽。值得注意的是，我们的模型包含了一种新颖的味觉回避机制，允许选择性排除风味。随后，我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中，以对生成的肽进行严格的安全评估。使用这个集成平台，我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽，极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力，并提供了一个适用于更广泛的肽工程挑战的多功能框架。
+
 ##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
 2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
@@ -4045,7 +4237,7 @@ CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
 疾病研究和診斷量化。
 
 ##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
 Early prediction of pediatric cardiac arrest (CA) is critical for timely
 intervention in high-risk intensive care settings. We introduce PedCA-FT, a
@@ -4060,7 +4252,7 @@ and identifies clinically meaningful risk factors. These findings underscore
 the potential of multimodal fusion techniques to enhance early CA detection and
 improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+摘要：早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT，一個新穎的基於轉換器的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組，PedCA-FT 捕獲複雜的時間和上下文模式，以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估，我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型，並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。
 
 ##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
 2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
@@ -5101,2682 +5293,16 @@ experiment details are made available.
 
 摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v2 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
-
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
-
-摘要：許多推理、規劃和問題解決任務都具有內在的演算法性質：正確模擬每一步是正確解決它們的充分條件。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的能力。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成數據是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見結構作為自然主義推理任務建構區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複操作方面的能力，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎很依賴模式辨識。我們的貢獻建立在合成測試 LLM 的推理能力之上，作為手工製作的人工標記問題的可擴充補充。
-
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
-
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
-
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
-
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
-
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
-
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
-
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
-
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
-
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
-
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
-
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
-
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
-
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
-
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
-
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
-
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
-
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
-
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
-
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
-
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
-
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
-
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
-
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
-
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
-
-
-### LLM
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-17**|**Diffusion Models without Classifier-free Guidance**|Zhicong Tang et.al.|[2502.12154v1](http://arxiv.org/abs/2502.12154v1)|[link](https://github.com/tzco/Diffusion-wo-CFG)|
-|**2025-02-17**|**Idiosyncrasies in Large Language Models**|Mingjie Sun et.al.|[2502.12150v1](http://arxiv.org/abs/2502.12150v1)|null|
-|**2025-02-17**|**HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**|Kenan Jiang et.al.|[2502.12149v1](http://arxiv.org/abs/2502.12149v1)|null|
-|**2025-02-17**|**Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**|Jinyan Su et.al.|[2502.12145v1](http://arxiv.org/abs/2502.12145v1)|null|
-|**2025-02-17**|**Small Models Struggle to Learn from Strong Reasoners**|Yuetai Li et.al.|[2502.12143v1](http://arxiv.org/abs/2502.12143v1)|null|
-|**2025-02-17**|**SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**|Yige Xu et.al.|[2502.12134v1](http://arxiv.org/abs/2502.12134v1)|null|
-|**2025-02-17**|**Transformer Dynamics: A neuroscientific approach to interpretability of large language models**|Jesseba Fernando et.al.|[2502.12131v1](http://arxiv.org/abs/2502.12131v1)|null|
-|**2025-02-17**|**Scaling Autonomous Agents via Automatic Reward Modeling And Planning**|Zhenfang Chen et.al.|[2502.12130v1](http://arxiv.org/abs/2502.12130v1)|null|
-|**2025-02-17**|**LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**|Florian Sestak et.al.|[2502.12128v1](http://arxiv.org/abs/2502.12128v1)|null|
-|**2025-02-17**|**On the Query Complexity of Verifier-Assisted Language Generation**|Edoardo Botta et.al.|[2502.12123v1](http://arxiv.org/abs/2502.12123v1)|null|
-|**2025-02-17**|**LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**|Prasanna Mayilvahanan et.al.|[2502.12120v1](http://arxiv.org/abs/2502.12120v1)|null|
-|**2025-02-17**|**PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**|Jinhe Bi et.al.|[2502.12119v1](http://arxiv.org/abs/2502.12119v1)|null|
-|**2025-02-17**|**Scaling Test-Time Compute Without Verification or RL is Suboptimal**|Amrith Setlur et.al.|[2502.12118v1](http://arxiv.org/abs/2502.12118v1)|null|
-|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
-|**2025-02-17**|**Personality Structured Interview for Large Language Model Simulation in Personality Research**|Pengda Wang et.al.|[2502.12109v1](http://arxiv.org/abs/2502.12109v1)|null|
-|**2025-02-17**|**Using the Path of Least Resistance to Explain Deep Networks**|Sina Salek et.al.|[2502.12108v1](http://arxiv.org/abs/2502.12108v1)|null|
-|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
-|**2025-02-17**|**A Study on Leveraging Search and Self-Feedback for Agent Reasoning**|Karthikeyan K et.al.|[2502.12094v1](http://arxiv.org/abs/2502.12094v1)|null|
-|**2025-02-17**|**Meta-Statistical Learning: Supervised Learning of Statistical Inference**|Maxime Peyrard et.al.|[2502.12088v1](http://arxiv.org/abs/2502.12088v1)|null|
-|**2025-02-17**|**APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**|Yuxiang Huang et.al.|[2502.12085v1](http://arxiv.org/abs/2502.12085v1)|null|
-|**2025-02-17**|**VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**|Jianshu Zhang et.al.|[2502.12084v1](http://arxiv.org/abs/2502.12084v1)|null|
-|**2025-02-17**|**AdaSplash: Adaptive Sparse Flash Attention**|Nuno Gonçalves et.al.|[2502.12082v1](http://arxiv.org/abs/2502.12082v1)|null|
-|**2025-02-17**|**Unhackable Temporal Rewarding for Scalable Video MLLMs**|En Yu et.al.|[2502.12081v1](http://arxiv.org/abs/2502.12081v1)|null|
-|**2025-02-17**|**Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**|Zhongyi Qiu et.al.|[2502.12073v1](http://arxiv.org/abs/2502.12073v1)|null|
-|**2025-02-17**|**TokenSkip: Controllable Chain-of-Thought Compression in LLMs**|Heming Xia et.al.|[2502.12067v1](http://arxiv.org/abs/2502.12067v1)|null|
-|**2025-02-17**|**CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**|Yifan Zhang et.al.|[2502.12066v1](http://arxiv.org/abs/2502.12066v1)|null|
-|**2025-02-17**|**Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**|Lan Zhang et.al.|[2502.12065v1](http://arxiv.org/abs/2502.12065v1)|null|
-|**2025-02-17**|**AI-generated Text Detection with a GLTR-based Approach**|Lucía Yan Wu et.al.|[2502.12064v1](http://arxiv.org/abs/2502.12064v1)|null|
-|**2025-02-17**|**Culture is Not Trivia: Sociocultural Theory for Cultural NLP**|Naitian Zhou et.al.|[2502.12057v1](http://arxiv.org/abs/2502.12057v1)|null|
-|**2025-02-17**|**Designing Role Vectors to Improve LLM Inference Behaviour**|Daniele Potertì et.al.|[2502.12055v1](http://arxiv.org/abs/2502.12055v1)|null|
-|**2025-02-17**|**PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**|Xinyu Zhang et.al.|[2502.12054v1](http://arxiv.org/abs/2502.12054v1)|null|
-|**2025-02-17**|**A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**|Xinyu Hu et.al.|[2502.12052v1](http://arxiv.org/abs/2502.12052v1)|null|
-|**2025-02-17**|**How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**|Ayan Sengupta et.al.|[2502.12051v1](http://arxiv.org/abs/2502.12051v1)|null|
-|**2025-02-17**|**SpeechT: Findings of the First Mentorship in Speech Translation**|Yasmin Moslem et.al.|[2502.12050v1](http://arxiv.org/abs/2502.12050v1)|null|
-|**2025-02-17**|**A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**|Shreya Shukla et.al.|[2502.12048v1](http://arxiv.org/abs/2502.12048v1)|null|
-|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
-|**2025-02-17**|**SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**|Fengqing Jiang et.al.|[2502.12025v1](http://arxiv.org/abs/2502.12025v1)|null|
-|**2025-02-17**|**Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**|Xin Xu et.al.|[2502.12022v1](http://arxiv.org/abs/2502.12022v1)|null|
-|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
-|**2025-02-17**|**Demographic Attributes Prediction from Speech Using WavLM Embeddings**|Yuchen Yang et.al.|[2502.12007v1](http://arxiv.org/abs/2502.12007v1)|null|
-|**2025-02-17**|**Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**|Thibault Rousset et.al.|[2502.12001v1](http://arxiv.org/abs/2502.12001v1)|null|
-|**2025-02-17**|**Presumed Cultural Identity: How Names Shape LLM Responses**|Siddhesh Pawar et.al.|[2502.11995v1](http://arxiv.org/abs/2502.11995v1)|null|
-|**2025-02-17**|**Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**|Negar Kamali et.al.|[2502.11989v1](http://arxiv.org/abs/2502.11989v1)|null|
-|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|null|
-|**2025-02-17**|**Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**|Sehun Jung et.al.|[2502.11969v1](http://arxiv.org/abs/2502.11969v1)|null|
-|**2025-02-17**|**A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**|Jun Jiang et.al.|[2502.11965v1](http://arxiv.org/abs/2502.11965v1)|null|
-|**2025-02-17**|**Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**|Tianyi Wu et.al.|[2502.11962v1](http://arxiv.org/abs/2502.11962v1)|null|
-|**2025-02-17**|**STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**|Haisong Gong et.al.|[2502.11959v1](http://arxiv.org/abs/2502.11959v1)|null|
-|**2025-02-17**|**Can Your Uncertainty Scores Detect Hallucinated Entity?**|Min-Hsuan Yeh et.al.|[2502.11948v1](http://arxiv.org/abs/2502.11948v1)|null|
-|**2025-02-17**|**Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**|Ailin Huang et.al.|[2502.11946v1](http://arxiv.org/abs/2502.11946v1)|null|
-|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
-|**2025-02-17**|**FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**|Yutong Ye et.al.|[2502.11937v1](http://arxiv.org/abs/2502.11937v1)|null|
-|**2025-02-17**|**On Representational Dissociation of Language and Arithmetic in Large Language Models**|Riku Kisako et.al.|[2502.11932v1](http://arxiv.org/abs/2502.11932v1)|null|
-|**2025-02-17**|**BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**|Shamsuddeen Hassan Muhammad et.al.|[2502.11926v1](http://arxiv.org/abs/2502.11926v1)|null|
-|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null|
-|**2025-02-17**|**From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**|Zhuoyan Li et.al.|[2502.11919v1](http://arxiv.org/abs/2502.11919v1)|null|
-|**2025-02-17**|**EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**|Jiamin Su et.al.|[2502.11916v1](http://arxiv.org/abs/2502.11916v1)|null|
-|**2025-02-17**|**On the robustness of ChatGPT in teaching Korean Mathematics**|Phuong-Nam Nguyen et.al.|[2502.11915v1](http://arxiv.org/abs/2502.11915v1)|null|
-|**2025-02-17**|**MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**|Haochen Xue et.al.|[2502.11903v1](http://arxiv.org/abs/2502.11903v1)|null|
-|**2025-02-17**|**Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**|Dylan Zhang et.al.|[2502.11901v1](http://arxiv.org/abs/2502.11901v1)|null|
-|**2025-02-17**|**DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**|Zhihang Yuan et.al.|[2502.11897v1](http://arxiv.org/abs/2502.11897v1)|null|
-|**2025-02-17**|**CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**|Yanxiao Zhao et.al.|[2502.11896v1](http://arxiv.org/abs/2502.11896v1)|null|
-|**2025-02-17**|**Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**|Jacob Nielsen et.al.|[2502.11895v1](http://arxiv.org/abs/2502.11895v1)|null|
-|**2025-02-17**|**Revisiting Classification Taxonomy for Grammatical Errors**|Deqing Zou et.al.|[2502.11890v1](http://arxiv.org/abs/2502.11890v1)|null|
-|**2025-02-17**|**Stonefish: Supporting Machine Learning Research in Marine Robotics**|Michele Grimaldi et.al.|[2502.11887v1](http://arxiv.org/abs/2502.11887v1)|null|
-|**2025-02-17**|**LIMR: Less is More for RL Scaling**|Xuefeng Li et.al.|[2502.11886v1](http://arxiv.org/abs/2502.11886v1)|null|
-|**2025-02-17**|**Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**|Shao Zhang et.al.|[2502.11882v1](http://arxiv.org/abs/2502.11882v1)|null|
-|**2025-02-17**|**Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**|Hyunwoo Kim et.al.|[2502.11881v1](http://arxiv.org/abs/2502.11881v1)|null|
-|**2025-02-17**|**Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**|Jinheng Wang et.al.|[2502.11880v1](http://arxiv.org/abs/2502.11880v1)|null|
-|**2025-02-17**|**VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**|Hugh Mee Wong et.al.|[2502.11874v1](http://arxiv.org/abs/2502.11874v1)|null|
-|**2025-02-17**|**Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**|Michael McRae et.al.|[2502.11866v1](http://arxiv.org/abs/2502.11866v1)|null|
-|**2025-02-17**|**FedEAT: A Robustness Optimization Framework for Federated LLMs**|Yahao Pang et.al.|[2502.11863v1](http://arxiv.org/abs/2502.11863v1)|null|
-|**2025-02-17**|**Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**|Renhao Pei et.al.|[2502.11862v1](http://arxiv.org/abs/2502.11862v1)|null|
-|**2025-02-17**|**Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**|Shuqi Yang et.al.|[2502.11861v1](http://arxiv.org/abs/2502.11861v1)|null|
-|**2025-02-17**|**Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**|Wenrui Xu et.al.|[2502.11859v1](http://arxiv.org/abs/2502.11859v1)|null|
-|**2025-02-17**|**LLMs as a synthesis between symbolic and continuous approaches to language**|Gemma Boleda et.al.|[2502.11856v1](http://arxiv.org/abs/2502.11856v1)|null|
-|**2025-02-17**|**BaxBench: Can LLMs Generate Correct and Secure Backends?**|Mark Vero et.al.|[2502.11844v1](http://arxiv.org/abs/2502.11844v1)|null|
-|**2025-02-17**|**Can LLM Agents Maintain a Persona in Discourse?**|Pranav Bhandari et.al.|[2502.11843v1](http://arxiv.org/abs/2502.11843v1)|null|
-|**2025-02-17**|**ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**|Muhammad Waseem Akram et.al.|[2502.11840v1](http://arxiv.org/abs/2502.11840v1)|null|
-|**2025-02-17**|**Intuitive physics understanding emerges from self-supervised pretraining on natural videos**|Quentin Garrido et.al.|[2502.11831v1](http://arxiv.org/abs/2502.11831v1)|null|
-|**2025-02-17**|**Text Classification in the LLM Era - Where do we stand?**|Sowmya Vajjala et.al.|[2502.11830v1](http://arxiv.org/abs/2502.11830v1)|null|
-|**2025-02-17**|**Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**|Hanbin Wang et.al.|[2502.11829v1](http://arxiv.org/abs/2502.11829v1)|null|
-|**2025-02-17**|**M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**|Chengyan Wu et.al.|[2502.11824v1](http://arxiv.org/abs/2502.11824v1)|null|
-|**2025-02-17**|**AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**|Hao Zhou et.al.|[2502.11817v1](http://arxiv.org/abs/2502.11817v1)|null|
-|**2025-02-17**|**Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**|Xu Wang et.al.|[2502.11812v1](http://arxiv.org/abs/2502.11812v1)|null|
-|**2025-02-17**|**FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**|Qianchi Zhang et.al.|[2502.11811v1](http://arxiv.org/abs/2502.11811v1)|null|
-|**2025-02-17**|**Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**|Yanbiao Ma et.al.|[2502.11809v1](http://arxiv.org/abs/2502.11809v1)|null|
-|**2025-02-17**|**Exploring Translation Mechanism of Large Language Models**|Hongbin Zhang et.al.|[2502.11806v1](http://arxiv.org/abs/2502.11806v1)|null|
-|**2025-02-17**|**Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**|Peiying Yu et.al.|[2502.11799v1](http://arxiv.org/abs/2502.11799v1)|null|
-|**2025-02-17**|**Personality Editing for Language Models through Relevant Knowledge Editing**|Seojin Hwang et.al.|[2502.11789v1](http://arxiv.org/abs/2502.11789v1)|null|
-|**2025-02-17**|**Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**|Xuan Ren et.al.|[2502.11779v1](http://arxiv.org/abs/2502.11779v1)|null|
-|**2025-02-17**|**Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**|Siddiqui Muhammad Yasir et.al.|[2502.11777v1](http://arxiv.org/abs/2502.11777v1)|null|
-|**2025-02-17**|**The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**|Leonardo Bertolazzi et.al.|[2502.11771v1](http://arxiv.org/abs/2502.11771v1)|null|
-|**2025-02-17**|**Cognitive-Aligned Document Selection for Retrieval-augmented Generation**|Bingyu Wan et.al.|[2502.11770v1](http://arxiv.org/abs/2502.11770v1)|null|
-|**2025-02-17**|**From Selection to Generation: A Survey of LLM-based Active Learning**|Yu Xia et.al.|[2502.11767v1](http://arxiv.org/abs/2502.11767v1)|null|
-|**2025-02-17**|**Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**|Zengkui Sun et.al.|[2502.11766v1](http://arxiv.org/abs/2502.11766v1)|null|
-|**2025-02-17**|**Lightweight Deepfake Detection Based on Multi-Feature Fusion**|Siddiqui Muhammad Yasir et.al.|[2502.11763v1](http://arxiv.org/abs/2502.11763v1)|null|
-|**2025-02-17**|**HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**|Michiel van der Meer et.al.|[2502.11753v1](http://arxiv.org/abs/2502.11753v1)|null|
-|**2025-02-17**|**Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**|Yuqi Pang et.al.|[2502.11751v1](http://arxiv.org/abs/2502.11751v1)|null|
-|**2025-02-17**|**SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**|Shuai Lyu et.al.|[2502.11741v1](http://arxiv.org/abs/2502.11741v1)|null|
-
-#### Abstracts
-##### **Diffusion Models without Classifier-free Guidance**
-2502.12154v1 by Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
-
-This paper presents Model-guidance (MG), a novel objective for training
-diffusion model that addresses and removes of the commonly used Classifier-free
-guidance (CFG). Our innovative approach transcends the standard modeling of
-solely data distribution to incorporating the posterior probability of
-conditions. The proposed technique originates from the idea of CFG and is easy
-yet effective, making it a plug-and-play module for existing models. Our method
-significantly accelerates the training process, doubles the inference speed,
-and achieve exceptional quality that parallel and even surpass concurrent
-diffusion models with CFG. Extensive experiments demonstrate the effectiveness,
-efficiency, scalability on different models and datasets. Finally, we establish
-state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34.
-Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
-
-摘要：本文提出模型指導 (MG)，一種用於訓練擴散模型的新目標，它解決並消除了常用的無分類器指導 (CFG)。我們的創新方法超越了僅數據分佈的標準建模，並納入了條件的後驗機率。提議的技術源自 CFG 的概念，既簡單又有效，使其成為現有模型的即插即用模組。我們的技術顯著加速了訓練過程，將推論速度提高了一倍，並取得了與 CFG 並行甚至超越並行擴散模型的出色品質。廣泛的實驗證明了該技術在不同模型和資料集上的有效性、效率和可擴充性。最後，我們在 ImageNet 256 基準上建立了最先進的效能，FID 為 1.34。我們的程式碼可在 https://github.com/tzco/Diffusion-wo-CFG 取得。
-
-##### **Idiosyncrasies in Large Language Models**
-2502.12150v1 by Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
-
-In this work, we unveil and study idiosyncrasies in Large Language Models
-(LLMs) -- unique patterns in their outputs that can be used to distinguish the
-models. To do so, we consider a simple classification task: given a particular
-text output, the objective is to predict the source LLM that generates the
-text. We evaluate this synthetic task across various groups of LLMs and find
-that simply fine-tuning existing text embedding models on LLM-generated texts
-yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on
-held-out validation data in the five-way classification problem involving
-ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals
-that these idiosyncrasies are rooted in word-level distributions. These
-patterns persist even when the texts are rewritten, translated, or summarized
-by an external LLM, suggesting that they are also encoded in the semantic
-content. Additionally, we leverage LLM as judges to generate detailed,
-open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the
-broader implications of our findings, particularly for training on synthetic
-data and inferring model similarity. Code is available at
-https://github.com/locuslab/llm-idiosyncrasies.
-
-摘要：在這項工作中，我們揭示並研究了大型語言模型 (LLM) 中的特殊性，也就是其輸出中可區分模型的獨特模式。為此，我們考慮了一項簡單的分類任務：給定一個特定文本輸出，目標是預測產生該文本的來源 LLM。我們在各種 LLM 組合中評估這個合成任務，並發現僅微調現有的文本嵌入模型在 LLM 生成的文本上即可產生極佳的分類準確度。值得注意的是，在涉及 ChatGPT、Claude、Grok、Gemini 和 DeepSeek 的五向分類問題中，我們在留存驗證資料上達到了 97.1% 的準確度。我們的進一步調查顯示，這些特殊性根植於詞彙層級的分布。即使文本是由外部 LLM 改寫、翻譯或摘要，這些模式仍然存在，這表明它們也編碼在語義內容中。此外，我們利用 LLM 作為評審，為每個模型的特殊性產生詳細、開放式的描述。最後，我們討論了我們發現的更廣泛含意，特別是對於合成資料的訓練和推斷模型相似性。程式碼可在 https://github.com/locuslab/llm-idiosyncrasies 取得。
-
-##### **HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**
-2502.12149v1 by Kenan Jiang, Li Xiong, Fei Liu
-
-We investigate factors contributing to LLM agents' success in competitive
-multi-agent environments, using auctions as a testbed where agents bid to
-maximize profit. The agents are equipped with bidding domain knowledge,
-distinct personas that reflect item preferences, and a memory of auction
-history. Our work extends the classic auction scenario by creating a realistic
-environment where multiple agents bid on houses, weighing aspects such as size,
-location, and budget to secure the most desirable homes at the lowest prices.
-Particularly, we investigate three key questions: (a) How does a persona
-influence an agent's behavior in a competitive setting? (b) Can an agent
-effectively profile its competitors' behavior during auctions? (c) How can
-persona profiling be leveraged to create an advantage using strategies such as
-theory of mind? Through a series of experiments, we analyze the behaviors of
-LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a
-valuable platform for deepening our understanding of multi-agent workflows in
-competitive environments.
-
-摘要：我們研究促成 LLM 代理在競爭性多代理環境中成功的因素，使用拍賣作為測試平台，其中代理出價以最大化利潤。這些代理配備了競標領域知識、反映物品偏好的不同角色以及拍賣歷史的記憶。我們的研究透過創造一個現實的環境來擴展經典的拍賣場景，在該環境中，多個代理對房屋出價，權衡大小、位置和預算等方面以最低價格確保最理想的房屋。特別是，我們研究了三個關鍵問題：(a) 角色如何在競爭環境中影響代理的行為？(b) 代理是否可以在拍賣期間有效地分析其競爭對手的行為？(c) 如何利用角色分析來利用心智理論等策略創造優勢？透過一系列實驗，我們分析 LLM 代理的行為並闡明新的發現。我們的測試平台稱為 HARBOR，它提供了一個有價值的平台，用於加深我們對競爭環境中多代理工作流程的理解。
-
-##### **Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**
-2502.12145v1 by Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie
-
-Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to
-mitigate large language model (LLM) hallucinations by incorporating external
-knowledge retrieval. However, existing RAG frameworks often apply retrieval
-indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or
-failing to retrieve iteratively when required for complex reasoning. Recent
-adaptive retrieval strategies, though adaptively navigates these retrieval
-strategies, predict only based on query complexity and lacks user-driven
-flexibility, making them infeasible for diverse user application needs. In this
-paper, we introduce a novel user-controllable RAG framework that enables
-dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two
-classifiers: one trained to prioritize accuracy and another to prioritize
-retrieval efficiency. Via an interpretable control parameter $\alpha$, users
-can seamlessly navigate between minimal-cost retrieval and high-accuracy
-retrieval based on their specific requirements. We empirically demonstrate that
-our approach effectively balances accuracy, retrieval cost, and user
-controllability, making it a practical and adaptable solution for real-world
-applications.
-
-摘要：檢索增強生成 (RAG) 已成為一種強大的方法，可透過整合外部知識檢索來減輕大型語言模型 (LLM) 的幻覺。然而，現有的 RAG 框架經常不加區別地應用檢索，導致低效率，在不必要時過度檢索，或在複雜推理時無法反覆檢索。最近的自適應檢索策略，儘管自適應地導航這些檢索策略，但僅根據查詢複雜性進行預測，並且缺乏使用者驅動的靈活性，這使得它們無法滿足多樣化的使用者應用需求。在本文中，我們引入了一個新穎的使用者可控制 RAG 框架，它可以動態調整準確度成本權衡。我們的做法利用兩個分類器：一個訓練用於優先考慮準確度，另一個用於優先考慮檢索效率。透過可解釋的控制參數 $\alpha$，使用者可以在最低成本檢索和基於其特定需求的高準確度檢索之間無縫導航。我們通過實證證明，我們的做法有效地平衡了準確度、檢索成本和使用者可控性，使其成為現實世界應用中實用且適應性強的解決方案。
-
-##### **Small Models Struggle to Learn from Strong Reasoners**
-2502.12143v1 by Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
-
-Large language models (LLMs) excel in complex reasoning tasks, and distilling
-their reasoning capabilities into smaller models has shown promise. However, we
-uncover an interesting phenomenon, which we term the Small Model Learnability
-Gap: small models ($\leq$3B parameters) do not consistently benefit from long
-chain-of-thought (CoT) reasoning or distillation from larger models. Instead,
-they perform better when fine-tuned on shorter, simpler reasoning chains that
-better align with their intrinsic learning capacity. To address this, we
-propose Mix Distillation, a simple yet effective strategy that balances
-reasoning complexity by combining long and short CoT examples or reasoning from
-both larger and smaller models. Our experiments demonstrate that Mix
-Distillation significantly improves small model reasoning performance compared
-to training on either data alone. These findings highlight the limitations of
-direct strong model distillation and underscore the importance of adapting
-reasoning complexity for effective reasoning capability transfer.
-
-摘要：大型語言模型 (LLM) 在複雜推理任務中表現出色，且將其推理能力提煉成較小的模型已展現前景。然而，我們發現了一個有趣的現象，我們稱之為小型模型可學習性差距：小型模型（參數數目 ≤ 3B）並非總能從大型模型的長鏈條思考 (CoT) 推理或提煉中受益。相反地，當針對較短、較簡單的推理鏈進行微調時，它們的表現會更好，而這更符合其內在學習能力。為了解決此問題，我們提出混合提煉，這是一種簡單但有效的策略，透過結合長短 CoT 範例或從較大及較小模型進行推理，來平衡推理的複雜性。我們的實驗證明，與僅針對任一資料進行訓練相比，混合提煉顯著改善了小型模型的推理效能。這些發現突顯了直接強模型提煉的限制，並強調了調整推理複雜性以有效轉移推理能力的重要性。
-
-##### **SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**
-2502.12134v1 by Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
-
-Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to
-solve complex reasoning tasks by generating intermediate reasoning steps.
-However, most existing approaches focus on hard token decoding, which
-constrains reasoning within the discrete vocabulary space and may not always be
-optimal. While recent efforts explore continuous-space reasoning, they often
-suffer from catastrophic forgetting, limiting their applicability to
-state-of-the-art LLMs that already perform well in zero-shot settings with a
-proper instruction. To address this challenge, we propose a novel approach for
-continuous-space reasoning that does not require modifying the underlying LLM.
-Specifically, we employ a lightweight assistant model to generate
-instance-specific soft thought tokens speculatively as the initial chain of
-thoughts, which are then mapped into the LLM's representation space via a
-projection module. Experimental results on five reasoning benchmarks
-demonstrate that our method enhances LLM reasoning performance through
-supervised, parameter-efficient fine-tuning.
-
-摘要：鏈式思考 (CoT) 推理讓大型語言模型 (LLM) 能夠透過產生中間推理步驟來解決複雜的推理任務。然而，現有的大多數方法都專注於硬標記解碼，這會將推理限制在離散的詞彙空間內，而且可能並非總是最佳。雖然最近的研究探索了連續空間推理，但它們經常會遭遇災難性遺忘，這限制了它們在零次學習設置中表現良好的最先進 LLM 的適用性，且需要適當的說明。為了應對這項挑戰，我們提出了一種創新的連續空間推理方法，不需要修改底層的 LLM。具體來說，我們採用一個輕量級的輔助模型來產生特定於實例的軟思考標記，作為思考的初始鏈，然後透過投影模組將它們映射到 LLM 的表示空間。在五個推理基準上的實驗結果表明，我們的模型透過監督式、參數高效的微調，增強了 LLM 的推理效能。
-
-##### **Transformer Dynamics: A neuroscientific approach to interpretability of large language models**
-2502.12131v1 by Jesseba Fernando, Grigori Guitchounts
-
-As artificial intelligence models have exploded in scale and capability,
-understanding of their internal mechanisms remains a critical challenge.
-Inspired by the success of dynamical systems approaches in neuroscience, here
-we propose a novel framework for studying computations in deep learning
-systems. We focus on the residual stream (RS) in transformer models,
-conceptualizing it as a dynamical system evolving across layers. We find that
-activations of individual RS units exhibit strong continuity across layers,
-despite the RS being a non-privileged basis. Activations in the RS accelerate
-and grow denser over layers, while individual units trace unstable periodic
-orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with
-attractor-like dynamics in the lower layers. These insights bridge dynamical
-systems theory and mechanistic interpretability, establishing a foundation for
-a "neuroscience of AI" that combines theoretical rigor with large-scale data
-analysis to advance our understanding of modern neural networks.
-
-摘要：隨著人工智慧模型在規模和能力上爆炸式增長，
-理解其內部機制仍然是一項嚴峻的挑戰。
-受到神經科學中動力系統方法成功的啟發，我們在此
-提出了一個新的框架來研究深度學習系統中的運算。我們專注於Transformer模型中的殘差流 (RS)，
-將其概念化為一個跨層演化的動態系統。我們發現
-儘管 RS 不是一個特權基礎，但個別 RS 單元的激活在各層之間表現出很強的連續性。RS 中的激活
-隨著層數的增加而加速並變得更密集，而個別單元則追蹤不穩定的週期
-軌道。在降維空間中，RS 遵循一個曲線軌跡，在較低層中具有類吸引子的動力學。這些見解橋接了動力
-系統理論和機制可解釋性，為「AI 神經科學」奠定了基礎，結合了理論嚴謹性和大規模數據
-分析，以增進我們對現代神經網路的理解。
-
-##### **Scaling Autonomous Agents via Automatic Reward Modeling And Planning**
-2502.12130v1 by Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
-
-Large language models (LLMs) have demonstrated remarkable capabilities across
-a range of text-generation tasks. However, LLMs still struggle with problems
-requiring multi-step decision-making and environmental feedback, such as online
-shopping, scientific reasoning, and mathematical problem-solving. Unlike pure
-text data, collecting large-scale decision-making data is challenging.
-Moreover, many powerful LLMs are only accessible through APIs, which hinders
-their fine-tuning for agent tasks due to cost and complexity. To address LLM
-agents' limitations, we propose a framework that can automatically learn a
-reward model from the environment without human annotations. This model can be
-used to evaluate the action trajectories of LLM agents and provide heuristics
-for task planning. Specifically, our approach involves employing one LLM-based
-agent to navigate an environment randomly, generating diverse action
-trajectories. Subsequently, a separate LLM is leveraged to assign a task intent
-and synthesize a negative response alongside the correct response for each
-trajectory. These triplets (task intent, positive response, and negative
-response) are then utilized as training data to optimize a reward model capable
-of scoring action trajectories. The effectiveness and generalizability of our
-framework are demonstrated through evaluations conducted on different agent
-benchmarks. In conclusion, our proposed framework represents a significant
-advancement in enhancing LLM agents' decision-making capabilities. By
-automating the learning of reward models, we overcome the challenges of data
-scarcity and API limitations, potentially revolutionizing the application of
-LLMs in complex and interactive environments. This research paves the way for
-more sophisticated AI agents capable of tackling a wide range of real-world
-problems requiring multi-step decision-making.
-
-摘要：大型語言模型 (LLM) 已在各種文字生成任務中展示出非凡的能力。然而，LLM 仍然在需要多步驟決策制定和環境回饋的問題上苦苦掙扎，例如網上購物、科學推理和數學問題求解。與純文本數據不同，收集大規模決策制定數據具有挑戰性。此外，許多強大的 LLM 只能通過 API 訪問，這由於成本和複雜性而阻礙了它們對代理任務的微調。為了解決 LLM 代理的局限性，我們提出了一個框架，該框架可以從環境中自動學習獎勵模型，而無需人工註釋。此模型可用于評估 LLM 代理的動作軌跡並為任務規劃提供啟發式方法。具體來說，我們的方法涉及使用一個基於 LLM 的代理隨機導航環境，生成不同的動作軌跡。隨後，利用一個單獨的 LLM 為每個軌跡分配任務意圖並合成一個負面響應以及正確的響應。然後將這些三元組（任務意圖、正面響應和負面響應）用作訓練數據，以優化能夠評分動作軌跡的獎勵模型。我們框架的有效性和普遍性通過在不同代理基準上進行的評估得到證明。總之，我們提出的框架代表了加強 LLM 代理決策能力的重大進步。通過自動化獎勵模型的學習，我們克服了數據稀缺和 API 限制的挑戰，有可能徹底改變 LLM 在複雜和互動環境中的應用。這項研究為更複雜的 AI 代理鋪平了道路，這些代理能夠解決需要多步驟決策制定的大量現實世界問題。
-
-##### **LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**
-2502.12128v1 by Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
-
-Generative models are spearheading recent progress in deep learning, showing
-strong promise for trajectory sampling in dynamical systems as well. However,
-while latent space modeling paradigms have transformed image and video
-generation, similar approaches are more difficult for most dynamical systems.
-Such systems -- from chemical molecule structures to collective human behavior
--- are described by interactions of entities, making them inherently linked to
-connectivity patterns and the traceability of entities over time. Our approach,
-LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked
-Entities), combines the advantages of graph neural networks, i.e., the
-traceability of entities across time-steps, with the efficiency and scalability
-of recent advances in image and video generation, where pre-trained encoder and
-decoder are frozen to enable generative modeling in the latent space. The core
-idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for
-retrieval of entity properties, e.g., entity coordinates, from latent system
-representations and thus enables traceability. Experimentally, across different
-domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy,
-and generalizability. (Code is available at
-https://github.com/ml-jku/LaM-SLidE)
-
-摘要：生成模型引領深度學習的最新進展，也展現出在動態系統中進行軌跡取樣的強大前景。然而，儘管潛在空間建模範例已轉變圖像和影片生成，但對於大多數動態系統來說，類似的做法較為困難。此類系統（從化學分子結構到人類集體行為）由實體的交互作用所描述，使它們與連接模式和實體隨時間的追溯性產生固有聯繫。我們的做法 LaM-SLidE（透過連結實體進行空間動態系統的潛在空間建模）結合圖形神經網路的優點，亦即跨時間步長的實體追溯性，以及圖像和影片生成中近期進展的高效率和可擴充性，其中預先訓練的編碼器和解碼器被凍結以在潛在空間中啟用生成模型。LaM-SLidE 的核心概念是導入識別符號表示（ID），以允許從潛在系統表示中擷取實體屬性（例如實體座標），從而實現追溯性。透過不同領域的實驗，我們證明 LaM-SLidE 在速度、準確度和可概括性方面表現良好。（程式碼可在 https://github.com/ml-jku/LaM-SLidE 取得）
-
-##### **On the Query Complexity of Verifier-Assisted Language Generation**
-2502.12123v1 by Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
-
-Recently, a plethora of works have proposed inference-time algorithms (e.g.
-best-of-n), which incorporate verifiers to assist the generation process. Their
-quality-efficiency trade-offs have been empirically benchmarked on a variety of
-constrained generation tasks, but the algorithmic design landscape is still
-largely poorly understood. In this paper, we develop a mathematical framework
-for reasoning about constrained generation using a pre-trained language model
-generator oracle and a process verifier--which can decide whether a prefix can
-be extended to a string which satisfies the constraints of choice. We show that
-even in very simple settings, access to a verifier can render an intractable
-problem (information-theoretically or computationally) to a tractable one. In
-fact, we show even simple algorithms, like tokenwise rejection sampling, can
-enjoy significant benefits from access to a verifier. Empirically, we show that
-a natural modification of tokenwise rejection sampling, in which the sampler is
-allowed to "backtrack" (i.e., erase the final few generated tokens) has robust
-and substantive benefits over natural baselines (e.g. (blockwise) rejection
-sampling, nucleus sampling)--both in terms of computational efficiency,
-accuracy and diversity.
-
-摘要：<paragraph>最近，许多作品提出了推理时间算法（例如 best-of-n），其中包含验证器以协助生成过程。它们的质量效率权衡已在各种受限生成任务中得到经验基准测试，但算法设计格局仍然很大程度上难以理解。在本文中，我们开发了一个数学框架，用于使用预训练语言模型生成器预言机和过程验证器推理受限生成——它可以决定是否可以将前缀扩展为满足选择约束的字符串。我们表明，即使在非常简单的设置中，访问验证器也可以将一个棘手的问题（信息论或计算）转换为一个易处理的问题。事实上，我们表明即使是简单的算法，如逐个标记拒绝采样，也可以从访问验证器中受益匪浅。凭经验，我们表明逐个标记拒绝采样的自然修改，其中允许采样器“回溯”（即，擦除最后几个生成的标记）比自然基线（例如（按块）拒绝采样、核采样）具有强大而实质性的优势——无论是在计算效率、准确性还是多样性方面。</paragraph>
-
-##### **LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**
-2502.12120v1 by Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
-
-Scaling laws guide the development of large language models (LLMs) by
-offering estimates for the optimal balance of model size, tokens, and compute.
-More recently, loss-to-loss scaling laws that relate losses across pretraining
-datasets and downstream tasks have emerged as a powerful tool for understanding
-and improving LLM performance. In this work, we investigate which factors most
-strongly influence loss-to-loss scaling. Our experiments reveal that the
-pretraining data and tokenizer determine the scaling trend. In contrast, model
-size, optimization hyperparameters, and even significant architectural
-differences, such as between transformer-based models like Llama and
-state-space models like Mamba, have limited impact. Consequently, practitioners
-should carefully curate suitable pretraining datasets for optimal downstream
-performance, while architectures and other settings can be freely optimized for
-training efficiency.
-
-摘要：規模化定律透過提供模型大小、符號和運算的最佳平衡估計，引導大型語言模型 (LLM) 的開發。最近，與預訓練資料集和下游任務相關的損失到損失縮放定律已成為了解和改善 LLM 效能的強大工具。在這項工作中，我們探討哪些因素最能影響損失到損失縮放。我們的實驗顯示，預訓練資料和分詞器會決定縮放趨勢。相反地，模型大小、最佳化超參數，甚至重大的架構差異（例如基於Transformer的模型，如 Llama，和狀態空間模型，如 Mamba 之間的差異）影響有限。因此，從業人員應仔細策劃適當的預訓練資料集以獲得最佳的下游效能，而架構和其他設定可以自由最佳化以提升訓練效率。
-
-##### **PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**
-2502.12119v1 by Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
-
-Visual instruction tuning refines pre-trained Multimodal Large Language
-Models (MLLMs) to enhance their real-world task performance. However, the rapid
-expansion of visual instruction datasets introduces significant data
-redundancy, leading to excessive computational costs. Existing data selection
-methods predominantly rely on proxy models or loss-based metrics, both of which
-impose substantial computational overheads due to the necessity of model
-inference and backpropagation. To address this challenge, we propose PRISM, a
-novel training-free approach for efficient multimodal data selection. Unlike
-existing methods, PRISM eliminates the reliance on proxy models, warm-up
-pretraining, and gradient-based optimization. Instead, it leverages Pearson
-correlation analysis to quantify the intrinsic visual encoding properties of
-MLLMs, computing a task-specific correlation score to identify high-value
-instances. This not only enbles data-efficient selection,but maintains the
-original performance. Empirical evaluations across multiple MLLMs demonstrate
-that PRISM reduces the overall time required for visual instruction tuning and
-data selection to just 30% of conventional methods, while surpassing fully
-fine-tuned models across eight multimodal and three language understanding
-benchmarks, achieving a 101.7% relative improvement in final performance.
-
-摘要：視覺指令調整優化預先訓練的多模態大型語言模型 (MLLM)，以增強其真實世界的任務表現。然而，視覺指令資料集的快速擴展引入了顯著的資料冗餘，導致過度的運算成本。現有的資料選取方法主要依賴於代理模型或基於損失的指標，這兩者由於模型推理和反向傳播的必要性而造成大量的運算負擔。為了應對這一挑戰，我們提出了 PRISM，一種用於高效多模態資料選取的新型無訓練方法。與現有方法不同，PRISM 消除了對代理模型、熱身預訓練和基於梯度的優化的依賴。相反，它利用 Pearson 相關分析來量化 MLLM 的內在視覺編碼特性，計算特定任務相關性分數以識別高價值實例。這不僅能選擇資料效率，而且能保持原始效能。跨多個 MLLM 的經驗評估表明，PRISM 將視覺指令調整和資料選取所需的總時間減少到傳統方法的 30%，同時在八個多模態和三個語言理解基準中超越了完全微調的模型，在最終效能上實現了 101.7% 的相對改進。
-
-##### **Scaling Test-Time Compute Without Verification or RL is Suboptimal**
-2502.12118v1 by Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
-
-Despite substantial advances in scaling test-time compute, an ongoing debate
-in the community is how it should be scaled up to enable continued and
-efficient improvements with scaling. There are largely two approaches: first,
-distilling successful search or thinking traces; and second, using verification
-(e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement
-learning (RL) and search algorithms. In this paper, we prove that finetuning
-LLMs with verifier-based (VB) methods based on RL or search is far superior to
-verifier-free (VF) approaches based on distilling or cloning search traces,
-given a fixed amount of compute/data budget. Further, we show that as we scale
-test-time compute (measured as the output token length) and training data,
-suboptimality of VF methods scales poorly compared to VB when the base
-pre-trained LLM presents a heterogeneous distribution over correct solution
-traces (e.g., different lengths, styles, etc.) and admits a non-sharp
-distribution over rewards on traces sampled from it. We formalize this
-condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger
-result that VB methods scale better asymptotically, with the performance gap
-between VB and VF methods widening as test-time budget grows. We corroborate
-our theory empirically on both didactic and math reasoning problems with
-3/8/32B-sized pre-trained LLMs, where we find verification is crucial for
-scaling test-time compute.
-
-摘要：儘管在擴展測試時間計算方面取得了重大進展，但社群中持續的辯論是如何擴展它以持續有效地改善擴展。大致有兩種方法：首先，提煉成功的搜尋或思考軌跡；其次，使用驗證（例如，0/1 結果獎勵、獎勵模型或驗證器）來指導強化學習 (RL) 和搜尋演算法。在本文中，我們證明使用基於 RL 或搜尋的驗證器為基礎 (VB) 方法微調 LLM 遠優於基於提煉或複製搜尋軌跡的驗證器免費 (VF) 方法，給定固定數量的計算/資料預算。此外，我們表明，當我們擴展測試時間計算（以輸出標記長度衡量）和訓練資料時，與 VB 相比，VF 方法的次最佳性擴展效果不佳，當基礎預先訓練的 LLM 在正確的解決方案軌跡上呈現異質分佈（例如，不同的長度、樣式等）並承認從其中取樣的軌跡上獎勵的分佈不尖銳時。我們使用反集中 [Erd\H{o}s，1945] 將此條件形式化。這暗示了一個更強的結果，即 VB 方法在漸近上擴展得更好，VB 和 VF 方法之間的效能差距隨著測試時間預算的增加而擴大。我們在具有 3/8/32B 大小的預先訓練 LLM 的教學和數學推理問題上對我們的理論進行實證驗證，我們發現驗證對於擴展測試時間計算至關重要。
-
-##### **A-MEM: Agentic Memory for LLM Agents**
-2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
-
-While large language model (LLM) agents can effectively use external tools
-for complex real-world tasks, they require memory systems to leverage
-historical experiences. Current memory systems enable basic storage and
-retrieval but lack sophisticated memory organization, despite recent attempts
-to incorporate graph databases. Moreover, these systems' fixed operations and
-structures limit their adaptability across diverse tasks. To address this
-limitation, this paper proposes a novel agentic memory system for LLM agents
-that can dynamically organize memories in an agentic way. Following the basic
-principles of the Zettelkasten method, we designed our memory system to create
-interconnected knowledge networks through dynamic indexing and linking. When a
-new memory is added, we generate a comprehensive note containing multiple
-structured attributes, including contextual descriptions, keywords, and tags.
-The system then analyzes historical memories to identify relevant connections,
-establishing links where meaningful similarities exist. Additionally, this
-process enables memory evolution - as new memories are integrated, they can
-trigger updates to the contextual representations and attributes of existing
-historical memories, allowing the memory network to continuously refine its
-understanding. Our approach combines the structured organization principles of
-Zettelkasten with the flexibility of agent-driven decision making, allowing for
-more adaptive and context-aware memory management. Empirical experiments on six
-foundation models show superior improvement against existing SOTA baselines.
-The source code is available at https://github.com/WujiangXu/AgenticMemory.
-
-摘要：大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務，但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索，但缺乏精密的記憶體組織，儘管最近嘗試納入圖形資料庫。此外，這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制，本文提出了一種新的代理記憶體系統，供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則，我們設計我們的記憶體系統，透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時，我們會產生包含多個結構化屬性的綜合筆記，包括脈絡描述、關鍵字和標籤。然後，系統會分析歷史記憶體以找出相關連結，在有意義的相似性時建立連結。此外，這個程序能讓記憶體演化，因為當整合新的記憶體時，它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新，讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性，能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。
-
-##### **Personality Structured Interview for Large Language Model Simulation in Personality Research**
-2502.12109v1 by Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, Frederick L. Oswald
-
-Although psychometrics researchers have recently explored the use of large
-language models (LLMs) as proxies for human participants, LLMs often fail to
-generate heterogeneous data with human-like diversity, which diminishes their
-value in advancing social science research. To address these challenges, we
-explored the potential of the theory-informed Personality Structured Interview
-(PSI) as a tool for simulating human responses in personality research. In this
-approach, the simulation is grounded in nuanced real-human interview
-transcripts that target the personality construct of interest. We have provided
-a growing set of 357 structured interview transcripts from a representative
-sample, each containing an individual's response to 32 open-ended questions
-carefully designed to gather theory-based personality evidence. Additionally,
-grounded in psychometric research, we have summarized an evaluation framework
-to systematically validate LLM-generated psychometric data. Results from three
-experiments demonstrate that well-designed structured interviews could improve
-human-like heterogeneity in LLM-simulated personality data and predict
-personality-related behavioral outcomes (i.e., organizational citizenship
-behaviors and counterproductive work behavior). We further discuss the role of
-theory-informed structured interviews in LLM-based simulation and outline a
-general framework for designing structured interviews to simulate human-like
-data for psychometric research.
-
-摘要：儘管心理測量研究人員最近已探討將大型語言模型 (LLM) 用作人類參與者的代理，但 LLM 經常無法產生具有類似人類多樣性的異質資料，這降低了它們在推進社會科學研究中的價值。為了應對這些挑戰，我們探討了理論知情的個性結構化訪談 (PSI) 作為模擬人格研究中人類反應的工具的潛力。在此方法中，模擬基於針對目標人格建構的細緻真實人類訪談記錄。我們提供了一組不斷增加的 357 個結構化訪談記錄，來自一個具代表性的樣本，每個記錄都包含個人對 32 個開放式問題的回答，這些問題經過仔細設計，用於收集基於理論的人格證據。此外，基於心理測量研究，我們總結了一個評估架構，以系統性驗證 LLM 生成的精神測量資料。三個實驗的結果表明，設計良好的結構化訪談可以改善 LLM 模擬的人格資料中類似人類的異質性，並預測與人格相關的行為結果（例如，組織公民行為和適得其反的工作行為）。我們進一步討論了理論知情的結構化訪談在基於 LLM 的模擬中的作用，並概述了一個通用框架，用於設計結構化訪談以模擬類似人類的資料，以進行心理測量研究。
-
-##### **Using the Path of Least Resistance to Explain Deep Networks**
-2502.12108v1 by Sina Salek, Joseph Enguehard
-
-Integrated Gradients (IG), a widely used axiomatic path-based attribution
-method, assigns importance scores to input features by integrating model
-gradients along a straight path from a baseline to the input. While effective
-in some cases, we show that straight paths can lead to flawed attributions. In
-this paper, we identify the cause of these misattributions and propose an
-alternative approach that treats the input space as a Riemannian manifold,
-computing attributions by integrating gradients along geodesics. We call this
-method Geodesic Integrated Gradients (GIG). To approximate geodesic paths, we
-introduce two techniques: a k-Nearest Neighbours-based approach for smaller
-models and a Stochastic Variational Inference-based method for larger ones.
-Additionally, we propose a new axiom, Strong Completeness, extending the axioms
-satisfied by IG. We show that this property is desirable for attribution
-methods and that GIG is the only method that satisfies it. Through experiments
-on both synthetic and real-world data, we demonstrate that GIG outperforms
-existing explainability methods, including IG.
-
-摘要：整合梯度 (IG) 是一種廣泛使用的公理路徑歸因方法，它透過整合從基線到輸入的直線路徑上的模型梯度，為輸入特徵分配重要性分數。雖然在某些情況下有效，但我們表明直線路徑可能會導致錯誤的歸因。在本文中，我們找出這些錯誤歸因的原因，並提出將輸入空間視為黎曼流形的替代方法，透過整合測地線上的梯度來計算歸因。我們將此方法稱為測地線整合梯度 (GIG)。為了近似測地線路徑，我們引入了兩種技術：一種基於 k 最近鄰的方法，適用於較小的模型；一種基於隨機變異推論的方法，適用於較大的模型。此外，我們提出了新的公理，即強完整性，擴展了 IG 滿足的公理。我們表明此屬性對於歸因方法而言是理想的，並且 GIG 是唯一滿足此屬性的方法。透過對合成資料和真實世界資料進行的實驗，我們證明 GIG 優於現有的可解釋性方法，包括 IG。
-
-##### **Relational Norms for Human-AI Cooperation**
-2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
-
-How we should design and interact with social artificial intelligence depends
-on the socio-relational role the AI is meant to emulate or occupy. In human
-society, relationships such as teacher-student, parent-child, neighbors,
-siblings, or employer-employee are governed by specific norms that prescribe or
-proscribe cooperative functions including hierarchy, care, transaction, and
-mating. These norms shape our judgments of what is appropriate for each
-partner. For example, workplace norms may allow a boss to give orders to an
-employee, but not vice versa, reflecting hierarchical and transactional
-expectations. As AI agents and chatbots powered by large language models are
-increasingly designed to serve roles analogous to human positions - such as
-assistant, mental health provider, tutor, or romantic partner - it is
-imperative to examine whether and how human relational norms should extend to
-human-AI interactions. Our analysis explores how differences between AI systems
-and humans, such as the absence of conscious experience and immunity to
-fatigue, may affect an AI's capacity to fulfill relationship-specific functions
-and adhere to corresponding norms. This analysis, which is a collaborative
-effort by philosophers, psychologists, relationship scientists, ethicists,
-legal experts, and AI researchers, carries important implications for AI
-systems design, user behavior, and regulation. While we accept that AI systems
-can offer significant benefits such as increased availability and consistency
-in certain socio-relational roles, they also risk fostering unhealthy
-dependencies or unrealistic expectations that could spill over into human-human
-relationships. We propose that understanding and thoughtfully shaping (or
-implementing) suitable human-AI relational norms will be crucial for ensuring
-that human-AI interactions are ethical, trustworthy, and favorable to human
-well-being.
-
-摘要：<paragraph>我們應如何設計和與社交人工智慧互動，取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中，師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配，這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如，職場規範可能允許老闆對員工發號施令，但反之則不行，這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色，例如助理、心理健康提供者、導師或浪漫伴侶，審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異，例如缺乏意識體驗和對疲勞的免疫力，如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果，對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處，例如增加可用性和一致性，但它們也可能助長不健康的依賴關係或不切實際的期望，這些期望可能會蔓延到人際關係中。我們提出，理解和深思熟慮地塑造（或實施）適當的人類與人工智慧關係規範，對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。</paragraph>
-
-##### **A Study on Leveraging Search and Self-Feedback for Agent Reasoning**
-2502.12094v1 by Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba
-
-Recent works have demonstrated that incorporating search during inference can
-significantly improve reasoning capabilities of language agents. Some
-approaches may make use of the ground truth or rely on model's own generated
-feedback. The search algorithm uses this feedback to then produce values that
-will update its criterion for exploring and exploiting various reasoning paths.
-In this study, we investigate how search and model's self-feedback can be
-leveraged for reasoning tasks. First, we explore differences in ground-truth
-feedback and self-feedback during search for math reasoning. Second, we observe
-limitations in applying search techniques to more complex tasks like
-tool-calling and design domain-specific approaches to address these gaps. Our
-experiments reveal challenges related to generalization when solely relying on
-self-feedback during search. For search to work effectively, either access to
-the ground-truth is needed or feedback mechanisms need to be carefully designed
-for the specific task.
-
-摘要：最近的研究表明，在推理过程中加入搜索功能可以显著提升语言代理的推理能力。一些方法可能会利用基本事实或依赖模型本身产生的反馈。搜索算法使用此反馈，然后生成值，以更新其探索和利用各种推理路径的标准。在本研究中，我们调查了如何利用搜索和模型的自反馈来进行推理任务。首先，我们探讨了数学推理搜索过程中基本事实反馈和自反馈的差异。其次，我们观察到在将搜索技术应用于更复杂的任务（如工具调用和设计特定于领域的解决方案）时存在的局限性，并提出针对这些差距的解决方案。我们的实验揭示了在搜索过程中仅依赖自反馈时与泛化相关的挑战。要使搜索有效，需要访问基本事实或需要针对特定任务仔细设计反馈机制。
-
-##### **Meta-Statistical Learning: Supervised Learning of Statistical Inference**
-2502.12088v1 by Maxime Peyrard, Kyunghyun Cho
-
-This work demonstrates that the tools and principles driving the success of
-large language models (LLMs) can be repurposed to tackle distribution-level
-tasks, where the goal is to predict properties of the data-generating
-distribution rather than labels for individual datapoints. These tasks
-encompass statistical inference problems such as parameter estimation,
-hypothesis testing, or mutual information estimation. Framing these tasks
-within traditional machine learning pipelines is challenging, as supervision is
-typically tied to individual datapoint. We propose meta-statistical learning, a
-framework inspired by multi-instance learning that reformulates statistical
-inference tasks as supervised learning problems. In this approach, entire
-datasets are treated as single inputs to neural networks, which predict
-distribution-level parameters. Transformer-based architectures, without
-positional encoding, provide a natural fit due to their permutation-invariance
-properties. By training on large-scale synthetic datasets, meta-statistical
-models can leverage the scalability and optimization infrastructure of
-Transformer-based LLMs. We demonstrate the framework's versatility with
-applications in hypothesis testing and mutual information estimation, showing
-strong performance, particularly for small datasets where traditional neural
-methods struggle.
-
-摘要：这项工作表明，推动大型语言模型 (LLM) 成功发展的工具和原则可以重新用于解决分布级别任务，其中目标是预测数据生成分布的属性，而不是单个数据点的标签。这些任务包括统计推断问题，例如参数估计、假设检验或互信息估计。在传统的机器学习管道中构建这些任务具有挑战性，因为监督通常与单个数据点相关联。我们提出了元统计学习，这是一个受多实例学习启发的框架，它将统计推断任务重新表述为监督学习问题。在此方法中，整个数据集被视为神经网络的单个输入，该神经网络预测分布级别参数。基于 Transformer 的架构在没有位置编码的情况下提供了自然拟合，因为它们具有置换不变性。通过在大型合成数据集上进行训练，元统计模型可以利用基于 Transformer 的 LLM 的可扩展性和优化基础设施。我们通过在假设检验和互信息估计中的应用展示了该框架的多功能性，显示出强大的性能，特别是对于传统神经方法难以处理的小型数据集。
-
-##### **APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**
-2502.12085v1 by Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
-
-While long-context inference is crucial for advancing large language model
-(LLM) applications, its prefill speed remains a significant bottleneck. Current
-approaches, including sequence parallelism strategies and compute reduction
-through approximate attention mechanisms, still fall short of delivering
-optimal inference efficiency. This hinders scaling the inputs to longer
-sequences and processing long-context queries in a timely manner. To address
-this, we introduce APB, an efficient long-context inference framework that
-leverages multi-host approximate attention to enhance prefill speed by reducing
-compute and enhancing parallelism simultaneously. APB introduces a
-communication mechanism for essential key-value pairs within a sequence
-parallelism framework, enabling a faster inference speed while maintaining task
-performance. We implement APB by incorporating a tailored FlashAttn kernel
-alongside optimized distribution strategies, supporting diverse models and
-parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x
-compared with FlashAttn, RingAttn, and StarAttn, respectively, without any
-observable task performance degradation. We provide the implementation and
-experiment code of APB in https://github.com/thunlp/APB.
-
-摘要：雖然長文本推理對於推進大型語言模型 (LLM) 應用至關重要，但其預填充速度仍然是一個重大的瓶頸。目前的各種方法，包括序列並行策略和透過近似注意力機制減少運算，仍然無法提供最佳的推理效率。這會阻礙將輸入擴展到更長的序列，以及及時處理長文本查詢。為了解決這個問題，我們引入了 APB，這是一個高效的長文本推理架構，它利用多主機近似注意力來減少運算並同時提高並行性，從而提高預填充速度。APB 在序列並行架構中引入了一個用於基本鍵值對的通訊機制，在維持任務效能的同時，實現更快的推理速度。我們透過整合一個量身打造的 FlashAttn 核心以及最佳化的分佈策略來實作 APB，支援各種模型和並行配置。與 FlashAttn、RingAttn 和 StarAttn 相比，APB 分別實現了高達 9.2 倍、4.2 倍和 1.6 倍的加速，同時沒有任何可觀察到的任務效能下降。我們在 https://github.com/thunlp/APB 中提供了 APB 的實作和實驗程式碼。
-
-##### **VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**
-2502.12084v1 by Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R., Fung
-
-Visually linking matching cues is a crucial ability in daily life, such as
-identifying the same person in multiple photos based on their cues, even
-without knowing who they are. Despite the extensive knowledge that
-vision-language models (VLMs) possess, it remains largely unexplored whether
-they are capable of performing this fundamental task. To address this, we
-introduce VLM$^2$-Bench, a benchmark designed to assess whether VLMs can
-Visually Link Matching cues, with 9 subtasks and over 3,000 test cases.
-Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with
-further analysis of various language-side and vision-side prompting methods,
-leads to a total of eight key findings. We identify critical challenges in
-models' ability to link visual cues, highlighting a significant performance gap
-where even GPT-4o lags 34.80% behind humans. Based on these insights, we
-advocate for (i) enhancing core visual capabilities to improve adaptability and
-reduce reliance on prior knowledge, (ii) establishing clearer principles for
-integrating language-based reasoning in vision-centric tasks to prevent
-unnecessary biases, and (iii) shifting vision-text training paradigms toward
-fostering models' ability to independently structure and infer relationships
-among visual cues.
-
-摘要：視覺連結匹配線索是日常生活中的關鍵能力，例如在多張照片中根據線索辨識同一個人，即使不知道他們是誰。儘管視覺語言模型 (VLM) 擁有廣泛的知識，但它們是否能執行這項基本任務，在很大程度上仍未被探討。為了解決這個問題，我們引入了 VLM$^2$-Bench，一個基準測試，旨在評估 VLM 是否能視覺連結匹配線索，包含 9 個子任務和超過 3,000 個測試案例。對八個開源 VLM 和 GPT-4o 的全面評估，以及對各種語言側和視覺側提示方法的進一步分析，得出總共八項關鍵發現。我們找出模型連結視覺線索能力的關鍵挑戰，強調一個顯著的效能差距，即使是 GPT-4o 也落後人類 34.80%。根據這些見解，我們提倡 (i) 提升核心視覺能力以改善適應性並減少對先驗知識的依賴，(ii) 為整合基於語言的推理到以視覺為中心的任務中建立更明確的原則，以防止不必要的偏見，以及 (iii) 將視覺文字訓練範例轉移到培養模型獨立建構和推論視覺線索之間關係的能力。
-
-##### **AdaSplash: Adaptive Sparse Flash Attention**
-2502.12082v1 by Nuno Gonçalves, Marcos Treviso, André F. T. Martins
-
-The computational cost of softmax-based attention in transformers limits
-their applicability to long-context tasks. Adaptive sparsity, of which
-$\alpha$-entmax attention is an example, offers a flexible data-dependent
-alternative, but existing implementations are inefficient and do not leverage
-the sparsity to obtain runtime and memory gains. In this work, we propose
-AdaSplash, which combines the efficiency of GPU-optimized algorithms with the
-sparsity benefits of $\alpha$-entmax. We first introduce a hybrid
-Halley-bisection algorithm, resulting in a 7-fold reduction in the number of
-iterations needed to compute the $\alpha$-entmax transformation. Then, we
-implement custom Triton kernels to efficiently handle adaptive sparsity.
-Experiments with RoBERTa and ModernBERT for text classification and
-single-vector retrieval, along with GPT-2 for language modeling, show that our
-method achieves substantial improvements in runtime and memory efficiency
-compared to existing $\alpha$-entmax implementations. It approaches -- and in
-some cases surpasses -- the efficiency of highly optimized softmax
-implementations like FlashAttention-2, enabling long-context training while
-maintaining strong task performance.
-
-摘要：基於 softmax 的注意力在 Transformer 中的運算成本限制了它們在長內容任務中的應用性。適應性稀疏性，其中 $\alpha$-entmax 注意力是一個例子，提供了一個靈活的資料相關替代方案，但現有的實作效率低下，且無法利用稀疏性來獲得執行時間和記憶體的增益。在這項工作中，我們提出了 AdaSplash，它結合了 GPU 最佳化演算法的效率和 $\alpha$-entmax 的稀疏性優點。我們首先引入了一個混合 Halley-二分法演算法，導致計算 $\alpha$-entmax 轉換所需的迭代次數減少了 7 倍。然後，我們實作自訂 Triton 核心，以有效處理適應性稀疏性。針對文字分類和單一向量擷取的 RoBERTa 和 ModernBERT，以及用於語言建模的 GPT-2 的實驗顯示，與現有的 $\alpha$-entmax 實作相比，我們的方法在執行時間和記憶體效率方面獲得了顯著的改善。它接近了 -- 在某些情況下超越了 -- 高度最佳化 softmax 實作（例如 FlashAttention-2）的效率，同時在維持強大任務效能的同時，能夠進行長內容訓練。
-
-##### **Unhackable Temporal Rewarding for Scalable Video MLLMs**
-2502.12081v1 by En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
-
-In the pursuit of superior video-processing MLLMs, we have encountered a
-perplexing paradox: the "anti-scaling law", where more data and larger models
-lead to worse performance. This study unmasks the culprit: "temporal hacking",
-a phenomenon where models shortcut by fixating on select frames, missing the
-full video narrative. In this work, we systematically establish a comprehensive
-theory of temporal hacking, defining it from a reinforcement learning
-perspective, introducing the Temporal Perplexity (TPL) score to assess this
-misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework
-to mitigate the temporal hacking. Both theoretically and empirically, TPL
-proves to be a reliable indicator of temporal modeling quality, correlating
-strongly with frame activation patterns. Extensive experiments reveal that UTR
-not only counters temporal hacking but significantly elevates video
-comprehension capabilities. This work not only advances video-AI systems but
-also illuminates the critical importance of aligning proxy rewards with true
-objectives in MLLM development.
-
-摘要：在追求卓越的影片處理 MLLM 時，我們遭遇了一個令人費解的矛盾現象：「反規模化定律」，也就是更多資料和更大的模型會導致更差的效能。本研究揭露了罪魁禍首：「時間駭客」，這是一種模型透過專注於特定影格來簡化的現象，錯失了完整的影片敘事。在這項研究中，我們系統性地建立了一個關於時間駭客的全面理論，從強化學習的角度定義它，並引入了時間困惑度 (TPL) 分數來評估這種失衡，並提出了無法破解的時間獎勵 (UTR) 架構來減輕時間駭客現象。從理論和經驗上來說，TPL 被證明是時間建模品質的可靠指標，與影格啟動模式有很強的相關性。大量的實驗顯示，UTR 不僅對抗時間駭客，還能顯著提升影片理解能力。這項研究不僅推動了影片 AI 系統，也闡明了在 MLLM 開發中，將代理獎勵與真實目標對齊的重要性。
-
-##### **Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**
-2502.12073v1 by Zhongyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
-
-Social media enables dynamic user engagement with trending topics, and recent
-research has explored the potential of large language models (LLMs) for
-response generation. While some studies investigate LLMs as agents for
-simulating user behavior on social media, their focus remains on practical
-viability and scalability rather than a deeper understanding of how well LLM
-aligns with human behavior. This paper analyzes LLMs' ability to simulate
-social media engagement through action guided response generation, where a
-model first predicts a user's most likely engagement action-retweet, quote, or
-rewrite-towards a trending post before generating a personalized response
-conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and
-DeepSeek-R1 in social media engagement simulation regarding a major societal
-event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT
-in action prediction, while few-shot prompting initially degrades the
-prediction accuracy of LLMs with limited examples. However, in response
-generation, few-shot LLMs achieve stronger semantic alignment with ground truth
-posts.
-
-摘要：社交媒體讓使用者能夠動態參與熱門話題，而最近的研究探索了大型語言模型 (LLM) 在回應生成方面的潛力。儘管有些研究將 LLM 視為模擬社交媒體使用者行為的代理，但其重點仍放在實務可行性和可擴充性，而非深入了解 LLM 如何與人類行為相符。本文分析了 LLM 透過動作引導回應生成來模擬社交媒體參與的能力，其中一個模型首先預測使用者最有可能的參與動作（轉推、引用或改寫）對熱門貼文的參與，然後根據預測的動作產生個人化回應。我們在 X 上討論的一個重大社會事件中，對 GPT-4o-mini、O1-mini 和 DeepSeek-R1 進行社交媒體參與模擬的基準測試。我們的研究結果顯示，零次學習 LLM 在動作預測方面表現不如 BERT，而少次學習提示最初會降低範例有限的 LLM 預測準確度。然而，在回應生成方面，少次學習 LLM 與真實貼文達到了更強的語義對齊。
-
-##### **TokenSkip: Controllable Chain-of-Thought Compression in LLMs**
-2502.12067v1 by Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li
-
-Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning
-capabilities of large language models (LLMs). Recent advancements, such as
-OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT
-sequences during inference could further boost LLM reasoning performance.
-However, due to the autoregressive nature of LLM decoding, longer CoT outputs
-lead to a linear increase in inference latency, adversely affecting user
-experience, particularly when the CoT exceeds 10,000 tokens. To address this
-limitation, we analyze the semantic importance of tokens within CoT outputs and
-reveal that their contributions to reasoning vary. Building on this insight, we
-propose TokenSkip, a simple yet effective approach that enables LLMs to
-selectively skip less important tokens, allowing for controllable CoT
-compression. Extensive experiments across various models and tasks demonstrate
-the effectiveness of TokenSkip in reducing CoT token usage while preserving
-strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct,
-TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less
-than a 0.4% performance drop.
-
-摘要：<paragraph>鏈式思維 (CoT) 已被證明能有效提升大型語言模型 (LLM) 的推理能力。最近的進展，例如 OpenAI 的 o1 和 DeepSeek-R1，表明在推理過程中擴展 CoT 序列的長度可以進一步提升 LLM 的推理效能。然而，由於 LLM 解碼的自動回歸特性，較長的 CoT 輸出會導致推理延遲線性增加，對使用者體驗造成負面影響，特別是在 CoT 超過 10,000 個符號時。為了解決這個限制，我們分析了 CoT 輸出中符號的語義重要性，並揭示了它們對推理的貢獻度不同。基於這個見解，我們提出了 TokenSkip，一種簡單但有效的技術，使 LLM 能有選擇地略過較不重要的符號，從而實現可控的 CoT 壓縮。跨越各種模型和任務的廣泛實驗證明了 TokenSkip 在減少 CoT 符號使用量同時保持強大推理效能方面的有效性。值得注意的是，當應用於 Qwen2.5-14B-Instruct 時，TokenSkip 將 GSM8K 上的推理符號減少了 40%（從 313 個減少到 181 個），效能下降不到 0.4%。</paragraph>
-
-##### **CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**
-2502.12066v1 by Yifan Zhang, Xue Yang
-
-Automating planning with LLMs presents transformative opportunities for
-traditional industries, yet remains underexplored. In commercial construction,
-the complexity of automated scheduling often requires manual intervention to
-ensure precision. We propose CONSTRUCTA, a novel framework leveraging LLMs to
-optimize construction schedules in complex projects like semiconductor
-fabrication. CONSTRUCTA addresses key challenges by: (1) integrating
-construction-specific knowledge through static RAG; (2) employing
-context-sampling techniques inspired by architectural expertise to provide
-relevant input; and (3) deploying Construction DPO to align schedules with
-expert preferences using RLHF. Experiments on proprietary data demonstrate
-performance improvements of +42.3% in missing value prediction, +79.1% in
-dependency analysis, and +28.9% in automated planning compared to baseline
-methods, showcasing its potential to revolutionize construction workflows and
-inspire domain-specific LLM advancements.
-
-摘要：利用 LLM 自動化規劃為傳統產業帶來轉型契機，但仍有待進一步探索。在商業建築中，自動化排程的複雜性通常需要手動介入以確保精確度。我們提出 CONSTRUCTA，一個利用 LLM 優化複雜專案（如半導體製造）建築排程的新穎架構。CONSTRUCTA 透過下列方式解決關鍵挑戰：(1) 整合靜態 RAG 的建築特定知識；(2) 採用受建築專業知識啟發的脈絡取樣技術，提供相關輸入；(3) 部署建築 DPO，使用 RLHF 將排程與專家偏好對齊。專利數據的實驗顯示，與基準方法相比，遺失值預測的效能提升 +42.3%、相依性分析提升 +79.1%、自動化規劃提升 +28.9%，展示其革新建築工作流程和激勵領域特定 LLM 進展的潛力。
-
-##### **Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**
-2502.12065v1 by Lan Zhang, Marco Valentino, Andre Freitas
-
-Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge
-the gap between informal mathematics and formal languages through
-autoformalization. However, it is still unclear how well LLMs generalize to
-sophisticated and naturally occurring mathematical statements. To address this
-gap, we investigate the task of autoformalizing real-world mathematical
-definitions -- a critical component of mathematical discourse. Specifically, we
-introduce two novel resources for autoformalisation, collecting definitions
-from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically
-evaluate a range of LLMs, analyzing their ability to formalize definitions into
-Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs'
-performance including refinement through external feedback from Proof
-Assistants, and formal definition grounding, where we guide LLMs through
-relevant contextual elements from formal mathematical libraries. Our findings
-reveal that definitions present a greater challenge compared to existing
-benchmarks, such as miniF2F. In particular, we found that LLMs still struggle
-with self-correction, and aligning with relevant mathematical libraries. At the
-same time, structured refinement methods and definition grounding strategies
-yield notable improvements of up to 16% on self-correction capabilities and 43%
-on the reduction of undefined errors, highlighting promising directions for
-enhancing LLM-based autoformalization in real-world scenarios.
-
-摘要：由於語言能力，LLM 提供了一個機會，透過自動形式化來彌合非正式數學和形式語言之間的差距。然而，LLM 在多麼精巧且自然發生的數學陳述中概化，這仍不清楚。為了解決這個差距，我們探討了自動形式化真實世界數學定義的任務，這是數學論述中的關鍵組成部分。具體來說，我們介紹了自動形式化的兩個新資源，收集來自維基百科（Def_Wiki）和 arXiv 論文（Def_ArXiv）的定義。然後，我們系統性地評估了一系列 LLM，分析它們將定義形式化為 Isabelle/HOL 的能力。此外，我們探討了增強 LLM 效能的策略，包括透過證明輔助工具的外部回饋進行精煉，以及形式定義基礎，其中我們透過形式數學函式庫中的相關脈絡元素來引導 LLM。我們的發現顯示，與現有的基準（例如 miniF2F）相比，定義提出了更大的挑戰。特別是，我們發現 LLM 在自我修正和與相關數學函式庫對齊方面仍然有困難。同時，結構化的精煉方法和定義基礎策略在自我修正能力上產生了顯著的改善，高達 16%，在減少未定義錯誤方面改善了 43%，突顯了在真實世界場景中增強基於 LLM 的自動形式化的有希望的方向。
-
-##### **AI-generated Text Detection with a GLTR-based Approach**
-2502.12064v1 by Lucía Yan Wu, Isabel Segura-Bedmar
-
-The rise of LLMs (Large Language Models) has contributed to the improved
-performance and development of cutting-edge NLP applications. However, these
-can also pose risks when used maliciously, such as spreading fake news, harmful
-content, impersonating individuals, or facilitating school plagiarism, among
-others. This is because LLMs can generate high-quality texts, which are
-challenging to differentiate from those written by humans. GLTR, which stands
-for Giant Language Model Test Room and was developed jointly by the MIT-IBM
-Watson AI Lab and HarvardNLP, is a visual tool designed to help detect
-machine-generated texts based on GPT-2, that highlights the words in text
-depending on the probability that they were machine-generated. One limitation
-of GLTR is that the results it returns can sometimes be ambiguous and lead to
-confusion. This study aims to explore various ways to improve GLTR's
-effectiveness for detecting AI-generated texts within the context of the
-IberLef-AuTexTification 2023 shared task, in both English and Spanish
-languages. Experiment results show that our GLTR-based GPT-2 model overcomes
-the state-of-the-art models on the English dataset with a macro F1-score of
-80.19%, except for the first ranking model (80.91%). However, for the Spanish
-dataset, we obtained a macro F1-score of 66.20%, which differs by 4.57%
-compared to the top-performing model.
-
-摘要：大型語言模型 (LLM) 的興起有助於改進尖端 NLP 應用程式的效能和開發。不過，這些應用程式若遭惡意使用，例如散布假新聞、有害內容、冒充個人或協助學校抄襲等，也可能造成風險。這是因為 LLM 可以產生高品質的文字，而這些文字難以與人類所寫的文字區分。GLTR（代表大型語言模型測試室）是由麻省理工學院-IBM Watson AI 實驗室和 HarvardNLP 共同開發的視覺工具，旨在協助偵測基於 GPT-2 的機器產生的文字，它會根據文字中每個字詞機器產生的機率來標示。GLTR 的一個限制在於，它回傳的結果有時可能模稜兩可，容易造成混淆。本研究旨在探討各種方法來改善 GLTR 在 IberLef-AuTexTification 2023 共享任務中偵測 AI 生成的文字的效能，任務中包含英文和西班牙文兩種語言。實驗結果顯示，我們的基於 GLTR 的 GPT-2 模型在英文資料集上以 80.19% 的巨觀 F1 分數超越了最先進的模型，僅次於第一名排名模型 (80.91%)。不過，在西班牙文資料集上，我們獲得的巨觀 F1 分數為 66.20%，與表現最佳的模型相比，相差 4.57%。
-
-##### **Culture is Not Trivia: Sociocultural Theory for Cultural NLP**
-2502.12057v1 by Naitian Zhou, David Bamman, Isaac L. Bleaman
-
-The field of cultural NLP has recently experienced rapid growth, driven by a
-pressing need to ensure that language technologies are effective and safe
-across a pluralistic user base. This work has largely progressed without a
-shared conception of culture, instead choosing to rely on a wide array of
-cultural proxies. However, this leads to a number of recurring limitations:
-coarse national boundaries fail to capture nuanced differences that lay within
-them, limited coverage restricts datasets to only a subset of usually
-highly-represented cultures, and a lack of dynamicity results in static
-cultural benchmarks that do not change as culture evolves. In this position
-paper, we argue that these methodological limitations are symptomatic of a
-theoretical gap. We draw on a well-developed theory of culture from
-sociocultural linguistics to fill this gap by 1) demonstrating in a case study
-how it can clarify methodological constraints and affordances, 2) offering
-theoretically-motivated paths forward to achieving cultural competence, and 3)
-arguing that localization is a more useful framing for the goals of much
-current work in cultural NLP.
-
-摘要：文化 NLP 領域最近經歷了快速成長，這是因為迫切需要確保語言技術對於多元化的使用者基礎而言是有效且安全的。這項工作在很大程度上沒有文化共識，而是選擇依賴各種文化代理。然而，這導致了許多重複性的限制：粗略的國家界線無法捕捉到其中的細微差異，有限的涵蓋範圍將資料集限制在通常高度代表的文化子集，而且缺乏動態性導致靜態文化基準無法隨著文化演變而改變。在這篇立場文件中，我們認為這些方法論限制是理論差距的徵兆。我們從社會文化語言學中汲取一個發展良好的文化理論，透過 1) 在個案研究中展示它如何釐清方法論限制和可負擔性，2) 提供理論上合理的途徑來實現文化能力，以及 3) 主張在地化對於文化 NLP 中許多當前工作的目標而言是一個更有用的框架，來填補這個差距。
-
-##### **Designing Role Vectors to Improve LLM Inference Behaviour**
-2502.12055v1 by Daniele Potertì, Andrea Seveso, Fabio Mercorio
-
-The influence of personas on Large Language Models (LLMs) has been widely
-studied, yet their direct impact on performance remains uncertain. This work
-explores a novel approach to guiding LLM behaviour through role vectors, an
-alternative to persona-based prompting. We construct 29 role vectors derived
-from model activations and evaluate their impact on benchmark performance
-across multiple domains. Our analysis investigates whether these vectors can
-effectively steer models toward domain-specific expertise. We measure two key
-interventions: (i) activation addition, which reinforces role-specific
-directions, and (ii) directional ablation, which removes them. Results on
-well-established benchmarks indicate that role vectors do, in fact, influence
-model behaviour, improving task performance in relevant domains while
-marginally affecting unrelated tasks. This, in turn, suggests that manipulating
-internal model representations has a greater impact on outcomes than
-persona-based prompting.
-
-摘要：大型語言模型 (LLM) 中角色的影響已被廣泛研究，但它們對效能的直接影響仍然不確定。本研究探討了一種透過角色向量引導 LLM 行為的新方法，這是一種基於角色提示的替代方案。我們從模型激活中建構了 29 個角色向量，並評估它們對多個領域基準效能的影響。我們的分析探討了這些向量是否能有效地引導模型朝向特定領域的專業知識。我們衡量了兩個關鍵干預措施：(i) 激活新增，它加強了特定角色的方向，以及 (ii) 方向消融，它移除了這些方向。在既定基準上的結果表明，角色向量確實會影響模型行為，在相關領域中改善任務效能，同時對不相關任務的影響很小。這反過來表明，操縱內部模型表示對結果的影響比基於角色的提示更大。
-
-##### **PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**
-2502.12054v1 by Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu
-
-Large language models demonstrate remarkable capabilities across various
-domains, especially mathematics and logic reasoning. However, current
-evaluations overlook physics-based reasoning - a complex task requiring physics
-theorems and constraints. We present PhysReason, a 1,200-problem benchmark
-comprising knowledge-based (25%) and reasoning-based (75%) problems, where the
-latter are divided into three difficulty levels (easy, medium, hard). Notably,
-problems require an average of 8.1 solution steps, with hard requiring 15.6,
-reflecting the complexity of physics-based reasoning. We propose the Physics
-Solution Auto Scoring Framework, incorporating efficient answer-level and
-comprehensive step-level evaluations. Top-performing models like Deepseek-R1,
-Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on
-answer-level evaluation, with performance dropping from knowledge questions
-(75.11%) to hard problems (31.95%). Through step-level evaluation, we
-identified four key bottlenecks: Physics Theorem Application, Physics Process
-Understanding, Calculation, and Physics Condition Analysis. These findings
-position PhysReason as a novel and comprehensive benchmark for evaluating
-physics-based reasoning capabilities in large language models. Our code and
-data will be published at https:/dxzxy12138.github.io/PhysReason.
-
-摘要：大型語言模型展示了在各個領域的非凡能力，特別是數學和邏輯推理。然而，目前的評估忽略了基於物理的推理——這是一項複雜的任務，需要物理定理和約束。我們提出了 PhysReason，一個包含 1,200 題的基準，包含基於知識的（25%）和基於推理的（75%）問題，後者分為三個難度等級（容易、中等、困難）。值得注意的是，問題需要平均 8.1 個求解步驟，困難的需要 15.6 個，反映了基於物理的推理的複雜性。我們提出了物理解決方案自動評分框架，結合了高效的答案級別和全面的步驟級別評估。Deepseek-R1、Gemini-2.0-Flash-Thinking 和 o3-mini-high 等表現最佳的模型在答案級別評估中獲得低於 60% 的分數，性能從知識問題（75.11%）下降到困難問題（31.95%）。通過步驟級別評估，我們確定了四個關鍵瓶頸：物理定理應用、物理過程理解、計算和物理條件分析。這些發現將 PhysReason 定位為一個新穎且全面的基準，用於評估大型語言模型中基於物理的推理能力。我們的代碼和數據將發布在 https:/dxzxy12138.github.io/PhysReason。
-
-##### **A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**
-2502.12052v1 by Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
-
-In NLG meta-evaluation, evaluation metrics are typically assessed based on
-their consistency with humans. However, we identify some limitations in
-traditional NLG meta-evaluation approaches, such as issues in handling human
-ratings and ambiguous selections of correlation measures, which undermine the
-effectiveness of meta-evaluation. In this work, we propose a dual-perspective
-NLG meta-evaluation framework that focuses on different evaluation
-capabilities, thereby providing better interpretability. In addition, we
-introduce a method of automatically constructing the corresponding benchmarks
-without requiring new human annotations. Furthermore, we conduct experiments
-with 16 representative LLMs as the evaluators based on our proposed framework,
-comprehensively analyzing their evaluation performance from different
-perspectives.
-
-摘要：在 NLG 元評估中，評估指標通常根據其與人類的一致性進行評估。然而，我們在傳統的 NLG 元評估方法中發現了一些限制，例如在處理人類評分和模稜兩可的相關性測量選擇方面存在問題，這會損害元評估的有效性。在這項工作中，我們提出了一個雙視角 NLG 元評估框架，該框架專注於不同的評估能力，從而提供更好的可解釋性。此外，我們引入了一種自動構建相應基準的方法，而不需要新的手動註釋。此外，我們根據我們提出的框架對 16 個具有代表性的 LLM 作為評估器進行了實驗，從不同的角度全面分析了它們的評估性能。
-
-##### **How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**
-2502.12051v1 by Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
-
-Neural scaling laws have revolutionized the design and optimization of
-large-scale AI models by revealing predictable relationships between model
-size, dataset volume, and computational resources. Early research established
-power-law relationships in model performance, leading to compute-optimal
-scaling strategies. However, recent studies highlighted their limitations
-across architectures, modalities, and deployment contexts. Sparse models,
-mixture-of-experts, retrieval-augmented learning, and multimodal models often
-deviate from traditional scaling patterns. Moreover, scaling behaviors vary
-across domains such as vision, reinforcement learning, and fine-tuning,
-underscoring the need for more nuanced approaches. In this survey, we
-synthesize insights from over 50 studies, examining the theoretical
-foundations, empirical findings, and practical implications of scaling laws. We
-also explore key challenges, including data efficiency, inference scaling, and
-architecture-specific constraints, advocating for adaptive scaling strategies
-tailored to real-world applications. We suggest that while scaling laws provide
-a useful guide, they do not always generalize across all architectures and
-training strategies.
-
-摘要：神經網路規模定律透過揭示模型規模、資料集體積和計算資源之間可預測的關係，徹底革新了大型 AI 模型的設計和最佳化。早期研究建立了模型效能中的冪次定律關係，進而產生最佳化的運算規模策略。然而，最近的研究突出了它們在架構、模態和部署脈絡中的限制。稀疏模型、專家混合、檢索增強式學習和多模態模型通常偏離傳統的規模模式。此外，規模行為因視覺、強化學習和微調等領域而異，強調需要更細緻的方法。在這項調查中，我們綜合了 50 多項研究的見解，探討規模定律的理論基礎、實證發現和實務意涵。我們也探討了關鍵挑戰，包括資料效率、推論規模和特定於架構的限制，提倡針對實際應用量身打造的自適應規模策略。我們建議，儘管規模定律提供了有用的指南，但它們並不總是能概括到所有架構和訓練策略。
-
-##### **SpeechT: Findings of the First Mentorship in Speech Translation**
-2502.12050v1 by Yasmin Moslem, Juan Julián Cea Morán, Mariano Gonzalez-Gomez, Muhammad Hazim Al Farouq, Farah Abdou, Satarupa Deb
-
-This work presents the details and findings of the first mentorship in speech
-translation (SpeechT), which took place in December 2024 and January 2025. To
-fulfil the requirements of the mentorship, the participants engaged in key
-activities, including data preparation, modelling, and advanced research.
-
-摘要：本研究報告了 2024 年 12 月和 2025 年 1 月舉行的首次語音翻譯 (SpeechT) 指導計畫的詳細資訊和發現。為了滿足指導計畫的要求，參與者參與了關鍵活動，包括資料準備、建模和進階研究。
-
-##### **A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**
-2502.12048v1 by Shreya Shukla, Jose Torres, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury
-
-Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial
-Intelligence (GenAI) has opened new frontiers in brain signal decoding,
-enabling assistive communication, neural representation learning, and
-multimodal integration. BCIs, particularly those leveraging
-Electroencephalography (EEG), provide a non-invasive means of translating
-neural activity into meaningful outputs. Recent advances in deep learning,
-including Generative Adversarial Networks (GANs) and Transformer-based Large
-Language Models (LLMs), have significantly improved EEG-based generation of
-images, text, and speech. This paper provides a literature review of the
-state-of-the-art in EEG-based multimodal generation, focusing on (i)
-EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and
-Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based
-language models and contrastive learning methods. Additionally, we discuss the
-emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We
-highlight key datasets, use cases, challenges, and EEG feature encoding methods
-that underpin generative approaches. By providing a structured overview of
-EEG-based generative AI, this survey aims to equip researchers and
-practitioners with insights to advance neural decoding, enhance assistive
-technologies, and expand the frontiers of brain-computer interaction.
-
-摘要：腦機介面（BCIs）與生成式人工智慧（GenAI）的整合為腦信號解碼開啟了新領域，能協助溝通、神經表徵學習與多模式整合。BCIs，特別是利用腦電圖（EEG）的 BCIs，提供了一種非侵入性的方式，可將神經活動轉換為有意義的輸出。深度學習的最新進展，包括生成對抗網路（GANs）與基於 Transformer 的大型語言模型（LLMs），大幅改善了基於 EEG 的影像、文字與語音生成。本文提供了一份基於 EEG 的多模式生成的最新文獻回顧，重點在於（一）透過 GANs、變異自動編碼器（VAEs）與擴散模型進行 EEG 到影像的生成，以及（二）利用基於 Transformer 的語言模型與對比學習方法進行 EEG 到文字的生成。此外，我們討論了 EEG 到語音合成的新興領域，這是一個不斷演進的多模式領域。我們重點介紹了關鍵的資料集、用例、挑戰與支撐生成方法的 EEG 特徵編碼方法。透過提供基於 EEG 的生成式 AI 的結構化概觀，本調查旨在為研究人員與從業人員提供見解，以推進神經解碼、增強輔助技術並擴展腦機互動的領域。
-
-##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**
-2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li
-
-Large language models (LLMs) have demonstrated remarkable capabilities in
-various complex tasks, yet they still suffer from hallucinations. Introducing
-external knowledge, such as knowledge graph, can enhance the LLMs' ability to
-provide factual answers. LLMs have the ability to interactively explore
-knowledge graphs. However, most approaches have been affected by insufficient
-internal knowledge excavation in LLMs, limited generation of trustworthy
-knowledge reasoning paths, and a vague integration between internal and
-external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large
-model framework driven by the collaboration of internal and external knowledge.
-It relies on the internal knowledge of the LLM to guide the exploration of
-interpretable directed subgraphs in external knowledge graphs, better
-integrating the two knowledge sources for more accurate reasoning. Extensive
-experiments on multiple real-world datasets confirm the superiority of
-KnowPath.
-
-摘要：大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力，但仍會出現幻覺。引入外部知識（例如知識圖譜）可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而，大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限，以及內部和外部知識之間的整合模糊的影響。因此，我們提出 KnowPath，這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索，更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。
-
-##### **SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**
-2502.12025v1 by Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
-
-Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage
-long chain-of-thought (CoT) reasoning to generate structured intermediate
-steps, enhancing their reasoning capabilities. However, long CoT does not
-inherently guarantee safe outputs, potentially leading to harmful consequences
-such as the introduction of security vulnerabilities in code or the spread of
-misinformation. Current research on large language model (LLM) safety usually
-focuses on short-answer responses, overlooking the long CoT style outputs of
-LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First,
-we investigate safety evaluators calibrated against human annotations. Using
-our newly developed metrics, we thoroughly assess the safety of 12
-state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results
-show that LRMs are not safe compared to their reasoning advance. Further, we
-perform a fine-grained analysis of the reasoning trace and final answer. We
-find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can
-improve model safety without additional training. However, these strategies
-either use constrained reasoning traces or incur high inference costs. To
-better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind
-safety training dataset in CoT style. We fine-tune two LRMs with SafeChain,
-showing that it not only enhances model safety but also preserves performance
-across 6 reasoning benchmarks.
-
-摘要：新興的大型推理模型（LRM），例如 DeepSeek-R1 模型，利用長鏈思考（CoT）推理來生成結構化的中間步驟，增強其推理能力。然而，長 CoT 本質上並不能保證安全的輸出，可能會導致有害的後果，例如在程式碼中引入安全漏洞或散佈錯誤訊息。目前針對大型語言模型（LLM）安全性的研究通常側重於簡短的回答回應，忽略了 LRM 的長 CoT 風格輸出。為了彌補這個差距，我們對 LRM 安全性進行系統性研究。首先，我們研究根據人類註解校正的安全評估器。使用我們新開發的指標，我們徹底評估了 12 個最先進的 LRM 在 StrongReject 和 WildJailbreak 資料集上的安全性。我們的結果表明，與其推理進度相比，LRM 並不安全。此外，我們對推理軌跡和最終答案進行了細粒度分析。我們發現三種解碼策略（ZeroThink、LessThink 和 MoreThink）可以在不額外訓練的情況下提高模型安全性。然而，這些策略要么使用受約束的推理軌跡，要么會產生高昂的推論成本。為了進一步加強 LRM 安全性，我們引入了 SafeChain，這是第一個 CoT 風格的安全訓練資料集。我們使用 SafeChain 微調了兩個 LRM，表明它不僅增強了模型安全性，而且在 6 個推理基準測試中都保持了效能。
-
-##### **Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**
-2502.12022v1 by Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
-
-Existing approaches to mathematical reasoning with large language models
-(LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated
-Reasoning (TIR) for precise computation. While efforts have been made to
-combine these methods, they primarily rely on post-selection or predefined
-strategies, leaving an open question: whether LLMs can autonomously adapt their
-reasoning strategy based on their inherent capabilities. In this work, we
-propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework
-that enables LLMs to personalize their reasoning strategy spontaneously,
-aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware
-data selection during supervised fine-tuning (SFT) to tailor training data to
-the model's unique abilities. This approach equips LLMs to autonomously
-determine and apply the appropriate reasoning strategy at test time. We
-evaluate TATA through extensive experiments on six mathematical reasoning
-benchmarks, using both general-purpose and math-specialized LLMs. Empirical
-results demonstrate that TATA effectively combines the complementary strengths
-of CoT and TIR, achieving superior or comparable performance with improved
-inference efficiency compared to TIR alone. Further analysis underscores the
-critical role of aptitude-aware data selection in enabling LLMs to make
-effective and adaptive reasoning decisions and align reasoning strategies with
-model capabilities.
-
-摘要：現有的數學推理方法使用大型語言模型 (LLM) 仰賴思考鏈 (CoT) 來達到泛化性，或使用工具整合推理 (TIR) 來進行精確運算。儘管已有人嘗試結合這些方法，但它們主要依賴後選取或預定義策略，留下一個開放性的問題：LLM 是否能根據其內在能力自主調整其推理策略。在這項工作中，我們提出 TATA（根據其天賦來教授 LLM），這是一個適應性架構，讓 LLM 能夠自發地個人化其推理策略，並與其內在的天賦保持一致。TATA 在監督微調 (SFT) 期間納入了基礎 LLM 感知資料選取，以根據模型的獨特能力調整訓練資料。此方法讓 LLM 能夠在測試時自主決定並套用適當的推理策略。我們透過對六個數學推理基準進行廣泛的實驗來評估 TATA，使用通用和數學專用 LLM。經驗結果顯示，TATA 有效地結合了 CoT 和 TIR 的互補優勢，與僅使用 TIR 相比，達到了優越或相當的效能，並改善了推論效率。進一步的分析強調了天賦感知資料選取在讓 LLM 能夠做出有效且適應性的推理決策，並將推理策略與模型能力保持一致時所扮演的關鍵角色。
-
-##### **Atom of Thoughts for Markov LLM Test-Time Scaling**
-2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
-
-Large Language Models (LLMs) achieve superior performance through
-training-time scaling, and test-time scaling further enhances their
-capabilities by conducting effective reasoning during inference. However, as
-the scale of reasoning increases, existing test-time scaling methods suffer
-from accumulated historical information, which not only wastes computational
-resources but also interferes with effective reasoning. To address this issue,
-we observe that complex reasoning progress is often achieved by solving a
-sequence of independent subquestions, each being self-contained and verifiable.
-These subquestions are essentially atomic questions, relying primarily on their
-current state rather than accumulated history, similar to the memoryless
-transitions in a Markov process. Based on this observation, we propose Atom of
-Thoughts (AoT), where each state transition in the reasoning process consists
-of decomposing the current question into a dependency-based directed acyclic
-graph and contracting its subquestions, forming a new atomic question state.
-This iterative decomposition-contraction process continues until reaching
-directly solvable atomic questions, naturally realizing Markov transitions
-between question states. Furthermore, these atomic questions can be seamlessly
-integrated into existing test-time scaling methods, enabling AoT to serve as a
-plug-in enhancement for improving reasoning capabilities. Experiments across
-six benchmarks demonstrate the effectiveness of AoT both as a standalone
-framework and a plug-in enhancement. Notably, on HotpotQA, when applied to
-gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and
-DeepSeek-R1 by 10.6%. The code will be available at
-https://github.com/qixucen/atom.
-
-摘要：大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能，而測試時間擴充透過在推論期間進行有效的推理，進一步提升其能力。然而，隨著推理規模的擴大，現有的測試時間擴充方法會受到累積的歷史資訊影響，這不僅會浪費運算資源，還會干擾有效的推理。為了解決這個問題，我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成，每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題，主要依賴於它們的當前狀態，而不是累積的歷史，類似於馬可夫過程中的無記憶轉換。基於這個觀察，我們提出了思想原子 (AoT)，其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖，並收縮其子問題，形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行，直到達到可直接解決的原子問題，自然地實現問題狀態之間的馬可夫轉換。此外，這些原子問題可以無縫整合到現有的測試時間擴充方法中，讓 AoT 可以作為外掛程式強化功能，以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是，在 HotpotQA 上，當應用於 gpt-4o-mini 時，AoT 達到了 80.6% 的 F1 分數，比 o3-mini 高出 3.4%，比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。
-
-##### **Demographic Attributes Prediction from Speech Using WavLM Embeddings**
-2502.12007v1 by Yuchen Yang, Thomas Thebaud, Najim Dehak
-
-This paper introduces a general classifier based on WavLM features, to infer
-demographic characteristics, such as age, gender, native language, education,
-and country, from speech. Demographic feature prediction plays a crucial role
-in applications like language learning, accessibility, and digital forensics,
-enabling more personalized and inclusive technologies. Leveraging pretrained
-models for embedding extraction, the proposed framework identifies key acoustic
-and linguistic fea-tures associated with demographic attributes, achieving a
-Mean Absolute Error (MAE) of 4.94 for age prediction and over 99.81% accuracy
-for gender classification across various datasets. Our system improves upon
-existing models by up to relative 30% in MAE and up to relative 10% in accuracy
-and F1 scores across tasks, leveraging a diverse range of datasets and large
-pretrained models to ensure robustness and generalizability. This study offers
-new insights into speaker diversity and provides a strong foundation for future
-research in speech-based demographic profiling.
-
-摘要：本文介紹一個基於 WavLM 特徵的一般分類器，用於從語音中推斷人口特徵，例如年齡、性別、母語、教育和國家。人口特徵預測在語言學習、無障礙性和數位鑑識等應用中扮演著至關重要的角色，能實現更個人化且包容性的技術。利用預先訓練的模型進行嵌入式萃取，提出的架構識別與人口屬性相關的主要音訊和語言特徵，在年齡預測中達到 4.94 的平均絕對誤差 (MAE)，在各種資料集中的性別分類中準確率超過 99.81%。我們的系統在平均絕對誤差上比現有模型提升了相對 30%，在準確率和 F1 分數上提升了相對 10%，利用各種資料集和大型預先訓練模型來確保穩健性和概括性。本研究提供了對說話者多元性的新見解，並為未來基於語音的人口特徵分析研究奠定了堅實的基礎。
-
-##### **Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**
-2502.12001v1 by Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, Yoshihide Nomura
-
-This paper investigates the integration of technical vocabulary in merged
-language models. We explore the knowledge transfer mechanisms involved when
-combining a general-purpose language-specific model with a domain-specific
-model, focusing on the resulting model's comprehension of technical jargon. Our
-experiments analyze the impact of this merging process on the target model's
-proficiency in handling specialized terminology. We present a quantitative
-evaluation of the performance of the merged model, comparing it with that of
-the individual constituent models. The findings offer insights into the
-effectiveness of different model merging methods for enhancing domain-specific
-knowledge and highlight potential challenges and future directions in
-leveraging these methods for cross-lingual knowledge transfer in Natural
-Language Processing.
-
-摘要：本文探討了技術詞彙在合併語言模型中的整合。我們探討了結合一般用途語言特定模型與特定領域模型時所涉及的知識轉移機制，重點在於所產生模型對技術術語的理解。我們的實驗分析了此合併程序對目標模型處理專業術語能力的影響。我們提出了合併模型效能的量化評估，並將其與個別組成模型的效能進行比較。這些發現提供了見解，說明了不同模型合併方法在增強特定領域知識方面的效能，並強調了利用這些方法進行自然語言處理中跨語言知識轉移的潛在挑戰和未來方向。
-
-##### **Presumed Cultural Identity: How Names Shape LLM Responses**
-2502.11995v1 by Siddhesh Pawar, Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein
-
-Names are deeply tied to human identity. They can serve as markers of
-individuality, cultural heritage, and personal history. However, using names as
-a core indicator of identity can lead to over-simplification of complex
-identities. When interacting with LLMs, user names are an important point of
-information for personalisation. Names can enter chatbot conversations through
-direct user input (requested by chatbots), as part of task contexts such as CV
-reviews, or as built-in memory features that store user information for
-personalisation. We study biases associated with names by measuring cultural
-presumptions in the responses generated by LLMs when presented with common
-suggestion-seeking queries, which might involve making assumptions about the
-user. Our analyses demonstrate strong assumptions about cultural identity
-associated with names present in LLM generations across multiple cultures. Our
-work has implications for designing more nuanced personalisation systems that
-avoid reinforcing stereotypes while maintaining meaningful customisation.
-
-摘要：姓名與人類身分密不可分。它們可以作為個人特質、文化遺產和個人歷史的標記。然而，將姓名作為身分的核心指標可能會導致複雜身分的過度簡化。在與 LLM 互動時，使用者名稱是個人化的重要資訊點。姓名可以透過直接使用者輸入（聊天機器人要求）、作為履歷審查等任務情境的其中一部分，或作為儲存使用者資訊以供個人化的內建記憶功能，進入聊天機器人對話。我們透過衡量 LLM 在面對常見的建議尋求查詢時所產生的回應中的文化預設，來研究與姓名相關的偏見，這可能涉及對使用者的假設。我們的分析顯示，在跨多種文化的 LLM 世代中，與姓名相關的文化身分有強烈的假設。我們的研究對於設計更細緻的個人化系統有影響，這些系統避免強化刻板印象，同時維持有意義的客製化。
-
-##### **Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**
-2502.11989v1 by Negar Kamali, Karyn Nakamura, Aakriti Kumar, Angelos Chatzimparmpas, Jessica Hullman, Matthew Groh
-
-Diffusion model-generated images can appear indistinguishable from authentic
-photographs, but these images often contain artifacts and implausibilities that
-reveal their AI-generated provenance. Given the challenge to public trust in
-media posed by photorealistic AI-generated images, we conducted a large-scale
-experiment measuring human detection accuracy on 450 diffusion-model generated
-images and 149 real images. Based on collecting 749,828 observations and 34,675
-comments from 50,444 participants, we find that scene complexity of an image,
-artifact types within an image, display time of an image, and human curation of
-AI-generated images all play significant roles in how accurately people
-distinguish real from AI-generated images. Additionally, we propose a taxonomy
-characterizing artifacts often appearing in images generated by diffusion
-models. Our empirical observations and taxonomy offer nuanced insights into the
-capabilities and limitations of diffusion models to generate photorealistic
-images in 2024.
-
-摘要：擴散模型生成的影像看起來可能與真實照片無異，但這些影像通常包含人工智慧生成來源的瑕疵和不合理之處。由於寫實的人工智慧生成影像對公眾對媒體的信任構成挑戰，我們進行了一項大規模實驗，測量人類對 450 張擴散模型生成影像和 149 張真實影像的檢測準確度。根據收集自 50,444 位參與者的 749,828 次觀察和 34,675 則評論，我們發現影像的場景複雜性、影像中的瑕疵類型、影像的顯示時間，以及人類對人工智慧生成影像的策展，在人們準確區分真實影像和人工智慧生成影像方面都扮演重要的角色。此外，我們提出了一種分類法，用於描述經常出現在擴散模型生成的影像中的瑕疵。我們的經驗觀察和分類法為擴散模型在 2024 年生成寫實影像的能力和限制提供了細緻的見解。
-
-##### **Generating Text from Uniform Meaning Representation**
-2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein
-
-Uniform Meaning Representation (UMR) is a recently developed graph-based
-semantic representation, which expands on Abstract Meaning Representation (AMR)
-in a number of ways, in particular through the inclusion of document-level
-information and multilingual flexibility. In order to effectively adopt and
-leverage UMR for downstream tasks, efforts must be placed toward developing a
-UMR technological ecosystem. Though still limited amounts of UMR annotations
-have been produced to date, in this work, we investigate the first approaches
-to producing text from multilingual UMR graphs: (1) a pipeline conversion of
-UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large
-language models with UMR data, and (3) fine-tuning existing AMR-to-text
-generation models with UMR data. Our best performing model achieves a
-multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared
-to the reference, which is a promising indication of the effectiveness of
-fine-tuning approaches for UMR-to-text generation with even limited amounts of
-UMR data.
-
-摘要：統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示，它在許多方面擴展了抽象語意表示 (AMR)，特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR，必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限，但在這項工作中，我們探討了從多語言 UMR 圖形產生文字的第一種方法：(1) 將 UMR 轉換為 AMR 的管道，然後使用 AMR 轉文字生成模型，(2) 使用 UMR 資料微調大型語言模型，以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比，我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數，在中文中達到 0.882，這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果，即使 UMR 資料數量有限。
-
-##### **Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**
-2502.11969v1 by Sehun Jung, Hyang-won Lee
-
-In vision-language models (VLMs), prompt tuning has shown its effectiveness
-in adapting models to downstream tasks. However, learned prompts struggle to
-generalize to unseen classes, as they tend to overfit to the classes that are
-targeted during prompt tuning. Examining failure cases, we observed that
-learned prompts disrupt the semantics of unseen classes, generating text
-embeddings with incorrect semantic relationships among classes. To address
-this, we propose Similarity Alignment Regularization (SAR), which regularizes
-learnable prompts to preserve the semantic relationships among classes captured
-by hand-crafted prompts. Specifically, we first obtain novel classes related to
-base classes using ChatGPT-4o and utilize them as potential unseen classes
-during prompt tuning. Then, by targeting both base and novel classes, SAR
-aligns the similarity relationships among text embeddings generated by
-learnable prompts with the similarity relationships from hand-crafted prompts.
-Extensive experiments applying SAR to existing prompt tuning methods
-demonstrate its effectiveness in improving generalization to unseen classes.
-
-摘要：在視覺語言模型 (VLM) 中，提示調整已展現其在調整模型至下游任務上的效能。然而，已學習的提示難以推廣至未見類別，因為它們傾向於過度擬合提示調整期間所鎖定的類別。在檢視失敗案例時，我們觀察到已學習的提示會擾亂未見類別的語義，產生具有類別間不正確語義關係的文字嵌入。為了解決此問題，我們提出相似度對齊正則化 (SAR)，它會對可學習提示進行正則化，以保留由手工提示捕捉到的類別間語義關係。具體來說，我們首先使用 ChatGPT-4o 取得與基本類別相關的新穎類別，並在提示調整期間將它們用作潛在的未見類別。然後，透過鎖定基本類別和新穎類別，SAR 會將可學習提示產生的文字嵌入之間的相似度關係與手工提示的相似度關係對齊。將 SAR 應用於現有提示調整方法的廣泛實驗證明了其在改善對未見類別的概括上的效能。
-
-##### **A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**
-2502.11965v1 by Jun Jiang, Wenjun Yu, Yunfan Li, Yuan Gao, Shugong Xu
-
-In the field of artificial intelligence, self-supervised learning has
-demonstrated superior generalization capabilities by leveraging large-scale
-unlabeled datasets for pretraining, which is especially critical for wireless
-communication models to adapt to a variety of scenarios. This paper
-innovatively treats Channel State Information (CSI) and Channel Impulse
-Response (CIR) as naturally aligned multi-modal data and proposes the first
-MIMO wireless channel foundation model, named CSI-CLIP. By effectively
-capturing the joint representations of both CIR and CSI, CSI-CLIP exhibits
-remarkable adaptability across scenarios and robust feature extraction
-capabilities. Experimental results show that in positioning task, CSI-CLIP
-reduces the mean error distance by 22%; in beam management task, it increases
-accuracy by 1% compared to traditional supervised methods, as well as in the
-channel identification task. These improvements not only highlight the
-potential and value of CSI-CLIP in integrating sensing and communication but
-also demonstrate its significant advantages over existing techniques. Moreover,
-viewing CSI and CIR as multi-modal pairs and contrastive learning for wireless
-channel foundation model open up new research directions in the domain of MIMO
-wireless communications.
-
-摘要：在人工智能领域，自监督学习通过利用大规模无标签数据集进行预训练，展示了卓越的泛化能力，这对于无线通信模型适应各种场景尤为关键。本文创新地将信道状态信息 (CSI) 和信道脉冲响应 (CIR) 视为自然对齐的多模态数据，并提出了第一个 MIMO 无线信道基础模型，名为 CSI-CLIP。通过有效捕获 CIR 和 CSI 的联合表示，CSI-CLIP 在各种场景中表现出卓越的适应性和强大的特征提取能力。实验结果表明，在定位任务中，CSI-CLIP 将平均误差距离减少了 22%；在波束管理任务中，与传统的监督方法相比，其准确度提高了 1%，以及在信道识别任务中。这些改进不仅突出了 CSI-CLIP 在集成感知和通信方面的潜力和价值，而且还展示了其相对于现有技术的显着优势。此外，将 CSI 和 CIR 视为多模态对，并对比学习无线信道基础模型，为 MIMO 无线通信领域开辟了新的研究方向。
-
-##### **Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**
-2502.11962v1 by Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
-
-Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language
-Models (LLMs), but it may lower their truthfulness. This trade-off arises
-because IFT steers LLMs to generate responses with long-tail knowledge that is
-not well covered during pre-training, leading to more informative but less
-truthful answers when generalizing to unseen tasks. In this paper, we
-empirically demonstrate this helpfulness-truthfulness trade-off in IFT and
-propose $\textbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs
-to recognize their uncertainty and explicitly reflect it at the end of their
-responses. Experimental results show that UNIT-tuned models maintain their
-helpfulness while distinguishing between certain and uncertain claims, thereby
-reducing hallucinations.
-
-摘要：指令微調 (IFT) 可以提升大型語言模型 (LLM) 的實用性，但可能會降低其真實性。這種取捨會出現，是因為 IFT 引導 LLM 生成具有長尾知識的回應，而這些知識在預訓練期間並未充分涵蓋，導致在推廣到未見任務時，答案更具資訊性，但真實性較低。在本文中，我們透過實證展示 IFT 中的這種實用性與真實性取捨，並提出一個新穎的 IFT 典範 $\textbf{UNIT}$ 來解決這個問題。UNIT 教導 LLM 辨識其不確定性，並明確反映在其回應的結尾。實驗結果顯示，經過 UNIT 微調的模型維持其實用性，同時區分確定和不確定的說法，從而減少幻覺。
-
-##### **STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**
-2502.11959v1 by Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang
-
-Claim verification is the task of determining whether a claim is supported or
-refuted by evidence. Self-improvement methods, where reasoning chains are
-generated and those leading to correct results are selected for training, have
-succeeded in tasks like mathematical problem solving. However, in claim
-verification, this approach struggles. Low-quality reasoning chains may falsely
-match binary truth labels, introducing faulty reasoning into the
-self-improvement process and ultimately degrading performance. To address this,
-we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our
-method introduces a structured reasoning design with Claim Decomposition,
-Entity Analysis, and Evidence Grounding Verification. These components improve
-reasoning quality, reduce errors, and provide additional supervision signals
-for self-improvement. STRIVE begins with a warm-up phase, where the base model
-is fine-tuned on a small number of annotated examples to learn the structured
-reasoning design. It is then applied to generate reasoning chains for all
-training examples, selecting only those that are correct and structurally sound
-for subsequent self-improvement training. We demonstrate that STRIVE achieves
-significant improvements over baseline models, with a 31.4% performance gain
-over the base model and 20.7% over Chain of Thought on the HOVER datasets,
-highlighting its effectiveness.
-
-摘要：聲明驗證的任務是確定聲明是否受到證據支持或反駁。自改善方法（產生推理鏈並選擇導致正確結果的鏈進行訓練）已成功應用於數學問題求解等任務。然而，在聲明驗證中，此方法會遇到困難。低品質的推理鏈可能錯誤地匹配二元真值標籤，將錯誤的推理引入自改善流程並最終降低效能。為了解決此問題，我們提出 STRIVE：結構化推理自改善驗證。我們的模型引入了結構化推理設計，包含聲明分解、實體分析和證據依據驗證。這些組件改善了推理品質、減少了錯誤，並為自改善提供了額外的監督訊號。STRIVE 從熱身階段開始，在少數標註範例上微調基礎模型以學習結構化推理設計。接著將其應用於為所有訓練範例產生推理鏈，僅選擇正確且結構上合理的推理鏈進行後續的自改善訓練。我們證明 STRIVE 獲得了顯著的改善，在 HOVER 資料集上，效能比基礎模型提升了 31.4%，比 Chain of Thought 提升了 20.7%，突顯了其有效性。
-
-##### **Can Your Uncertainty Scores Detect Hallucinated Entity?**
-2502.11948v1 by Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li
-
-To mitigate the impact of hallucination nature of LLMs, many studies propose
-detecting hallucinated generation through uncertainty estimation. However,
-these approaches predominantly operate at the sentence or paragraph level,
-failing to pinpoint specific spans or entities responsible for hallucinated
-content. This lack of granularity is especially problematic for long-form
-outputs that mix accurate and fabricated information. To address this
-limitation, we explore entity-level hallucination detection. We propose a new
-data set, HalluEntity, which annotates hallucination at the entity level. Based
-on the dataset, we comprehensively evaluate uncertainty-based hallucination
-detection approaches across 17 modern LLMs. Our experimental results show that
-uncertainty estimation approaches focusing on individual token probabilities
-tend to over-predict hallucinations, while context-aware methods show better
-but still suboptimal performance. Through an in-depth qualitative study, we
-identify relationships between hallucination tendencies and linguistic
-properties and highlight important directions for future research.
-
-摘要：為了減輕 LLM 幻覺性質的影響，許多研究提出透過不確定性估計來偵測幻覺產生的內容。然而，這些方法主要是在句子或段落層級運作，無法精確找出對幻覺內容負責的特定區間或實體。這種缺乏粒度的現象對於混合了準確和虛構資訊的長篇輸出內容來說尤其成問題。為了解決這個限制，我們探討了實體層級的幻覺偵測。我們提出了一個新的資料集 HalluEntity，其中註解了實體層級的幻覺。根據該資料集，我們全面評估了 17 種現代 LLM 的基於不確定性的幻覺偵測方法。我們的實驗結果顯示，專注於個別代幣機率的不確定性估計方法傾向於過度預測幻覺，而具備背景感知能力的方法則表現得更好，但仍未達到最佳狀態。透過深入的定性研究，我們找出幻覺傾向與語言特徵之間的關係，並強調未來研究的重要方向。
-
-##### **Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**
-2502.11946v1 by Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuting Yan, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
-
-Real-time speech interaction, serving as a fundamental interface for
-human-machine collaboration, holds immense potential. However, current
-open-source models face limitations such as high costs in voice data
-collection, weakness in dynamic control, and limited intelligence. To address
-these challenges, this paper introduces Step-Audio, the first production-ready
-open-source solution. Key contributions include: 1) a 130B-parameter unified
-speech-text multi-modal model that achieves unified understanding and
-generation, with the Step-Audio-Chat version open-sourced; 2) a generative
-speech data engine that establishes an affordable voice cloning framework and
-produces the open-sourced lightweight Step-Audio-TTS-3B model through
-distillation; 3) an instruction-driven fine control system enabling dynamic
-adjustments across dialects, emotions, singing, and RAP; 4) an enhanced
-cognitive architecture augmented with tool calling and role-playing abilities
-to manage complex tasks effectively. Based on our new StepEval-Audio-360
-evaluation benchmark, Step-Audio achieves state-of-the-art performance in human
-evaluations, especially in terms of instruction following. On open-source
-benchmarks like LLaMA Question, shows 9.3% average performance improvement,
-demonstrating our commitment to advancing the development of open-source
-multi-modal language technologies. Our code and models are available at
-https://github.com/stepfun-ai/Step-Audio.
-
-摘要：<paragraph>即時語音互動作為人機協作的基本介面，蘊含著巨大的潛力。然而，目前的開源模型面臨著語音數據收集成本高、動態控制能力弱、智慧有限等限制。為了應對這些挑戰，本文介紹了 Step-Audio，這是第一個可投入生產的開源解決方案。主要貢獻包括：1) 一個 130B 參數的統一語音文字多模態模型，實現了統一的理解和生成，其中 Step-Audio-Chat 版本已開源；2) 一個生成式語音數據引擎，建立了一個經濟實惠的語音克隆框架，並通過蒸餾技術產生了開源的輕量級 Step-Audio-TTS-3B 模型；3) 一個指令驅動的精細控制系統，實現了跨方言、情緒、唱歌和饒舌的動態調整；4) 一個增強的認知架構，增加了工具呼叫和角色扮演的能力，以有效地管理複雜的任務。根據我們新的 StepEval-Audio-360 評估基準，Step-Audio 在人類評估中實現了最先進的性能，特別是在指令遵循方面。在 LLaMA Question 等開源基準測試中，表現出平均提升了 9.3%，證明了我們致力於推進開源多模態語言技術的發展。我們的程式碼和模型可在 https://github.com/stepfun-ai/Step-Audio 取得。</paragraph>
-
-##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**
-2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy
-
-Air quality prediction is key to mitigating health impacts and guiding
-decisions, yet existing models tend to focus on temporal trends while
-overlooking spatial generalization. We propose AQ-Net, a spatiotemporal
-reanalysis model for both observed and unobserved stations in the near future.
-AQ-Net utilizes the LSTM and multi-head attention for the temporal regression.
-We also propose a cyclic encoding technique to ensure continuous time
-representation. To learn fine-grained spatial air quality estimation, we
-incorporate AQ-Net with the neural kNN to explore feature-based interpolation,
-such that we can fill the spatial gaps given coarse observation stations. To
-demonstrate the efficiency of our model for spatiotemporal reanalysis, we use
-data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive
-experiments show that AQ-Net excels in air quality reanalysis, highlighting the
-potential of hybrid spatio-temporal models to better capture environmental
-dynamics, especially in urban areas where both spatial and temporal variability
-are critical.
-
-摘要：空气品质预测是减轻健康影响和指导决策的关键，但现有的模型倾向于关注时间趋势，而忽略空间概化。我们提出了 AQ-Net，这是一种时空再分析模型，适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计，我们将 AQ-Net 与神经 kNN 结合起来，以探索基于特征的插值，以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率，我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明，AQ-Net 在空气质量再分析中表现出色，突出了混合时空模型在更好地捕捉环境动态方面的潜力，尤其是在空间和时间变异性都很关键的城市地区。
-
-##### **FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**
-2502.11937v1 by Yutong Ye, Yingbo Zhou, Zhusen Liu, Xiao Du, Hao Zhou, Xiang Lian, Mingsong Chen
-
-Although Reinforcement Learning (RL)-based Traffic Signal Control (TSC)
-methods have been extensively studied, their practical applications still raise
-some serious issues such as high learning cost and poor generalizability. This
-is because the ``trial-and-error'' training style makes RL agents extremely
-dependent on the specific traffic environment, which also requires a long
-convergence time. To address these issues, we propose a novel Federated
-Imitation Learning (FIL)-based framework for multi-intersection TSC, named
-FitLight, which allows RL agents to plug-and-play for any traffic environment
-without additional pre-training cost. Unlike existing imitation learning
-approaches that rely on pre-training RL agents with demonstrations, FitLight
-allows real-time imitation learning and seamless transition to reinforcement
-learning. Due to our proposed knowledge-sharing mechanism and novel hybrid
-pressure-based agent design, RL agents can quickly find a best control policy
-with only a few episodes. Moreover, for resource-constrained TSC scenarios,
-FitLight supports model pruning and heterogeneous model aggregation, such that
-RL agents can work on a micro-controller with merely 16{\it KB} RAM and 32{\it
-KB} ROM. Extensive experiments demonstrate that, compared to state-of-the-art
-methods, FitLight not only provides a superior starting point but also
-converges to a better final solution on both real-world and synthetic datasets,
-even under extreme resource limitations.
-
-摘要：儘管基於強化學習 (RL) 的交通號誌控制 (TSC) 方法已經廣泛研究，但其實際應用仍會產生一些嚴重的問題，例如學習成本高和泛化能力差。這是因為「試錯法」訓練風格讓 RL 代理極度依賴特定的交通環境，這也需要很長的收斂時間。為了解決這些問題，我們提出一個名為 FitLight 的基於聯邦模仿學習 (FIL) 的多路口 TSC 框架，讓 RL 代理可以即插即用於任何交通環境，而無需額外的預訓練成本。與依賴使用示範預訓練 RL 代理的現有模仿學習方法不同，FitLight 允許即時模仿學習和無縫過渡到強化學習。由於我們提出的知識共享機制和新穎的基於壓力的混合代理設計，RL 代理只需幾個回合即可快速找到最佳控制策略。此外，對於資源受限的 TSC 場景，FitLight 支援模型剪枝和異質模型聚合，讓 RL 代理可以在僅有 16{\it KB} RAM 和 32{\it KB} ROM 的微控制器上運行。廣泛的實驗證明，與最先進的方法相比，FitLight 不僅提供了更好的起點，而且在實際和合成資料集上都能收斂到更好的最終解決方案，即使在極端的資源限制下也是如此。
-
-##### **On Representational Dissociation of Language and Arithmetic in Large Language Models**
-2502.11932v1 by Riku Kisako, Tatsuki Kuribayashi, Ryohei Sasano
-
-The association between language and (non-linguistic) thinking ability in
-humans has long been debated, and recently, neuroscientific evidence of brain
-activity patterns has been considered. Such a scientific context naturally
-raises an interdisciplinary question -- what about such a language-thought
-dissociation in large language models (LLMs)? In this paper, as an initial
-foray, we explore this question by focusing on simple arithmetic skills (e.g.,
-$1+2=$ ?) as a thinking ability and analyzing the geometry of their encoding in
-LLMs' representation space. Our experiments with linear classifiers and cluster
-separability tests demonstrate that simple arithmetic equations and general
-language input are encoded in completely separated regions in LLMs' internal
-representation space across all the layers, which is also supported with more
-controlled stimuli (e.g., spelled-out equations). These tentatively suggest
-that arithmetic reasoning is mapped into a distinct region from general
-language input, which is in line with the neuroscientific observations of human
-brain activations, while we also point out their somewhat cognitively
-implausible geometric properties.
-
-摘要：人類語言與（非語言）思考能力之間的關聯性長期以來一直備受爭論，而最近，神經科學證據中的大腦活動模式也已受到考量。這樣一個科學背景自然會引發一個跨領域問題——大型語言模型（LLM）中這種語言與思考的分離又是如何？在本文中，作為初步探討，我們透過專注於簡單的算術技能（例如 $1+2=$？）作為思考能力，並分析它們在 LLM 表徵空間中的編碼幾何形狀來探討這個問題。我們透過線性分類器和群集可分性測試進行的實驗證明，簡單的算術方程式和一般語言輸入在 LLM 的內部表徵空間中所有層中都是以完全分離的區域編碼，這也獲得了更受控刺激（例如，拼寫出的方程式）的支持。這些初步表明算術推理被映射到與一般語言輸入不同的區域，這與人類大腦活化的神經科學觀察結果一致，同時我們也指出了它們在認知上有些難以置信的幾何屬性。
-
-##### **BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**
-2502.11926v1 by Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Andrew Piper, Alexander Panchenko, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
-
-People worldwide use language in subtle and complex ways to express emotions.
-While emotion recognition -- an umbrella term for several NLP tasks --
-significantly impacts different applications in NLP and other fields, most work
-in the area is focused on high-resource languages. Therefore, this has led to
-major disparities in research and proposed solutions, especially for
-low-resource languages that suffer from the lack of high-quality datasets. In
-this paper, we present BRIGHTER-- a collection of multilabeled
-emotion-annotated datasets in 28 different languages. BRIGHTER covers
-predominantly low-resource languages from Africa, Asia, Eastern Europe, and
-Latin America, with instances from various domains annotated by fluent
-speakers. We describe the data collection and annotation processes and the
-challenges of building these datasets. Then, we report different experimental
-results for monolingual and crosslingual multi-label emotion identification, as
-well as intensity-level emotion recognition. We investigate results with and
-without using LLMs and analyse the large variability in performance across
-languages and text domains. We show that BRIGHTER datasets are a step towards
-bridging the gap in text-based emotion recognition and discuss their impact and
-utility.
-
-摘要：全球各地的人們都以微妙且複雜的方式使用語言來表達情感。
-雖然情緒辨識——幾個 NLP 任務的總稱——
-顯著影響 NLP 及其他領域中的不同應用，但該領域中的大部分工作
-都集中於高資源語言。因此，這導致研究和提出的解決方案出現重大差異，特別是
-對於缺乏高品質資料集的低資源語言。在本文中，我們提出 BRIGHTER——一個
-由 28 種不同語言組成的多標記情緒標註資料集。BRIGHTER 主要涵蓋來自非洲、亞洲、東歐和
-拉丁美洲的低資源語言，其中包含由流利講者標註的來自不同領域的實例。我們描述了資料收集和標註流程以及
-建立這些資料集的挑戰。然後，我們報告了單語和跨語言多標籤情緒識別的不同實驗結果，以及
-強度級別的情緒識別。我們研究了使用和不使用 LLM 的結果，並分析了跨語言和文字領域的性能的巨大變異。我們表明，BRIGHTER 資料集是縮小基於文字的情緒識別差距的一步，並討論了它們的影響和
-效用。
-
-##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**
-2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han
-
-The rapid development of Multimodal Large Language Models (MLLMs) has enabled
-the integration of multiple modalities, including texts and images, within the
-large language model (LLM) framework. However, texts and images are usually
-interconnected, forming a multimodal attributed graph (MMAG). It is
-underexplored how MLLMs can incorporate the relational information
-(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts
-and images) on such graphs for multimodal comprehension and generation. In this
-paper, we propose GraphGPT-o, which supports omni-multimodal understanding and
-creation on MMAGs. We first comprehensively study linearization variants to
-transform semantic and structural information as input for MLLMs. Then, we
-propose a hierarchical aligner that enables deep graph encoding, bridging the
-gap between MMAGs and MLLMs. Finally, we explore the inference choices,
-adapting MLLM to interleaved text and image generation in graph scenarios.
-Extensive experiments on three datasets from different domains demonstrate the
-effectiveness of our proposed method. Datasets and codes will be open-sourced
-upon acceptance.
-
-摘要：多模态大语言模型 (MLLM) 的快速发展，促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而，文本和图像通常是相互关联的，形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息（即图结构）和语义信息（即文本和图像）以进行多模态理解和生成，目前仍未得到充分探索。在本文中，我们提出了 GraphGPT-o，它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体，以将语义和结构信息转换为 MLLM 的输入。然后，我们提出了一个分层对齐器，它支持深度图编码，弥合了 MMAG 和 MLLM 之间的差距。最后，我们探索了推理选择，使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。
-
-##### **From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**
-2502.11919v1 by Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ziang Xiao, Ming Yin
-
-AI-assisted decision making becomes increasingly prevalent, yet individuals
-often fail to utilize AI-based decision aids appropriately especially when the
-AI explanations are absent, potentially as they do not %understand reflect on
-AI's decision recommendations critically. Large language models (LLMs), with
-their exceptional conversational and analytical capabilities, present great
-opportunities to enhance AI-assisted decision making in the absence of AI
-explanations by providing natural-language-based analysis of AI's decision
-recommendation, e.g., how each feature of a decision making task might
-contribute to the AI recommendation. In this paper, via a randomized
-experiment, we first show that presenting LLM-powered analysis of each task
-feature, either sequentially or concurrently, does not significantly improve
-people's AI-assisted decision performance. To enable decision makers to better
-leverage LLM-powered analysis, we then propose an algorithmic framework to
-characterize the effects of LLM-powered analysis on human decisions and
-dynamically decide which analysis to present. Our evaluation with human
-subjects shows that this approach effectively improves decision makers'
-appropriate reliance on AI in AI-assisted decision making.
-
-摘要：隨著 AI 輔助決策越來越普遍，但個人常常無法適當地利用 AI 決策輔助，特別是在沒有 AI 解釋的情況下，潛在原因是他們無法批判性地理解 AI 的決策建議。大型語言模型 (LLM) 擁有卓越的對話和分析能力，在沒有 AI 解釋的情況下，透過提供基於自然語言的 AI 決策建議分析，例如決策任務的每個特徵如何影響 AI 建議，為增強 AI 輔助決策提供了絕佳的機會。在本文中，我們透過隨機實驗，首先展示了以循序或並行的方式呈現 LLM 分析的每個任務特徵，並未顯著改善人們的 AI 輔助決策表現。為了讓決策者能更好地利用 LLM 分析，我們接著提出了演算法架構，用於描述 LLM 分析對人類決策的影響，並動態決定要呈現哪種分析。我們對人類受試者的評估顯示，這種方法有效地改善了決策者在 AI 輔助決策中對 AI 的適當依賴性。
-
-##### **EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**
-2502.11916v1 by Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
-
-Automated Essay Scoring (AES) plays a crucial role in educational assessment
-by providing scalable and consistent evaluations of writing tasks. However,
-traditional AES systems face three major challenges: (1) reliance on
-handcrafted features that limit generalizability, (2) difficulty in capturing
-fine-grained traits like coherence and argumentation, and (3) inability to
-handle multimodal contexts. In the era of Multimodal Large Language Models
-(MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES
-capabilities across lexical-, sentence-, and discourse-level traits. By
-leveraging MLLMs' strengths in trait-specific scoring and multimodal context
-understanding, EssayJudge aims to offer precise, context-rich evaluations
-without manual feature engineering, addressing longstanding AES limitations.
-Our experiments with 18 representative MLLMs reveal gaps in AES performance
-compared to human evaluation, particularly in discourse-level traits,
-highlighting the need for further advancements in MLLM-based AES research. Our
-dataset and code will be available upon acceptance.
-
-摘要：自動化論文評分 (AES) 在教育評量中扮演著重要的角色，它能提供可擴充且一致的寫作任務評量。然而，傳統的 AES 系統面臨了三個主要的挑戰：(1) 依賴於限制泛用性的手工特徵，(2) 難以捕捉連貫性和論證等細微特徵，以及 (3) 無法處理多模態的脈絡。在多模態大型語言模型 (MLLM) 的時代，我們提出了 EssayJudge，這是第一個評估 AES 能力的多模態基準，橫跨詞彙、句子和篇章層級的特徵。EssayJudge 透過利用 MLLM 在特定特徵評分和多模態脈絡理解方面的優勢，旨在提供精確且富含脈絡的評量，而無需手動特徵工程，進而解決長久以來的 AES 限制。我們針對 18 個具代表性的 MLLM 進行的實驗揭露了 AES 效能與人類評量之間的差距，特別是在篇章層級的特徵，這凸顯了 MLLM 為基礎的 AES 研究需要進一步的進展。我們的資料集和程式碼將在通過驗證後提供。
-
-##### **On the robustness of ChatGPT in teaching Korean Mathematics**
-2502.11915v1 by Phuong-Nam Nguyen, Quang Nguyen-The, An Vu-Minh, Diep-Anh Nguyen, Xuan-Lam Pham
-
-ChatGPT, an Artificial Intelligence model, has the potential to revolutionize
-education. However, its effectiveness in solving non-English questions remains
-uncertain. This study evaluates ChatGPT's robustness using 586 Korean
-mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering
-391 out of 586 questions. We also assess its ability to rate mathematics
-questions based on eleven criteria and perform a topic analysis. Our findings
-show that ChatGPT's ratings align with educational theory and test-taker
-perspectives. While ChatGPT performs well in question classification, it
-struggles with non-English contexts, highlighting areas for improvement. Future
-research should address linguistic biases and enhance accuracy across diverse
-languages. Domain-specific optimizations and multilingual training could
-improve ChatGPT's role in personalized education.
-
-摘要：ChatGPT，一種人工智慧模型，具有革新教育的潛力。然而，其解決非英語問題的有效性仍不確定。本研究使用 586 個韓語數學問題評估 ChatGPT 的健壯性。ChatGPT 達到 66.72% 的準確率，正確回答了 586 個問題中的 391 個。我們也評估其根據 11 個標準對數學問題進行評分並執行主題分析的能力。我們的研究結果顯示，ChatGPT 的評分與教育理論和應試者的觀點一致。儘管 ChatGPT 在問題分類中表現良好，但它在非英語語境中表現不佳，突顯出需要改進的地方。未來的研究應解決語言偏見並提高跨不同語言的準確性。特定領域的優化和多語言訓練可以提升 ChatGPT 在個人化教育中的作用。
-
-##### **MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**
-2502.11903v1 by Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao
-
-Recent multimodal large language models (MLLMs) have demonstrated significant
-potential in open-ended conversation, generating more accurate and personalized
-responses. However, their abilities to memorize, recall, and reason in
-sustained interactions within real-world scenarios remain underexplored. This
-paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for
-evaluating six core open-ended abilities of MLLMs: information extraction,
-multi-turn reasoning, information update, image management, memory recall, and
-answer refusal. With data collected from real-world scenarios, MMRC comprises
-5,120 conversations and 28,720 corresponding manually labeled questions, posing
-a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC
-indicate an accuracy drop during open-ended interactions. We identify four
-common failure patterns: long-term memory degradation, inadequacies in updating
-factual knowledge, accumulated assumption of error propagation, and reluctance
-to say no. To mitigate these issues, we propose a simple yet effective
-NOTE-TAKING strategy, which can record key information from the conversation
-and remind the model during its responses, enhancing conversational
-capabilities. Experiments across six MLLMs demonstrate significant performance
-improvements.
-
-摘要：最近的多模态大型语言模型 (MLLM) 已在开放式对话中展现出显著的潜力，产生更准确且个性化的回应。然而，它们在现实世界场景中持续互动中的记忆、回忆和推理能力仍未得到充分探索。本文介绍了 MMRC，一个多模态现实世界对话基准，用于评估 MLLM 的六项核心开放式能力：信息提取、多轮推理、信息更新、图像管理、记忆回忆和答案拒绝。通过从现实世界场景中收集的数据，MMRC 包含 5,120 个对话和 28,720 个相应的手动标记问题，对现有的 MLLM 构成了重大挑战。在 MMRC 中对 20 个 MLLM 的评估表明，在开放式互动期间准确性下降。我们确定了四种常见的故障模式：长期记忆退化、更新事实知识的不足、累积的错误传播假设以及不愿说不。为了减轻这些问题，我们提出了一种简单但有效的笔记策略，它可以记录对话中的关键信息并在模型响应期间提醒模型，从而增强对话能力。六个 MLLM 的实验表明了显著的性能改进。
-
-##### **Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**
-2502.11901v1 by Dylan Zhang, Justin Wang, Tianran Sun
-
-Existing LMs struggle with proof-oriented programming due to data scarcity,
-which manifest in two key ways: (1) a lack of sufficient corpora for
-proof-oriented programming languages such as F*, and (2) the absence of
-large-scale, project-level proof-oriented implementations that can teach the
-model the intricate reasoning process when performing proof-oriented
-programming. We present the first on synthetic data augmentation for project
-level proof oriented programming for both generation and repair. Our method
-addresses data scarcity by synthesizing basic proof-oriented programming
-problems for proficiency in that language; incorporating diverse coding data
-for reasoning capability elicitation and creating new proofs and repair data
-within existing repositories. This approach enables language models to both
-synthesize and repair proofs for function- and repository-level code. We show
-that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of
-the models that outperforms GPT-4o in project-level proof-oriented programming
-by 64% relative margin, and can improve GPT-4o's performance by 54% by
-repairing its outputs over GPT-4o's self-repair.
-
-摘要：現有的語言模型在基於證明編程時會因資料稀少而有困難，
-這會以兩種關鍵方式表現出來：(1) 缺乏足夠的語料庫，例如 F* 等面向證明的程式語言，以及 (2) 缺乏大型的專案層級面向證明實作，這些實作可以在執行面向證明編程時，教導模型複雜的推理程序。我們提出第一個面向專案層級面向證明編程的合成資料擴充，用於產生和修復。我們的做法透過合成基本的面向證明編程問題來解決資料稀少的問題，以精通該語言；納入不同的編碼資料，以引出推理能力，並在現有的儲存庫中建立新的證明和修復資料。這個方法讓語言模型能夠為函數層級和儲存庫層級的程式碼合成和修復證明。我們展示經過微調的 14B 參數模型 PoPilot，可以超過在專案層級面向證明編程中表現優於 GPT-4o 的模型 64% 的相對差距，並且可以透過修復 GPT-4o 自我修復的輸出，將 GPT-4o 的效能提升 54%。
-
-##### **DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**
-2502.11897v1 by Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
-
-In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a
-training-free paradigm that can make use of adaptive temporal compression in
-latent space. While existing video generative models apply fixed compression
-rates via pretrained VAE, we observe that real-world video content exhibits
-substantial temporal non-uniformity, with high-motion segments containing more
-information than static scenes. Based on this insight, DLFR-VAE dynamically
-adjusts the latent frame rate according to the content complexity.
-Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent
-Frame Rate Scheduler that partitions videos into temporal chunks and adaptively
-determines optimal frame rates based on information-theoretic content
-complexity, and (2) A training-free adaptation mechanism that transforms
-pretrained VAE architectures into a dynamic VAE that can process features with
-variable frame rates. Our simple but effective DLFR-VAE can function as a
-plug-and-play module, seamlessly integrating with existing video generation
-models and accelerating the video generation process.
-
-摘要：在本文中，我們提出動態潛在幀率 VAE (DLFR-VAE)，一種無需訓練的範例，它可以在潛在空間中使用自適應時間壓縮。現有的影片生成模型透過預訓練的 VAE 應用固定壓縮率，但我們觀察到真實世界的影片內容展現出大量的時間非一致性，其中高動作片段包含比靜態場景更多的資訊。基於這個見解，DLFR-VAE 會根據內容複雜度動態調整潛在幀率。具體來說，DLFR-VAE 包含兩項核心創新：(1) 一個動態潛在幀率排程器，它將影片分割成時間區塊，並根據資訊理論內容複雜度自適應地決定最佳幀率，以及 (2) 一個無需訓練的適應機制，它將預訓練的 VAE 架構轉換成一個動態 VAE，它可以處理具有可變幀率的特色。我們簡單但有效的 DLFR-VAE 可以作為一個即插即用的模組，與現有的影片生成模型無縫整合，並加速影片生成過程。
-
-##### **CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**
-2502.11896v1 by Yanxiao Zhao, Yangge Qian, Jingyang Shan, Xiaolin Qin
-
-Reinforcement learning (RL) in continuous action spaces encounters persistent
-challenges, such as inefficient exploration and convergence to suboptimal
-solutions. To address these limitations, we propose CAMEL, a novel framework
-integrating LLM-generated suboptimal policies into the RL training pipeline.
-CAMEL leverages dynamic action masking and an adaptive epsilon-masking
-mechanism to guide exploration during early training stages while gradually
-enabling agents to optimize policies independently. At the core of CAMEL lies
-the integration of Python-executable suboptimal policies generated by LLMs
-based on environment descriptions and task objectives. Although simplistic and
-hard-coded, these policies offer valuable initial guidance for RL agents. To
-effectively utilize these priors, CAMEL employs masking-aware optimization to
-dynamically constrain the action space based on LLM outputs. Additionally,
-epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling
-agents to transition from constrained exploration to autonomous policy
-refinement. Experimental validation on Gymnasium MuJoCo environments
-demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated
-policies significantly improve sample efficiency, achieving performance
-comparable to or surpassing expert masking baselines. For Walker2d-v4, where
-LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust
-RL performance without notable degradation, highlighting the framework's
-adaptability across diverse tasks. While CAMEL shows promise in enhancing
-sample efficiency and mitigating convergence challenges, these issues remain
-open for further research. Future work aims to generalize CAMEL to multimodal
-LLMs for broader observation-action spaces and automate policy evaluation,
-reducing human intervention and enhancing scalability in RL training pipelines.
-
-摘要：<paragraph>在連續動作空間中的強化學習 (RL) 會遇到持續的挑戰，例如探索效率低落和收斂至次佳解。為了解決這些限制，我們提出 CAMEL，一個將 LLM 生成的次佳策略整合到 RL 訓練管線中的新框架。CAMEL 透過動態動作遮罩和自適應 epsilon 遮罩機制來引導探索，同時逐漸讓代理程式能夠獨立最佳化策略。CAMEL 的核心在於整合由 LLM 生成的 Python 可執行次佳策略，這些策略基於環境描述和任務目標。儘管這些策略過於簡化且硬編碼，但它們為 RL 代理程式提供了有價值的初始指導。為了有效利用這些先驗知識，CAMEL 採用遮罩感知最佳化來根據 LLM 輸出動態限制動作空間。此外，epsilon 遮罩逐漸減少對 LLM 生成的指導依賴，讓代理程式能夠從受限探索轉換為自主策略改善。在 Gymnasium MuJoCo 環境上的實驗驗證證明了 CAMEL 的有效性。在 Hopper-v4 和 Ant-v4 中，LLM 生成的策略顯著提升了樣本效率，達到了與專家遮罩基準相近或超越的效能。對於 LLM 難以準確建模雙足步態動態的 Walker2d-v4，CAMEL 維持穩健的 RL 效能，且沒有顯著降低，突顯了該框架在不同任務中的適應性。儘管 CAMEL 在提升樣本效率和緩解收斂挑戰方面顯示出前景，但這些問題仍有待進一步研究。未來的研究工作旨在將 CAMEL 推廣到多模態 LLM，以涵蓋更廣泛的觀察動作空間，並自動化策略評估，減少人工介入並提升 RL 訓練管線的可擴充性。</paragraph>
-
-##### **Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**
-2502.11895v1 by Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
-
-Large language models (LLMs) require immense resources for training and
-inference. Quantization, a technique that reduces the precision of model
-parameters, offers a promising solution for improving LLM efficiency and
-sustainability. While post-training quantization methods typically achieve 4-8
-bits per parameter, recent research suggests that training LLMs with 1.58 bits
-per weight parameter from scratch can maintain model accuracy while greatly
-reducing memory requirements and energy consumption at inference time. Here, we
-investigate a training strategy for quantization-aware pre-training, where the
-models are first trained with 16-bit precision and then transition into
-1.58-bit quantization-aware training. Our results on 11 downstream tasks show
-that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit
-training and leaves models closer to those which have undergone 16-bit
-training. We further investigate the effects of retaining the optimizer state
-at the transition point and gradually phasing in quantization strength --
-finding that both techniques alleviate the magnitude of loss spikes, but also
-that these effects can be compensated through further training.
-
-摘要：大型語言模型 (LLM) 需要大量的資源來進行訓練和推理。量化是一種降低模型參數精度的技術，為提高 LLM 效率和可持續性提供了一個有希望的解決方案。雖然訓練後量化方法通常每參數達到 4-8 位元，但最近的研究表明，從頭開始使用每權重參數 1.58 位元訓練 LLM 可以維持模型準確性，同時大幅減少推理時間的記憶體需求和能源消耗。在此，我們探討量化感知預訓練的訓練策略，其中模型首先使用 16 位元精度訓練，然後轉換為 1.58 位元量化感知訓練。我們在 11 個下游任務上的結果表明，這種 16 位元到 1.58 位元的訓練策略優於完全 1.58 位元訓練，並且使模型更接近經過 16 位元訓練的模型。我們進一步探討了在轉換點保留最佳化器狀態和逐漸調整量化強度的影響——發現這兩種技術都可以減輕損失尖峰的大小，但這些影響也可以透過進一步訓練來補償。
-
-##### **Revisiting Classification Taxonomy for Grammatical Errors**
-2502.11890v1 by Deqing Zou, Jingheng Ye, Yulu Liu, Yu Wu, Zishan Xu, Yinghui Li, Hai-Tao Zheng, Bingxu An, Zhao Wei, Yong Xu
-
-Grammatical error classification plays a crucial role in language learning
-systems, but existing classification taxonomies often lack rigorous validation,
-leading to inconsistencies and unreliable feedback. In this paper, we revisit
-previous classification taxonomies for grammatical errors by introducing a
-systematic and qualitative evaluation framework. Our approach examines four
-aspects of a taxonomy, i.e., exclusivity, coverage, balance, and usability.
-Then, we construct a high-quality grammatical error classification dataset
-annotated with multiple classification taxonomies and evaluate them grounding
-on our proposed evaluation framework. Our experiments reveal the drawbacks of
-existing taxonomies. Our contributions aim to improve the precision and
-effectiveness of error analysis, providing more understandable and actionable
-feedback for language learners.
-
-摘要：語法錯誤分類在語言學習系統中扮演至關重要的角色，但現有的分類法常常缺乏嚴謹的驗證，導致不一致且不可靠的回饋。在本文中，我們透過引入一個系統且定性的評估架構，重新檢視先前的語法錯誤分類法。我們的做法檢視分類法的四個面向，即排他性、涵蓋性、平衡性和可用性。接著，我們建構一個高品質的語法錯誤分類資料集，並用多個分類法進行標註，並根據我們提出的評估架構對其進行評估。我們的實驗揭露了現有分類法的缺點。我們的貢獻旨在改善錯誤分析的準確性和有效性，為語言學習者提供更易於理解且可操作的回饋。
-
-##### **Stonefish: Supporting Machine Learning Research in Marine Robotics**
-2502.11887v1 by Michele Grimaldi, Patryk Cieslak, Eduardo Ochoa, Vibhav Bharti, Hayat Rajani, Ignacio Carlucho, Maria Koskinopoulou, Yvan R. Petillot, Nuno Gracias
-
-Simulations are highly valuable in marine robotics, offering a cost-effective
-and controlled environment for testing in the challenging conditions of
-underwater and surface operations. Given the high costs and logistical
-difficulties of real-world trials, simulators capable of capturing the
-operational conditions of subsea environments have become key in developing and
-refining algorithms for remotely-operated and autonomous underwater vehicles.
-This paper highlights recent enhancements to the Stonefish simulator, an
-advanced open-source platform supporting development and testing of marine
-robotics solutions. Key updates include a suite of additional sensors, such as
-an event-based camera, a thermal camera, and an optical flow camera, as well
-as, visual light communication, support for tethered operations, improved
-thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy.
-These developments and an automated annotation tool significantly bolster
-Stonefish's role in marine robotics research, especially in the field of
-machine learning, where training data with a known ground truth is hard or
-impossible to collect.
-
-摘要：模擬在海洋機器人中極具價值，提供具成本效益且受控的環境，用於在水下和水面作業的挑戰性條件下進行測試。鑑於現實世界試驗的高成本和後勤困難，能夠捕捉海底環境作業條件的模擬器已成為開發和改進遠程操作和自主水下載具演算法的關鍵。本文重點介紹了 Stonefish 模擬器最近的增強功能，這是一個先進的開源平台，支援海洋機器人解決方案的開發和測試。主要更新包括一系列額外的感測器，例如事件式相機、熱像儀和光流相機，以及可見光通訊、對繫繩操作的支援、改進的推進器建模、更靈活的水動力學和增強的聲納準確度。這些開發和自動化標註工具顯著提升了 Stonefish 在海洋機器人研究中的作用，特別是在機器學習領域，其中具有已知基本事實的訓練資料難以或無法收集。
-
-##### **LIMR: Less is More for RL Scaling**
-2502.11886v1 by Xuefeng Li, Haoyang Zou, Pengfei Liu
-
-In this paper, we ask: what truly determines the effectiveness of RL training
-data for enhancing language models' reasoning capabilities? While recent
-advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack
-of transparency about training data requirements has hindered systematic
-progress. Starting directly from base models without distillation, we challenge
-the assumption that scaling up RL training data inherently improves
-performance. we demonstrate that a strategically selected subset of just 1,389
-samples can outperform the full 8,523-sample dataset. We introduce Learning
-Impact Measurement (LIM), an automated method to evaluate and prioritize
-training samples based on their alignment with model learning trajectories,
-enabling efficient resource utilization and scalable implementation. Our method
-achieves comparable or even superior performance using only 1,389 samples
-versus the full 8,523 samples dataset. Notably, while recent data-efficient
-approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it
-significantly underperforms at 7B-scale through supervised fine-tuning (SFT).
-In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and
-outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results
-fundamentally reshape our understanding of RL scaling in LLMs, demonstrating
-that precise sample selection, rather than data scale, may be the key to
-unlocking enhanced reasoning capabilities. For reproducible research and future
-innovation, we are open-sourcing LIMR, including implementation of LIM,
-training and evaluation code, curated datasets, and trained models at
-https://github.com/GAIR-NLP/LIMR.
-
-摘要：<paragraph>在這篇論文中，我們提出一個問題：究竟是什麼決定了 RL 訓練資料增強語言模型推理能力的有效性？雖然最近的進展，例如 o1、Deepseek R1 和 Kimi1.5，展示了 RL 的潛力，但缺乏關於訓練資料需求的透明度阻礙了系統化的進展。從沒有蒸餾的基本模型直接開始，我們挑戰了擴充 RL 訓練資料本質上就會提升效能的假設。我們證明，策略性地選出僅 1,389 個樣本的子集就能勝過完整的 8,523 個樣本資料集。我們引入了學習影響力測量 (LIM)，這是一種自動化方法，用來評估和優先處理訓練樣本，根據它們與模型學習軌跡的一致性，能有效利用資源和擴充實作。我們的方法使用僅 1,389 個樣本就能達到與使用完整的 8,523 個樣本資料集相當甚至更佳的效能。值得注意的是，雖然最近資料有效率的方法（例如 LIMO 和 s1）在 32B 規模的模型上展現了前景，但我們發現它在 7B 規模上透過監督微調 (SFT) 的表現大幅落後。相比之下，我們基於 RL 的 LIMR 在 AIME24 上達到了高出 16.7% 的準確度，並在 MATH500 上比 LIMO 和 s1 分別高出 13.0% 和 22.2%。這些結果從根本上改變了我們對 LLM 中 RL 擴充的理解，證明精確的樣本選取，而非資料規模，可能是解鎖增強推理能力的關鍵。為了可重製的研究和未來的創新，我們開放原始碼 LIMR，包括 LIM 的實作、訓練和評估程式碼、策展的資料集，以及在 https://github.com/GAIR-NLP/LIMR 上訓練的模型。</paragraph>
-
-##### **Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**
-2502.11882v1 by Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
-
-Agents built on large language models (LLMs) have excelled in turn-by-turn
-human-AI collaboration but struggle with simultaneous tasks requiring real-time
-interaction. Latency issues and the challenge of inferring variable human
-strategies hinder their ability to make autonomous decisions without explicit
-instructions. Through experiments with current independent System 1 and System
-2 methods, we validate the necessity of using Dual Process Theory (DPT) in
-real-time tasks. We propose DPT-Agent, a novel language agent framework that
-integrates System 1 and System 2 for efficient real-time simultaneous human-AI
-collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and
-code-as-policy for fast, intuitive, and controllable decision-making.
-DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous
-reflection to infer human intentions and perform reasoning-based autonomous
-decisions. We demonstrate the effectiveness of DPT-Agent through further
-experiments with rule-based agents and human collaborators, showing significant
-improvements over mainstream LLM-based frameworks. To the best of our
-knowledge, DPT-Agent is the first language agent framework that achieves
-successful real-time simultaneous human-AI collaboration autonomously. Code of
-DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
-
-摘要：建立在大语言模型（LLM）上的代理在回合制人机协作方面表现出色，但在需要实时交互的同时任务中却举步维艰。延迟问题和推断可变人类策略的挑战阻碍了他们在没有明确指示的情况下做出自主决策的能力。通过使用当前独立的系统 1 和系统 2 方法进行的实验，我们验证了在实时任务中使用双重过程理论 (DPT) 的必要性。我们提出了 DPT-Agent，这是一个新颖的语言代理框架，它集成了系统 1 和系统 2，以实现高效的实时同时人机协作。DPT-Agent 的系统 1 使用有限状态机 (FSM) 和代码作为策略，以进行快速、直观且可控的决策。DPT-Agent 的系统 2 集成了心智理论 (ToM) 和异步反射，以推断人类意图并执行基于推理的自主决策。我们通过与基于规则的代理和人类合作者进行进一步的实验来证明 DPT-Agent 的有效性，展示了对主流基于 LLM 的框架的重大改进。据我们所知，DPT-Agent 是第一个实现自主的实时同时人机协作的语言代理框架。DPT-Agent 的代码可以在 https://github.com/sjtu-marl/DPT-Agent 中找到。
-
-##### **Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**
-2502.11881v1 by Hyunwoo Kim, Melanie Sclar, Tan Zhi-Xuan, Lance Ying, Sydney Levine, Yang Liu, Joshua B. Tenenbaum, Yejin Choi
-
-Existing LLM reasoning methods have shown impressive capabilities across
-various tasks, such as solving math and coding problems. However, applying
-these methods to scenarios without ground-truth answers or rule-based
-verification methods - such as tracking the mental states of an agent - remains
-challenging. Inspired by the sequential Monte Carlo algorithm, we introduce
-thought-tracing, an inference-time reasoning algorithm designed to trace the
-mental states of specific agents by generating hypotheses and weighting them
-based on observations without relying on ground-truth solutions to questions in
-datasets. Our algorithm is modeled after the Bayesian theory-of-mind framework,
-using LLMs to approximate probabilistic inference over agents' evolving mental
-states based on their perceptions and actions. We evaluate thought-tracing on
-diverse theory-of-mind benchmarks, demonstrating significant performance
-improvements compared to baseline LLMs. Our experiments also reveal interesting
-behaviors of the recent reasoning models - e.g., o1 and R1 - on theory-of-mind,
-highlighting the difference of social reasoning compared to other domains.
-
-摘要：現有的 LLM 推理方法已在各種任務中展現出令人印象深刻的能力，例如解決數學和編碼問題。然而，將這些方法應用於沒有正解答案或基於規則的驗證方法的情境中 - 例如追蹤代理人的心智狀態 - 仍然具有挑戰性。受到序貫蒙地卡羅演算法的啟發，我們引入了思想追蹤，這是一種在推理時間進行推理的演算法，旨在透過產生假設並根據觀察加權這些假設來追蹤特定代理人的心智狀態，而無需依賴資料集中的問題正解。我們的演算法是以貝氏心智理論架構為範本，使用 LLM 根據代理人的感知和行動來近似代理人不斷演變的心智狀態的機率推論。我們在各種心智理論基準上評估思想追蹤，與基準 LLM 相比，證明了顯著的效能提升。我們的實驗也揭露了近期推理模型在心智理論上的有趣行為 - 例如 o1 和 R1 - 突顯了社會推理與其他領域的差異。
-
-##### **Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**
-2502.11880v1 by Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
-
-The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has
-spurred interest in ternary LLMs. Despite this, research and practical
-applications focusing on efficient edge inference for ternary LLMs remain
-scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system
-optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix
-multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs,
-Bitnet.cpp incorporates a novel mpGEMM library to facilitate
-sub-2-bits-per-weight, efficient and lossless inference. The library features
-two core solutions: Ternary Lookup Table (TL), which addresses spatial
-inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S),
-which ensures lossless edge inference, both enabling high-speed inference. Our
-experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over
-full-precision baselines and up to 2.32x over low-bit baselines, setting new
-benchmarks in the field. Additionally, we expand TL to element-wise lookup
-table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and
-empirical evidence of its considerable potential. Bitnet.cpp is publicly
-available at https://github.com/microsoft/BitNet/tree/paper , offering a
-sophisticated solution for the efficient and practical deployment of edge LLMs.
-
-摘要：隨著由 BitNet b1.58 領先的 1 位元大型語言模型 (LLM) 出現，已激發了對三元 LLM 的興趣。儘管如此，專注於三元 LLM 的高效能邊緣推論的研究和實際應用仍然很少見。為了彌補這個差距，我們引入了 Bitnet.cpp，這是一個針對 BitNet b1.58 和三元 LLM 最佳化的推論系統。由於混合精度矩陣乘法 (mpGEMM) 構成三元 LLM 中推論時間的大部分，Bitnet.cpp 結合了一個新穎的 mpGEMM 函式庫，以利於每權重低於 2 位元、高效能且無損失的推論。該函式庫具有兩個核心解決方案：三元查詢表 (TL)，它解決了先前逐位元方法的空間低效率，以及具有比例的 Int2 (I2_S)，它確保無損失的邊緣推論，兩者都能實現高速推論。我們的實驗顯示，Bitnet.cpp 的速度比全精度的基準快了 6.25 倍，比低位元基準快了 2.32 倍，樹立了該領域的新基準。此外，我們在附錄中將 TL 擴充到逐元素查詢表 (ELUT) 以用於低位元 LLM，並提出其巨大潛力的理論和實證證據。Bitnet.cpp 已公開於 https://github.com/microsoft/BitNet/tree/paper，提供了一個精密的解決方案，用於邊緣 LLM 的高效能和實際部署。
-
-##### **VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**
-2502.11874v1 by Hugh Mee Wong, Rick Nouwen, Albert Gatt
-
-Vague quantifiers such as "a few" and "many" are influenced by many
-contextual factors, including how many objects are present in a given context.
-In this work, we evaluate the extent to which vision-and-language models (VLMs)
-are compatible with humans when producing or judging the appropriateness of
-vague quantifiers in visual contexts. We release a novel dataset, VAQUUM,
-containing 20300 human ratings on quantified statements across a total of 1089
-images. Using this dataset, we compare human judgments and VLM predictions
-using three different evaluation methods. Our findings show that VLMs, like
-humans, are influenced by object counts in vague quantifier use. However, we
-find significant inconsistencies across models in different evaluation
-settings, suggesting that judging and producing vague quantifiers rely on two
-different processes.
-
-摘要：模糊量词，例如「一些」和「许多」，会受到许多语境因素的影响，包括在给定语境中出现的对象数量。在这项工作中，我们评估视觉语言模型 (VLM) 在视觉语境中产生或判断模糊量词的适当性时，与人类的兼容程度。我们发布了一个新数据集 VAQUUM，其中包含对 1089 张图像中的量化陈述的 20300 个人类评级。使用此数据集，我们使用三种不同的评估方法来比较人类判断和 VLM 预测。我们的研究结果表明，VLM 与人类一样，在模糊量词的使用中会受到对象数量的影响。然而，我们发现不同评估设置中的模型之间存在显着的不一致性，这表明判断和产生模糊量词依赖于两个不同的过程。
-
-##### **Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**
-2502.11866v1 by Michael McRae
-
-I introduce a new large-scale dataset of historical wire articles from U.S.
-Southern newspapers, spanning 1960-1975 and covering multiple wire services:
-The Associated Press, United Press International, Newspaper Enterprise
-Association. Unlike prior work focusing on front-page content, this dataset
-captures articles across the entire newspaper, offering broader insight into
-mid-century Southern coverage. The dataset includes a version that has
-undergone an LLM-based text cleanup pipeline to reduce OCR noise, enhancing its
-suitability for quantitative text analysis. Additionally, duplicate versions of
-articles are retained to enable analysis of editorial differences in language
-and framing across newspapers. Each article is tagged by wire service,
-facilitating comparative studies of editorial patterns across agencies. This
-resource opens new avenues for research in computational social science,
-digital humanities, and historical linguistics, providing a detailed
-perspective on how Southern newspapers relayed national and international news
-during a transformative period in American history. The dataset will be made
-available upon publication or request for research purposes.
-
-摘要：我介紹一個新的美國歷史電訊文章大型資料集，時間跨度為 1960-1975 年，涵蓋多個電訊服務：美聯社、美聯國際社、報業企業協會。與先前專注於頭版內容的研究不同，此資料集擷取了整份報紙的文章，提供更廣泛的見解，深入探討世紀中葉的南方報導。該資料集包含一個經過 LLM 文字清理管線處理的版本，以減少 OCR 雜訊，提升其適用於量化文字分析。此外，保留文章的重複版本，以利分析報紙間語言和架構的編輯差異。每篇文章都標記電訊服務，便於比較各家機構的編輯模式。此資源為計算社會科學、數位人文和歷史語言學的研究開啟了新的途徑，提供一個詳細的觀點，探討南方報紙在美國歷史的轉型時期如何傳遞國內和國際新聞。該資料集將在出版或研究目的請求後提供。
-
-##### **FedEAT: A Robustness Optimization Framework for Federated LLMs**
-2502.11863v1 by Yahao Pang, Xingyuan Wu, Xiaojin Zhang, Wei Chen, Hai Jin
-
-Significant advancements have been made by Large Language Models (LLMs) in
-the domains of natural language understanding and automated content creation.
-However, they still face persistent problems, including substantial
-computational costs and inadequate availability of training data. The
-combination of Federated Learning (FL) and LLMs (federated LLMs) offers a
-solution by leveraging distributed data while protecting privacy, which
-positions it as an ideal choice for sensitive domains. However, Federated LLMs
-still suffer from robustness challenges, including data heterogeneity,
-malicious clients, and adversarial attacks, which greatly hinder their
-applications. We first introduce the robustness problems in federated LLMs, to
-address these challenges, we propose FedEAT (Federated Embedding space
-Adversarial Training), a novel framework that applies adversarial training in
-the embedding space of client LLM and employs a robust aggregation approach,
-specifically geometric median aggregation, to enhance the robustness of
-Federated LLMs. Our experiments demonstrate that FedEAT effectively improves
-the robustness of Federated LLMs with minimal performance loss.
-
-摘要：大型語言模型 (LLM) 在自然語言理解和自動化內容創作領域取得了重大進展。
-然而，它們仍然面臨持續的問題，包括大量的運算成本和訓練數據的可用性不足。
-聯合學習 (FL) 和 LLM（聯合 LLM）的結合提供了一個解決方案，在保護隱私的同時利用分佈式數據，這使其成為敏感領域的理想選擇。
-然而，聯合 LLM 仍然面臨著穩健性的挑戰，包括數據異質性、惡意用戶和對抗性攻擊，這極大地阻礙了它們的應用。
-我們首先介紹了聯合 LLM 中的穩健性問題，為了應對這些挑戰，我們提出了 FedEAT（聯合嵌入空間對抗訓練），這是一個新穎的框架，它在用戶端 LLM 的嵌入空間中應用對抗訓練，並採用穩健的聚合方法，特別是幾何中值聚合，以增強聯合 LLM 的穩健性。
-我們的實驗表明，FedEAT 有效地提高了聯合 LLM 的穩健性，同時性能損失最小。
-
-##### **Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**
-2502.11862v1 by Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
-
-In-context machine translation (MT) with large language models (LLMs) is a
-promising approach for low-resource MT, as it can readily take advantage of
-linguistic resources such as grammar books and dictionaries. Such resources are
-usually selectively integrated into the prompt so that LLMs can directly
-perform translation without any specific training, via their in-context
-learning capability (ICL). However, the relative importance of each type of
-resource e.g., dictionary, grammar book, and retrieved parallel examples, is
-not entirely clear. To address this gap, this study systematically investigates
-how each resource and its quality affects the translation performance, with the
-Manchu language as our case study. To remove any prior knowledge of Manchu
-encoded in the LLM parameters and single out the effect of ICL, we also
-experiment with an encrypted version of Manchu texts. Our results indicate that
-high-quality dictionaries and good parallel examples are very helpful, while
-grammars hardly help. In a follow-up study, we showcase a promising application
-of in-context MT: parallel data augmentation as a way to bootstrap the
-conventional MT model. When monolingual data abound, generating synthetic
-parallel data through in-context MT offers a pathway to mitigate data scarcity
-and build effective and efficient low-resource neural MT systems.
-
-摘要：語境機器翻譯 (MT) 與大型語言模型 (LLM) 結合，對於低資源 MT 來說是一種有前景的方法，因為它可以輕易利用語法書和字典等語言資源。此類資源通常會選擇性地整合到提示中，讓 LLM 能夠透過其語境學習能力 (ICL) 直接執行翻譯，而無需任何特定訓練。然而，每種類型的資源（例如字典、語法書和擷取的平行範例）的相對重要性並不明確。為了解決這個問題，本研究系統性地探討每項資源及其品質如何影響翻譯效能，並以滿語作為我們的案例研究。為了移除 LLM 參數中編碼的任何滿語先備知識，並找出 ICL 的影響，我們也對滿語文本的加密版本進行實驗。我們的結果顯示，高品質的字典和良好的平行範例非常有幫助，而語法幾乎沒有幫助。在後續研究中，我們展示了語境 MT 的一個有前景的應用：平行數據擴充，作為引導傳統 MT 模型的一種方式。當單語資料豐富時，透過語境 MT 產生合成平行資料提供了一條途徑，可以減輕資料短缺，並建構有效且高效的低資源神經 MT 系統。
-
-##### **Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**
-2502.11861v1 by Shuqi Yang, Mingrui Jing, Shuai Wang, Jiaxin Kou, Manfei Shi, Weijie Xing, Yan Hu, Zheng Zhu
-
-This study reviewed the use of Large Language Models (LLMs) in healthcare,
-focusing on their training corpora, customization techniques, and evaluation
-metrics. A systematic search of studies from 2021 to 2024 identified 61
-articles. Four types of corpora were used: clinical resources, literature,
-open-source datasets, and web-crawled data. Common construction techniques
-included pre-training, prompt engineering, and retrieval-augmented generation,
-with 44 studies combining multiple methods. Evaluation metrics were categorized
-into process, usability, and outcome metrics, with outcome metrics divided into
-model-based and expert-assessed outcomes. The study identified critical gaps in
-corpus fairness, which contributed to biases from geographic, cultural, and
-socio-economic factors. The reliance on unverified or unstructured data
-highlighted the need for better integration of evidence-based clinical
-guidelines. Future research should focus on developing a tiered corpus
-architecture with vetted sources and dynamic weighting, while ensuring model
-transparency. Additionally, the lack of standardized evaluation frameworks for
-domain-specific models called for comprehensive validation of LLMs in
-real-world healthcare settings.
-
-摘要：本研究回顧了大型語言模型 (LLM) 在醫療保健中的使用，重點在於其訓練語料庫、自訂技術和評估指標。針對 2021 年至 2024 年的研究進行系統性搜尋，找出 61 篇文章。語料庫類型有四種：臨床資源、文獻、開放原始碼資料集和網路爬取資料。常見的建構技術包括預訓練、提示工程和檢索增強生成，其中有 44 項研究結合多種方法。評估指標分為流程、可用性和成果指標，其中成果指標又分為基於模型和專家評估的成果。本研究發現語料庫公平性存在重大差距，這會導致地理、文化和社會經濟因素的偏見。對未驗證或非結構化資料的依賴性突顯出更佳整合循證臨床指南的必要性。未來的研究應專注於開發具有審查來源和動態加權的分層語料庫架構，同時確保模型透明性。此外，缺乏針對特定領域模型的標準化評估架構，因此需要對 LLM 在實際醫療保健環境中進行全面驗證。
-
-##### **Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**
-2502.11859v1 by Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, Yong Li
-
-The Theory of Multiple Intelligences underscores the hierarchical nature of
-cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer
-a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual
-Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial
-Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13
-mainstream VLMs through nine validated psychometric experiments reveals
-significant gaps versus humans (average score 24.95 vs. 68.38), with three key
-findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation,
-weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller
-models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading
-(30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought
-(0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from
-architectural constraints. Identified barriers include weak geometry encoding
-and missing dynamic simulation. By linking psychometric BSAs to VLM
-capabilities, we provide a diagnostic toolkit for spatial intelligence
-evaluation, methodological foundations for embodied AI development, and a
-cognitive science-informed roadmap for achieving human-like spatial
-intelligence.
-
-摘要：多元智能理論強調認知能力的層次性質。為了推進空間人工智慧，我們開創了一個心理測量框架，在視覺語言模型 (VLM) 中定義了五種基本空間能力 (BSA)：空間知覺、空間關係、空間定向、心智旋轉和空間視覺化。通過九項經過驗證的心理測量實驗對 13 個主流 VLM 進行基準測試，揭示了與人類相比的顯著差距（平均分數 24.95 對 68.38），並得出三個關鍵發現：1) VLM 反映人類層次結構（2D 定向最強，3D 旋轉最弱）具有獨立的 BSA（Pearson's r<0.4）；2) Qwen2-VL-7B 等較小的模型超越了較大的模型，其中 Qwen 領先（30.82），InternVL2 落後（19.6）；3) 思想鏈等干預措施（0.100  accuracy gain）和 5 次訓練（0.259 提升）顯示了架構約束的限制。已識別的障礙包括弱幾何編碼和缺少動態模擬。通過將心理測量 BSA 與 VLM 能力聯繫起來，我們提供了一個用於空間智能評估的診斷工具包、具身 AI 開發的方法論基礎，以及實現類人空間智能的認知科學信息路標。
-
-##### **LLMs as a synthesis between symbolic and continuous approaches to language**
-2502.11856v1 by Gemma Boleda
-
-Since the middle of the 20th century, a fierce battle is being fought between
-symbolic and continuous approaches to language and cognition. The success of
-deep learning models, and LLMs in particular, has been alternatively taken as
-showing that the continuous camp has won, or dismissed as an irrelevant
-engineering development. However, in this position paper I argue that deep
-learning models for language actually represent a synthesis between the two
-traditions. This is because 1) deep learning architectures allow for both
-continuous/distributed and symbolic/discrete-like representations and
-computations; 2) models trained on language make use this flexibility. In
-particular, I review recent research in mechanistic interpretability that
-showcases how a substantial part of morphosyntactic knowledge is encoded in a
-near-discrete fashion in LLMs. This line of research suggests that different
-behaviors arise in an emergent fashion, and models flexibly alternate between
-the two modes (and everything in between) as needed. This is possibly one of
-the main reasons for their wild success; and it is also what makes them
-particularly interesting for the study of language and cognition. Is it time
-for peace?
-
-摘要：自 20 世紀中葉以來，象徵與連續的語言和認知方法之間展開了一場激烈的戰鬥。深度學習模型，特別是 LLM 的成功，被交替視為連續陣營獲勝的證明，或被視為無關的工程發展而被忽視。然而，在本文中，我認為用於語言的深度學習模型實際上代表了這兩種傳統之間的綜合。這是因為 1) 深度學習架構允許連續/分佈式和符號/離散式表示和計算；2) 在語言上訓練的模型利用了這種靈活性。特別是，我回顧了機制可解釋性的最新研究，展示了形態句法知識的實質部分是如何以近乎離散的方式編碼在 LLM 中的。這條研究線表明，不同的行為以一種新興的方式出現，並且模型根據需要在兩種模式（以及介於兩者之間的所有內容）之間靈活地交替。這可能是它們獲得巨大成功的主要原因之一；這也是它們對語言和認知研究特別有趣的原因。和平的時刻到了嗎？
-
-##### **BaxBench: Can LLMs Generate Correct and Secure Backends?**
-2502.11844v1 by Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev
-
-The automatic generation of programs has long been a fundamental challenge in
-computer science. Recent benchmarks have shown that large language models
-(LLMs) can effectively generate code at the function level, make code edits,
-and solve algorithmic coding tasks. However, to achieve full automation, LLMs
-should be able to generate production-quality, self-contained application
-modules. To evaluate the capabilities of LLMs in solving this challenge, we
-introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for
-the generation of backend applications. We focus on backends for three critical
-reasons: (i) they are practically relevant, building the core components of
-most modern web and cloud software, (ii) they are difficult to get right,
-requiring multiple functions and files to achieve the desired functionality,
-and (iii) they are security-critical, as they are exposed to untrusted
-third-parties, making secure solutions that prevent deployment-time attacks an
-imperative. BaxBench validates the functionality of the generated applications
-with comprehensive test cases, and assesses their security exposure by
-executing end-to-end exploits. Our experiments reveal key limitations of
-current LLMs in both functionality and security: (i) even the best model,
-OpenAI o1, achieves a mere 60% on code correctness; (ii) on average, we could
-successfully execute security exploits on more than half of the correct
-programs generated by each LLM; and (iii) in less popular backend frameworks,
-models further struggle to generate correct and secure applications. Progress
-on BaxBench signifies important steps towards autonomous and secure software
-development with LLMs.
-
-摘要：<paragraph>程式自動產生一直是電腦科學中的基本挑戰。最近的基準測試顯示，大型語言模型 (LLM) 能夠有效產生函數層級的程式碼、進行程式碼編輯，以及解決演算法編碼任務。然而，若要達成完全自動化，LLM 應能夠產生生產品質、獨立的應用程式模組。為了評估 LLM 在解決此挑戰的能力，我們引入了 BaxBench，這是一個包含 392 個後端應用程式產生任務的新評估基準。我們專注於後端有三個關鍵原因：(i) 它們在實務上有其相關性，建構了大多數現代網路和雲端軟體的核心元件；(ii) 它們難以正確執行，需要多個函數和檔案才能達成所需的運作功能；(iii) 它們與安全性息息相關，因為它們會暴露於不受信任的第三方，使得預防部署時攻擊的安全解決方案成為當務之急。BaxBench 使用全面的測試案例驗證產生應用程式的功能，並透過執行端對端漏洞利用來評估其安全性風險。我們的實驗揭露了目前 LLM 在功能和安全性上的主要限制：(i) 即使是最好的模型 OpenAI o1，在程式碼正確性上也僅達到 60%；(ii) 平均而言，我們能夠在每個 LLM 產生的正確程式中成功執行超過一半的安全漏洞利用；(iii) 在較不受歡迎的後端框架中，模型在產生正確且安全的應用程式上更加困難。在 BaxBench 上的進展代表著使用 LLM 朝向自主且安全的軟體開發邁出了重要的一步。</paragraph>
-
-##### **Can LLM Agents Maintain a Persona in Discourse?**
-2502.11843v1 by Pranav Bhandari, Nicolas Fay, Michael Wise, Amitava Datta, Stephanie Meek, Usman Naseem, Mehwish Nasim
-
-Large Language Models (LLMs) are widely used as conversational agents,
-exploiting their capabilities in various sectors such as education, law,
-medicine, and more. However, LLMs are often subjected to context-shifting
-behaviour, resulting in a lack of consistent and interpretable
-personality-aligned interactions. Adherence to psychological traits lacks
-comprehensive analysis, especially in the case of dyadic (pairwise)
-conversations. We examine this challenge from two viewpoints, initially using
-two conversation agents to generate a discourse on a certain topic with an
-assigned personality from the OCEAN framework (Openness, Conscientiousness,
-Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This
-is followed by using multiple judge agents to infer the original traits
-assigned to explore prediction consistency, inter-model agreement, and
-alignment with the assigned personality. Our findings indicate that while LLMs
-can be guided toward personality-driven dialogue, their ability to maintain
-personality traits varies significantly depending on the combination of models
-and discourse settings. These inconsistencies emphasise the challenges in
-achieving stable and interpretable personality-aligned interactions in LLMs.
-
-摘要：大型語言模型 (LLM) 被廣泛用作對話代理，
-在教育、法律、
-醫學等各個領域發揮其能力。然而，LLM 經常受到情境轉換
-行為的影響，導致缺乏一致且可解釋的
-與人格一致的互動。對心理特質的堅持缺乏
-全面的分析，特別是在二元 (成對)
-對話的情況下。我們從兩個觀點審視這個挑戰，最初使用
-兩個對話代理在特定主題上產生論述，並從 OCEAN 框架 (開放性、盡責性、
-外向性、宜人性、神經質) 中分配人格，每個特質為高/低。這
-接著使用多個評審代理來推斷分配給探索預測一致性、模型間協議的原始特質，
-以及與分配人格的一致性。我們的研究結果表明，雖然 LLM
-可以引導至以人格為導向的對話，但它們維持
-人格特質的能力會根據模型和論述設定的組合而有顯著差異。這些不一致強調了
-在 LLM 中實現穩定且可解釋的與人格一致的互動的挑戰。
-
-##### **ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**
-2502.11840v1 by Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo
-
-Chord recognition serves as a critical task in music information retrieval
-due to the abstract and descriptive nature of chords in music analysis. While
-audio chord recognition systems have achieved significant accuracy for small
-vocabularies (e.g., major/minor chords), large-vocabulary chord recognition
-remains a challenging problem. This complexity also arises from the inherent
-long-tail distribution of chords, where rare chord types are underrepresented
-in most datasets, leading to insufficient training samples. Effective chord
-recognition requires leveraging contextual information from audio sequences,
-yet existing models, such as combinations of convolutional neural networks,
-bidirectional long short-term memory networks, and bidirectional transformers,
-face limitations in capturing long-term dependencies and exhibit suboptimal
-performance on large-vocabulary chord recognition tasks. This work proposes
-ChordFormer, a novel conformer-based architecture designed to tackle structural
-chord recognition (e.g., triads, bass, sevenths) for large vocabularies.
-ChordFormer leverages conformer blocks that integrate convolutional neural
-networks with transformers, thus enabling the model to capture both local
-patterns and global dependencies effectively. By addressing challenges such as
-class imbalance through a reweighted loss function and structured chord
-representations, ChordFormer outperforms state-of-the-art models, achieving a
-2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy
-on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling
-class imbalance, providing robust and balanced recognition across chord types.
-This approach bridges the gap between theoretical music knowledge and practical
-applications, advancing the field of large-vocabulary chord recognition.
-
-摘要：和弦辨識由於和弦在音樂分析中具有抽象性和描述性，因此在音樂資訊檢索中扮演著重要的任務。雖然音訊和弦辨識系統已在小型詞彙（例如，大調/小調和弦）中達到顯著的準確度，但大型詞彙和弦辨識仍然是一個具有挑戰性的問題。這種複雜性也來自和弦固有的長尾分佈，其中在大多數資料集中罕見的和弦類型代表性不足，導致訓練樣本不足。有效的和弦辨識需要利用音訊序列中的上下文資訊，但現有的模型，例如卷積神經網路、雙向長短期記憶網路和雙向轉換器的組合，在捕捉長期依賴關係方面面臨限制，並且在大詞彙和弦辨識任務上表現不佳。這項工作提出了 ChordFormer，這是一種新穎的基於變形器的架構，旨在解決大型詞彙的結構和弦辨識（例如，三和弦、低音、七和弦）。ChordFormer 利用變形器區塊將卷積神經網路與變形器整合在一起，從而使模型能夠有效地捕捉局部模式和全局依賴關係。透過重新加權損失函數和結構化和弦表示來解決類別不平衡等挑戰，ChordFormer 優於最先進的模型，在大詞彙和弦資料集上實現了幀準確度提高 2% 和類準確度提高 6%。此外，ChordFormer 在處理類別不平衡方面表現出色，在和弦類型中提供穩健且平衡的辨識。這種方法彌合了理論音樂知識與實際應用之間的差距，推動了大型詞彙和弦辨識領域的發展。
-
-##### **Intuitive physics understanding emerges from self-supervised pretraining on natural videos**
-2502.11831v1 by Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
-
-We investigate the emergence of intuitive physics understanding in
-general-purpose deep neural network models trained to predict masked regions in
-natural videos. Leveraging the violation-of-expectation framework, we find that
-video prediction models trained to predict outcomes in a learned representation
-space demonstrate an understanding of various intuitive physics properties,
-such as object permanence and shape consistency. In contrast, video prediction
-in pixel space and multimodal large language models, which reason through text,
-achieve performance closer to chance. Our comparisons of these architectures
-reveal that jointly learning an abstract representation space while predicting
-missing parts of sensory input, akin to predictive coding, is sufficient to
-acquire an understanding of intuitive physics, and that even models trained on
-one week of unique video achieve above chance performance. This challenges the
-idea that core knowledge -- a set of innate systems to help understand the
-world -- needs to be hardwired to develop an understanding of intuitive
-physics.
-
-摘要：我們探討了在經過訓練以預測自然影片中遮蔽區域的通用深度神經網路模型中，直覺物理理解的出現。利用違反預期框架，我們發現經過訓練以預測學習表徵空間中結果的影片預測模型，展現了對各種直覺物理特性的理解，例如物體恆存和形狀一致性。相反地，影片在像素空間和多模態大型語言模型中的預測，透過文字推理，達到的效能接近隨機。我們對這些架構的比較揭示了在預測感官輸入的遺失部分時，同時學習抽象表徵空間，類似於預測編碼，足以獲得對直覺物理的理解，而且即使在獨特影片上訓練一週的模型，也達到了高於隨機的效能。這挑戰了核心知識（一套幫助理解世界的先天系統）需要硬連線才能發展對直覺物理的理解這個想法。
-
-##### **Text Classification in the LLM Era - Where do we stand?**
-2502.11830v1 by Sowmya Vajjala, Shwetali Shimangaud
-
-Large Language Models revolutionized NLP and showed dramatic performance
-improvements across several tasks. In this paper, we investigated the role of
-such language models in text classification and how they compare with other
-approaches relying on smaller pre-trained language models. Considering 32
-datasets spanning 8 languages, we compared zero-shot classification, few-shot
-fine-tuning and synthetic data based classifiers with classifiers built using
-the complete human labeled dataset. Our results show that zero-shot approaches
-do well for sentiment classification, but are outperformed by other approaches
-for the rest of the tasks, and synthetic data sourced from multiple LLMs can
-build better classifiers than zero-shot open LLMs. We also see wide performance
-disparities across languages in all the classification scenarios. We expect
-that these findings would guide practitioners working on developing text
-classification systems across languages.
-
-摘要：大型語言模型革新了自然語言處理，並在多項任務中展現出顯著的效能提升。在本文中，我們探討了此類語言模型在文字分類中的角色，以及它們與依賴較小規模預先訓練語言模型的其他方法相比如何。考量涵蓋 8 種語言的 32 個資料集，我們比較了零次學習分類、少次學習微調和合成資料分類器，以及使用完整人工標記資料集建置的分類器。我們的結果顯示，零次學習方法在情緒分類中表現良好，但在其他任務中則不如其他方法，而來自多個大型語言模型的合成資料可以建置比零次學習開放大型語言模型更好的分類器。我們也看到在所有分類情境中，不同語言之間的效能差異很大。我們預期這些發現將引導從事跨語言文字分類系統開發的實務工作者。
-
-##### **Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**
-2502.11829v1 by Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
-
-This paper introduces Code-Vision, a benchmark designed to evaluate the
-logical understanding and code generation capabilities of Multimodal Large
-Language Models (MLLMs). It challenges MLLMs to generate a correct program that
-fulfills specific functionality requirements based on a given flowchart, which
-visually represents the desired algorithm or process. Code-Vision comprises
-three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding
-abilities across basic programming, algorithmic, and mathematical
-problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision.
-Experimental results demonstrate that there is a large performance difference
-between proprietary and open-source models. On Hard problems, GPT-4o can
-achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further
-experiments reveal that Code-Vision can pose unique challenges compared to
-other multimodal reasoning benchmarks MMCode and MathVista. We also explore the
-reason for the poor performance of the open-source models. All data and codes
-are available at https://github.com/wanghanbinpanda/CodeVision.
-
-摘要：本文介绍 Code-Vision，此基准测试旨在评估多模态大型语言模型 (MLLM) 的逻辑理解和代码生成能力。它要求 MLLM 根据给定的流程图生成一个正确的程序，以满足特定的功能需求，而流程图直观地表示所需的算法或流程。Code-Vision 包含三个子集：HumanEval-V、Algorithm 和 MATH，它们评估 MLLM 在基本编程、算法和数学问题解决域中的编码能力。我们的实验对 Code-Vision 上的 12 个 MLLM 进行了评估。实验结果表明，专有模型和开源模型之间的性能差异很大。在困难问题上，GPT-4o 可以达到 79.3% 的 pass@1，但最好的开源模型只能达到 15%。进一步的实验表明，与其他多模态推理基准 MMCode 和 MathVista 相比，Code-Vision 可能会带来独特的挑战。我们还探讨了开源模型性能不佳的原因。所有数据和代码均可在 https://github.com/wanghanbinpanda/CodeVision 中获得。
-
-##### **M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**
-2502.11824v1 by Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Barbara Plank, Yun Xue
-
-Aspect-based sentiment analysis (ABSA) is a crucial task in information
-extraction and sentiment analysis, aiming to identify aspects with associated
-sentiment elements in text. However, existing ABSA datasets are predominantly
-English-centric, limiting the scope for multilingual evaluation and research.
-To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7
-domains and 21 languages, making it the most extensive multilingual parallel
-dataset for ABSA to date. Our primary focus is on triplet extraction, which
-involves identifying aspect terms, aspect categories, and sentiment polarities.
-The dataset is constructed through an automatic translation process with human
-review to ensure quality. We perform extensive experiments using various
-baselines to assess performance and compatibility on M-ABSA. Our empirical
-findings highlight that the dataset enables diverse evaluation tasks, such as
-multilingual and multi-domain transfer learning, and large language model
-evaluation, underscoring its inclusivity and its potential to drive
-advancements in multilingual ABSA research.
-
-摘要：面向方面的观点分析 (ABSA) 是資訊萃取和觀點分析中的一項重要任務，旨在識別文本中帶有相關觀點元素的方面。然而，現有的 ABSA 資料集以英語為中心，限制了多語言評估和研究的範圍。為了彌補這個差距，我們提出了 M-ABSA，這是一個涵蓋 7 個領域和 21 種語言的綜合性資料集，使其成為迄今為止最廣泛的多語言平行資料集，適用於 ABSA。我們的重點是三元組萃取，其中涉及識別方面術語、方面類別和觀點極性。該資料集是透過自動翻譯過程構建的，並經過人工審查以確保品質。我們使用各種基線進行廣泛的實驗，以評估 M-ABSA 上的效能和相容性。我們的實證結果強調，該資料集支援多樣化的評估任務，例如多語言和多領域遷移學習，以及大型語言模型評估，凸顯其包容性和推動多語言 ABSA 研究進展的潛力。
-
-##### **AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**
-2502.11817v1 by Hao Zhou, Wenge Rong, Jianfei Zhang, Qing Sun, Yuanxin Ouyang, Zhang Xiong
-
-Knowledge Tracing (KT) aims to predict students' future performances based on
-their former exercises and additional information in educational settings. KT
-has received significant attention since it facilitates personalized
-experiences in educational situations. Simultaneously, the autoregressive
-modeling on the sequence of former exercises has been proven effective for this
-task. One of the primary challenges in autoregressive modeling for Knowledge
-Tracing is effectively representing the anterior (pre-response) and posterior
-(post-response) states of learners across exercises. Existing methods often
-employ complex model architectures to update learner states using question and
-response records. In this study, we propose a novel perspective on knowledge
-tracing task by treating it as a generative process, consistent with the
-principles of autoregressive models. We demonstrate that knowledge states can
-be directly represented through autoregressive encodings on a question-response
-alternate sequence, where model generate the most probable representation in
-hidden state space by analyzing history interactions. This approach underpins
-our framework, termed Alternate Autoregressive Knowledge Tracing (AAKT).
-Additionally, we incorporate supplementary educational information, such as
-question-related skills, into our framework through an auxiliary task, and
-include extra exercise details, like response time, as additional inputs. Our
-proposed framework is implemented using advanced autoregressive technologies
-from Natural Language Generation (NLG) for both training and prediction.
-Empirical evaluations on four real-world KT datasets indicate that AAKT
-consistently outperforms all baseline models in terms of AUC, ACC, and RMSE.
-Furthermore, extensive ablation studies and visualized analysis validate the
-effectiveness of key components in AAKT.
-
-摘要：<paragraph>知識追蹤 (KT) 旨在根據學生的前次練習和教育環境中的額外資訊，預測學生的未來表現。KT 自從促進教育情境中的個人化體驗後，便備受關注。同時，前次練習序列上的自迴歸模型已被證明對此任務有效。知識追蹤中自迴歸模型的主要挑戰之一，是有效表示學習者在各項練習中的先驗 (反應前) 和後驗 (反應後) 狀態。現有方法通常採用複雜的模型架構，使用問題和反應記錄來更新學習者狀態。在本研究中，我們提出了一個關於知識追蹤任務的新觀點，將其視為一個生成過程，與自迴歸模型的原理一致。我們證明了知識狀態可以直接透過問答交替序列上的自迴歸編碼來表示，其中模型透過分析歷史互動來生成隱藏狀態空間中最可能的表示。此方法支撐了我們的架構，稱為交替自迴歸知識追蹤 (AAKT)。此外，我們透過輔助任務將補充教育資訊（例如與問題相關的技能）納入我們的架構，並將額外練習細節（例如反應時間）納入額外輸入。我們提出的架構是使用自然語言生成 (NLG) 的先進自迴歸技術，用於訓練和預測。對四個真實世界的 KT 資料集進行的經驗評估表明，AAKT 在 AUC、ACC 和 RMSE 方面始終優於所有基準模型。此外，廣泛的消融研究和視覺化分析驗證了 AAKT 中關鍵組件的有效性。</paragraph>
-
-##### **Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**
-2502.11812v1 by Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
-
-Fine-tuning significantly improves the performance of Large Language Models
-(LLMs), yet its underlying mechanisms remain poorly understood. This paper aims
-to provide an in-depth interpretation of the fine-tuning process through
-circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike
-previous studies
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that focus on tasks where pre-trained models already perform well, we develop a
-set of mathematical tasks where fine-tuning yields substantial performance
-gains, which are closer to the practical setting. In our experiments, we
-identify circuits at various checkpoints during fine-tuning and examine the
-interplay between circuit analysis, fine-tuning methods, and task complexities.
-First, we find that while circuits maintain high node similarity before and
-after fine-tuning, their edges undergo significant changes, which is in
-contrast to the previous work
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that show circuits only add some additional components after fine-tuning. Based
-on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA)
-method, which assigns ranks to layers based on edge changes in the circuits.
-Experimental results demonstrate that our circuit-based LoRA algorithm achieves
-an average performance improvement of 2.46\% over standard LoRA with similar
-parameter sizes. Furthermore, we explore how combining circuits from subtasks
-can enhance fine-tuning in compositional tasks, providing new insights into the
-design of such tasks and deepening the understanding of circuit dynamics and
-fine-tuning mechanisms.
-
-摘要：微調大幅提升大型語言模型 (LLM) 的效能，但其底層機制仍鮮為人知。本文旨在透過電路分析，一種機械可解釋性 (MI) 中廣泛使用的工具，提供微調過程的深入詮釋。不同於先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-專注於預訓練模型已表現良好的任務，我們開發了一組數學任務，其中微調產生顯著的效能提升，更接近實際設定。在我們的實驗中，我們在微調期間的各種檢查點識別電路，並探討電路分析、微調方法和任務複雜度之間的交互作用。首先，我們發現電路在微調前後雖然維持高節點相似度，但其邊緣卻經歷顯著變化，這與先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-顯示電路僅在微調後新增一些額外組件的結果相反。基於這些觀察，我們開發了一個電路感知低秩適應 (LoRA) 方法，根據電路中的邊緣變化為層級分配秩。實驗結果證明，我們的基於電路的 LoRA 演算法在參數大小相似的條件下，比標準 LoRA 平均提升了 2.46% 的效能。此外，我們探討如何結合子任務的電路來增強組合任務中的微調，為此類任務的設計提供新的見解，並加深對電路動態和微調機制的理解。
-
-##### **FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**
-2502.11811v1 by Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
-
-Retrieved documents containing noise will hinder Retrieval-Augmented
-Generation (RAG) from detecting answer clues, necessitating noise filtering
-mechanisms to enhance accuracy.Existing methods use re-ranking or summarization
-to identify the most relevant sentences, but directly and accurately locating
-answer clues from these large-scale and complex documents remains challenging.
-Unlike these document-level operations, we treat noise filtering as a
-sentence-level MinMax optimization problem: first identifying the potential
-clues from multiple documents using contextual information, then ranking them
-by relevance, and finally retaining the least clues through truncation. In this
-paper, we propose FineFilter, a novel fine-grained noise filtering mechanism
-for RAG consisting of a clue extractor, a re-ranker, and a truncator. We
-optimize each module to tackle complex reasoning challenges: (1) Clue extractor
-firstly uses sentences containing the answer and similar ones as fine-tuned
-targets, aiming at extracting sufficient potential clues; (2) Re-ranker is
-trained to prioritize effective clues based on the real feedback from
-generation module, with clues capable of generating correct answer as positive
-samples and others as negative; (3) Truncator takes the minimum clues needed to
-answer the question (truncation point) as fine-tuned targets, and performs
-truncation on the re-ranked clues to achieve fine-grained noise filtering.
-Experiments on three QA datasets demonstrate that FineFilter significantly
-outperforms baselines in terms of performance and inference cost. Further
-analysis on each module shows the effectiveness of our optimizations for
-complex reasoning.
-
-摘要：<paragraph>檢索到含有雜訊的文件會阻礙檢索增強生成 (RAG) 偵測答案線索，因此需要雜訊過濾機制來增強準確性。現有方法使用重新排序或摘要來找出最相關的句子，但從這些大規模且複雜的文件中直接且準確地找出答案線索仍然具有挑戰性。與這些文件層級的操作不同，我們將雜訊過濾視為一個句子層級的 MinMax 最佳化問題：首先使用脈絡資訊從多個文件中找出潛在線索，接著依據相關性對它們進行排序，最後透過截斷保留最少的線索。在本文中，我們提出 FineFilter，一種創新的細緻雜訊過濾機制，用於 RAG，它包含一個線索萃取器、一個重新排序器和一個截斷器。我們最佳化每個模組來應對複雜的推理挑戰：(1) 線索萃取器首先使用包含答案和類似答案的句子作為微調的目標，旨在萃取足夠的潛在線索；(2) 重新排序器經過訓練，根據生成模組的真實回饋來優先處理有效的線索，其中能夠生成正確答案的線索為正樣本，其他則為負樣本；(3) 截斷器將回答問題所需的最小線索 (截斷點) 視為微調的目標，並對重新排序的線索執行截斷，以達成細緻的雜訊過濾。在三個問答資料集上的實驗證實，FineFilter 在效能和推論成本方面都明顯優於基線。進一步分析每個模組顯示，我們的最佳化對於複雜推理而言是有效的。</paragraph>
-
-##### **Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**
-2502.11809v1 by Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
-
-Deep neural networks (DNNs) often exhibit biases toward certain categories
-during object recognition, even under balanced training data conditions. The
-intrinsic mechanisms underlying these biases remain unclear. Inspired by the
-human visual system, which decouples object manifolds through hierarchical
-processing to achieve object recognition, we propose a geometric analysis
-framework linking the geometric complexity of class-specific perceptual
-manifolds in DNNs to model bias. Our findings reveal that differences in
-geometric complexity can lead to varying recognition capabilities across
-categories, introducing biases. To support this analysis, we present the
-Perceptual-Manifold-Geometry library, designed for calculating the geometric
-properties of perceptual manifolds.
-
-摘要：深度神經網路 (DNN) 在物件辨識過程中，即使在平衡的訓練資料條件下，通常會對特定類別表現出偏見。這些偏見背後的基本機制仍然不清楚。受人類視覺系統的啟發，人類視覺系統透過階層化處理來解耦物件流形以達成物件辨識，我們提出一個幾何分析架構，將 DNN 中特定類別感知流形的幾何複雜度與模型偏見連結起來。我們的研究結果顯示，幾何複雜度的差異會導致不同類別的辨識能力有所不同，進而造成偏見。為了支持這個分析，我們提出感知流形幾何函式庫，用於計算感知流形的幾何屬性。
-
-##### **Exploring Translation Mechanism of Large Language Models**
-2502.11806v1 by Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Min Zhang
-
-Large language models (LLMs) have succeeded remarkably in multilingual
-translation tasks. However, the inherent translation mechanisms of LLMs remain
-poorly understood, largely due to sophisticated architectures and vast
-parameter scales. In response to this issue, this study explores the
-translation mechanism of LLM from the perspective of computational components
-(e.g., attention heads and MLPs). Path patching is utilized to explore causal
-relationships between components, detecting those crucial for translation tasks
-and subsequently analyzing their behavioral patterns in human-interpretable
-terms. Comprehensive analysis reveals that translation is predominantly
-facilitated by a sparse subset of specialized attention heads (less than 5\%),
-which extract source language, indicator, and positional features. MLPs
-subsequently integrate and process these features by transiting towards
-English-centric latent representations. Notably, building on the above
-findings, targeted fine-tuning of only 64 heads achieves translation
-improvement comparable to full-parameter tuning while preserving general
-capabilities.
-
-摘要：大型語言模型 (LLM) 在多語言翻譯任務中取得了顯著的成功。然而，LLM 內在的翻譯機制仍未被很好地理解，這主要是由於複雜的架構和龐大的參數規模。為了應對這個問題，本研究從計算元件（例如注意力頭和 MLP）的角度探討了 LLM 的翻譯機制。路徑修補用於探索元件之間的因果關係，檢測對翻譯任務至關重要的元件，並隨後以人類可解釋的方式分析它們的行為模式。綜合分析表明，翻譯主要由稀疏的專門注意力頭（不到 5%）促進，這些注意力頭提取源語言、指標和位置特徵。MLPs 隨後通過轉換為以英語為中心的潛在表示來整合和處理這些特徵。值得注意的是，根據上述發現，僅對 64 個頭進行有針對性的微調，即可實現與全參數調整相當的翻譯改進，同時保留一般能力。
-
-##### **Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**
-2502.11799v1 by Peiying Yu, Guoxin Chen, Jingjing Wang
-
-Despite the remarkable capabilities of large language models (LLMs) in
-various reasoning tasks, they still struggle with table reasoning tasks,
-particularly in maintaining consistency throughout multi-step reasoning
-processes. While existing approaches have explored various decomposition
-strategies, they often lack effective mechanisms to identify and correct errors
-in intermediate reasoning steps, leading to cascading error propagation. To
-address these issues, we propose Table-Critic, a novel multi-agent framework
-that facilitates collaborative criticism and iterative refinement of the
-reasoning process until convergence to correct solutions. Our framework
-consists of four specialized agents: a Judge for error identification, a Critic
-for comprehensive critiques, a Refiner for process improvement, and a Curator
-for pattern distillation. To effectively deal with diverse and unpredictable
-error types, we introduce a self-evolving template tree that systematically
-accumulates critique knowledge through experience-driven learning and guides
-future reflections. Extensive experiments have demonstrated that Table-Critic
-achieves substantial improvements over existing methods, achieving superior
-accuracy and error correction rates while maintaining computational efficiency
-and lower solution degradation rate.
-
-摘要：儘管大型語言模型 (LLM) 在各種推理任務中展現出非凡的能力，它們在表格推理任務中仍面臨挑戰，特別是在多步驟推理過程中維持一致性方面。現有方法雖然探索了各種分解策略，但它們通常缺乏有效機制來識別和修正中間推理步驟中的錯誤，導致錯誤遞增。為了解決這些問題，我們提出 Table-Critic，一個新穎的多代理架構，它促進協作批評和反覆改進推理過程，直到收斂到正確的解決方案。我們的架構包含四個專業代理：用於錯誤識別的法官、用於全面批評的批評者、用於流程改進的精煉器，以及用於模式萃取的策展人。為了有效處理多樣且不可預測的錯誤類型，我們引入了一個自演化範本樹，它透過經驗驅動的學習系統性地累積批評知識，並引導未來的反思。廣泛的實驗證明，Table-Critic 在現有方法的基礎上取得了顯著的進步，在維持運算效率和較低解決方案劣化率的同時，達到了更高的準確度和錯誤修正率。
-
-##### **Personality Editing for Language Models through Relevant Knowledge Editing**
-2502.11789v1 by Seojin Hwang, Yumin Kim, Byeongjeong Kim, Hwanhee Lee
-
-Large Language Models (LLMs) play a vital role in applications like
-conversational agents and content creation, where controlling a model's
-personality is crucial for maintaining tone, consistency, and engagement.
-However, traditional prompt-based techniques for controlling personality often
-fall short, as they do not effectively mitigate the model's inherent biases. In
-this paper, we introduce a novel method PALETTE that enhances personality
-control through knowledge editing. By generating adjustment queries inspired by
-psychological assessments, our approach systematically adjusts responses to
-personality-related queries similar to modifying factual knowledge, thereby
-achieving controlled shifts in personality traits. Experimental results from
-both automatic and human evaluations demonstrate that our method enables more
-stable and well-balanced personality control in LLMs.
-
-摘要：大型語言模型 (LLM) 在會話代理和內容創作等應用程式中扮演至關重要的角色，其中控制模型的人格特質對於維持語氣、一致性和參與度至關重要。然而，傳統基於提示的控制人格技術通常無法達到預期效果，因為它們無法有效減輕模型固有的偏差。在本文中，我們介紹一種創新的方法 PALETTE，它通過知識編輯來增強人格控制。透過產生受心理評量啟發的調整查詢，我們的做法系統性地調整對人格相關查詢的回應，類似於修改事實知識，從而實現人格特質的受控轉變。來自自動和人工評估的實驗結果表明，我們的模型能夠在 LLM 中實現更穩定且均衡的人格控制。
-
-##### **Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**
-2502.11779v1 by Xuan Ren, Qi Chen, Lingqiao Liu
-
-The training data for fine-tuning large language models (LLMs) is typically
-structured as input-output pairs. However, for many tasks, there can be
-multiple equally valid output variations for the same input. Recent studies
-have observed that the choice of output variation used in training can affect
-the model's performance. This raises an important question: how can we generate
-the most effective output from the many possible response generation strategy
-options? Rather than relying on the traditional but resource-intensive
-train-and-evaluate approach, this paper proposes a scalable, approximate method
-for estimating the quality of a small subset of generated training data derived
-from the same input. We then evaluate how well this small subset of generated
-output fits the target model we are trying to train. We present a large-scale
-benchmark covering diverse reasoning-based datasets to support our study.
-  The central idea is that a good output should closely resemble the output
-generated by the target LLM. We formalize this 'closeness' as the expected
-alignment score between a candidate output and the output sampled from the
-target LLM. We connect this measurement to the perplexity metric used in
-previous literature and demonstrate that leveraging an alignment-based metric
-can provide better predictions of model performance. Using this strategy, we
-can evaluate a small subset of the generated output from each response
-generation strategy option, then select the most effective strategy. We show
-that an LLM trained on data generated by the selected strategy could lead to a
-significant performance gain in many cases.
-
-摘要：大型語言模型 (LLM) 的微調訓練資料通常
-以輸入輸出配對結構化。然而，對於許多任務而言，相同的輸入可能有多個同樣有效的輸出變化。最近的研究
-觀察到訓練中使用的輸出變化選擇會影響模型的效能。這引發了一個重要問題：我們如何從許多可能的回應產生策略選項中產生最有效的輸出？本文提出一個可擴充、近似的方法，用於估計從相同輸入衍生的訓練資料小子集的品質，而非依賴傳統但資源密集的訓練和評估方法。然後我們評估這個產生輸出的小子集與我們嘗試訓練的目標模型的契合程度。我們提出一個涵蓋各種基於推理的資料集的大規模基準，以支持我們的研究。
-核心概念是良好的輸出應與目標 LLM 產生的輸出密切相似。我們將這種「接近度」形式化為候選輸出與從目標 LLM 取樣的輸出之間的預期對齊分數。我們將此測量連接到先前文獻中使用的困惑度指標，並證明利用基於對齊的指標可以提供更好的模型效能預測。使用此策略，我們可以評估每個回應產生策略選項所產生輸出的小子集，然後選擇最有效的策略。我們展示在由所選策略產生的資料上訓練的 LLM，在許多情況下可能導致顯著的效能提升。
-
-##### **Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**
-2502.11777v1 by Siddiqui Muhammad Yasir, Hyunsik Ahn
-
-Depth estimation plays a pivotal role in advancing human-robot interactions,
-especially in indoor environments where accurate 3D scene reconstruction is
-essential for tasks like navigation and object handling. Monocular depth
-estimation, which relies on a single RGB camera, offers a more affordable
-solution compared to traditional methods that use stereo cameras or LiDAR.
-However, despite recent progress, many monocular approaches struggle with
-accurately defining depth boundaries, leading to less precise reconstructions.
-In response to these challenges, this study introduces a novel depth estimation
-framework that leverages latent space features within a deep convolutional
-neural network to enhance the precision of monocular depth maps. The proposed
-model features dual encoder-decoder architecture, enabling both color-to-depth
-and depth-to-depth transformations. This structure allows for refined depth
-estimation through latent space encoding. To further improve the accuracy of
-depth boundaries and local features, a new loss function is introduced. This
-function combines latent loss with gradient loss, helping the model maintain
-the integrity of depth boundaries. The framework is thoroughly tested using the
-NYU Depth V2 dataset, where it sets a new benchmark, particularly excelling in
-complex indoor scenarios. The results clearly show that this approach
-effectively reduces depth ambiguities and blurring, making it a promising
-solution for applications in human-robot interaction and 3D scene
-reconstruction.
-
-摘要：深度估計在推進人機互動方面發揮著至關重要的作用，特別是在室內環境中，準確的 3D 場景重建對於導航和物體處理等任務至關重要。單目深度估計依賴於單個 RGB 相機，與使用立體相機或 LiDAR 的傳統方法相比，它提供了一個更經濟的解決方案。然而，儘管最近取得了進展，許多單目方法在準確定義深度邊界方面仍然存在困難，從而導致重建精度降低。為了應對這些挑戰，本研究引入了一個新穎的深度估計框架，該框架利用深度卷積神經網路中的潛在空間特徵來增強單目深度圖的精度。所提出的模型採用雙編碼器-解碼器架構，既能進行顏色到深度的轉換，又能進行深度到深度的轉換。這種結構允許通過潛在空間編碼進行精確的深度估計。為了進一步提高深度邊界和局部特徵的精度，引入了一個新的損失函數。此函數將潛在損失與梯度損失相結合，幫助模型維護深度邊界的完整性。使用 NYU Depth V2 數據集對該框架進行了全面測試，在該數據集上，它設定了一個新的基準，特別是在複雜的室內場景中表現出色。結果清楚地表明，這種方法有效地減少了深度模糊和模糊，使其成為人機互動和 3D 場景重建應用中一種有前途的解決方案。
-
-##### **The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**
-2502.11771v1 by Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
-
-The ability of large language models (LLMs) to validate their output and
-identify potential errors is crucial for ensuring robustness and reliability.
-However, current research indicates that LLMs struggle with self-correction,
-encountering significant challenges in detecting errors. While studies have
-explored methods to enhance self-correction in LLMs, relatively little
-attention has been given to understanding the models' internal mechanisms
-underlying error detection. In this paper, we present a mechanistic analysis of
-error detection in LLMs, focusing on simple arithmetic problems. Through
-circuit analysis, we identify the computational subgraphs responsible for
-detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal
-that all models heavily rely on $\textit{consistency heads}$--attention heads
-that assess surface-level alignment of numerical values in arithmetic
-solutions. Moreover, we observe that the models' internal arithmetic
-computation primarily occurs in higher layers, whereas validation takes place
-in middle layers, before the final arithmetic results are fully encoded. This
-structural dissociation between arithmetic computation and validation seems to
-explain why current LLMs struggle to detect even simple arithmetic errors.
-
-摘要：大型語言模型 (LLM) 驗證其輸出並識別潛在錯誤的能力對於確保穩健性和可靠性至關重要。
-然而，目前的研究所示，LLM 難以進行自我修正，在檢測錯誤時遇到重大挑戰。儘管研究已探討增強 LLM 自我修正的方法，但對於瞭解模型內部錯誤檢測機制卻關注較少。在本文中，我們提出對 LLM 中錯誤檢測的機制分析，重點關注簡單的算術問題。通過電路分析，我們識別出負責檢測四個較小規模 LLM 中算術錯誤的計算子圖。我們的研究結果表明，所有模型都嚴重依賴於「一致性頭部」--注意頭部，用於評估算術解中數值表面的對齊方式。此外，我們觀察到模型的內部算術運算主要發生在較高層，而驗證則發生在中間層，在最終算術結果完全編碼之前。算術運算和驗證之間的這種結構性分離似乎解釋了為什麼當前的 LLM 難以檢測到即使是簡單的算術錯誤。
-
-##### **Cognitive-Aligned Document Selection for Retrieval-augmented Generation**
-2502.11770v1 by Bingyu Wan, Fuxi Zhang, Zhongpeng Qi, Jiayi Ding, Jijun Li, Baoshi Fan, Yijia Zhang, Jun Zhang
-
-Large language models (LLMs) inherently display hallucinations since the
-precision of generated texts cannot be guaranteed purely by the parametric
-knowledge they include. Although retrieval-augmented generation (RAG) systems
-enhance the accuracy and reliability of generative models by incorporating
-external documents, these retrieved documents often fail to adequately support
-the model's responses in practical applications. To address this issue, we
-propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment
-Re\textbf{trieval} for verifiable generation), which leverages an LLM to
-dynamically update queries and filter high-quality, reliable retrieval
-documents. Specifically, we parse the user query into its syntactic components
-and perform fine-grained grounded alignment with the retrieved documents. For
-query components that cannot be individually aligned, we propose a dynamic
-semantic compensation mechanism that iteratively refines and rewrites the query
-while continuously updating the retrieval results. This iterative process
-continues until the retrieved documents sufficiently support the query's
-response. Our approach introduces a novel criterion for filtering retrieved
-documents, closely emulating human strategies for acquiring targeted
-information. This ensures that the retrieved content effectively supports and
-verifies the generated outputs. On the ALCE benchmark, our method significantly
-surpasses a wide range of baselines, achieving state-of-the-art performance.
-
-摘要：大型語言模型 (LLM) 本質上會出現幻覺，因為生成的文本的準確性無法僅透過它們包含的參數化知識來保證。儘管檢索增強生成 (RAG) 系統透過納入外部文件來提升生成模型的準確性和可靠性，但這些檢索的文件在實際應用中常常無法充分支援模型的回應。為了解決這個問題，我們提出 GGatrieval（用於可驗證生成的精細化粒度化基礎對齊檢索），它利用 LLM 來動態更新查詢並過濾高品質、可靠的檢索文件。具體來說，我們將使用者查詢分析成其語法組成部分，並對檢索文件執行精細化粒度化基礎對齊。對於無法個別對齊的查詢組成部分，我們提出一個動態語義補償機制，在持續更新檢索結果的同時，反覆修正和重寫查詢。這個反覆的程序會持續到檢索的文件充分支援查詢的回應為止。我們的做法引進了一個新的檢索文件過濾標準，嚴密地模擬人類獲取目標資訊的策略。這確保檢索的內容有效地支援和驗證生成的輸出。在 ALCE 基準測試中，我們的做法顯著超越各種基線，達成最先進的效能。
-
-##### **From Selection to Generation: A Survey of LLM-based Active Learning**
-2502.11767v1 by Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley
-
-Active Learning (AL) has been a powerful paradigm for improving model
-efficiency and performance by selecting the most informative data points for
-labeling and training. In recent active learning frameworks, Large Language
-Models (LLMs) have been employed not only for selection but also for generating
-entirely new data instances and providing more cost-effective annotations.
-Motivated by the increasing importance of high-quality data and efficient model
-training in the era of LLMs, we present a comprehensive survey on LLM-based
-Active Learning. We introduce an intuitive taxonomy that categorizes these
-techniques and discuss the transformative roles LLMs can play in the active
-learning loop. We further examine the impact of AL on LLM learning paradigms
-and its applications across various domains. Finally, we identify open
-challenges and propose future research directions. This survey aims to serve as
-an up-to-date resource for researchers and practitioners seeking to gain an
-intuitive understanding of LLM-based AL techniques and deploy them to new
-applications.
-
-摘要：主動學習 (AL) 透過挑選最具資訊性的資料點來標記和訓練，已成為一種強大的範例，用以提升模型效率和效能。在最近的主動學習架構中，大型語言模型 (LLM) 不僅用於挑選，也用於產生全新的資料實例，並提供更具成本效益的註解。在大型語言模型時代，由於高品質資料和高效能模型訓練日益重要，我們針對基於大型語言模型的主動學習提出了一項全面的調查。我們提出一個直覺式的分類法，用以分類這些技術，並探討大型語言模型在主動學習迴圈中可以扮演的轉型角色。我們進一步探討主動學習對大型語言模型學習範例的影響，以及它在各種領域中的應用。最後，我們找出開放式挑戰，並提出未來的研究方向。本調查旨在作為研究人員和實務工作者的最新資源，用以獲得對基於大型語言模型的主動學習技術的直覺式理解，並將其部署至新的應用程式。
-
-##### **Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**
-2502.11766v1 by Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
-
-The widespread deployment of Large Language Models (LLMs) is hindered by the
-high computational demands, making knowledge distillation (KD) crucial for
-developing compact smaller ones. However, the conventional KD methods endure
-the distribution mismatch issue between the teacher and student models, leading
-to the poor performance of distillation. For instance, the widely-used KL-based
-methods suffer the mode-averaging and mode-collapsing problems, since the
-mismatched probabitliy distribution between both models. Previous studies
-mainly optimize this issue via different distance calculations towards the
-distribution of both models. Unfortunately, the distribution mismatch issue
-still exists in the early stage of the distillation. Hence, to reduce the
-impact of distribution mismatch, we propose a simple yet efficient method,
-named Warmup-Distill, which aligns the distillation of the student to that of
-the teacher in advance of distillation. Specifically, we first detect the
-distribution of the student model in practical scenarios with its internal
-knowledge, and then modify the knowledge with low probability via the teacher
-as the checker. Consequently, Warmup-Distill aligns the internal student's
-knowledge to that of the teacher, which expands the distribution of the student
-with the teacher's, and assists the student model to learn better in the
-subsequent distillation. Experiments on the seven benchmarks demonstrate that
-Warmup-Distill could provide a warmup student more suitable for distillation,
-which outperforms the vanilla student by as least +0.4 averaged score among all
-benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation
-on the math task could yield a further improvement, at most +1.9% accuracy.
-
-摘要：大型語言模型 (LLM) 的廣泛部署受到高運算需求的阻礙，這使得知識蒸餾 (KD) 對於開發緊湊型的小型模型至關重要。然而，傳統的 KD 方法忍受了教師和學生模型之間的分布不匹配問題，導致蒸餾效果不佳。例如，廣泛使用的基於 KL 的方法會出現模式平均和模式崩潰問題，因為兩個模型之間的機率分佈不匹配。先前的研究主要透過不同的距離計算來最佳化這個問題，以朝向兩個模型的分布。不幸的是，分布不匹配的問題仍然存在於蒸餾的早期階段。因此，為了減少分布不匹配的影響，我們提出了一種簡單但有效的方法，稱為 Warmup-Distill，它在蒸餾之前將學生的蒸餾與教師的蒸餾對齊。具體來說，我們首先使用其內部知識在實際場景中檢測學生的分布，然後透過教師作為檢查員修改低機率的知識。因此，Warmup-Distill 將學生的內部知識與教師的知識對齊，這會將學生的分布擴展到教師的分布，並協助學生模型在後續的蒸餾中學習得更好。在七個基準測試上的實驗表明，Warmup-Distill 可以提供更適合蒸餾的熱身學生，在所有基準測試中，其表現優於香草學生至少 +0.4 的平均分數。值得注意的是，在 Warmup-Distill 的協助下，數學任務上的蒸餾可以進一步提升，最多可提升 +1.9% 的準確度。
-
-##### **Lightweight Deepfake Detection Based on Multi-Feature Fusion**
-2502.11763v1 by Siddiqui Muhammad Yasir, Hyun Kim
-
-Deepfake technology utilizes deep learning based face manipulation techniques
-to seamlessly replace faces in videos creating highly realistic but
-artificially generated content. Although this technology has beneficial
-applications in media and entertainment misuse of its capabilities may lead to
-serious risks including identity theft cyberbullying and false information. The
-integration of DL with visual cognition has resulted in important technological
-improvements particularly in addressing privacy risks caused by artificially
-generated deepfake images on digital media platforms. In this study we propose
-an efficient and lightweight method for detecting deepfake images and videos
-making it suitable for devices with limited computational resources. In order
-to reduce the computational burden usually associated with DL models our method
-integrates machine learning classifiers in combination with keyframing
-approaches and texture analysis. Moreover the features extracted with a
-histogram of oriented gradients (HOG) local binary pattern (LBP) and KAZE bands
-were integrated to evaluate using random forest extreme gradient boosting extra
-trees and support vector classifier algorithms. Our findings show a
-feature-level fusion of HOG LBP and KAZE features improves accuracy to 92% and
-96% on FaceForensics++ and Celeb-DFv2 respectively.
-
-摘要：深度偽造技術利用基於深度學習的換臉技術，可無縫替換影片中的臉孔，創造出高度逼真但人工產生的內容。儘管這項技術在媒體和娛樂方面有益，但若誤用其功能可能會導致嚴重的風險，包括身分盜用、網路霸凌和虛假訊息。深度學習與視覺認知的整合已帶來重要的技術進步，特別是在解決由數位媒體平台上的人工深度偽造影像所造成的隱私風險方面。在本研究中，我們提出了一種用於偵測深度偽造影像和影片的有效且輕量級的方法，使其適用於運算資源有限的裝置。為了降低通常與深度學習模型相關的運算負擔，我們的做法結合了機器學習分類器、關鍵影格方法和紋理分析。此外，我們整合了使用方向梯度直方圖 (HOG)、局部二進位模式 (LBP) 和 KAZE 頻段所萃取出的特徵，並使用隨機森林、極端梯度提升、額外樹木和支援向量分類器演算法進行評估。我們的研究結果顯示，HOG、LBP 和 KAZE 特徵的層級融合將準確度提升至 92%，分別在 FaceForensics++ 和 Celeb-DFv2 上達到 96%。
-
-##### **HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**
-2502.11753v1 by Michiel van der Meer, Pavel Korshunov, Sébastien Marcel, Lonneke van der Plas
-
-Misinformation can be countered with fact-checking, but the process is costly
-and slow. Identifying checkworthy claims is the first step, where automation
-can help scale fact-checkers' efforts. However, detection methods struggle with
-content that is 1) multimodal, 2) from diverse domains, and 3) synthetic. We
-introduce HintsOfTruth, a public dataset for multimodal checkworthiness
-detection with $27$K real-world and synthetic image/claim pairs. The mix of
-real and synthetic data makes this dataset unique and ideal for benchmarking
-detection methods. We compare fine-tuned and prompted Large Language Models
-(LLMs). We find that well-configured lightweight text-based encoders perform
-comparably to multimodal models but the first only focus on identifying
-non-claim-like content. Multimodal LLMs can be more accurate but come at a
-significant computational cost, making them impractical for large-scale
-applications. When faced with synthetic data, multimodal models perform more
-robustly
-
-摘要：錯誤訊息可以透過事實查核來反駁，但這個過程既昂貴又緩慢。辨識需要查核的說法是第一步，自動化可以幫助擴大事實查核人員的努力。然而，偵測方法會在處理 1) 多模態、2) 來自不同領域，以及 3) 合成的內容時遇到困難。我們引進 HintsOfTruth，一個用於多模態查核價值偵測的公開資料集，其中包含 27K 個真實世界和合成的影像/說法配對。真實和合成資料的組合讓這個資料集獨一無二，非常適合用於基準偵測方法。我們比較微調和提示的大語言模型 (LLM)。我們發現，設定良好的輕量級文字編碼器的表現與多模態模型相當，但前者只專注於辨識非說法類型的內容。多模態 LLM 可能更準確，但需要大量的運算成本，這讓它們不適用於大規模的應用。在面對合成資料時，多模態模型的表現更強健。
-
-##### **Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**
-2502.11751v1 by Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
-
-Although Large Language Models (LLMs) excel in reasoning and generation for
-language tasks, they are not specifically designed for multimodal challenges.
-Training Multimodal Large Language Models (MLLMs), however, is
-resource-intensive and constrained by various training limitations. In this
-paper, we propose the Modular-based Visual Contrastive Decoding (MVCD)
-framework to move this obstacle. Our framework leverages LLMs' In-Context
-Learning (ICL) capability and the proposed visual contrastive-example decoding
-(CED), specifically tailored for this framework, without requiring any
-additional training. By converting visual signals into text and focusing on
-contrastive output distributions during decoding, we can highlight the new
-information introduced by contextual examples, explore their connections, and
-avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual
-perception to make it see and reason over the input visuals. To demonstrate
-MVCD's effectiveness, we conduct experiments with four LLMs across five
-question answering datasets. Our results not only show consistent improvement
-in model accuracy but well explain the effective components inside our decoding
-strategy. Our code will be available at https://github.com/Pbhgit/MVCD.
-
-摘要：儘管大型語言模型 (LLM) 在語言任務的推理和生成方面表現優異，但它們並非專門針對多模態挑戰而設計。然而，訓練多模態大型語言模型 (MLLM) 十分耗費資源，並受到各種訓練限制。在本文中，我們提出基於模組的視覺對比解碼 (MVCD) 架構來克服這個障礙。我們的架構利用 LLM 的情境學習 (ICL) 能力和專門為此架構量身打造的視覺對比範例解碼 (CED)，而無需任何額外訓練。透過將視覺信號轉換為文字，並在解碼過程中專注於對比輸出分佈，我們可以突顯情境範例引入的新資訊，探索它們的關聯性，並避免過度依賴先前編碼的知識。MVCD 增強了 LLM 的視覺感知能力，使其能夠觀察並推論輸入視覺效果。為了證明 MVCD 的有效性，我們使用四個 LLM 在五個問答資料集上進行實驗。我們的結果不僅顯示模型準確度持續提升，還能清楚說明我們的解碼策略中的有效組成部分。我們的程式碼將在 https://github.com/Pbhgit/MVCD 公開。
-
-##### **SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**
-2502.11741v1 by Shuai Lyu, Haoran Luo, Zhonghong Ou, Yifan Zhu, Xiaoran Shang, Yang Qin, Meina Song
-
-The Text-to-SQL(Text2SQL) task aims to convert natural language queries into
-executable SQL queries. Thanks to the application of large language models
-(LLMs), significant progress has been made in this field. However, challenges
-such as model scalability, limited generation space, and coherence issues in
-SQL generation still persist. To address these issues, we propose SQL-o1, a
-Self-Reward-based heuristic search method designed to enhance the reasoning
-ability of LLMs in SQL query generation. SQL-o1 combines Monte Carlo Tree
-Search (MCTS) for heuristic process-level search and constructs a Schema-Aware
-dataset to help the model better understand database schemas. Extensive
-experiments on the Bird and Spider datasets demonstrate that SQL-o1 improves
-execution accuracy by 10.8\% on the complex Bird dataset compared to the latest
-baseline methods, even outperforming GPT-4-based approaches. Additionally,
-SQL-o1 excels in few-shot learning scenarios and shows strong cross-model
-transferability. Our code is publicly available
-at:https://github.com/ShuaiLyu0110/SQL-o1.
-
-摘要：文本转 SQL（Text2SQL）任务旨在将自然语言查询转换为可执行的 SQL 查询。得益于大型语言模型（LLM）的应用，该领域取得了显著进展。然而，模型可扩展性、生成空间受限和 SQL 生成的连贯性问题等挑战仍然存在。为了解决这些问题，我们提出了 SQL-o1，这是一种基于自我奖励的启发式搜索方法，旨在增强 LLM 在 SQL 查询生成中的推理能力。SQL-o1 结合了蒙特卡罗树搜索（MCTS）用于启发式过程级搜索，并构建了一个模式感知数据集，以帮助模型更好地理解数据库模式。在 Bird 和 Spider 数据集上的大量实验表明，与最新的基准方法相比，SQL-o1 将复杂 Bird 数据集上的执行准确率提高了 10.8%，甚至优于基于 GPT-4 的方法。此外，SQL-o1 在少样本学习场景中表现出色，并显示出强大的跨模型可迁移性。我们的代码已公开发布在：https://github.com/ShuaiLyu0110/SQL-o1。
-
 
 ### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null|
+|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|null|
 |**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
 |**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
 |**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
@@ -7841,7 +5367,7 @@ at:https://github.com/ShuaiLyu0110/SQL-o1.
 |**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
 |**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
 |**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null|
 |**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
 |**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
 |**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
@@ -7871,14 +5397,163 @@ at:https://github.com/ShuaiLyu0110/SQL-o1.
 |**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
 |**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
 |**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
 
 #### Abstracts
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
+
+##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**
+2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang
+
+The rapid advancement of perovskite solar cells (PSCs) has led to an
+exponential growth in research publications, creating an urgent need for
+efficient knowledge management and reasoning systems in this domain. We present
+a comprehensive knowledge-enhanced system for PSCs that integrates three key
+components. First, we develop Perovskite-KG, a domain-specific knowledge graph
+constructed from 1,517 research papers, containing 23,789 entities and 22,272
+relationships. Second, we create two complementary datasets: Perovskite-Chat,
+comprising 55,101 high-quality question-answer pairs generated through a novel
+multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully
+curated materials science problems. Third, we introduce two specialized large
+language models: Perovskite-Chat-LLM for domain-specific knowledge assistance
+and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental
+results demonstrate that our system significantly outperforms existing models
+in both domain-specific knowledge retrieval and scientific reasoning tasks,
+providing researchers with effective tools for literature review, experimental
+design, and complex problem-solving in PSC research.
+
+摘要：由於 perovskite 太陽能電池 (PSC) 快速進展，導致研究出版物呈指數成長，迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先，我們開發出 Perovskite-KG，一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次，我們建立兩個互補的資料集：Perovskite-Chat，包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對；以及 Perovskite-Reasoning，包含 2,217 個仔細策展的材料科學問題。第三，我們推出兩個專門化大型語言模型：針對領域特定知識協助的 Perovskite-Chat-LLM，以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示，我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型，為研究人員提供有效的工具，用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。
+
+##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**
+2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
+
+Explainable recommendation has demonstrated significant advantages in
+informing users about the logic behind recommendations, thereby increasing
+system transparency, effectiveness, and trustworthiness. To provide
+personalized and interpretable explanations, existing works often combine the
+generation capabilities of large language models (LLMs) with collaborative
+filtering (CF) information. CF information extracted from the user-item
+interaction graph captures the user behaviors and preferences, which is crucial
+for providing informative explanations. However, due to the complexity of graph
+structure, effectively extracting the CF information from graphs still remains
+a challenge. Moreover, existing methods often struggle with the integration of
+extracted CF information with LLMs due to its implicit representation and the
+modality gap between graph structures and natural language explanations. To
+address these challenges, we propose G-Refer, a framework using graph
+retrieval-augmented large language models (LLMs) for explainable
+recommendation. Specifically, we first employ a hybrid graph retrieval
+mechanism to retrieve explicit CF signals from both structural and semantic
+perspectives. The retrieved CF information is explicitly formulated as
+human-understandable text by the proposed graph translation and accounts for
+the explanations generated by LLMs. To bridge the modality gap, we introduce
+knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of
+LLMs to process and utilize the retrieved CF information to generate
+explanations. Extensive experiments show that G-Refer achieves superior
+performance compared with existing methods in both explainability and
+stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
+
+摘要：可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點，從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明，現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好，這對於提供資訊性說明至關重要。然而，由於圖形結構的複雜性，從圖形中有效提取 CF 資訊仍然是一個挑戰。此外，現有方法通常難以將提取的 CF 資訊與 LLM 整合，因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰，我們提出 G-Refer，一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說，我們首先採用混合圖形檢索機制，從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字，並說明 LLM 生成的解釋。為了彌合模式差距，我們引入了知識修剪和檢索增強微調，以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明，與現有方法相比，G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。
+
 ##### **A-MEM: Agentic Memory for LLM Agents**
 2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
 
@@ -9399,7 +7074,7 @@ absence of agent-level demonstrations. Project code will be released.
 摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
 ##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
 Recent advancements have highlighted that Large Language Models (LLMs) are
 prone to hallucinations when solving complex reasoning problems, leading to
@@ -9426,7 +7101,7 @@ better or comparable performance compared to various strong baselines. Further
 analysis reveals that our agent can identify missing triples, facilitating
 automatic KG updates.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：<paragraph>最近的進展強調出，大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺，導致錯誤的結果。為了解決這個問題，研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而，現有方法面臨兩個限制：1) 它們通常假設問題的所有答案都包含在 KG 中，忽略了 KG 的不完整性問題，以及 2) 它們將 KG 視為一個靜態儲存庫，而忽略了 KG 中固有的隱式邏輯推理結構。在本文中，我們介紹了 SymAgent，一個創新的神經符號代理架構，它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境，並將複雜的推理任務轉化為一個多步驟的互動過程，使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組：代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則，指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊，解決 KG 不完整性的問題。此外，我們設計了一個自學習框架，包括線上探索和離線反覆的政策更新階段，使代理能夠自動合成推理軌跡並改善效能。實驗結果表明，具有弱 LLM 主幹的 SymAgent（例如，7B 系列）與各種強大的基線相比，產生了更好或相當的效能。進一步的分析表明，我們的代理可以識別遺失的三元組，促進自動 KG 更新。</paragraph>
 
 ##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
 2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
@@ -10122,209 +7797,2460 @@ improve a real-world KGQA system.
 
 摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
-
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
-
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
-
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
-
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
-
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
-
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
-
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
-
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
-
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
-
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
-
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
-
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
-
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
-
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
-
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
-
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
-
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
-
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
-
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
-
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
-
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
+
+### LLM
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**|Zekun Qi et.al.|[2502.13143v1](http://arxiv.org/abs/2502.13143v1)|null|
+|**2025-02-18**|**Pre-training Auto-regressive Robotic Models with 4D Representations**|Dantong Niu et.al.|[2502.13142v1](http://arxiv.org/abs/2502.13142v1)|null|
+|**2025-02-18**|**UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**|Huawei Lin et.al.|[2502.13141v1](http://arxiv.org/abs/2502.13141v1)|null|
+|**2025-02-18**|**AIDE: AI-Driven Exploration in the Space of Code**|Zhengyao Jiang et.al.|[2502.13138v1](http://arxiv.org/abs/2502.13138v1)|null|
+|**2025-02-18**|**Theorem Prover as a Judge for Synthetic Data Generation**|Joshua Ong Jun Leang et.al.|[2502.13137v1](http://arxiv.org/abs/2502.13137v1)|null|
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Rethinking Diverse Human Preference Learning through Principal Component Analysis**|Feng Luo et.al.|[2502.13131v1](http://arxiv.org/abs/2502.13131v1)|null|
+|**2025-02-18**|**Magma: A Foundation Model for Multimodal AI Agents**|Jianwei Yang et.al.|[2502.13130v1](http://arxiv.org/abs/2502.13130v1)|null|
+|**2025-02-18**|**SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**|Zihan Liu et.al.|[2502.13128v1](http://arxiv.org/abs/2502.13128v1)|null|
+|**2025-02-18**|**Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**|Jingyang Lin et.al.|[2502.13127v1](http://arxiv.org/abs/2502.13127v1)|null|
+|**2025-02-18**|**RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**|Zenan Zhai et.al.|[2502.13125v1](http://arxiv.org/abs/2502.13125v1)|null|
+|**2025-02-18**|**NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**|Weizhe Yuan et.al.|[2502.13124v1](http://arxiv.org/abs/2502.13124v1)|null|
+|**2025-02-18**|**Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**|Marion Bartl et.al.|[2502.13120v1](http://arxiv.org/abs/2502.13120v1)|null|
+|**2025-02-18**|**STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**|Narun Raman et.al.|[2502.13119v1](http://arxiv.org/abs/2502.13119v1)|null|
+|**2025-02-18**|**Performance Evaluation of Large Language Models in Statistical Programming**|Xinyi Song et.al.|[2502.13117v1](http://arxiv.org/abs/2502.13117v1)|null|
+|**2025-02-18**|**Near-Optimal Private Learning in Linear Contextual Bandits**|Fan Chen et.al.|[2502.13115v1](http://arxiv.org/abs/2502.13115v1)|null|
+|**2025-02-18**|**The influence of motion features in temporal perception**|Rosa Illan Castillo et.al.|[2502.13114v1](http://arxiv.org/abs/2502.13114v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**MatterChat: A Multi-Modal LLM for Material Science**|Yingheng Tang et.al.|[2502.13107v1](http://arxiv.org/abs/2502.13107v1)|null|
+|**2025-02-18**|**Understanding and Rectifying Safety Perception Distortion in VLMs**|Xiaohan Zou et.al.|[2502.13095v1](http://arxiv.org/abs/2502.13095v1)|null|
+|**2025-02-18**|**Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**|Mengkang Hu et.al.|[2502.13092v1](http://arxiv.org/abs/2502.13092v1)|null|
+|**2025-02-18**|**KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**|Xin Xia et.al.|[2502.13076v1](http://arxiv.org/abs/2502.13076v1)|null|
+|**2025-02-18**|**Interactive Agents to Overcome Ambiguity in Software Engineering**|Sanidhya Vijayvargiya et.al.|[2502.13069v1](http://arxiv.org/abs/2502.13069v1)|null|
+|**2025-02-18**|**Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**|Yuri Kuratov et.al.|[2502.13063v1](http://arxiv.org/abs/2502.13063v1)|null|
+|**2025-02-18**|**AI-Assisted Decision Making with Human Learning**|Gali Noti et.al.|[2502.13062v1](http://arxiv.org/abs/2502.13062v1)|null|
+|**2025-02-18**|**Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**|Jingbiao Mei et.al.|[2502.13061v1](http://arxiv.org/abs/2502.13061v1)|null|
+|**2025-02-18**|**SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**|Xianfu Cheng et.al.|[2502.13059v1](http://arxiv.org/abs/2502.13059v1)|null|
+|**2025-02-18**|**LAMD: Context-driven Android Malware Detection and Classification with LLMs**|Xingzhi Qian et.al.|[2502.13055v1](http://arxiv.org/abs/2502.13055v1)|null|
+|**2025-02-18**|**Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**|Nils Constantin Hellwig et.al.|[2502.13044v1](http://arxiv.org/abs/2502.13044v1)|null|
+|**2025-02-18**|**Natural Language Generation from Visual Sequences: Challenges and Future Directions**|Aditya K Surikuchi et.al.|[2502.13034v1](http://arxiv.org/abs/2502.13034v1)|null|
+|**2025-02-18**|**HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**|Bosi Wen et.al.|[2502.13031v1](http://arxiv.org/abs/2502.13031v1)|null|
+|**2025-02-18**|**Whose story is it? Personalizing story generation by inferring author styles**|Nischal Ashok Kumar et.al.|[2502.13028v1](http://arxiv.org/abs/2502.13028v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**|Sha Li et.al.|[2502.13019v1](http://arxiv.org/abs/2502.13019v1)|null|
+|**2025-02-18**|**LLM-Powered Proactive Data Systems**|Sepanta Zeighami et.al.|[2502.13016v1](http://arxiv.org/abs/2502.13016v1)|null|
+|**2025-02-18**|**Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**|Chaoran Chen et.al.|[2502.13012v1](http://arxiv.org/abs/2502.13012v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**|Yarin Benyamin et.al.|[2502.13006v1](http://arxiv.org/abs/2502.13006v1)|null|
+|**2025-02-18**|**Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**|Wafaa Wardah et.al.|[2502.13004v1](http://arxiv.org/abs/2502.13004v1)|null|
+|**2025-02-18**|**You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**|Frederic Kirstein et.al.|[2502.13001v1](http://arxiv.org/abs/2502.13001v1)|null|
+|**2025-02-18**|**Personalized Top-k Set Queries Over Predicted Scores**|Sohrab Namazi Nia et.al.|[2502.12998v1](http://arxiv.org/abs/2502.12998v1)|null|
+|**2025-02-18**|**Eager Updates For Overlapped Communication and Computation in DiLoCo**|Satyen Kale et.al.|[2502.12996v1](http://arxiv.org/abs/2502.12996v1)|null|
+|**2025-02-18**|**Free Argumentative Exchanges for Explaining Image Classifiers**|Avinash Kori et.al.|[2502.12995v1](http://arxiv.org/abs/2502.12995v1)|null|
+|**2025-02-18**|**B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**|Yifan Wang et.al.|[2502.12992v1](http://arxiv.org/abs/2502.12992v1)|null|
+|**2025-02-18**|**Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**|Zixiao Wang et.al.|[2502.12988v1](http://arxiv.org/abs/2502.12988v1)|null|
+|**2025-02-18**|**PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**|Nicolas Talabot et.al.|[2502.12985v1](http://arxiv.org/abs/2502.12985v1)|null|
+|**2025-02-18**|**Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**|Longxu Dou et.al.|[2502.12982v1](http://arxiv.org/abs/2502.12982v1)|null|
+|**2025-02-18**|**Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**|Junda Zhu et.al.|[2502.12970v1](http://arxiv.org/abs/2502.12970v1)|null|
+|**2025-02-18**|**A Survey of Text Classification Under Class Distribution Shift**|Adriana Valentina Costache et.al.|[2502.12965v1](http://arxiv.org/abs/2502.12965v1)|null|
+|**2025-02-18**|**Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**|Adi Simhi et.al.|[2502.12964v1](http://arxiv.org/abs/2502.12964v1)|null|
+|**2025-02-18**|**Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**|Xiaoju Ye et.al.|[2502.12962v1](http://arxiv.org/abs/2502.12962v1)|null|
+|**2025-02-18**|**Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**|Wenjun Li et.al.|[2502.12961v1](http://arxiv.org/abs/2502.12961v1)|null|
+|**2025-02-18**|**AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**|Steve Bakos et.al.|[2502.12959v1](http://arxiv.org/abs/2502.12959v1)|null|
+|**2025-02-18**|**Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**|Andrei Jarca et.al.|[2502.12953v1](http://arxiv.org/abs/2502.12953v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**|Gyeongman Kim et.al.|[2502.12947v1](http://arxiv.org/abs/2502.12947v1)|null|
+|**2025-02-18**|**LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**|Junchen Fu et.al.|[2502.12945v1](http://arxiv.org/abs/2502.12945v1)|null|
+|**2025-02-18**|**Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**|Salsabila Zahirah Pranida et.al.|[2502.12932v1](http://arxiv.org/abs/2502.12932v1)|null|
+|**2025-02-18**|**Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**|Lakshmi Nair et.al.|[2502.12929v1](http://arxiv.org/abs/2502.12929v1)|null|
+|**2025-02-18**|**Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**|Leiyu Pan et.al.|[2502.12928v1](http://arxiv.org/abs/2502.12928v1)|null|
+|**2025-02-18**|**SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**|Mike Zhang et.al.|[2502.12927v1](http://arxiv.org/abs/2502.12927v1)|null|
+|**2025-02-18**|**Towards more Contextual Agents: An extractor-Generator Optimization Framework**|Mourad Aouini et.al.|[2502.12926v1](http://arxiv.org/abs/2502.12926v1)|null|
+|**2025-02-18**|**Keep what you need : extracting efficient subnetworks from large audio representation models**|David Genova et.al.|[2502.12925v1](http://arxiv.org/abs/2502.12925v1)|null|
+|**2025-02-18**|**Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**|Maite Heredia et.al.|[2502.12924v1](http://arxiv.org/abs/2502.12924v1)|null|
+|**2025-02-18**|**On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**|Rune Birkmose et.al.|[2502.12923v1](http://arxiv.org/abs/2502.12923v1)|null|
+|**2025-02-18**|**Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**|George-Kirollos Saad et.al.|[2502.12921v1](http://arxiv.org/abs/2502.12921v1)|null|
+|**2025-02-18**|**GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**|Sifan Zhou et.al.|[2502.12913v1](http://arxiv.org/abs/2502.12913v1)|null|
+|**2025-02-18**|**Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**|Zheng Yuan et.al.|[2502.12911v1](http://arxiv.org/abs/2502.12911v1)|null|
+|**2025-02-18**|**Graph Neural Networks for Databases: A Survey**|Ziming Li et.al.|[2502.12908v1](http://arxiv.org/abs/2502.12908v1)|null|
+|**2025-02-18**|**Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**|Shu Yang et.al.|[2502.12904v1](http://arxiv.org/abs/2502.12904v1)|null|
+|**2025-02-18**|**Soundwave: Less is More for Speech-Text Alignment in LLMs**|Yuhao Zhang et.al.|[2502.12900v1](http://arxiv.org/abs/2502.12900v1)|null|
+|**2025-02-18**|**None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**|Eva Sánchez Salido et.al.|[2502.12896v1](http://arxiv.org/abs/2502.12896v1)|null|
+|**2025-02-18**|**Multilingual European Language Models: Benchmarking Approaches and Challenges**|Fabio Barth et.al.|[2502.12895v1](http://arxiv.org/abs/2502.12895v1)|null|
+|**2025-02-18**|**H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**|Martin Kuo et.al.|[2502.12893v1](http://arxiv.org/abs/2502.12893v1)|null|
+|**2025-02-18**|**Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**|Georg Rehm et.al.|[2502.12886v1](http://arxiv.org/abs/2502.12886v1)|null|
+|**2025-02-18**|**How desirable is alignment between LLMs and linguistically diverse human users?**|Pia Knoeferle et.al.|[2502.12884v1](http://arxiv.org/abs/2502.12884v1)|null|
+|**2025-02-18**|**Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**|Nandakishor M et.al.|[2502.12876v1](http://arxiv.org/abs/2502.12876v1)|null|
+|**2025-02-18**|**PAFT: Prompt-Agnostic Fine-Tuning**|Chenxing Wei et.al.|[2502.12859v1](http://arxiv.org/abs/2502.12859v1)|null|
+|**2025-02-18**|**Rejected Dialects: Biases Against African American Language in Reward Models**|Joel Mire et.al.|[2502.12858v1](http://arxiv.org/abs/2502.12858v1)|null|
+|**2025-02-18**|**Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**|Neeraj Gangwar et.al.|[2502.12855v1](http://arxiv.org/abs/2502.12855v1)|null|
+|**2025-02-18**|**S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**|Ruotian Ma et.al.|[2502.12853v1](http://arxiv.org/abs/2502.12853v1)|null|
+|**2025-02-18**|**MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**|Fabian David Schmidt et.al.|[2502.12852v1](http://arxiv.org/abs/2502.12852v1)|null|
+|**2025-02-18**|**MeMo: Towards Language Models with Associative Memory Mechanisms**|Fabio Massimo Zanzotto et.al.|[2502.12851v1](http://arxiv.org/abs/2502.12851v1)|null|
+|**2025-02-18**|**Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**|Kathrin Seßler et.al.|[2502.12842v1](http://arxiv.org/abs/2502.12842v1)|null|
+|**2025-02-18**|**Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**|Berk Yilmaz et.al.|[2502.12838v1](http://arxiv.org/abs/2502.12838v1)|null|
+|**2025-02-18**|**An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**|Mohammad Feli et.al.|[2502.12836v1](http://arxiv.org/abs/2502.12836v1)|null|
+|**2025-02-18**|**Subword models struggle with word learning, but surprisal hides it**|Bastian Bunzeck et.al.|[2502.12835v1](http://arxiv.org/abs/2502.12835v1)|null|
+|**2025-02-18**|**KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**|Mukhammed Togmanov et.al.|[2502.12829v1](http://arxiv.org/abs/2502.12829v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**|Elena Stringli et.al.|[2502.12821v1](http://arxiv.org/abs/2502.12821v1)|null|
+|**2025-02-18**|**Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**|Adnan Ahmad et.al.|[2502.12813v1](http://arxiv.org/abs/2502.12813v1)|null|
+|**2025-02-18**|**Towards Text-Image Interleaved Retrieval**|Xin Zhang et.al.|[2502.12799v1](http://arxiv.org/abs/2502.12799v1)|null|
+|**2025-02-18**|**Envious Explore and Exploit**|Omer Ben-Porat et.al.|[2502.12798v1](http://arxiv.org/abs/2502.12798v1)|null|
+|**2025-02-18**|**Commonsense Reasoning in Arab Culture**|Abdelrahman Sadallah et.al.|[2502.12788v1](http://arxiv.org/abs/2502.12788v1)|null|
+|**2025-02-18**|**VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**|Xinlong Chen et.al.|[2502.12782v1](http://arxiv.org/abs/2502.12782v1)|null|
+|**2025-02-18**|**Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**|Daiki Chijiwa et.al.|[2502.12776v1](http://arxiv.org/abs/2502.12776v1)|null|
+|**2025-02-18**|**Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**|Danny Dongyeop Han et.al.|[2502.12771v1](http://arxiv.org/abs/2502.12771v1)|null|
+|**2025-02-18**|**How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**|Saad Obaid ul Islam et.al.|[2502.12769v1](http://arxiv.org/abs/2502.12769v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+
+#### Abstracts
+##### **SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**
+2502.13143v1 by Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
+
+Spatial intelligence is a critical component of embodied AI, promoting robots
+to understand and interact with their environments. While recent advances have
+enhanced the ability of VLMs to perceive object locations and positional
+relationships, they still lack the capability to precisely understand object
+orientations-a key requirement for tasks involving fine-grained manipulations.
+Addressing this limitation not only requires geometric reasoning but also an
+expressive and intuitive way to represent orientation. In this context, we
+propose that natural language offers a more flexible representation space than
+canonical frames, making it particularly suitable for instruction-following
+robotic systems. In this paper, we introduce the concept of semantic
+orientation, which defines object orientations using natural language in a
+reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the
+''handle'' direction of a knife). To support this, we construct OrienText300K,
+a large-scale dataset of 3D models annotated with semantic orientations that
+link geometric understanding to functional semantics. By integrating semantic
+orientation into a VLM system, we enable robots to generate manipulation
+actions with both positional and orientational constraints. Extensive
+experiments in simulation and real world demonstrate that our approach
+significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy
+on Open6DOR and 74.9% accuracy on SIMPLER.
+
+摘要：空間智能是具象 AI 的關鍵組成部分，促使機器人了解其環境並與之互動。雖然最近的進展增強了 VLM 感知物件位置和位置關係的能力，但它們仍然缺乏精確理解物件方向的能力，這對於涉及細微操作的任務來說是一項關鍵要求。解決這個限制不僅需要幾何推理，還需要一種表達性和直觀的方式來表示方向。在此背景下，我們提出自然語言提供了一個比標準框架更靈活的表示空間，使其特別適合於遵循指令的機器人系統。在本文中，我們介紹了語義方向的概念，它使用自然語言以無參考框架的方式定義物件方向（例如，USB 的「插入」方向或刀子的「握柄」方向）。為了支持這一點，我們構建了 OrienText300K，這是一個大型 3D 模型數據集，其中註釋了語義方向，將幾何理解與功能語義聯繫起來。通過將語義方向整合到 VLM 系統中，我們使機器人能夠生成同時具有位置和方向約束的操作動作。在模擬和現實世界中進行的廣泛實驗表明，我們的做法顯著增強了機器人的操作能力，例如，Open6DOR 的準確率為 48.7%，SIMPLER 的準確率為 74.9%。
+
+##### **Pre-training Auto-regressive Robotic Models with 4D Representations**
+2502.13142v1 by Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig
+
+Foundation models pre-trained on massive unlabeled datasets have
+revolutionized natural language and computer vision, exhibiting remarkable
+generalization capabilities, thus highlighting the importance of pre-training.
+Yet, efforts in robotics have struggled to achieve similar success, limited by
+either the need for costly robotic annotations or the lack of representations
+that effectively model the physical world. In this paper, we introduce ARM4R,
+an Auto-regressive Robotic Model that leverages low-level 4D Representations
+learned from human video data to yield a better pre-trained robotic model.
+Specifically, we focus on utilizing 3D point tracking representations from
+videos derived by lifting 2D representations into 3D space via monocular depth
+estimation across time. These 4D representations maintain a shared geometric
+structure between the points and robot state representations up to a linear
+transformation, enabling efficient transfer learning from human video data to
+low-level robotic control. Our experiments show that ARM4R can transfer
+efficiently from human video data to robotics and consistently improves
+performance on tasks across various robot environments and configurations.
+
+摘要：預先在大量未標記資料集上訓練好的基礎模型已經徹底改變了自然語言和電腦視覺，展現出非凡的概化能力，因此突顯了預先訓練的重要性。然而，機器人領域的努力一直難以取得類似的成功，受到昂貴的機器人標註需求或缺乏有效建模物理世界的表徵的限制。在本文中，我們介紹了 ARM4R，一種自迴歸機器人模型，它利用從人類影片資料中學習到的低階 4D 表徵，以產生更好的預先訓練機器人模型。具體來說，我們專注於利用從影片中獲得的 3D 點追蹤表徵，這些表徵是透過單眼深度估計跨時間將 2D 表徵提升到 3D 空間而導出的。這些 4D 表徵在點和機器人狀態表徵之間保持一個共用的幾何結構，直到一個線性轉換，這使得從人類影片資料到低階機器人控制的有效遷移學習成為可能。我們的實驗表明，ARM4R 可以有效地從人類影片資料轉移到機器人技術，並持續改善各種機器人環境和組態中的任務效能。
+
+##### **UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**
+2502.13141v1 by Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao
+
+Large Language Models (LLMs) are vulnerable to attacks like prompt injection,
+backdoor attacks, and adversarial attacks, which manipulate prompts or models
+to generate harmful outputs. In this paper, departing from traditional deep
+learning attack paradigms, we explore their intrinsic relationship and
+collectively term them Prompt Trigger Attacks (PTA). This raises a key
+question: Can we determine if a prompt is benign or poisoned? To address this,
+we propose UniGuardian, the first unified defense mechanism designed to detect
+prompt injection, backdoor attacks, and adversarial attacks in LLMs.
+Additionally, we introduce a single-forward strategy to optimize the detection
+pipeline, enabling simultaneous attack detection and text generation within a
+single forward pass. Our experiments confirm that UniGuardian accurately and
+efficiently identifies malicious prompts in LLMs.
+
+摘要：大型語言模型 (LLM) 容易受到提示注入、後門攻擊和對抗性攻擊等攻擊，這些攻擊會操縱提示或模型以產生有害的輸出。在本文中，我們跳脫傳統深度學習攻擊範例，探討它們的內在關係，並將它們統稱為提示觸發攻擊 (PTA)。這引發了一個關鍵問題：我們能確定一個提示是良性的還是惡意的嗎？為了解決這個問題，我們提出了 UniGuardian，這是一種旨在偵測 LLM 中的提示注入、後門攻擊和對抗性攻擊的第一個統一防禦機制。此外，我們引入了一個單一前向策略來最佳化偵測管道，在單一前向傳遞中同時進行攻擊偵測和文字生成。我們的實驗證實，UniGuardian 能準確且有效地識別 LLM 中的惡意提示。
+
+##### **AIDE: AI-Driven Exploration in the Space of Code**
+2502.13138v1 by Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, Yuxiang Wu
+
+Machine learning, the foundation of modern artificial intelligence, has
+driven innovations that have fundamentally transformed the world. Yet, behind
+advancements lies a complex and often tedious process requiring labor and
+compute intensive iteration and experimentation. Engineers and scientists
+developing machine learning models spend much of their time on trial-and-error
+tasks instead of conceptualizing innovative solutions or research hypotheses.
+To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine
+learning engineering agent powered by large language models (LLMs). AIDE frames
+machine learning engineering as a code optimization problem, and formulates
+trial-and-error as a tree search in the space of potential solutions. By
+strategically reusing and refining promising solutions, AIDE effectively trades
+computational resources for enhanced performance, achieving state-of-the-art
+results on multiple machine learning engineering benchmarks, including our
+Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.
+
+摘要：機器學習，現代人工智慧的基礎，已經推動了根本性地改變世界的創新。然而，進步的背後是一個複雜且經常繁瑣的過程，需要人工和計算密集的迭代和實驗。開發機器學習模型的工程師和科學家將大部分時間花在試錯任務上，而不是構思創新的解決方案或研究假設。為了應對這一挑戰，我們引入了 AI 驅動探索 (AIDE)，這是一種由大型語言模型 (LLM) 驅動的機器學習工程代理。AIDE 將機器學習工程構建為一個程式碼最佳化問題，並將試錯表述為在潛在解決方案空間中的樹狀搜尋。透過策略性地重複使用和改進有希望的解決方案，AIDE 有效地將計算資源轉換為增強的效能，在多個機器學習工程基準上取得了最先進的成果，包括我們的 Kaggle 評估、OpenAI MLE-Bench 和 METRs RE-Bench。
+
+##### **Theorem Prover as a Judge for Synthetic Data Generation**
+2502.13137v1 by Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen
+
+The demand for synthetic data in mathematical reasoning has increased due to
+its potential to enhance the mathematical capabilities of large language models
+(LLMs). However, ensuring the validity of intermediate reasoning steps remains
+a significant challenge, affecting data quality. While formal verification via
+theorem provers effectively validates LLM reasoning, the autoformalisation of
+mathematical proofs remains error-prone. In response, we introduce iterative
+autoformalisation, an approach that iteratively refines theorem prover
+formalisation to mitigate errors, thereby increasing the execution rate on the
+Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as
+a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to
+rigorously assess LLM intermediate reasoning, effectively integrating
+autoformalisation with synthetic data generation. Finally, we present
+Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that
+replaces human annotation with theorem prover feedback in Reinforcement
+Learning from Human Feedback (RLHF). Across multiple LLMs, applying
+TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving
+5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for
+SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
+
+摘要：<paragraph>由於合成資料在數學推理中具有增強大型語言模型 (LLM) 數學能力的潛力，對合成資料的需求已增加。然而，確保中間推理步驟的有效性仍然是一項重大的挑戰，影響資料品質。雖然透過定理證明器進行形式驗證可有效驗證 LLM 推理，但數學證明自動形式化仍然容易出錯。為了解決這個問題，我們引入了迭代自動形式化，這是一種迭代優化定理證明器形式化以減少錯誤的方法，從而將 Lean 證明器的執行率從 60% 提高到 87%。在此基礎上，我們引入了定理證明器作為評審 (TP-as-a-Judge)，這是一種採用定理證明器形式化來嚴格評估 LLM 中間推理的方法，有效地將自動形式化與合成資料產生整合。最後，我們提出了定理證明器回饋強化學習 (RLTPF)，這是一個框架，用定理證明器回饋取代人類標註，以進行人類回饋強化學習 (RLHF)。在多個 LLM 中，應用 TP-as-a-Judge 和 RLTPF 可透過僅 3,508 個樣本改善基準，在 MultiArith 上獲得 5.56% 的準確度提升，在 SVAMP 上獲得 Llama-2-7B 的 6.00% 提升，在 AQUA 上獲得 Llama-3.1-8B 的 3.55% 提升。</paragraph>
+
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Rethinking Diverse Human Preference Learning through Principal Component Analysis**
+2502.13131v1 by Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
+
+Understanding human preferences is crucial for improving foundation models
+and building personalized AI systems. However, preferences are inherently
+diverse and complex, making it difficult for traditional reward models to
+capture their full range. While fine-grained preference data can help,
+collecting it is expensive and hard to scale. In this paper, we introduce
+Decomposed Reward Models (DRMs), a novel approach that extracts diverse human
+preferences from binary comparisons without requiring fine-grained annotations.
+Our key insight is to represent human preferences as vectors and analyze them
+using Principal Component Analysis (PCA). By constructing a dataset of
+embedding differences between preferred and rejected responses, DRMs identify
+orthogonal basis vectors that capture distinct aspects of preference. These
+decomposed rewards can be flexibly combined to align with different user needs,
+offering an interpretable and scalable alternative to traditional reward
+models. We demonstrate that DRMs effectively extract meaningful preference
+dimensions (e.g., helpfulness, safety, humor) and adapt to new users without
+additional training. Our results highlight DRMs as a powerful framework for
+personalized and interpretable LLM alignment.
+
+摘要：理解人類偏好對於改進基礎模型和建構個人化 AI 系統至關重要。然而，偏好本質上是多樣且複雜的，這使得傳統的獎勵模型難以捕捉其全部範圍。雖然細緻的偏好數據可能有所幫助，但收集這些數據既昂貴又難以擴展。在本文中，我們介紹了解構獎勵模型 (DRM)，這是一種新穎的方法，它可以從二元比較中提取多樣化的人類偏好，而不需要細緻的註解。我們的關鍵見解是將人類偏好表示為向量，並使用主成分分析 (PCA) 對其進行分析。透過建構偏好和拒絕回應之間嵌入差異的數據集，DRM 識別出正交基向量，這些向量捕捉偏好的不同面向。這些解構的獎勵可以靈活地結合在一起，以符合不同的使用者需求，提供一種可解釋且可擴展的傳統獎勵模型替代方案。我們證明了 DRM 可以有效地提取有意義的偏好維度（例如，有用性、安全性、幽默感），並在不需要額外訓練的情況下適應新的使用者。我們的結果突顯了 DRM 作為個人化且可解釋的 LLM 對齊強大架構。
+
+##### **Magma: A Foundation Model for Multimodal AI Agents**
+2502.13130v1 by Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
+
+We present Magma, a foundation model that serves multimodal AI agentic tasks
+in both the digital and physical worlds. Magma is a significant extension of
+vision-language (VL) models in that it not only retains the VL understanding
+ability (verbal intelligence) of the latter, but is also equipped with the
+ability to plan and act in the visual-spatial world (spatial-temporal
+intelligence) and complete agentic tasks ranging from UI navigation to robot
+manipulation. To endow the agentic capabilities, Magma is pretrained on large
+amounts of heterogeneous datasets spanning from images, videos to robotics
+data, where the actionable visual objects (e.g., clickable buttons in GUI) in
+images are labeled by Set-of-Mark (SoM) for action grounding, and the object
+movements (e.g., the trace of human hands or robotic arms) in videos are
+labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show
+that SoM and ToM reach great synergy and facilitate the acquisition of
+spatial-temporal intelligence for our Magma model, which is fundamental to a
+wide range of tasks as shown in Fig.1. In particular, Magma creates new
+state-of-the-art results on UI navigation and robotic manipulation tasks,
+outperforming previous models that are specifically tailored to these tasks. On
+image and video-related multimodal tasks, Magma also compares favorably to
+popular large multimodal models that are trained on much larger datasets. We
+make our model and code public for reproducibility at
+https://microsoft.github.io/Magma.
+
+摘要：<paragraph>我們提出 Magma，這是一個基礎模型，用於服務數位和物理世界中的多模態 AI 代理任務。Magma 是視覺語言 (VL) 模型的重大延伸，它不僅保留了後者的 VL 理解能力（語言智能），還具備在視覺空間世界中規劃和行動的能力（時空智能），並完成從 UI 導航到機器人操作的代理任務。為了賦予代理能力，Magma 在從影像、影片到機器人資料的大量異質資料集上進行預訓練，其中影像中的可操作視覺物件（例如 GUI 中的可點擊按鈕）由動作接地 Set-of-Mark (SoM) 標記，影片中的物件動作（例如人手或機器手臂的軌跡）由動作規劃 Trace-of-Mark (ToM) 標記。廣泛的實驗表明，SoM 和 ToM 達到了極大的協同作用，並促進了我們 Magma 模型的時空智能的獲取，這對於圖 1 中所示的各種任務至關重要。特別是，Magma 在 UI 導航和機器人操作任務上創造了新的最先進的結果，優於專門針對這些任務的先前模型。在影像和影片相關的多模態任務上，Magma 也與在更大資料集上訓練的流行大型多模態模型相比，表現得很好。我們公開我們的模型和程式碼，以便在 https://microsoft.github.io/Magma 上重現。</paragraph>
+
+##### **SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**
+2502.13128v1 by Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
+
+Text-to-song generation, the task of creating vocals and accompaniment from
+textual inputs, poses significant challenges due to domain complexity and data
+scarcity. Existing approaches often employ multi-stage generation procedures,
+resulting in cumbersome training and inference pipelines. In this paper, we
+propose SongGen, a fully open-source, single-stage auto-regressive transformer
+designed for controllable song generation. The proposed model facilitates
+fine-grained control over diverse musical attributes, including lyrics and
+textual descriptions of instrumentation, genre, mood, and timbre, while also
+offering an optional three-second reference clip for voice cloning. Within a
+unified auto-regressive framework, SongGen supports two output modes: mixed
+mode, which generates a mixture of vocals and accompaniment directly, and
+dual-track mode, which synthesizes them separately for greater flexibility in
+downstream applications. We explore diverse token pattern strategies for each
+mode, leading to notable improvements and valuable insights. Furthermore, we
+design an automated data preprocessing pipeline with effective quality control.
+To foster community engagement and future research, we will release our model
+weights, training code, annotated data, and preprocessing pipeline. The
+generated samples are showcased on our project page at
+https://liuzh-19.github.io/SongGen/ , and the code will be available at
+https://github.com/LiuZH-19/SongGen .
+
+摘要：文字轉歌曲生成，從文字輸入建立人聲和伴奏的任務，由於領域複雜性和資料稀少性，因此構成重大挑戰。現有方法通常採用多階段生成程序，導致訓練和推論管道繁瑣。在本文中，我們提出 SongGen，一個完全開源的單階段自迴歸轉換器，專為可控歌曲生成而設計。所提出的模型促進對各種音樂屬性的細粒度控制，包括歌詞和樂器、類型、情緒和音色的文字描述，同時還提供可選的三秒參考片段以進行語音複製。在統一的自迴歸框架內，SongGen 支援兩種輸出模式：混合模式，直接生成人聲和伴奏的混合，以及雙軌模式，將它們分開合成以提高下游應用程式的靈活性。我們探索每種模式的不同代幣模式策略，從而帶來顯著的改進和有價值的見解。此外，我們設計了一個自動化資料預處理管道，具備有效的品質控制。為了促進社區參與和未來的研究，我們將釋出我們的模型權重、訓練程式碼、註解資料和預處理管道。生成的範例展示在我們的專案頁面 https://liuzh-19.github.io/SongGen/，程式碼將在 https://github.com/LiuZH-19/SongGen 中提供。
+
+##### **Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**
+2502.13127v1 by Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo
+
+Recent advances in Large Language Models (LLMs) have enabled them to process
+increasingly longer sequences, ranging from 2K to 2M tokens and even beyond.
+However, simply extending the input sequence length does not necessarily lead
+to effective long-context understanding. In this study, we integrate
+Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate
+effective long-context understanding. To achieve this, we introduce
+LongFinanceQA, a synthetic dataset in the financial domain designed to improve
+long-context reasoning. Unlike existing long-context synthetic data,
+LongFinanceQA includes intermediate CoT reasoning before the final conclusion,
+which encourages LLMs to perform explicit reasoning, improving accuracy and
+interpretability in long-context understanding. To generate synthetic CoT
+reasoning, we propose Property-driven Agentic Inference (PAI), an agentic
+framework that simulates human-like reasoning steps, including property
+extraction, retrieval, and summarization. We evaluate PAI's reasoning
+capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark,
+outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune
+LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's
+financial subset.
+
+摘要：大型語言模型 (LLM) 的最新進展讓它們能夠處理越來越長的序列，範圍從 2K 到 2M 個符號，甚至更長。
+然而，僅僅延長輸入序列長度並不會必然導致有效的長語境理解。在本研究中，我們以監督的方式將思考鏈 (CoT) 推理整合到 LLM 中，以促進有效的長語境理解。為此，我們引入了 LongFinanceQA，這是一個在金融領域中的合成數據集，旨在改進長語境推理。與現有的長語境合成數據不同，LongFinanceQA 在最終結論之前包含了中間的 CoT 推理，這鼓勵 LLM 執行明確的推理，從而提高長語境理解的準確性和可解釋性。為了生成合成的 CoT 推理，我們提出了基於屬性的主體推理 (PAI)，這是一個模擬類人推理步驟的主體框架，包括屬性提取、檢索和總結。我們通過評估搭載 PAI 的 GPT-4o-mini 在 Loong 基準上的推理能力，使其比標準的 GPT-4o-mini 高出 20.0%，來評估 PAI 的推理能力。此外，我們對 LLaMA-3.1-8B-Instruct 進行了微調，在 Loong 的金融子集中實現了 24.6% 的增益。
+
+##### **RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**
+2502.13125v1 by Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li
+
+Recent advances in large language models (LLMs) have shown that they can
+answer questions requiring complex reasoning. However, their ability to
+identify and respond to text containing logical fallacies or deliberately
+misleading premises remains less studied. To address this gap, we introduce
+RuozhiBench, a bilingual dataset comprising 677 carefully curated questions
+that contain various forms of deceptive reasoning, meticulously crafted through
+extensive human effort and expert review. In a comprehensive evaluation of 17
+LLMs from 5 Series over RuozhiBench using both open-ended and two-choice
+formats, we conduct extensive analyses on evaluation protocols and result
+patterns. Despite their high scores on conventional benchmarks, these models
+showed limited ability to detect and reason correctly about logical fallacies,
+with even the best-performing model, Claude-3-haiku, achieving only 62%
+accuracy compared to the human of more than 90%.
+
+摘要：大型語言模型 (LLM) 的最新進展顯示，它們可以回答需要複雜推理的問題。然而，它們識別和回應包含邏輯謬誤或故意誤導前提的文本的能力仍未得到充分研究。為了解決這個差距，我們引入了 RuozhiBench，這是一個雙語資料集，包含 677 個經過仔細策劃的問題，其中包含各種形式的欺騙性推理，並透過廣泛的人力投入和專家審查精心製作。在使用開放式和二選一格式對來自 5 個系列的 17 個 LLM 進行 RuozhiBench 的全面評估中，我們對評估協定和結果模式進行了廣泛的分析。儘管它們在傳統基準測試中獲得了高分，但這些模型在檢測和正確推理邏輯謬誤方面表現出的能力有限，即使是效能最好的模型 Claude-3-haiku，與人類的 90% 以上相比，也只達到了 62% 的準確度。
+
+##### **NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**
+2502.13124v1 by Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li
+
+Scaling reasoning capabilities beyond traditional domains such as math and
+coding is hindered by the lack of diverse and high-quality questions. To
+overcome this limitation, we introduce a scalable approach for generating
+diverse and challenging reasoning questions, accompanied by reference answers.
+We present NaturalReasoning, a comprehensive dataset comprising 2.8 million
+questions that span multiple domains, including STEM fields (e.g., Physics,
+Computer Science), Economics, Social Sciences, and more. We demonstrate the
+utility of the questions in NaturalReasoning through knowledge distillation
+experiments which show that NaturalReasoning can effectively elicit and
+transfer reasoning capabilities from a strong teacher model. Furthermore, we
+demonstrate that NaturalReasoning is also effective for unsupervised
+self-training using external reward models or self-rewarding.
+
+摘要：透過超越傳統領域（例如數學和編碼）來擴充推理能力，受到缺乏多元且高品質問題的阻礙。為了克服這個限制，我們引入一個可擴充的方法，用於產生多元且具挑戰性的推理問題，並附上參考答案。我們提出 NaturalReasoning，這是一個包含 280 萬個問題的綜合資料集，涵蓋多個領域，包括 STEM 領域（例如物理、電腦科學）、經濟學、社會科學等等。我們透過知識蒸餾實驗，展示 NaturalReasoning 中問題的實用性，這些實驗顯示 NaturalReasoning 能有效地引發和轉移強大教師模型的推理能力。此外，我們展示 NaturalReasoning 也適用於使用外部獎勵模型或自我獎勵的無監督自我訓練。
+
+##### **Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**
+2502.13120v1 by Marion Bartl, Thomas Brendan Murphy, Susan Leavy
+
+Gender-inclusive language is often used with the aim of ensuring that all
+individuals, regardless of gender, can be associated with certain concepts.
+While psycholinguistic studies have examined its effects in relation to human
+cognition, it remains unclear how Large Language Models (LLMs) process
+gender-inclusive language. Given that commercial LLMs are gaining an
+increasingly strong foothold in everyday applications, it is crucial to examine
+whether LLMs in fact interpret gender-inclusive language neutrally, because the
+language they generate has the potential to influence the language of their
+users. This study examines whether LLM-generated coreferent terms align with a
+given gender expression or reflect model biases. Adapting psycholinguistic
+methods from French to English and German, we find that in English, LLMs
+generally maintain the antecedent's gender but exhibit underlying masculine
+bias. In German, this bias is much stronger, overriding all tested
+gender-neutralization strategies.
+
+摘要：性別包容性語言通常用於確保所有個人，無論性別如何，都能與某些概念聯繫在一起。雖然心理語言學研究已經檢視了它對人類認知的影響，但大型語言模型 (LLM) 如何處理性別包容性語言仍然不清楚。鑑於商業 LLM 在日常應用中越來越站穩腳步，因此至關重要的是要檢查 LLM 是否實際上中立地解釋性別包容性語言，因為它們產生的語言有可能影響其使用者的語言。本研究探討了 LLM 生成的共指術語是否與給定的性別表達一致或反映模型偏見。我們採用法語到英語和德語的心理語言學方法，發現英語中，LLM 通常會保持先行詞的性別，但表現出潛在的男性偏見。在德語中，這種偏見強得多，凌駕於所有經過測試的性別中立化策略。
+
+##### **STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**
+2502.13119v1 by Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin-Leyton Brown
+
+How should one judge whether a given large language model (LLM) can reliably
+perform economic reasoning? Most existing LLM benchmarks focus on specific
+applications and fail to present the model with a rich variety of economic
+tasks. A notable exception is Raman et al. [2024], who offer an approach for
+comprehensively benchmarking strategic decision-making; however, this approach
+fails to address the non-strategic settings prevalent in microeconomics, such
+as supply-and-demand analysis. We address this gap by taxonomizing
+microeconomic reasoning into $58$ distinct elements, focusing on the logic of
+supply and demand, each grounded in up to $10$ distinct domains, $5$
+perspectives, and $3$ types. The generation of benchmark data across this
+combinatorial space is powered by a novel LLM-assisted data generation protocol
+that we dub auto-STEER, which generates a set of questions by adapting
+handwritten templates to target new domains and perspectives. Because it offers
+an automated way of generating fresh questions, auto-STEER mitigates the risk
+that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that
+it will serve as a useful tool both for evaluating and fine-tuning models for
+years to come. We demonstrate the usefulness of our benchmark via a case study
+on $27$ LLMs, ranging from small open-source models to the current state of the
+art. We examined each model's ability to solve microeconomic problems across
+our whole taxonomy and present the results across a range of prompting
+strategies and scoring metrics.
+
+摘要：<paragraph>如何判斷一個給定的大型語言模型 (LLM) 能否可靠地進行經濟推理？現有的 LLM 基準測試大多專注於特定應用，未能為模型提供豐富多樣的經濟任務。一個值得注意的例外是 Raman 等人 [2024]，他們提供了一種全面評估策略決策制定方法；然而，這種方法無法解決微觀經濟學中普遍存在的非策略性設定，例如供需分析。我們透過將微觀經濟推理分類為 58 個不同的元素來解決這個差距，重點放在供需邏輯上，每個元素都基於多達 10 個不同的領域、5 個觀點和 3 種類型。在這個組合空間中產生基準數據是由一種新穎的 LLM 輔助數據生成協議（我們稱之為 auto-STEER）推動的，它通過調整手寫模板來針對新的領域和觀點來生成一組問題。由於它提供了一種生成新問題的自動化方式，auto-STEER 減輕了 LLM 將被訓練過度配合評估基準測試的風險；因此，我們希望它將成為未來幾年評估和微調模型的有用工具。我們通過一個案例研究展示了我們基準測試的效用，該案例研究涵蓋了 27 個 LLM，從小型開源模型到當前技術狀態。我們檢查了每個模型在我們的整個分類法中解決微觀經濟問題的能力，並在各種提示策略和評分指標中展示了結果。</paragraph>
+
+##### **Performance Evaluation of Large Language Models in Statistical Programming**
+2502.13117v1 by Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong
+
+The programming capabilities of large language models (LLMs) have
+revolutionized automatic code generation and opened new avenues for automatic
+statistical analysis. However, the validity and quality of these generated
+codes need to be systematically evaluated before they can be widely adopted.
+Despite their growing prominence, a comprehensive evaluation of statistical
+code generated by LLMs remains scarce in the literature. In this paper, we
+assess the performance of LLMs, including two versions of ChatGPT and one
+version of Llama, in the domain of SAS programming for statistical analysis.
+Our study utilizes a set of statistical analysis tasks encompassing diverse
+statistical topics and datasets. Each task includes a problem description,
+dataset information, and human-verified SAS code. We conduct a comprehensive
+assessment of the quality of SAS code generated by LLMs through human expert
+evaluation based on correctness, effectiveness, readability, executability, and
+the accuracy of output results. The analysis of rating scores reveals that
+while LLMs demonstrate usefulness in generating syntactically correct code,
+they struggle with tasks requiring deep domain understanding and may produce
+redundant or incorrect results. This study offers valuable insights into the
+capabilities and limitations of LLMs in statistical programming, providing
+guidance for future advancements in AI-assisted coding systems for statistical
+analysis.
+
+摘要：大型語言模型 (LLM) 的程式設計功能徹底改變了自動程式碼生成，並為自動統計分析開啟了新途徑。然而，在廣泛採用這些產生的程式碼之前，需要系統性地評估其有效性和品質。儘管其重要性日益提升，但文獻中對於 LLM 產生的統計程式碼的全面評估仍然稀少。在本文中，我們評估了 LLM 的效能，包括兩個版本的 ChatGPT 和一個版本的 Llama，在統計分析的 SAS 程式設計領域。我們的研究利用了一組涵蓋各種統計主題和資料集的統計分析任務。每個任務都包含問題說明、資料集資訊和經過人工驗證的 SAS 程式碼。我們透過基於正確性、有效性、可讀性、可執行性和輸出結果精確度的專家評估，對 LLM 產生的 SAS 程式碼品質進行全面評估。評分結果的分析顯示，儘管 LLM 在產生語法正確的程式碼方面表現出其效用，但它們在需要深入領域理解的任務中會遇到困難，並且可能會產生冗餘或不正確的結果。本研究提供了 LLM 在統計程式設計中能力和限制的寶貴見解，為統計分析的 AI 輔助編碼系統的未來進展提供指導。
+
+##### **Near-Optimal Private Learning in Linear Contextual Bandits**
+2502.13115v1 by Fan Chen, Jiachun Li, Alexander Rakhlin, David Simchi-Levi
+
+We analyze the problem of private learning in generalized linear contextual
+bandits. Our approach is based on a novel method of re-weighted regression,
+yielding an efficient algorithm with regret of order
+$\sqrt{T}+\frac{1}{\alpha}$ and $\sqrt{T}/\alpha$ in the joint and local model
+of $\alpha$-privacy, respectively. Further, we provide near-optimal private
+procedures that achieve dimension-independent rates in private linear models
+and linear contextual bandits. In particular, our results imply that joint
+privacy is almost "for free" in all the settings we consider, partially
+addressing the open problem posed by Azize and Basu (2024).
+
+摘要：我們分析廣義線性情境強盜中私人學習的問題。我們的做法基於重新加權回歸的新方法，產生一種有效率的演算法，其後悔值分別為
+$\sqrt{T}+\frac{1}{\alpha}$ 和 $\sqrt{T}/\alpha$ 在 $\alpha$-隱私的聯合和局部模型中。此外，我們提供近乎最佳的私人程序，在私人線性模型和線性情境強盜中實現與維度無關的比率。特別是，我們的結果表明，在我們考慮的所有設定中，聯合隱私幾乎是「免費」的，部分解決了 Azize 和 Basu (2024) 提出的開放性問題。
+
+##### **The influence of motion features in temporal perception**
+2502.13114v1 by Rosa Illan Castillo, Javier Valenzuela
+
+This paper examines the role of manner-of-motion verbs in shaping subjective
+temporal perception and emotional resonance. Through four complementary
+studies, we explore how these verbs influence the conceptualization of time,
+examining their use in literal and metaphorical (temporal) contexts. Our
+findings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging
+temporal experiences, often linked to positive emotions and greater agency. In
+contrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and
+negative emotions, reflecting tedious or constrained experiences of time. These
+effects are amplified in metaphorical contexts, where manner verbs encode
+emotional and experiential nuances that transcend their literal meanings. We
+also find that participants prefer manner verbs over path verbs (e.g., go,
+pass) in emotionally charged temporal contexts, as manner verbs capture the
+experiential and emotional qualities of time more effectively. These findings
+highlight the interplay between language, motion, and emotion in shaping
+temporal perception, offering insights into how linguistic framing influences
+subjective experiences of time.
+
+摘要：本文探討動作方式動詞在形塑主觀時間感知和情緒共鳴中所扮演的角色。透過四項互補的研究，我們探討這些動詞如何影響時間的概念化，並檢視它們在字面和隱喻（時間）語境中的用法。我們的研究結果顯示，較快的動詞（例如飛、飆）會引起動態且引人入勝的時間體驗，通常與正面情緒和較大的自主性有關。相反地，較慢的動詞（例如爬、拖）傳達了被動、單調和負面情緒，反映出乏味或受限的時間體驗。這些效應在隱喻語境中會被放大，其中動作動詞編碼了超越其字面意義的情緒和體驗細微差別。我們還發現，在充滿情緒的時間語境中，參與者偏好動作動詞而非路徑動詞（例如走、經過），因為動作動詞更有效地捕捉了時間的體驗和情緒品質。這些研究結果突顯了語言、動作和情緒之間在形塑時間感知中的交互作用，並提供了語言框架如何影響主觀時間體驗的見解。
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **MatterChat: A Multi-Modal LLM for Material Science**
+2502.13107v1 by Yingheng Tang, Wenbin Xu, Jie Cao, Jianzhu Ma, Weilu Gao, Steve Farrell, Benjamin Erichson, Michael W. Mahoney, Andy Nonaka, Zhi Yao
+
+Understanding and predicting the properties of inorganic materials is crucial
+for accelerating advancements in materials science and driving applications in
+energy, electronics, and beyond. Integrating material structure data with
+language-based information through multi-modal large language models (LLMs)
+offers great potential to support these efforts by enhancing human-AI
+interaction. However, a key challenge lies in integrating atomic structures at
+full resolution into LLMs. In this work, we introduce MatterChat, a versatile
+structure-aware multi-modal LLM that unifies material structural data and
+textual inputs into a single cohesive model. MatterChat employs a bridging
+module to effectively align a pretrained machine learning interatomic potential
+with a pretrained LLM, reducing training costs and enhancing flexibility. Our
+results demonstrate that MatterChat significantly improves performance in
+material property prediction and human-AI interaction, surpassing
+general-purpose LLMs such as GPT-4. We also demonstrate its usefulness in
+applications such as more advanced scientific reasoning and step-by-step
+material synthesis.
+
+摘要：了解和預測無機材料的特性對於加速材料科學的進步和推動能源、電子等方面的應用至關重要。透過多模態大型語言模型 (LLM) 將材料結構數據與基於語言的資訊整合，可以極大程度地支持這些工作，藉此增強人類與 AI 的互動。然而，一個關鍵挑戰在於將原子結構以完整解析度整合到 LLM 中。在這項工作中，我們引入了 MatterChat，這是一個通用的結構感知多模態 LLM，它將材料結構數據和文字輸入統一到一個單一的內聚模型中。MatterChat 採用橋接模組，將預先訓練好的機器學習原子間電位與預先訓練好的 LLM 有效地對齊，從而降低訓練成本並增強靈活性。我們的結果表明，MatterChat 大幅提升了材料特性預測和人類與 AI 互動的效能，超越了 GPT-4 等通用 LLM。我們也展示了它在更進階的科學推理和逐步材料合成等應用中的效用。
+
+##### **Understanding and Rectifying Safety Perception Distortion in VLMs**
+2502.13095v1 by Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin
+
+Recent studies reveal that vision-language models (VLMs) become more
+susceptible to harmful requests and jailbreak attacks after integrating the
+vision modality, exhibiting greater vulnerability than their text-only LLM
+backbones. To uncover the root cause of this phenomenon, we conduct an in-depth
+analysis and identify a key issue: multimodal inputs introduce an
+modality-induced activation shift toward a "safer" direction compared to their
+text-only counterparts, leading VLMs to systematically overestimate the safety
+of harmful inputs. We refer to this issue as safety perception distortion. To
+mitigate such distortion, we propose Activation Shift Disentanglement and
+Calibration (ShiftDC), a training-free method that decomposes and calibrates
+the modality-induced activation shift to reduce the impact of modality on
+safety. By isolating and removing the safety-relevant component, ShiftDC
+restores the inherent safety alignment of the LLM backbone while preserving the
+vision-language capabilities of VLMs. Empirical results demonstrate that
+ShiftDC significantly enhances alignment performance on safety benchmarks
+without impairing model utility.
+
+摘要：最近的研究表明，在整合了视觉模态后，视觉语言模型 (VLM) 更容易受到有害请求和越狱攻击，表现出比其仅文本的 LLM 主干更大的漏洞。为了揭示这种现象的根本原因，我们进行了深入分析，并确定了一个关键问题：与仅文本的对应物相比，多模态输入引入了朝“更安全”方向的模态诱导激活转移，导致 VLM 系统性地高估有害输入的安全性。我们将此问题称为安全感知扭曲。为了减轻这种扭曲，我们提出了激活转移解耦和校准 (ShiftDC)，这是一种无训练方法，用于分解和校准模态诱导的激活转移，以减少模态对安全性的影响。通过隔离和移除与安全性相关的组件，ShiftDC 恢复了 LLM 主干的固有安全性对齐，同时保留了 VLM 的视觉语言能力。实证结果表明，ShiftDC 在不损害模型效用的情况下，显著增强了安全基准上的对齐性能。
+
+##### **Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**
+2502.13092v1 by Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
+
+Recently, there has been growing interest in leveraging large language models
+(LLMs) to generate symbolic world models from textual descriptions. Although
+LLMs have been extensively explored in the context of world modeling, prior
+studies encountered several challenges, including evaluation randomness,
+dependence on indirect metrics, and a limited domain scope. To address these
+limitations, we introduce a novel benchmark, Text2World, based on planning
+domain definition language (PDDL), featuring hundreds of diverse domains and
+employing multi-criteria, execution-based metrics for a more robust evaluation.
+We benchmark current LLMs using Text2World and find that reasoning models
+trained with large-scale reinforcement learning outperform others. However,
+even the best-performing model still demonstrates limited capabilities in world
+modeling. Building on these insights, we examine several promising strategies
+to enhance the world modeling capabilities of LLMs, including test-time
+scaling, agent training, and more. We hope that Text2World can serve as a
+crucial resource, laying the groundwork for future research in leveraging LLMs
+as world models. The project page is available at
+https://text-to-world.github.io/.
+
+摘要：最近，人们越来越有兴趣利用大型语言模型（LLM）从文本描述中生成符号世界模型。尽管 LLM 已在世界建模的背景下得到广泛探索，但先前的研究遇到了若干挑战，包括评估随机性、对间接指标的依赖以及有限的领域范围。为了解决这些限制，我们引入了基于规划域定义语言（PDDL）的新基准 Text2World，该基准包含数百个不同的域，并采用基于执行的多标准指标来进行更稳健的评估。我们使用 Text2World 对当前的 LLM 进行了基准测试，发现使用大规模强化学习训练的推理模型优于其他模型。然而，即使是性能最佳的模型在世界建模方面仍然表现出有限的能力。基于这些见解，我们研究了几种有希望的策略来增强 LLM 的世界建模能力，包括测试时缩放、代理训练等等。我们希望 Text2World 能够作为一项至关重要的资源，为未来利用 LLM 作为世界模型的研究奠定基础。项目页面可在 https://text-to-world.github.io/ 获得。
+
+##### **KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**
+2502.13076v1 by Xin Xia, Yujin Wang, Jun Zhou, Guisheng Zhong, Linning Cai, Chen Zhang
+
+Patent analysis highly relies on concise and interpretable document
+representations, referred to as patent portraits. Keyphrases, both present and
+absent, are ideal candidates for patent portraits due to their brevity,
+representativeness, and clarity. In this paper, we introduce KAPPA, an
+integrated framework designed to construct keyphrase-based patent portraits and
+enhance patent analysis. KAPPA operates in two phases: patent portrait
+construction and portrait-based analysis. To ensure effective portrait
+construction, we propose a semantic-calibrated keyphrase generation paradigm
+that integrates pre-trained language models with a prompt-based hierarchical
+decoding strategy to leverage the multi-level structural characteristics of
+patents. For portrait-based analysis, we develop a comprehensive framework that
+employs keyphrase-based patent portraits to enable efficient and accurate
+patent analysis. Extensive experiments on benchmark datasets of keyphrase
+generation, the proposed model achieves significant improvements compared to
+state-of-the-art baselines. Further experiments conducted on real-world patent
+applications demonstrate that our keyphrase-based portraits effectively capture
+domain-specific knowledge and enrich semantic representation for patent
+analysis tasks.
+
+摘要：專利分析高度依賴簡潔且可解讀的文件表示，稱為專利描述。關鍵字組，無論是存在的還是不存在的，都是專利描述的理想候選者，因為它們簡潔、具有代表性且清晰。在本文中，我們介紹了 KAPPA，一個用於建構基於關鍵字組的專利描述和增強專利分析的整合式架構。KAPPA 分為兩個階段執行：專利描述建構和基於描述的分析。為確保有效的描述建構，我們提出了一個語義校準關鍵字組生成範例，它將預先訓練的語言模型與基於提示的分層解碼策略整合在一起，以利用專利的多分層結構特性。對於基於描述的分析，我們開發了一個全面的架構，它採用基於關鍵字組的專利描述，以實現高效且準確的專利分析。在關鍵字組生成基準資料集上進行的廣泛實驗中，與最先進的基準線相比，所提出的模型取得了顯著的改進。在真實世界專利申請上進行的進一步實驗表明，我們基於關鍵字組的描述有效地擷取了特定領域的知識，並豐富了專利分析任務的語義表示。
+
+##### **Interactive Agents to Overcome Ambiguity in Software Engineering**
+2502.13069v1 by Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
+
+AI agents are increasingly being deployed to automate tasks, often based on
+ambiguous and underspecified user instructions. Making unwarranted assumptions
+and failing to ask clarifying questions can lead to suboptimal outcomes, safety
+risks due to tool misuse, and wasted computational resources. In this work, we
+study the ability of LLM agents to handle ambiguous instructions in interactive
+code generation settings by evaluating proprietary and open-weight models on
+their performance across three key steps: (a) leveraging interactivity to
+improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c)
+asking targeted questions. Our findings reveal that models struggle to
+distinguish between well-specified and underspecified instructions. However,
+when models interact for underspecified inputs, they effectively obtain vital
+information from the user, leading to significant improvements in performance
+and underscoring the value of effective interaction. Our study highlights
+critical gaps in how current state-of-the-art models handle ambiguity in
+complex software engineering tasks and structures the evaluation into distinct
+steps to enable targeted improvements.
+
+摘要：人工智能代理正越來越多地被部署用於自動化任務，通常基於模棱兩可且未明確規定的使用者指令。做出不合理的假設且未能提出澄清問題，可能導致次佳結果、因工具誤用而產生的安全風險，以及浪費運算資源。在這項工作中，我們研究了 LLM 代理在互動式程式碼生成設定中處理模棱兩可指令的能力，方法是在三個關鍵步驟中評估專有和開放權重的模型： (a) 利用互動性來提升在模棱兩可場景中的效能、(b) 偵測模糊性，以及 (c) 提出目標問題。我們的研究結果顯示，模型難以區分明確規範的指令和未明確規範的指令。然而，當模型針對未明確規範的輸入進行互動時，它們會有效地從使用者取得重要資訊，進而大幅提升效能，並強調有效互動的價值。我們的研究突顯了目前最先進的模型在處理複雜軟體工程任務中的模糊性時存在哪些關鍵差距，並將評估架構為不同的步驟，以促成有目標的改善。
+
+##### **Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**
+2502.13063v1 by Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
+
+A range of recent works addresses the problem of compression of sequence of
+tokens into a shorter sequence of real-valued vectors to be used as inputs
+instead of token embeddings or key-value cache. These approaches allow to
+reduce the amount of compute in existing language models. Despite relying on
+powerful models as encoders, the maximum attainable lossless compression ratio
+is typically not higher than x10. This fact is highly intriguing because, in
+theory, the maximum information capacity of large real-valued vectors is far
+beyond the presented rates even for 16-bit precision and a modest vector size.
+In this work, we explore the limits of compression by replacing the encoder
+with a per-sample optimization procedure. We show that vectors with compression
+ratios up to x1500 exist, which highlights two orders of magnitude gap between
+existing and practically attainable solutions. Furthermore, we empirically show
+that the compression limits are determined not by the length of the input but
+by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
+this sequence without any conditioning. The obtained limits highlight the
+substantial gap between the theoretical capacity of input embeddings and their
+practical utilization, suggesting significant room for optimization in model
+design.
+
+摘要：一系列近期作品探讨了将序列标记压缩成较短的实值向量序列的问题，以用作输入，而不是标记嵌入或键值缓存。这些方法允许减少现有语言模型中的计算量。尽管依赖于强大的模型作为编码器，但最大可达到的无损压缩比通常不高于 x10。这一事实非常有趣，因为理论上，即使对于 16 位精度和适中的向量大小，大型实值向量的最大信息容量也远远超出了所呈现的速率。在这项工作中，我们通过用按样本优化程序替换编码器来探索压缩的极限。我们表明，存在压缩比高达 x1500 的向量，这突出了现有解决方案和实际可实现解决方案之间两个数量级的差距。此外，我们凭经验表明，压缩极限不是由输入的长度决定的，而是由要减少的不确定性量决定的，即在此序列上的交叉熵损失，没有任何条件。获得的极限突出了输入嵌入的理论容量与其实际利用之间的巨大差距，表明模型设计中有很大的优化空间。
+
+##### **AI-Assisted Decision Making with Human Learning**
+2502.13062v1 by Gali Noti, Kate Donahue, Jon Kleinberg, Sigal Oren
+
+AI systems increasingly support human decision-making. In many cases, despite
+the algorithm's superior performance, the final decision remains in human
+hands. For example, an AI may assist doctors in determining which diagnostic
+tests to run, but the doctor ultimately makes the diagnosis. This paper studies
+such AI-assisted decision-making settings, where the human learns through
+repeated interactions with the algorithm. In our framework, the algorithm --
+designed to maximize decision accuracy according to its own model -- determines
+which features the human can consider. The human then makes a prediction based
+on their own less accurate model. We observe that the discrepancy between the
+algorithm's model and the human's model creates a fundamental tradeoff. Should
+the algorithm prioritize recommending more informative features, encouraging
+the human to recognize their importance, even if it results in less accurate
+predictions in the short term until learning occurs? Or is it preferable to
+forgo educating the human and instead select features that align more closely
+with their existing understanding, minimizing the immediate cost of learning?
+This tradeoff is shaped by the algorithm's time-discounted objective and the
+human's learning ability. Our results show that optimal feature selection has a
+surprisingly clean combinatorial characterization, reducible to a stationary
+sequence of feature subsets that is tractable to compute. As the algorithm
+becomes more "patient" or the human's learning improves, the algorithm
+increasingly selects more informative features, enhancing both prediction
+accuracy and the human's understanding. Notably, early investment in learning
+leads to the selection of more informative features than a later investment. We
+complement our analysis by showing that the impact of errors in the algorithm's
+knowledge is limited as it does not make the prediction directly.
+
+摘要：人工智慧系統日益支援人類決策。在許多情況下，儘管演算法的效能優異，最終決策仍掌握在人類手中。例如，人工智慧可能會協助醫生決定要執行哪些診斷測試，但最終下診斷的是醫生。本文探討此類人工智慧輔助決策設定，其中人類透過與演算法重複互動而學習。在我們的架構中，演算法（旨在根據其自身模型最大化決策準確度）會決定人類可以考量的特徵。然後，人類根據其自身較不準確的模型做出預測。我們觀察到，演算法模型與人類模型之間的差異會產生基本的權衡。演算法是否應優先推薦更多資訊性特徵，鼓勵人類認識其重要性，即使短期內會導致準確度較低的預測，直到學習發生？或者，是否較好放棄教育人類，而選擇與其現有理解更緊密對齊的特徵，將學習的立即成本降至最低？這種權衡取決於演算法的時間折現目標和人類的學習能力。我們的結果表明，最佳特徵選擇具有令人驚訝的乾淨組合特徵，可簡化為可計算的固定特徵子集序列。隨著演算法變得更「有耐心」或人類的學習進步，演算法會越來越多地選擇更多資訊性特徵，增強預測準確度和人類的理解。值得注意的是，早期投資於學習會導致選擇比後期投資更多資訊性特徵。我們透過顯示演算法知識中錯誤的影響是有限的，因為它不會直接做出預測，來補充我們的分析。
+
+##### **Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**
+2502.13061v1 by Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
+
+Hateful memes have become a significant concern on the Internet,
+necessitating robust automated detection systems. While large multimodal models
+have shown strong generalization across various tasks, they exhibit poor
+generalization to hateful meme detection due to the dynamic nature of memes
+tied to emerging social trends and breaking news. Recent work further
+highlights the limitations of conventional supervised fine-tuning for large
+multimodal models in this context. To address these challenges, we propose
+Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a
+novel two-stage fine-tuning framework designed to improve both in-domain
+accuracy and cross-domain generalization. Experimental results on six widely
+used meme classification datasets demonstrate that LMM-RGCL achieves
+state-of-the-art performance, outperforming agent-based systems such as
+VPD-PALI-X-55B. Furthermore, our method effectively generalizes to
+out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
+
+摘要：網路上的仇恨迷因已成為一大隱憂，因此需要強大的自動化偵測系統。雖然大型多模態模型已在各種任務中展現出強大的泛化能力，但由於迷因與新興社會趨勢和突發新聞息息相關，因此在仇恨迷因偵測方面表現不佳。最近的研究進一步強調了在這種情況下，傳統監督微調對大型多模態模型的限制。為了應對這些挑戰，我們提出了大型多模態模型檢索引導對比學習 (LMM-RGCL)，這是一種新穎的兩階段微調架構，旨在提高領域內準確度和跨領域泛化能力。在六個廣泛使用的迷因分類資料集上的實驗結果表明，LMM-RGCL 達到了最先進的效能，優於基於代理的系統，例如 VPD-PALI-X-55B。此外，我們的模型在低資源設定下有效泛化到領域外迷因，超越了 GPT-4o 等模型。
+
+##### **SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**
+2502.13059v1 by Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
+
+The increasing application of multi-modal large language models (MLLMs)
+across various sectors have spotlighted the essence of their output reliability
+and accuracy, particularly their ability to produce content grounded in factual
+information (e.g. common and domain-specific knowledge). In this work, we
+introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate
+the factuality ability of MLLMs to answer natural language short questions.
+SimpleVQA is characterized by six key features: it covers multiple tasks and
+multiple scenarios, ensures high quality and challenging queries, maintains
+static and timeless reference answers, and is straightforward to evaluate. Our
+approach involves categorizing visual question-answering items into 9 different
+tasks around objective events or common knowledge and situating these within 9
+topics. Rigorous quality control processes are implemented to guarantee
+high-quality, concise, and clear answers, facilitating evaluation with minimal
+variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a
+comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into
+their image comprehension and text generation abilities by identifying and
+analyzing error cases.
+
+摘要：隨著多模態大型語言模型 (MLLM) 在各個領域的應用日益普及，其輸出結果的可靠性和準確性已備受關注，特別是其根據事實資訊（例如一般知識和特定領域知識）產生內容的能力。在本文中，我們介紹 SimpleVQA，這是第一個用於評估 MLLM 回答自然語言簡短問題的事實能力的綜合多模態基準。SimpleVQA 有六個主要特徵：涵蓋多項任務和多種情境、確保高品質且具挑戰性的查詢、維護靜態且永恆的參考答案，而且評估起來很簡單。我們的做法是將視覺問答項目分類為 9 個不同的任務，圍繞客觀事件或常識，並將它們置於 9 個主題中。我們實施嚴格的品質控管流程，以保證答案的高品質、簡潔和清晰，並透過 LLM 作為評分系統，以最小的差異進行評估。我們使用 SimpleVQA 對 18 個主要的 MLLM 和 8 個純文字 LLM 進行全面評估，透過找出和分析錯誤案例，深入探討它們的影像理解和文字生成能力。
+
+##### **LAMD: Context-driven Android Malware Detection and Classification with LLMs**
+2502.13055v1 by Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, Lorenzo Cavallaro
+
+The rapid growth of mobile applications has escalated Android malware
+threats. Although there are numerous detection methods, they often struggle
+with evolving attacks, dataset biases, and limited explainability. Large
+Language Models (LLMs) offer a promising alternative with their zero-shot
+inference and reasoning capabilities. However, applying LLMs to Android malware
+detection presents two key challenges: (1)the extensive support code in Android
+applications, often spanning thousands of classes, exceeds LLMs' context limits
+and obscures malicious behavior within benign functionality; (2)the structural
+complexity and interdependencies of Android applications surpass LLMs'
+sequence-based reasoning, fragmenting code analysis and hindering malicious
+intent inference. To address these challenges, we propose LAMD, a practical
+context-driven framework to enable LLM-based Android malware detection. LAMD
+integrates key context extraction to isolate security-critical code regions and
+construct program structures, then applies tier-wise code reasoning to analyze
+application behavior progressively, from low-level instructions to high-level
+semantics, providing final prediction and explanation. A well-designed factual
+consistency verification mechanism is equipped to mitigate LLM hallucinations
+from the first tier. Evaluation in real-world settings demonstrates LAMD's
+effectiveness over conventional detectors, establishing a feasible basis for
+LLM-driven malware analysis in dynamic threat landscapes.
+
+摘要：隨著行動應用程式快速成長，Android 惡意軟體威脅也隨之升級。雖然有許多偵測方法，但它們經常難以應付不斷演進的攻擊、資料集偏差和有限的可解釋性。大型語言模型 (LLM) 提供了一個有前途的替代方案，具備零次學習推理和推理能力。然而，將 LLM 應用於 Android 惡意軟體偵測會出現兩個主要挑戰：(1) Android 應用程式中大量的支援程式碼，通常橫跨數千個類別，超過 LLM 的上下文限制，並模糊了良性功能中的惡意行為；(2) Android 應用程式的結構複雜性和相互依賴性超過 LLM 的基於序列的推理，會造成程式碼分析破碎，並阻礙惡意意圖推論。為了應對這些挑戰，我們提出了 LAMD，一個實用的脈絡驅動架構，以支援基於 LLM 的 Android 惡意軟體偵測。LAMD 整合了關鍵脈絡萃取，以隔離與安全性至關重要的程式碼區域並建構程式結構，然後套用分層式程式碼推理，逐步分析應用程式行為，從低階指令到高階語意，提供最終預測和說明。一個設計良好的事實一致性驗證機制具備減輕 LLM 從第一層產生的幻覺的能力。在真實環境中的評估顯示，LAMD 優於傳統偵測器，為動態威脅環境中的 LLM 驅動惡意軟體分析建立了一個可行的基礎。
+
+##### **Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**
+2502.13044v1 by Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
+
+Aspect sentiment quadruple prediction (ASQP) facilitates a detailed
+understanding of opinions expressed in a text by identifying the opinion term,
+aspect term, aspect category and sentiment polarity for each opinion. However,
+annotating a full set of training examples to fine-tune models for ASQP is a
+resource-intensive process. In this study, we explore the capabilities of large
+language models (LLMs) for zero- and few-shot learning on the ASQP task across
+five diverse datasets. We report F1 scores slightly below those obtained with
+state-of-the-art fine-tuned models but exceeding previously reported zero- and
+few-shot performance. In the 40-shot setting on the Rest16 restaurant domain
+dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the
+best-performing fine-tuned method MVP. Additionally, we report the performance
+of LLMs in target aspect sentiment detection (TASD), where the F1 scores were
+also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot
+setting, compared to 72.76 with MVP. While human annotators remain essential
+for achieving optimal performance, LLMs can reduce the need for extensive
+manual annotation in ASQP tasks.
+
+摘要：面向觀點的四元預測 (ASQP) 透過辨識各個觀點的觀點詞彙、面向詞彙、面向類別和觀點極性，協助詳細了解文字中表達的意見。然而，標註一組完整的訓練範例以微調 ASQP 模型是一個耗費資源的過程。在這項研究中，我們探討大型語言模型 (LLM) 在 ASQP 任務中進行零次和少量學習的能力，橫跨五個不同的資料集。我們報告的 F1 分數略低於使用最先進的微調模型獲得的分數，但超過先前報告的零次和少量學習表現。在 Rest16 餐廳領域資料集的 40 次學習設定中，LLM 達到了 52.46 的 F1 分數，而效能最佳的微調方法 MVP 則為 60.39。此外，我們報告了 LLM 在目標面向觀點偵測 (TASD) 中的表現，其中 F1 分數也接近微調模型，在 40 次學習設定中於 Rest16 達到 66.03，而 MVP 則為 72.76。儘管人類標註員對於達成最佳效能仍然至關重要，但 LLM 可以減少 ASQP 任務中廣泛手動標註的需求。
+
+##### **Natural Language Generation from Visual Sequences: Challenges and Future Directions**
+2502.13034v1 by Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
+
+The ability to use natural language to talk about visual content is at the
+core of human intelligence and a crucial feature of any artificial intelligence
+system. Various studies have focused on generating text for single images. In
+contrast, comparatively little attention has been paid to exhaustively
+analyzing and advancing work on multiple-image vision-to-text settings. In this
+position paper, we claim that any task dealing with temporally ordered
+sequences of multiple images or frames is an instance of a broader, more
+general problem involving the understanding of intricate relationships between
+the visual content and the corresponding text. We comprehensively analyze five
+tasks that are instances of this problem and argue that they pose a common set
+of challenges and share similarities in terms of modeling and evaluation
+approaches. Based on the insights from these various aspects and stages of
+multi-image-to-text generation, we highlight several open questions and suggest
+future research directions. We believe that these directions can advance the
+understanding of complex phenomena in this domain and the development of better
+models.
+
+摘要：使用自然語言來談論視覺內容的能力是人類智慧的核心，也是任何人工智慧系統的一項關鍵功能。各種研究都專注於為單一影像產生文字。相較之下，對於詳盡分析和推進多重影像視覺轉文字設定的工作，關注較少。在此立場文件中，我們聲稱任何處理多重影像或畫格的時間順序序列的任務，都是一個更廣泛、更普遍問題的範例，涉及理解視覺內容和對應文字之間的複雜關係。我們全面分析了此問題的五個範例任務，並論證它們提出了一組常見的挑戰，且在建模和評估方法方面有相似之處。根據多重影像轉文字生成的這些不同面向和階段的見解，我們突出了幾個開放性問題，並建議未來的研究方向。我們相信這些方向可以推進對此領域中複雜現象的理解，以及開發出更好的模型。
+
+##### **HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**
+2502.13031v1 by Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
+
+Since the adoption of large language models (LLMs) for text evaluation has
+become increasingly prevalent in the field of natural language processing
+(NLP), a series of existing works attempt to optimize the prompts for LLM
+evaluators to improve their alignment with human judgment. However, their
+efforts are limited to optimizing individual factors of evaluation prompts,
+such as evaluation criteria or output formats, neglecting the combinatorial
+impact of multiple factors, which leads to insufficient optimization of the
+evaluation pipeline. Nevertheless, identifying well-behaved prompting
+strategies for adjusting multiple factors requires extensive enumeration. To
+this end, we comprehensively integrate 8 key factors for evaluation prompts and
+propose a novel automatic prompting strategy optimization method called
+Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm,
+HPSS conducts an iterative search to find well-behaved prompting strategies for
+LLM evaluators. A heuristic function is employed to guide the search process,
+enhancing the performance of our algorithm. Extensive experiments across four
+evaluation tasks demonstrate the effectiveness of HPSS, consistently
+outperforming both human-designed evaluation prompts and existing automatic
+prompt optimization methods.
+
+摘要：隨著自然語言處理（NLP）領域中採用大型語言模型（LLM）進行文本評估變得越來越普遍，一系列現有工作嘗試優化 LLM 評估器的提示，以改善它們與人類判斷的一致性。然而，他們的努力僅限於優化評估提示的個別因素，例如評估準則或輸出格式，而忽略了多種因素的組合影響，這導致評估管道優化不足。儘管如此，找出調整多種因素的良好提示策略需要廣泛的枚舉。為此，我們全面整合了評估提示的 8 個關鍵因素，並提出了一種名為啟發式提示策略搜索（HPSS）的新型自動提示策略優化方法。在遺傳演算法的啟發下，HPSS 進行反覆搜索以找出 LLM 評估器的良好提示策略。採用啟發式函數來指導搜索過程，增強了我們演算法的效能。在四項評估任務中進行的廣泛實驗證明了 HPSS 的有效性，始終優於人類設計的評估提示和現有的自動提示優化方法。
+
+##### **Whose story is it? Personalizing story generation by inferring author styles**
+2502.13028v1 by Nischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer, Andrew Lan
+
+Personalization has become essential for improving user experience in
+interactive writing and educational applications, yet its potential in story
+generation remains largely unexplored. In this work, we propose a novel
+two-stage pipeline for personalized story generation. Our approach first infers
+an author's implicit story-writing characteristics from their past work and
+organizes them into an Author Writing Sheet, inspired by narrative theory. The
+second stage uses this sheet to simulate the author's persona through tailored
+persona descriptions and personalized story writing rules. To enable and
+validate our approach, we construct Mythos, a dataset of 590 stories from 64
+authors across five distinct sources that reflect diverse story-writing
+settings. A head-to-head comparison with a non-personalized baseline
+demonstrates our pipeline's effectiveness in generating high-quality
+personalized stories. Our personalized stories achieve a 75 percent win rate
+(versus 14 percent for the baseline and 11 percent ties) in capturing authors'
+writing style based on their past works. Human evaluation highlights the high
+quality of our Author Writing Sheet and provides valuable insights into the
+personalized story generation task. Notable takeaways are that writings from
+certain sources, such as Reddit, are easier to personalize than others, like
+AO3, while narrative aspects, like Creativity and Language Use, are easier to
+personalize than others, like Plot.
+
+摘要：個人化已成為改善互動式寫作和教育應用程式中使用者體驗的必要手段，然而其在故事生成中的潛力仍未被廣泛探索。在這項工作中，我們提出了一個創新的兩階段流程，用於個人化故事生成。我們的做法首先從作者過去的作品中推論出作者隱含的故事寫作特徵，並根據敘事理論將它們組織成作者寫作表。第二階段使用此表透過量身打造的角色描述和個人化故事寫作規則來模擬作者的角色。為了啟用和驗證我們的做法，我們建構了 Mythos，一個包含來自 64 位作者、橫跨五個不同來源的 590 個故事的資料集，這些故事反映了多樣化的故事寫作設定。與非個人化基準進行一對一的比較，證明了我們的流程在生成高品質個人化故事方面的有效性。我們的個人化故事以 75% 的獲勝率（相較於基準的 14% 和 11% 平手）捕捉到作者基於其過去作品的寫作風格。人類評估突顯了我們作者寫作表的優良品質，並提供了對個人化故事生成任務的寶貴見解。值得注意的是，來自某些來源（例如 Reddit）的作品比其他來源（例如 AO3）更容易個人化，而敘事層面（例如創造力和語言使用）比其他層面（例如情節）更容易個人化。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**
+2502.13019v1 by Sha Li, Naren Ramarkrishnan
+
+Despite the remarkable capabilities of Large Language Models (LLMs) in
+various NLP tasks, they remain vulnerable to hallucinations due to their
+limited parametric knowledge and lack of domain-specific expertise.
+Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating
+external document retrieval to augment the knowledge base of LLMs. In this
+approach, RAG retrieves document chunks from an external corpus in response to
+a query, which are then used as context for the downstream language model to
+generate an answer. However, these retrieved knowledge sources often include
+irrelevant or erroneous information, undermining the effectiveness of RAG in
+downstream tasks. To overcome this limitation, we introduce a compact,
+efficient, and pluggable module designed to refine external knowledge sources
+before feeding them to the generator. The module reconstructs retrieved content
+by extracting the most relevant and supportive information and reorganising it
+into a concise, query-specific format. Through a three-stage training paradigm
+- comprising supervised fine-tuning, contrastive multi-task learning, and
+reinforcement learning-based alignment - it prioritises critical knowledge and
+aligns it with the generator's preferences. This method enables LLMs to produce
+outputs that are more accurate, reliable, and contextually appropriate.
+
+摘要：儘管大型語言模型 (LLM) 在各種自然語言處理任務中具備卓越的能力，但由於其參數知識有限且缺乏特定領域的專業知識，因此它們仍然容易出現幻覺。檢索增強式生成 (RAG) 透過納入外部文件檢索來擴充 LLM 的知識庫，以應對此項挑戰。在此方法中，RAG 會根據查詢檢索外部語料庫中的文件區塊，然後將其用作下游語言模型的背景，以產生答案。然而，這些檢索到的知識來源通常包含不相關或錯誤的資訊，因而損害了 RAG 在下游任務中的效能。為了克服此項限制，我們引入了一個精簡、有效率且可插入的模組，用於在將外部知識來源提供給生成器之前對其進行精煉。此模組透過提取最相關且有用的資訊並將其重新組織成簡潔且特定於查詢的格式，來重建檢索到的內容。透過三階段訓練範例 - 包含監督微調、對比多任務學習以及基於強化學習的比對 - 它優先考量關鍵知識，並使其與生成器的偏好相符。此方法可讓 LLM 產生更準確、可靠且在語境上更適當的輸出。
+
+##### **LLM-Powered Proactive Data Systems**
+2502.13016v1 by Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran
+
+With the power of LLMs, we now have the ability to query data that was
+previously impossible to query, including text, images, and video. However,
+despite this enormous potential, most present-day data systems that leverage
+LLMs are reactive, reflecting our community's desire to map LLMs to known
+abstractions. Most data systems treat LLMs as an opaque black box that operates
+on user inputs and data as is, optimizing them much like any other approximate,
+expensive UDFs, in conjunction with other relational operators. Such data
+systems do as they are told, but fail to understand and leverage what the LLM
+is being asked to do (i.e. the underlying operations, which may be
+error-prone), the data the LLM is operating on (e.g., long, complex documents),
+or what the user really needs. They don't take advantage of the characteristics
+of the operations and/or the data at hand, or ensure correctness of results
+when there are imprecisions and ambiguities. We argue that data systems instead
+need to be proactive: they need to be given more agency -- armed with the power
+of LLMs -- to understand and rework the user inputs and the data and to make
+decisions on how the operations and the data should be represented and
+processed. By allowing the data system to parse, rewrite, and decompose user
+inputs and data, or to interact with the user in ways that go beyond the
+standard single-shot query-result paradigm, the data system is able to address
+user needs more efficiently and effectively. These new capabilities lead to a
+rich design space where the data system takes more initiative: they are
+empowered to perform optimization based on the transformation operations, data
+characteristics, and user intent. We discuss various successful examples of how
+this framework has been and can be applied in real-world tasks, and present
+future directions for this ambitious research agenda.
+
+摘要：<paragraph>透過 LLM 的強大功能，我們現在能夠查詢過去無法查詢的資料，包括文字、圖片和影片。然而，儘管有如此龐大的潛力，但現今大多數利用 LLM 的資料系統都是被動的，反映出我們的社群希望將 LLM 映射到已知的抽象化。大多數資料系統將 LLM 視為一個不透明的黑盒子，以使用者輸入和資料為基礎進行運作，並像其他近似、昂貴的 UDF 一樣最佳化它們，並與其他關聯運算子結合使用。這些資料系統會照著指示執行，但無法理解並運用 LLM 被要求執行的任務（例如可能容易出錯的基本運算）、LLM 正在運算的資料（例如冗長、複雜的文件），或使用者真正需要的是什麼。它們不會利用運算和/或手邊資料的特性，或在有誤差和歧義時確保結果的正確性。我們認為資料系統應該改為主動：它們需要被賦予更多自主權，並具備 LLM 的強大功能，以了解並重新處理使用者輸入和資料，並就運算和資料的表示和處理方式做出決策。透過允許資料系統解析、改寫和分解使用者輸入和資料，或以超越標準單次查詢結果模式的方式與使用者互動，資料系統能夠更有效率且有效地滿足使用者的需求。這些新功能會帶來一個豐富的設計空間，讓資料系統發揮更多主導性：它們有能力根據轉換運算、資料特性和使用者意圖進行最佳化。我們將討論這個架構如何應用於實際任務，並提出這個雄心勃勃的研究議程的未來方向。</paragraph>
+
+##### **Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**
+2502.13012v1 by Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, Dakuo Wang
+
+Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that
+simulates human-like behaviors in a variety of tasks. However, evaluating RPAs
+is challenging due to diverse task requirements and agent designs. This paper
+proposes an evidence-based, actionable, and generalizable evaluation design
+guideline for LLM-based RPA by systematically reviewing 1,676 papers published
+between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes,
+seven task attributes, and seven evaluation metrics from existing literature.
+Based on these findings, we present an RPA evaluation design guideline to help
+researchers develop more systematic and consistent evaluation methods.
+
+摘要：角色扮演代理（RPA）是一種越來越流行的 LLM 代理，它能模擬人類在各種任務中的行為。然而，由於任務需求和代理設計的多樣性，評估 RPA 具有挑戰性。本文通過系統地審查 2021 年 1 月至 2024 年 12 月期間發表的 1,676 篇論文，提出了基於證據、可操作且可推廣的 LLM 基於 RPA 的評估設計指南。我們的分析從現有文獻中識別出六個代理屬性、七個任務屬性和七個評估指標。根據這些發現，我們提出了 RPA 評估設計指南，以幫助研究人員開發更系統化和一致的評估方法。
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**
+2502.13006v1 by Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
+
+Automated Planning algorithms require a model of the domain that specifies
+the preconditions and effects of each action. Obtaining such a domain model is
+notoriously hard. Algorithms for learning domain models exist, yet it remains
+unclear whether learning a domain model and planning is an effective approach
+for numeric planning environments, i.e., where states include discrete and
+numeric state variables. In this work, we explore the benefits of learning a
+numeric domain model and compare it with alternative model-free solutions. As a
+case study, we use two tasks in Minecraft, a popular sandbox game that has been
+used as an AI challenge. First, we consider an offline learning setting, where
+a set of expert trajectories are available to learn from. This is the standard
+setting for learning domain models. We used the Numeric Safe Action Model
+Learning (NSAM) algorithm to learn a numeric domain model and solve new
+problems with the learned domain model and a numeric planner. We call this
+model-based solution NSAM_(+p), and compare it to several model-free Imitation
+Learning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical
+results show that some IL algorithms can learn faster to solve simple tasks,
+while NSAM_(+p) allows solving tasks that require long-term planning and
+enables generalizing to solve problems in larger environments. Then, we
+consider an online learning setting, where learning is done by moving an agent
+in the environment. For this setting, we introduce RAMP. In RAMP, observations
+collected during the agent's execution are used to simultaneously train an RL
+policy and learn a planning domain action model. This forms a positive feedback
+loop between the RL policy and the learned domain model. We demonstrate
+experimentally the benefits of using RAMP, showing that it finds more efficient
+plans and solves more problems than several RL baselines.
+
+摘要：<paragraph>自動化規劃演算法需要一個網域模型，來指定每個動作的前提條件和效果。取得這樣的網域模型出了名的困難。學習網域模型的演算法確實存在，但學習網域模型和規劃是否為數值規劃環境的有效方法仍然不清楚，也就是說，其中狀態包含離散和數值狀態變數。在這項工作中，我們探討學習數值網域模型的優點，並將其與替代的無模型解決方案進行比較。作為一個案例研究，我們使用 Minecraft 中的兩個任務，Minecraft 是一個流行的沙盒遊戲，已被用作 AI 挑戰。首先，我們考慮離線學習設定，其中有一組專家軌跡可供學習。這是學習網域模型的標準設定。我們使用數值安全動作模型學習 (NSAM) 演算法來學習數值網域模型，並使用已學習的網域模型和數值規劃器解決新問題。我們稱此模型為基礎的解決方案 NSAM_(+p)，並將其與多種無模型模仿學習 (IL) 和離線強化學習 (RL) 演算法進行比較。經驗結果顯示，一些 IL 演算法可以更快地學習解決簡單任務，而 NSAM_(+p) 允許解決需要長期規劃的任務，並能夠推廣到在更大環境中解決問題。然後，我們考慮線上學習設定，其中學習是透過在環境中移動代理來完成的。對於此設定，我們引入了 RAMP。在 RAMP 中，在代理執行期間收集的觀察結果用於同時訓練 RL 政策和學習規劃網域動作模型。這在 RL 政策和已學習的網域模型之間形成了一個正向回饋迴路。我們透過實驗證明了使用 RAMP 的好處，顯示它比多個 RL 基準找到了更有效的計畫，並解決了更多問題。</paragraph>
+
+##### **Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**
+2502.13004v1 by Wafaa Wardah, Tuğçe Melike Koçak Büyüktaş, Kirill Shchegelskiy, Sebastian Möller, Robert P. Spang
+
+Objective speech quality models aim to predict human-perceived speech quality
+using automated methods. However, cross-lingual generalization remains a major
+challenge, as Mean Opinion Scores (MOS) vary across languages due to
+linguistic, perceptual, and dataset-specific differences. A model trained
+primarily on English data may struggle to generalize to languages with
+different phonetic, tonal, and prosodic characteristics, leading to
+inconsistencies in objective assessments. This study investigates the
+cross-lingual performance of two speech quality models: NISQA, a CNN-based
+model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both
+models were trained exclusively on English datasets containing over 49,000
+speech samples and subsequently evaluated on speech in German, French,
+Mandarin, Swedish, and Dutch. We analyze model performance using Pearson
+Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five
+speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS.
+Our findings show that while AST achieves a more stable cross-lingual
+performance, both models exhibit noticeable biases. Notably, Mandarin speech
+quality predictions correlate highly with human MOS scores, whereas Swedish and
+Dutch present greater prediction challenges. Discontinuities remain difficult
+to model across all languages. These results highlight the need for more
+balanced multilingual datasets and architecture-specific adaptations to improve
+cross-lingual generalization.
+
+摘要：客觀語音品質模型旨在使用自動化方法預測人類感知的語音品質。然而，跨語言的概化仍然是一項重大挑戰，因為平均意見分數 (MOS) 會因語言的不同而有所不同，這是由於語言、感知和特定於資料集的差異所致。主要使用英語資料訓練的模型可能會難以概化到具有不同語音、聲調和韻律特徵的語言，導致客觀評估不一致。本研究探討了兩種語音品質模型的跨語言效能：基於 CNN 的 NISQA 模型和基於 Transformer 的音訊光譜 Transformer (AST) 模型。這兩種模型都僅使用包含超過 49,000 個語音範例的英語資料集進行訓練，然後在德語、法語、普通話、瑞典語和荷蘭語的語音上進行評估。我們使用皮爾森相關係數 (PCC) 和均方根誤差 (RMSE) 分析五個語音品質維度的模型效能：色彩、不連續性、響度、雜訊和 MOS。我們的研究結果顯示，儘管 AST 達到了更穩定的跨語言效能，但這兩種模型都表現出明顯的偏差。值得注意的是，普通話語音品質預測與人類 MOS 分數高度相關，而瑞典語和荷蘭語則呈現出更大的預測挑戰。不連續性在所有語言中仍然難以建模。這些結果凸顯了對更平衡的多語言資料集和特定於架構的調整的需求，以改善跨語言的概化。
+
+##### **You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**
+2502.13001v1 by Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp
+
+Meeting summarization suffers from limited high-quality data, mainly due to
+privacy restrictions and expensive collection processes. We address this gap
+with FAME, a dataset of 500 meetings in English and 300 in German produced by
+MIMIC, our new multi-agent meeting synthesis framework that generates meeting
+transcripts on a given knowledge source by defining psychologically grounded
+participant profiles, outlining the conversation, and orchestrating a large
+language model (LLM) debate. A modular post-processing step refines these
+outputs, mitigating potential repetitiveness and overly formal tones, ensuring
+coherent, credible dialogues at scale. We also propose a psychologically
+grounded evaluation framework assessing naturalness, social behavior
+authenticity, and transcript difficulties. Human assessments show that FAME
+approximates real-meeting spontaneity (4.5/5 in naturalness), preserves
+speaker-centric challenges (3/5 in spoken language), and introduces richer
+information-oriented difficulty (4/5 in difficulty). These findings highlight
+that FAME is a good and scalable proxy for real-world meeting conditions. It
+enables new test scenarios for meeting summarization research and other
+conversation-centric applications in tasks requiring conversation data or
+simulating social scenarios under behavioral constraints.
+
+摘要：會議摘要因缺乏高品質資料而受限，主要是由於隱私限制和昂貴的收集程序。我們透過 FAME 來解決這個差距，FAME 是 MIMIC 製作的 500 場英文會議和 300 場德文會議的資料集，MIMIC 是我們新的多重代理會議合成架構，透過定義心理基礎的參與者設定檔、概述對話，並協調大型語言模型 (LLM) 辯論，在給定的知識來源上產生會議記錄。模組化後處理步驟會改善這些輸出，減輕潛在的重複性和過於正式的語氣，確保大規模的對話連貫且可信。我們也提出一個心理基礎的評估架構，評估自然性、社交行為真實性，以及記錄難度。人類評估顯示，FAME 近似於真實會議的即興性（自然性 4.5/5），保留以講者為中心的挑戰（口語 3/5），並引入更豐富的資訊導向難度（難度 4/5）。這些發現強調 FAME 是真實世界會議條件的良好且可擴充的代理。它能為會議摘要研究和其他對話為中心的應用程式啟用新的測試情境，在需要對話資料或在行為限制下模擬社交情境的任務中。
+
+##### **Personalized Top-k Set Queries Over Predicted Scores**
+2502.12998v1 by Sohrab Namazi Nia, Subhodeep Ghosh, Senjuti Basu Roy, Sihem Amer-Yahia
+
+This work studies the applicability of expensive external oracles such as
+large language models in answering top-k queries over predicted scores. Such
+scores are incurred by user-defined functions to answer personalized queries
+over multi-modal data. We propose a generic computational framework that
+handles arbitrary set-based scoring functions, as long as the functions could
+be decomposed into constructs, each of which sent to an oracle (in our case an
+LLM) to predict partial scores. At a given point in time, the framework assumes
+a set of responses and their partial predicted scores, and it maintains a
+collection of possible sets that are likely to be the true top-k. Since calling
+oracles is costly, our framework judiciously identifies the next construct,
+i.e., the next best question to ask the oracle so as to maximize the likelihood
+of identifying the true top-k. We present a principled probabilistic model that
+quantifies that likelihood. We study efficiency opportunities in designing
+algorithms. We run an evaluation with three large scale datasets, scoring
+functions, and baselines. Experiments indicate the efficacy of our framework,
+as it achieves an order of magnitude improvement over baselines in requiring
+LLM calls while ensuring result accuracy. Scalability experiments further
+indicate that our framework could be used in large-scale applications.
+
+摘要：本研究探討在預測分數中回答前 k 個查詢時，昂貴的外部預言（例如大型語言模型）的適用性。此類分數是由使用者定義的函式產生，用於回答多模態資料中的個人化查詢。我們提出一個通用的運算框架，用於處理任意基於集合的計分函式，只要這些函式可以分解為建構區塊，然後將每個建構區塊傳送給預言（在本例中為 LLM）以預測部分分數。在特定時間點，此框架假設一組回應及其部分預測分數，並維護一組可能成為真實前 k 個的集合。由於呼叫預言的成本很高，因此我們的框架會明智地找出下一個建構區塊，亦即下一個最佳問題，以詢問預言，以便最大化找出真實前 k 個的可能性。我們提出一個基於原理的機率模型，用於量化此可能性。我們研究設計演算法時的效率機會。我們針對三個大型資料集、計分函式和基準執行評估。實驗結果指出我們框架的效能，因為它在需要 LLM 呼叫的同時確保結果準確性，比基準進步了一個數量級。可擴充性實驗進一步指出我們的框架可用於大型應用程式。
+
+##### **Eager Updates For Overlapped Communication and Computation in DiLoCo**
+2502.12996v1 by Satyen Kale, Arthur Douillard, Yanislav Donchev
+
+Distributed optimization methods such as DiLoCo have been shown to be
+effective in training very large models across multiple distributed workers,
+such as datacenters. These methods split updates into two parts: an inner
+optimization phase, where the workers independently execute multiple
+optimization steps on their own local data, and an outer optimization step,
+where the inner updates are synchronized. While such approaches require orders
+of magnitude less communication than standard data-parallel training, in
+settings where the workers are datacenters, even the limited communication
+requirements of these approaches can still cause significant slow downs due to
+the blocking necessary at each outer optimization step. In this paper, we
+investigate techniques to mitigate this issue by overlapping communication with
+computation in a manner that allows the outer optimization step to fully
+overlap with the inner optimization phase. We show that a particular variant,
+dubbed eager updates, provides competitive performance with standard DiLoCo in
+settings with low bandwidth between workers.
+
+摘要：分散式優化方法（例如 DiLoCo）已被證明可有效訓練橫跨多個分散式工作者的超大型模型，例如資料中心。這些方法將更新拆分為兩部分：內部最佳化階段，其中工作者獨立地在自己的本地資料上執行多個最佳化步驟，以及外部最佳化步驟，其中內部更新會同步。雖然此類方法所需的通訊量比標準資料平行訓練少幾個數量級，但在工作者為資料中心的情況下，即使這些方法有限的通訊需求仍可能由於每個外部最佳化步驟所需的封鎖而導致顯著的減速。在本文中，我們探討了透過以允許外部最佳化步驟與內部最佳化階段完全重疊的方式將通訊與運算重疊，來減輕此問題的技術。我們展示了一個特定變體，稱為即時更新，在工作者之間頻寬較低的情況下，可提供與標準 DiLoCo 相當的效能。
+
+##### **Free Argumentative Exchanges for Explaining Image Classifiers**
+2502.12995v1 by Avinash Kori, Antonio Rago, Francesca Toni
+
+Deep learning models are powerful image classifiers but their opacity hinders
+their trustworthiness. Explanation methods for capturing the reasoning process
+within these classifiers faithfully and in a clear manner are scarce, due to
+their sheer complexity and size. We provide a solution for this problem by
+defining a novel method for explaining the outputs of image classifiers with
+debates between two agents, each arguing for a particular class. We obtain
+these debates as concrete instances of Free Argumentative eXchanges (FAXs), a
+novel argumentation-based multi-agent framework allowing agents to internalise
+opinions by other agents differently than originally stated. We define two
+metrics (consensus and persuasion rate) to assess the usefulness of FAXs as
+argumentative explanations for image classifiers. We then conduct a number of
+empirical experiments showing that FAXs perform well along these metrics as
+well as being more faithful to the image classifiers than conventional,
+non-argumentative explanation methods. All our implementations can be found at
+https://github.com/koriavinash1/FAX.
+
+摘要：深度學習模型是強大的影像分類器，但其不透明性阻礙了其可信度。由於其極高的複雜性和規模，忠實且清楚地捕捉這些分類器內部推理過程的解釋方法很少見。我們透過定義一種新穎的方法來解決這個問題，該方法透過兩個代理之間的辯論來解釋影像分類器的輸出，每個代理都主張一個特定類別。我們將這些辯論作為自由論證交換 (FAX) 的具體實例，這是一個新穎的基於論證的多代理架構，允許代理以不同於原始陳述的方式內化其他代理的意見。我們定義了兩個指標（共識率和說服率）來評估 FAX 作為影像分類器論證解釋的有用性。然後，我們進行了多項實證實驗，表明 FAX 在這些指標上表現良好，並且比傳統的非論證解釋方法更忠實於影像分類器。我們所有的實作都可以在 https://github.com/koriavinash1/FAX 中找到。
+
+##### **B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**
+2502.12992v1 by Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
+
+Post-hoc explanation methods for black-box models often struggle with
+faithfulness and human interpretability due to the lack of explainability in
+current neural models. Meanwhile, B-cos networks have been introduced to
+improve model explainability through architectural and computational
+adaptations, but their application has so far been limited to computer vision
+models and their associated training pipelines. In this work, we introduce
+B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
+transforms pre-trained language models into B-cos LMs by combining B-cos
+conversion and task fine-tuning, improving efficiency compared to previous
+B-cos methods. Our automatic and human evaluation results demonstrate that
+B-cos LMs produce more faithful and human interpretable explanations than post
+hoc methods, while maintaining task performance comparable to conventional
+fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
+conventionally fine-tuned models in their learning processes and explanation
+patterns. Finally, we provide practical guidelines for effectively building
+B-cos LMs based on our findings. Our code is available at
+https://anonymous.4open.science/r/bcos_lm.
+
+摘要：黑盒模型的事后解释方法通常会因为当前神经模型缺乏可解释性而难以做到忠实和人类可解释。与此同时，B-cos 网络已被引入，以通过架构和计算改编来提高模型的可解释性，但到目前为止，它们的应用仅限于计算机视觉模型及其相关的训练管道。在这项工作中，我们引入了 B-cos LM，即针对 NLP 任务增强的 B-cos 网络。我们的方法通过结合 B-cos 转换和任务微调，将预训练的语言模型直接转换为 B-cos LM，与以前 B-cos 方法相比，提高了效率。我们的自动和人工评估结果表明，与事后方法相比，B-cos LM 产生了更忠实和人类可解释的解释，同时保持与传统微调相当的任务性能。我们的深入分析探讨了 B-cos LM 在其学习过程和解释模式中与传统微调模型有何不同。最后，我们根据我们的发现提供了有效构建 B-cos LM 的实用指南。我们的代码可在 https://anonymous.4open.science/r/bcos_lm 获得。
+
+##### **Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**
+2502.12988v1 by Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen
+
+Previous approaches to persona simulation large language models (LLMs) have
+typically relied on learning basic biographical information, or using limited
+role-play dialogue datasets to capture a character's responses. However, a
+holistic representation of an individual goes beyond surface-level facts or
+conversations to deeper thoughts and thinking. In this work, we introduce
+CharacterBot, a model designed to replicate both the linguistic patterns and
+distinctive thought processes of a character. Using Lu Xun, a renowned Chinese
+writer, as a case study, we propose four training tasks derived from his 17
+essay collections. These include a pre-training task focused on mastering
+external linguistic structures and knowledge, as well as three fine-tuning
+tasks: multiple-choice question answering, generative question answering, and
+style transfer, each aligning the LLM with Lu Xun's internal ideation and
+writing style. To optimize learning across these tasks, we introduce a CharLoRA
+parameter updating mechanism, where a general linguistic style expert
+collaborates with other task-specific experts to better study both the language
+style and the understanding of deeper thoughts. We evaluate CharacterBot on
+three tasks for linguistic accuracy and opinion comprehension, demonstrating
+that it significantly outperforms the baselines on our adapted metrics. We hope
+that this work inspires future research on deep character persona simulation
+LLM.
+
+摘要：<paragraph>以前對角色模擬大型語言模型 (LLM) 的方法通常依賴於學習基本傳記資訊，或使用有限的角色扮演對話資料集來捕捉角色的反應。然而，對個人的整體表徵超越了表面層面的事實或對話，深入到更深層的想法和思考。在這項工作中，我們引入了 CharacterBot，一個旨在複製角色的語言模式和獨特思考過程的模型。以著名的中國作家魯迅為案例研究，我們提出了四個從他的 17 篇散文集中衍生的訓練任務。其中包括一個預訓練任務，專注於掌握外部語言結構和知識，以及三個微調任務：多選題回答、生成式問答和風格轉移，每個任務都將 LLM 與魯迅的內部觀念和寫作風格相結合。為了優化這些任務的學習，我們引入了一個 CharLoRA 參數更新機制，其中一位通曉語言風格的專家與其他特定任務專家合作，以更好地研究語言風格和對深層思想的理解。我們在三項任務上評估了 CharacterBot 的語言準確性和意見理解，證明它在我們調整的指標上顯著優於基準。我們希望這項工作能激勵未來對深度角色角色模擬 LLM 的研究。</paragraph>
+
+##### **PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**
+2502.12985v1 by Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua
+
+Accurate 3D shape representation is essential in engineering applications
+such as design, optimization, and simulation. In practice, engineering
+workflows require structured, part-aware representations, as objects are
+inherently designed as assemblies of distinct components. However, most
+existing methods either model shapes holistically or decompose them without
+predefined part structures, limiting their applicability in real-world design
+tasks. We propose PartSDF, a supervised implicit representation framework that
+explicitly models composite shapes with independent, controllable parts while
+maintaining shape consistency. Despite its simple single-decoder architecture,
+PartSDF outperforms both supervised and unsupervised baselines in
+reconstruction and generation tasks. We further demonstrate its effectiveness
+as a structured shape prior for engineering applications, enabling precise
+control over individual components while preserving overall coherence. Code
+available at https://github.com/cvlab-epfl/PartSDF.
+
+摘要：精確的 3D 形狀表示在工程應用中至關重要，例如設計、最佳化和模擬。實際上，工程工作流程需要結構化、零件感知的表示，因為物體本質上是設計為不同元件的組件。然而，大多數現有方法不是整體建模形狀，就是將其分解，而沒有預先定義的零件結構，這限制了它們在實際設計任務中的適用性。我們提出 PartSDF，一個監督式的隱式表示框架，它明確地使用獨立、可控的零件對複合形狀進行建模，同時保持形狀一致性。儘管其單一的解碼器架構很簡單，但 PartSDF 在重建和生成任務中都優於監督式和非監督式基準。我們進一步證明了其作為工程應用結構化形狀先驗的有效性，能夠精確控制各個元件，同時保持整體一致性。程式碼可在 https://github.com/cvlab-epfl/PartSDF 取得。
+
+##### **Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**
+2502.12982v1 by Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
+
+Sailor2 is a family of cutting-edge multilingual language models for
+South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
+diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
+pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
+support 13 SEA languages while retaining proficiency in Chinese and English.
+Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
+languages. We also deliver a comprehensive cookbook on how to develop the
+multilingual model in an efficient manner, including five key aspects: data
+curation, pre-training, post-training, model customization and evaluation. We
+hope that Sailor2 model (Apache 2.0 license) will drive language development in
+the SEA region, and Sailor2 cookbook will inspire researchers to build more
+inclusive LLMs for other under-served languages.
+
+摘要：Sailor2 是一系列針對東南亞 (SEA) 語言的尖端多語言語言模型，備有 1B、8B 和 20B 大小，以適應各種應用。在 Qwen2.5 的基礎上，Sailor2 持續進行 500B 代幣（400B SEA 專用和 100B 重播代幣）的預訓練，以支援 13 種 SEA 語言，同時保留中文和英文的熟練度。Sailor2-20B 模型在 SEA 語言中對抗 GPT-4o 時，達到 50-50 的獲勝率。我們還提供一本全面的食譜，說明如何以有效的方式開發多語言模型，包括五個關鍵方面：資料策展、預訓練、後訓練、模型自訂和評估。我們希望 Sailor2 模型（Apache 2.0 授權）將推動 SEA 地區的語言發展，而 Sailor2 食譜將激勵研究人員為其他服務不足的語言建立更具包容性的 LLM。
+
+##### **Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**
+2502.12970v1 by Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
+
+The reasoning abilities of Large Language Models (LLMs) have demonstrated
+remarkable advancement and exceptional performance across diverse domains.
+However, leveraging these reasoning capabilities to enhance LLM safety against
+adversarial attacks and jailbreak queries remains largely unexplored. To bridge
+this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that
+integrates safety reflections of queries and responses into LLMs' generation
+process, unlocking a safety-aware reasoning mechanism. This approach enables
+self-evaluation at each reasoning step to create safety pivot tokens as
+indicators of the response's safety status. Furthermore, in order to improve
+the learning efficiency of pivot token prediction, we propose Contrastive Pivot
+Optimization(CPO), which enhances the model's ability to perceive the safety
+status of dialogues. Through this mechanism, LLMs dynamically adjust their
+response strategies during reasoning, significantly enhancing their defense
+capabilities against jailbreak attacks. Extensive experimental results
+demonstrate that R2D effectively mitigates various attacks and improves overall
+safety, highlighting the substantial potential of safety-aware reasoning in
+strengthening LLMs' robustness against jailbreaks.
+
+摘要：大型語言模型 (LLM) 的推理能力已展現出顯著的進步，並在不同的領域中表現出色。然而，利用這些推理能力來增強 LLM 對抗攻擊和越獄查詢的安全性仍然是未開發的領域。為了彌補這個差距，我們提出了推理防禦 (R2D)，這是一種新穎的訓練範例，它將查詢和回應的安全考量整合到 LLM 的生成過程中，開啟了一個安全感知推理機制。此方法可以在每個推理步驟中進行自我評估，以建立安全樞紐標記，作為回應安全狀態的指標。此外，為了提高樞紐標記預測的學習效率，我們提出了對比樞紐最佳化 (CPO)，它增強了模型感知對話安全狀態的能力。透過此機制，LLM 在推理過程中動態調整其回應策略，大幅增強其對抗越獄攻擊的防禦能力。廣泛的實驗結果證明，R2D 有效地減輕了各種攻擊，並改善了整體安全性，突顯了安全感知推理在加強 LLM 對抗越獄的穩健性方面的潛力。
+
+##### **A Survey of Text Classification Under Class Distribution Shift**
+2502.12965v1 by Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
+
+The basic underlying assumption of machine learning (ML) models is that the
+training and test data are sampled from the same distribution. However, in
+daily practice, this assumption is often broken, i.e.~the distribution of the
+test data changes over time, which hinders the application of conventional ML
+models. One domain where the distribution shift naturally occurs is text
+classification, since people always find new topics to discuss. To this end, we
+survey research articles studying open-set text classification and related
+tasks. We divide the methods in this area based on the constraints that define
+the kind of distribution shift and the corresponding problem formulation,
+i.e.~learning with the Universum, zero-shot learning, and open-set learning. We
+next discuss the predominant mitigation approaches for each problem setup.
+Finally, we identify several future work directions, aiming to push the
+boundaries beyond the state of the art. Interestingly, we find that continual
+learning can solve many of the issues caused by the shifting class
+distribution. We maintain a list of relevant papers at
+https://github.com/Eduard6421/Open-Set-Survey.
+
+摘要：機器學習 (ML) 模型的基本假設是訓練資料和測試資料取樣自同一個分佈。然而，在日常實務中，這個假設經常被打破，也就是說測試資料的分布會隨著時間改變，這會阻礙傳統 ML 模型的應用。分佈轉移自然發生的其中一個領域是文字分類，因為人們總能找到新的主題來討論。為此，我們調查研究開放集文字分類和相關任務的研究文章。我們根據定義分佈轉移的類型和對應問題公式的限制，將這個領域的方法分為：使用 Universum 學習、零次學習和開放集學習。接下來，我們討論每個問題設定的主要緩解方法。最後，我們找出幾個未來的研究方向，目標是將界線推展到現有技術的極限之外。有趣的是，我們發現持續學習可以解決許多由類別分佈轉移所造成的議題。我們在 https://github.com/Eduard6421/Open-Set-Survey 維護一份相關論文清單。
+
+##### **Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**
+2502.12964v1 by Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
+
+Large Language Models (LLMs) often generate outputs that lack grounding in
+real-world facts, a phenomenon known as hallucinations. Prior research has
+associated hallucinations with model uncertainty, leveraging this relationship
+for hallucination detection and mitigation. In this paper, we challenge the
+underlying assumption that all hallucinations are associated with uncertainty.
+Using knowledge detection and uncertainty measurement methods, we demonstrate
+that models can hallucinate with high certainty even when they have the correct
+knowledge. We further show that high-certainty hallucinations are consistent
+across models and datasets, distinctive enough to be singled out, and challenge
+existing mitigation methods. Our findings reveal an overlooked aspect of
+hallucinations, emphasizing the need to understand their origins and improve
+mitigation strategies to enhance LLM safety. The code is available at
+https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
+
+摘要：大型語言模型 (LLM) 經常產生缺乏真實世界事實根據的輸出，這種現象稱為幻覺。先前的研究已將幻覺與模型不確定性聯繫起來，利用這種關係進行幻覺偵測和緩解。在本文中，我們挑戰所有幻覺都與不確定性相關的基本假設。使用知識偵測和不確定性測量方法，我們證明模型即使擁有正確的知識，也能以高度確定性產生幻覺。我們進一步表明，高確定性幻覺在模型和資料集之間是一致的，足夠獨特以至於可以單獨挑選出來，並挑戰現有的緩解方法。我們的研究結果揭示了幻覺的一個被忽視的方面，強調需要了解其起源並改進緩解策略以增強 LLM 安全性。可以在 https://github.com/technion-cs-nlp/Trust_me_Im_wrong 找到程式碼。
+
+##### **Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**
+2502.12962v1 by Xiaoju Ye, Zhichun Wang, Jingyuan Wang
+
+Limited by the context window size of Large Language Models(LLMs), handling
+various tasks with input tokens exceeding the upper limit has been challenging,
+whether it is a simple direct retrieval task or a complex multi-hop reasoning
+task. Although various methods have been proposed to enhance the long-context
+processing capabilities of LLMs, they either incur substantial post-training
+costs, or require additional tool modules(e.g.,RAG), or have not shown
+significant improvement in realistic tasks. Our work observes the correlation
+between the attention distribution and generated answers across each layer, and
+establishes the attention allocation aligns with retrieval-augmented
+capabilities through experiments. Drawing on the above insights, we propose a
+novel method InfiniRetri that leverages the LLMs's own attention information to
+enable accurate retrieval across inputs of infinitely length. Our evaluations
+indicate that InfiniRetri achieves 100% accuracy in the
+Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,
+surpassing other method or larger models and setting a new
+state-of-the-art(SOTA). Moreover, our method achieves significant performance
+improvements on real-world benchmarks, with a maximum 288% improvement. In
+addition, InfiniRetri can be applied to any Transformer-based LLMs without
+additional training and substantially reduces inference latency and compute
+overhead in long texts. In summary, our comprehensive studies show
+InfiniRetri's potential for practical applications and creates a paradigm for
+retrievaling information using LLMs own capabilities under infinite-length
+tokens. Code will be released in link.
+
+摘要：受限于大型语言模型 (LLM) 的上下文窗口大小，处理超出上限的输入标记的各种任务一直具有挑战性，无论是简单的直接检索任务还是复杂的多跳推理任务。虽然已经提出了各种方法来增强 LLM 的长上下文处理能力，但它们要么产生大量的后训练成本，要么需要额外的工具模块（例如，RAG），要么在实际任务中没有显示出显着的改进。我们的工作观察了每层注意力分布和生成答案之间的相关性，并通过实验建立了注意力分配与检索增强能力保持一致。根据上述见解，我们提出了一种新方法 InfiniRetri，该方法利用 LLM 自身的注意力信息来实现对无限长度输入的准确检索。我们的评估表明，InfiniRetri 在使用 0.5B 参数模型对超过 100 万个标记的针头干草堆 (NIH) 测试中实现了 100% 的准确率，超越了其他方法或更大的模型，并创造了新的最先进 (SOTA)。此外，我们的方法在实际基准上实现了显著的性能提升，最大提升了 288%。此外，InfiniRetri 可以应用于任何基于 Transformer 的 LLM，而无需额外的训练，并且可以大幅减少推理延迟和长文本中的计算开销。总之，我们的综合研究表明了 InfiniRetri 在实际应用中的潜力，并为使用 LLM 自身能力在无限长度标记下检索信息创造了一个范例。代码将在链接中发布。
+
+##### **Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**
+2502.12961v1 by Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
+
+Large language models (LLMs) have shown remarkable emergent capabilities,
+transforming the execution of functional tasks by leveraging external tools for
+complex problems that require specialized processing or real-time data. While
+existing research expands LLMs access to diverse tools (e.g., program
+interpreters, search engines, weather/map apps), the necessity of using these
+tools is often overlooked, leading to indiscriminate tool invocation. This
+naive approach raises two key issues:(1) increased delays due to unnecessary
+tool calls, and (2) potential errors resulting from faulty interactions with
+external tools. In this paper, we introduce meta-cognition as a proxy for LLMs
+self-assessment of their capabilities, representing the model's awareness of
+its own limitations. Based on this, we propose MeCo, an adaptive
+decision-making strategy for external tool use. MeCo quantifies metacognitive
+scores by capturing high-level cognitive signals in the representation space,
+guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs
+minimal cost. Our experiments show that MeCo accurately detects LLMs' internal
+cognitive signals and significantly improves tool-use decision-making across
+multiple base models and benchmarks.
+
+摘要：大型語言模型 (LLM) 已展現出顯著的新興能力，透過運用外部工具來執行功能任務，解決需要專業處理或即時資料的複雜問題，從而轉變任務的執行方式。儘管現有研究擴展了 LLM 對各種工具的存取（例如程式碼詮釋器、搜尋引擎、天氣/地圖應用程式），但使用這些工具的必要性往往被忽略，導致不加選擇地呼叫工具。這種天真的方法提出了兩個關鍵問題：(1) 由於不必要的工具呼叫而導致延遲增加，以及 (2) 由於與外部工具互動錯誤而導致的潛在錯誤。在本文中，我們將元認知引入作為 LLM 自我評估其能力的代理，代表模型意識到其自身的限制。基於此，我們提出了 MeCo，一種用於外部工具使用的適應性決策制定策略。MeCo 透過擷取表徵空間中的高階認知訊號來量化元認知分數，指導何時呼叫工具。值得注意的是，MeCo 是免微調的，而且成本極低。我們的實驗表明，MeCo 能夠準確地偵測 LLM 的內部認知訊號，並大幅改善跨多個基本模型和基準的工具使用決策制定。
+
+##### **AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**
+2502.12959v1 by Steve Bakos, Félix Gaschi, David Guzmán, Riddhi More, Kelly Chutong Li, En-Shiun Annie Lee
+
+Realignment techniques are often employed to enhance cross-lingual transfer
+in multilingual language models, still, they can sometimes degrade performance
+in languages that differ significantly from the fine-tuned source language.
+This paper introduces AlignFreeze, a method that freezes either the layers'
+lower half or upper half during realignment. Through controlled experiments on
+4 tasks, 3 models, and in 35 languages, we find that realignment affects all
+the layers but can be the most detrimental to the lower ones. Freezing the
+lower layers can prevent performance degradation. Particularly, AlignFreeze
+improves Part-of-Speech (PoS) tagging performances in languages where full
+realignment fails: with XLM-R, it provides improvements of more than one
+standard deviation in accuracy in seven more languages than full realignment.
+
+摘要：重新對齊技術通常用於增強多語言語言模型中的跨語言轉移，然而，它們有時會降低與微調源語言顯著不同的語言的效能。本文介紹了 AlignFreeze，一種在重新對齊期間凍結層的下半部或上半部的的方法。透過 4 項任務、3 個模型和 35 種語言的受控實驗，我們發現重新對齊會影響所有層，但對較低層的影響最大。凍結較低層可以防止效能下降。特別是，AlignFreeze 改善了在完全重新對齊失敗的語言中的詞性 (PoS) 標記效能：使用 XLM-R，它比完全重新對齊在七種語言中提供了超過一個標準差的準確度改進。
+
+##### **Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**
+2502.12953v1 by Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
+
+Masked language modeling has become a widely adopted unsupervised technique
+to pre-train language models. However, the process of selecting tokens for
+masking is random, and the percentage of masked tokens is typically fixed for
+the entire training process. In this paper, we propose to adjust the masking
+ratio and to decide which tokens to mask based on a novel task-informed
+anti-curriculum learning scheme. First, we harness task-specific knowledge
+about useful and harmful tokens in order to determine which tokens to mask.
+Second, we propose a cyclic decaying masking ratio, which corresponds to an
+anti-curriculum schedule (from hard to easy). We exemplify our novel
+task-informed anti-curriculum by masking (TIACBM) approach across three diverse
+downstream tasks: sentiment analysis, text classification by topic, and
+authorship attribution. Our findings suggest that TIACBM enhances the ability
+of the model to focus on key task-relevant features, contributing to
+statistically significant performance gains across tasks. We release our code
+at https://github.com/JarcaAndrei/TIACBM.
+
+摘要：遮蔽語言模型已成為一種廣泛採用的無監督技術，用於預先訓練語言模型。然而，選擇用於遮蔽的詞彙的過程是隨機的，且遮蔽詞彙的百分比通常在整個訓練過程中是固定的。在本文中，我們建議調整遮蔽率，並根據一種新穎的任務資訊反課程學習方案來決定要遮蔽哪些詞彙。首先，我們利用任務特定的知識，了解有用的和有害的詞彙，以確定要遮蔽哪些詞彙。其次，我們提出一個循環遞減遮蔽率，這對應於一個反課程表（從難到易）。我們以三項不同的下游任務為例，說明我們新穎的任務資訊反課程遮蔽（TIACBM）方法：情緒分析、按主題分類文字，以及作者歸屬。我們的研究結果表明，TIACBM 增強了模型專注於關鍵任務相關特徵的能力，有助於在各項任務中獲得具有統計意義的效能提升。我們在 https://github.com/JarcaAndrei/TIACBM 釋出我們的程式碼。
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**
+2502.12947v1 by Gyeongman Kim, Gyouk Chu, Eunho Yang
+
+With the emergence of Mixture-of-Experts (MoE), the efficient scaling of
+model size has accelerated the development of large language models in recent
+years. However, their high memory requirements prevent their use in
+resource-constrained environments. While knowledge distillation (KD) has been a
+proven method for model compression, its application to MoE teacher models
+remains underexplored. Through our investigation, we discover that
+non-activated experts in MoE models possess valuable knowledge that benefits
+student models. We further demonstrate that existing KD methods are not optimal
+for compressing MoE models, as they fail to leverage this knowledge
+effectively. To address this, we propose two intuitive MoE-specific KD methods
+for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR),
+both designed to effectively extract knowledge from all experts. Specifically,
+KA augments knowledge by sampling experts multiple times, while SAR uses all
+experts and adjusts the expert weights through router training to provide
+optimal knowledge. Extensive experiments show that our methods outperform
+conventional KD methods, demonstrating their effectiveness for MoE teacher
+models.
+
+摘要：隨著 Mixture-of-Experts (MoE) 的出現，模型規模的有效擴展加速了近年來大型語言模型的發展。然而，它們的高記憶體需求會阻礙它們在資源受限的環境中使用。雖然知識蒸餾 (KD) 已被證明是一種模型壓縮的方法，但它在 MoE 教師模型中的應用仍未被充分探索。透過我們的調查，我們發現 MoE 模型中未被啟用的專家擁有有價值的知識，這些知識對學生模型有益。我們進一步證明，現有的 KD 方法並非壓縮 MoE 模型的最佳方法，因為它們無法有效利用這些知識。為了解決這個問題，我們首次提出兩種直觀的 MoE 專用 KD 方法：知識擴充 (KA) 和學生感知路由器 (SAR)，兩者都旨在從所有專家有效提取知識。具體來說，KA 透過多次抽樣專家來擴充知識，而 SAR 使用所有專家並透過路由器訓練調整專家權重以提供最佳知識。廣泛的實驗表明，我們的模型優於傳統的 KD 模型，證明了它們對 MoE 教師模型的有效性。
+
+##### **LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**
+2502.12945v1 by Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose
+
+Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold
+significant commercial value. The rise of high-quality AI-generated content has
+spurred interest in AI-driven micro-video creation. However, despite the
+advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek
+in text generation and reasoning, their potential to assist the creation of
+popular micro-videos remains largely unexplored.
+  In this paper, we conduct an empirical study on LLM-assisted popular
+micro-video generation (LLMPopcorn). Specifically, we investigate the following
+research questions: (i) How can LLMs be effectively utilized to assist popular
+micro-video generation? (ii) To what extent can prompt-based enhancements
+optimize the LLM-generated content for higher popularity? (iii) How well do
+various LLMs and video generators perform in the popular micro-video generation
+task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3
+enable micro-video generation to achieve popularity comparable to human-created
+content. Prompt enhancements further boost popularity, and benchmarking
+highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and
+HunyuanVideo lead in video generation. This pioneering work advances
+AI-assisted micro-video creation, uncovering new research opportunities. We
+will release the code and datasets to support future studies.
+
+摘要：<paragraph>在 TikTok 和 YouTube 等平台上流行的微影片具有
+重要的商业价值。高质量 AI 生成的内容的兴起
+激发了人们对 AI 驱动的微影片创作的兴趣。然而，尽管大型语言模型 (LLM) 如 ChatGPT 和 DeepSeek
+在文本生成和推理方面的能力很强，但它们在辅助创建
+流行微影片方面的潜力在很大程度上仍未得到探索。
+  在本文中，我们对 LLM 辅助的流行
+微影片生成 (LLMPopcorn) 进行了实证研究。具体来说，我们调查了以下
+研究问题：(i) 如何有效利用 LLM 来辅助流行
+微影片生成？(ii) 基于提示的增强在多大程度上可以
+优化 LLM 生成的内容以获得更高的流行度？(iii) 各种 LLM 和视频生成器在流行的微视频生成中表现如何
+任务？通过探索这些问题，我们表明了像 DeepSeek-V3 这样的高级 LLM
+使微视频生成能够达到与人类创作的内容相当的流行度。提示增强进一步提高了受欢迎程度，并且基准测试突出了 LLM 中的 DeepSeek-V3 和 DeepSeek-R1，而 LTX-Video 和
+HunyuanVideo 在视频生成中领先。这项开创性的工作推进了
+人工智能辅助的微视频创作，发现了新的研究机会。我们将发布代码和数据集以支持未来的研究。</paragraph>
+
+##### **Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**
+2502.12932v1 by Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
+
+Quantifying reasoning capability in low-resource languages remains a
+challenge in NLP due to data scarcity and limited access to annotators. While
+LLM-assisted dataset construction has proven useful for medium- and
+high-resource languages, its effectiveness in low-resource languages,
+particularly for commonsense reasoning, is still unclear. In this paper, we
+compare three dataset creation strategies: (1) LLM-assisted dataset generation,
+(2) machine translation, and (3) human-written data by native speakers, to
+build a culturally nuanced story comprehension dataset. We focus on Javanese
+and Sundanese, two major local languages in Indonesia, and evaluate the
+effectiveness of open-weight and closed-weight LLMs in assisting dataset
+creation through extensive manual validation. To assess the utility of
+synthetic data, we fine-tune language models on classification and generation
+tasks using this data and evaluate performance on a human-written test set. Our
+findings indicate that LLM-assisted data creation outperforms machine
+translation.
+
+摘要：由於資料稀少且標註者有限，量化低資源語言中的推理能力在自然語言處理中仍然是一項挑戰。雖然 LLM 輔助的資料集建構已被證明對中高資源語言有用，但其在低資源語言中的有效性，特別是對於常識推理，仍然不清楚。在本文中，我們比較了三種資料集建立策略：(1) LLM 輔助的資料集生成，(2) 機器翻譯，以及 (3) 母語人士撰寫的人工資料，以建立具有文化細微差的故事理解資料集。我們專注於爪哇語和巽他語，這兩種印尼的主要地方語言，並透過廣泛的手動驗證評估開放權重和封閉權重 LLM 在協助資料集建立中的有效性。為了評估合成資料的效用，我們使用這些資料對分類和生成任務進行語言模型微調，並在人工撰寫的測試集上評估效能。我們的研究結果表明，LLM 輔助的資料建立優於機器翻譯。
+
+##### **Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**
+2502.12929v1 by Lakshmi Nair, Ian Trase, Mark Kim
+
+We present a novel reasoning approach called Flow-of-Options (FoO), designed
+to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs
+to systematically explore a diverse range of possibilities in their reasoning,
+as demonstrated by an FoO-based agentic system for autonomously solving Machine
+Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines,
+achieving improvements of 38.2% - 69.2% on standard data science tasks, and
+37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost
+under $1 per task, our framework is well-suited for cost-sensitive
+applications. Beyond classification and regression, we illustrate the broader
+applicability of our FoO-based agentic system to tasks such as reinforcement
+learning and image generation. Our framework presents significant advancements
+compared to current state-of-the-art agentic systems for AutoML, due to the
+benefits of FoO in enforcing diversity in LLM solutions through compressed,
+explainable representations that also support long-term memory when combined
+with case-based reasoning.
+
+摘要：我們提出了一種稱為選項流 (FoO) 的新推理方法，旨在解決大型語言模型 (LLM) 中的內在偏差。FoO 使 LLM 能系統性地探索其推理中的各種可能性，這由一個基於 FoO 的代理系統展示，該系統可自主解決機器學習任務 (AutoML)。我們的框架優於最先進的基準，在標準數據科學任務上取得了 38.2% - 69.2% 的改進，在治療化學任務上取得了 37.4% - 47.9% 的改進。由於每個任務的整體運營成本低於 1 美元，因此我們的框架非常適合對成本敏感的應用。除了分類和回歸之外，我們還說明了基於 FoO 的代理系統在強化學習和圖像生成等任務中的更廣泛適用性。我們的框架與當前最先進的 AutoML 代理系統相比具有顯著的進步，這是因為 FoO 在通過壓縮、可解釋的表示強制 LLM 解決方案的多樣性方面具有優勢，這些表示與基於案例的推理結合時還支持長期記憶。
+
+##### **Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**
+2502.12928v1 by Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
+
+Large language models have demonstrated exceptional performance across a wide
+range of tasks. However, dense models usually suffer from sparse activation,
+where many activation values tend towards zero (i.e., being inactivated). We
+argue that this could restrict the efficient exploration of model
+representation space. To mitigate this issue, we propose Finedeep, a
+deep-layered fine-grained expert architecture for dense models. Our framework
+partitions the feed-forward neural network layers of traditional dense models
+into small experts, arranges them across multiple sub-layers. A novel routing
+mechanism is proposed to determine each expert's contribution. We conduct
+extensive experiments across various model sizes, demonstrating that our
+approach significantly outperforms traditional dense architectures in terms of
+perplexity and benchmark performance while maintaining a comparable number of
+parameters and floating-point operations. Moreover, we find that Finedeep
+achieves optimal results when balancing depth and width, specifically by
+adjusting the number of expert sub-layers and the number of experts per
+sub-layer. Empirical results confirm that Finedeep effectively alleviates
+sparse activation and efficiently utilizes representation capacity in dense
+models.
+
+摘要：大型語言模型在各種任務中展現出非凡的效能。然而，密集模型通常會出現稀疏激活，其中許多激活值趨近於零（即處於非激活狀態）。我們認為這可能會限制模型表示空間的有效探索。為了減輕這個問題，我們提出 Finedeep，這是一種針對密集模型的深度分層細粒度專家架構。我們的框架將傳統密集模型的前饋神經網路層分割成小型專家，並將它們排列在多個子層中。我們提出了一種新穎的路由機制來確定每個專家的貢獻。我們針對各種模型大小進行了廣泛的實驗，證明我們的做法在困惑度和基準效能方面顯著優於傳統的密集架構，同時保持了相當數量的參數和浮點運算。此外，我們發現 Finedeep 在平衡深度和廣度時可以達到最佳結果，特別是透過調整專家子層的數量和每個子層的專家數量。實證結果證實，Finedeep 有效地減輕了稀疏激活，並有效利用了密集模型中的表示能力。
+
+##### **SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**
+2502.12927v1 by Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva
+
+Providing high-quality feedback is crucial for student success but is
+constrained by time, cost, and limited data availability. We introduce
+Synthetic Educational Feedback Loops (SEFL), a novel framework designed to
+deliver immediate, on-demand feedback at scale without relying on extensive,
+real-world student data. In SEFL, two large language models (LLMs) operate in
+teacher--student roles to simulate assignment completion and formative
+feedback, generating abundant synthetic pairs of student work and corresponding
+critiques. We then fine-tune smaller, more computationally efficient LLMs on
+these synthetic pairs, enabling them to replicate key features of high-quality,
+goal-oriented feedback. Unlike personalized tutoring approaches that offer
+multi-turn, individualized instruction, SEFL specifically focuses on
+replicating the teacher-->student feedback loop for diverse assignments.
+Through both LLM-as-a-judge and human evaluations, we demonstrate that
+SEFL-tuned models outperform their non-tuned counterparts in feedback quality,
+clarity, and timeliness. These findings reveal SEFL's potential to transform
+feedback processes for higher education and beyond, offering an ethical and
+scalable alternative to conventional manual feedback cycles.
+
+摘要：提供高品質的回饋對於學生的成功至關重要，但受到時間、成本和資料取得有限的限制。我們引入了合成教育回饋迴圈 (SEFL)，這是一個新穎的架構，旨在提供立即且依需求的回饋，且無需仰賴大量的真實世界學生資料。在 SEFL 中，兩個大型語言模型 (LLM) 以師生角色運作，模擬作業完成和形成性回饋，產生大量的合成學生作業和對應的評論。然後我們針對這些合成配對微調較小、計算效率較高的 LLM，讓它們能夠複製高品質、目標導向回饋的主要特徵。與提供多回合、個別化教學的個人化輔導方法不同，SEFL 特別專注於複製適用於各種作業的教師-->學生回饋迴圈。透過 LLM 作為評審和人類評估，我們證明了 SEFL 微調模型在回饋品質、清晰度和時效性方面優於未微調的模型。這些發現揭示了 SEFL 轉變高等教育及其他領域回饋流程的潛力，提供了一個符合道德且可擴充的替代方案，取代傳統的手動回饋週期。
+
+##### **Towards more Contextual Agents: An extractor-Generator Optimization Framework**
+2502.12926v1 by Mourad Aouini, Jinan Loubani
+
+Large Language Model (LLM)-based agents have demonstrated remarkable success
+in solving complex tasks across a wide range of general-purpose applications.
+However, their performance often degrades in context-specific scenarios, such
+as specialized industries or research domains, where the absence of
+domain-relevant knowledge leads to imprecise or suboptimal outcomes. To address
+this challenge, our work introduces a systematic approach to enhance the
+contextual adaptability of LLM-based agents by optimizing their underlying
+prompts-critical components that govern agent behavior, roles, and
+interactions. Manually crafting optimized prompts for context-specific tasks is
+labor-intensive, error-prone, and lacks scalability. In this work, we introduce
+an Extractor-Generator framework designed to automate the optimization of
+contextual LLM-based agents. Our method operates through two key stages: (i)
+feature extraction from a dataset of gold-standard input-output examples, and
+(ii) prompt generation via a high-level optimization strategy that iteratively
+identifies underperforming cases and applies self-improvement techniques. This
+framework substantially improves prompt adaptability by enabling more precise
+generalization across diverse inputs, particularly in context-specific tasks
+where maintaining semantic consistency and minimizing error propagation are
+critical for reliable performance. Although developed with single-stage
+workflows in mind, the approach naturally extends to multi-stage workflows,
+offering broad applicability across various agent-based systems. Empirical
+evaluations demonstrate that our framework significantly enhances the
+performance of prompt-optimized agents, providing a structured and efficient
+approach to contextual LLM-based agents.
+
+摘要：大型語言模型 (LLM) 為基礎的代理已展現出非凡的成功，
+能解決廣泛一般用途應用程式的複雜任務。
+然而，它們的效能通常會在特定情境中下降，例如專門產業或研究領域，
+其中缺乏與領域相關知識會導致不精確或次佳的結果。為了解決
+這項挑戰，我們的研究引進了一種系統化的方法來增強 LLM 為基礎的代理的
+情境適應性，方法是最佳化它們的基礎提示，這些提示是決定代理行為、角色和
+互動的重要組成部分。手動製作最佳化的提示以應對特定情境的任務既費時又容易出錯，而且缺乏可擴充性。在這項研究中，我們引進
+一個萃取產生器架構，旨在自動化情境 LLM 為基礎代理的最佳化。我們的
+方法透過兩個關鍵階段運作：(i) 從黃金標準輸入輸出範例的資料集萃取特徵，以及
+(ii) 透過高階最佳化策略產生提示，此策略會反覆找出表現不佳的案例並套用自我改善技術。此
+架構大幅改善了提示適應性，讓它能針對不同的輸入進行更精確的概括，特別是在情境特定任務中，在這些任務中，維持語意一致性和將錯誤傳播降至最低對於可靠的效能至關重要。儘管是針對單階段工作流程開發，但此方法自然能延伸至多階段工作流程，在各種基於代理的系統中提供廣泛的適用性。實證評估顯示，我們的架構大幅增強了提示最佳化代理的效能，為基於情境的 LLM 代理提供了一個結構化且有效率的方法。
+
+##### **Keep what you need : extracting efficient subnetworks from large audio representation models**
+2502.12925v1 by David Genova, Philippe Esling, Tom Hurlin
+
+Recently, research on audio foundation models has witnessed notable advances,
+as illustrated by the ever improving results on complex downstream tasks.
+Subsequently, those pretrained networks have quickly been used for various
+audio applications. These improvements have however resulted in a considerable
+increase both in size and complexity of these models. Along the environmental
+concerns this issue raises, this prevents the deployment of such networks on
+consumer-level devices, and precludes their use for real-time applications.
+Moreover, this appears contradictory with the specificity of the tasks for
+which these models are used, which are often simpler compared to extracting a
+rich, multi-purpose representation from any type of audio data. In this paper,
+we address this issue with a simple, yet effective method to extract
+lightweight specialist subnetworks from large foundation models. Specifically,
+we introduce learnable binary masks in-between the layers of a pretrained
+representation model. When training the end-to-end model on a downstream task,
+we add a sparsity-inducing loss to the overall objective, hence learning a
+compact subnetwork specialized on a single task. Importantly, the weights of
+the foundation model are kept frozen, resulting into low additional training
+costs. Once trained, the masked computational units can then be removed from
+the network, implying significant performance gains. We assess our method on
+three widespread audio foundation models, each based on a different backbone
+architecture, and illustrate its effectiveness on common audio representation
+evaluation tasks, as well as its versatility on both speech, music, and general
+audio. Code for reproducing the results and supporting webpage are available at
+https://github.com/gnvIRCAM/Audio-representation-trimming
+
+摘要：<paragraph>近期，音频基础模型的研究取得了显著进展，
+复杂的下游任务上不断提升的结果证明了这一点。
+随后，这些预训练网络已迅速用于各种
+音频应用程序。然而，这些改进导致了这些模型的尺寸和复杂性都大幅
+增加。除了由此产生的环境问题外，这也阻止了此类网络在
+消费者级设备上的部署，并排除了它们在实时应用程序中的使用。
+此外，这似乎与这些模型的使用任务的特殊性相矛盾，与从任何类型的音频数据中提取丰富的多用途表示相比，这些任务通常更简单。在本文中，
+我们通过一种简单但有效的方法来解决此问题，从大型基础模型中提取轻量级专家子网络。具体来说，
+我们在预训练表示模型的层之间引入了可学习的二进制掩码。当在某个下游任务上训练端到端模型时，
+我们在总体目标中添加了稀疏性诱导损失，从而学习到专门用于单个任务的紧凑型子网络。重要的是，
+基础模型的权重保持冻结，从而导致额外的训练成本低。一旦训练完成，就可以从网络中移除掩码的计算单元，这意味着性能将大幅提升。我们对三个广泛使用的音频基础模型评估了我们的方法，每个模型都基于不同的骨干架构，并说明了其在常见音频表示评估任务上的有效性，以及其在语音、音乐和通用音频上的多功能性。用于重现结果的代码和支持网页可在
+https://github.com/gnvIRCAM/Audio-representation-trimming 获得</paragraph>
+
+##### **Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**
+2502.12924v1 by Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
+
+Code-switching (CS) is still a critical challenge in Natural Language
+Processing (NLP). Current Large Language Models (LLMs) struggle to interpret
+and generate code-switched text, primarily due to the scarcity of large-scale
+CS datasets for training. This paper presents a novel methodology to generate
+CS data using LLMs, and test it on the English-Spanish language pair. We
+propose back-translating natural CS sentences into monolingual English, and
+using the resulting parallel corpus to fine-tune LLMs to turn monolingual
+sentences into CS. Unlike previous approaches to CS generation, our methodology
+uses natural CS data as a starting point, allowing models to learn its natural
+distribution beyond grammatical patterns. We thoroughly analyse the models'
+performance through a study on human preferences, a qualitative error analysis
+and an evaluation with popular automatic metrics. Results show that our
+methodology generates fluent code-switched text, expanding research
+opportunities in CS communication, and that traditional metrics do not
+correlate with human judgement when assessing the quality of the generated CS
+data. We release our code and generated dataset under a CC-BY-NC-SA license.
+
+摘要：代碼轉換（CS）在自然語言處理（NLP）中仍是一個嚴峻的挑戰。目前的巨量語言模型（LLM）難以解讀和生成代碼轉換文字，主要是因為缺乏用於訓練的大規模 CS 資料集。本文提出了一種使用 LLM 生成 CS 資料的新方法，並在英語-西班牙語語言對上進行測試。我們建議將自然 CS 句子反向翻譯成單語英語，並使用產生的平行語料庫微調 LLM，將單語句子轉換為 CS。與先前的 CS 生成方法不同，我們的技術使用自然 CS 資料作為起點，讓模型能夠學習其超越語法模式的自然分佈。我們透過研究人類偏好、定性錯誤分析和使用流行的自動化指標進行評估，徹底分析模型的效能。結果顯示，我們的技術可以生成流利的代碼轉換文字，擴展 CS 溝通的研究機會，而且在評估生成的 CS 資料品質時，傳統指標與人類判斷無關。我們在 CC-BY-NC-SA 授權下釋出我們的程式碼和生成的資料集。
+
+##### **On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**
+2502.12923v1 by Rune Birkmose, Nathan Mørkeberg Reece, Esben Hofstedt Norvin, Johannes Bjerva, Mike Zhang
+
+This paper investigates whether Large Language Models (LLMs), fine-tuned on
+synthetic but domain-representative data, can perform the twofold task of (i)
+slot and intent detection and (ii) natural language response generation for a
+smart home assistant, while running solely on resource-limited, CPU-only edge
+hardware. We fine-tune LLMs to produce both JSON action calls and text
+responses. Our experiments show that 16-bit and 8-bit quantized variants
+preserve high accuracy on slot and intent detection and maintain strong
+semantic coherence in generated text, while the 4-bit model, while retaining
+generative fluency, suffers a noticeable drop in device-service classification
+accuracy. Further evaluations on noisy human (non-synthetic) prompts and
+out-of-domain intents confirm the models' generalization ability, obtaining
+around 80--86\% accuracy. While the average inference time is 5--6 seconds per
+query -- acceptable for one-shot commands but suboptimal for multi-turn
+dialogue -- our results affirm that an on-device LLM can effectively unify
+command interpretation and flexible response generation for home automation
+without relying on specialized hardware.
+
+摘要：本文探討微調於合成但具領域代表性的資料上的大型語言模型 (LLM)，是否能執行 (i) 槽位和意圖偵測，以及 (ii) 自然語言回應產生的雙重任務，同時僅在資源受限、僅 CPU 的邊緣硬體上執行。我們微調 LLM 以產生 JSON 動作呼叫和文字回應。我們的實驗顯示，16 位元和 8 位元量化的變體在槽位和意圖偵測上保持高準確度，並在產生的文字中維持強大的語意一致性，而 4 位元模型雖然保有生成流暢度，但在裝置服務分類準確度上卻有明顯下降。進一步對有雜訊的人類 (非合成) 提示和領域外意圖的評估，證實了模型的泛化能力，獲得約 80--86% 的準確度。雖然平均推論時間為每個查詢 5--6 秒，對於一次性命令來說是可以接受的，但對於多輪對話來說並不理想，但我們的結果證實，裝置上的 LLM 可以有效地統一命令解譯和彈性回應產生，以進行家庭自動化，而無需依賴專用硬體。
+
+##### **Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**
+2502.12921v1 by George-Kirollos Saad, Scott Sanner
+
+Query-driven recommendation with unknown items poses a challenge for users to
+understand why certain items are appropriate for their needs. Query-driven
+Contrastive Summarization (QCS) is a methodology designed to address this issue
+by leveraging language-based item descriptions to clarify contrasts between
+them. However, existing state-of-the-art contrastive summarization methods such
+as STRUM-LLM fall short of this goal. To overcome these limitations, we
+introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs
+debate-style prompting to generate focused and contrastive summarizations of
+item aspects relevant to a query. Leveraging modern large language models
+(LLMs) as powerful tools for generating debates, Q-STRUM Debate provides
+enhanced contrastive summaries. Experiments across three datasets demonstrate
+that Q-STRUM Debate yields significant performance improvements over existing
+methods on key contrastive summarization criteria, thus introducing a novel and
+performant debate prompting methodology for QCS.
+
+摘要：以未知項目進行的查詢驅動推薦對使用者來說是一項挑戰，他們難以理解為何某些項目適合自己的需求。查詢驅動對比摘要 (QCS) 是一種方法，旨在透過利用基於語言的項目描述來釐清項目之間的對比，以解決這個問題。然而，現有的最先進對比摘要方法（例如 STRUM-LLM）並未達成此目標。為了克服這些限制，我們引進 Q-STRUM Debate，一種 STRUM-LLM 的新延伸，它採用辯論式提示來產生與查詢相關的項目面向的重點式對比摘要。透過利用現代大型語言模型 (LLM) 作為產生辯論的強大工具，Q-STRUM Debate 提供增強的對比摘要。透過三個資料集的實驗證明，Q-STRUM Debate 在關鍵的對比摘要標準上，比現有方法有顯著的效能改善，因此為 QCS 引進一種新穎且高性能的辯論提示方法。
+
+##### **GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**
+2502.12913v1 by Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang
+
+Large Language Models (LLMs) fine-tuning technologies have achieved
+remarkable results. However, traditional LLM fine-tuning approaches face
+significant challenges: they require large Floating Point (FP) computation,
+raising privacy concerns when handling sensitive data, and are impractical for
+resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT)
+techniques reduce trainable parameters, their reliance on floating-point
+arithmetic creates fundamental incompatibilities with edge hardware. In this
+work, we introduce a novel framework for on-device LLM fine-tuning that
+eliminates the need for floating-point operations in both inference and
+training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer
+format, which efficiently represents model parameters in integer format using
+shared exponents among parameter groups. When combined with LoRA-like adapters,
+this enables fully integer-based fine-tuning that is both memory and compute
+efficient. We demonstrate that our approach achieves accuracy comparable to
+FP16-based fine-tuning while significantly reducing memory usage (50%).
+Moreover, compared to FP8, our method can reduce 5x power consumption and 11x
+chip area with same performance, making large-scale model adaptation feasible
+on edge devices.
+
+摘要：大型语言模型 (LLM) 微调技术已取得显著成果。然而，传统的 LLM 微调方法面临着严峻的挑战：它们需要大量的浮点 (FP) 计算，在处理敏感数据时会引发隐私问题，并且对于资源受限的边缘设备而言不切实际。虽然参数高效微调 (PEFT) 技术减少了可训练参数，但它们对浮点运算的依赖与边缘硬件产生了根本上的不兼容性。在这项工作中，我们引入了一个用于设备上 LLM 微调的新框架，该框架消除了推理和训练中对浮点运算的需求，名为 GSQ-Tuning。其核心是组共享指数整数格式，该格式使用参数组之间的共享指数以整数格式有效地表示模型参数。当与类似 LoRA 的适配器相结合时，这实现了完全基于整数的微调，既节省内存又节省计算。我们证明了我们的方法实现了与基于 FP16 的微调相当的准确性，同时显著减少了内存使用量 (50%)。此外，与 FP8 相比，我们的方法可以在相同的性能下减少 5 倍的功耗和 11 倍的芯片面积，从而使大规模模型适应在边缘设备上成为可能。
+
+##### **Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**
+2502.12911v1 by Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang
+
+Generating SQLs from user queries is a long-standing challenge, where the
+accuracy of initial schema linking significantly impacts subsequent SQL
+generation performance. However, current schema linking models still struggle
+with missing relevant schema elements or an excess of redundant ones. A crucial
+reason for this is that commonly used metrics, recall and precision, fail to
+capture relevant element missing and thus cannot reflect actual schema linking
+performance. Motivated by this, we propose an enhanced schema linking metric by
+introducing a restricted missing indicator. Accordingly, we introduce Knapsack
+optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent
+designed to prevent the missing of relevant schema elements while minimizing
+the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy
+that first identifies the optimal table linking and subsequently links columns
+within the selected table to reduce linking candidate space. In each linking
+process, it utilize a knapsack optimization approach to link potentially
+relevant elements while accounting for a limited tolerance of potential
+redundant ones.With this optimization, KaSLA-1.6B achieves superior schema
+linking results compared to large-scale LLMs, including deepseek-v3 with
+state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider
+and BIRD benchmarks verify that KaSLA can significantly improve the SQL
+generation performance of SOTA text-to-SQL models by substituting their schema
+linking processes.
+
+摘要：從使用者查詢中產生 SQL 是個長期的挑戰，其中初始架構連結的準確性會顯著影響後續 SQL 產生效能。然而，目前的架構連結模型仍難以處理遺漏相關架構元素或過多重複元素的問題。造成此問題的一個關鍵原因是，常用的指標召回率和精確度無法捕捉遺漏相關元素，因此無法反映實際的架構連結效能。有鑑於此，我們提出一個增強的架構連結指標，透過引入受限遺漏指標。因此，我們介紹基於背包最佳化的架構連結代理 (KaSLA)，這是一個外掛式架構連結代理，旨在防止遺漏相關架構元素，同時將重複元素的納入降至最低。KaSLA 採用分層連結策略，首先找出最佳的表格連結，然後連結所選表格中的欄位，以減少連結候選空間。在每個連結過程中，它利用背包最佳化方法連結潛在相關元素，同時考量對潛在重複元素的容忍度。透過此最佳化，KaSLA-1.6B 達到優於大規模 LLM 的架構連結結果，包括採用最先進 (SOTA) 架構連結方法的 deepseek-v3。在 Spider 和 BIRD 基準上的廣泛實驗驗證，KaSLA 可透過取代其架構連結流程，大幅提升 SOTA 文字轉 SQL 模型的 SQL 產生效能。
+
+##### **Graph Neural Networks for Databases: A Survey**
+2502.12908v1 by Ziming Li, Youhuan Li, Yuyu Luo, Guoliang Li, Chuxu Zhang
+
+Graph neural networks (GNNs) are powerful deep learning models for
+graph-structured data, demonstrating remarkable success across diverse domains.
+Recently, the database (DB) community has increasingly recognized the
+potentiality of GNNs, prompting a surge of researches focusing on improving
+database systems through GNN-based approaches. However, despite notable
+advances, There is a lack of a comprehensive review and understanding of how
+GNNs could improve DB systems. Therefore, this survey aims to bridge this gap
+by providing a structured and in-depth overview of GNNs for DB systems.
+Specifically, we propose a new taxonomy that classifies existing methods into
+two key categories: (1) Relational Databases, which includes tasks like
+performance prediction, query optimization, and text-to-SQL, and (2) Graph
+Databases, addressing challenges like efficient graph query processing and
+graph similarity computation. We systematically review key methods in each
+category, highlighting their contributions and practical implications. Finally,
+we suggest promising avenues for integrating GNNs into Database systems.
+
+摘要：圖形神經網路 (GNN) 是用於圖形結構資料的強大深度學習模型，在各種領域中展現出顯著的成功。最近，資料庫 (DB) 社群越來越認識到 GNN 的潛力，促使大量研究專注於透過基於 GNN 的方法來改善資料庫系統。然而，儘管有顯著的進展，但對於 GNN 如何改善資料庫系統，仍然缺乏全面的回顧和理解。因此，本調查旨在透過提供 GNN 在資料庫系統中的結構化且深入的概觀來彌補這個差距。具體來說，我們提出了一個新的分類法，將現有方法分類為兩個主要類別：(1) 關係資料庫，其中包括效能預測、查詢最佳化和文字轉 SQL 等任務，以及 (2) 圖形資料庫，用於處理高效圖形查詢處理和圖形相似度計算等挑戰。我們系統性地回顧了每個類別中的關鍵方法，重點說明其貢獻和實務意涵。最後，我們建議將 GNN 整合到資料庫系統中的有希望途徑。
+
+##### **Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**
+2502.12904v1 by Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
+
+We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to
+defend against internet fraud and phishing in dynamic, real-world scenarios.
+Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job
+postings, social media, and news, categorized into 5 major fraud types. Unlike
+previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to
+assess LLMs' resistance to fraud at different stages, including credibility
+building, urgency creation, and emotional manipulation. Furthermore, we
+evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM
+provides general decision-making assistance, and 2. Role-play, where the model
+assumes a specific persona, widely used in real-world agent-based interactions.
+Our evaluation reveals the significant challenges in defending against fraud
+and phishing inducement, especially in role-play settings and fake job
+postings. Additionally, we observe a substantial performance gap between
+Chinese and English, underscoring the need for improved multilingual fraud
+detection capabilities.
+
+摘要：我們推出 Fraud-R1，一個基準，旨在評估 LLM 在動態、真實世界場景中防範網路詐騙和網路釣魚的能力。Fraud-R1 包含 8,564 起詐騙案例，來源包括網路釣魚詐騙、虛假職缺、社群媒體和新聞，分類為 5 種類型的主要詐騙手法。與先前的基準不同，Fraud-R1 引入多輪評估管道，以評估 LLM 在不同階段對詐騙的抵抗力，包括建立信譽、製造急迫感和情感操縱。此外，我們在兩種設定下評估 15 個 LLM：1. 協助助理，其中 LLM 提供一般決策協助，以及 2. 角色扮演，其中模型假設特定角色，廣泛用於現實世界中基於代理的互動。我們的評估揭示了在防範詐騙和網路釣魚誘導方面面臨的重大挑戰，尤其是在角色扮演設定和虛假職缺中。此外，我們觀察到中文和英文之間有顯著的效能差距，這凸顯了改進多語言詐騙偵測功能的必要性。
+
+##### **Soundwave: Less is More for Speech-Text Alignment in LLMs**
+2502.12900v1 by Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
+
+Existing end-to-end speech large language models (LLMs) usually rely on
+large-scale annotated data for training, while data-efficient training has not
+been discussed in depth. We focus on two fundamental problems between speech
+and text: the representation space gap and sequence length inconsistency. We
+propose Soundwave, which utilizes an efficient training strategy and a novel
+architecture to address these issues. Results show that Soundwave outperforms
+the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,
+using only one-fiftieth of the training data. Further analysis shows that
+Soundwave still retains its intelligence during conversation. The project is
+available at https://github.com/FreedomIntelligence/Soundwave.
+
+摘要：現有的端對端語音大型語言模型 (LLM) 通常依賴於大規模註釋資料進行訓練，而資料有效率的訓練尚未深入探討。我們專注於語音和文字之間的兩個基本問題：表示空間差距和序列長度不一致。我們提出 Soundwave，它利用高效的訓練策略和新穎的架構來解決這些問題。結果顯示，Soundwave 在語音翻譯和 AIR-Bench 語音任務中優於進階的 Qwen2-Audio，僅使用五十分之一的訓練資料。進一步的分析顯示，Soundwave 在對話中仍能保持其智慧。專案可於 https://github.com/FreedomIntelligence/Soundwave 取得。
+
+##### **None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**
+2502.12896v1 by Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
+
+In LLM evaluations, reasoning is often distinguished from recall/memorization
+by performing numerical variations to math-oriented questions. Here we
+introduce a general variation method for multiple-choice questions that
+completely dissociates the correct answer from previously seen tokens or
+concepts, requiring LLMs to understand and reason (rather than memorizing) in
+order to answer correctly. Using this method, we evaluate state-of-the-art
+proprietary and open-source LLMs on two datasets available in English and
+Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
+Results show that all models experience remarkable accuracy drops under our
+proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
+2024, ranging from 10% to 93% across models. Notably, the most accurate model
+in our experimentation (OpenAI-o3-mini) is not the most robust
+(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
+not be the ones with better reasoning capabilities. Also, we see larger
+accuracy drops in public (vs private) datasets and questions posed in their
+original language (vs a manual translation), which are signs of contamination
+and also point to a relevant role of recall/memorization in current LLMs'
+answers.
+
+摘要：在 LLM 評估中，推理通常透過對數學導向問題進行數值變異來區別於回憶/記憶。在此，我們引入一種通用變異方法，適用於多選題，它將正確答案與先前看到的代幣或概念完全區分開來，要求 LLM 理解和推理（而不是記憶），以便正確回答。使用此方法，我們在英語和西班牙語中評估了兩種數據集中的最先進的專有和開源 LLM：公共 MMLU 基準和私有 UNED-Access 2024 數據集。結果表明，在我們提出的變異下，所有模型的準確度都出現顯著下降，在 MMLU 上平均損失 57%，在 UNED-Access 2024 上平均損失 50%，在不同模型中範圍從 10% 到 93%。值得注意的是，我們實驗中最準確的模型（OpenAI-o3-mini）並不是最穩健的模型（DeepSeek-R1-70B），這表明標準評估中最好的模型可能不是推理能力最強的模型。此外，我們看到公共（相對於私有）數據集和以原始語言提出的問題（相對於人工翻譯）的準確度下降幅度更大，這是汙染的跡象，也表明回憶/記憶在當前 LLM 的答案中發揮著相關作用。
+
+##### **Multilingual European Language Models: Benchmarking Approaches and Challenges**
+2502.12895v1 by Fabio Barth, Georg Rehm
+
+The breakthrough of generative large language models (LLMs) that can solve
+different tasks through chat interaction has led to a significant increase in
+the use of general benchmarks to assess the quality or performance of these
+models beyond individual applications. There is also a need for better methods
+to evaluate and also to compare models due to the ever increasing number of new
+models published. However, most of the established benchmarks revolve around
+the English language. This paper analyses the benefits and limitations of
+current evaluation datasets, focusing on multilingual European benchmarks. We
+analyse seven multilingual benchmarks and identify four major challenges.
+Furthermore, we discuss potential solutions to enhance translation quality and
+mitigate cultural biases, including human-in-the-loop verification and
+iterative translation ranking. Our analysis highlights the need for culturally
+aware and rigorously validated benchmarks to assess the reasoning and
+question-answering capabilities of multilingual LLMs accurately.
+
+摘要：生成式大型語言模型 (LLM) 的突破，它能透過聊天互動解決不同任務，這導致使用一般基準來評估這些模型在個別應用程式以外的品質或效能大幅增加。由於已發布的新模型數量不斷增加，因此也有必要採用更好的方法來評估模型並進行比較。然而，大多數已建立的基準都圍繞著英語。本文分析了目前評估資料集的優點和限制，重點放在多語言歐洲基準。我們分析了七個多語言基準，並找出四個主要的挑戰。此外，我們討論了增強翻譯品質和減輕文化偏見的潛在解決方案，包括人為迴圈驗證和反覆翻譯排名。我們的分析突顯了對文化意識和嚴格驗證的基準的需求，以準確評估多語言 LLM 的推理和問答能力。
+
+##### **H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**
+2502.12893v1 by Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen
+
+Large Reasoning Models (LRMs) have recently extended their powerful reasoning
+capabilities to safety checks-using chain-of-thought reasoning to decide
+whether a request should be answered. While this new approach offers a
+promising route for balancing model utility and safety, its robustness remains
+underexplored. To address this gap, we introduce Malicious-Educator, a
+benchmark that disguises extremely dangerous or malicious requests beneath
+seemingly legitimate educational prompts. Our experiments reveal severe
+security flaws in popular commercial-grade LRMs, including OpenAI o1/o3,
+DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1
+model initially maintains a high refusal rate of about 98%, subsequent model
+updates significantly compromise its safety; and attackers can easily extract
+criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any
+additional tricks. To further highlight these vulnerabilities, we propose
+Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method
+that leverages the model's own displayed intermediate reasoning to jailbreak
+its safety reasoning mechanism. Under H-CoT, refusal rates sharply
+decline-dropping from 98% to below 2%-and, in some instances, even transform
+initially cautious tones into ones that are willing to provide harmful content.
+We hope these findings underscore the urgent need for more robust safety
+mechanisms to preserve the benefits of advanced reasoning capabilities without
+compromising ethical standards.
+
+摘要：大型推理模型 (LRM) 最近將其強大的推理能力擴展到安全檢查，使用思維鏈推理來決定是否應回答請求。雖然這種新方法為平衡模型實用性和安全性提供了一條有希望的途徑，但其穩健性仍未得到充分探索。為了解決這一差距，我們引入了 Malicious-Educator，這是一個基準，它將極其危險或惡意的請求偽裝在看似合法的教育提示之下。我們的實驗揭示了流行的商業級 LRM 中嚴重的安全缺陷，包括 OpenAI o1/o3、DeepSeek-R1 和 Gemini 2.0 Flash Thinking。例如，儘管 OpenAI 的 o1 模型最初保持約 98% 的高拒絕率，但後續的模型更新顯著損害了其安全性；攻擊者可以輕鬆地從 DeepSeek-R1 和 Gemini 2.0 Flash Thinking 中提取犯罪策略，而無需任何額外的技巧。為了進一步強調這些漏洞，我們提出了劫持思維鏈 (H-CoT)，這是一種通用且可轉移的攻擊方法，它利用模型自己顯示的中間推理來越獄其安全推理機制。在 H-CoT 下，拒絕率急劇下降，從 98% 降至 2% 以下，在某些情況下，甚至將最初謹慎的語氣轉變為願意提供有害內容的語氣。我們希望這些發現強調了對更強大的安全機制的迫切需要，以保留先進推理能力的好處，同時不損害道德標準。
+
+##### **Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**
+2502.12886v1 by Georg Rehm, Annika Grützner-Zahn, Fabio Barth
+
+Large language models (LLMs) demonstrate unprecedented capabilities and
+define the state of the art for almost all natural language processing (NLP)
+tasks and also for essentially all Language Technology (LT) applications. LLMs
+can only be trained for languages for which a sufficient amount of pre-training
+data is available, effectively excluding many languages that are typically
+characterised as under-resourced. However, there is both circumstantial and
+empirical evidence that multilingual LLMs, which have been trained using data
+sets that cover multiple languages (including under-resourced ones), do exhibit
+strong capabilities for some of these under-resourced languages. Eventually,
+this approach may have the potential to be a technological off-ramp for those
+under-resourced languages for which "native" LLMs, and LLM-based technologies,
+cannot be developed due to a lack of training data. This paper, which
+concentrates on European languages, examines this idea, analyses the current
+situation in terms of technology support and summarises related work. The
+article concludes by focusing on the key open questions that need to be
+answered for the approach to be put into practice in a systematic way.
+
+摘要：大型語言模型 (LLM) 展現前所未有的能力，並定義了幾乎所有自然語言處理 (NLP) 任務以及所有語言技術 (LT) 應用的最新技術。LLM 只能針對有足夠預訓練資料可用的語言進行訓練，實際上排除了許多通常被歸類為資源不足的語言。然而，有環境和經驗證據顯示，多語言 LLM 已使用涵蓋多種語言（包括資源不足的語言）的資料集進行訓練，確實對其中一些資源不足的語言展現出強大的能力。最終，這種方法可能具有成為那些由於缺乏訓練資料而無法開發「原生」LLM 和基於 LLM 的技術的資源不足語言的技術跳板的潛力。本文專注於歐洲語言，探討這個想法，分析技術支援方面的現狀，並總結相關工作。本文最後專注於必須回答的主要開放性問題，以便系統性地實踐這種方法。
+
+##### **How desirable is alignment between LLMs and linguistically diverse human users?**
+2502.12884v1 by Pia Knoeferle, Sebastian Möller, Dorothea Kolossa, Veronika Solopova, Georg Rehm
+
+We discuss how desirable it is that Large Language Models (LLMs) be able to
+adapt or align their language behavior with users who may be diverse in their
+language use. User diversity may come about among others due to i) age
+differences; ii) gender characteristics, and/or iii) multilingual experience,
+and associated differences in language processing and use. We consider
+potential consequences for usability, communication, and LLM development.
+
+摘要：我們探討大型語言模型 (LLM) 能夠適應或調整其語言行為，以適應語言使用可能多樣化的使用者，這有多麼可取。使用者多樣性可能出於以下原因而產生：i) 年齡差異；ii) 性別特徵，和/或 iii) 多語言經驗，以及語言處理和使用上的相關差異。我們考慮對可用性、溝通和 LLM 開發的潛在後果。
+
+##### **Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**
+2502.12876v1 by Nandakishor M, Anjali M
+
+Creating personalized and adaptable conversational AI remains a key
+challenge. This paper introduces a Continuous Learning Conversational AI (CLCA)
+approach, implemented using A2C reinforcement learning, to move beyond static
+Large Language Models (LLMs). We use simulated sales dialogues, generated by
+LLMs, to train an A2C agent. This agent learns to optimize conversation
+strategies for personalization, focusing on engagement and delivering value.
+Our system architecture integrates reinforcement learning with LLMs for both
+data creation and response selection. This method offers a practical way to
+build personalized AI companions that evolve through continuous learning,
+advancing beyond traditional static LLM techniques.
+
+摘要：建立個人化且適應性強的對話式 AI 仍然是一項關鍵挑戰。本文介紹了一種持續學習對話式 AI (CLCA) 方法，透過 A2C 強化學習實作，以超越靜態大型語言模型 (LLM)。我們使用 LLM 生成的模擬銷售對話來訓練 A2C 代理。此代理會學習最佳化對話策略以實現個人化，並專注於參與和提供價值。我們的系統架構將強化學習與 LLM 整合，用於資料建立和回應選取。此方法提供了一種實用的方式來建立個人化 AI 伴侶，這些伴侶會透過持續學習而演進，超越傳統的靜態 LLM 技術。
+
+##### **PAFT: Prompt-Agnostic Fine-Tuning**
+2502.12859v1 by Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu
+
+While Large Language Models (LLMs) adapt well to downstream tasks after
+fine-tuning, this adaptability often compromises prompt robustness, as even
+minor prompt variations can significantly degrade performance. To address this,
+we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach
+that dynamically adjusts prompts during fine-tuning. This encourages the model
+to learn underlying task principles rather than overfitting to specific prompt
+formulations. PAFT operates in two stages: First, a diverse set of meaningful,
+synthetic candidate prompts is constructed. Second, during fine-tuning, prompts
+are randomly sampled from this set to create dynamic training inputs. Extensive
+experiments across diverse datasets and LLMs demonstrate that models trained
+with PAFT exhibit strong robustness and generalization across a wide range of
+prompts, including unseen ones. This enhanced robustness improves both model
+performance and inference speed while maintaining training efficiency. Ablation
+studies further confirm the effectiveness of PAFT.
+
+摘要：儘管大型語言模型 (LLM) 在微調後能很好地適應下游任務，但這種適應性通常會損害提示的穩健性，因為即使微小的提示變異也會大幅降低效能。為了解決這個問題，我們提出提示不可知微調 (PAFT)，這是一種簡單卻有效的方法，可以在微調期間動態調整提示。這鼓勵模型學習底層任務原則，而不是過度擬合特定的提示表述。PAFT 分為兩個階段運作：首先，構建一組多樣化、有意義的合成候選提示。其次，在微調期間，從此集合中隨機抽取提示以建立動態訓練輸入。針對各種資料集和 LLM 進行的廣泛實驗表明，使用 PAFT 訓練的模型在各種提示中表現出強大的穩健性和概括性，包括未見過的提示。這種增強的穩健性同時改善了模型效能和推理速度，同時維持訓練效率。消融研究進一步證實了 PAFT 的有效性。
+
+##### **Rejected Dialects: Biases Against African American Language in Reward Models**
+2502.12858v1 by Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, Maarten Sap
+
+Preference alignment via reward models helps build safe, helpful, and
+reliable large language models (LLMs). However, subjectivity in preference
+judgments and the lack of representative sampling in preference data collection
+can introduce new biases, hindering reward models' fairness and equity. In this
+work, we introduce a framework for evaluating dialect biases in reward models
+and conduct a case study on biases against African American Language (AAL)
+through several experiments comparing reward model preferences and behavior on
+paired White Mainstream English (WME) and both machine-translated and
+human-written AAL corpora. We show that reward models are less aligned with
+human preferences when processing AAL texts vs. WME ones (-4\% accuracy on
+average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and
+steer conversations toward WME, even when prompted with AAL texts. Our findings
+provide a targeted analysis of anti-AAL biases at a relatively understudied
+stage in LLM development, highlighting representational harms and ethical
+questions about the desired behavior of LLMs concerning AAL.
+
+摘要：透過獎勵模型進行偏好比對有助於建立安全、有用的可靠大型語言模型 (LLM)。然而，偏好判斷的主觀性，以及偏好資料收集中缺乏代表性抽樣，可能會引進新的偏誤，阻礙獎勵模型的公平性和公正性。在這項工作中，我們引進一個用於評估獎勵模型中方言偏誤的架構，並透過數個實驗進行案例研究，探討針對非裔美國人語言 (AAL) 的偏誤，這些實驗比較了獎勵模型偏好和行為，比較成對的白人主流英語 (WME) 與機器翻譯和人類撰寫的 AAL 語料庫。我們顯示，與處理 WME 文字相比，獎勵模型在處理 AAL 文字時與人類偏好較不一致（平均準確度降低 4%），經常不偏好與 AAL 一致的文字，而偏好與 WME 一致的文字，並將對話導向 WME，即使提示的是 AAL 文字。我們的發現針對 LLM 開發中相對未受重視的階段，提供針對反 AAL 偏誤的目標分析，強調與表徵相關的危害和關於 LLM 對 AAL 的期望行為的倫理問題。
+
+##### **Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**
+2502.12855v1 by Neeraj Gangwar, Suma P Bhat, Nickvash Kani
+
+While large models pre-trained on high-quality data exhibit excellent
+performance across various reasoning tasks, including mathematical reasoning
+(e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical
+reasoning remains a challenging problem. Common approaches to address this
+challenge include knowledge distillation, where smaller student models learn
+from large pre-trained teacher models, and data augmentation, such as
+rephrasing questions. Despite these efforts, smaller models struggle with
+arithmetic computations, leading to errors in mathematical reasoning. In this
+work, we focus on leveraging a programmatically generated arithmetic dataset to
+enhance the reasoning capabilities of smaller models. We investigate two key
+approaches to incorporate this dataset -- (1) intermediate fine-tuning, where a
+model is fine-tuned on the arithmetic dataset before being trained on a
+reasoning dataset, and (2) integrating the arithmetic dataset into the
+instruction-tuning mixture, allowing the model to learn arithmetic skills
+alongside general instruction-following abilities. Our experiments on multiple
+reasoning benchmarks demonstrate that incorporating an arithmetic dataset,
+whether through targeted fine-tuning or within the instruction-tuning mixture,
+enhances the models' arithmetic capabilities, which in turn improves their
+mathematical reasoning performance.
+
+摘要：大型模型经过针对高质量数据的预训练，在各种推理任务中表现出色，包括数学推理（例如 GSM8k、MultiArith），但专门化小型模型以擅长数学推理仍然是一个具有挑战性的问题。解决这一挑战的常见方法包括知识蒸馏，其中较小的学生模型从经过预训练的大型教师模型中学习，以及数据增强，例如重新表述问题。尽管做出了这些努力，较小的模型在算术计算中仍然存在困难，从而导致数学推理错误。在这项工作中，我们专注于利用程序化生成的算术数据集来增强较小模型的推理能力。我们研究了两种关键方法来合并此数据集——（1）中间微调，其中模型在算术数据集上进行微调，然后在推理数据集上进行训练，以及（2）将算术数据集集成到指令微调混合中，允许模型学习算术技能以及一般的指令遵循能力。我们在多个推理基准上的实验表明，通过有针对性的微调或在指令微调混合中合并算术数据集，增强了模型的算术能力，进而提高了它们的数学推理性能。
+
+##### **S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**
+2502.12853v1 by Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
+
+Recent studies have demonstrated the effectiveness of LLM test-time scaling.
+However, existing approaches to incentivize LLMs' deep thinking abilities
+generally require large-scale data or significant training efforts. Meanwhile,
+it remains unclear how to improve the thinking abilities of less powerful base
+models. In this work, we introduce S$^2$R, an efficient framework that enhances
+LLM reasoning by teaching models to self-verify and self-correct during
+inference. Specifically, we first initialize LLMs with iterative
+self-verification and self-correction behaviors through supervised fine-tuning
+on carefully curated data. The self-verification and self-correction skills are
+then further strengthened by both outcome-level and process-level reinforcement
+learning, with minimized resource requirements, enabling the model to
+adaptively refine its reasoning process during inference. Our results
+demonstrate that, with only 3.1k self-verifying and self-correcting behavior
+initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
+51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
+long-CoT distilled data. Extensive experiments and analysis based on three base
+models across both in-domain and out-of-domain benchmarks validate the
+effectiveness of S$^2$R. Our code and data are available at
+https://github.com/NineAbyss/S2R.
+
+摘要：<paragraph>最近的研究表明了 LLM 测试时间扩展的有效性。
+然而，现有激励 LLM 深度思考能力的方法
+通常需要大规模数据或大量的训练工作。同时，
+如何提高较弱基础模型的思考能力仍然不清楚。在这项工作中，我们引入了 S$^2$R，一个通过教导模型在
+推理过程中进行自我验证和自我纠正来增强 LLM 推理的有效框架。具体来说，我们首先通过监督微调对精心整理的数据来初始化具有迭代自我验证和自我纠正行为的 LLM。然后通过结果级别和过程级别的强化
+学习进一步加强自我验证和自我纠正技能，同时最大程度地减少资源需求，使模型能够
+在推理过程中自适应地优化其推理过程。我们的结果
+表明，仅使用 3.1k 个自我验证和自我纠正行为
+初始化样本，Qwen2.5-math-7B 的准确率从
+51.0% 提高到 81.6%，优于在等量长 CoT 蒸馏数据上训练的模型。基于三个基础模型在域内和域外基准上的广泛实验和分析验证了
+S$^2$R 的有效性。我们的代码和数据可以在
+https://github.com/NineAbyss/S2R 获得。</paragraph>
+
+##### **MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**
+2502.12852v1 by Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
+
+Existing multilingual vision-language (VL) benchmarks often only cover a
+handful of languages. Consequently, evaluations of large vision-language models
+(LVLMs) predominantly target high-resource languages, underscoring the need for
+evaluation data for low-resource languages. To address this limitation, we
+introduce MVL-SIB, a massively multilingual vision-language benchmark that
+evaluates both cross-modal and text-only topical matching across 205 languages
+-- over 100 more than the most multilingual existing VL benchmarks encompass.
+We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)
+on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic
+matching in lower-resource languages, performing no better than chance on
+languages like N'Koo. Our analysis further reveals that VL support in LVLMs
+declines disproportionately relative to textual support for lower-resource
+languages, as evidenced by comparison of cross-modal and text-only topical
+matching performance. We further observe that open-weight LVLMs do not benefit
+from representing a topic with more than one image, suggesting that these
+models are not yet fully effective at handling multi-image tasks. By
+correlating performance on MVL-SIB with other multilingual VL benchmarks, we
+highlight that MVL-SIB serves as a comprehensive probe of multilingual VL
+understanding in LVLMs.
+
+摘要：現有的多語言視覺語言 (VL) 基準通常只涵蓋少數語言。因此，大型視覺語言模型 (LVLMs) 的評估主要針對資源豐富的語言，強調了對資源匱乏語言的評估資料的需求。為了解決此限制，我們引入了 MVL-SIB，一個大規模的多語言視覺語言基準，它評估了 205 種語言的跨模態和純文字主題匹配，比現有的多語言 VL 基準涵蓋的語言多出 100 多種。然後，我們在 MVL-SIB 上對一系列開放權重的 LVLMs 與 GPT-4o(-mini) 進行了基準測試。我們的結果表明，LVLMs 在資源較少的語言中難以進行跨模態主題匹配，在 N'Koo 等語言上的表現不比隨機好。我們的分析進一步表明，LVLMs 中的 VL 支援相對於資源較少的語言的文字支援下降得不成比例，這從跨模態和純文字主題匹配效能的比較中可以看出。我們進一步觀察到，開放權重的 LVLMs 無法從用多於一張影像來表示主題中受益，這表明這些模型在處理多影像任務方面尚未完全有效。通過將 MVL-SIB 上的效能與其他多語言 VL 基準相關聯，我們強調 MVL-SIB 可作為 LVLMs 中多語言 VL 理解的綜合探測。
+
+##### **MeMo: Towards Language Models with Associative Memory Mechanisms**
+2502.12851v1 by Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
+
+Memorization is a fundamental ability of Transformer-based Large Language
+Models, achieved through learning. In this paper, we propose a paradigm shift
+by designing an architecture to memorize text directly, bearing in mind the
+principle that memorization precedes learning. We introduce MeMo, a novel
+architecture for language modeling that explicitly memorizes sequences of
+tokens in layered associative memories. By design, MeMo offers transparency and
+the possibility of model editing, including forgetting texts. We experimented
+with the MeMo architecture, showing the memorization power of the one-layer and
+the multi-layer configurations.
+
+摘要：記憶是 Transformer 大型語言模型的基本能力，可透過學習達成。在本文中，我們提出一個典範轉移，透過設計一個架構來直接記憶文字，並牢記記憶先於學習的原則。我們導入 MeMo，一個新穎的語言建模架構，可明確地記憶分層關聯式記憶中的代幣序列。透過設計，MeMo 提供透明度和模型編輯的可能性，包括遺忘文字。我們實驗了 MeMo 架構，展示了單層和多層組態的記憶力。
+
+##### **Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**
+2502.12842v1 by Kathrin Seßler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci
+
+Effective feedback is essential for fostering students' success in scientific
+inquiry. With advancements in artificial intelligence, large language models
+(LLMs) offer new possibilities for delivering instant and adaptive feedback.
+However, this feedback often lacks the pedagogical validation provided by
+real-world practitioners. To address this limitation, our study evaluates and
+compares the feedback quality of LLM agents with that of human teachers and
+science education experts on student-written experimentation protocols. Four
+blinded raters, all professionals in scientific inquiry and science education,
+evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and
+3) the science education experts using a five-point Likert scale based on six
+criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive
+Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that
+LLM-generated feedback shows no significant difference to that of teachers and
+experts in overall quality. However, the LLM agent's performance lags in the
+Feed Back dimension, which involves identifying and explaining errors within
+the student's work context. Qualitative analysis highlighted the LLM agent's
+limitations in contextual understanding and in the clear communication of
+specific errors. Our findings suggest that combining LLM-generated feedback
+with human expertise can enhance educational practices by leveraging the
+efficiency of LLMs and the nuanced understanding of educators.
+
+摘要：有效的回饋對於培養學生在科學探究中的成功至關重要。隨著人工智慧的進步，大型語言模型 (LLM) 為提供即時且適應性的回饋提供了新的可能性。然而，此回饋通常缺乏實際從業者提供的教學驗證。為了解決此限制，我們的研究評估並比較了 LLM 代理與人類教師和科學教育專家在學生撰寫的實驗協定上的回饋品質。四位盲評者，皆為科學探究和科學教育專業人士，使用基於六個有效回饋準則的五點李克特量表評估由 1) LLM 代理、2) 教師和 3) 科學教育專家產生的回饋文字：鼓勵、回饋、前饋、建設性語氣、語言清晰度和技術術語。我們的結果表明，LLM 產生的回饋在整體品質上與教師和專家產生的回饋沒有顯著差異。然而，LLM 代理的表現落後於回饋面向，這涉及在學生的作業背景中識別和解釋錯誤。定性分析突顯了 LLM 代理在情境理解和明確傳達特定錯誤方面的限制。我們的研究結果表明，將 LLM 產生的回饋與人類專業知識相結合，可以透過利用 LLM 的效率和教育者的細緻理解來提升教育實務。
+
+##### **Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**
+2502.12838v1 by Berk Yilmaz, Huthaifa I. Ashqar
+
+The recent advances in large language models (LLMs) have revolutionized
+industries such as finance, marketing, and customer service by enabling
+sophisticated natural language processing tasks. However, the broad adoption of
+LLMs brings significant challenges, particularly in the form of social biases
+that can be embedded within their outputs. Biases related to gender, age, and
+other sensitive attributes can lead to unfair treatment, raising ethical
+concerns and risking both company reputation and customer trust. This study
+examined bias in finance-related marketing slogans generated by LLMs (i.e.,
+ChatGPT) by prompting tailored ads targeting five demographic categories:
+gender, marital status, age, income level, and education level. A total of
+1,700 slogans were generated for 17 unique demographic groups, and key terms
+were categorized into four thematic groups: empowerment, financial, benefits
+and features, and personalization. Bias was systematically assessed using
+relative bias calculations and statistically tested with the Kolmogorov-Smirnov
+(KS) test against general slogans generated for any individual. Results
+revealed that marketing slogans are not neutral; rather, they emphasize
+different themes based on demographic factors. Women, younger individuals,
+low-income earners, and those with lower education levels receive more distinct
+messaging compared to older, higher-income, and highly educated individuals.
+This underscores the need to consider demographic-based biases in AI-generated
+marketing strategies and their broader societal implications. The findings of
+this study provide a roadmap for developing more equitable AI systems,
+highlighting the need for ongoing bias detection and mitigation efforts in
+LLMs.
+
+摘要：大型語言模型 (LLM) 的最新進展徹底改變了金融、行銷和客戶服務等產業，因為它能執行複雜的自然語言處理任務。然而，LLM 的廣泛採用帶來重大的挑戰，特別是潛藏在其輸出結果中的社會偏見形式。與性別、年齡和其他敏感屬性相關的偏見可能導致不公平的待遇，引發道德問題，並危及公司聲譽和客戶信任。本研究探討了 LLM（即 ChatGPT）產生的與金融相關的行銷標語中的偏見，方法是針對五個人口統計類別：性別、婚姻狀況、年齡、收入水準和教育水準，提示量身打造的廣告。總共為 17 個獨特的人口統計群組產生了 1,700 個標語，並且關鍵詞被分類為四個主題群組：賦權、財務、好處和功能，以及個人化。偏見使用相對偏見計算進行系統性評估，並使用科爾莫哥洛夫-史米諾夫 (KS) 檢定與針對任何個人產生的通用標語進行統計檢定。結果顯示行銷標語並非中立；相反地，它們根據人口統計因素強調不同的主題。與年紀較大、收入較高和受教育程度較高的個人相比，女性、年輕人、低收入者和教育程度較低者接收到的訊息更為不同。這強調了在 AI 生成的行銷策略中考量基於人口統計的偏見及其更廣泛的社會影響的必要性。本研究的發現提供了開發更公平 AI 系統的路線圖，突顯了在 LLM 中持續進行偏見偵測和緩解工作的重要性。
+
+##### **An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**
+2502.12836v1 by Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M. Rahmani
+
+Large language models (LLMs) are revolutionizing healthcare by improving
+diagnosis, patient care, and decision support through interactive
+communication. More recently, they have been applied to analyzing physiological
+time-series like wearable data for health insight extraction. Existing methods
+embed raw numerical sequences directly into prompts, which exceeds token limits
+and increases computational costs. Additionally, some studies integrated
+features extracted from time-series in textual prompts or applied multimodal
+approaches. However, these methods often produce generic and unreliable outputs
+due to LLMs' limited analytical rigor and inefficiency in interpreting
+continuous waveforms. In this paper, we develop an LLM-powered agent for
+physiological time-series analysis aimed to bridge the gap in integrating LLMs
+with well-established analytical tools. Built on the OpenCHA, an open-source
+LLM-powered framework, our agent features an orchestrator that integrates user
+interaction, data sources, and analytical tools to generate accurate health
+insights. To evaluate its effectiveness, we implement a case study on heart
+rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of
+PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study.
+The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o,
+with ECG serving as the gold standard for HR estimation. Results demonstrate
+that our agent significantly outperforms benchmark models by achieving lower
+error rates and more reliable HR estimations. The agent implementation is
+publicly available on GitHub.
+
+摘要：大型語言模型 (LLM) 透過互動式溝通，改善診斷、病人照護和決策支援，進而革新醫療保健。最近，它們已應用於分析生理時間序列，例如可穿戴式裝置的資料，以萃取健康見解。現有方法會將原始數值序列直接嵌入提示中，這會超過權杖限制並增加運算成本。此外，一些研究將從時間序列中萃取的特徵整合到文字提示中，或應用多模態方法。然而，由於 LLM 在解譯連續波形時分析嚴謹度有限且效率不彰，這些方法經常產生通用且不可靠的輸出。在本文中，我們開發了一個由 LLM 驅動的代理，用於生理時間序列分析，旨在彌合將 LLM 與既有分析工具整合的差距。我們的代理建立在 OpenCHA（一個由 LLM 驅動的開源架構）之上，具備一個整合使用者互動、資料來源和分析工具的協調器，以產生準確的健康見解。為了評估其有效性，我們實作了一個案例研究，從遠距健康監測研究中的一組光電容積描記圖 (PPG) 和心電圖 (ECG) 記錄中估算心率 (HR)。該代理的效能與 OpenAI GPT-4o-mini 和 GPT-4o 進行基準測試，其中 ECG 作為 HR 估算的金標準。結果顯示，我們的代理透過達成較低的錯誤率和更可靠的 HR 估算，顯著優於基準模型。該代理實作已公開在 GitHub 上。
+
+##### **Subword models struggle with word learning, but surprisal hides it**
+2502.12835v1 by Bastian Bunzeck, Sina Zarrieß
+
+We study word learning in subword and character language models with the
+psycholinguistic lexical decision task. While subword LMs struggle to discern
+words and non-words with high accuracy, character LMs solve this task easily
+and consistently. Furthermore, when comparing word learning and syntactic
+learning, both processes are separable in character LM where word learning
+predates syntactic learning, whereas these processes are simultaneous in
+subword LM. This raises questions about the adequacy of subword LMs for
+modeling language acquisition and positions character LMs as a viable
+alternative.
+
+摘要：我們使用心理語言學的詞彙決策任務研究在子詞和字元語言模型中的詞彙學習。儘管子詞語言模型難以區分單詞和非單詞，但字元語言模型可以輕鬆且一致地解決此任務。此外，在比較單詞學習和句法學習時，這兩個過程在字元語言模型中是可分離的，其中單詞學習先於句法學習，而這些過程在子詞語言模型中是同時發生的。這引發了關於子詞語言模型對語言習得建模的充分性的問題，並將字元語言模型定位為可行的替代方案。
+
+##### **KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**
+2502.12829v1 by Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
+
+Despite having a population of twenty million, Kazakhstan's culture and
+language remain underrepresented in the field of natural language processing.
+Although large language models (LLMs) continue to advance worldwide, progress
+in Kazakh language has been limited, as seen in the scarcity of dedicated
+models and benchmark evaluations. To address this gap, we introduce KazMMLU,
+the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU
+comprises 23,000 questions that cover various educational levels, including
+STEM, humanities, and social sciences, sourced from authentic educational
+materials and manually validated by native speakers and educators. The dataset
+includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting
+Kazakhstan's bilingual education system and rich local context. Our evaluation
+of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4,
+and DeepSeek V3) demonstrates substantial room for improvement, as even the
+best-performing models struggle to achieve competitive performance in Kazakh
+and Russian. These findings underscore significant performance gaps compared to
+high-resource languages. We hope that our dataset will enable further research
+and development of Kazakh-centric LLMs. Data and code will be made available
+upon acceptance.
+
+摘要：儘管哈薩克人口達兩千萬，但哈薩克的文化和語言在自然語言處理領域仍未得到充分的重視。儘管大型語言模型 (LLM) 在全球持續進步，但哈薩克語的進展卻十分有限，這從專用模型和基準評估的稀缺性中可見一斑。為了解決這個差距，我們引入了 KazMMLU，這是第一個專門為哈薩克語設計的 MMLU 風格資料集。KazMMLU 包含 23,000 個問題，涵蓋各種教育層級，包括 STEM、人文學科和社會科學，這些問題來自真實的教育材料，並由母語人士和教育工作者手動驗證。該資料集包含 10,969 個哈薩克語問題和 12,031 個俄語問題，反映了哈薩克的雙語教育體系和豐富的在地脈絡。我們對幾個最先進的多語言模型（Llama-3.1、Qwen-2.5、GPT-4 和 DeepSeek V3）的評估顯示，仍有很大的改進空間，因為即使是效能最好的模型，也很難在哈薩克語和俄語中達到有競爭力的效能。這些發現強調了與資源豐富的語言相比，存在顯著的效能差距。我們希望我們的資料集能促進以哈薩克語為中心的 LLM 的進一步研究和開發。資料和程式碼將在獲得接受後提供。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**
+2502.12821v1 by Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
+
+Inverse tasks can uncover potential reasoning gaps as Large Language Models
+(LLMs) scale up. In this work, we explore the redefinition task, in which we
+assign alternative values to well-known physical constants and units of
+measure, prompting LLMs to respond accordingly. Our findings show that not only
+does model performance degrade with scale, but its false confidence also rises.
+Moreover, while factors such as prompting strategies or response formatting are
+influential, they do not preclude LLMs from anchoring to memorized values.
+
+摘要：逆向任務可以揭示大型語言模型 (LLM) 擴展時潛在的推理差距。在本文中，我們探討重新定義任務，其中我們將替換值指定給著名的物理常數和測量單位，促使 LLM 做出相應回應。我們的研究結果表明，模型效能不僅會隨著規模而下降，其虛假信心也會上升。此外，儘管提示策略或回應格式等因素具有影響力，但它們並不妨礙 LLM 錨定在記憶值上。
+
+##### **Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**
+2502.12813v1 by Adnan Ahmad, Stefan Hillmann, Sebastian Möller
+
+In this study, we explore the application of Large Language Models (LLMs) for
+generating synthetic users and simulating user conversations with a
+task-oriented dialogue system and present detailed results and their analysis.
+We propose a comprehensive novel approach to user simulation technique that
+uses LLMs to create diverse user profiles, set goals, engage in multi-turn
+dialogues, and evaluate the conversation success. We employ two proprietary
+LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a
+heterogeneous base of user profiles, characterized by varied demographics,
+multiple user goals, different conversational styles, initial knowledge levels,
+interests, and conversational objectives. We perform a detailed analysis of the
+user profiles generated by LLMs to assess the diversity, consistency, and
+potential biases inherent in these LLM-generated user simulations. We find that
+GPT-o1 generates more heterogeneous user distribution across most user
+attributes, while GPT-4o generates more skewed user attributes. The generated
+set of user profiles are then utilized to simulate dialogue sessions by
+interacting with a task-oriented dialogue system.
+
+摘要：在這項研究中，我們探討大型語言模型 (LLM) 在生成合成使用者和模擬使用者對話，並使用任務導向對話系統進行對話的應用，並提出詳細的結果及其分析。我們提出了一種全面的使用者模擬技術新方法，利用 LLM 建立多樣化的使用者概況、設定目標、參與多輪對話，並評估對話的成功性。我們採用了兩個專有的 LLM，即 GPT-4o 和 GPT-o1 (Achiam 等人，2023 年)，以生成一個異質的使用者概況基礎，其特徵在於不同的人口統計資料、多個使用者目標、不同的對話風格、初始知識水準、興趣和對話目標。我們對 LLM 生成的使用者概況進行了詳細分析，以評估這些 LLM 生成的使用者模擬中固有的多樣性、一致性和潛在偏差。我們發現 GPT-o1 在大多數使用者屬性中產生更異質的使用者分佈，而 GPT-4o 則產生更偏斜的使用者屬性。然後利用生成的使用者概況集，透過與任務導向對話系統互動來模擬對話會話。
+
+##### **Towards Text-Image Interleaved Retrieval**
+2502.12799v1 by Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
+
+Current multimodal information retrieval studies mainly focus on single-image
+inputs, which limits real-world applications involving multiple images and
+text-image interleaved content. In this work, we introduce the text-image
+interleaved retrieval (TIIR) task, where the query and document are interleaved
+text-image sequences, and the model is required to understand the semantics
+from the interleaved context for effective retrieval. We construct a TIIR
+benchmark based on naturally interleaved wikiHow tutorials, where a specific
+pipeline is designed to generate interleaved queries. To explore the task, we
+adapt several off-the-shelf retrievers and build a dense baseline by
+interleaved multimodal large language model (MLLM). We then propose a novel
+Matryoshka Multimodal Embedder (MME), which compresses the number of visual
+tokens at different granularity, to address the challenge of excessive visual
+tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption
+of existing models does not consistently yield effective results. Our MME
+achieves significant improvements over the baseline by substantially fewer
+visual tokens. We provide extensive analysis and will release the dataset and
+code to facilitate future research.
+
+摘要：目前的多模態資訊檢索研究主要集中在單一影像輸入，這限制了涉及多個影像和文字影像交錯內容的實際應用。在這項工作中，我們引入了文字影像交錯檢索 (TIIR) 任務，其中查詢和文件是交錯的文字影像序列，並且模型需要理解交錯內容的語意以進行有效檢索。我們根據自然交錯的 wikiHow 教學課程建構了一個 TIIR 基準，其中設計了一個特定的管線來產生交錯查詢。為了探索這個任務，我們調整了幾個現成的檢索器，並透過交錯的多模態大型語言模型 (MLLM) 建立了一個密集的基準。然後，我們提出了一個新穎的 Matryoshka 多模態嵌入器 (MME)，它壓縮了不同粒度視覺符號的數量，以解決基於 MLLM 的 TIIR 模型中過多視覺符號的挑戰。實驗表明，對現有模型的簡單調整並未持續產生有效結果。我們的 MME 透過大幅減少視覺符號，達到了比基準顯著的改進。我們提供了廣泛的分析，並將釋出資料集和程式碼以促進未來的研究。
+
+##### **Envious Explore and Exploit**
+2502.12798v1 by Omer Ben-Porat, Yotam Gafni, Or Markovetzki
+
+Explore-and-exploit tradeoffs play a key role in recommendation systems
+(RSs), aiming at serving users better by learning from previous interactions.
+Despite their commercial success, the societal effects of explore-and-exploit
+mechanisms are not well understood, especially regarding the utility
+discrepancy they generate between different users. In this work, we measure
+such discrepancy using the economic notion of envy. We present a multi-armed
+bandit-like model in which every round consists of several sessions, and
+rewards are realized once per round. We call the latter property reward
+consistency, and show that the RS can leverage this property for better
+societal outcomes. On the downside, doing so also generates envy, as
+late-to-arrive users enjoy the information gathered by early-to-arrive users.
+We examine the generated envy under several arrival order mechanisms and
+virtually any anonymous algorithm, i.e., any algorithm that treats all similar
+users similarly without leveraging their identities. We provide tight envy
+bounds on uniform arrival and upper bound the envy for nudged arrival, in which
+the RS can affect the order of arrival by nudging its users. Furthermore, we
+study the efficiency-fairness trade-off by devising an algorithm that allows
+constant envy and approximates the optimal welfare in restricted settings.
+Finally, we validate our theoretical results empirically using simulations.
+
+摘要：探索與開發的取捨在推薦系統 (RS) 中扮演著關鍵角色，旨在透過學習先前的互動來為使用者提供更好的服務。儘管在商業上獲得成功，但探索與開發機制的社會效應仍未被充分理解，特別是關於它們在不同使用者之間產生的效用差異。在這項工作中，我們使用經濟學中的嫉妒概念來衡量這種差異。我們提出了一個多臂老虎機模型，其中每一輪都包含多個回合，並且每回合只會實現一次獎勵。我們將後者的特性稱為獎勵一致性，並證明 RS 可以利用此特性來獲得更好的社會成果。不利的是，這麼做也會產生嫉妒，因為較晚加入的使用者可以享受較早加入的使用者所收集的資訊。我們在多種到達順序機制和幾乎任何匿名演算法（即任何演算法都以類似的方式對待所有類似的使用者，而不利用他們的身份）下檢驗產生的嫉妒。我們對均勻到達提供嚴格的嫉妒界線，並對推動到達的上限進行嫉妒界線，其中 RS 可以透過推動其使用者來影響到達順序。此外，我們透過設計一種演算法來研究效率公平權衡，該演算法允許恆定的嫉妒，並在受限設定中近似最佳福利。最後，我們使用模擬對我們的理論結果進行經驗驗證。
+
+##### **Commonsense Reasoning in Arab Culture**
+2502.12788v1 by Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto
+
+Despite progress in Arabic large language models, such as Jais and AceGPT,
+their evaluation on commonsense reasoning has largely relied on
+machine-translated datasets, which lack cultural depth and may introduce
+Anglocentric biases. Commonsense reasoning is shaped by geographical and
+cultural contexts, and existing English datasets fail to capture the diversity
+of the Arab world. To address this, we introduce \datasetname, a commonsense
+reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13
+countries across the Gulf, Levant, North Africa, and the Nile Valley. The
+dataset was built from scratch by engaging native speakers to write and
+validate culturally relevant questions for their respective countries.
+\datasetname spans 12 daily life domains with 54 fine-grained subtopics,
+reflecting various aspects of social norms, traditions, and everyday
+experiences. Zero-shot evaluations show that open-weight language models with
+up to 32B parameters struggle to comprehend diverse Arab cultures, with
+performance varying across regions. These findings highlight the need for more
+culturally aware models and datasets tailored to the Arabic-speaking world.
+
+摘要：儘管阿拉伯語大型語言模型（例如 Jais 和 AceGPT）已有進展，
+但它們在常識推理上的評估在很大程度上依賴於
+機器翻譯的資料集，這些資料集缺乏文化深度，可能會引入
+以英語為中心的偏見。常識推理受地理和
+文化背景影響，現有的英文資料集無法捕捉阿拉伯世界的多樣性。為了解決這個問題，我們引入了 \datasetname，一個現代標準阿拉伯語 (MSA) 的常識推理資料集，涵蓋海灣地區、黎凡特地區、北非和尼羅河谷 13 個國家的文化。此資料集是從頭開始建立的，由母語人士參與編寫和驗證他們各自國家的文化相關問題。\datasetname 涵蓋 12 個日常生活領域，包含 54 個細緻的主題，反映社會規範、傳統和日常經驗的各個方面。零次學習評估顯示，具有高達 32B 參數的開放式權重語言模型難以理解不同的阿拉伯文化，且各區域的表現不一。這些發現突顯了對更具文化意識的模型和專為阿拉伯語系世界量身打造的資料集的需求。
+
+##### **VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**
+2502.12782v1 by Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan
+
+The training of controllable text-to-video (T2V) models relies heavily on the
+alignment between videos and captions, yet little existing research connects
+video caption evaluation with T2V generation assessment. This paper introduces
+VidCapBench, a video caption evaluation scheme specifically designed for T2V
+generation, agnostic to any particular caption format. VidCapBench employs a
+data annotation pipeline, combining expert model labeling and human refinement,
+to associate each collected video with key information spanning video
+aesthetics, content, motion, and physical laws. VidCapBench then partitions
+these key information attributes into automatically assessable and manually
+assessable subsets, catering to both the rapid evaluation needs of agile
+development and the accuracy requirements of thorough validation. By evaluating
+numerous state-of-the-art captioning models, we demonstrate the superior
+stability and comprehensiveness of VidCapBench compared to existing video
+captioning evaluation approaches. Verification with off-the-shelf T2V models
+reveals a significant positive correlation between scores on VidCapBench and
+the T2V quality evaluation metrics, indicating that VidCapBench can provide
+valuable guidance for training T2V models. The project is available at
+https://github.com/VidCapBench/VidCapBench.
+
+摘要：可控制文本到影片 (T2V) 模型的訓練極度仰賴影片和字幕之間的對齊，但現有研究鮮少將影片字幕評估與 T2V 生成評估連結起來。本文介紹 VidCapBench，這是一種專門為 T2V 生成設計的影片字幕評估架構，與任何特定的字幕格式無關。VidCapBench 採用資料標註流程，結合專家模型標記和人工微調，將每個收集到的影片與涵蓋影片美學、內容、動作和物理定律等關鍵資訊關聯起來。VidCapBench 接著將這些關鍵資訊屬性分割成可自動評估和可手動評估的子集，以滿足敏捷開發的快速評估需求和全面驗證的準確性要求。透過評估許多最先進的字幕模型，我們證明了 VidCapBench 與現有的影片字幕評估方法相比，具有優異的穩定性和全面性。使用現成的 T2V 模型驗證顯示，VidCapBench 得分與 T2V 品質評估指標之間存在顯著的正相關，這表示 VidCapBench 可以為訓練 T2V 模型提供有價值的指導。專案可於 https://github.com/VidCapBench/VidCapBench 取得。
+
+##### **Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**
+2502.12776v1 by Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi
+
+While foundation models have been exploited for various expert tasks through
+fine-tuning, any foundation model will become outdated due to its old knowledge
+or limited capability. Thus the underlying foundation model should be
+eventually replaced by new ones, which leads to repeated cost of fine-tuning
+these new models. Existing work addresses this problem by inference-time
+tuning, i.e., modifying the output probabilities from the new foundation model
+with the outputs from the old foundation model and its fine-tuned model, which
+involves an additional overhead in inference by the latter two models. In this
+paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT),
+that reduces the inference overhead by its nature, based on the reformulation
+of fine-tuning as the reward maximization. Specifically, instead of fine-tuning
+parameters of the foundation models, PRT trains the reward model explicitly
+through the same loss function as in fine-tuning. During inference, the reward
+model can be used with any foundation model (with the same set of vocabularies
+or labels) through the formulation of reward maximization. Experimental
+results, covering both vision and language models, demonstrate that the
+PRT-trained model can achieve comparable accuracy to the existing work of
+inference-time tuning, with less inference cost.
+
+摘要：儘管基礎模型已透過微調用於各種專家任務，任何基礎模型都將因其舊知識或有限功能而過時。因此，基礎模型最終應由新模型取代，這導致重複微調這些新模型的成本。現有工作透過推論時間調整來解決這個問題，即使用舊基礎模型及其微調模型的輸出修改新基礎模型的輸出機率，這涉及後兩個模型在推論中的額外開銷。在本文中，我們提出一個新的微調原則，可攜式獎勵調整 (PRT)，它本質上會減少推論開銷，基於將微調重新表述為獎勵最大化。具體來說，PRT 不是微調基礎模型的參數，而是透過與微調中相同的損失函數明確訓練獎勵模型。在推論期間，獎勵模型可透過獎勵最大化的公式與任何基礎模型（具有相同的詞彙或標籤組）一起使用。涵蓋視覺和語言模型的實驗結果證明，PRT 訓練的模型可以達到與現有推論時間調整工作相當的準確度，且推論成本較低。
+
+##### **Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**
+2502.12771v1 by Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee
+
+Self-supervised language and audio models effectively predict brain responses
+to speech. However, traditional prediction models rely on linear mappings from
+unimodal features, despite the complex integration of auditory signals with
+linguistic and semantic information across widespread brain networks during
+speech comprehension. Here, we introduce a nonlinear, multimodal prediction
+model that combines audio and linguistic features from pre-trained models
+(e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in
+prediction performance (unnormalized and normalized correlation) over
+traditional unimodal linear models, as well as a 7.7% and 14.4% improvement,
+respectively, over prior state-of-the-art models. These improvements represent
+a major step towards future robust in-silico testing and improved decoding
+performance. They also reveal how auditory and semantic information are fused
+in motor, somatosensory, and higher-level semantic regions, aligning with
+existing neurolinguistic theories. Overall, our work highlights the often
+neglected potential of nonlinear and multimodal approaches to brain modeling,
+paving the way for future studies to embrace these strategies in naturalistic
+neurolinguistics research.
+
+摘要：自我監督的語言和音訊模型有效預測大腦對語言的反應。然而，傳統的預測模型依賴於單模態特徵的線性映射，儘管在語言理解過程中，聽覺信號與語言和語義資訊在廣泛的腦網路中進行複雜的整合。在此，我們引入一個非線性、多模態預測模型，結合預先訓練模型（例如，LLAMA、Whisper）中的音訊和語言特徵。我們的做法在預測效能上（未正規化和正規化相關性）分別比傳統的單模態線性模型提升了 17.2% 和 17.9%，分別比先前的最先進模型提升了 7.7% 和 14.4%。這些改進代表了未來穩健的電腦模擬測試和改進的解碼效能邁出了一大步。它們也揭示了聽覺和語義資訊如何在運動、體感和更高層次的語義區域中融合，與現有的神經語言學理論一致。總的來說，我們的研究突出了非線性和多模態大腦建模方法經常被忽略的潛力，為未來研究在自然主義神經語言學研究中採用這些策略鋪平了道路。
+
+##### **How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**
+2502.12769v1 by Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
+
+In the age of misinformation, hallucination -- the tendency of Large Language
+Models (LLMs) to generate non-factual or unfaithful responses -- represents the
+main risk for their global utility. Despite LLMs becoming increasingly
+multilingual, the vast majority of research on detecting and quantifying LLM
+hallucination are (a) English-centric and (b) focus on machine translation (MT)
+and summarization, tasks that are less common ``in the wild'' than open
+information seeking. In contrast, we aim to quantify the extent of LLM
+hallucination across languages in knowledge-intensive long-form question
+answering. To this end, we train a multilingual hallucination detection model
+and conduct a large-scale study across 30 languages and 6 open-source LLM
+families. We start from an English hallucination detection dataset and rely on
+MT to generate (noisy) training data in other languages. We also manually
+annotate gold data for five high-resource languages; we then demonstrate, for
+these languages, that the estimates of hallucination rates are similar between
+silver (LLM-generated) and gold test sets, validating the use of silver data
+for estimating hallucination rates for other languages. For the final rates
+estimation, we build a knowledge-intensive QA dataset for 30 languages with
+LLM-generated prompts and Wikipedia articles as references. We find that, while
+LLMs generate longer responses with more hallucinated tokens for
+higher-resource languages, there is no correlation between length-normalized
+hallucination rates of languages and their digital representation. Further, we
+find that smaller LLMs exhibit larger hallucination rates than larger models.
+
+摘要：<paragraph>在错误訊息的時代，幻覺——大型語言模型 (LLM) 產生非事實或不忠實回應的傾向——代表其全球效用的主要風險。儘管 LLM 變得越來越多元化，但絕大多數關於偵測和量化 LLM 幻覺的研究都是 (a) 以英語為中心，(b) 專注於機器翻譯 (MT) 和摘要，這些任務在「野外」中不如開放式資訊搜尋常見。相反地，我們旨在量化 LLM 在知識密集型長篇問答中跨語言的幻覺程度。為此，我們訓練了一個多語言幻覺偵測模型，並針對 30 種語言和 6 個開放原始碼 LLM 家族進行大規模研究。我們從一個英語幻覺偵測資料集開始，並依賴 MT 在其他語言中產生（有雜訊的）訓練資料。我們還手動為五種高資源語言註解黃金資料；然後我們證明，對於這些語言，幻覺率的估計值在白銀（LLM 產生）和黃金測試集之間是相似的，驗證了使用白銀資料來估計其他語言的幻覺率。對於最終的比率估計，我們建立了一個知識密集型問答資料集，其中包含 30 種語言，並以 LLM 產生的提示和維基百科文章作為參考。我們發現，儘管 LLM 為資源較多的語言產生了更長的回應和更多幻覺的代幣，但語言的長度正規化幻覺率與其數位表示之間沒有相關性。此外，我們發現較小的 LLM 表現出比較大的模型更大的幻覺率。</paragraph>
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
 
diff --git a/__pycache__/config.cpython-310.pyc b/__pycache__/config.cpython-310.pyc
index d9121af035..ac0457b4be 100644
Binary files a/__pycache__/config.cpython-310.pyc and b/__pycache__/config.cpython-310.pyc differ
diff --git a/__pycache__/util4translation.cpython-310.pyc b/__pycache__/util4translation.cpython-310.pyc
index 72b61359fa..75b636fe9f 100644
Binary files a/__pycache__/util4translation.cpython-310.pyc and b/__pycache__/util4translation.cpython-310.pyc differ
diff --git a/database/logs/runtime.log b/database/logs/runtime.log
index a34d673051..6c3bfde5c6 100644
--- a/database/logs/runtime.log
+++ b/database/logs/runtime.log
@@ -21407,3 +21407,7 @@ KeyError: 'paper_summary_zh'
 2025-02-19 09:05:53.322 | SUCCESS  | __main__:parse:267 - handle [2/4] | topic=`AI` subtopic=`Medical`
 2025-02-19 09:05:53.323 | SUCCESS  | __main__:parse:267 - handle [3/4] | topic=`AI` subtopic=`LLM`
 2025-02-19 09:05:53.336 | SUCCESS  | __main__:parse:267 - handle [4/4] | topic=`AI` subtopic=`Knowledge Graphs`
+2025-02-19 20:33:00.249 | SUCCESS  | __main__:parse:267 - handle [1/4] | topic=`AI` subtopic=`Medical explainable AI`
+2025-02-19 20:33:00.291 | SUCCESS  | __main__:parse:267 - handle [2/4] | topic=`AI` subtopic=`Medical`
+2025-02-19 20:33:07.452 | SUCCESS  | __main__:parse:267 - handle [3/4] | topic=`AI` subtopic=`Knowledge Graphs`
+2025-02-19 20:34:11.488 | SUCCESS  | __main__:parse:267 - handle [4/4] | topic=`AI` subtopic=`LLM`
diff --git a/database/storage/2502/03283/2502.03283v2.json b/database/storage/2502/03283/2502.03283v2.json
new file mode 100644
index 0000000000..13606dd288
--- /dev/null
+++ b/database/storage/2502/03283/2502.03283v2.json
@@ -0,0 +1 @@
+{"2502.03283": {"publish_time": "2025-02-05", "title": "SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs", "paper_summary": "Recent advancements have highlighted that Large Language Models (LLMs) are\nprone to hallucinations when solving complex reasoning problems, leading to\nerroneous results. To tackle this issue, researchers incorporate Knowledge\nGraphs (KGs) to improve the reasoning ability of LLMs. However, existing\nmethods face two limitations: 1) they typically assume that all answers to the\nquestions are contained in KGs, neglecting the incompleteness issue of KGs, and\n2) they treat the KG as a static repository and overlook the implicit logical\nreasoning structures inherent in KGs. In this paper, we introduce SymAgent, an\ninnovative neural-symbolic agent framework that achieves collaborative\naugmentation between KGs and LLMs. We conceptualize KGs as dynamic environments\nand transform complex reasoning tasks into a multi-step interactive process,\nenabling KGs to participate deeply in the reasoning process. SymAgent consists\nof two modules: Agent-Planner and Agent-Executor. The Agent-Planner leverages\nLLM's inductive reasoning capability to extract symbolic rules from KGs,\nguiding efficient question decomposition. The Agent-Executor autonomously\ninvokes predefined action tools to integrate information from KGs and external\ndocuments, addressing the issues of KG incompleteness. Furthermore, we design a\nself-learning framework comprising online exploration and offline iterative\npolicy updating phases, enabling the agent to automatically synthesize\nreasoning trajectories and improve performance. Experimental results\ndemonstrate that SymAgent with weak LLM backbones (i.e., 7B series) yields\nbetter or comparable performance compared to various strong baselines. Further\nanalysis reveals that our agent can identify missing triples, facilitating\nautomatic KG updates.", "paper_summary_zh": "<paragraph>\u6700\u8fd1\u7684\u9032\u5c55\u5f37\u8abf\u51fa\uff0c\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728\u89e3\u6c7a\u8907\u96dc\u63a8\u7406\u554f\u984c\u6642\u5bb9\u6613\u51fa\u73fe\u5e7b\u89ba\uff0c\u5c0e\u81f4\u932f\u8aa4\u7684\u7d50\u679c\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u7814\u7a76\u4eba\u54e1\u7d50\u5408\u77e5\u8b58\u5716\u8b5c (KG) \u4f86\u6539\u5584 LLM \u7684\u63a8\u7406\u80fd\u529b\u3002\u7136\u800c\uff0c\u73fe\u6709\u65b9\u6cd5\u9762\u81e8\u5169\u500b\u9650\u5236\uff1a1) \u5b83\u5011\u901a\u5e38\u5047\u8a2d\u554f\u984c\u7684\u6240\u6709\u7b54\u6848\u90fd\u5305\u542b\u5728 KG \u4e2d\uff0c\u5ffd\u7565\u4e86 KG \u7684\u4e0d\u5b8c\u6574\u6027\u554f\u984c\uff0c\u4ee5\u53ca 2) \u5b83\u5011\u5c07 KG \u8996\u70ba\u4e00\u500b\u975c\u614b\u5132\u5b58\u5eab\uff0c\u800c\u5ffd\u7565\u4e86 KG \u4e2d\u56fa\u6709\u7684\u96b1\u5f0f\u908f\u8f2f\u63a8\u7406\u7d50\u69cb\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u4ecb\u7d39\u4e86 SymAgent\uff0c\u4e00\u500b\u5275\u65b0\u7684\u795e\u7d93\u7b26\u865f\u4ee3\u7406\u67b6\u69cb\uff0c\u5b83\u5728 KG \u548c LLM \u4e4b\u9593\u5be6\u73fe\u4e86\u5354\u4f5c\u64f4\u5145\u3002\u6211\u5011\u5c07 KG \u6982\u5ff5\u5316\u70ba\u52d5\u614b\u74b0\u5883\uff0c\u4e26\u5c07\u8907\u96dc\u7684\u63a8\u7406\u4efb\u52d9\u8f49\u5316\u70ba\u4e00\u500b\u591a\u6b65\u9a5f\u7684\u4e92\u52d5\u904e\u7a0b\uff0c\u4f7f KG \u80fd\u5920\u6df1\u5165\u53c3\u8207\u63a8\u7406\u904e\u7a0b\u3002SymAgent \u5305\u542b\u5169\u500b\u6a21\u7d44\uff1a\u4ee3\u7406\u898f\u5283\u5668\u548c\u4ee3\u7406\u57f7\u884c\u5668\u3002\u4ee3\u7406\u898f\u5283\u5668\u5229\u7528 LLM \u7684\u6b78\u7d0d\u63a8\u7406\u80fd\u529b\u5f9e KG \u4e2d\u63d0\u53d6\u7b26\u865f\u898f\u5247\uff0c\u6307\u5c0e\u6709\u6548\u7684\u554f\u984c\u5206\u89e3\u3002\u4ee3\u7406\u57f7\u884c\u5668\u81ea\u4e3b\u5730\u8abf\u7528\u9810\u5b9a\u7fa9\u7684\u52d5\u4f5c\u5de5\u5177\u4f86\u6574\u5408\u4f86\u81ea KG \u548c\u5916\u90e8\u6587\u4ef6\u7684\u8cc7\u8a0a\uff0c\u89e3\u6c7a KG \u4e0d\u5b8c\u6574\u6027\u7684\u554f\u984c\u3002\u6b64\u5916\uff0c\u6211\u5011\u8a2d\u8a08\u4e86\u4e00\u500b\u81ea\u5b78\u7fd2\u6846\u67b6\uff0c\u5305\u62ec\u7dda\u4e0a\u63a2\u7d22\u548c\u96e2\u7dda\u53cd\u8986\u7684\u653f\u7b56\u66f4\u65b0\u968e\u6bb5\uff0c\u4f7f\u4ee3\u7406\u80fd\u5920\u81ea\u52d5\u5408\u6210\u63a8\u7406\u8ecc\u8de1\u4e26\u6539\u5584\u6548\u80fd\u3002\u5be6\u9a57\u7d50\u679c\u8868\u660e\uff0c\u5177\u6709\u5f31 LLM \u4e3b\u5e79\u7684 SymAgent\uff08\u4f8b\u5982\uff0c7B \u7cfb\u5217\uff09\u8207\u5404\u7a2e\u5f37\u5927\u7684\u57fa\u7dda\u76f8\u6bd4\uff0c\u7522\u751f\u4e86\u66f4\u597d\u6216\u76f8\u7576\u7684\u6548\u80fd\u3002\u9032\u4e00\u6b65\u7684\u5206\u6790\u8868\u660e\uff0c\u6211\u5011\u7684\u4ee3\u7406\u53ef\u4ee5\u8b58\u5225\u907a\u5931\u7684\u4e09\u5143\u7d44\uff0c\u4fc3\u9032\u81ea\u52d5 KG \u66f4\u65b0\u3002</paragraph>", "author": "Ben Liu et.al.", "authors": "Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin", "id": "2502.03283v2", "paper_url": "http://arxiv.org/abs/2502.03283v2", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/07158/2502.07158v2.json b/database/storage/2502/07158/2502.07158v2.json
new file mode 100644
index 0000000000..91aaef9e41
--- /dev/null
+++ b/database/storage/2502/07158/2502.07158v2.json
@@ -0,0 +1 @@
+{"2502.07158": {"publish_time": "2025-02-11", "title": "Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer", "paper_summary": "Early prediction of pediatric cardiac arrest (CA) is critical for timely\nintervention in high-risk intensive care settings. We introduce PedCA-FT, a\nnovel transformer-based framework that fuses tabular view of EHR with the\nderived textual view of EHR to fully unleash the interactions of\nhigh-dimensional risk factors and their dynamics. By employing dedicated\ntransformer modules for each modality view, PedCA-FT captures complex temporal\nand contextual patterns to produce robust CA risk estimates. Evaluated on a\ncurated pediatric cohort from the CHOA-CICU database, our approach outperforms\nten other artificial intelligence models across five key performance metrics\nand identifies clinically meaningful risk factors. These findings underscore\nthe potential of multimodal fusion techniques to enhance early CA detection and\nimprove patient care.", "paper_summary_zh": "\u65e9\u671f\u9810\u6e2c\u5c0f\u5152\u5fc3\u81df\u9a5f\u505c (CA) \u5c0d\u65bc\u5728\u9ad8\u98a8\u96aa\u7684\u91cd\u75c7\u7167\u8b77\u74b0\u5883\u4e2d\u53ca\u6642\u4ecb\u5165\u81f3\u95dc\u91cd\u8981\u3002\u6211\u5011\u5f15\u5165\u4e86 PedCA-FT\uff0c\u4e00\u500b\u65b0\u7a4e\u7684\u57fa\u65bc\u8f49\u63db\u5668\u7684\u6846\u67b6\uff0c\u5b83\u5c07 EHR \u7684\u8868\u683c\u8996\u5716\u8207 EHR \u7684\u6d3e\u751f\u6587\u672c\u8996\u5716\u878d\u5408\u5728\u4e00\u8d77\uff0c\u4ee5\u5145\u5206\u767c\u63ee\u9ad8\u7dad\u98a8\u96aa\u56e0\u7d20\u53ca\u5176\u52d5\u614b\u7684\u4ea4\u4e92\u4f5c\u7528\u3002\u901a\u904e\u70ba\u6bcf\u500b\u6a21\u614b\u8996\u5716\u63a1\u7528\u5c08\u7528\u7684\u8f49\u63db\u5668\u6a21\u7d44\uff0cPedCA-FT \u6355\u7372\u8907\u96dc\u7684\u6642\u9593\u548c\u4e0a\u4e0b\u6587\u6a21\u5f0f\uff0c\u4ee5\u7522\u751f\u7a69\u5065\u7684 CA \u98a8\u96aa\u4f30\u8a08\u3002\u5728 CHOA-CICU \u8cc7\u6599\u5eab\u4e2d\u7b56\u5283\u7684\u5c0f\u5152\u7fa4\u9ad4\u4e2d\u9032\u884c\u8a55\u4f30\uff0c\u6211\u5011\u7684\u505a\u6cd5\u5728\u4e94\u9805\u95dc\u9375\u7e3e\u6548\u6307\u6a19\u4e2d\u512a\u65bc\u5176\u4ed6\u5341\u7a2e\u4eba\u5de5\u667a\u6167\u6a21\u578b\uff0c\u4e26\u627e\u51fa\u81e8\u5e8a\u4e0a\u6709\u610f\u7fa9\u7684\u98a8\u96aa\u56e0\u7d20\u3002\u9019\u4e9b\u767c\u73fe\u5f37\u8abf\u4e86\u591a\u6a21\u5f0f\u878d\u5408\u6280\u8853\u5728\u589e\u5f37\u65e9\u671f CA \u6aa2\u6e2c\u548c\u6539\u5584\u60a3\u8005\u7167\u8b77\u65b9\u9762\u7684\u6f5b\u529b\u3002", "author": "Jiaying Lu et.al.", "authors": "Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu", "id": "2502.07158v2", "paper_url": "http://arxiv.org/abs/2502.07158v2", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12167/2502.12167v1.json b/database/storage/2502/12167/2502.12167v1.json
new file mode 100644
index 0000000000..2e5b0ba093
--- /dev/null
+++ b/database/storage/2502/12167/2502.12167v1.json
@@ -0,0 +1 @@
+{"2502.12167": {"publish_time": "2025-02-13", "title": "TastepepAI, An artificial intelligence platform for taste peptide de novo design", "paper_summary": "Taste peptides have emerged as promising natural flavoring agents attributed\nto their unique organoleptic properties, high safety profile, and potential\nhealth benefits. However, the de novo identification of taste peptides derived\nfrom animal, plant, or microbial sources remains a time-consuming and\nresource-intensive process, significantly impeding their widespread application\nin the food industry. Here, we present TastePepAI, a comprehensive artificial\nintelligence framework for customized taste peptide design and safety\nassessment. As the key element of this framework, a loss-supervised adaptive\nvariational autoencoder (LA-VAE) is implemented to efficiently optimizes the\nlatent representation of sequences during training and facilitates the\ngeneration of target peptides with desired taste profiles. Notably, our model\nincorporates a novel taste-avoidance mechanism, allowing for selective flavor\nexclusion. Subsequently, our in-house developed toxicity prediction algorithm\n(SpepToxPred) is integrated in the framework to undergo rigorous safety\nevaluation of generated peptides. Using this integrated platform, we\nsuccessfully identified 73 peptides exhibiting sweet, salty, and umami,\nsignificantly expanding the current repertoire of taste peptides. This work\ndemonstrates the potential of TastePepAI in accelerating taste peptide\ndiscovery for food applications and provides a versatile framework adaptable to\nbroader peptide engineering challenges.", "paper_summary_zh": "\u5473\u89c9\u80bd\u56e0\u5176\u72ec\u7279\u7684\u611f\u5b98\u7279\u6027\u3001\u9ad8\u5b89\u5168\u6027\u6982\u51b5\u548c\u6f5c\u5728\u7684\u5065\u5eb7\u76ca\u5904\u800c\u6210\u4e3a\u6709\u524d\u9014\u7684\u5929\u7136\u8c03\u5473\u5242\u3002\u7136\u800c\uff0c\u4ece\u52a8\u7269\u3001\u690d\u7269\u6216\u5fae\u751f\u7269\u6765\u6e90\u4e2d\u4ece\u5934\u9274\u5b9a\u5473\u89c9\u80bd\u4ecd\u7136\u662f\u4e00\u4e2a\u8017\u65f6\u4e14\u8d44\u6e90\u5bc6\u96c6\u7684\u8fc7\u7a0b\uff0c\u4e25\u91cd\u963b\u788d\u4e86\u5b83\u4eec\u5728\u98df\u54c1\u5de5\u4e1a\u4e2d\u7684\u5e7f\u6cdb\u5e94\u7528\u3002\u5728\u6b64\uff0c\u6211\u4eec\u63d0\u51fa\u4e86 TastePepAI\uff0c\u8fd9\u662f\u4e00\u4e2a\u7528\u4e8e\u5b9a\u5236\u5473\u89c9\u80bd\u8bbe\u8ba1\u548c\u5b89\u5168\u6027\u8bc4\u4f30\u7684\u7efc\u5408\u4eba\u5de5\u667a\u80fd\u6846\u67b6\u3002\u4f5c\u4e3a\u8be5\u6846\u67b6\u7684\u5173\u952e\u5143\u7d20\uff0c\u5b9e\u73b0\u4e86\u635f\u5931\u76d1\u7763\u81ea\u9002\u5e94\u53d8\u5206\u81ea\u52a8\u7f16\u7801\u5668 (LA-VAE)\uff0c\u4ee5\u5728\u8bad\u7ec3\u671f\u95f4\u6709\u6548\u4f18\u5316\u5e8f\u5217\u7684\u6f5c\u5728\u8868\u793a\uff0c\u5e76\u4fc3\u8fdb\u751f\u6210\u5177\u6709\u6240\u9700\u5473\u89c9\u7279\u5f81\u7684\u76ee\u6807\u80bd\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u6211\u4eec\u7684\u6a21\u578b\u5305\u542b\u4e86\u4e00\u79cd\u65b0\u9896\u7684\u5473\u89c9\u56de\u907f\u673a\u5236\uff0c\u5141\u8bb8\u9009\u62e9\u6027\u6392\u9664\u98ce\u5473\u3002\u968f\u540e\uff0c\u6211\u4eec\u5185\u90e8\u5f00\u53d1\u7684\u6bd2\u6027\u9884\u6d4b\u7b97\u6cd5 (SpepToxPred) \u88ab\u96c6\u6210\u5230\u6846\u67b6\u4e2d\uff0c\u4ee5\u5bf9\u751f\u6210\u7684\u80bd\u8fdb\u884c\u4e25\u683c\u7684\u5b89\u5168\u8bc4\u4f30\u3002\u4f7f\u7528\u8fd9\u4e2a\u96c6\u6210\u5e73\u53f0\uff0c\u6211\u4eec\u6210\u529f\u5730\u9274\u5b9a\u4e86 73 \u79cd\u8868\u73b0\u51fa\u751c\u5473\u3001\u54b8\u5473\u548c\u9c9c\u5473\u7684\u80bd\uff0c\u6781\u5927\u5730\u6269\u5c55\u4e86\u5f53\u524d\u7684\u5473\u89c9\u80bd\u5e93\u3002\u8fd9\u9879\u5de5\u4f5c\u5c55\u793a\u4e86 TastePepAI \u5728\u52a0\u901f\u5473\u89c9\u80bd\u53d1\u73b0\u4ee5\u7528\u4e8e\u98df\u54c1\u5e94\u7528\u65b9\u9762\u7684\u6f5c\u529b\uff0c\u5e76\u63d0\u4f9b\u4e86\u4e00\u4e2a\u9002\u7528\u4e8e\u66f4\u5e7f\u6cdb\u7684\u80bd\u5de5\u7a0b\u6311\u6218\u7684\u591a\u529f\u80fd\u6846\u67b6\u3002", "author": "Jianda Yue et.al.", "authors": "Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang", "id": "2502.12167v1", "paper_url": "http://arxiv.org/abs/2502.12167v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12181/2502.12181v1.json b/database/storage/2502/12181/2502.12181v1.json
new file mode 100644
index 0000000000..f1ee861915
--- /dev/null
+++ b/database/storage/2502/12181/2502.12181v1.json
@@ -0,0 +1 @@
+{"2502.12181": {"publish_time": "2025-02-14", "title": "3D ReX: Causal Explanations in 3D Neuroimaging Classification", "paper_summary": "Explainability remains a significant problem for AI models in medical\nimaging, making it challenging for clinicians to trust AI-driven predictions.\nWe introduce 3D ReX, the first causality-based post-hoc explainability tool for\n3D models. 3D ReX uses the theory of actual causality to generate\nresponsibility maps which highlight the regions most crucial to the model's\ndecision. We test 3D ReX on a stroke detection model, providing insight into\nthe spatial distribution of features relevant to stroke.", "paper_summary_zh": "\u89e3\u91cb\u6027\u4ecd\u7136\u662f\u91ab\u7642\u5f71\u50cf\u4e2d AI \u6a21\u578b\u7684\u4e00\u5927\u554f\u984c\uff0c\u9019\u4f7f\u5f97\u81e8\u5e8a\u91ab\u751f\u96e3\u4ee5\u4fe1\u4efb AI \u9a45\u52d5\u7684\u9810\u6e2c\u3002\n\u6211\u5011\u5f15\u5165\u4e86 3D ReX\uff0c\u9019\u662f\u7b2c\u4e00\u500b\u7528\u65bc 3D \u6a21\u578b\u7684\u57fa\u65bc\u56e0\u679c\u95dc\u4fc2\u7684\u4e8b\u5f8c\u89e3\u91cb\u6027\u5de5\u5177\u30023D ReX \u4f7f\u7528\u5be6\u969b\u56e0\u679c\u95dc\u4fc2\u7406\u8ad6\u4f86\u751f\u6210\u8cac\u4efb\u5716\uff0c\u8a72\u5716\u7a81\u51fa\u4e86\u5c0d\u6a21\u578b\u6c7a\u7b56\u81f3\u95dc\u91cd\u8981\u7684\u5340\u57df\u3002\u6211\u5011\u5728\u4e2d\u98a8\u6aa2\u6e2c\u6a21\u578b\u4e0a\u6e2c\u8a66\u4e86 3D ReX\uff0c\u63d0\u4f9b\u4e86\u8207\u4e2d\u98a8\u76f8\u95dc\u7279\u5fb5\u7684\u7a7a\u9593\u5206\u4f48\u7684\u898b\u89e3\u3002", "author": "Melane Navaratnarajah et.al.", "authors": "Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker", "id": "2502.12181v1", "paper_url": "http://arxiv.org/abs/2502.12181v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12204/2502.12204v1.json b/database/storage/2502/12204/2502.12204v1.json
new file mode 100644
index 0000000000..106922a693
--- /dev/null
+++ b/database/storage/2502/12204/2502.12204v1.json
@@ -0,0 +1 @@
+{"2502.12204": {"publish_time": "2025-02-16", "title": "Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration", "paper_summary": "Automatic depression detection provides cues for early clinical intervention\nby clinicians. Clinical interviews for depression detection involve dialogues\ncentered around multiple themes. Existing studies primarily design end-to-end\nneural network models to capture the hierarchical structure of clinical\ninterview dialogues. However, these methods exhibit defects in modeling the\nthematic content of clinical interviews: 1) they fail to capture intra-theme\nand inter-theme correlation explicitly, and 2) they do not allow clinicians to\nintervene and focus on themes of interest. To address these issues, this paper\nintroduces an interactive depression detection framework. This framework\nleverages in-context learning techniques to identify themes in clinical\ninterviews and then models both intra-theme and inter-theme correlation.\nAdditionally, it employs AI-driven feedback to simulate the interests of\nclinicians, enabling interactive adjustment of theme importance. PDIMC achieves\nabsolute improvements of 35\\% and 12\\% compared to the state-of-the-art on the\ndepression detection dataset DAIC-WOZ, which demonstrates the effectiveness of\nmodeling theme correlation and incorporating interactive external feedback.", "paper_summary_zh": "\u81ea\u52d5\u6182\u9b31\u75c7\u5075\u6e2c\u63d0\u4f9b\u81e8\u5e8a\u91ab\u5e2b\u65e9\u671f\u81e8\u5e8a\u4ecb\u5165\u7684\u7dda\u7d22\u3002\u6182\u9b31\u75c7\u5075\u6e2c\u7684\u81e8\u5e8a\u8a2a\u8ac7\u6d89\u53ca\u4ee5\u591a\u500b\u4e3b\u984c\u70ba\u4e2d\u5fc3\u7684\u5c0d\u8a71\u3002\u73fe\u6709\u7814\u7a76\u4e3b\u8981\u8a2d\u8a08\u7aef\u5c0d\u7aef\u7684\u985e\u795e\u7d93\u7db2\u8def\u6a21\u578b\u4f86\u6355\u6349\u81e8\u5e8a\u8a2a\u8ac7\u5c0d\u8a71\u7684\u968e\u5c64\u7d50\u69cb\u3002\u7136\u800c\uff0c\u9019\u4e9b\u65b9\u6cd5\u5728\u5efa\u6a21\u81e8\u5e8a\u8a2a\u8ac7\u7684\u4e3b\u984c\u5167\u5bb9\u6642\u8868\u73fe\u51fa\u7f3a\u9677\uff1a1\uff09\u5b83\u5011\u7121\u6cd5\u660e\u78ba\u6355\u6349\u4e3b\u984c\u5167\u548c\u4e3b\u984c\u9593\u7684\u95dc\u806f\u6027\uff0c\u4ee5\u53ca 2\uff09\u5b83\u5011\u4e0d\u5141\u8a31\u81e8\u5e8a\u91ab\u5e2b\u4ecb\u5165\u4e26\u5c08\u6ce8\u65bc\u611f\u8208\u8da3\u7684\u4e3b\u984c\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u4e9b\u554f\u984c\uff0c\u672c\u6587\u4ecb\u7d39\u4e86\u4e00\u500b\u4e92\u52d5\u5f0f\u6182\u9b31\u75c7\u5075\u6e2c\u6846\u67b6\u3002\u6b64\u6846\u67b6\u5229\u7528\u60c5\u5883\u5b78\u7fd2\u6280\u8853\u4f86\u8b58\u5225\u81e8\u5e8a\u8a2a\u8ac7\u4e2d\u7684\u4e3b\u984c\uff0c\u7136\u5f8c\u5c0d\u4e3b\u984c\u5167\u548c\u4e3b\u984c\u9593\u7684\u95dc\u806f\u6027\u9032\u884c\u5efa\u6a21\u3002\u6b64\u5916\uff0c\u5b83\u63a1\u7528 AI \u9a45\u52d5\u7684\u56de\u994b\u4f86\u6a21\u64ec\u81e8\u5e8a\u91ab\u5e2b\u7684\u8208\u8da3\uff0c\u5be6\u73fe\u4e3b\u984c\u91cd\u8981\u6027\u7684\u4e92\u52d5\u5f0f\u8abf\u6574\u3002\u8207 DAIC-WOZ \u6182\u9b31\u75c7\u5075\u6e2c\u8cc7\u6599\u96c6\u4e0a\u7684\u6700\u65b0\u6280\u8853\u76f8\u6bd4\uff0cPDIMC \u7684\u7d55\u5c0d\u6539\u9032\u7387\u5206\u5225\u70ba 35% \u548c 12%\uff0c\u9019\u8b49\u660e\u4e86\u5c0d\u4e3b\u984c\u95dc\u806f\u6027\u5efa\u6a21\u548c\u7d0d\u5165\u4e92\u52d5\u5f0f\u5916\u90e8\u56de\u994b\u7684\u6709\u6548\u6027\u3002", "author": "Xianbing Zhao et.al.", "authors": "Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang", "id": "2502.12204v1", "paper_url": "http://arxiv.org/abs/2502.12204v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12362/2502.12362v1.json b/database/storage/2502/12362/2502.12362v1.json
new file mode 100644
index 0000000000..598f567c36
--- /dev/null
+++ b/database/storage/2502/12362/2502.12362v1.json
@@ -0,0 +1 @@
+{"2502.12362": {"publish_time": "2025-02-17", "title": "Classifiers of Data Sharing Statements in Clinical Trial Records", "paper_summary": "Digital individual participant data (IPD) from clinical trials are\nincreasingly distributed for potential scientific reuse. The identification of\navailable IPD, however, requires interpretations of textual data-sharing\nstatements (DSS) in large databases. Recent advancements in computational\nlinguistics include pre-trained language models that promise to simplify the\nimplementation of effective classifiers based on textual inputs. In a subset of\n5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers\nbased on domain-specific pre-trained language models reproduce original\navailability categories as well as manually annotated labels. Typical metrics\nindicate that classifiers that predicted manual annotations outperformed those\nthat learned to output the original availability categories. This suggests that\nthe textual DSS descriptions contain applicable information that the\navailability categories do not, and that such classifiers could thus aid the\nautomatic identification of available IPD in large trial databases.", "paper_summary_zh": "\u81e8\u5e8a\u8a66\u9a57\u7684\u6578\u4f4d\u500b\u4eba\u53c3\u8207\u8005\u8cc7\u6599 (IPD) \u6108\u4f86\u6108\u5ee3\u6cdb\u5730\u7528\u65bc\u6f5b\u5728\u7684\u79d1\u5b78\u518d\u5229\u7528\u3002\u7136\u800c\uff0c\u8981\u627e\u51fa\u53ef\u7528\u7684 IPD\uff0c\u9700\u8981\u5c0d\u5927\u578b\u8cc7\u6599\u5eab\u4e2d\u7684\u6587\u5b57\u8cc7\u6599\u5171\u4eab\u8072\u660e (DSS) \u9032\u884c\u8a6e\u91cb\u3002\u8a08\u7b97\u8a9e\u8a00\u5b78\u6700\u8fd1\u7684\u9032\u5c55\u5305\u62ec\u9810\u5148\u8a13\u7df4\u7684\u8a9e\u8a00\u6a21\u578b\uff0c\u6709\u671b\u7c21\u5316\u6839\u64da\u6587\u5b57\u8f38\u5165\u5be6\u4f5c\u6709\u6548\u5206\u985e\u5668\u7684\u904e\u7a0b\u3002\u5728 ClinicalTrials.gov \u4e2d\u7684 5,000 \u500b\u6587\u5b57 DSS \u5b50\u96c6\u4e2d\uff0c\u6211\u5011\u8a55\u4f30\u4e86\u57fa\u65bc\u7279\u5b9a\u9818\u57df\u9810\u5148\u8a13\u7df4\u8a9e\u8a00\u6a21\u578b\u7684\u5206\u985e\u5668\uff0c\u5728\u91cd\u73fe\u539f\u59cb\u53ef\u7528\u6027\u985e\u5225\u4ee5\u53ca\u624b\u52d5\u8a3b\u89e3\u6a19\u7c64\u65b9\u9762\u7684\u8868\u73fe\u3002\u5178\u578b\u7684\u6307\u6a19\u986f\u793a\uff0c\u9810\u6e2c\u624b\u52d5\u8a3b\u89e3\u7684\u5206\u985e\u5668\u512a\u65bc\u5b78\u6703\u8f38\u51fa\u539f\u59cb\u53ef\u7528\u6027\u985e\u5225\u7684\u5206\u985e\u5668\u3002\u9019\u8868\u793a\u6587\u5b57 DSS \u8aaa\u660e\u5305\u542b\u53ef\u7528\u6027\u985e\u5225\u6240\u6c92\u6709\u7684\u9069\u7528\u8cc7\u8a0a\uff0c\u800c\u4e14\u6b64\u985e\u5206\u985e\u5668\u56e0\u6b64\u6709\u52a9\u65bc\u5728\u5927\u578b\u8a66\u9a57\u8cc7\u6599\u5eab\u4e2d\u81ea\u52d5\u627e\u51fa\u53ef\u7528\u7684 IPD\u3002", "author": "Saber Jelodari Mamaghani et.al.", "authors": "Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth", "id": "2502.12362v1", "paper_url": "http://arxiv.org/abs/2502.12362v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12552/2502.12552v1.json b/database/storage/2502/12552/2502.12552v1.json
new file mode 100644
index 0000000000..a886edbe6b
--- /dev/null
+++ b/database/storage/2502/12552/2502.12552v1.json
@@ -0,0 +1 @@
+{"2502.12552": {"publish_time": "2025-02-18", "title": "LLM Safety for Children", "paper_summary": "This paper analyzes the safety of Large Language Models (LLMs) in\ninteractions with children below age of 18 years. Despite the transformative\napplications of LLMs in various aspects of children's lives such as education\nand therapy, there remains a significant gap in understanding and mitigating\npotential content harms specific to this demographic. The study acknowledges\nthe diverse nature of children often overlooked by standard safety evaluations\nand proposes a comprehensive approach to evaluating LLM safety specifically for\nchildren. We list down potential risks that children may encounter when using\nLLM powered applications. Additionally we develop Child User Models that\nreflect the varied personalities and interests of children informed by\nliterature in child care and psychology. These user models aim to bridge the\nexisting gap in child safety literature across various fields. We utilize Child\nUser Models to evaluate the safety of six state of the art LLMs. Our\nobservations reveal significant safety gaps in LLMs particularly in categories\nharmful to children but not adults", "paper_summary_zh": "\u672c\u6587\u5206\u6790\u4e86\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728\u8207 18 \u6b72\u4ee5\u4e0b\u5152\u7ae5\u4e92\u52d5\u6642\u7684\u5b89\u5168\u6027\u3002\u5118\u7ba1 LLM \u5728\u5152\u7ae5\u751f\u6d3b\u7684\u5404\u500b\u65b9\u9762\uff08\u4f8b\u5982\u6559\u80b2\u548c\u6cbb\u7642\uff09\u90fd\u6709\u8f49\u8b8a\u6027\u7684\u61c9\u7528\uff0c\u4f46\u5728\u4e86\u89e3\u548c\u6e1b\u8f15\u5c0d\u9019\u500b\u7fa4\u9ad4\u5177\u9ad4\u7684\u6f5b\u5728\u5167\u5bb9\u5371\u5bb3\u65b9\u9762\u4ecd\u7136\u5b58\u5728\u986f\u8457\u5dee\u8ddd\u3002\u7814\u7a76\u627f\u8a8d\u5152\u7ae5\u7684\u591a\u6a23\u6027\uff0c\u800c\u6a19\u6e96\u5b89\u5168\u8a55\u4f30\u901a\u5e38\u6703\u5ffd\u7565\u9019\u4e9b\u591a\u6a23\u6027\uff0c\u4e26\u63d0\u51fa\u4e86\u4e00\u7a2e\u91dd\u5c0d\u5152\u7ae5\u8a55\u4f30 LLM \u5b89\u5168\u6027\u7684\u7d9c\u5408\u65b9\u6cd5\u3002\u6211\u5011\u5217\u51fa\u4e86\u5152\u7ae5\u5728\u4f7f\u7528\u7531 LLM \u63d0\u4f9b\u52d5\u529b\u7684\u61c9\u7528\u7a0b\u5f0f\u6642\u53ef\u80fd\u9047\u5230\u7684\u6f5b\u5728\u98a8\u96aa\u3002\u6b64\u5916\uff0c\u6211\u5011\u958b\u767c\u4e86\u5152\u7ae5\u4f7f\u7528\u8005\u6a21\u578b\uff0c\u9019\u4e9b\u6a21\u578b\u53cd\u6620\u4e86\u5152\u7ae5\u4e0d\u540c\u7684\u500b\u6027\u7279\u8cea\u548c\u8208\u8da3\uff0c\u4e26\u53c3\u8003\u4e86\u5152\u7ae5\u7167\u8b77\u548c\u5fc3\u7406\u5b78\u7684\u6587\u737b\u3002\u9019\u4e9b\u4f7f\u7528\u8005\u6a21\u578b\u65e8\u5728\u5f4c\u5408\u4e0d\u540c\u9818\u57df\u5152\u7ae5\u5b89\u5168\u6587\u737b\u4e2d\u73fe\u6709\u7684\u5dee\u8ddd\u3002\u6211\u5011\u5229\u7528\u5152\u7ae5\u4f7f\u7528\u8005\u6a21\u578b\u4f86\u8a55\u4f30\u516d\u500b\u6700\u5148\u9032\u7684 LLM \u7684\u5b89\u5168\u6027\u3002\u6211\u5011\u7684\u89c0\u5bdf\u7d50\u679c\u63ed\u793a\u4e86 LLM \u4e2d\u7684\u91cd\u5927\u5b89\u5168\u6f0f\u6d1e\uff0c\u7279\u5225\u662f\u5728\u5c0d\u5152\u7ae5\u6709\u5bb3\u4f46\u5c0d\u6210\u5e74\u4eba\u7121\u5bb3\u7684\u985e\u5225\u4e2d", "author": "Prasanjit Rath et.al.", "authors": "Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat", "id": "2502.12552v1", "paper_url": "http://arxiv.org/abs/2502.12552v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12586/2502.12586v1.json b/database/storage/2502/12586/2502.12586v1.json
new file mode 100644
index 0000000000..6447f7fce7
--- /dev/null
+++ b/database/storage/2502/12586/2502.12586v1.json
@@ -0,0 +1 @@
+{"2502.12586": {"publish_time": "2025-02-18", "title": "G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation", "paper_summary": "Explainable recommendation has demonstrated significant advantages in\ninforming users about the logic behind recommendations, thereby increasing\nsystem transparency, effectiveness, and trustworthiness. To provide\npersonalized and interpretable explanations, existing works often combine the\ngeneration capabilities of large language models (LLMs) with collaborative\nfiltering (CF) information. CF information extracted from the user-item\ninteraction graph captures the user behaviors and preferences, which is crucial\nfor providing informative explanations. However, due to the complexity of graph\nstructure, effectively extracting the CF information from graphs still remains\na challenge. Moreover, existing methods often struggle with the integration of\nextracted CF information with LLMs due to its implicit representation and the\nmodality gap between graph structures and natural language explanations. To\naddress these challenges, we propose G-Refer, a framework using graph\nretrieval-augmented large language models (LLMs) for explainable\nrecommendation. Specifically, we first employ a hybrid graph retrieval\nmechanism to retrieve explicit CF signals from both structural and semantic\nperspectives. The retrieved CF information is explicitly formulated as\nhuman-understandable text by the proposed graph translation and accounts for\nthe explanations generated by LLMs. To bridge the modality gap, we introduce\nknowledge pruning and retrieval-augmented fine-tuning to enhance the ability of\nLLMs to process and utilize the retrieved CF information to generate\nexplanations. Extensive experiments show that G-Refer achieves superior\nperformance compared with existing methods in both explainability and\nstability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.", "paper_summary_zh": "\u53ef\u89e3\u91cb\u5efa\u8b70\u5df2\u8b49\u660e\u5728\u544a\u77e5\u4f7f\u7528\u8005\u5efa\u8b70\u80cc\u5f8c\u7684\u908f\u8f2f\u65b9\u9762\u5177\u6709\u986f\u8457\u512a\u9ede\uff0c\u5f9e\u800c\u63d0\u9ad8\u7cfb\u7d71\u900f\u660e\u5ea6\u3001\u6709\u6548\u6027\u548c\u53ef\u4fe1\u5ea6\u3002\u70ba\u4e86\u63d0\u4f9b\u500b\u4eba\u5316\u4e14\u53ef\u89e3\u91cb\u7684\u8aaa\u660e\uff0c\u73fe\u6709\u4f5c\u54c1\u901a\u5e38\u7d50\u5408\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u751f\u6210\u80fd\u529b\u8207\u5354\u540c\u904e\u6ffe (CF) \u8cc7\u8a0a\u3002\u5f9e\u4f7f\u7528\u8005\u9805\u76ee\u4e92\u52d5\u5716\u5f62\u4e2d\u63d0\u53d6\u7684 CF \u8cc7\u8a0a\u6703\u64f7\u53d6\u4f7f\u7528\u8005\u884c\u70ba\u548c\u504f\u597d\uff0c\u9019\u5c0d\u65bc\u63d0\u4f9b\u8cc7\u8a0a\u6027\u8aaa\u660e\u81f3\u95dc\u91cd\u8981\u3002\u7136\u800c\uff0c\u7531\u65bc\u5716\u5f62\u7d50\u69cb\u7684\u8907\u96dc\u6027\uff0c\u5f9e\u5716\u5f62\u4e2d\u6709\u6548\u63d0\u53d6 CF \u8cc7\u8a0a\u4ecd\u7136\u662f\u4e00\u500b\u6311\u6230\u3002\u6b64\u5916\uff0c\u73fe\u6709\u65b9\u6cd5\u901a\u5e38\u96e3\u4ee5\u5c07\u63d0\u53d6\u7684 CF \u8cc7\u8a0a\u8207 LLM \u6574\u5408\uff0c\u56e0\u70ba\u5176\u96b1\u542b\u8868\u793a\u548c\u5716\u5f62\u7d50\u69cb\u8207\u81ea\u7136\u8a9e\u8a00\u8aaa\u660e\u4e4b\u9593\u7684\u6a21\u5f0f\u5dee\u8ddd\u3002\u70ba\u4e86\u61c9\u5c0d\u9019\u4e9b\u6311\u6230\uff0c\u6211\u5011\u63d0\u51fa G-Refer\uff0c\u4e00\u500b\u4f7f\u7528\u5716\u5f62\u6aa2\u7d22\u589e\u5f37\u578b\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u53ef\u89e3\u91cb\u5efa\u8b70\u67b6\u69cb\u3002\u5177\u9ad4\u4f86\u8aaa\uff0c\u6211\u5011\u9996\u5148\u63a1\u7528\u6df7\u5408\u5716\u5f62\u6aa2\u7d22\u6a5f\u5236\uff0c\u5f9e\u7d50\u69cb\u548c\u8a9e\u7fa9\u89d2\u5ea6\u6aa2\u7d22\u660e\u78ba\u7684 CF \u8a0a\u865f\u3002\u6aa2\u7d22\u5230\u7684 CF \u8cc7\u8a0a\u7531\u5efa\u8b70\u7684\u5716\u5f62\u7ffb\u8b6f\u660e\u78ba\u8868\u8ff0\u70ba\u4eba\u985e\u53ef\u4ee5\u7406\u89e3\u7684\u6587\u5b57\uff0c\u4e26\u8aaa\u660e LLM \u751f\u6210\u7684\u89e3\u91cb\u3002\u70ba\u4e86\u5f4c\u5408\u6a21\u5f0f\u5dee\u8ddd\uff0c\u6211\u5011\u5f15\u5165\u4e86\u77e5\u8b58\u4fee\u526a\u548c\u6aa2\u7d22\u589e\u5f37\u5fae\u8abf\uff0c\u4ee5\u589e\u5f37 LLM \u8655\u7406\u548c\u5229\u7528\u6aa2\u7d22\u5230\u7684 CF \u8cc7\u8a0a\u4ee5\u7522\u751f\u89e3\u91cb\u7684\u80fd\u529b\u3002\u5ee3\u6cdb\u7684\u5be6\u9a57\u8868\u660e\uff0c\u8207\u73fe\u6709\u65b9\u6cd5\u76f8\u6bd4\uff0cG-Refer \u5728\u53ef\u89e3\u91cb\u6027\u548c\u7a69\u5b9a\u6027\u65b9\u9762\u90fd\u53d6\u5f97\u4e86\u5353\u8d8a\u7684\u6548\u80fd\u3002\u7a0b\u5f0f\u78bc\u548c\u8cc7\u6599\u53ef\u5728 https://github.com/Yuhan1i/G-Refer \u53d6\u5f97\u3002", "author": "Yuhan Li et.al.", "authors": "Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li", "id": "2502.12586v1", "paper_url": "http://arxiv.org/abs/2502.12586v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12669/2502.12669v1.json b/database/storage/2502/12669/2502.12669v1.json
new file mode 100644
index 0000000000..ed1c83189f
--- /dev/null
+++ b/database/storage/2502/12669/2502.12669v1.json
@@ -0,0 +1 @@
+{"2502.12669": {"publish_time": "2025-02-18", "title": "Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research", "paper_summary": "The rapid advancement of perovskite solar cells (PSCs) has led to an\nexponential growth in research publications, creating an urgent need for\nefficient knowledge management and reasoning systems in this domain. We present\na comprehensive knowledge-enhanced system for PSCs that integrates three key\ncomponents. First, we develop Perovskite-KG, a domain-specific knowledge graph\nconstructed from 1,517 research papers, containing 23,789 entities and 22,272\nrelationships. Second, we create two complementary datasets: Perovskite-Chat,\ncomprising 55,101 high-quality question-answer pairs generated through a novel\nmulti-agent framework, and Perovskite-Reasoning, containing 2,217 carefully\ncurated materials science problems. Third, we introduce two specialized large\nlanguage models: Perovskite-Chat-LLM for domain-specific knowledge assistance\nand Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental\nresults demonstrate that our system significantly outperforms existing models\nin both domain-specific knowledge retrieval and scientific reasoning tasks,\nproviding researchers with effective tools for literature review, experimental\ndesign, and complex problem-solving in PSC research.", "paper_summary_zh": "\u7531\u65bc perovskite \u592a\u967d\u80fd\u96fb\u6c60 (PSC) \u5feb\u901f\u9032\u5c55\uff0c\u5c0e\u81f4\u7814\u7a76\u51fa\u7248\u7269\u5448\u6307\u6578\u6210\u9577\uff0c\u8feb\u5207\u9700\u8981\u5728\u9019\u9818\u57df\u5efa\u7acb\u6709\u6548\u7684\u77e5\u8b58\u7ba1\u7406\u548c\u63a8\u7406\u7cfb\u7d71\u3002\u6211\u5011\u63d0\u51fa\u4e00\u500b\u7d50\u5408\u4e09\u9805\u95dc\u9375\u5143\u4ef6\u7684 PSC \u5168\u9762\u77e5\u8b58\u589e\u5f37\u7cfb\u7d71\u3002\u9996\u5148\uff0c\u6211\u5011\u958b\u767c\u51fa Perovskite-KG\uff0c\u4e00\u500b\u7531 1,517 \u7bc7\u7814\u7a76\u8ad6\u6587\u5efa\u69cb\u800c\u6210\u3001\u5305\u542b 23,789 \u500b\u5be6\u9ad4\u548c 22,272 \u500b\u95dc\u4fc2\u7684\u9818\u57df\u7279\u5b9a\u77e5\u8b58\u5716\u8b5c\u3002\u5176\u6b21\uff0c\u6211\u5011\u5efa\u7acb\u5169\u500b\u4e92\u88dc\u7684\u8cc7\u6599\u96c6\uff1aPerovskite-Chat\uff0c\u5305\u542b\u900f\u904e\u4e00\u500b\u65b0\u7a4e\u7684\u591a\u4ee3\u7406\u67b6\u69cb\u7522\u751f 55,101 \u500b\u9ad8\u54c1\u8cea\u554f\u7b54\u914d\u5c0d\uff1b\u4ee5\u53ca Perovskite-Reasoning\uff0c\u5305\u542b 2,217 \u500b\u4ed4\u7d30\u7b56\u5c55\u7684\u6750\u6599\u79d1\u5b78\u554f\u984c\u3002\u7b2c\u4e09\uff0c\u6211\u5011\u63a8\u51fa\u5169\u500b\u5c08\u9580\u5316\u5927\u578b\u8a9e\u8a00\u6a21\u578b\uff1a\u91dd\u5c0d\u9818\u57df\u7279\u5b9a\u77e5\u8b58\u5354\u52a9\u7684 Perovskite-Chat-LLM\uff0c\u4ee5\u53ca\u91dd\u5c0d\u79d1\u5b78\u63a8\u7406\u4efb\u52d9\u7684 Perovskite-Reasoning-LLM\u3002\u5be6\u9a57\u7d50\u679c\u986f\u793a\uff0c\u6211\u5011\u7684\u7cfb\u7d71\u5728\u9818\u57df\u7279\u5b9a\u77e5\u8b58\u64f7\u53d6\u548c\u79d1\u5b78\u63a8\u7406\u4efb\u52d9\u4e0a\u90fd\u660e\u986f\u512a\u65bc\u73fe\u6709\u6a21\u578b\uff0c\u70ba\u7814\u7a76\u4eba\u54e1\u63d0\u4f9b\u6709\u6548\u7684\u5de5\u5177\uff0c\u7528\u65bc PSC \u7814\u7a76\u4e2d\u7684\u6587\u737b\u56de\u9867\u3001\u5be6\u9a57\u8a2d\u8a08\u548c\u8907\u96dc\u554f\u984c\u89e3\u6c7a\u3002", "author": "Xiang Liu et.al.", "authors": "Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang", "id": "2502.12669v1", "paper_url": "http://arxiv.org/abs/2502.12669v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12767/2502.12767v1.json b/database/storage/2502/12767/2502.12767v1.json
new file mode 100644
index 0000000000..b64f120c5f
--- /dev/null
+++ b/database/storage/2502/12767/2502.12767v1.json
@@ -0,0 +1 @@
+{"2502.12767": {"publish_time": "2025-02-18", "title": "R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs", "paper_summary": "Recent studies have combined Large Language Models (LLMs) with Knowledge\nGraphs (KGs) to enhance reasoning, improving inference accuracy without\nadditional training while mitigating hallucination. However, existing\nframeworks are often rigid, struggling to adapt to KG or task changes. They\nalso rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.\nTo address this, We introduce R2-KG, a plug-and-play, dual-agent framework that\nseparates reasoning into two roles: an Operator (a low-capacity LLM) that\ngathers evidence and a Supervisor (a high-capacity LLM) that makes final\njudgments. This design is cost-efficient for LLM inference while still\nmaintaining strong reasoning accuracy. Additionally, R2-KG employs an\nAbstention mechanism, generating answers only when sufficient evidence is\ncollected from KG, which significantly enhances reliability. Experiments across\nmultiple KG-based reasoning tasks show that R2-KG consistently outperforms\nbaselines in both accuracy and reliability, regardless of the inherent\ncapability of LLMs used as the Operator. Further experiments reveal that the\nsingle-agent version of R2-KG, equipped with a strict self-consistency\nstrategy, achieves significantly higher-than-baseline reliability while\nreducing inference cost. However, it also leads to a higher abstention rate in\ncomplex KGs. Our findings establish R2-KG as a flexible and cost-effective\nsolution for KG-based reasoning. It reduces reliance on high-capacity LLMs\nwhile ensuring trustworthy inference.", "paper_summary_zh": "<paragraph>\u6700\u8fd1\u7684\u7814\u7a76\u7ed3\u5408\u4e86\u5927\u578b\u8bed\u8a00\u6a21\u578b (LLM) \u4e0e\u77e5\u8bc6\u56fe\u8c31 (KG) \u4ee5\u589e\u5f3a\u63a8\u7406\uff0c\u5728\u4e0d\u989d\u5916\u8bad\u7ec3\u7684\u60c5\u51b5\u4e0b\u63d0\u9ad8\u63a8\u7406\u51c6\u786e\u6027\uff0c\u540c\u65f6\u51cf\u8f7b\u5e7b\u89c9\u3002\u7136\u800c\uff0c\u73b0\u6709\u7684\u6846\u67b6\u901a\u5e38\u5f88\u50f5\u5316\uff0c\u96be\u4ee5\u9002\u5e94\u77e5\u8bc6\u56fe\u8c31\u6216\u4efb\u52a1\u7684\u53d8\u5316\u3002\u5b83\u4eec\u8fd8\u4e25\u91cd\u4f9d\u8d56\u5f3a\u5927\u7684 LLM \u6765\u8fdb\u884c\u53ef\u9760\uff08\u5373\u503c\u5f97\u4fe1\u8d56\uff09\u7684\u63a8\u7406\u3002\u4e3a\u4e86\u89e3\u51b3\u8fd9\u4e2a\u95ee\u9898\uff0c\u6211\u4eec\u5f15\u5165\u4e86 R2-KG\uff0c\u8fd9\u662f\u4e00\u4e2a\u5373\u63d2\u5373\u7528\u3001\u53cc\u4ee3\u7406\u6846\u67b6\uff0c\u5b83\u5c06\u63a8\u7406\u5206\u4e3a\u4e24\u4e2a\u89d2\u8272\uff1a\u4e00\u4e2a\u6536\u96c6\u8bc1\u636e\u7684\u64cd\u4f5c\u5458\uff08\u4f4e\u5bb9\u91cf LLM\uff09\u548c\u4e00\u4e2a\u505a\u51fa\u6700\u7ec8\u5224\u65ad\u7684\u76d1\u7763\u5458\uff08\u9ad8\u5bb9\u91cf LLM\uff09\u3002\u8fd9\u79cd\u8bbe\u8ba1\u5728 LLM \u63a8\u7406\u65b9\u9762\u5177\u6709\u6210\u672c\u6548\u76ca\uff0c\u540c\u65f6\u4ecd\u4fdd\u6301\u5f3a\u5927\u7684\u63a8\u7406\u51c6\u786e\u6027\u3002\u6b64\u5916\uff0cR2-KG \u91c7\u7528\u5f03\u6743\u673a\u5236\uff0c\u4ec5\u5728\u4ece\u77e5\u8bc6\u56fe\u8c31\u6536\u96c6\u5230\u8db3\u591f\u8bc1\u636e\u65f6\u624d\u751f\u6210\u7b54\u6848\uff0c\u8fd9\u663e\u8457\u63d0\u9ad8\u4e86\u53ef\u9760\u6027\u3002\u8de8\u591a\u4e2a\u57fa\u4e8e\u77e5\u8bc6\u56fe\u8c31\u7684\u63a8\u7406\u4efb\u52a1\u7684\u5b9e\u9a8c\u8868\u660e\uff0cR2-KG \u5728\u51c6\u786e\u6027\u548c\u53ef\u9760\u6027\u65b9\u9762\u59cb\u7ec8\u4f18\u4e8e\u57fa\u7ebf\uff0c\u800c\u4e0e\u7528\u4f5c\u64cd\u4f5c\u5458\u7684 LLM \u7684\u56fa\u6709\u80fd\u529b\u65e0\u5173\u3002\u8fdb\u4e00\u6b65\u7684\u5b9e\u9a8c\u8868\u660e\uff0cR2-KG \u7684\u5355\u4ee3\u7406\u7248\u672c\u914d\u5907\u4e86\u4e25\u683c\u7684\u81ea\u4e00\u81f4\u6027\u7b56\u7565\uff0c\u5b9e\u73b0\u4e86\u660e\u663e\u9ad8\u4e8e\u57fa\u7ebf\u7684\u53ef\u9760\u6027\uff0c\u540c\u65f6\u964d\u4f4e\u4e86\u63a8\u7406\u6210\u672c\u3002\u7136\u800c\uff0c\u5b83\u4e5f\u5bfc\u81f4\u4e86\u590d\u6742\u77e5\u8bc6\u56fe\u8c31\u4e2d\u66f4\u9ad8\u7684\u5f03\u6743\u7387\u3002\u6211\u4eec\u7684\u53d1\u73b0\u5c06 R2-KG \u786e\u7acb\u4e3a\u4e00\u79cd\u7075\u6d3b\u4e14\u7ecf\u6d4e\u9ad8\u6548\u7684\u57fa\u4e8e\u77e5\u8bc6\u56fe\u8c31\u7684\u63a8\u7406\u89e3\u51b3\u65b9\u6848\u3002\u5b83\u51cf\u5c11\u4e86\u5bf9\u9ad8\u5bb9\u91cf LLM \u7684\u4f9d\u8d56\uff0c\u540c\u65f6\u786e\u4fdd\u4e86\u53ef\u4fe1\u7684\u63a8\u7406\u3002</paragraph>", "author": "Sumin Jo et.al.", "authors": "Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi", "id": "2502.12767v1", "paper_url": "http://arxiv.org/abs/2502.12767v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12769/2502.12769v1.json b/database/storage/2502/12769/2502.12769v1.json
new file mode 100644
index 0000000000..dba153b222
--- /dev/null
+++ b/database/storage/2502/12769/2502.12769v1.json
@@ -0,0 +1 @@
+{"2502.12769": {"publish_time": "2025-02-18", "title": "How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild", "paper_summary": "In the age of misinformation, hallucination -- the tendency of Large Language\nModels (LLMs) to generate non-factual or unfaithful responses -- represents the\nmain risk for their global utility. Despite LLMs becoming increasingly\nmultilingual, the vast majority of research on detecting and quantifying LLM\nhallucination are (a) English-centric and (b) focus on machine translation (MT)\nand summarization, tasks that are less common ``in the wild'' than open\ninformation seeking. In contrast, we aim to quantify the extent of LLM\nhallucination across languages in knowledge-intensive long-form question\nanswering. To this end, we train a multilingual hallucination detection model\nand conduct a large-scale study across 30 languages and 6 open-source LLM\nfamilies. We start from an English hallucination detection dataset and rely on\nMT to generate (noisy) training data in other languages. We also manually\nannotate gold data for five high-resource languages; we then demonstrate, for\nthese languages, that the estimates of hallucination rates are similar between\nsilver (LLM-generated) and gold test sets, validating the use of silver data\nfor estimating hallucination rates for other languages. For the final rates\nestimation, we build a knowledge-intensive QA dataset for 30 languages with\nLLM-generated prompts and Wikipedia articles as references. We find that, while\nLLMs generate longer responses with more hallucinated tokens for\nhigher-resource languages, there is no correlation between length-normalized\nhallucination rates of languages and their digital representation. Further, we\nfind that smaller LLMs exhibit larger hallucination rates than larger models.", "paper_summary_zh": "<paragraph>\u5728\u9519\u8bef\u8a0a\u606f\u7684\u6642\u4ee3\uff0c\u5e7b\u89ba\u2014\u2014\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7522\u751f\u975e\u4e8b\u5be6\u6216\u4e0d\u5fe0\u5be6\u56de\u61c9\u7684\u50be\u5411\u2014\u2014\u4ee3\u8868\u5176\u5168\u7403\u6548\u7528\u7684\u4e3b\u8981\u98a8\u96aa\u3002\u5118\u7ba1 LLM \u8b8a\u5f97\u8d8a\u4f86\u8d8a\u591a\u5143\u5316\uff0c\u4f46\u7d55\u5927\u591a\u6578\u95dc\u65bc\u5075\u6e2c\u548c\u91cf\u5316 LLM \u5e7b\u89ba\u7684\u7814\u7a76\u90fd\u662f (a) \u4ee5\u82f1\u8a9e\u70ba\u4e2d\u5fc3\uff0c(b) \u5c08\u6ce8\u65bc\u6a5f\u5668\u7ffb\u8b6f (MT) \u548c\u6458\u8981\uff0c\u9019\u4e9b\u4efb\u52d9\u5728\u300c\u91ce\u5916\u300d\u4e2d\u4e0d\u5982\u958b\u653e\u5f0f\u8cc7\u8a0a\u641c\u5c0b\u5e38\u898b\u3002\u76f8\u53cd\u5730\uff0c\u6211\u5011\u65e8\u5728\u91cf\u5316 LLM \u5728\u77e5\u8b58\u5bc6\u96c6\u578b\u9577\u7bc7\u554f\u7b54\u4e2d\u8de8\u8a9e\u8a00\u7684\u5e7b\u89ba\u7a0b\u5ea6\u3002\u70ba\u6b64\uff0c\u6211\u5011\u8a13\u7df4\u4e86\u4e00\u500b\u591a\u8a9e\u8a00\u5e7b\u89ba\u5075\u6e2c\u6a21\u578b\uff0c\u4e26\u91dd\u5c0d 30 \u7a2e\u8a9e\u8a00\u548c 6 \u500b\u958b\u653e\u539f\u59cb\u78bc LLM \u5bb6\u65cf\u9032\u884c\u5927\u898f\u6a21\u7814\u7a76\u3002\u6211\u5011\u5f9e\u4e00\u500b\u82f1\u8a9e\u5e7b\u89ba\u5075\u6e2c\u8cc7\u6599\u96c6\u958b\u59cb\uff0c\u4e26\u4f9d\u8cf4 MT \u5728\u5176\u4ed6\u8a9e\u8a00\u4e2d\u7522\u751f\uff08\u6709\u96dc\u8a0a\u7684\uff09\u8a13\u7df4\u8cc7\u6599\u3002\u6211\u5011\u9084\u624b\u52d5\u70ba\u4e94\u7a2e\u9ad8\u8cc7\u6e90\u8a9e\u8a00\u8a3b\u89e3\u9ec3\u91d1\u8cc7\u6599\uff1b\u7136\u5f8c\u6211\u5011\u8b49\u660e\uff0c\u5c0d\u65bc\u9019\u4e9b\u8a9e\u8a00\uff0c\u5e7b\u89ba\u7387\u7684\u4f30\u8a08\u503c\u5728\u767d\u9280\uff08LLM \u7522\u751f\uff09\u548c\u9ec3\u91d1\u6e2c\u8a66\u96c6\u4e4b\u9593\u662f\u76f8\u4f3c\u7684\uff0c\u9a57\u8b49\u4e86\u4f7f\u7528\u767d\u9280\u8cc7\u6599\u4f86\u4f30\u8a08\u5176\u4ed6\u8a9e\u8a00\u7684\u5e7b\u89ba\u7387\u3002\u5c0d\u65bc\u6700\u7d42\u7684\u6bd4\u7387\u4f30\u8a08\uff0c\u6211\u5011\u5efa\u7acb\u4e86\u4e00\u500b\u77e5\u8b58\u5bc6\u96c6\u578b\u554f\u7b54\u8cc7\u6599\u96c6\uff0c\u5176\u4e2d\u5305\u542b 30 \u7a2e\u8a9e\u8a00\uff0c\u4e26\u4ee5 LLM \u7522\u751f\u7684\u63d0\u793a\u548c\u7dad\u57fa\u767e\u79d1\u6587\u7ae0\u4f5c\u70ba\u53c3\u8003\u3002\u6211\u5011\u767c\u73fe\uff0c\u5118\u7ba1 LLM \u70ba\u8cc7\u6e90\u8f03\u591a\u7684\u8a9e\u8a00\u7522\u751f\u4e86\u66f4\u9577\u7684\u56de\u61c9\u548c\u66f4\u591a\u5e7b\u89ba\u7684\u4ee3\u5e63\uff0c\u4f46\u8a9e\u8a00\u7684\u9577\u5ea6\u6b63\u898f\u5316\u5e7b\u89ba\u7387\u8207\u5176\u6578\u4f4d\u8868\u793a\u4e4b\u9593\u6c92\u6709\u76f8\u95dc\u6027\u3002\u6b64\u5916\uff0c\u6211\u5011\u767c\u73fe\u8f03\u5c0f\u7684 LLM \u8868\u73fe\u51fa\u6bd4\u8f03\u5927\u7684\u6a21\u578b\u66f4\u5927\u7684\u5e7b\u89ba\u7387\u3002</paragraph>", "author": "Saad Obaid ul Islam et.al.", "authors": "Saad Obaid ul Islam, Anne Lauscher, Goran Glava\u0161", "id": "2502.12769v1", "paper_url": "http://arxiv.org/abs/2502.12769v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12771/2502.12771v1.json b/database/storage/2502/12771/2502.12771v1.json
new file mode 100644
index 0000000000..fa16d91cd9
--- /dev/null
+++ b/database/storage/2502/12771/2502.12771v1.json
@@ -0,0 +1 @@
+{"2502.12771": {"publish_time": "2025-02-18", "title": "Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach", "paper_summary": "Self-supervised language and audio models effectively predict brain responses\nto speech. However, traditional prediction models rely on linear mappings from\nunimodal features, despite the complex integration of auditory signals with\nlinguistic and semantic information across widespread brain networks during\nspeech comprehension. Here, we introduce a nonlinear, multimodal prediction\nmodel that combines audio and linguistic features from pre-trained models\n(e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in\nprediction performance (unnormalized and normalized correlation) over\ntraditional unimodal linear models, as well as a 7.7% and 14.4% improvement,\nrespectively, over prior state-of-the-art models. These improvements represent\na major step towards future robust in-silico testing and improved decoding\nperformance. They also reveal how auditory and semantic information are fused\nin motor, somatosensory, and higher-level semantic regions, aligning with\nexisting neurolinguistic theories. Overall, our work highlights the often\nneglected potential of nonlinear and multimodal approaches to brain modeling,\npaving the way for future studies to embrace these strategies in naturalistic\nneurolinguistics research.", "paper_summary_zh": "\u81ea\u6211\u76e3\u7763\u7684\u8a9e\u8a00\u548c\u97f3\u8a0a\u6a21\u578b\u6709\u6548\u9810\u6e2c\u5927\u8166\u5c0d\u8a9e\u8a00\u7684\u53cd\u61c9\u3002\u7136\u800c\uff0c\u50b3\u7d71\u7684\u9810\u6e2c\u6a21\u578b\u4f9d\u8cf4\u65bc\u55ae\u6a21\u614b\u7279\u5fb5\u7684\u7dda\u6027\u6620\u5c04\uff0c\u5118\u7ba1\u5728\u8a9e\u8a00\u7406\u89e3\u904e\u7a0b\u4e2d\uff0c\u807d\u89ba\u4fe1\u865f\u8207\u8a9e\u8a00\u548c\u8a9e\u7fa9\u8cc7\u8a0a\u5728\u5ee3\u6cdb\u7684\u8166\u7db2\u8def\u4e2d\u9032\u884c\u8907\u96dc\u7684\u6574\u5408\u3002\u5728\u6b64\uff0c\u6211\u5011\u5f15\u5165\u4e00\u500b\u975e\u7dda\u6027\u3001\u591a\u6a21\u614b\u9810\u6e2c\u6a21\u578b\uff0c\u7d50\u5408\u9810\u5148\u8a13\u7df4\u6a21\u578b\uff08\u4f8b\u5982\uff0cLLAMA\u3001Whisper\uff09\u4e2d\u7684\u97f3\u8a0a\u548c\u8a9e\u8a00\u7279\u5fb5\u3002\u6211\u5011\u7684\u505a\u6cd5\u5728\u9810\u6e2c\u6548\u80fd\u4e0a\uff08\u672a\u6b63\u898f\u5316\u548c\u6b63\u898f\u5316\u76f8\u95dc\u6027\uff09\u5206\u5225\u6bd4\u50b3\u7d71\u7684\u55ae\u6a21\u614b\u7dda\u6027\u6a21\u578b\u63d0\u5347\u4e86 17.2% \u548c 17.9%\uff0c\u5206\u5225\u6bd4\u5148\u524d\u7684\u6700\u5148\u9032\u6a21\u578b\u63d0\u5347\u4e86 7.7% \u548c 14.4%\u3002\u9019\u4e9b\u6539\u9032\u4ee3\u8868\u4e86\u672a\u4f86\u7a69\u5065\u7684\u96fb\u8166\u6a21\u64ec\u6e2c\u8a66\u548c\u6539\u9032\u7684\u89e3\u78bc\u6548\u80fd\u9081\u51fa\u4e86\u4e00\u5927\u6b65\u3002\u5b83\u5011\u4e5f\u63ed\u793a\u4e86\u807d\u89ba\u548c\u8a9e\u7fa9\u8cc7\u8a0a\u5982\u4f55\u5728\u904b\u52d5\u3001\u9ad4\u611f\u548c\u66f4\u9ad8\u5c64\u6b21\u7684\u8a9e\u7fa9\u5340\u57df\u4e2d\u878d\u5408\uff0c\u8207\u73fe\u6709\u7684\u795e\u7d93\u8a9e\u8a00\u5b78\u7406\u8ad6\u4e00\u81f4\u3002\u7e3d\u7684\u4f86\u8aaa\uff0c\u6211\u5011\u7684\u7814\u7a76\u7a81\u51fa\u4e86\u975e\u7dda\u6027\u548c\u591a\u6a21\u614b\u5927\u8166\u5efa\u6a21\u65b9\u6cd5\u7d93\u5e38\u88ab\u5ffd\u7565\u7684\u6f5b\u529b\uff0c\u70ba\u672a\u4f86\u7814\u7a76\u5728\u81ea\u7136\u4e3b\u7fa9\u795e\u7d93\u8a9e\u8a00\u5b78\u7814\u7a76\u4e2d\u63a1\u7528\u9019\u4e9b\u7b56\u7565\u92ea\u5e73\u4e86\u9053\u8def\u3002", "author": "Danny Dongyeop Han et.al.", "authors": "Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee", "id": "2502.12771v1", "paper_url": "http://arxiv.org/abs/2502.12771v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12776/2502.12776v1.json b/database/storage/2502/12776/2502.12776v1.json
new file mode 100644
index 0000000000..bdc568693e
--- /dev/null
+++ b/database/storage/2502/12776/2502.12776v1.json
@@ -0,0 +1 @@
+{"2502.12776": {"publish_time": "2025-02-18", "title": "Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models", "paper_summary": "While foundation models have been exploited for various expert tasks through\nfine-tuning, any foundation model will become outdated due to its old knowledge\nor limited capability. Thus the underlying foundation model should be\neventually replaced by new ones, which leads to repeated cost of fine-tuning\nthese new models. Existing work addresses this problem by inference-time\ntuning, i.e., modifying the output probabilities from the new foundation model\nwith the outputs from the old foundation model and its fine-tuned model, which\ninvolves an additional overhead in inference by the latter two models. In this\npaper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT),\nthat reduces the inference overhead by its nature, based on the reformulation\nof fine-tuning as the reward maximization. Specifically, instead of fine-tuning\nparameters of the foundation models, PRT trains the reward model explicitly\nthrough the same loss function as in fine-tuning. During inference, the reward\nmodel can be used with any foundation model (with the same set of vocabularies\nor labels) through the formulation of reward maximization. Experimental\nresults, covering both vision and language models, demonstrate that the\nPRT-trained model can achieve comparable accuracy to the existing work of\ninference-time tuning, with less inference cost.", "paper_summary_zh": "\u5118\u7ba1\u57fa\u790e\u6a21\u578b\u5df2\u900f\u904e\u5fae\u8abf\u7528\u65bc\u5404\u7a2e\u5c08\u5bb6\u4efb\u52d9\uff0c\u4efb\u4f55\u57fa\u790e\u6a21\u578b\u90fd\u5c07\u56e0\u5176\u820a\u77e5\u8b58\u6216\u6709\u9650\u529f\u80fd\u800c\u904e\u6642\u3002\u56e0\u6b64\uff0c\u57fa\u790e\u6a21\u578b\u6700\u7d42\u61c9\u7531\u65b0\u6a21\u578b\u53d6\u4ee3\uff0c\u9019\u5c0e\u81f4\u91cd\u8907\u5fae\u8abf\u9019\u4e9b\u65b0\u6a21\u578b\u7684\u6210\u672c\u3002\u73fe\u6709\u5de5\u4f5c\u900f\u904e\u63a8\u8ad6\u6642\u9593\u8abf\u6574\u4f86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u5373\u4f7f\u7528\u820a\u57fa\u790e\u6a21\u578b\u53ca\u5176\u5fae\u8abf\u6a21\u578b\u7684\u8f38\u51fa\u4fee\u6539\u65b0\u57fa\u790e\u6a21\u578b\u7684\u8f38\u51fa\u6a5f\u7387\uff0c\u9019\u6d89\u53ca\u5f8c\u5169\u500b\u6a21\u578b\u5728\u63a8\u8ad6\u4e2d\u7684\u984d\u5916\u958b\u92b7\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u63d0\u51fa\u4e00\u500b\u65b0\u7684\u5fae\u8abf\u539f\u5247\uff0c\u53ef\u651c\u5f0f\u734e\u52f5\u8abf\u6574 (PRT)\uff0c\u5b83\u672c\u8cea\u4e0a\u6703\u6e1b\u5c11\u63a8\u8ad6\u958b\u92b7\uff0c\u57fa\u65bc\u5c07\u5fae\u8abf\u91cd\u65b0\u8868\u8ff0\u70ba\u734e\u52f5\u6700\u5927\u5316\u3002\u5177\u9ad4\u4f86\u8aaa\uff0cPRT \u4e0d\u662f\u5fae\u8abf\u57fa\u790e\u6a21\u578b\u7684\u53c3\u6578\uff0c\u800c\u662f\u900f\u904e\u8207\u5fae\u8abf\u4e2d\u76f8\u540c\u7684\u640d\u5931\u51fd\u6578\u660e\u78ba\u8a13\u7df4\u734e\u52f5\u6a21\u578b\u3002\u5728\u63a8\u8ad6\u671f\u9593\uff0c\u734e\u52f5\u6a21\u578b\u53ef\u900f\u904e\u734e\u52f5\u6700\u5927\u5316\u7684\u516c\u5f0f\u8207\u4efb\u4f55\u57fa\u790e\u6a21\u578b\uff08\u5177\u6709\u76f8\u540c\u7684\u8a5e\u5f59\u6216\u6a19\u7c64\u7d44\uff09\u4e00\u8d77\u4f7f\u7528\u3002\u6db5\u84cb\u8996\u89ba\u548c\u8a9e\u8a00\u6a21\u578b\u7684\u5be6\u9a57\u7d50\u679c\u8b49\u660e\uff0cPRT \u8a13\u7df4\u7684\u6a21\u578b\u53ef\u4ee5\u9054\u5230\u8207\u73fe\u6709\u63a8\u8ad6\u6642\u9593\u8abf\u6574\u5de5\u4f5c\u76f8\u7576\u7684\u6e96\u78ba\u5ea6\uff0c\u4e14\u63a8\u8ad6\u6210\u672c\u8f03\u4f4e\u3002", "author": "Daiki Chijiwa et.al.", "authors": "Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi", "id": "2502.12776v1", "paper_url": "http://arxiv.org/abs/2502.12776v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12782/2502.12782v1.json b/database/storage/2502/12782/2502.12782v1.json
new file mode 100644
index 0000000000..dd0739f9d6
--- /dev/null
+++ b/database/storage/2502/12782/2502.12782v1.json
@@ -0,0 +1 @@
+{"2502.12782": {"publish_time": "2025-02-18", "title": "VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation", "paper_summary": "The training of controllable text-to-video (T2V) models relies heavily on the\nalignment between videos and captions, yet little existing research connects\nvideo caption evaluation with T2V generation assessment. This paper introduces\nVidCapBench, a video caption evaluation scheme specifically designed for T2V\ngeneration, agnostic to any particular caption format. VidCapBench employs a\ndata annotation pipeline, combining expert model labeling and human refinement,\nto associate each collected video with key information spanning video\naesthetics, content, motion, and physical laws. VidCapBench then partitions\nthese key information attributes into automatically assessable and manually\nassessable subsets, catering to both the rapid evaluation needs of agile\ndevelopment and the accuracy requirements of thorough validation. By evaluating\nnumerous state-of-the-art captioning models, we demonstrate the superior\nstability and comprehensiveness of VidCapBench compared to existing video\ncaptioning evaluation approaches. Verification with off-the-shelf T2V models\nreveals a significant positive correlation between scores on VidCapBench and\nthe T2V quality evaluation metrics, indicating that VidCapBench can provide\nvaluable guidance for training T2V models. The project is available at\nhttps://github.com/VidCapBench/VidCapBench.", "paper_summary_zh": "\u53ef\u63a7\u5236\u6587\u672c\u5230\u5f71\u7247 (T2V) \u6a21\u578b\u7684\u8a13\u7df4\u6975\u5ea6\u4ef0\u8cf4\u5f71\u7247\u548c\u5b57\u5e55\u4e4b\u9593\u7684\u5c0d\u9f4a\uff0c\u4f46\u73fe\u6709\u7814\u7a76\u9bae\u5c11\u5c07\u5f71\u7247\u5b57\u5e55\u8a55\u4f30\u8207 T2V \u751f\u6210\u8a55\u4f30\u9023\u7d50\u8d77\u4f86\u3002\u672c\u6587\u4ecb\u7d39 VidCapBench\uff0c\u9019\u662f\u4e00\u7a2e\u5c08\u9580\u70ba T2V \u751f\u6210\u8a2d\u8a08\u7684\u5f71\u7247\u5b57\u5e55\u8a55\u4f30\u67b6\u69cb\uff0c\u8207\u4efb\u4f55\u7279\u5b9a\u7684\u5b57\u5e55\u683c\u5f0f\u7121\u95dc\u3002VidCapBench \u63a1\u7528\u8cc7\u6599\u6a19\u8a3b\u6d41\u7a0b\uff0c\u7d50\u5408\u5c08\u5bb6\u6a21\u578b\u6a19\u8a18\u548c\u4eba\u5de5\u5fae\u8abf\uff0c\u5c07\u6bcf\u500b\u6536\u96c6\u5230\u7684\u5f71\u7247\u8207\u6db5\u84cb\u5f71\u7247\u7f8e\u5b78\u3001\u5167\u5bb9\u3001\u52d5\u4f5c\u548c\u7269\u7406\u5b9a\u5f8b\u7b49\u95dc\u9375\u8cc7\u8a0a\u95dc\u806f\u8d77\u4f86\u3002VidCapBench \u63a5\u8457\u5c07\u9019\u4e9b\u95dc\u9375\u8cc7\u8a0a\u5c6c\u6027\u5206\u5272\u6210\u53ef\u81ea\u52d5\u8a55\u4f30\u548c\u53ef\u624b\u52d5\u8a55\u4f30\u7684\u5b50\u96c6\uff0c\u4ee5\u6eff\u8db3\u654f\u6377\u958b\u767c\u7684\u5feb\u901f\u8a55\u4f30\u9700\u6c42\u548c\u5168\u9762\u9a57\u8b49\u7684\u6e96\u78ba\u6027\u8981\u6c42\u3002\u900f\u904e\u8a55\u4f30\u8a31\u591a\u6700\u5148\u9032\u7684\u5b57\u5e55\u6a21\u578b\uff0c\u6211\u5011\u8b49\u660e\u4e86 VidCapBench \u8207\u73fe\u6709\u7684\u5f71\u7247\u5b57\u5e55\u8a55\u4f30\u65b9\u6cd5\u76f8\u6bd4\uff0c\u5177\u6709\u512a\u7570\u7684\u7a69\u5b9a\u6027\u548c\u5168\u9762\u6027\u3002\u4f7f\u7528\u73fe\u6210\u7684 T2V \u6a21\u578b\u9a57\u8b49\u986f\u793a\uff0cVidCapBench \u5f97\u5206\u8207 T2V \u54c1\u8cea\u8a55\u4f30\u6307\u6a19\u4e4b\u9593\u5b58\u5728\u986f\u8457\u7684\u6b63\u76f8\u95dc\uff0c\u9019\u8868\u793a VidCapBench \u53ef\u4ee5\u70ba\u8a13\u7df4 T2V \u6a21\u578b\u63d0\u4f9b\u6709\u50f9\u503c\u7684\u6307\u5c0e\u3002\u5c08\u6848\u53ef\u65bc https://github.com/VidCapBench/VidCapBench \u53d6\u5f97\u3002", "author": "Xinlong Chen et.al.", "authors": "Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan", "id": "2502.12782v1", "paper_url": "http://arxiv.org/abs/2502.12782v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12788/2502.12788v1.json b/database/storage/2502/12788/2502.12788v1.json
new file mode 100644
index 0000000000..7a421e7107
--- /dev/null
+++ b/database/storage/2502/12788/2502.12788v1.json
@@ -0,0 +1 @@
+{"2502.12788": {"publish_time": "2025-02-18", "title": "Commonsense Reasoning in Arab Culture", "paper_summary": "Despite progress in Arabic large language models, such as Jais and AceGPT,\ntheir evaluation on commonsense reasoning has largely relied on\nmachine-translated datasets, which lack cultural depth and may introduce\nAnglocentric biases. Commonsense reasoning is shaped by geographical and\ncultural contexts, and existing English datasets fail to capture the diversity\nof the Arab world. To address this, we introduce \\datasetname, a commonsense\nreasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13\ncountries across the Gulf, Levant, North Africa, and the Nile Valley. The\ndataset was built from scratch by engaging native speakers to write and\nvalidate culturally relevant questions for their respective countries.\n\\datasetname spans 12 daily life domains with 54 fine-grained subtopics,\nreflecting various aspects of social norms, traditions, and everyday\nexperiences. Zero-shot evaluations show that open-weight language models with\nup to 32B parameters struggle to comprehend diverse Arab cultures, with\nperformance varying across regions. These findings highlight the need for more\nculturally aware models and datasets tailored to the Arabic-speaking world.", "paper_summary_zh": "\u5118\u7ba1\u963f\u62c9\u4f2f\u8a9e\u5927\u578b\u8a9e\u8a00\u6a21\u578b\uff08\u4f8b\u5982 Jais \u548c AceGPT\uff09\u5df2\u6709\u9032\u5c55\uff0c\n\u4f46\u5b83\u5011\u5728\u5e38\u8b58\u63a8\u7406\u4e0a\u7684\u8a55\u4f30\u5728\u5f88\u5927\u7a0b\u5ea6\u4e0a\u4f9d\u8cf4\u65bc\n\u6a5f\u5668\u7ffb\u8b6f\u7684\u8cc7\u6599\u96c6\uff0c\u9019\u4e9b\u8cc7\u6599\u96c6\u7f3a\u4e4f\u6587\u5316\u6df1\u5ea6\uff0c\u53ef\u80fd\u6703\u5f15\u5165\n\u4ee5\u82f1\u8a9e\u70ba\u4e2d\u5fc3\u7684\u504f\u898b\u3002\u5e38\u8b58\u63a8\u7406\u53d7\u5730\u7406\u548c\n\u6587\u5316\u80cc\u666f\u5f71\u97ff\uff0c\u73fe\u6709\u7684\u82f1\u6587\u8cc7\u6599\u96c6\u7121\u6cd5\u6355\u6349\u963f\u62c9\u4f2f\u4e16\u754c\u7684\u591a\u6a23\u6027\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u5f15\u5165\u4e86 \\datasetname\uff0c\u4e00\u500b\u73fe\u4ee3\u6a19\u6e96\u963f\u62c9\u4f2f\u8a9e (MSA) \u7684\u5e38\u8b58\u63a8\u7406\u8cc7\u6599\u96c6\uff0c\u6db5\u84cb\u6d77\u7063\u5730\u5340\u3001\u9ece\u51e1\u7279\u5730\u5340\u3001\u5317\u975e\u548c\u5c3c\u7f85\u6cb3\u8c37 13 \u500b\u570b\u5bb6\u7684\u6587\u5316\u3002\u6b64\u8cc7\u6599\u96c6\u662f\u5f9e\u982d\u958b\u59cb\u5efa\u7acb\u7684\uff0c\u7531\u6bcd\u8a9e\u4eba\u58eb\u53c3\u8207\u7de8\u5beb\u548c\u9a57\u8b49\u4ed6\u5011\u5404\u81ea\u570b\u5bb6\u7684\u6587\u5316\u76f8\u95dc\u554f\u984c\u3002\\datasetname \u6db5\u84cb 12 \u500b\u65e5\u5e38\u751f\u6d3b\u9818\u57df\uff0c\u5305\u542b 54 \u500b\u7d30\u7dfb\u7684\u4e3b\u984c\uff0c\u53cd\u6620\u793e\u6703\u898f\u7bc4\u3001\u50b3\u7d71\u548c\u65e5\u5e38\u7d93\u9a57\u7684\u5404\u500b\u65b9\u9762\u3002\u96f6\u6b21\u5b78\u7fd2\u8a55\u4f30\u986f\u793a\uff0c\u5177\u6709\u9ad8\u9054 32B \u53c3\u6578\u7684\u958b\u653e\u5f0f\u6b0a\u91cd\u8a9e\u8a00\u6a21\u578b\u96e3\u4ee5\u7406\u89e3\u4e0d\u540c\u7684\u963f\u62c9\u4f2f\u6587\u5316\uff0c\u4e14\u5404\u5340\u57df\u7684\u8868\u73fe\u4e0d\u4e00\u3002\u9019\u4e9b\u767c\u73fe\u7a81\u986f\u4e86\u5c0d\u66f4\u5177\u6587\u5316\u610f\u8b58\u7684\u6a21\u578b\u548c\u5c08\u70ba\u963f\u62c9\u4f2f\u8a9e\u7cfb\u4e16\u754c\u91cf\u8eab\u6253\u9020\u7684\u8cc7\u6599\u96c6\u7684\u9700\u6c42\u3002", "author": "Abdelrahman Sadallah et.al.", "authors": "Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto", "id": "2502.12788v1", "paper_url": "http://arxiv.org/abs/2502.12788v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12798/2502.12798v1.json b/database/storage/2502/12798/2502.12798v1.json
new file mode 100644
index 0000000000..0e79d5cda9
--- /dev/null
+++ b/database/storage/2502/12798/2502.12798v1.json
@@ -0,0 +1 @@
+{"2502.12798": {"publish_time": "2025-02-18", "title": "Envious Explore and Exploit", "paper_summary": "Explore-and-exploit tradeoffs play a key role in recommendation systems\n(RSs), aiming at serving users better by learning from previous interactions.\nDespite their commercial success, the societal effects of explore-and-exploit\nmechanisms are not well understood, especially regarding the utility\ndiscrepancy they generate between different users. In this work, we measure\nsuch discrepancy using the economic notion of envy. We present a multi-armed\nbandit-like model in which every round consists of several sessions, and\nrewards are realized once per round. We call the latter property reward\nconsistency, and show that the RS can leverage this property for better\nsocietal outcomes. On the downside, doing so also generates envy, as\nlate-to-arrive users enjoy the information gathered by early-to-arrive users.\nWe examine the generated envy under several arrival order mechanisms and\nvirtually any anonymous algorithm, i.e., any algorithm that treats all similar\nusers similarly without leveraging their identities. We provide tight envy\nbounds on uniform arrival and upper bound the envy for nudged arrival, in which\nthe RS can affect the order of arrival by nudging its users. Furthermore, we\nstudy the efficiency-fairness trade-off by devising an algorithm that allows\nconstant envy and approximates the optimal welfare in restricted settings.\nFinally, we validate our theoretical results empirically using simulations.", "paper_summary_zh": "\u63a2\u7d22\u8207\u958b\u767c\u7684\u53d6\u6368\u5728\u63a8\u85a6\u7cfb\u7d71 (RS) \u4e2d\u626e\u6f14\u8457\u95dc\u9375\u89d2\u8272\uff0c\u65e8\u5728\u900f\u904e\u5b78\u7fd2\u5148\u524d\u7684\u4e92\u52d5\u4f86\u70ba\u4f7f\u7528\u8005\u63d0\u4f9b\u66f4\u597d\u7684\u670d\u52d9\u3002\u5118\u7ba1\u5728\u5546\u696d\u4e0a\u7372\u5f97\u6210\u529f\uff0c\u4f46\u63a2\u7d22\u8207\u958b\u767c\u6a5f\u5236\u7684\u793e\u6703\u6548\u61c9\u4ecd\u672a\u88ab\u5145\u5206\u7406\u89e3\uff0c\u7279\u5225\u662f\u95dc\u65bc\u5b83\u5011\u5728\u4e0d\u540c\u4f7f\u7528\u8005\u4e4b\u9593\u7522\u751f\u7684\u6548\u7528\u5dee\u7570\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u4f7f\u7528\u7d93\u6fdf\u5b78\u4e2d\u7684\u5ac9\u5992\u6982\u5ff5\u4f86\u8861\u91cf\u9019\u7a2e\u5dee\u7570\u3002\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u500b\u591a\u81c2\u8001\u864e\u6a5f\u6a21\u578b\uff0c\u5176\u4e2d\u6bcf\u4e00\u8f2a\u90fd\u5305\u542b\u591a\u500b\u56de\u5408\uff0c\u4e26\u4e14\u6bcf\u56de\u5408\u53ea\u6703\u5be6\u73fe\u4e00\u6b21\u734e\u52f5\u3002\u6211\u5011\u5c07\u5f8c\u8005\u7684\u7279\u6027\u7a31\u70ba\u734e\u52f5\u4e00\u81f4\u6027\uff0c\u4e26\u8b49\u660e RS \u53ef\u4ee5\u5229\u7528\u6b64\u7279\u6027\u4f86\u7372\u5f97\u66f4\u597d\u7684\u793e\u6703\u6210\u679c\u3002\u4e0d\u5229\u7684\u662f\uff0c\u9019\u9ebc\u505a\u4e5f\u6703\u7522\u751f\u5ac9\u5992\uff0c\u56e0\u70ba\u8f03\u665a\u52a0\u5165\u7684\u4f7f\u7528\u8005\u53ef\u4ee5\u4eab\u53d7\u8f03\u65e9\u52a0\u5165\u7684\u4f7f\u7528\u8005\u6240\u6536\u96c6\u7684\u8cc7\u8a0a\u3002\u6211\u5011\u5728\u591a\u7a2e\u5230\u9054\u9806\u5e8f\u6a5f\u5236\u548c\u5e7e\u4e4e\u4efb\u4f55\u533f\u540d\u6f14\u7b97\u6cd5\uff08\u5373\u4efb\u4f55\u6f14\u7b97\u6cd5\u90fd\u4ee5\u985e\u4f3c\u7684\u65b9\u5f0f\u5c0d\u5f85\u6240\u6709\u985e\u4f3c\u7684\u4f7f\u7528\u8005\uff0c\u800c\u4e0d\u5229\u7528\u4ed6\u5011\u7684\u8eab\u4efd\uff09\u4e0b\u6aa2\u9a57\u7522\u751f\u7684\u5ac9\u5992\u3002\u6211\u5011\u5c0d\u5747\u52fb\u5230\u9054\u63d0\u4f9b\u56b4\u683c\u7684\u5ac9\u5992\u754c\u7dda\uff0c\u4e26\u5c0d\u63a8\u52d5\u5230\u9054\u7684\u4e0a\u9650\u9032\u884c\u5ac9\u5992\u754c\u7dda\uff0c\u5176\u4e2d RS \u53ef\u4ee5\u900f\u904e\u63a8\u52d5\u5176\u4f7f\u7528\u8005\u4f86\u5f71\u97ff\u5230\u9054\u9806\u5e8f\u3002\u6b64\u5916\uff0c\u6211\u5011\u900f\u904e\u8a2d\u8a08\u4e00\u7a2e\u6f14\u7b97\u6cd5\u4f86\u7814\u7a76\u6548\u7387\u516c\u5e73\u6b0a\u8861\uff0c\u8a72\u6f14\u7b97\u6cd5\u5141\u8a31\u6046\u5b9a\u7684\u5ac9\u5992\uff0c\u4e26\u5728\u53d7\u9650\u8a2d\u5b9a\u4e2d\u8fd1\u4f3c\u6700\u4f73\u798f\u5229\u3002\u6700\u5f8c\uff0c\u6211\u5011\u4f7f\u7528\u6a21\u64ec\u5c0d\u6211\u5011\u7684\u7406\u8ad6\u7d50\u679c\u9032\u884c\u7d93\u9a57\u9a57\u8b49\u3002", "author": "Omer Ben-Porat et.al.", "authors": "Omer Ben-Porat, Yotam Gafni, Or Markovetzki", "id": "2502.12798v1", "paper_url": "http://arxiv.org/abs/2502.12798v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12799/2502.12799v1.json b/database/storage/2502/12799/2502.12799v1.json
new file mode 100644
index 0000000000..cf217003b7
--- /dev/null
+++ b/database/storage/2502/12799/2502.12799v1.json
@@ -0,0 +1 @@
+{"2502.12799": {"publish_time": "2025-02-18", "title": "Towards Text-Image Interleaved Retrieval", "paper_summary": "Current multimodal information retrieval studies mainly focus on single-image\ninputs, which limits real-world applications involving multiple images and\ntext-image interleaved content. In this work, we introduce the text-image\ninterleaved retrieval (TIIR) task, where the query and document are interleaved\ntext-image sequences, and the model is required to understand the semantics\nfrom the interleaved context for effective retrieval. We construct a TIIR\nbenchmark based on naturally interleaved wikiHow tutorials, where a specific\npipeline is designed to generate interleaved queries. To explore the task, we\nadapt several off-the-shelf retrievers and build a dense baseline by\ninterleaved multimodal large language model (MLLM). We then propose a novel\nMatryoshka Multimodal Embedder (MME), which compresses the number of visual\ntokens at different granularity, to address the challenge of excessive visual\ntokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption\nof existing models does not consistently yield effective results. Our MME\nachieves significant improvements over the baseline by substantially fewer\nvisual tokens. We provide extensive analysis and will release the dataset and\ncode to facilitate future research.", "paper_summary_zh": "\u76ee\u524d\u7684\u591a\u6a21\u614b\u8cc7\u8a0a\u6aa2\u7d22\u7814\u7a76\u4e3b\u8981\u96c6\u4e2d\u5728\u55ae\u4e00\u5f71\u50cf\u8f38\u5165\uff0c\u9019\u9650\u5236\u4e86\u6d89\u53ca\u591a\u500b\u5f71\u50cf\u548c\u6587\u5b57\u5f71\u50cf\u4ea4\u932f\u5167\u5bb9\u7684\u5be6\u969b\u61c9\u7528\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u5f15\u5165\u4e86\u6587\u5b57\u5f71\u50cf\u4ea4\u932f\u6aa2\u7d22 (TIIR) \u4efb\u52d9\uff0c\u5176\u4e2d\u67e5\u8a62\u548c\u6587\u4ef6\u662f\u4ea4\u932f\u7684\u6587\u5b57\u5f71\u50cf\u5e8f\u5217\uff0c\u4e26\u4e14\u6a21\u578b\u9700\u8981\u7406\u89e3\u4ea4\u932f\u5167\u5bb9\u7684\u8a9e\u610f\u4ee5\u9032\u884c\u6709\u6548\u6aa2\u7d22\u3002\u6211\u5011\u6839\u64da\u81ea\u7136\u4ea4\u932f\u7684 wikiHow \u6559\u5b78\u8ab2\u7a0b\u5efa\u69cb\u4e86\u4e00\u500b TIIR \u57fa\u6e96\uff0c\u5176\u4e2d\u8a2d\u8a08\u4e86\u4e00\u500b\u7279\u5b9a\u7684\u7ba1\u7dda\u4f86\u7522\u751f\u4ea4\u932f\u67e5\u8a62\u3002\u70ba\u4e86\u63a2\u7d22\u9019\u500b\u4efb\u52d9\uff0c\u6211\u5011\u8abf\u6574\u4e86\u5e7e\u500b\u73fe\u6210\u7684\u6aa2\u7d22\u5668\uff0c\u4e26\u900f\u904e\u4ea4\u932f\u7684\u591a\u6a21\u614b\u5927\u578b\u8a9e\u8a00\u6a21\u578b (MLLM) \u5efa\u7acb\u4e86\u4e00\u500b\u5bc6\u96c6\u7684\u57fa\u6e96\u3002\u7136\u5f8c\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u500b\u65b0\u7a4e\u7684 Matryoshka \u591a\u6a21\u614b\u5d4c\u5165\u5668 (MME)\uff0c\u5b83\u58d3\u7e2e\u4e86\u4e0d\u540c\u7c92\u5ea6\u8996\u89ba\u7b26\u865f\u7684\u6578\u91cf\uff0c\u4ee5\u89e3\u6c7a\u57fa\u65bc MLLM \u7684 TIIR \u6a21\u578b\u4e2d\u904e\u591a\u8996\u89ba\u7b26\u865f\u7684\u6311\u6230\u3002\u5be6\u9a57\u8868\u660e\uff0c\u5c0d\u73fe\u6709\u6a21\u578b\u7684\u7c21\u55ae\u8abf\u6574\u4e26\u672a\u6301\u7e8c\u7522\u751f\u6709\u6548\u7d50\u679c\u3002\u6211\u5011\u7684 MME \u900f\u904e\u5927\u5e45\u6e1b\u5c11\u8996\u89ba\u7b26\u865f\uff0c\u9054\u5230\u4e86\u6bd4\u57fa\u6e96\u986f\u8457\u7684\u6539\u9032\u3002\u6211\u5011\u63d0\u4f9b\u4e86\u5ee3\u6cdb\u7684\u5206\u6790\uff0c\u4e26\u5c07\u91cb\u51fa\u8cc7\u6599\u96c6\u548c\u7a0b\u5f0f\u78bc\u4ee5\u4fc3\u9032\u672a\u4f86\u7684\u7814\u7a76\u3002", "author": "Xin Zhang et.al.", "authors": "Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang", "id": "2502.12799v1", "paper_url": "http://arxiv.org/abs/2502.12799v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12813/2502.12813v1.json b/database/storage/2502/12813/2502.12813v1.json
new file mode 100644
index 0000000000..637cee970d
--- /dev/null
+++ b/database/storage/2502/12813/2502.12813v1.json
@@ -0,0 +1 @@
+{"2502.12813": {"publish_time": "2025-02-18", "title": "Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models", "paper_summary": "In this study, we explore the application of Large Language Models (LLMs) for\ngenerating synthetic users and simulating user conversations with a\ntask-oriented dialogue system and present detailed results and their analysis.\nWe propose a comprehensive novel approach to user simulation technique that\nuses LLMs to create diverse user profiles, set goals, engage in multi-turn\ndialogues, and evaluate the conversation success. We employ two proprietary\nLLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a\nheterogeneous base of user profiles, characterized by varied demographics,\nmultiple user goals, different conversational styles, initial knowledge levels,\ninterests, and conversational objectives. We perform a detailed analysis of the\nuser profiles generated by LLMs to assess the diversity, consistency, and\npotential biases inherent in these LLM-generated user simulations. We find that\nGPT-o1 generates more heterogeneous user distribution across most user\nattributes, while GPT-4o generates more skewed user attributes. The generated\nset of user profiles are then utilized to simulate dialogue sessions by\ninteracting with a task-oriented dialogue system.", "paper_summary_zh": "\u5728\u9019\u9805\u7814\u7a76\u4e2d\uff0c\u6211\u5011\u63a2\u8a0e\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728\u751f\u6210\u5408\u6210\u4f7f\u7528\u8005\u548c\u6a21\u64ec\u4f7f\u7528\u8005\u5c0d\u8a71\uff0c\u4e26\u4f7f\u7528\u4efb\u52d9\u5c0e\u5411\u5c0d\u8a71\u7cfb\u7d71\u9032\u884c\u5c0d\u8a71\u7684\u61c9\u7528\uff0c\u4e26\u63d0\u51fa\u8a73\u7d30\u7684\u7d50\u679c\u53ca\u5176\u5206\u6790\u3002\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u7a2e\u5168\u9762\u7684\u4f7f\u7528\u8005\u6a21\u64ec\u6280\u8853\u65b0\u65b9\u6cd5\uff0c\u5229\u7528 LLM \u5efa\u7acb\u591a\u6a23\u5316\u7684\u4f7f\u7528\u8005\u6982\u6cc1\u3001\u8a2d\u5b9a\u76ee\u6a19\u3001\u53c3\u8207\u591a\u8f2a\u5c0d\u8a71\uff0c\u4e26\u8a55\u4f30\u5c0d\u8a71\u7684\u6210\u529f\u6027\u3002\u6211\u5011\u63a1\u7528\u4e86\u5169\u500b\u5c08\u6709\u7684 LLM\uff0c\u5373 GPT-4o \u548c GPT-o1 (Achiam \u7b49\u4eba\uff0c2023 \u5e74)\uff0c\u4ee5\u751f\u6210\u4e00\u500b\u7570\u8cea\u7684\u4f7f\u7528\u8005\u6982\u6cc1\u57fa\u790e\uff0c\u5176\u7279\u5fb5\u5728\u65bc\u4e0d\u540c\u7684\u4eba\u53e3\u7d71\u8a08\u8cc7\u6599\u3001\u591a\u500b\u4f7f\u7528\u8005\u76ee\u6a19\u3001\u4e0d\u540c\u7684\u5c0d\u8a71\u98a8\u683c\u3001\u521d\u59cb\u77e5\u8b58\u6c34\u6e96\u3001\u8208\u8da3\u548c\u5c0d\u8a71\u76ee\u6a19\u3002\u6211\u5011\u5c0d LLM \u751f\u6210\u7684\u4f7f\u7528\u8005\u6982\u6cc1\u9032\u884c\u4e86\u8a73\u7d30\u5206\u6790\uff0c\u4ee5\u8a55\u4f30\u9019\u4e9b LLM \u751f\u6210\u7684\u4f7f\u7528\u8005\u6a21\u64ec\u4e2d\u56fa\u6709\u7684\u591a\u6a23\u6027\u3001\u4e00\u81f4\u6027\u548c\u6f5b\u5728\u504f\u5dee\u3002\u6211\u5011\u767c\u73fe GPT-o1 \u5728\u5927\u591a\u6578\u4f7f\u7528\u8005\u5c6c\u6027\u4e2d\u7522\u751f\u66f4\u7570\u8cea\u7684\u4f7f\u7528\u8005\u5206\u4f48\uff0c\u800c GPT-4o \u5247\u7522\u751f\u66f4\u504f\u659c\u7684\u4f7f\u7528\u8005\u5c6c\u6027\u3002\u7136\u5f8c\u5229\u7528\u751f\u6210\u7684\u4f7f\u7528\u8005\u6982\u6cc1\u96c6\uff0c\u900f\u904e\u8207\u4efb\u52d9\u5c0e\u5411\u5c0d\u8a71\u7cfb\u7d71\u4e92\u52d5\u4f86\u6a21\u64ec\u5c0d\u8a71\u6703\u8a71\u3002", "author": "Adnan Ahmad et.al.", "authors": "Adnan Ahmad, Stefan Hillmann, Sebastian M\u00f6ller", "id": "2502.12813v1", "paper_url": "http://arxiv.org/abs/2502.12813v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12821/2502.12821v1.json b/database/storage/2502/12821/2502.12821v1.json
new file mode 100644
index 0000000000..570fc2a4a9
--- /dev/null
+++ b/database/storage/2502/12821/2502.12821v1.json
@@ -0,0 +1 @@
+{"2502.12821": {"publish_time": "2025-02-18", "title": "Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models", "paper_summary": "Inverse tasks can uncover potential reasoning gaps as Large Language Models\n(LLMs) scale up. In this work, we explore the redefinition task, in which we\nassign alternative values to well-known physical constants and units of\nmeasure, prompting LLMs to respond accordingly. Our findings show that not only\ndoes model performance degrade with scale, but its false confidence also rises.\nMoreover, while factors such as prompting strategies or response formatting are\ninfluential, they do not preclude LLMs from anchoring to memorized values.", "paper_summary_zh": "\u9006\u5411\u4efb\u52d9\u53ef\u4ee5\u63ed\u793a\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u64f4\u5c55\u6642\u6f5b\u5728\u7684\u63a8\u7406\u5dee\u8ddd\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u63a2\u8a0e\u91cd\u65b0\u5b9a\u7fa9\u4efb\u52d9\uff0c\u5176\u4e2d\u6211\u5011\u5c07\u66ff\u63db\u503c\u6307\u5b9a\u7d66\u8457\u540d\u7684\u7269\u7406\u5e38\u6578\u548c\u6e2c\u91cf\u55ae\u4f4d\uff0c\u4fc3\u4f7f LLM \u505a\u51fa\u76f8\u61c9\u56de\u61c9\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u8868\u660e\uff0c\u6a21\u578b\u6548\u80fd\u4e0d\u50c5\u6703\u96a8\u8457\u898f\u6a21\u800c\u4e0b\u964d\uff0c\u5176\u865b\u5047\u4fe1\u5fc3\u4e5f\u6703\u4e0a\u5347\u3002\u6b64\u5916\uff0c\u5118\u7ba1\u63d0\u793a\u7b56\u7565\u6216\u56de\u61c9\u683c\u5f0f\u7b49\u56e0\u7d20\u5177\u6709\u5f71\u97ff\u529b\uff0c\u4f46\u5b83\u5011\u4e26\u4e0d\u59a8\u7919 LLM \u9328\u5b9a\u5728\u8a18\u61b6\u503c\u4e0a\u3002", "author": "Elena Stringli et.al.", "authors": "Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou", "id": "2502.12821v1", "paper_url": "http://arxiv.org/abs/2502.12821v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12825/2502.12825v1.json b/database/storage/2502/12825/2502.12825v1.json
new file mode 100644
index 0000000000..a7c78ab9ad
--- /dev/null
+++ b/database/storage/2502/12825/2502.12825v1.json
@@ -0,0 +1 @@
+{"2502.12825": {"publish_time": "2025-02-18", "title": "Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models", "paper_summary": "When encountering increasingly frequent performance improvements or cost\nreductions from a new large language model (LLM), developers of applications\nleveraging LLMs must decide whether to take advantage of these improvements or\nstay with older tried-and-tested models. Low perceived switching frictions can\nlead to choices that do not consider more subtle behavior changes that the\ntransition may induce. Our experiments use a popular game-theoretic behavioral\neconomics model of trust to show stark differences in the trusting behavior of\nOpenAI's and DeepSeek's models. We highlight a collapse in the economic trust\nbehavior of the o1-mini and o3-mini models as they reconcile profit-maximizing\nand risk-seeking with future returns from trust, and contrast it with\nDeepSeek's more sophisticated and profitable trusting behavior that stems from\nan ability to incorporate deeper concepts like forward planning and\ntheory-of-mind. As LLMs form the basis for high-stakes commercial systems, our\nresults highlight the perils of relying on LLM performance benchmarks that are\ntoo narrowly defined and suggest that careful analysis of their hidden fault\nlines should be part of any organization's AI strategy.", "paper_summary_zh": "\u7576\u9047\u5230\u8d8a\u4f86\u8d8a\u983b\u7e41\u7684\u6548\u80fd\u63d0\u5347\u6216\u4f86\u81ea\u65bc\u65b0\u7684\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u6210\u672c\u964d\u4f4e\u6642\uff0c\u5229\u7528 LLM \u7684\u61c9\u7528\u7a0b\u5f0f\u958b\u767c\u4eba\u54e1\u5fc5\u9808\u6c7a\u5b9a\u662f\u5426\u8981\u5229\u7528\u9019\u4e9b\u63d0\u5347\u6216\u7dad\u6301\u8f03\u820a\u4e14\u7d93\u904e\u6e2c\u8a66\u7684\u6a21\u578b\u3002\u4f4e\u611f\u77e5\u5207\u63db\u6469\u64e6\u53ef\u80fd\u6703\u5c0e\u81f4\u9078\u64c7\u4e0d\u8003\u616e\u8f49\u63db\u53ef\u80fd\u8a98\u767c\u7684\u66f4\u7d30\u5fae\u7684\u884c\u70ba\u6539\u8b8a\u3002\u6211\u5011\u7684\u5be6\u9a57\u4f7f\u7528\u4fe1\u4efb\u7684\u6d41\u884c\u535a\u5f08\u8ad6\u884c\u70ba\u7d93\u6fdf\u6a21\u578b\u4f86\u986f\u793a OpenAI \u548c DeepSeek \u6a21\u578b\u5728\u4fe1\u4efb\u884c\u70ba\u4e0a\u7684\u986f\u8457\u5dee\u7570\u3002\u6211\u5011\u5f37\u8abf o1-mini \u548c o3-mini \u6a21\u578b\u7684\u7d93\u6fdf\u4fe1\u4efb\u884c\u70ba\u5d29\u6f70\uff0c\u56e0\u70ba\u5b83\u5011\u8abf\u548c\u4e86\u5229\u6f64\u6700\u5927\u5316\u548c\u98a8\u96aa\u5c0b\u6c42\u8207\u4f86\u81ea\u4fe1\u4efb\u7684\u672a\u4f86\u56de\u5831\uff0c\u4e26\u5c07\u5176\u8207 DeepSeek \u66f4\u8907\u96dc\u4e14\u6709\u5229\u53ef\u5716\u7684\u4fe1\u4efb\u884c\u70ba\u9032\u884c\u5c0d\u6bd4\uff0c\u9019\u7a2e\u4fe1\u4efb\u884c\u70ba\u6e90\u65bc\u6574\u5408\u66f4\u6df1\u5c64\u7684\u6982\u5ff5\uff0c\u4f8b\u5982\u524d\u77bb\u6027\u898f\u5283\u548c\u5fc3\u667a\u7406\u8ad6\u3002\u7531\u65bc LLM \u69cb\u6210\u9ad8\u98a8\u96aa\u5546\u696d\u7cfb\u7d71\u7684\u57fa\u790e\uff0c\u6211\u5011\u7684\u7d50\u679c\u7a81\u986f\u4e86\u4f9d\u8cf4\u5b9a\u7fa9\u904e\u65bc\u72f9\u7a84\u7684 LLM \u6548\u80fd\u57fa\u6e96\u7684\u5371\u96aa\u6027\uff0c\u4e26\u5efa\u8b70\u4ed4\u7d30\u5206\u6790\u5176\u96b1\u85cf\u7684\u65b7\u5c64\u7dda\u61c9\u8a72\u662f\u4efb\u4f55\u7d44\u7e54\u7684 AI \u7b56\u7565\u7684\u4e00\u90e8\u5206\u3002", "author": "Rubing Lu et.al.", "authors": "Rubing Lu, Jo\u00e3o Sedoc, Arun Sundararajan", "id": "2502.12825v1", "paper_url": "http://arxiv.org/abs/2502.12825v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12829/2502.12829v1.json b/database/storage/2502/12829/2502.12829v1.json
new file mode 100644
index 0000000000..2744355f1e
--- /dev/null
+++ b/database/storage/2502/12829/2502.12829v1.json
@@ -0,0 +1 @@
+{"2502.12829": {"publish_time": "2025-02-18", "title": "KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan", "paper_summary": "Despite having a population of twenty million, Kazakhstan's culture and\nlanguage remain underrepresented in the field of natural language processing.\nAlthough large language models (LLMs) continue to advance worldwide, progress\nin Kazakh language has been limited, as seen in the scarcity of dedicated\nmodels and benchmark evaluations. To address this gap, we introduce KazMMLU,\nthe first MMLU-style dataset specifically designed for Kazakh language. KazMMLU\ncomprises 23,000 questions that cover various educational levels, including\nSTEM, humanities, and social sciences, sourced from authentic educational\nmaterials and manually validated by native speakers and educators. The dataset\nincludes 10,969 Kazakh questions and 12,031 Russian questions, reflecting\nKazakhstan's bilingual education system and rich local context. Our evaluation\nof several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4,\nand DeepSeek V3) demonstrates substantial room for improvement, as even the\nbest-performing models struggle to achieve competitive performance in Kazakh\nand Russian. These findings underscore significant performance gaps compared to\nhigh-resource languages. We hope that our dataset will enable further research\nand development of Kazakh-centric LLMs. Data and code will be made available\nupon acceptance.", "paper_summary_zh": "\u5118\u7ba1\u54c8\u85a9\u514b\u4eba\u53e3\u9054\u5169\u5343\u842c\uff0c\u4f46\u54c8\u85a9\u514b\u7684\u6587\u5316\u548c\u8a9e\u8a00\u5728\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u9818\u57df\u4ecd\u672a\u5f97\u5230\u5145\u5206\u7684\u91cd\u8996\u3002\u5118\u7ba1\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728\u5168\u7403\u6301\u7e8c\u9032\u6b65\uff0c\u4f46\u54c8\u85a9\u514b\u8a9e\u7684\u9032\u5c55\u537b\u5341\u5206\u6709\u9650\uff0c\u9019\u5f9e\u5c08\u7528\u6a21\u578b\u548c\u57fa\u6e96\u8a55\u4f30\u7684\u7a00\u7f3a\u6027\u4e2d\u53ef\u898b\u4e00\u6591\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u5dee\u8ddd\uff0c\u6211\u5011\u5f15\u5165\u4e86 KazMMLU\uff0c\u9019\u662f\u7b2c\u4e00\u500b\u5c08\u9580\u70ba\u54c8\u85a9\u514b\u8a9e\u8a2d\u8a08\u7684 MMLU \u98a8\u683c\u8cc7\u6599\u96c6\u3002KazMMLU \u5305\u542b 23,000 \u500b\u554f\u984c\uff0c\u6db5\u84cb\u5404\u7a2e\u6559\u80b2\u5c64\u7d1a\uff0c\u5305\u62ec STEM\u3001\u4eba\u6587\u5b78\u79d1\u548c\u793e\u6703\u79d1\u5b78\uff0c\u9019\u4e9b\u554f\u984c\u4f86\u81ea\u771f\u5be6\u7684\u6559\u80b2\u6750\u6599\uff0c\u4e26\u7531\u6bcd\u8a9e\u4eba\u58eb\u548c\u6559\u80b2\u5de5\u4f5c\u8005\u624b\u52d5\u9a57\u8b49\u3002\u8a72\u8cc7\u6599\u96c6\u5305\u542b 10,969 \u500b\u54c8\u85a9\u514b\u8a9e\u554f\u984c\u548c 12,031 \u500b\u4fc4\u8a9e\u554f\u984c\uff0c\u53cd\u6620\u4e86\u54c8\u85a9\u514b\u7684\u96d9\u8a9e\u6559\u80b2\u9ad4\u7cfb\u548c\u8c50\u5bcc\u7684\u5728\u5730\u8108\u7d61\u3002\u6211\u5011\u5c0d\u5e7e\u500b\u6700\u5148\u9032\u7684\u591a\u8a9e\u8a00\u6a21\u578b\uff08Llama-3.1\u3001Qwen-2.5\u3001GPT-4 \u548c DeepSeek V3\uff09\u7684\u8a55\u4f30\u986f\u793a\uff0c\u4ecd\u6709\u5f88\u5927\u7684\u6539\u9032\u7a7a\u9593\uff0c\u56e0\u70ba\u5373\u4f7f\u662f\u6548\u80fd\u6700\u597d\u7684\u6a21\u578b\uff0c\u4e5f\u5f88\u96e3\u5728\u54c8\u85a9\u514b\u8a9e\u548c\u4fc4\u8a9e\u4e2d\u9054\u5230\u6709\u7af6\u722d\u529b\u7684\u6548\u80fd\u3002\u9019\u4e9b\u767c\u73fe\u5f37\u8abf\u4e86\u8207\u8cc7\u6e90\u8c50\u5bcc\u7684\u8a9e\u8a00\u76f8\u6bd4\uff0c\u5b58\u5728\u986f\u8457\u7684\u6548\u80fd\u5dee\u8ddd\u3002\u6211\u5011\u5e0c\u671b\u6211\u5011\u7684\u8cc7\u6599\u96c6\u80fd\u4fc3\u9032\u4ee5\u54c8\u85a9\u514b\u8a9e\u70ba\u4e2d\u5fc3\u7684 LLM \u7684\u9032\u4e00\u6b65\u7814\u7a76\u548c\u958b\u767c\u3002\u8cc7\u6599\u548c\u7a0b\u5f0f\u78bc\u5c07\u5728\u7372\u5f97\u63a5\u53d7\u5f8c\u63d0\u4f9b\u3002", "author": "Mukhammed Togmanov et.al.", "authors": "Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto", "id": "2502.12829v1", "paper_url": "http://arxiv.org/abs/2502.12829v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12835/2502.12835v1.json b/database/storage/2502/12835/2502.12835v1.json
new file mode 100644
index 0000000000..59a9e01cbc
--- /dev/null
+++ b/database/storage/2502/12835/2502.12835v1.json
@@ -0,0 +1 @@
+{"2502.12835": {"publish_time": "2025-02-18", "title": "Subword models struggle with word learning, but surprisal hides it", "paper_summary": "We study word learning in subword and character language models with the\npsycholinguistic lexical decision task. While subword LMs struggle to discern\nwords and non-words with high accuracy, character LMs solve this task easily\nand consistently. Furthermore, when comparing word learning and syntactic\nlearning, both processes are separable in character LM where word learning\npredates syntactic learning, whereas these processes are simultaneous in\nsubword LM. This raises questions about the adequacy of subword LMs for\nmodeling language acquisition and positions character LMs as a viable\nalternative.", "paper_summary_zh": "\u6211\u5011\u4f7f\u7528\u5fc3\u7406\u8a9e\u8a00\u5b78\u7684\u8a5e\u5f59\u6c7a\u7b56\u4efb\u52d9\u7814\u7a76\u5728\u5b50\u8a5e\u548c\u5b57\u5143\u8a9e\u8a00\u6a21\u578b\u4e2d\u7684\u8a5e\u5f59\u5b78\u7fd2\u3002\u5118\u7ba1\u5b50\u8a5e\u8a9e\u8a00\u6a21\u578b\u96e3\u4ee5\u5340\u5206\u55ae\u8a5e\u548c\u975e\u55ae\u8a5e\uff0c\u4f46\u5b57\u5143\u8a9e\u8a00\u6a21\u578b\u53ef\u4ee5\u8f15\u9b06\u4e14\u4e00\u81f4\u5730\u89e3\u6c7a\u6b64\u4efb\u52d9\u3002\u6b64\u5916\uff0c\u5728\u6bd4\u8f03\u55ae\u8a5e\u5b78\u7fd2\u548c\u53e5\u6cd5\u5b78\u7fd2\u6642\uff0c\u9019\u5169\u500b\u904e\u7a0b\u5728\u5b57\u5143\u8a9e\u8a00\u6a21\u578b\u4e2d\u662f\u53ef\u5206\u96e2\u7684\uff0c\u5176\u4e2d\u55ae\u8a5e\u5b78\u7fd2\u5148\u65bc\u53e5\u6cd5\u5b78\u7fd2\uff0c\u800c\u9019\u4e9b\u904e\u7a0b\u5728\u5b50\u8a5e\u8a9e\u8a00\u6a21\u578b\u4e2d\u662f\u540c\u6642\u767c\u751f\u7684\u3002\u9019\u5f15\u767c\u4e86\u95dc\u65bc\u5b50\u8a5e\u8a9e\u8a00\u6a21\u578b\u5c0d\u8a9e\u8a00\u7fd2\u5f97\u5efa\u6a21\u7684\u5145\u5206\u6027\u7684\u554f\u984c\uff0c\u4e26\u5c07\u5b57\u5143\u8a9e\u8a00\u6a21\u578b\u5b9a\u4f4d\u70ba\u53ef\u884c\u7684\u66ff\u4ee3\u65b9\u6848\u3002", "author": "Bastian Bunzeck et.al.", "authors": "Bastian Bunzeck, Sina Zarrie\u00df", "id": "2502.12835v1", "paper_url": "http://arxiv.org/abs/2502.12835v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12836/2502.12836v1.json b/database/storage/2502/12836/2502.12836v1.json
new file mode 100644
index 0000000000..b747b22580
--- /dev/null
+++ b/database/storage/2502/12836/2502.12836v1.json
@@ -0,0 +1 @@
+{"2502.12836": {"publish_time": "2025-02-18", "title": "An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation", "paper_summary": "Large language models (LLMs) are revolutionizing healthcare by improving\ndiagnosis, patient care, and decision support through interactive\ncommunication. More recently, they have been applied to analyzing physiological\ntime-series like wearable data for health insight extraction. Existing methods\nembed raw numerical sequences directly into prompts, which exceeds token limits\nand increases computational costs. Additionally, some studies integrated\nfeatures extracted from time-series in textual prompts or applied multimodal\napproaches. However, these methods often produce generic and unreliable outputs\ndue to LLMs' limited analytical rigor and inefficiency in interpreting\ncontinuous waveforms. In this paper, we develop an LLM-powered agent for\nphysiological time-series analysis aimed to bridge the gap in integrating LLMs\nwith well-established analytical tools. Built on the OpenCHA, an open-source\nLLM-powered framework, our agent features an orchestrator that integrates user\ninteraction, data sources, and analytical tools to generate accurate health\ninsights. To evaluate its effectiveness, we implement a case study on heart\nrate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of\nPPG and Electrocardiogram (ECG) recordings in a remote health monitoring study.\nThe agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o,\nwith ECG serving as the gold standard for HR estimation. Results demonstrate\nthat our agent significantly outperforms benchmark models by achieving lower\nerror rates and more reliable HR estimations. The agent implementation is\npublicly available on GitHub.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u900f\u904e\u4e92\u52d5\u5f0f\u6e9d\u901a\uff0c\u6539\u5584\u8a3a\u65b7\u3001\u75c5\u4eba\u7167\u8b77\u548c\u6c7a\u7b56\u652f\u63f4\uff0c\u9032\u800c\u9769\u65b0\u91ab\u7642\u4fdd\u5065\u3002\u6700\u8fd1\uff0c\u5b83\u5011\u5df2\u61c9\u7528\u65bc\u5206\u6790\u751f\u7406\u6642\u9593\u5e8f\u5217\uff0c\u4f8b\u5982\u53ef\u7a7f\u6234\u5f0f\u88dd\u7f6e\u7684\u8cc7\u6599\uff0c\u4ee5\u8403\u53d6\u5065\u5eb7\u898b\u89e3\u3002\u73fe\u6709\u65b9\u6cd5\u6703\u5c07\u539f\u59cb\u6578\u503c\u5e8f\u5217\u76f4\u63a5\u5d4c\u5165\u63d0\u793a\u4e2d\uff0c\u9019\u6703\u8d85\u904e\u6b0a\u6756\u9650\u5236\u4e26\u589e\u52a0\u904b\u7b97\u6210\u672c\u3002\u6b64\u5916\uff0c\u4e00\u4e9b\u7814\u7a76\u5c07\u5f9e\u6642\u9593\u5e8f\u5217\u4e2d\u8403\u53d6\u7684\u7279\u5fb5\u6574\u5408\u5230\u6587\u5b57\u63d0\u793a\u4e2d\uff0c\u6216\u61c9\u7528\u591a\u6a21\u614b\u65b9\u6cd5\u3002\u7136\u800c\uff0c\u7531\u65bc LLM \u5728\u89e3\u8b6f\u9023\u7e8c\u6ce2\u5f62\u6642\u5206\u6790\u56b4\u8b39\u5ea6\u6709\u9650\u4e14\u6548\u7387\u4e0d\u5f70\uff0c\u9019\u4e9b\u65b9\u6cd5\u7d93\u5e38\u7522\u751f\u901a\u7528\u4e14\u4e0d\u53ef\u9760\u7684\u8f38\u51fa\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u958b\u767c\u4e86\u4e00\u500b\u7531 LLM \u9a45\u52d5\u7684\u4ee3\u7406\uff0c\u7528\u65bc\u751f\u7406\u6642\u9593\u5e8f\u5217\u5206\u6790\uff0c\u65e8\u5728\u5f4c\u5408\u5c07 LLM \u8207\u65e2\u6709\u5206\u6790\u5de5\u5177\u6574\u5408\u7684\u5dee\u8ddd\u3002\u6211\u5011\u7684\u4ee3\u7406\u5efa\u7acb\u5728 OpenCHA\uff08\u4e00\u500b\u7531 LLM \u9a45\u52d5\u7684\u958b\u6e90\u67b6\u69cb\uff09\u4e4b\u4e0a\uff0c\u5177\u5099\u4e00\u500b\u6574\u5408\u4f7f\u7528\u8005\u4e92\u52d5\u3001\u8cc7\u6599\u4f86\u6e90\u548c\u5206\u6790\u5de5\u5177\u7684\u5354\u8abf\u5668\uff0c\u4ee5\u7522\u751f\u6e96\u78ba\u7684\u5065\u5eb7\u898b\u89e3\u3002\u70ba\u4e86\u8a55\u4f30\u5176\u6709\u6548\u6027\uff0c\u6211\u5011\u5be6\u4f5c\u4e86\u4e00\u500b\u6848\u4f8b\u7814\u7a76\uff0c\u5f9e\u9060\u8ddd\u5065\u5eb7\u76e3\u6e2c\u7814\u7a76\u4e2d\u7684\u4e00\u7d44\u5149\u96fb\u5bb9\u7a4d\u63cf\u8a18\u5716 (PPG) \u548c\u5fc3\u96fb\u5716 (ECG) \u8a18\u9304\u4e2d\u4f30\u7b97\u5fc3\u7387 (HR)\u3002\u8a72\u4ee3\u7406\u7684\u6548\u80fd\u8207 OpenAI GPT-4o-mini \u548c GPT-4o \u9032\u884c\u57fa\u6e96\u6e2c\u8a66\uff0c\u5176\u4e2d ECG \u4f5c\u70ba HR \u4f30\u7b97\u7684\u91d1\u6a19\u6e96\u3002\u7d50\u679c\u986f\u793a\uff0c\u6211\u5011\u7684\u4ee3\u7406\u900f\u904e\u9054\u6210\u8f03\u4f4e\u7684\u932f\u8aa4\u7387\u548c\u66f4\u53ef\u9760\u7684 HR \u4f30\u7b97\uff0c\u986f\u8457\u512a\u65bc\u57fa\u6e96\u6a21\u578b\u3002\u8a72\u4ee3\u7406\u5be6\u4f5c\u5df2\u516c\u958b\u5728 GitHub \u4e0a\u3002", "author": "Mohammad Feli et.al.", "authors": "Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M. Rahmani", "id": "2502.12836v1", "paper_url": "http://arxiv.org/abs/2502.12836v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12838/2502.12838v1.json b/database/storage/2502/12838/2502.12838v1.json
new file mode 100644
index 0000000000..a196f0ba33
--- /dev/null
+++ b/database/storage/2502/12838/2502.12838v1.json
@@ -0,0 +1 @@
+{"2502.12838": {"publish_time": "2025-02-18", "title": "Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing", "paper_summary": "The recent advances in large language models (LLMs) have revolutionized\nindustries such as finance, marketing, and customer service by enabling\nsophisticated natural language processing tasks. However, the broad adoption of\nLLMs brings significant challenges, particularly in the form of social biases\nthat can be embedded within their outputs. Biases related to gender, age, and\nother sensitive attributes can lead to unfair treatment, raising ethical\nconcerns and risking both company reputation and customer trust. This study\nexamined bias in finance-related marketing slogans generated by LLMs (i.e.,\nChatGPT) by prompting tailored ads targeting five demographic categories:\ngender, marital status, age, income level, and education level. A total of\n1,700 slogans were generated for 17 unique demographic groups, and key terms\nwere categorized into four thematic groups: empowerment, financial, benefits\nand features, and personalization. Bias was systematically assessed using\nrelative bias calculations and statistically tested with the Kolmogorov-Smirnov\n(KS) test against general slogans generated for any individual. Results\nrevealed that marketing slogans are not neutral; rather, they emphasize\ndifferent themes based on demographic factors. Women, younger individuals,\nlow-income earners, and those with lower education levels receive more distinct\nmessaging compared to older, higher-income, and highly educated individuals.\nThis underscores the need to consider demographic-based biases in AI-generated\nmarketing strategies and their broader societal implications. The findings of\nthis study provide a roadmap for developing more equitable AI systems,\nhighlighting the need for ongoing bias detection and mitigation efforts in\nLLMs.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u6700\u65b0\u9032\u5c55\u5fb9\u5e95\u6539\u8b8a\u4e86\u91d1\u878d\u3001\u884c\u92b7\u548c\u5ba2\u6236\u670d\u52d9\u7b49\u7522\u696d\uff0c\u56e0\u70ba\u5b83\u80fd\u57f7\u884c\u8907\u96dc\u7684\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u4efb\u52d9\u3002\u7136\u800c\uff0cLLM \u7684\u5ee3\u6cdb\u63a1\u7528\u5e36\u4f86\u91cd\u5927\u7684\u6311\u6230\uff0c\u7279\u5225\u662f\u6f5b\u85cf\u5728\u5176\u8f38\u51fa\u7d50\u679c\u4e2d\u7684\u793e\u6703\u504f\u898b\u5f62\u5f0f\u3002\u8207\u6027\u5225\u3001\u5e74\u9f61\u548c\u5176\u4ed6\u654f\u611f\u5c6c\u6027\u76f8\u95dc\u7684\u504f\u898b\u53ef\u80fd\u5c0e\u81f4\u4e0d\u516c\u5e73\u7684\u5f85\u9047\uff0c\u5f15\u767c\u9053\u5fb7\u554f\u984c\uff0c\u4e26\u5371\u53ca\u516c\u53f8\u8072\u8b7d\u548c\u5ba2\u6236\u4fe1\u4efb\u3002\u672c\u7814\u7a76\u63a2\u8a0e\u4e86 LLM\uff08\u5373 ChatGPT\uff09\u7522\u751f\u7684\u8207\u91d1\u878d\u76f8\u95dc\u7684\u884c\u92b7\u6a19\u8a9e\u4e2d\u7684\u504f\u898b\uff0c\u65b9\u6cd5\u662f\u91dd\u5c0d\u4e94\u500b\u4eba\u53e3\u7d71\u8a08\u985e\u5225\uff1a\u6027\u5225\u3001\u5a5a\u59fb\u72c0\u6cc1\u3001\u5e74\u9f61\u3001\u6536\u5165\u6c34\u6e96\u548c\u6559\u80b2\u6c34\u6e96\uff0c\u63d0\u793a\u91cf\u8eab\u6253\u9020\u7684\u5ee3\u544a\u3002\u7e3d\u5171\u70ba 17 \u500b\u7368\u7279\u7684\u4eba\u53e3\u7d71\u8a08\u7fa4\u7d44\u7522\u751f\u4e86 1,700 \u500b\u6a19\u8a9e\uff0c\u4e26\u4e14\u95dc\u9375\u8a5e\u88ab\u5206\u985e\u70ba\u56db\u500b\u4e3b\u984c\u7fa4\u7d44\uff1a\u8ce6\u6b0a\u3001\u8ca1\u52d9\u3001\u597d\u8655\u548c\u529f\u80fd\uff0c\u4ee5\u53ca\u500b\u4eba\u5316\u3002\u504f\u898b\u4f7f\u7528\u76f8\u5c0d\u504f\u898b\u8a08\u7b97\u9032\u884c\u7cfb\u7d71\u6027\u8a55\u4f30\uff0c\u4e26\u4f7f\u7528\u79d1\u723e\u83ab\u54e5\u6d1b\u592b-\u53f2\u7c73\u8afe\u592b (KS) \u6aa2\u5b9a\u8207\u91dd\u5c0d\u4efb\u4f55\u500b\u4eba\u7522\u751f\u7684\u901a\u7528\u6a19\u8a9e\u9032\u884c\u7d71\u8a08\u6aa2\u5b9a\u3002\u7d50\u679c\u986f\u793a\u884c\u92b7\u6a19\u8a9e\u4e26\u975e\u4e2d\u7acb\uff1b\u76f8\u53cd\u5730\uff0c\u5b83\u5011\u6839\u64da\u4eba\u53e3\u7d71\u8a08\u56e0\u7d20\u5f37\u8abf\u4e0d\u540c\u7684\u4e3b\u984c\u3002\u8207\u5e74\u7d00\u8f03\u5927\u3001\u6536\u5165\u8f03\u9ad8\u548c\u53d7\u6559\u80b2\u7a0b\u5ea6\u8f03\u9ad8\u7684\u500b\u4eba\u76f8\u6bd4\uff0c\u5973\u6027\u3001\u5e74\u8f15\u4eba\u3001\u4f4e\u6536\u5165\u8005\u548c\u6559\u80b2\u7a0b\u5ea6\u8f03\u4f4e\u8005\u63a5\u6536\u5230\u7684\u8a0a\u606f\u66f4\u70ba\u4e0d\u540c\u3002\u9019\u5f37\u8abf\u4e86\u5728 AI \u751f\u6210\u7684\u884c\u92b7\u7b56\u7565\u4e2d\u8003\u91cf\u57fa\u65bc\u4eba\u53e3\u7d71\u8a08\u7684\u504f\u898b\u53ca\u5176\u66f4\u5ee3\u6cdb\u7684\u793e\u6703\u5f71\u97ff\u7684\u5fc5\u8981\u6027\u3002\u672c\u7814\u7a76\u7684\u767c\u73fe\u63d0\u4f9b\u4e86\u958b\u767c\u66f4\u516c\u5e73 AI \u7cfb\u7d71\u7684\u8def\u7dda\u5716\uff0c\u7a81\u986f\u4e86\u5728 LLM \u4e2d\u6301\u7e8c\u9032\u884c\u504f\u898b\u5075\u6e2c\u548c\u7de9\u89e3\u5de5\u4f5c\u7684\u91cd\u8981\u6027\u3002", "author": "Berk Yilmaz et.al.", "authors": "Berk Yilmaz, Huthaifa I. Ashqar", "id": "2502.12838v1", "paper_url": "http://arxiv.org/abs/2502.12838v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12842/2502.12842v1.json b/database/storage/2502/12842/2502.12842v1.json
new file mode 100644
index 0000000000..b3fe85365e
--- /dev/null
+++ b/database/storage/2502/12842/2502.12842v1.json
@@ -0,0 +1 @@
+{"2502.12842": {"publish_time": "2025-02-18", "title": "Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols", "paper_summary": "Effective feedback is essential for fostering students' success in scientific\ninquiry. With advancements in artificial intelligence, large language models\n(LLMs) offer new possibilities for delivering instant and adaptive feedback.\nHowever, this feedback often lacks the pedagogical validation provided by\nreal-world practitioners. To address this limitation, our study evaluates and\ncompares the feedback quality of LLM agents with that of human teachers and\nscience education experts on student-written experimentation protocols. Four\nblinded raters, all professionals in scientific inquiry and science education,\nevaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and\n3) the science education experts using a five-point Likert scale based on six\ncriteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive\nTone, Linguistic Clarity, and Technical Terminology. Our results indicate that\nLLM-generated feedback shows no significant difference to that of teachers and\nexperts in overall quality. However, the LLM agent's performance lags in the\nFeed Back dimension, which involves identifying and explaining errors within\nthe student's work context. Qualitative analysis highlighted the LLM agent's\nlimitations in contextual understanding and in the clear communication of\nspecific errors. Our findings suggest that combining LLM-generated feedback\nwith human expertise can enhance educational practices by leveraging the\nefficiency of LLMs and the nuanced understanding of educators.", "paper_summary_zh": "\u6709\u6548\u7684\u56de\u994b\u5c0d\u65bc\u57f9\u990a\u5b78\u751f\u5728\u79d1\u5b78\u63a2\u7a76\u4e2d\u7684\u6210\u529f\u81f3\u95dc\u91cd\u8981\u3002\u96a8\u8457\u4eba\u5de5\u667a\u6167\u7684\u9032\u6b65\uff0c\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u70ba\u63d0\u4f9b\u5373\u6642\u4e14\u9069\u61c9\u6027\u7684\u56de\u994b\u63d0\u4f9b\u4e86\u65b0\u7684\u53ef\u80fd\u6027\u3002\u7136\u800c\uff0c\u6b64\u56de\u994b\u901a\u5e38\u7f3a\u4e4f\u5be6\u969b\u5f9e\u696d\u8005\u63d0\u4f9b\u7684\u6559\u5b78\u9a57\u8b49\u3002\u70ba\u4e86\u89e3\u6c7a\u6b64\u9650\u5236\uff0c\u6211\u5011\u7684\u7814\u7a76\u8a55\u4f30\u4e26\u6bd4\u8f03\u4e86 LLM \u4ee3\u7406\u8207\u4eba\u985e\u6559\u5e2b\u548c\u79d1\u5b78\u6559\u80b2\u5c08\u5bb6\u5728\u5b78\u751f\u64b0\u5beb\u7684\u5be6\u9a57\u5354\u5b9a\u4e0a\u7684\u56de\u994b\u54c1\u8cea\u3002\u56db\u4f4d\u76f2\u8a55\u8005\uff0c\u7686\u70ba\u79d1\u5b78\u63a2\u7a76\u548c\u79d1\u5b78\u6559\u80b2\u5c08\u696d\u4eba\u58eb\uff0c\u4f7f\u7528\u57fa\u65bc\u516d\u500b\u6709\u6548\u56de\u994b\u6e96\u5247\u7684\u4e94\u9ede\u674e\u514b\u7279\u91cf\u8868\u8a55\u4f30\u7531 1) LLM \u4ee3\u7406\u30012) \u6559\u5e2b\u548c 3) \u79d1\u5b78\u6559\u80b2\u5c08\u5bb6\u7522\u751f\u7684\u56de\u994b\u6587\u5b57\uff1a\u9f13\u52f5\u3001\u56de\u994b\u3001\u524d\u994b\u3001\u5efa\u8a2d\u6027\u8a9e\u6c23\u3001\u8a9e\u8a00\u6e05\u6670\u5ea6\u548c\u6280\u8853\u8853\u8a9e\u3002\u6211\u5011\u7684\u7d50\u679c\u8868\u660e\uff0cLLM \u7522\u751f\u7684\u56de\u994b\u5728\u6574\u9ad4\u54c1\u8cea\u4e0a\u8207\u6559\u5e2b\u548c\u5c08\u5bb6\u7522\u751f\u7684\u56de\u994b\u6c92\u6709\u986f\u8457\u5dee\u7570\u3002\u7136\u800c\uff0cLLM \u4ee3\u7406\u7684\u8868\u73fe\u843d\u5f8c\u65bc\u56de\u994b\u9762\u5411\uff0c\u9019\u6d89\u53ca\u5728\u5b78\u751f\u7684\u4f5c\u696d\u80cc\u666f\u4e2d\u8b58\u5225\u548c\u89e3\u91cb\u932f\u8aa4\u3002\u5b9a\u6027\u5206\u6790\u7a81\u986f\u4e86 LLM \u4ee3\u7406\u5728\u60c5\u5883\u7406\u89e3\u548c\u660e\u78ba\u50b3\u9054\u7279\u5b9a\u932f\u8aa4\u65b9\u9762\u7684\u9650\u5236\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u8868\u660e\uff0c\u5c07 LLM \u7522\u751f\u7684\u56de\u994b\u8207\u4eba\u985e\u5c08\u696d\u77e5\u8b58\u76f8\u7d50\u5408\uff0c\u53ef\u4ee5\u900f\u904e\u5229\u7528 LLM \u7684\u6548\u7387\u548c\u6559\u80b2\u8005\u7684\u7d30\u7dfb\u7406\u89e3\u4f86\u63d0\u5347\u6559\u80b2\u5be6\u52d9\u3002", "author": "Kathrin Se\u00dfler et.al.", "authors": "Kathrin Se\u00dfler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci", "id": "2502.12842v1", "paper_url": "http://arxiv.org/abs/2502.12842v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12851/2502.12851v1.json b/database/storage/2502/12851/2502.12851v1.json
new file mode 100644
index 0000000000..f81a4c0e35
--- /dev/null
+++ b/database/storage/2502/12851/2502.12851v1.json
@@ -0,0 +1 @@
+{"2502.12851": {"publish_time": "2025-02-18", "title": "MeMo: Towards Language Models with Associative Memory Mechanisms", "paper_summary": "Memorization is a fundamental ability of Transformer-based Large Language\nModels, achieved through learning. In this paper, we propose a paradigm shift\nby designing an architecture to memorize text directly, bearing in mind the\nprinciple that memorization precedes learning. We introduce MeMo, a novel\narchitecture for language modeling that explicitly memorizes sequences of\ntokens in layered associative memories. By design, MeMo offers transparency and\nthe possibility of model editing, including forgetting texts. We experimented\nwith the MeMo architecture, showing the memorization power of the one-layer and\nthe multi-layer configurations.", "paper_summary_zh": "\u8a18\u61b6\u662f Transformer \u5927\u578b\u8a9e\u8a00\u6a21\u578b\u7684\u57fa\u672c\u80fd\u529b\uff0c\u53ef\u900f\u904e\u5b78\u7fd2\u9054\u6210\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u63d0\u51fa\u4e00\u500b\u5178\u7bc4\u8f49\u79fb\uff0c\u900f\u904e\u8a2d\u8a08\u4e00\u500b\u67b6\u69cb\u4f86\u76f4\u63a5\u8a18\u61b6\u6587\u5b57\uff0c\u4e26\u7262\u8a18\u8a18\u61b6\u5148\u65bc\u5b78\u7fd2\u7684\u539f\u5247\u3002\u6211\u5011\u5c0e\u5165 MeMo\uff0c\u4e00\u500b\u65b0\u7a4e\u7684\u8a9e\u8a00\u5efa\u6a21\u67b6\u69cb\uff0c\u53ef\u660e\u78ba\u5730\u8a18\u61b6\u5206\u5c64\u95dc\u806f\u5f0f\u8a18\u61b6\u4e2d\u7684\u4ee3\u5e63\u5e8f\u5217\u3002\u900f\u904e\u8a2d\u8a08\uff0cMeMo \u63d0\u4f9b\u900f\u660e\u5ea6\u548c\u6a21\u578b\u7de8\u8f2f\u7684\u53ef\u80fd\u6027\uff0c\u5305\u62ec\u907a\u5fd8\u6587\u5b57\u3002\u6211\u5011\u5be6\u9a57\u4e86 MeMo \u67b6\u69cb\uff0c\u5c55\u793a\u4e86\u55ae\u5c64\u548c\u591a\u5c64\u7d44\u614b\u7684\u8a18\u61b6\u529b\u3002", "author": "Fabio Massimo Zanzotto et.al.", "authors": "Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli", "id": "2502.12851v1", "paper_url": "http://arxiv.org/abs/2502.12851v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12852/2502.12852v1.json b/database/storage/2502/12852/2502.12852v1.json
new file mode 100644
index 0000000000..a5520ae1c8
--- /dev/null
+++ b/database/storage/2502/12852/2502.12852v1.json
@@ -0,0 +1 @@
+{"2502.12852": {"publish_time": "2025-02-18", "title": "MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching", "paper_summary": "Existing multilingual vision-language (VL) benchmarks often only cover a\nhandful of languages. Consequently, evaluations of large vision-language models\n(LVLMs) predominantly target high-resource languages, underscoring the need for\nevaluation data for low-resource languages. To address this limitation, we\nintroduce MVL-SIB, a massively multilingual vision-language benchmark that\nevaluates both cross-modal and text-only topical matching across 205 languages\n-- over 100 more than the most multilingual existing VL benchmarks encompass.\nWe then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)\non MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic\nmatching in lower-resource languages, performing no better than chance on\nlanguages like N'Koo. Our analysis further reveals that VL support in LVLMs\ndeclines disproportionately relative to textual support for lower-resource\nlanguages, as evidenced by comparison of cross-modal and text-only topical\nmatching performance. We further observe that open-weight LVLMs do not benefit\nfrom representing a topic with more than one image, suggesting that these\nmodels are not yet fully effective at handling multi-image tasks. By\ncorrelating performance on MVL-SIB with other multilingual VL benchmarks, we\nhighlight that MVL-SIB serves as a comprehensive probe of multilingual VL\nunderstanding in LVLMs.", "paper_summary_zh": "\u73fe\u6709\u7684\u591a\u8a9e\u8a00\u8996\u89ba\u8a9e\u8a00 (VL) \u57fa\u6e96\u901a\u5e38\u53ea\u6db5\u84cb\u5c11\u6578\u8a9e\u8a00\u3002\u56e0\u6b64\uff0c\u5927\u578b\u8996\u89ba\u8a9e\u8a00\u6a21\u578b (LVLMs) \u7684\u8a55\u4f30\u4e3b\u8981\u91dd\u5c0d\u8cc7\u6e90\u8c50\u5bcc\u7684\u8a9e\u8a00\uff0c\u5f37\u8abf\u4e86\u5c0d\u8cc7\u6e90\u5331\u4e4f\u8a9e\u8a00\u7684\u8a55\u4f30\u8cc7\u6599\u7684\u9700\u6c42\u3002\u70ba\u4e86\u89e3\u6c7a\u6b64\u9650\u5236\uff0c\u6211\u5011\u5f15\u5165\u4e86 MVL-SIB\uff0c\u4e00\u500b\u5927\u898f\u6a21\u7684\u591a\u8a9e\u8a00\u8996\u89ba\u8a9e\u8a00\u57fa\u6e96\uff0c\u5b83\u8a55\u4f30\u4e86 205 \u7a2e\u8a9e\u8a00\u7684\u8de8\u6a21\u614b\u548c\u7d14\u6587\u5b57\u4e3b\u984c\u5339\u914d\uff0c\u6bd4\u73fe\u6709\u7684\u591a\u8a9e\u8a00 VL \u57fa\u6e96\u6db5\u84cb\u7684\u8a9e\u8a00\u591a\u51fa 100 \u591a\u7a2e\u3002\u7136\u5f8c\uff0c\u6211\u5011\u5728 MVL-SIB \u4e0a\u5c0d\u4e00\u7cfb\u5217\u958b\u653e\u6b0a\u91cd\u7684 LVLMs \u8207 GPT-4o(-mini) \u9032\u884c\u4e86\u57fa\u6e96\u6e2c\u8a66\u3002\u6211\u5011\u7684\u7d50\u679c\u8868\u660e\uff0cLVLMs \u5728\u8cc7\u6e90\u8f03\u5c11\u7684\u8a9e\u8a00\u4e2d\u96e3\u4ee5\u9032\u884c\u8de8\u6a21\u614b\u4e3b\u984c\u5339\u914d\uff0c\u5728 N'Koo \u7b49\u8a9e\u8a00\u4e0a\u7684\u8868\u73fe\u4e0d\u6bd4\u96a8\u6a5f\u597d\u3002\u6211\u5011\u7684\u5206\u6790\u9032\u4e00\u6b65\u8868\u660e\uff0cLVLMs \u4e2d\u7684 VL \u652f\u63f4\u76f8\u5c0d\u65bc\u8cc7\u6e90\u8f03\u5c11\u7684\u8a9e\u8a00\u7684\u6587\u5b57\u652f\u63f4\u4e0b\u964d\u5f97\u4e0d\u6210\u6bd4\u4f8b\uff0c\u9019\u5f9e\u8de8\u6a21\u614b\u548c\u7d14\u6587\u5b57\u4e3b\u984c\u5339\u914d\u6548\u80fd\u7684\u6bd4\u8f03\u4e2d\u53ef\u4ee5\u770b\u51fa\u3002\u6211\u5011\u9032\u4e00\u6b65\u89c0\u5bdf\u5230\uff0c\u958b\u653e\u6b0a\u91cd\u7684 LVLMs \u7121\u6cd5\u5f9e\u7528\u591a\u65bc\u4e00\u5f35\u5f71\u50cf\u4f86\u8868\u793a\u4e3b\u984c\u4e2d\u53d7\u76ca\uff0c\u9019\u8868\u660e\u9019\u4e9b\u6a21\u578b\u5728\u8655\u7406\u591a\u5f71\u50cf\u4efb\u52d9\u65b9\u9762\u5c1a\u672a\u5b8c\u5168\u6709\u6548\u3002\u901a\u904e\u5c07 MVL-SIB \u4e0a\u7684\u6548\u80fd\u8207\u5176\u4ed6\u591a\u8a9e\u8a00 VL \u57fa\u6e96\u76f8\u95dc\u806f\uff0c\u6211\u5011\u5f37\u8abf MVL-SIB \u53ef\u4f5c\u70ba LVLMs \u4e2d\u591a\u8a9e\u8a00 VL \u7406\u89e3\u7684\u7d9c\u5408\u63a2\u6e2c\u3002", "author": "Fabian David Schmidt et.al.", "authors": "Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glava\u0161", "id": "2502.12852v1", "paper_url": "http://arxiv.org/abs/2502.12852v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12853/2502.12853v1.json b/database/storage/2502/12853/2502.12853v1.json
new file mode 100644
index 0000000000..ba1d6fe3dc
--- /dev/null
+++ b/database/storage/2502/12853/2502.12853v1.json
@@ -0,0 +1 @@
+{"2502.12853": {"publish_time": "2025-02-18", "title": "S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning", "paper_summary": "Recent studies have demonstrated the effectiveness of LLM test-time scaling.\nHowever, existing approaches to incentivize LLMs' deep thinking abilities\ngenerally require large-scale data or significant training efforts. Meanwhile,\nit remains unclear how to improve the thinking abilities of less powerful base\nmodels. In this work, we introduce S$^2$R, an efficient framework that enhances\nLLM reasoning by teaching models to self-verify and self-correct during\ninference. Specifically, we first initialize LLMs with iterative\nself-verification and self-correction behaviors through supervised fine-tuning\non carefully curated data. The self-verification and self-correction skills are\nthen further strengthened by both outcome-level and process-level reinforcement\nlearning, with minimized resource requirements, enabling the model to\nadaptively refine its reasoning process during inference. Our results\ndemonstrate that, with only 3.1k self-verifying and self-correcting behavior\ninitialization samples, Qwen2.5-math-7B achieves an accuracy improvement from\n51.0\\% to 81.6\\%, outperforming models trained on an equivalent amount of\nlong-CoT distilled data. Extensive experiments and analysis based on three base\nmodels across both in-domain and out-of-domain benchmarks validate the\neffectiveness of S$^2$R. Our code and data are available at\nhttps://github.com/NineAbyss/S2R.", "paper_summary_zh": "<paragraph>\u6700\u8fd1\u7684\u7814\u7a76\u8868\u660e\u4e86 LLM \u6d4b\u8bd5\u65f6\u95f4\u6269\u5c55\u7684\u6709\u6548\u6027\u3002\n\u7136\u800c\uff0c\u73b0\u6709\u6fc0\u52b1 LLM \u6df1\u5ea6\u601d\u8003\u80fd\u529b\u7684\u65b9\u6cd5\n\u901a\u5e38\u9700\u8981\u5927\u89c4\u6a21\u6570\u636e\u6216\u5927\u91cf\u7684\u8bad\u7ec3\u5de5\u4f5c\u3002\u540c\u65f6\uff0c\n\u5982\u4f55\u63d0\u9ad8\u8f83\u5f31\u57fa\u7840\u6a21\u578b\u7684\u601d\u8003\u80fd\u529b\u4ecd\u7136\u4e0d\u6e05\u695a\u3002\u5728\u8fd9\u9879\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u5f15\u5165\u4e86 S$^2$R\uff0c\u4e00\u4e2a\u901a\u8fc7\u6559\u5bfc\u6a21\u578b\u5728\n\u63a8\u7406\u8fc7\u7a0b\u4e2d\u8fdb\u884c\u81ea\u6211\u9a8c\u8bc1\u548c\u81ea\u6211\u7ea0\u6b63\u6765\u589e\u5f3a LLM \u63a8\u7406\u7684\u6709\u6548\u6846\u67b6\u3002\u5177\u4f53\u6765\u8bf4\uff0c\u6211\u4eec\u9996\u5148\u901a\u8fc7\u76d1\u7763\u5fae\u8c03\u5bf9\u7cbe\u5fc3\u6574\u7406\u7684\u6570\u636e\u6765\u521d\u59cb\u5316\u5177\u6709\u8fed\u4ee3\u81ea\u6211\u9a8c\u8bc1\u548c\u81ea\u6211\u7ea0\u6b63\u884c\u4e3a\u7684 LLM\u3002\u7136\u540e\u901a\u8fc7\u7ed3\u679c\u7ea7\u522b\u548c\u8fc7\u7a0b\u7ea7\u522b\u7684\u5f3a\u5316\n\u5b66\u4e60\u8fdb\u4e00\u6b65\u52a0\u5f3a\u81ea\u6211\u9a8c\u8bc1\u548c\u81ea\u6211\u7ea0\u6b63\u6280\u80fd\uff0c\u540c\u65f6\u6700\u5927\u7a0b\u5ea6\u5730\u51cf\u5c11\u8d44\u6e90\u9700\u6c42\uff0c\u4f7f\u6a21\u578b\u80fd\u591f\n\u5728\u63a8\u7406\u8fc7\u7a0b\u4e2d\u81ea\u9002\u5e94\u5730\u4f18\u5316\u5176\u63a8\u7406\u8fc7\u7a0b\u3002\u6211\u4eec\u7684\u7ed3\u679c\n\u8868\u660e\uff0c\u4ec5\u4f7f\u7528 3.1k \u4e2a\u81ea\u6211\u9a8c\u8bc1\u548c\u81ea\u6211\u7ea0\u6b63\u884c\u4e3a\n\u521d\u59cb\u5316\u6837\u672c\uff0cQwen2.5-math-7B \u7684\u51c6\u786e\u7387\u4ece\n51.0% \u63d0\u9ad8\u5230 81.6%\uff0c\u4f18\u4e8e\u5728\u7b49\u91cf\u957f CoT \u84b8\u998f\u6570\u636e\u4e0a\u8bad\u7ec3\u7684\u6a21\u578b\u3002\u57fa\u4e8e\u4e09\u4e2a\u57fa\u7840\u6a21\u578b\u5728\u57df\u5185\u548c\u57df\u5916\u57fa\u51c6\u4e0a\u7684\u5e7f\u6cdb\u5b9e\u9a8c\u548c\u5206\u6790\u9a8c\u8bc1\u4e86\nS$^2$R \u7684\u6709\u6548\u6027\u3002\u6211\u4eec\u7684\u4ee3\u7801\u548c\u6570\u636e\u53ef\u4ee5\u5728\nhttps://github.com/NineAbyss/S2R \u83b7\u5f97\u3002</paragraph>", "author": "Ruotian Ma et.al.", "authors": "Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li", "id": "2502.12853v1", "paper_url": "http://arxiv.org/abs/2502.12853v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12855/2502.12855v1.json b/database/storage/2502/12855/2502.12855v1.json
new file mode 100644
index 0000000000..76f8447d50
--- /dev/null
+++ b/database/storage/2502/12855/2502.12855v1.json
@@ -0,0 +1 @@
+{"2502.12855": {"publish_time": "2025-02-18", "title": "Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models", "paper_summary": "While large models pre-trained on high-quality data exhibit excellent\nperformance across various reasoning tasks, including mathematical reasoning\n(e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical\nreasoning remains a challenging problem. Common approaches to address this\nchallenge include knowledge distillation, where smaller student models learn\nfrom large pre-trained teacher models, and data augmentation, such as\nrephrasing questions. Despite these efforts, smaller models struggle with\narithmetic computations, leading to errors in mathematical reasoning. In this\nwork, we focus on leveraging a programmatically generated arithmetic dataset to\nenhance the reasoning capabilities of smaller models. We investigate two key\napproaches to incorporate this dataset -- (1) intermediate fine-tuning, where a\nmodel is fine-tuned on the arithmetic dataset before being trained on a\nreasoning dataset, and (2) integrating the arithmetic dataset into the\ninstruction-tuning mixture, allowing the model to learn arithmetic skills\nalongside general instruction-following abilities. Our experiments on multiple\nreasoning benchmarks demonstrate that incorporating an arithmetic dataset,\nwhether through targeted fine-tuning or within the instruction-tuning mixture,\nenhances the models' arithmetic capabilities, which in turn improves their\nmathematical reasoning performance.", "paper_summary_zh": "\u5927\u578b\u6a21\u578b\u7ecf\u8fc7\u9488\u5bf9\u9ad8\u8d28\u91cf\u6570\u636e\u7684\u9884\u8bad\u7ec3\uff0c\u5728\u5404\u79cd\u63a8\u7406\u4efb\u52a1\u4e2d\u8868\u73b0\u51fa\u8272\uff0c\u5305\u62ec\u6570\u5b66\u63a8\u7406\uff08\u4f8b\u5982 GSM8k\u3001MultiArith\uff09\uff0c\u4f46\u4e13\u95e8\u5316\u5c0f\u578b\u6a21\u578b\u4ee5\u64c5\u957f\u6570\u5b66\u63a8\u7406\u4ecd\u7136\u662f\u4e00\u4e2a\u5177\u6709\u6311\u6218\u6027\u7684\u95ee\u9898\u3002\u89e3\u51b3\u8fd9\u4e00\u6311\u6218\u7684\u5e38\u89c1\u65b9\u6cd5\u5305\u62ec\u77e5\u8bc6\u84b8\u998f\uff0c\u5176\u4e2d\u8f83\u5c0f\u7684\u5b66\u751f\u6a21\u578b\u4ece\u7ecf\u8fc7\u9884\u8bad\u7ec3\u7684\u5927\u578b\u6559\u5e08\u6a21\u578b\u4e2d\u5b66\u4e60\uff0c\u4ee5\u53ca\u6570\u636e\u589e\u5f3a\uff0c\u4f8b\u5982\u91cd\u65b0\u8868\u8ff0\u95ee\u9898\u3002\u5c3d\u7ba1\u505a\u51fa\u4e86\u8fd9\u4e9b\u52aa\u529b\uff0c\u8f83\u5c0f\u7684\u6a21\u578b\u5728\u7b97\u672f\u8ba1\u7b97\u4e2d\u4ecd\u7136\u5b58\u5728\u56f0\u96be\uff0c\u4ece\u800c\u5bfc\u81f4\u6570\u5b66\u63a8\u7406\u9519\u8bef\u3002\u5728\u8fd9\u9879\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u4e13\u6ce8\u4e8e\u5229\u7528\u7a0b\u5e8f\u5316\u751f\u6210\u7684\u7b97\u672f\u6570\u636e\u96c6\u6765\u589e\u5f3a\u8f83\u5c0f\u6a21\u578b\u7684\u63a8\u7406\u80fd\u529b\u3002\u6211\u4eec\u7814\u7a76\u4e86\u4e24\u79cd\u5173\u952e\u65b9\u6cd5\u6765\u5408\u5e76\u6b64\u6570\u636e\u96c6\u2014\u2014\uff081\uff09\u4e2d\u95f4\u5fae\u8c03\uff0c\u5176\u4e2d\u6a21\u578b\u5728\u7b97\u672f\u6570\u636e\u96c6\u4e0a\u8fdb\u884c\u5fae\u8c03\uff0c\u7136\u540e\u5728\u63a8\u7406\u6570\u636e\u96c6\u4e0a\u8fdb\u884c\u8bad\u7ec3\uff0c\u4ee5\u53ca\uff082\uff09\u5c06\u7b97\u672f\u6570\u636e\u96c6\u96c6\u6210\u5230\u6307\u4ee4\u5fae\u8c03\u6df7\u5408\u4e2d\uff0c\u5141\u8bb8\u6a21\u578b\u5b66\u4e60\u7b97\u672f\u6280\u80fd\u4ee5\u53ca\u4e00\u822c\u7684\u6307\u4ee4\u9075\u5faa\u80fd\u529b\u3002\u6211\u4eec\u5728\u591a\u4e2a\u63a8\u7406\u57fa\u51c6\u4e0a\u7684\u5b9e\u9a8c\u8868\u660e\uff0c\u901a\u8fc7\u6709\u9488\u5bf9\u6027\u7684\u5fae\u8c03\u6216\u5728\u6307\u4ee4\u5fae\u8c03\u6df7\u5408\u4e2d\u5408\u5e76\u7b97\u672f\u6570\u636e\u96c6\uff0c\u589e\u5f3a\u4e86\u6a21\u578b\u7684\u7b97\u672f\u80fd\u529b\uff0c\u8fdb\u800c\u63d0\u9ad8\u4e86\u5b83\u4eec\u7684\u6570\u5b66\u63a8\u7406\u6027\u80fd\u3002", "author": "Neeraj Gangwar et.al.", "authors": "Neeraj Gangwar, Suma P Bhat, Nickvash Kani", "id": "2502.12855v1", "paper_url": "http://arxiv.org/abs/2502.12855v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12858/2502.12858v1.json b/database/storage/2502/12858/2502.12858v1.json
new file mode 100644
index 0000000000..1f3548a227
--- /dev/null
+++ b/database/storage/2502/12858/2502.12858v1.json
@@ -0,0 +1 @@
+{"2502.12858": {"publish_time": "2025-02-18", "title": "Rejected Dialects: Biases Against African American Language in Reward Models", "paper_summary": "Preference alignment via reward models helps build safe, helpful, and\nreliable large language models (LLMs). However, subjectivity in preference\njudgments and the lack of representative sampling in preference data collection\ncan introduce new biases, hindering reward models' fairness and equity. In this\nwork, we introduce a framework for evaluating dialect biases in reward models\nand conduct a case study on biases against African American Language (AAL)\nthrough several experiments comparing reward model preferences and behavior on\npaired White Mainstream English (WME) and both machine-translated and\nhuman-written AAL corpora. We show that reward models are less aligned with\nhuman preferences when processing AAL texts vs. WME ones (-4\\% accuracy on\naverage), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and\nsteer conversations toward WME, even when prompted with AAL texts. Our findings\nprovide a targeted analysis of anti-AAL biases at a relatively understudied\nstage in LLM development, highlighting representational harms and ethical\nquestions about the desired behavior of LLMs concerning AAL.", "paper_summary_zh": "\u900f\u904e\u734e\u52f5\u6a21\u578b\u9032\u884c\u504f\u597d\u6bd4\u5c0d\u6709\u52a9\u65bc\u5efa\u7acb\u5b89\u5168\u3001\u6709\u7528\u7684\u53ef\u9760\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM)\u3002\u7136\u800c\uff0c\u504f\u597d\u5224\u65b7\u7684\u4e3b\u89c0\u6027\uff0c\u4ee5\u53ca\u504f\u597d\u8cc7\u6599\u6536\u96c6\u4e2d\u7f3a\u4e4f\u4ee3\u8868\u6027\u62bd\u6a23\uff0c\u53ef\u80fd\u6703\u5f15\u9032\u65b0\u7684\u504f\u8aa4\uff0c\u963b\u7919\u734e\u52f5\u6a21\u578b\u7684\u516c\u5e73\u6027\u548c\u516c\u6b63\u6027\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u5f15\u9032\u4e00\u500b\u7528\u65bc\u8a55\u4f30\u734e\u52f5\u6a21\u578b\u4e2d\u65b9\u8a00\u504f\u8aa4\u7684\u67b6\u69cb\uff0c\u4e26\u900f\u904e\u6578\u500b\u5be6\u9a57\u9032\u884c\u6848\u4f8b\u7814\u7a76\uff0c\u63a2\u8a0e\u91dd\u5c0d\u975e\u88d4\u7f8e\u570b\u4eba\u8a9e\u8a00 (AAL) \u7684\u504f\u8aa4\uff0c\u9019\u4e9b\u5be6\u9a57\u6bd4\u8f03\u4e86\u734e\u52f5\u6a21\u578b\u504f\u597d\u548c\u884c\u70ba\uff0c\u6bd4\u8f03\u6210\u5c0d\u7684\u767d\u4eba\u4e3b\u6d41\u82f1\u8a9e (WME) \u8207\u6a5f\u5668\u7ffb\u8b6f\u548c\u4eba\u985e\u64b0\u5beb\u7684 AAL \u8a9e\u6599\u5eab\u3002\u6211\u5011\u986f\u793a\uff0c\u8207\u8655\u7406 WME \u6587\u5b57\u76f8\u6bd4\uff0c\u734e\u52f5\u6a21\u578b\u5728\u8655\u7406 AAL \u6587\u5b57\u6642\u8207\u4eba\u985e\u504f\u597d\u8f03\u4e0d\u4e00\u81f4\uff08\u5e73\u5747\u6e96\u78ba\u5ea6\u964d\u4f4e 4%\uff09\uff0c\u7d93\u5e38\u4e0d\u504f\u597d\u8207 AAL \u4e00\u81f4\u7684\u6587\u5b57\uff0c\u800c\u504f\u597d\u8207 WME \u4e00\u81f4\u7684\u6587\u5b57\uff0c\u4e26\u5c07\u5c0d\u8a71\u5c0e\u5411 WME\uff0c\u5373\u4f7f\u63d0\u793a\u7684\u662f AAL \u6587\u5b57\u3002\u6211\u5011\u7684\u767c\u73fe\u91dd\u5c0d LLM \u958b\u767c\u4e2d\u76f8\u5c0d\u672a\u53d7\u91cd\u8996\u7684\u968e\u6bb5\uff0c\u63d0\u4f9b\u91dd\u5c0d\u53cd AAL \u504f\u8aa4\u7684\u76ee\u6a19\u5206\u6790\uff0c\u5f37\u8abf\u8207\u8868\u5fb5\u76f8\u95dc\u7684\u5371\u5bb3\u548c\u95dc\u65bc LLM \u5c0d AAL \u7684\u671f\u671b\u884c\u70ba\u7684\u502b\u7406\u554f\u984c\u3002", "author": "Joel Mire et.al.", "authors": "Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, Maarten Sap", "id": "2502.12858v1", "paper_url": "http://arxiv.org/abs/2502.12858v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12859/2502.12859v1.json b/database/storage/2502/12859/2502.12859v1.json
new file mode 100644
index 0000000000..4f21b9f28b
--- /dev/null
+++ b/database/storage/2502/12859/2502.12859v1.json
@@ -0,0 +1 @@
+{"2502.12859": {"publish_time": "2025-02-18", "title": "PAFT: Prompt-Agnostic Fine-Tuning", "paper_summary": "While Large Language Models (LLMs) adapt well to downstream tasks after\nfine-tuning, this adaptability often compromises prompt robustness, as even\nminor prompt variations can significantly degrade performance. To address this,\nwe propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach\nthat dynamically adjusts prompts during fine-tuning. This encourages the model\nto learn underlying task principles rather than overfitting to specific prompt\nformulations. PAFT operates in two stages: First, a diverse set of meaningful,\nsynthetic candidate prompts is constructed. Second, during fine-tuning, prompts\nare randomly sampled from this set to create dynamic training inputs. Extensive\nexperiments across diverse datasets and LLMs demonstrate that models trained\nwith PAFT exhibit strong robustness and generalization across a wide range of\nprompts, including unseen ones. This enhanced robustness improves both model\nperformance and inference speed while maintaining training efficiency. Ablation\nstudies further confirm the effectiveness of PAFT.", "paper_summary_zh": "\u5118\u7ba1\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728\u5fae\u8abf\u5f8c\u80fd\u5f88\u597d\u5730\u9069\u61c9\u4e0b\u6e38\u4efb\u52d9\uff0c\u4f46\u9019\u7a2e\u9069\u61c9\u6027\u901a\u5e38\u6703\u640d\u5bb3\u63d0\u793a\u7684\u7a69\u5065\u6027\uff0c\u56e0\u70ba\u5373\u4f7f\u5fae\u5c0f\u7684\u63d0\u793a\u8b8a\u7570\u4e5f\u6703\u5927\u5e45\u964d\u4f4e\u6548\u80fd\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u63d0\u51fa\u63d0\u793a\u4e0d\u53ef\u77e5\u5fae\u8abf (PAFT)\uff0c\u9019\u662f\u4e00\u7a2e\u7c21\u55ae\u537b\u6709\u6548\u7684\u65b9\u6cd5\uff0c\u53ef\u4ee5\u5728\u5fae\u8abf\u671f\u9593\u52d5\u614b\u8abf\u6574\u63d0\u793a\u3002\u9019\u9f13\u52f5\u6a21\u578b\u5b78\u7fd2\u5e95\u5c64\u4efb\u52d9\u539f\u5247\uff0c\u800c\u4e0d\u662f\u904e\u5ea6\u64ec\u5408\u7279\u5b9a\u7684\u63d0\u793a\u8868\u8ff0\u3002PAFT \u5206\u70ba\u5169\u500b\u968e\u6bb5\u904b\u4f5c\uff1a\u9996\u5148\uff0c\u69cb\u5efa\u4e00\u7d44\u591a\u6a23\u5316\u3001\u6709\u610f\u7fa9\u7684\u5408\u6210\u5019\u9078\u63d0\u793a\u3002\u5176\u6b21\uff0c\u5728\u5fae\u8abf\u671f\u9593\uff0c\u5f9e\u6b64\u96c6\u5408\u4e2d\u96a8\u6a5f\u62bd\u53d6\u63d0\u793a\u4ee5\u5efa\u7acb\u52d5\u614b\u8a13\u7df4\u8f38\u5165\u3002\u91dd\u5c0d\u5404\u7a2e\u8cc7\u6599\u96c6\u548c LLM \u9032\u884c\u7684\u5ee3\u6cdb\u5be6\u9a57\u8868\u660e\uff0c\u4f7f\u7528 PAFT \u8a13\u7df4\u7684\u6a21\u578b\u5728\u5404\u7a2e\u63d0\u793a\u4e2d\u8868\u73fe\u51fa\u5f37\u5927\u7684\u7a69\u5065\u6027\u548c\u6982\u62ec\u6027\uff0c\u5305\u62ec\u672a\u898b\u904e\u7684\u63d0\u793a\u3002\u9019\u7a2e\u589e\u5f37\u7684\u7a69\u5065\u6027\u540c\u6642\u6539\u5584\u4e86\u6a21\u578b\u6548\u80fd\u548c\u63a8\u7406\u901f\u5ea6\uff0c\u540c\u6642\u7dad\u6301\u8a13\u7df4\u6548\u7387\u3002\u6d88\u878d\u7814\u7a76\u9032\u4e00\u6b65\u8b49\u5be6\u4e86 PAFT \u7684\u6709\u6548\u6027\u3002", "author": "Chenxing Wei et.al.", "authors": "Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu", "id": "2502.12859v1", "paper_url": "http://arxiv.org/abs/2502.12859v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12876/2502.12876v1.json b/database/storage/2502/12876/2502.12876v1.json
new file mode 100644
index 0000000000..5508b9e2df
--- /dev/null
+++ b/database/storage/2502/12876/2502.12876v1.json
@@ -0,0 +1 @@
+{"2502.12876": {"publish_time": "2025-02-18", "title": "Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning", "paper_summary": "Creating personalized and adaptable conversational AI remains a key\nchallenge. This paper introduces a Continuous Learning Conversational AI (CLCA)\napproach, implemented using A2C reinforcement learning, to move beyond static\nLarge Language Models (LLMs). We use simulated sales dialogues, generated by\nLLMs, to train an A2C agent. This agent learns to optimize conversation\nstrategies for personalization, focusing on engagement and delivering value.\nOur system architecture integrates reinforcement learning with LLMs for both\ndata creation and response selection. This method offers a practical way to\nbuild personalized AI companions that evolve through continuous learning,\nadvancing beyond traditional static LLM techniques.", "paper_summary_zh": "\u5efa\u7acb\u500b\u4eba\u5316\u4e14\u9069\u61c9\u6027\u5f37\u7684\u5c0d\u8a71\u5f0f AI \u4ecd\u7136\u662f\u4e00\u9805\u95dc\u9375\u6311\u6230\u3002\u672c\u6587\u4ecb\u7d39\u4e86\u4e00\u7a2e\u6301\u7e8c\u5b78\u7fd2\u5c0d\u8a71\u5f0f AI (CLCA) \u65b9\u6cd5\uff0c\u900f\u904e A2C \u5f37\u5316\u5b78\u7fd2\u5be6\u4f5c\uff0c\u4ee5\u8d85\u8d8a\u975c\u614b\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM)\u3002\u6211\u5011\u4f7f\u7528 LLM \u751f\u6210\u7684\u6a21\u64ec\u92b7\u552e\u5c0d\u8a71\u4f86\u8a13\u7df4 A2C \u4ee3\u7406\u3002\u6b64\u4ee3\u7406\u6703\u5b78\u7fd2\u6700\u4f73\u5316\u5c0d\u8a71\u7b56\u7565\u4ee5\u5be6\u73fe\u500b\u4eba\u5316\uff0c\u4e26\u5c08\u6ce8\u65bc\u53c3\u8207\u548c\u63d0\u4f9b\u50f9\u503c\u3002\u6211\u5011\u7684\u7cfb\u7d71\u67b6\u69cb\u5c07\u5f37\u5316\u5b78\u7fd2\u8207 LLM \u6574\u5408\uff0c\u7528\u65bc\u8cc7\u6599\u5efa\u7acb\u548c\u56de\u61c9\u9078\u53d6\u3002\u6b64\u65b9\u6cd5\u63d0\u4f9b\u4e86\u4e00\u7a2e\u5be6\u7528\u7684\u65b9\u5f0f\u4f86\u5efa\u7acb\u500b\u4eba\u5316 AI \u4f34\u4fb6\uff0c\u9019\u4e9b\u4f34\u4fb6\u6703\u900f\u904e\u6301\u7e8c\u5b78\u7fd2\u800c\u6f14\u9032\uff0c\u8d85\u8d8a\u50b3\u7d71\u7684\u975c\u614b LLM \u6280\u8853\u3002", "author": "Nandakishor M et.al.", "authors": "Nandakishor M, Anjali M", "id": "2502.12876v1", "paper_url": "http://arxiv.org/abs/2502.12876v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12884/2502.12884v1.json b/database/storage/2502/12884/2502.12884v1.json
new file mode 100644
index 0000000000..36fb8c6133
--- /dev/null
+++ b/database/storage/2502/12884/2502.12884v1.json
@@ -0,0 +1 @@
+{"2502.12884": {"publish_time": "2025-02-18", "title": "How desirable is alignment between LLMs and linguistically diverse human users?", "paper_summary": "We discuss how desirable it is that Large Language Models (LLMs) be able to\nadapt or align their language behavior with users who may be diverse in their\nlanguage use. User diversity may come about among others due to i) age\ndifferences; ii) gender characteristics, and/or iii) multilingual experience,\nand associated differences in language processing and use. We consider\npotential consequences for usability, communication, and LLM development.", "paper_summary_zh": "\u6211\u5011\u63a2\u8a0e\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u80fd\u5920\u9069\u61c9\u6216\u8abf\u6574\u5176\u8a9e\u8a00\u884c\u70ba\uff0c\u4ee5\u9069\u61c9\u8a9e\u8a00\u4f7f\u7528\u53ef\u80fd\u591a\u6a23\u5316\u7684\u4f7f\u7528\u8005\uff0c\u9019\u6709\u591a\u9ebc\u53ef\u53d6\u3002\u4f7f\u7528\u8005\u591a\u6a23\u6027\u53ef\u80fd\u51fa\u65bc\u4ee5\u4e0b\u539f\u56e0\u800c\u7522\u751f\uff1ai) \u5e74\u9f61\u5dee\u7570\uff1bii) \u6027\u5225\u7279\u5fb5\uff0c\u548c/\u6216 iii) \u591a\u8a9e\u8a00\u7d93\u9a57\uff0c\u4ee5\u53ca\u8a9e\u8a00\u8655\u7406\u548c\u4f7f\u7528\u4e0a\u7684\u76f8\u95dc\u5dee\u7570\u3002\u6211\u5011\u8003\u616e\u5c0d\u53ef\u7528\u6027\u3001\u6e9d\u901a\u548c LLM \u958b\u767c\u7684\u6f5b\u5728\u5f8c\u679c\u3002", "author": "Pia Knoeferle et.al.", "authors": "Pia Knoeferle, Sebastian M\u00f6ller, Dorothea Kolossa, Veronika Solopova, Georg Rehm", "id": "2502.12884v1", "paper_url": "http://arxiv.org/abs/2502.12884v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12886/2502.12886v1.json b/database/storage/2502/12886/2502.12886v1.json
new file mode 100644
index 0000000000..65772bc0cd
--- /dev/null
+++ b/database/storage/2502/12886/2502.12886v1.json
@@ -0,0 +1 @@
+{"2502.12886": {"publish_time": "2025-02-18", "title": "Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?", "paper_summary": "Large language models (LLMs) demonstrate unprecedented capabilities and\ndefine the state of the art for almost all natural language processing (NLP)\ntasks and also for essentially all Language Technology (LT) applications. LLMs\ncan only be trained for languages for which a sufficient amount of pre-training\ndata is available, effectively excluding many languages that are typically\ncharacterised as under-resourced. However, there is both circumstantial and\nempirical evidence that multilingual LLMs, which have been trained using data\nsets that cover multiple languages (including under-resourced ones), do exhibit\nstrong capabilities for some of these under-resourced languages. Eventually,\nthis approach may have the potential to be a technological off-ramp for those\nunder-resourced languages for which \"native\" LLMs, and LLM-based technologies,\ncannot be developed due to a lack of training data. This paper, which\nconcentrates on European languages, examines this idea, analyses the current\nsituation in terms of technology support and summarises related work. The\narticle concludes by focusing on the key open questions that need to be\nanswered for the approach to be put into practice in a systematic way.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5c55\u73fe\u524d\u6240\u672a\u6709\u7684\u80fd\u529b\uff0c\u4e26\u5b9a\u7fa9\u4e86\u5e7e\u4e4e\u6240\u6709\u81ea\u7136\u8a9e\u8a00\u8655\u7406 (NLP) \u4efb\u52d9\u4ee5\u53ca\u6240\u6709\u8a9e\u8a00\u6280\u8853 (LT) \u61c9\u7528\u7684\u6700\u65b0\u6280\u8853\u3002LLM \u53ea\u80fd\u91dd\u5c0d\u6709\u8db3\u5920\u9810\u8a13\u7df4\u8cc7\u6599\u53ef\u7528\u7684\u8a9e\u8a00\u9032\u884c\u8a13\u7df4\uff0c\u5be6\u969b\u4e0a\u6392\u9664\u4e86\u8a31\u591a\u901a\u5e38\u88ab\u6b78\u985e\u70ba\u8cc7\u6e90\u4e0d\u8db3\u7684\u8a9e\u8a00\u3002\u7136\u800c\uff0c\u6709\u74b0\u5883\u548c\u7d93\u9a57\u8b49\u64da\u986f\u793a\uff0c\u591a\u8a9e\u8a00 LLM \u5df2\u4f7f\u7528\u6db5\u84cb\u591a\u7a2e\u8a9e\u8a00\uff08\u5305\u62ec\u8cc7\u6e90\u4e0d\u8db3\u7684\u8a9e\u8a00\uff09\u7684\u8cc7\u6599\u96c6\u9032\u884c\u8a13\u7df4\uff0c\u78ba\u5be6\u5c0d\u5176\u4e2d\u4e00\u4e9b\u8cc7\u6e90\u4e0d\u8db3\u7684\u8a9e\u8a00\u5c55\u73fe\u51fa\u5f37\u5927\u7684\u80fd\u529b\u3002\u6700\u7d42\uff0c\u9019\u7a2e\u65b9\u6cd5\u53ef\u80fd\u5177\u6709\u6210\u70ba\u90a3\u4e9b\u7531\u65bc\u7f3a\u4e4f\u8a13\u7df4\u8cc7\u6599\u800c\u7121\u6cd5\u958b\u767c\u300c\u539f\u751f\u300dLLM \u548c\u57fa\u65bc LLM \u7684\u6280\u8853\u7684\u8cc7\u6e90\u4e0d\u8db3\u8a9e\u8a00\u7684\u6280\u8853\u8df3\u677f\u7684\u6f5b\u529b\u3002\u672c\u6587\u5c08\u6ce8\u65bc\u6b50\u6d32\u8a9e\u8a00\uff0c\u63a2\u8a0e\u9019\u500b\u60f3\u6cd5\uff0c\u5206\u6790\u6280\u8853\u652f\u63f4\u65b9\u9762\u7684\u73fe\u72c0\uff0c\u4e26\u7e3d\u7d50\u76f8\u95dc\u5de5\u4f5c\u3002\u672c\u6587\u6700\u5f8c\u5c08\u6ce8\u65bc\u5fc5\u9808\u56de\u7b54\u7684\u4e3b\u8981\u958b\u653e\u6027\u554f\u984c\uff0c\u4ee5\u4fbf\u7cfb\u7d71\u6027\u5730\u5be6\u8e10\u9019\u7a2e\u65b9\u6cd5\u3002", "author": "Georg Rehm et.al.", "authors": "Georg Rehm, Annika Gr\u00fctzner-Zahn, Fabio Barth", "id": "2502.12886v1", "paper_url": "http://arxiv.org/abs/2502.12886v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12893/2502.12893v1.json b/database/storage/2502/12893/2502.12893v1.json
new file mode 100644
index 0000000000..6259c09f31
--- /dev/null
+++ b/database/storage/2502/12893/2502.12893v1.json
@@ -0,0 +1 @@
+{"2502.12893": {"publish_time": "2025-02-18", "title": "H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking", "paper_summary": "Large Reasoning Models (LRMs) have recently extended their powerful reasoning\ncapabilities to safety checks-using chain-of-thought reasoning to decide\nwhether a request should be answered. While this new approach offers a\npromising route for balancing model utility and safety, its robustness remains\nunderexplored. To address this gap, we introduce Malicious-Educator, a\nbenchmark that disguises extremely dangerous or malicious requests beneath\nseemingly legitimate educational prompts. Our experiments reveal severe\nsecurity flaws in popular commercial-grade LRMs, including OpenAI o1/o3,\nDeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1\nmodel initially maintains a high refusal rate of about 98%, subsequent model\nupdates significantly compromise its safety; and attackers can easily extract\ncriminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any\nadditional tricks. To further highlight these vulnerabilities, we propose\nHijacking Chain-of-Thought (H-CoT), a universal and transferable attack method\nthat leverages the model's own displayed intermediate reasoning to jailbreak\nits safety reasoning mechanism. Under H-CoT, refusal rates sharply\ndecline-dropping from 98% to below 2%-and, in some instances, even transform\ninitially cautious tones into ones that are willing to provide harmful content.\nWe hope these findings underscore the urgent need for more robust safety\nmechanisms to preserve the benefits of advanced reasoning capabilities without\ncompromising ethical standards.", "paper_summary_zh": "\u5927\u578b\u63a8\u7406\u6a21\u578b (LRM) \u6700\u8fd1\u5c07\u5176\u5f37\u5927\u7684\u63a8\u7406\u80fd\u529b\u64f4\u5c55\u5230\u5b89\u5168\u6aa2\u67e5\uff0c\u4f7f\u7528\u601d\u7dad\u93c8\u63a8\u7406\u4f86\u6c7a\u5b9a\u662f\u5426\u61c9\u56de\u7b54\u8acb\u6c42\u3002\u96d6\u7136\u9019\u7a2e\u65b0\u65b9\u6cd5\u70ba\u5e73\u8861\u6a21\u578b\u5be6\u7528\u6027\u548c\u5b89\u5168\u6027\u63d0\u4f9b\u4e86\u4e00\u689d\u6709\u5e0c\u671b\u7684\u9014\u5f91\uff0c\u4f46\u5176\u7a69\u5065\u6027\u4ecd\u672a\u5f97\u5230\u5145\u5206\u63a2\u7d22\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u4e00\u5dee\u8ddd\uff0c\u6211\u5011\u5f15\u5165\u4e86 Malicious-Educator\uff0c\u9019\u662f\u4e00\u500b\u57fa\u6e96\uff0c\u5b83\u5c07\u6975\u5176\u5371\u96aa\u6216\u60e1\u610f\u7684\u8acb\u6c42\u507d\u88dd\u5728\u770b\u4f3c\u5408\u6cd5\u7684\u6559\u80b2\u63d0\u793a\u4e4b\u4e0b\u3002\u6211\u5011\u7684\u5be6\u9a57\u63ed\u793a\u4e86\u6d41\u884c\u7684\u5546\u696d\u7d1a LRM \u4e2d\u56b4\u91cd\u7684\u5b89\u5168\u7f3a\u9677\uff0c\u5305\u62ec OpenAI o1/o3\u3001DeepSeek-R1 \u548c Gemini 2.0 Flash Thinking\u3002\u4f8b\u5982\uff0c\u5118\u7ba1 OpenAI \u7684 o1 \u6a21\u578b\u6700\u521d\u4fdd\u6301\u7d04 98% \u7684\u9ad8\u62d2\u7d55\u7387\uff0c\u4f46\u5f8c\u7e8c\u7684\u6a21\u578b\u66f4\u65b0\u986f\u8457\u640d\u5bb3\u4e86\u5176\u5b89\u5168\u6027\uff1b\u653b\u64ca\u8005\u53ef\u4ee5\u8f15\u9b06\u5730\u5f9e DeepSeek-R1 \u548c Gemini 2.0 Flash Thinking \u4e2d\u63d0\u53d6\u72af\u7f6a\u7b56\u7565\uff0c\u800c\u7121\u9700\u4efb\u4f55\u984d\u5916\u7684\u6280\u5de7\u3002\u70ba\u4e86\u9032\u4e00\u6b65\u5f37\u8abf\u9019\u4e9b\u6f0f\u6d1e\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u52ab\u6301\u601d\u7dad\u93c8 (H-CoT)\uff0c\u9019\u662f\u4e00\u7a2e\u901a\u7528\u4e14\u53ef\u8f49\u79fb\u7684\u653b\u64ca\u65b9\u6cd5\uff0c\u5b83\u5229\u7528\u6a21\u578b\u81ea\u5df1\u986f\u793a\u7684\u4e2d\u9593\u63a8\u7406\u4f86\u8d8a\u7344\u5176\u5b89\u5168\u63a8\u7406\u6a5f\u5236\u3002\u5728 H-CoT \u4e0b\uff0c\u62d2\u7d55\u7387\u6025\u5287\u4e0b\u964d\uff0c\u5f9e 98% \u964d\u81f3 2% \u4ee5\u4e0b\uff0c\u5728\u67d0\u4e9b\u60c5\u6cc1\u4e0b\uff0c\u751a\u81f3\u5c07\u6700\u521d\u8b39\u614e\u7684\u8a9e\u6c23\u8f49\u8b8a\u70ba\u9858\u610f\u63d0\u4f9b\u6709\u5bb3\u5167\u5bb9\u7684\u8a9e\u6c23\u3002\u6211\u5011\u5e0c\u671b\u9019\u4e9b\u767c\u73fe\u5f37\u8abf\u4e86\u5c0d\u66f4\u5f37\u5927\u7684\u5b89\u5168\u6a5f\u5236\u7684\u8feb\u5207\u9700\u8981\uff0c\u4ee5\u4fdd\u7559\u5148\u9032\u63a8\u7406\u80fd\u529b\u7684\u597d\u8655\uff0c\u540c\u6642\u4e0d\u640d\u5bb3\u9053\u5fb7\u6a19\u6e96\u3002", "author": "Martin Kuo et.al.", "authors": "Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen", "id": "2502.12893v1", "paper_url": "http://arxiv.org/abs/2502.12893v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12895/2502.12895v1.json b/database/storage/2502/12895/2502.12895v1.json
new file mode 100644
index 0000000000..43563f0b65
--- /dev/null
+++ b/database/storage/2502/12895/2502.12895v1.json
@@ -0,0 +1 @@
+{"2502.12895": {"publish_time": "2025-02-18", "title": "Multilingual European Language Models: Benchmarking Approaches and Challenges", "paper_summary": "The breakthrough of generative large language models (LLMs) that can solve\ndifferent tasks through chat interaction has led to a significant increase in\nthe use of general benchmarks to assess the quality or performance of these\nmodels beyond individual applications. There is also a need for better methods\nto evaluate and also to compare models due to the ever increasing number of new\nmodels published. However, most of the established benchmarks revolve around\nthe English language. This paper analyses the benefits and limitations of\ncurrent evaluation datasets, focusing on multilingual European benchmarks. We\nanalyse seven multilingual benchmarks and identify four major challenges.\nFurthermore, we discuss potential solutions to enhance translation quality and\nmitigate cultural biases, including human-in-the-loop verification and\niterative translation ranking. Our analysis highlights the need for culturally\naware and rigorously validated benchmarks to assess the reasoning and\nquestion-answering capabilities of multilingual LLMs accurately.", "paper_summary_zh": "\u751f\u6210\u5f0f\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u7a81\u7834\uff0c\u5b83\u80fd\u900f\u904e\u804a\u5929\u4e92\u52d5\u89e3\u6c7a\u4e0d\u540c\u4efb\u52d9\uff0c\u9019\u5c0e\u81f4\u4f7f\u7528\u4e00\u822c\u57fa\u6e96\u4f86\u8a55\u4f30\u9019\u4e9b\u6a21\u578b\u5728\u500b\u5225\u61c9\u7528\u7a0b\u5f0f\u4ee5\u5916\u7684\u54c1\u8cea\u6216\u6548\u80fd\u5927\u5e45\u589e\u52a0\u3002\u7531\u65bc\u5df2\u767c\u5e03\u7684\u65b0\u6a21\u578b\u6578\u91cf\u4e0d\u65b7\u589e\u52a0\uff0c\u56e0\u6b64\u4e5f\u6709\u5fc5\u8981\u63a1\u7528\u66f4\u597d\u7684\u65b9\u6cd5\u4f86\u8a55\u4f30\u6a21\u578b\u4e26\u9032\u884c\u6bd4\u8f03\u3002\u7136\u800c\uff0c\u5927\u591a\u6578\u5df2\u5efa\u7acb\u7684\u57fa\u6e96\u90fd\u570d\u7e5e\u8457\u82f1\u8a9e\u3002\u672c\u6587\u5206\u6790\u4e86\u76ee\u524d\u8a55\u4f30\u8cc7\u6599\u96c6\u7684\u512a\u9ede\u548c\u9650\u5236\uff0c\u91cd\u9ede\u653e\u5728\u591a\u8a9e\u8a00\u6b50\u6d32\u57fa\u6e96\u3002\u6211\u5011\u5206\u6790\u4e86\u4e03\u500b\u591a\u8a9e\u8a00\u57fa\u6e96\uff0c\u4e26\u627e\u51fa\u56db\u500b\u4e3b\u8981\u7684\u6311\u6230\u3002\u6b64\u5916\uff0c\u6211\u5011\u8a0e\u8ad6\u4e86\u589e\u5f37\u7ffb\u8b6f\u54c1\u8cea\u548c\u6e1b\u8f15\u6587\u5316\u504f\u898b\u7684\u6f5b\u5728\u89e3\u6c7a\u65b9\u6848\uff0c\u5305\u62ec\u4eba\u70ba\u8ff4\u5708\u9a57\u8b49\u548c\u53cd\u8986\u7ffb\u8b6f\u6392\u540d\u3002\u6211\u5011\u7684\u5206\u6790\u7a81\u986f\u4e86\u5c0d\u6587\u5316\u610f\u8b58\u548c\u56b4\u683c\u9a57\u8b49\u7684\u57fa\u6e96\u7684\u9700\u6c42\uff0c\u4ee5\u6e96\u78ba\u8a55\u4f30\u591a\u8a9e\u8a00 LLM \u7684\u63a8\u7406\u548c\u554f\u7b54\u80fd\u529b\u3002", "author": "Fabio Barth et.al.", "authors": "Fabio Barth, Georg Rehm", "id": "2502.12895v1", "paper_url": "http://arxiv.org/abs/2502.12895v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12896/2502.12896v1.json b/database/storage/2502/12896/2502.12896v1.json
new file mode 100644
index 0000000000..56ba796980
--- /dev/null
+++ b/database/storage/2502/12896/2502.12896v1.json
@@ -0,0 +1 @@
+{"2502.12896": {"publish_time": "2025-02-18", "title": "None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks", "paper_summary": "In LLM evaluations, reasoning is often distinguished from recall/memorization\nby performing numerical variations to math-oriented questions. Here we\nintroduce a general variation method for multiple-choice questions that\ncompletely dissociates the correct answer from previously seen tokens or\nconcepts, requiring LLMs to understand and reason (rather than memorizing) in\norder to answer correctly. Using this method, we evaluate state-of-the-art\nproprietary and open-source LLMs on two datasets available in English and\nSpanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.\nResults show that all models experience remarkable accuracy drops under our\nproposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access\n2024, ranging from 10% to 93% across models. Notably, the most accurate model\nin our experimentation (OpenAI-o3-mini) is not the most robust\n(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may\nnot be the ones with better reasoning capabilities. Also, we see larger\naccuracy drops in public (vs private) datasets and questions posed in their\noriginal language (vs a manual translation), which are signs of contamination\nand also point to a relevant role of recall/memorization in current LLMs'\nanswers.", "paper_summary_zh": "\u5728 LLM \u8a55\u4f30\u4e2d\uff0c\u63a8\u7406\u901a\u5e38\u900f\u904e\u5c0d\u6578\u5b78\u5c0e\u5411\u554f\u984c\u9032\u884c\u6578\u503c\u8b8a\u7570\u4f86\u5340\u5225\u65bc\u56de\u61b6/\u8a18\u61b6\u3002\u5728\u6b64\uff0c\u6211\u5011\u5f15\u5165\u4e00\u7a2e\u901a\u7528\u8b8a\u7570\u65b9\u6cd5\uff0c\u9069\u7528\u65bc\u591a\u9078\u984c\uff0c\u5b83\u5c07\u6b63\u78ba\u7b54\u6848\u8207\u5148\u524d\u770b\u5230\u7684\u4ee3\u5e63\u6216\u6982\u5ff5\u5b8c\u5168\u5340\u5206\u958b\u4f86\uff0c\u8981\u6c42 LLM \u7406\u89e3\u548c\u63a8\u7406\uff08\u800c\u4e0d\u662f\u8a18\u61b6\uff09\uff0c\u4ee5\u4fbf\u6b63\u78ba\u56de\u7b54\u3002\u4f7f\u7528\u6b64\u65b9\u6cd5\uff0c\u6211\u5011\u5728\u82f1\u8a9e\u548c\u897f\u73ed\u7259\u8a9e\u4e2d\u8a55\u4f30\u4e86\u5169\u7a2e\u6578\u64da\u96c6\u4e2d\u7684\u6700\u5148\u9032\u7684\u5c08\u6709\u548c\u958b\u6e90 LLM\uff1a\u516c\u5171 MMLU \u57fa\u6e96\u548c\u79c1\u6709 UNED-Access 2024 \u6578\u64da\u96c6\u3002\u7d50\u679c\u8868\u660e\uff0c\u5728\u6211\u5011\u63d0\u51fa\u7684\u8b8a\u7570\u4e0b\uff0c\u6240\u6709\u6a21\u578b\u7684\u6e96\u78ba\u5ea6\u90fd\u51fa\u73fe\u986f\u8457\u4e0b\u964d\uff0c\u5728 MMLU \u4e0a\u5e73\u5747\u640d\u5931 57%\uff0c\u5728 UNED-Access 2024 \u4e0a\u5e73\u5747\u640d\u5931 50%\uff0c\u5728\u4e0d\u540c\u6a21\u578b\u4e2d\u7bc4\u570d\u5f9e 10% \u5230 93%\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u6211\u5011\u5be6\u9a57\u4e2d\u6700\u6e96\u78ba\u7684\u6a21\u578b\uff08OpenAI-o3-mini\uff09\u4e26\u4e0d\u662f\u6700\u7a69\u5065\u7684\u6a21\u578b\uff08DeepSeek-R1-70B\uff09\uff0c\u9019\u8868\u660e\u6a19\u6e96\u8a55\u4f30\u4e2d\u6700\u597d\u7684\u6a21\u578b\u53ef\u80fd\u4e0d\u662f\u63a8\u7406\u80fd\u529b\u6700\u5f37\u7684\u6a21\u578b\u3002\u6b64\u5916\uff0c\u6211\u5011\u770b\u5230\u516c\u5171\uff08\u76f8\u5c0d\u65bc\u79c1\u6709\uff09\u6578\u64da\u96c6\u548c\u4ee5\u539f\u59cb\u8a9e\u8a00\u63d0\u51fa\u7684\u554f\u984c\uff08\u76f8\u5c0d\u65bc\u4eba\u5de5\u7ffb\u8b6f\uff09\u7684\u6e96\u78ba\u5ea6\u4e0b\u964d\u5e45\u5ea6\u66f4\u5927\uff0c\u9019\u662f\u6c59\u67d3\u7684\u8de1\u8c61\uff0c\u4e5f\u8868\u660e\u56de\u61b6/\u8a18\u61b6\u5728\u7576\u524d LLM \u7684\u7b54\u6848\u4e2d\u767c\u63ee\u8457\u76f8\u95dc\u4f5c\u7528\u3002", "author": "Eva S\u00e1nchez Salido et.al.", "authors": "Eva S\u00e1nchez Salido, Julio Gonzalo, Guillermo Marco", "id": "2502.12896v1", "paper_url": "http://arxiv.org/abs/2502.12896v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12900/2502.12900v1.json b/database/storage/2502/12900/2502.12900v1.json
new file mode 100644
index 0000000000..2f1f7069b4
--- /dev/null
+++ b/database/storage/2502/12900/2502.12900v1.json
@@ -0,0 +1 @@
+{"2502.12900": {"publish_time": "2025-02-18", "title": "Soundwave: Less is More for Speech-Text Alignment in LLMs", "paper_summary": "Existing end-to-end speech large language models (LLMs) usually rely on\nlarge-scale annotated data for training, while data-efficient training has not\nbeen discussed in depth. We focus on two fundamental problems between speech\nand text: the representation space gap and sequence length inconsistency. We\npropose Soundwave, which utilizes an efficient training strategy and a novel\narchitecture to address these issues. Results show that Soundwave outperforms\nthe advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,\nusing only one-fiftieth of the training data. Further analysis shows that\nSoundwave still retains its intelligence during conversation. The project is\navailable at https://github.com/FreedomIntelligence/Soundwave.", "paper_summary_zh": "\u73fe\u6709\u7684\u7aef\u5c0d\u7aef\u8a9e\u97f3\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u901a\u5e38\u4f9d\u8cf4\u65bc\u5927\u898f\u6a21\u8a3b\u91cb\u8cc7\u6599\u9032\u884c\u8a13\u7df4\uff0c\u800c\u8cc7\u6599\u6709\u6548\u7387\u7684\u8a13\u7df4\u5c1a\u672a\u6df1\u5165\u63a2\u8a0e\u3002\u6211\u5011\u5c08\u6ce8\u65bc\u8a9e\u97f3\u548c\u6587\u5b57\u4e4b\u9593\u7684\u5169\u500b\u57fa\u672c\u554f\u984c\uff1a\u8868\u793a\u7a7a\u9593\u5dee\u8ddd\u548c\u5e8f\u5217\u9577\u5ea6\u4e0d\u4e00\u81f4\u3002\u6211\u5011\u63d0\u51fa Soundwave\uff0c\u5b83\u5229\u7528\u9ad8\u6548\u7684\u8a13\u7df4\u7b56\u7565\u548c\u65b0\u7a4e\u7684\u67b6\u69cb\u4f86\u89e3\u6c7a\u9019\u4e9b\u554f\u984c\u3002\u7d50\u679c\u986f\u793a\uff0cSoundwave \u5728\u8a9e\u97f3\u7ffb\u8b6f\u548c AIR-Bench \u8a9e\u97f3\u4efb\u52d9\u4e2d\u512a\u65bc\u9032\u968e\u7684 Qwen2-Audio\uff0c\u50c5\u4f7f\u7528\u4e94\u5341\u5206\u4e4b\u4e00\u7684\u8a13\u7df4\u8cc7\u6599\u3002\u9032\u4e00\u6b65\u7684\u5206\u6790\u986f\u793a\uff0cSoundwave \u5728\u5c0d\u8a71\u4e2d\u4ecd\u80fd\u4fdd\u6301\u5176\u667a\u6167\u3002\u5c08\u6848\u53ef\u65bc https://github.com/FreedomIntelligence/Soundwave \u53d6\u5f97\u3002", "author": "Yuhao Zhang et.al.", "authors": "Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li", "id": "2502.12900v1", "paper_url": "http://arxiv.org/abs/2502.12900v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12904/2502.12904v1.json b/database/storage/2502/12904/2502.12904v1.json
new file mode 100644
index 0000000000..a6cd9a4ef9
--- /dev/null
+++ b/database/storage/2502/12904/2502.12904v1.json
@@ -0,0 +1 @@
+{"2502.12904": {"publish_time": "2025-02-18", "title": "Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements", "paper_summary": "We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to\ndefend against internet fraud and phishing in dynamic, real-world scenarios.\nFraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job\npostings, social media, and news, categorized into 5 major fraud types. Unlike\nprevious benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to\nassess LLMs' resistance to fraud at different stages, including credibility\nbuilding, urgency creation, and emotional manipulation. Furthermore, we\nevaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM\nprovides general decision-making assistance, and 2. Role-play, where the model\nassumes a specific persona, widely used in real-world agent-based interactions.\nOur evaluation reveals the significant challenges in defending against fraud\nand phishing inducement, especially in role-play settings and fake job\npostings. Additionally, we observe a substantial performance gap between\nChinese and English, underscoring the need for improved multilingual fraud\ndetection capabilities.", "paper_summary_zh": "\u6211\u5011\u63a8\u51fa Fraud-R1\uff0c\u4e00\u500b\u57fa\u6e96\uff0c\u65e8\u5728\u8a55\u4f30 LLM \u5728\u52d5\u614b\u3001\u771f\u5be6\u4e16\u754c\u5834\u666f\u4e2d\u9632\u7bc4\u7db2\u8def\u8a50\u9a19\u548c\u7db2\u8def\u91e3\u9b5a\u7684\u80fd\u529b\u3002Fraud-R1 \u5305\u542b 8,564 \u8d77\u8a50\u9a19\u6848\u4f8b\uff0c\u4f86\u6e90\u5305\u62ec\u7db2\u8def\u91e3\u9b5a\u8a50\u9a19\u3001\u865b\u5047\u8077\u7f3a\u3001\u793e\u7fa4\u5a92\u9ad4\u548c\u65b0\u805e\uff0c\u5206\u985e\u70ba 5 \u7a2e\u985e\u578b\u7684\u4e3b\u8981\u8a50\u9a19\u624b\u6cd5\u3002\u8207\u5148\u524d\u7684\u57fa\u6e96\u4e0d\u540c\uff0cFraud-R1 \u5f15\u5165\u591a\u8f2a\u8a55\u4f30\u7ba1\u9053\uff0c\u4ee5\u8a55\u4f30 LLM \u5728\u4e0d\u540c\u968e\u6bb5\u5c0d\u8a50\u9a19\u7684\u62b5\u6297\u529b\uff0c\u5305\u62ec\u5efa\u7acb\u4fe1\u8b7d\u3001\u88fd\u9020\u6025\u8feb\u611f\u548c\u60c5\u611f\u64cd\u7e31\u3002\u6b64\u5916\uff0c\u6211\u5011\u5728\u5169\u7a2e\u8a2d\u5b9a\u4e0b\u8a55\u4f30 15 \u500b LLM\uff1a1. \u5354\u52a9\u52a9\u7406\uff0c\u5176\u4e2d LLM \u63d0\u4f9b\u4e00\u822c\u6c7a\u7b56\u5354\u52a9\uff0c\u4ee5\u53ca 2. \u89d2\u8272\u626e\u6f14\uff0c\u5176\u4e2d\u6a21\u578b\u5047\u8a2d\u7279\u5b9a\u89d2\u8272\uff0c\u5ee3\u6cdb\u7528\u65bc\u73fe\u5be6\u4e16\u754c\u4e2d\u57fa\u65bc\u4ee3\u7406\u7684\u4e92\u52d5\u3002\u6211\u5011\u7684\u8a55\u4f30\u63ed\u793a\u4e86\u5728\u9632\u7bc4\u8a50\u9a19\u548c\u7db2\u8def\u91e3\u9b5a\u8a98\u5c0e\u65b9\u9762\u9762\u81e8\u7684\u91cd\u5927\u6311\u6230\uff0c\u5c24\u5176\u662f\u5728\u89d2\u8272\u626e\u6f14\u8a2d\u5b9a\u548c\u865b\u5047\u8077\u7f3a\u4e2d\u3002\u6b64\u5916\uff0c\u6211\u5011\u89c0\u5bdf\u5230\u4e2d\u6587\u548c\u82f1\u6587\u4e4b\u9593\u6709\u986f\u8457\u7684\u6548\u80fd\u5dee\u8ddd\uff0c\u9019\u51f8\u986f\u4e86\u6539\u9032\u591a\u8a9e\u8a00\u8a50\u9a19\u5075\u6e2c\u529f\u80fd\u7684\u5fc5\u8981\u6027\u3002", "author": "Shu Yang et.al.", "authors": "Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang", "id": "2502.12904v1", "paper_url": "http://arxiv.org/abs/2502.12904v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12908/2502.12908v1.json b/database/storage/2502/12908/2502.12908v1.json
new file mode 100644
index 0000000000..024ca79f72
--- /dev/null
+++ b/database/storage/2502/12908/2502.12908v1.json
@@ -0,0 +1 @@
+{"2502.12908": {"publish_time": "2025-02-18", "title": "Graph Neural Networks for Databases: A Survey", "paper_summary": "Graph neural networks (GNNs) are powerful deep learning models for\ngraph-structured data, demonstrating remarkable success across diverse domains.\nRecently, the database (DB) community has increasingly recognized the\npotentiality of GNNs, prompting a surge of researches focusing on improving\ndatabase systems through GNN-based approaches. However, despite notable\nadvances, There is a lack of a comprehensive review and understanding of how\nGNNs could improve DB systems. Therefore, this survey aims to bridge this gap\nby providing a structured and in-depth overview of GNNs for DB systems.\nSpecifically, we propose a new taxonomy that classifies existing methods into\ntwo key categories: (1) Relational Databases, which includes tasks like\nperformance prediction, query optimization, and text-to-SQL, and (2) Graph\nDatabases, addressing challenges like efficient graph query processing and\ngraph similarity computation. We systematically review key methods in each\ncategory, highlighting their contributions and practical implications. Finally,\nwe suggest promising avenues for integrating GNNs into Database systems.", "paper_summary_zh": "\u5716\u5f62\u795e\u7d93\u7db2\u8def (GNN) \u662f\u7528\u65bc\u5716\u5f62\u7d50\u69cb\u8cc7\u6599\u7684\u5f37\u5927\u6df1\u5ea6\u5b78\u7fd2\u6a21\u578b\uff0c\u5728\u5404\u7a2e\u9818\u57df\u4e2d\u5c55\u73fe\u51fa\u986f\u8457\u7684\u6210\u529f\u3002\u6700\u8fd1\uff0c\u8cc7\u6599\u5eab (DB) \u793e\u7fa4\u8d8a\u4f86\u8d8a\u8a8d\u8b58\u5230 GNN \u7684\u6f5b\u529b\uff0c\u4fc3\u4f7f\u5927\u91cf\u7814\u7a76\u5c08\u6ce8\u65bc\u900f\u904e\u57fa\u65bc GNN \u7684\u65b9\u6cd5\u4f86\u6539\u5584\u8cc7\u6599\u5eab\u7cfb\u7d71\u3002\u7136\u800c\uff0c\u5118\u7ba1\u6709\u986f\u8457\u7684\u9032\u5c55\uff0c\u4f46\u5c0d\u65bc GNN \u5982\u4f55\u6539\u5584\u8cc7\u6599\u5eab\u7cfb\u7d71\uff0c\u4ecd\u7136\u7f3a\u4e4f\u5168\u9762\u7684\u56de\u9867\u548c\u7406\u89e3\u3002\u56e0\u6b64\uff0c\u672c\u8abf\u67e5\u65e8\u5728\u900f\u904e\u63d0\u4f9b GNN \u5728\u8cc7\u6599\u5eab\u7cfb\u7d71\u4e2d\u7684\u7d50\u69cb\u5316\u4e14\u6df1\u5165\u7684\u6982\u89c0\u4f86\u5f4c\u88dc\u9019\u500b\u5dee\u8ddd\u3002\u5177\u9ad4\u4f86\u8aaa\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u500b\u65b0\u7684\u5206\u985e\u6cd5\uff0c\u5c07\u73fe\u6709\u65b9\u6cd5\u5206\u985e\u70ba\u5169\u500b\u4e3b\u8981\u985e\u5225\uff1a(1) \u95dc\u4fc2\u8cc7\u6599\u5eab\uff0c\u5176\u4e2d\u5305\u62ec\u6548\u80fd\u9810\u6e2c\u3001\u67e5\u8a62\u6700\u4f73\u5316\u548c\u6587\u5b57\u8f49 SQL \u7b49\u4efb\u52d9\uff0c\u4ee5\u53ca (2) \u5716\u5f62\u8cc7\u6599\u5eab\uff0c\u7528\u65bc\u8655\u7406\u9ad8\u6548\u5716\u5f62\u67e5\u8a62\u8655\u7406\u548c\u5716\u5f62\u76f8\u4f3c\u5ea6\u8a08\u7b97\u7b49\u6311\u6230\u3002\u6211\u5011\u7cfb\u7d71\u6027\u5730\u56de\u9867\u4e86\u6bcf\u500b\u985e\u5225\u4e2d\u7684\u95dc\u9375\u65b9\u6cd5\uff0c\u91cd\u9ede\u8aaa\u660e\u5176\u8ca2\u737b\u548c\u5be6\u52d9\u610f\u6db5\u3002\u6700\u5f8c\uff0c\u6211\u5011\u5efa\u8b70\u5c07 GNN \u6574\u5408\u5230\u8cc7\u6599\u5eab\u7cfb\u7d71\u4e2d\u7684\u6709\u5e0c\u671b\u9014\u5f91\u3002", "author": "Ziming Li et.al.", "authors": "Ziming Li, Youhuan Li, Yuyu Luo, Guoliang Li, Chuxu Zhang", "id": "2502.12908v1", "paper_url": "http://arxiv.org/abs/2502.12908v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12911/2502.12911v1.json b/database/storage/2502/12911/2502.12911v1.json
new file mode 100644
index 0000000000..247104454a
--- /dev/null
+++ b/database/storage/2502/12911/2502.12911v1.json
@@ -0,0 +1 @@
+{"2502.12911": {"publish_time": "2025-02-18", "title": "Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation", "paper_summary": "Generating SQLs from user queries is a long-standing challenge, where the\naccuracy of initial schema linking significantly impacts subsequent SQL\ngeneration performance. However, current schema linking models still struggle\nwith missing relevant schema elements or an excess of redundant ones. A crucial\nreason for this is that commonly used metrics, recall and precision, fail to\ncapture relevant element missing and thus cannot reflect actual schema linking\nperformance. Motivated by this, we propose an enhanced schema linking metric by\nintroducing a restricted missing indicator. Accordingly, we introduce Knapsack\noptimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent\ndesigned to prevent the missing of relevant schema elements while minimizing\nthe inclusion of redundant ones. KaSLA employs a hierarchical linking strategy\nthat first identifies the optimal table linking and subsequently links columns\nwithin the selected table to reduce linking candidate space. In each linking\nprocess, it utilize a knapsack optimization approach to link potentially\nrelevant elements while accounting for a limited tolerance of potential\nredundant ones.With this optimization, KaSLA-1.6B achieves superior schema\nlinking results compared to large-scale LLMs, including deepseek-v3 with\nstate-of-the-art (SOTA) schema linking method. Extensive experiments on Spider\nand BIRD benchmarks verify that KaSLA can significantly improve the SQL\ngeneration performance of SOTA text-to-SQL models by substituting their schema\nlinking processes.", "paper_summary_zh": "\u5f9e\u4f7f\u7528\u8005\u67e5\u8a62\u4e2d\u7522\u751f SQL \u662f\u500b\u9577\u671f\u7684\u6311\u6230\uff0c\u5176\u4e2d\u521d\u59cb\u67b6\u69cb\u9023\u7d50\u7684\u6e96\u78ba\u6027\u6703\u986f\u8457\u5f71\u97ff\u5f8c\u7e8c SQL \u7522\u751f\u6548\u80fd\u3002\u7136\u800c\uff0c\u76ee\u524d\u7684\u67b6\u69cb\u9023\u7d50\u6a21\u578b\u4ecd\u96e3\u4ee5\u8655\u7406\u907a\u6f0f\u76f8\u95dc\u67b6\u69cb\u5143\u7d20\u6216\u904e\u591a\u91cd\u8907\u5143\u7d20\u7684\u554f\u984c\u3002\u9020\u6210\u6b64\u554f\u984c\u7684\u4e00\u500b\u95dc\u9375\u539f\u56e0\u662f\uff0c\u5e38\u7528\u7684\u6307\u6a19\u53ec\u56de\u7387\u548c\u7cbe\u78ba\u5ea6\u7121\u6cd5\u6355\u6349\u907a\u6f0f\u76f8\u95dc\u5143\u7d20\uff0c\u56e0\u6b64\u7121\u6cd5\u53cd\u6620\u5be6\u969b\u7684\u67b6\u69cb\u9023\u7d50\u6548\u80fd\u3002\u6709\u9451\u65bc\u6b64\uff0c\u6211\u5011\u63d0\u51fa\u4e00\u500b\u589e\u5f37\u7684\u67b6\u69cb\u9023\u7d50\u6307\u6a19\uff0c\u900f\u904e\u5f15\u5165\u53d7\u9650\u907a\u6f0f\u6307\u6a19\u3002\u56e0\u6b64\uff0c\u6211\u5011\u4ecb\u7d39\u57fa\u65bc\u80cc\u5305\u6700\u4f73\u5316\u7684\u67b6\u69cb\u9023\u7d50\u4ee3\u7406 (KaSLA)\uff0c\u9019\u662f\u4e00\u500b\u5916\u639b\u5f0f\u67b6\u69cb\u9023\u7d50\u4ee3\u7406\uff0c\u65e8\u5728\u9632\u6b62\u907a\u6f0f\u76f8\u95dc\u67b6\u69cb\u5143\u7d20\uff0c\u540c\u6642\u5c07\u91cd\u8907\u5143\u7d20\u7684\u7d0d\u5165\u964d\u81f3\u6700\u4f4e\u3002KaSLA \u63a1\u7528\u5206\u5c64\u9023\u7d50\u7b56\u7565\uff0c\u9996\u5148\u627e\u51fa\u6700\u4f73\u7684\u8868\u683c\u9023\u7d50\uff0c\u7136\u5f8c\u9023\u7d50\u6240\u9078\u8868\u683c\u4e2d\u7684\u6b04\u4f4d\uff0c\u4ee5\u6e1b\u5c11\u9023\u7d50\u5019\u9078\u7a7a\u9593\u3002\u5728\u6bcf\u500b\u9023\u7d50\u904e\u7a0b\u4e2d\uff0c\u5b83\u5229\u7528\u80cc\u5305\u6700\u4f73\u5316\u65b9\u6cd5\u9023\u7d50\u6f5b\u5728\u76f8\u95dc\u5143\u7d20\uff0c\u540c\u6642\u8003\u91cf\u5c0d\u6f5b\u5728\u91cd\u8907\u5143\u7d20\u7684\u5bb9\u5fcd\u5ea6\u3002\u900f\u904e\u6b64\u6700\u4f73\u5316\uff0cKaSLA-1.6B \u9054\u5230\u512a\u65bc\u5927\u898f\u6a21 LLM \u7684\u67b6\u69cb\u9023\u7d50\u7d50\u679c\uff0c\u5305\u62ec\u63a1\u7528\u6700\u5148\u9032 (SOTA) \u67b6\u69cb\u9023\u7d50\u65b9\u6cd5\u7684 deepseek-v3\u3002\u5728 Spider \u548c BIRD \u57fa\u6e96\u4e0a\u7684\u5ee3\u6cdb\u5be6\u9a57\u9a57\u8b49\uff0cKaSLA \u53ef\u900f\u904e\u53d6\u4ee3\u5176\u67b6\u69cb\u9023\u7d50\u6d41\u7a0b\uff0c\u5927\u5e45\u63d0\u5347 SOTA \u6587\u5b57\u8f49 SQL \u6a21\u578b\u7684 SQL \u7522\u751f\u6548\u80fd\u3002", "author": "Zheng Yuan et.al.", "authors": "Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang", "id": "2502.12911v1", "paper_url": "http://arxiv.org/abs/2502.12911v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12913/2502.12913v1.json b/database/storage/2502/12913/2502.12913v1.json
new file mode 100644
index 0000000000..22847a5b16
--- /dev/null
+++ b/database/storage/2502/12913/2502.12913v1.json
@@ -0,0 +1 @@
+{"2502.12913": {"publish_time": "2025-02-18", "title": "GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning", "paper_summary": "Large Language Models (LLMs) fine-tuning technologies have achieved\nremarkable results. However, traditional LLM fine-tuning approaches face\nsignificant challenges: they require large Floating Point (FP) computation,\nraising privacy concerns when handling sensitive data, and are impractical for\nresource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT)\ntechniques reduce trainable parameters, their reliance on floating-point\narithmetic creates fundamental incompatibilities with edge hardware. In this\nwork, we introduce a novel framework for on-device LLM fine-tuning that\neliminates the need for floating-point operations in both inference and\ntraining, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer\nformat, which efficiently represents model parameters in integer format using\nshared exponents among parameter groups. When combined with LoRA-like adapters,\nthis enables fully integer-based fine-tuning that is both memory and compute\nefficient. We demonstrate that our approach achieves accuracy comparable to\nFP16-based fine-tuning while significantly reducing memory usage (50%).\nMoreover, compared to FP8, our method can reduce 5x power consumption and 11x\nchip area with same performance, making large-scale model adaptation feasible\non edge devices.", "paper_summary_zh": "\u5927\u578b\u8bed\u8a00\u6a21\u578b (LLM) \u5fae\u8c03\u6280\u672f\u5df2\u53d6\u5f97\u663e\u8457\u6210\u679c\u3002\u7136\u800c\uff0c\u4f20\u7edf\u7684 LLM \u5fae\u8c03\u65b9\u6cd5\u9762\u4e34\u7740\u4e25\u5cfb\u7684\u6311\u6218\uff1a\u5b83\u4eec\u9700\u8981\u5927\u91cf\u7684\u6d6e\u70b9 (FP) \u8ba1\u7b97\uff0c\u5728\u5904\u7406\u654f\u611f\u6570\u636e\u65f6\u4f1a\u5f15\u53d1\u9690\u79c1\u95ee\u9898\uff0c\u5e76\u4e14\u5bf9\u4e8e\u8d44\u6e90\u53d7\u9650\u7684\u8fb9\u7f18\u8bbe\u5907\u800c\u8a00\u4e0d\u5207\u5b9e\u9645\u3002\u867d\u7136\u53c2\u6570\u9ad8\u6548\u5fae\u8c03 (PEFT) \u6280\u672f\u51cf\u5c11\u4e86\u53ef\u8bad\u7ec3\u53c2\u6570\uff0c\u4f46\u5b83\u4eec\u5bf9\u6d6e\u70b9\u8fd0\u7b97\u7684\u4f9d\u8d56\u4e0e\u8fb9\u7f18\u786c\u4ef6\u4ea7\u751f\u4e86\u6839\u672c\u4e0a\u7684\u4e0d\u517c\u5bb9\u6027\u3002\u5728\u8fd9\u9879\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u5f15\u5165\u4e86\u4e00\u4e2a\u7528\u4e8e\u8bbe\u5907\u4e0a LLM \u5fae\u8c03\u7684\u65b0\u6846\u67b6\uff0c\u8be5\u6846\u67b6\u6d88\u9664\u4e86\u63a8\u7406\u548c\u8bad\u7ec3\u4e2d\u5bf9\u6d6e\u70b9\u8fd0\u7b97\u7684\u9700\u6c42\uff0c\u540d\u4e3a GSQ-Tuning\u3002\u5176\u6838\u5fc3\u662f\u7ec4\u5171\u4eab\u6307\u6570\u6574\u6570\u683c\u5f0f\uff0c\u8be5\u683c\u5f0f\u4f7f\u7528\u53c2\u6570\u7ec4\u4e4b\u95f4\u7684\u5171\u4eab\u6307\u6570\u4ee5\u6574\u6570\u683c\u5f0f\u6709\u6548\u5730\u8868\u793a\u6a21\u578b\u53c2\u6570\u3002\u5f53\u4e0e\u7c7b\u4f3c LoRA \u7684\u9002\u914d\u5668\u76f8\u7ed3\u5408\u65f6\uff0c\u8fd9\u5b9e\u73b0\u4e86\u5b8c\u5168\u57fa\u4e8e\u6574\u6570\u7684\u5fae\u8c03\uff0c\u65e2\u8282\u7701\u5185\u5b58\u53c8\u8282\u7701\u8ba1\u7b97\u3002\u6211\u4eec\u8bc1\u660e\u4e86\u6211\u4eec\u7684\u65b9\u6cd5\u5b9e\u73b0\u4e86\u4e0e\u57fa\u4e8e FP16 \u7684\u5fae\u8c03\u76f8\u5f53\u7684\u51c6\u786e\u6027\uff0c\u540c\u65f6\u663e\u8457\u51cf\u5c11\u4e86\u5185\u5b58\u4f7f\u7528\u91cf (50%)\u3002\u6b64\u5916\uff0c\u4e0e FP8 \u76f8\u6bd4\uff0c\u6211\u4eec\u7684\u65b9\u6cd5\u53ef\u4ee5\u5728\u76f8\u540c\u7684\u6027\u80fd\u4e0b\u51cf\u5c11 5 \u500d\u7684\u529f\u8017\u548c 11 \u500d\u7684\u82af\u7247\u9762\u79ef\uff0c\u4ece\u800c\u4f7f\u5927\u89c4\u6a21\u6a21\u578b\u9002\u5e94\u5728\u8fb9\u7f18\u8bbe\u5907\u4e0a\u6210\u4e3a\u53ef\u80fd\u3002", "author": "Sifan Zhou et.al.", "authors": "Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang", "id": "2502.12913v1", "paper_url": "http://arxiv.org/abs/2502.12913v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12921/2502.12921v1.json b/database/storage/2502/12921/2502.12921v1.json
new file mode 100644
index 0000000000..fbb5a7f70c
--- /dev/null
+++ b/database/storage/2502/12921/2502.12921v1.json
@@ -0,0 +1 @@
+{"2502.12921": {"publish_time": "2025-02-18", "title": "Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison", "paper_summary": "Query-driven recommendation with unknown items poses a challenge for users to\nunderstand why certain items are appropriate for their needs. Query-driven\nContrastive Summarization (QCS) is a methodology designed to address this issue\nby leveraging language-based item descriptions to clarify contrasts between\nthem. However, existing state-of-the-art contrastive summarization methods such\nas STRUM-LLM fall short of this goal. To overcome these limitations, we\nintroduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs\ndebate-style prompting to generate focused and contrastive summarizations of\nitem aspects relevant to a query. Leveraging modern large language models\n(LLMs) as powerful tools for generating debates, Q-STRUM Debate provides\nenhanced contrastive summaries. Experiments across three datasets demonstrate\nthat Q-STRUM Debate yields significant performance improvements over existing\nmethods on key contrastive summarization criteria, thus introducing a novel and\nperformant debate prompting methodology for QCS.", "paper_summary_zh": "\u4ee5\u672a\u77e5\u9805\u76ee\u9032\u884c\u7684\u67e5\u8a62\u9a45\u52d5\u63a8\u85a6\u5c0d\u4f7f\u7528\u8005\u4f86\u8aaa\u662f\u4e00\u9805\u6311\u6230\uff0c\u4ed6\u5011\u96e3\u4ee5\u7406\u89e3\u70ba\u4f55\u67d0\u4e9b\u9805\u76ee\u9069\u5408\u81ea\u5df1\u7684\u9700\u6c42\u3002\u67e5\u8a62\u9a45\u52d5\u5c0d\u6bd4\u6458\u8981 (QCS) \u662f\u4e00\u7a2e\u65b9\u6cd5\uff0c\u65e8\u5728\u900f\u904e\u5229\u7528\u57fa\u65bc\u8a9e\u8a00\u7684\u9805\u76ee\u63cf\u8ff0\u4f86\u91d0\u6e05\u9805\u76ee\u4e4b\u9593\u7684\u5c0d\u6bd4\uff0c\u4ee5\u89e3\u6c7a\u9019\u500b\u554f\u984c\u3002\u7136\u800c\uff0c\u73fe\u6709\u7684\u6700\u5148\u9032\u5c0d\u6bd4\u6458\u8981\u65b9\u6cd5\uff08\u4f8b\u5982 STRUM-LLM\uff09\u4e26\u672a\u9054\u6210\u6b64\u76ee\u6a19\u3002\u70ba\u4e86\u514b\u670d\u9019\u4e9b\u9650\u5236\uff0c\u6211\u5011\u5f15\u9032 Q-STRUM Debate\uff0c\u4e00\u7a2e STRUM-LLM \u7684\u65b0\u5ef6\u4f38\uff0c\u5b83\u63a1\u7528\u8faf\u8ad6\u5f0f\u63d0\u793a\u4f86\u7522\u751f\u8207\u67e5\u8a62\u76f8\u95dc\u7684\u9805\u76ee\u9762\u5411\u7684\u91cd\u9ede\u5f0f\u5c0d\u6bd4\u6458\u8981\u3002\u900f\u904e\u5229\u7528\u73fe\u4ee3\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u4f5c\u70ba\u7522\u751f\u8faf\u8ad6\u7684\u5f37\u5927\u5de5\u5177\uff0cQ-STRUM Debate \u63d0\u4f9b\u589e\u5f37\u7684\u5c0d\u6bd4\u6458\u8981\u3002\u900f\u904e\u4e09\u500b\u8cc7\u6599\u96c6\u7684\u5be6\u9a57\u8b49\u660e\uff0cQ-STRUM Debate \u5728\u95dc\u9375\u7684\u5c0d\u6bd4\u6458\u8981\u6a19\u6e96\u4e0a\uff0c\u6bd4\u73fe\u6709\u65b9\u6cd5\u6709\u986f\u8457\u7684\u6548\u80fd\u6539\u5584\uff0c\u56e0\u6b64\u70ba QCS \u5f15\u9032\u4e00\u7a2e\u65b0\u7a4e\u4e14\u9ad8\u6027\u80fd\u7684\u8faf\u8ad6\u63d0\u793a\u65b9\u6cd5\u3002", "author": "George-Kirollos Saad et.al.", "authors": "George-Kirollos Saad, Scott Sanner", "id": "2502.12921v1", "paper_url": "http://arxiv.org/abs/2502.12921v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12923/2502.12923v1.json b/database/storage/2502/12923/2502.12923v1.json
new file mode 100644
index 0000000000..5dd20221ab
--- /dev/null
+++ b/database/storage/2502/12923/2502.12923v1.json
@@ -0,0 +1 @@
+{"2502.12923": {"publish_time": "2025-02-18", "title": "On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation", "paper_summary": "This paper investigates whether Large Language Models (LLMs), fine-tuned on\nsynthetic but domain-representative data, can perform the twofold task of (i)\nslot and intent detection and (ii) natural language response generation for a\nsmart home assistant, while running solely on resource-limited, CPU-only edge\nhardware. We fine-tune LLMs to produce both JSON action calls and text\nresponses. Our experiments show that 16-bit and 8-bit quantized variants\npreserve high accuracy on slot and intent detection and maintain strong\nsemantic coherence in generated text, while the 4-bit model, while retaining\ngenerative fluency, suffers a noticeable drop in device-service classification\naccuracy. Further evaluations on noisy human (non-synthetic) prompts and\nout-of-domain intents confirm the models' generalization ability, obtaining\naround 80--86\\% accuracy. While the average inference time is 5--6 seconds per\nquery -- acceptable for one-shot commands but suboptimal for multi-turn\ndialogue -- our results affirm that an on-device LLM can effectively unify\ncommand interpretation and flexible response generation for home automation\nwithout relying on specialized hardware.", "paper_summary_zh": "\u672c\u6587\u63a2\u8a0e\u5fae\u8abf\u65bc\u5408\u6210\u4f46\u5177\u9818\u57df\u4ee3\u8868\u6027\u7684\u8cc7\u6599\u4e0a\u7684\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM)\uff0c\u662f\u5426\u80fd\u57f7\u884c (i) \u69fd\u4f4d\u548c\u610f\u5716\u5075\u6e2c\uff0c\u4ee5\u53ca (ii) \u81ea\u7136\u8a9e\u8a00\u56de\u61c9\u7522\u751f\u7684\u96d9\u91cd\u4efb\u52d9\uff0c\u540c\u6642\u50c5\u5728\u8cc7\u6e90\u53d7\u9650\u3001\u50c5 CPU \u7684\u908a\u7de3\u786c\u9ad4\u4e0a\u57f7\u884c\u3002\u6211\u5011\u5fae\u8abf LLM \u4ee5\u7522\u751f JSON \u52d5\u4f5c\u547c\u53eb\u548c\u6587\u5b57\u56de\u61c9\u3002\u6211\u5011\u7684\u5be6\u9a57\u986f\u793a\uff0c16 \u4f4d\u5143\u548c 8 \u4f4d\u5143\u91cf\u5316\u7684\u8b8a\u9ad4\u5728\u69fd\u4f4d\u548c\u610f\u5716\u5075\u6e2c\u4e0a\u4fdd\u6301\u9ad8\u6e96\u78ba\u5ea6\uff0c\u4e26\u5728\u7522\u751f\u7684\u6587\u5b57\u4e2d\u7dad\u6301\u5f37\u5927\u7684\u8a9e\u610f\u4e00\u81f4\u6027\uff0c\u800c 4 \u4f4d\u5143\u6a21\u578b\u96d6\u7136\u4fdd\u6709\u751f\u6210\u6d41\u66a2\u5ea6\uff0c\u4f46\u5728\u88dd\u7f6e\u670d\u52d9\u5206\u985e\u6e96\u78ba\u5ea6\u4e0a\u537b\u6709\u660e\u986f\u4e0b\u964d\u3002\u9032\u4e00\u6b65\u5c0d\u6709\u96dc\u8a0a\u7684\u4eba\u985e (\u975e\u5408\u6210) \u63d0\u793a\u548c\u9818\u57df\u5916\u610f\u5716\u7684\u8a55\u4f30\uff0c\u8b49\u5be6\u4e86\u6a21\u578b\u7684\u6cdb\u5316\u80fd\u529b\uff0c\u7372\u5f97\u7d04 80--86% \u7684\u6e96\u78ba\u5ea6\u3002\u96d6\u7136\u5e73\u5747\u63a8\u8ad6\u6642\u9593\u70ba\u6bcf\u500b\u67e5\u8a62 5--6 \u79d2\uff0c\u5c0d\u65bc\u4e00\u6b21\u6027\u547d\u4ee4\u4f86\u8aaa\u662f\u53ef\u4ee5\u63a5\u53d7\u7684\uff0c\u4f46\u5c0d\u65bc\u591a\u8f2a\u5c0d\u8a71\u4f86\u8aaa\u4e26\u4e0d\u7406\u60f3\uff0c\u4f46\u6211\u5011\u7684\u7d50\u679c\u8b49\u5be6\uff0c\u88dd\u7f6e\u4e0a\u7684 LLM \u53ef\u4ee5\u6709\u6548\u5730\u7d71\u4e00\u547d\u4ee4\u89e3\u8b6f\u548c\u5f48\u6027\u56de\u61c9\u7522\u751f\uff0c\u4ee5\u9032\u884c\u5bb6\u5ead\u81ea\u52d5\u5316\uff0c\u800c\u7121\u9700\u4f9d\u8cf4\u5c08\u7528\u786c\u9ad4\u3002", "author": "Rune Birkmose et.al.", "authors": "Rune Birkmose, Nathan M\u00f8rkeberg Reece, Esben Hofstedt Norvin, Johannes Bjerva, Mike Zhang", "id": "2502.12923v1", "paper_url": "http://arxiv.org/abs/2502.12923v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12924/2502.12924v1.json b/database/storage/2502/12924/2502.12924v1.json
new file mode 100644
index 0000000000..2a336021a2
--- /dev/null
+++ b/database/storage/2502/12924/2502.12924v1.json
@@ -0,0 +1 @@
+{"2502.12924": {"publish_time": "2025-02-18", "title": "Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data", "paper_summary": "Code-switching (CS) is still a critical challenge in Natural Language\nProcessing (NLP). Current Large Language Models (LLMs) struggle to interpret\nand generate code-switched text, primarily due to the scarcity of large-scale\nCS datasets for training. This paper presents a novel methodology to generate\nCS data using LLMs, and test it on the English-Spanish language pair. We\npropose back-translating natural CS sentences into monolingual English, and\nusing the resulting parallel corpus to fine-tune LLMs to turn monolingual\nsentences into CS. Unlike previous approaches to CS generation, our methodology\nuses natural CS data as a starting point, allowing models to learn its natural\ndistribution beyond grammatical patterns. We thoroughly analyse the models'\nperformance through a study on human preferences, a qualitative error analysis\nand an evaluation with popular automatic metrics. Results show that our\nmethodology generates fluent code-switched text, expanding research\nopportunities in CS communication, and that traditional metrics do not\ncorrelate with human judgement when assessing the quality of the generated CS\ndata. We release our code and generated dataset under a CC-BY-NC-SA license.", "paper_summary_zh": "\u4ee3\u78bc\u8f49\u63db\uff08CS\uff09\u5728\u81ea\u7136\u8a9e\u8a00\u8655\u7406\uff08NLP\uff09\u4e2d\u4ecd\u662f\u4e00\u500b\u56b4\u5cfb\u7684\u6311\u6230\u3002\u76ee\u524d\u7684\u5de8\u91cf\u8a9e\u8a00\u6a21\u578b\uff08LLM\uff09\u96e3\u4ee5\u89e3\u8b80\u548c\u751f\u6210\u4ee3\u78bc\u8f49\u63db\u6587\u5b57\uff0c\u4e3b\u8981\u662f\u56e0\u70ba\u7f3a\u4e4f\u7528\u65bc\u8a13\u7df4\u7684\u5927\u898f\u6a21 CS \u8cc7\u6599\u96c6\u3002\u672c\u6587\u63d0\u51fa\u4e86\u4e00\u7a2e\u4f7f\u7528 LLM \u751f\u6210 CS \u8cc7\u6599\u7684\u65b0\u65b9\u6cd5\uff0c\u4e26\u5728\u82f1\u8a9e-\u897f\u73ed\u7259\u8a9e\u8a9e\u8a00\u5c0d\u4e0a\u9032\u884c\u6e2c\u8a66\u3002\u6211\u5011\u5efa\u8b70\u5c07\u81ea\u7136 CS \u53e5\u5b50\u53cd\u5411\u7ffb\u8b6f\u6210\u55ae\u8a9e\u82f1\u8a9e\uff0c\u4e26\u4f7f\u7528\u7522\u751f\u7684\u5e73\u884c\u8a9e\u6599\u5eab\u5fae\u8abf LLM\uff0c\u5c07\u55ae\u8a9e\u53e5\u5b50\u8f49\u63db\u70ba CS\u3002\u8207\u5148\u524d\u7684 CS \u751f\u6210\u65b9\u6cd5\u4e0d\u540c\uff0c\u6211\u5011\u7684\u6280\u8853\u4f7f\u7528\u81ea\u7136 CS \u8cc7\u6599\u4f5c\u70ba\u8d77\u9ede\uff0c\u8b93\u6a21\u578b\u80fd\u5920\u5b78\u7fd2\u5176\u8d85\u8d8a\u8a9e\u6cd5\u6a21\u5f0f\u7684\u81ea\u7136\u5206\u4f48\u3002\u6211\u5011\u900f\u904e\u7814\u7a76\u4eba\u985e\u504f\u597d\u3001\u5b9a\u6027\u932f\u8aa4\u5206\u6790\u548c\u4f7f\u7528\u6d41\u884c\u7684\u81ea\u52d5\u5316\u6307\u6a19\u9032\u884c\u8a55\u4f30\uff0c\u5fb9\u5e95\u5206\u6790\u6a21\u578b\u7684\u6548\u80fd\u3002\u7d50\u679c\u986f\u793a\uff0c\u6211\u5011\u7684\u6280\u8853\u53ef\u4ee5\u751f\u6210\u6d41\u5229\u7684\u4ee3\u78bc\u8f49\u63db\u6587\u5b57\uff0c\u64f4\u5c55 CS \u6e9d\u901a\u7684\u7814\u7a76\u6a5f\u6703\uff0c\u800c\u4e14\u5728\u8a55\u4f30\u751f\u6210\u7684 CS \u8cc7\u6599\u54c1\u8cea\u6642\uff0c\u50b3\u7d71\u6307\u6a19\u8207\u4eba\u985e\u5224\u65b7\u7121\u95dc\u3002\u6211\u5011\u5728 CC-BY-NC-SA \u6388\u6b0a\u4e0b\u91cb\u51fa\u6211\u5011\u7684\u7a0b\u5f0f\u78bc\u548c\u751f\u6210\u7684\u8cc7\u6599\u96c6\u3002", "author": "Maite Heredia et.al.", "authors": "Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa", "id": "2502.12924v1", "paper_url": "http://arxiv.org/abs/2502.12924v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12925/2502.12925v1.json b/database/storage/2502/12925/2502.12925v1.json
new file mode 100644
index 0000000000..58eac71d15
--- /dev/null
+++ b/database/storage/2502/12925/2502.12925v1.json
@@ -0,0 +1 @@
+{"2502.12925": {"publish_time": "2025-02-18", "title": "Keep what you need : extracting efficient subnetworks from large audio representation models", "paper_summary": "Recently, research on audio foundation models has witnessed notable advances,\nas illustrated by the ever improving results on complex downstream tasks.\nSubsequently, those pretrained networks have quickly been used for various\naudio applications. These improvements have however resulted in a considerable\nincrease both in size and complexity of these models. Along the environmental\nconcerns this issue raises, this prevents the deployment of such networks on\nconsumer-level devices, and precludes their use for real-time applications.\nMoreover, this appears contradictory with the specificity of the tasks for\nwhich these models are used, which are often simpler compared to extracting a\nrich, multi-purpose representation from any type of audio data. In this paper,\nwe address this issue with a simple, yet effective method to extract\nlightweight specialist subnetworks from large foundation models. Specifically,\nwe introduce learnable binary masks in-between the layers of a pretrained\nrepresentation model. When training the end-to-end model on a downstream task,\nwe add a sparsity-inducing loss to the overall objective, hence learning a\ncompact subnetwork specialized on a single task. Importantly, the weights of\nthe foundation model are kept frozen, resulting into low additional training\ncosts. Once trained, the masked computational units can then be removed from\nthe network, implying significant performance gains. We assess our method on\nthree widespread audio foundation models, each based on a different backbone\narchitecture, and illustrate its effectiveness on common audio representation\nevaluation tasks, as well as its versatility on both speech, music, and general\naudio. Code for reproducing the results and supporting webpage are available at\nhttps://github.com/gnvIRCAM/Audio-representation-trimming", "paper_summary_zh": "<paragraph>\u8fd1\u671f\uff0c\u97f3\u9891\u57fa\u7840\u6a21\u578b\u7684\u7814\u7a76\u53d6\u5f97\u4e86\u663e\u8457\u8fdb\u5c55\uff0c\n\u590d\u6742\u7684\u4e0b\u6e38\u4efb\u52a1\u4e0a\u4e0d\u65ad\u63d0\u5347\u7684\u7ed3\u679c\u8bc1\u660e\u4e86\u8fd9\u4e00\u70b9\u3002\n\u968f\u540e\uff0c\u8fd9\u4e9b\u9884\u8bad\u7ec3\u7f51\u7edc\u5df2\u8fc5\u901f\u7528\u4e8e\u5404\u79cd\n\u97f3\u9891\u5e94\u7528\u7a0b\u5e8f\u3002\u7136\u800c\uff0c\u8fd9\u4e9b\u6539\u8fdb\u5bfc\u81f4\u4e86\u8fd9\u4e9b\u6a21\u578b\u7684\u5c3a\u5bf8\u548c\u590d\u6742\u6027\u90fd\u5927\u5e45\n\u589e\u52a0\u3002\u9664\u4e86\u7531\u6b64\u4ea7\u751f\u7684\u73af\u5883\u95ee\u9898\u5916\uff0c\u8fd9\u4e5f\u963b\u6b62\u4e86\u6b64\u7c7b\u7f51\u7edc\u5728\n\u6d88\u8d39\u8005\u7ea7\u8bbe\u5907\u4e0a\u7684\u90e8\u7f72\uff0c\u5e76\u6392\u9664\u4e86\u5b83\u4eec\u5728\u5b9e\u65f6\u5e94\u7528\u7a0b\u5e8f\u4e2d\u7684\u4f7f\u7528\u3002\n\u6b64\u5916\uff0c\u8fd9\u4f3c\u4e4e\u4e0e\u8fd9\u4e9b\u6a21\u578b\u7684\u4f7f\u7528\u4efb\u52a1\u7684\u7279\u6b8a\u6027\u76f8\u77db\u76fe\uff0c\u4e0e\u4ece\u4efb\u4f55\u7c7b\u578b\u7684\u97f3\u9891\u6570\u636e\u4e2d\u63d0\u53d6\u4e30\u5bcc\u7684\u591a\u7528\u9014\u8868\u793a\u76f8\u6bd4\uff0c\u8fd9\u4e9b\u4efb\u52a1\u901a\u5e38\u66f4\u7b80\u5355\u3002\u5728\u672c\u6587\u4e2d\uff0c\n\u6211\u4eec\u901a\u8fc7\u4e00\u79cd\u7b80\u5355\u4f46\u6709\u6548\u7684\u65b9\u6cd5\u6765\u89e3\u51b3\u6b64\u95ee\u9898\uff0c\u4ece\u5927\u578b\u57fa\u7840\u6a21\u578b\u4e2d\u63d0\u53d6\u8f7b\u91cf\u7ea7\u4e13\u5bb6\u5b50\u7f51\u7edc\u3002\u5177\u4f53\u6765\u8bf4\uff0c\n\u6211\u4eec\u5728\u9884\u8bad\u7ec3\u8868\u793a\u6a21\u578b\u7684\u5c42\u4e4b\u95f4\u5f15\u5165\u4e86\u53ef\u5b66\u4e60\u7684\u4e8c\u8fdb\u5236\u63a9\u7801\u3002\u5f53\u5728\u67d0\u4e2a\u4e0b\u6e38\u4efb\u52a1\u4e0a\u8bad\u7ec3\u7aef\u5230\u7aef\u6a21\u578b\u65f6\uff0c\n\u6211\u4eec\u5728\u603b\u4f53\u76ee\u6807\u4e2d\u6dfb\u52a0\u4e86\u7a00\u758f\u6027\u8bf1\u5bfc\u635f\u5931\uff0c\u4ece\u800c\u5b66\u4e60\u5230\u4e13\u95e8\u7528\u4e8e\u5355\u4e2a\u4efb\u52a1\u7684\u7d27\u51d1\u578b\u5b50\u7f51\u7edc\u3002\u91cd\u8981\u7684\u662f\uff0c\n\u57fa\u7840\u6a21\u578b\u7684\u6743\u91cd\u4fdd\u6301\u51bb\u7ed3\uff0c\u4ece\u800c\u5bfc\u81f4\u989d\u5916\u7684\u8bad\u7ec3\u6210\u672c\u4f4e\u3002\u4e00\u65e6\u8bad\u7ec3\u5b8c\u6210\uff0c\u5c31\u53ef\u4ee5\u4ece\u7f51\u7edc\u4e2d\u79fb\u9664\u63a9\u7801\u7684\u8ba1\u7b97\u5355\u5143\uff0c\u8fd9\u610f\u5473\u7740\u6027\u80fd\u5c06\u5927\u5e45\u63d0\u5347\u3002\u6211\u4eec\u5bf9\u4e09\u4e2a\u5e7f\u6cdb\u4f7f\u7528\u7684\u97f3\u9891\u57fa\u7840\u6a21\u578b\u8bc4\u4f30\u4e86\u6211\u4eec\u7684\u65b9\u6cd5\uff0c\u6bcf\u4e2a\u6a21\u578b\u90fd\u57fa\u4e8e\u4e0d\u540c\u7684\u9aa8\u5e72\u67b6\u6784\uff0c\u5e76\u8bf4\u660e\u4e86\u5176\u5728\u5e38\u89c1\u97f3\u9891\u8868\u793a\u8bc4\u4f30\u4efb\u52a1\u4e0a\u7684\u6709\u6548\u6027\uff0c\u4ee5\u53ca\u5176\u5728\u8bed\u97f3\u3001\u97f3\u4e50\u548c\u901a\u7528\u97f3\u9891\u4e0a\u7684\u591a\u529f\u80fd\u6027\u3002\u7528\u4e8e\u91cd\u73b0\u7ed3\u679c\u7684\u4ee3\u7801\u548c\u652f\u6301\u7f51\u9875\u53ef\u5728\nhttps://github.com/gnvIRCAM/Audio-representation-trimming \u83b7\u5f97</paragraph>", "author": "David Genova et.al.", "authors": "David Genova, Philippe Esling, Tom Hurlin", "id": "2502.12925v1", "paper_url": "http://arxiv.org/abs/2502.12925v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12926/2502.12926v1.json b/database/storage/2502/12926/2502.12926v1.json
new file mode 100644
index 0000000000..d24f94874f
--- /dev/null
+++ b/database/storage/2502/12926/2502.12926v1.json
@@ -0,0 +1 @@
+{"2502.12926": {"publish_time": "2025-02-18", "title": "Towards more Contextual Agents: An extractor-Generator Optimization Framework", "paper_summary": "Large Language Model (LLM)-based agents have demonstrated remarkable success\nin solving complex tasks across a wide range of general-purpose applications.\nHowever, their performance often degrades in context-specific scenarios, such\nas specialized industries or research domains, where the absence of\ndomain-relevant knowledge leads to imprecise or suboptimal outcomes. To address\nthis challenge, our work introduces a systematic approach to enhance the\ncontextual adaptability of LLM-based agents by optimizing their underlying\nprompts-critical components that govern agent behavior, roles, and\ninteractions. Manually crafting optimized prompts for context-specific tasks is\nlabor-intensive, error-prone, and lacks scalability. In this work, we introduce\nan Extractor-Generator framework designed to automate the optimization of\ncontextual LLM-based agents. Our method operates through two key stages: (i)\nfeature extraction from a dataset of gold-standard input-output examples, and\n(ii) prompt generation via a high-level optimization strategy that iteratively\nidentifies underperforming cases and applies self-improvement techniques. This\nframework substantially improves prompt adaptability by enabling more precise\ngeneralization across diverse inputs, particularly in context-specific tasks\nwhere maintaining semantic consistency and minimizing error propagation are\ncritical for reliable performance. Although developed with single-stage\nworkflows in mind, the approach naturally extends to multi-stage workflows,\noffering broad applicability across various agent-based systems. Empirical\nevaluations demonstrate that our framework significantly enhances the\nperformance of prompt-optimized agents, providing a structured and efficient\napproach to contextual LLM-based agents.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u70ba\u57fa\u790e\u7684\u4ee3\u7406\u5df2\u5c55\u73fe\u51fa\u975e\u51e1\u7684\u6210\u529f\uff0c\n\u80fd\u89e3\u6c7a\u5ee3\u6cdb\u4e00\u822c\u7528\u9014\u61c9\u7528\u7a0b\u5f0f\u7684\u8907\u96dc\u4efb\u52d9\u3002\n\u7136\u800c\uff0c\u5b83\u5011\u7684\u6548\u80fd\u901a\u5e38\u6703\u5728\u7279\u5b9a\u60c5\u5883\u4e2d\u4e0b\u964d\uff0c\u4f8b\u5982\u5c08\u9580\u7522\u696d\u6216\u7814\u7a76\u9818\u57df\uff0c\n\u5176\u4e2d\u7f3a\u4e4f\u8207\u9818\u57df\u76f8\u95dc\u77e5\u8b58\u6703\u5c0e\u81f4\u4e0d\u7cbe\u78ba\u6216\u6b21\u4f73\u7684\u7d50\u679c\u3002\u70ba\u4e86\u89e3\u6c7a\n\u9019\u9805\u6311\u6230\uff0c\u6211\u5011\u7684\u7814\u7a76\u5f15\u9032\u4e86\u4e00\u7a2e\u7cfb\u7d71\u5316\u7684\u65b9\u6cd5\u4f86\u589e\u5f37 LLM \u70ba\u57fa\u790e\u7684\u4ee3\u7406\u7684\n\u60c5\u5883\u9069\u61c9\u6027\uff0c\u65b9\u6cd5\u662f\u6700\u4f73\u5316\u5b83\u5011\u7684\u57fa\u790e\u63d0\u793a\uff0c\u9019\u4e9b\u63d0\u793a\u662f\u6c7a\u5b9a\u4ee3\u7406\u884c\u70ba\u3001\u89d2\u8272\u548c\n\u4e92\u52d5\u7684\u91cd\u8981\u7d44\u6210\u90e8\u5206\u3002\u624b\u52d5\u88fd\u4f5c\u6700\u4f73\u5316\u7684\u63d0\u793a\u4ee5\u61c9\u5c0d\u7279\u5b9a\u60c5\u5883\u7684\u4efb\u52d9\u65e2\u8cbb\u6642\u53c8\u5bb9\u6613\u51fa\u932f\uff0c\u800c\u4e14\u7f3a\u4e4f\u53ef\u64f4\u5145\u6027\u3002\u5728\u9019\u9805\u7814\u7a76\u4e2d\uff0c\u6211\u5011\u5f15\u9032\n\u4e00\u500b\u8403\u53d6\u7522\u751f\u5668\u67b6\u69cb\uff0c\u65e8\u5728\u81ea\u52d5\u5316\u60c5\u5883 LLM \u70ba\u57fa\u790e\u4ee3\u7406\u7684\u6700\u4f73\u5316\u3002\u6211\u5011\u7684\n\u65b9\u6cd5\u900f\u904e\u5169\u500b\u95dc\u9375\u968e\u6bb5\u904b\u4f5c\uff1a(i) \u5f9e\u9ec3\u91d1\u6a19\u6e96\u8f38\u5165\u8f38\u51fa\u7bc4\u4f8b\u7684\u8cc7\u6599\u96c6\u8403\u53d6\u7279\u5fb5\uff0c\u4ee5\u53ca\n(ii) \u900f\u904e\u9ad8\u968e\u6700\u4f73\u5316\u7b56\u7565\u7522\u751f\u63d0\u793a\uff0c\u6b64\u7b56\u7565\u6703\u53cd\u8986\u627e\u51fa\u8868\u73fe\u4e0d\u4f73\u7684\u6848\u4f8b\u4e26\u5957\u7528\u81ea\u6211\u6539\u5584\u6280\u8853\u3002\u6b64\n\u67b6\u69cb\u5927\u5e45\u6539\u5584\u4e86\u63d0\u793a\u9069\u61c9\u6027\uff0c\u8b93\u5b83\u80fd\u91dd\u5c0d\u4e0d\u540c\u7684\u8f38\u5165\u9032\u884c\u66f4\u7cbe\u78ba\u7684\u6982\u62ec\uff0c\u7279\u5225\u662f\u5728\u60c5\u5883\u7279\u5b9a\u4efb\u52d9\u4e2d\uff0c\u5728\u9019\u4e9b\u4efb\u52d9\u4e2d\uff0c\u7dad\u6301\u8a9e\u610f\u4e00\u81f4\u6027\u548c\u5c07\u932f\u8aa4\u50b3\u64ad\u964d\u81f3\u6700\u4f4e\u5c0d\u65bc\u53ef\u9760\u7684\u6548\u80fd\u81f3\u95dc\u91cd\u8981\u3002\u5118\u7ba1\u662f\u91dd\u5c0d\u55ae\u968e\u6bb5\u5de5\u4f5c\u6d41\u7a0b\u958b\u767c\uff0c\u4f46\u6b64\u65b9\u6cd5\u81ea\u7136\u80fd\u5ef6\u4f38\u81f3\u591a\u968e\u6bb5\u5de5\u4f5c\u6d41\u7a0b\uff0c\u5728\u5404\u7a2e\u57fa\u65bc\u4ee3\u7406\u7684\u7cfb\u7d71\u4e2d\u63d0\u4f9b\u5ee3\u6cdb\u7684\u9069\u7528\u6027\u3002\u5be6\u8b49\u8a55\u4f30\u986f\u793a\uff0c\u6211\u5011\u7684\u67b6\u69cb\u5927\u5e45\u589e\u5f37\u4e86\u63d0\u793a\u6700\u4f73\u5316\u4ee3\u7406\u7684\u6548\u80fd\uff0c\u70ba\u57fa\u65bc\u60c5\u5883\u7684 LLM \u4ee3\u7406\u63d0\u4f9b\u4e86\u4e00\u500b\u7d50\u69cb\u5316\u4e14\u6709\u6548\u7387\u7684\u65b9\u6cd5\u3002", "author": "Mourad Aouini et.al.", "authors": "Mourad Aouini, Jinan Loubani", "id": "2502.12926v1", "paper_url": "http://arxiv.org/abs/2502.12926v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12927/2502.12927v1.json b/database/storage/2502/12927/2502.12927v1.json
new file mode 100644
index 0000000000..ae57f1b4c3
--- /dev/null
+++ b/database/storage/2502/12927/2502.12927v1.json
@@ -0,0 +1 @@
+{"2502.12927": {"publish_time": "2025-02-18", "title": "SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems", "paper_summary": "Providing high-quality feedback is crucial for student success but is\nconstrained by time, cost, and limited data availability. We introduce\nSynthetic Educational Feedback Loops (SEFL), a novel framework designed to\ndeliver immediate, on-demand feedback at scale without relying on extensive,\nreal-world student data. In SEFL, two large language models (LLMs) operate in\nteacher--student roles to simulate assignment completion and formative\nfeedback, generating abundant synthetic pairs of student work and corresponding\ncritiques. We then fine-tune smaller, more computationally efficient LLMs on\nthese synthetic pairs, enabling them to replicate key features of high-quality,\ngoal-oriented feedback. Unlike personalized tutoring approaches that offer\nmulti-turn, individualized instruction, SEFL specifically focuses on\nreplicating the teacher-->student feedback loop for diverse assignments.\nThrough both LLM-as-a-judge and human evaluations, we demonstrate that\nSEFL-tuned models outperform their non-tuned counterparts in feedback quality,\nclarity, and timeliness. These findings reveal SEFL's potential to transform\nfeedback processes for higher education and beyond, offering an ethical and\nscalable alternative to conventional manual feedback cycles.", "paper_summary_zh": "\u63d0\u4f9b\u9ad8\u54c1\u8cea\u7684\u56de\u994b\u5c0d\u65bc\u5b78\u751f\u7684\u6210\u529f\u81f3\u95dc\u91cd\u8981\uff0c\u4f46\u53d7\u5230\u6642\u9593\u3001\u6210\u672c\u548c\u8cc7\u6599\u53d6\u5f97\u6709\u9650\u7684\u9650\u5236\u3002\u6211\u5011\u5f15\u5165\u4e86\u5408\u6210\u6559\u80b2\u56de\u994b\u8ff4\u5708 (SEFL)\uff0c\u9019\u662f\u4e00\u500b\u65b0\u7a4e\u7684\u67b6\u69cb\uff0c\u65e8\u5728\u63d0\u4f9b\u7acb\u5373\u4e14\u4f9d\u9700\u6c42\u7684\u56de\u994b\uff0c\u4e14\u7121\u9700\u4ef0\u8cf4\u5927\u91cf\u7684\u771f\u5be6\u4e16\u754c\u5b78\u751f\u8cc7\u6599\u3002\u5728 SEFL \u4e2d\uff0c\u5169\u500b\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u4ee5\u5e2b\u751f\u89d2\u8272\u904b\u4f5c\uff0c\u6a21\u64ec\u4f5c\u696d\u5b8c\u6210\u548c\u5f62\u6210\u6027\u56de\u994b\uff0c\u7522\u751f\u5927\u91cf\u7684\u5408\u6210\u5b78\u751f\u4f5c\u696d\u548c\u5c0d\u61c9\u7684\u8a55\u8ad6\u3002\u7136\u5f8c\u6211\u5011\u91dd\u5c0d\u9019\u4e9b\u5408\u6210\u914d\u5c0d\u5fae\u8abf\u8f03\u5c0f\u3001\u8a08\u7b97\u6548\u7387\u8f03\u9ad8\u7684 LLM\uff0c\u8b93\u5b83\u5011\u80fd\u5920\u8907\u88fd\u9ad8\u54c1\u8cea\u3001\u76ee\u6a19\u5c0e\u5411\u56de\u994b\u7684\u4e3b\u8981\u7279\u5fb5\u3002\u8207\u63d0\u4f9b\u591a\u56de\u5408\u3001\u500b\u5225\u5316\u6559\u5b78\u7684\u500b\u4eba\u5316\u8f14\u5c0e\u65b9\u6cd5\u4e0d\u540c\uff0cSEFL \u7279\u5225\u5c08\u6ce8\u65bc\u8907\u88fd\u9069\u7528\u65bc\u5404\u7a2e\u4f5c\u696d\u7684\u6559\u5e2b-->\u5b78\u751f\u56de\u994b\u8ff4\u5708\u3002\u900f\u904e LLM \u4f5c\u70ba\u8a55\u5be9\u548c\u4eba\u985e\u8a55\u4f30\uff0c\u6211\u5011\u8b49\u660e\u4e86 SEFL \u5fae\u8abf\u6a21\u578b\u5728\u56de\u994b\u54c1\u8cea\u3001\u6e05\u6670\u5ea6\u548c\u6642\u6548\u6027\u65b9\u9762\u512a\u65bc\u672a\u5fae\u8abf\u7684\u6a21\u578b\u3002\u9019\u4e9b\u767c\u73fe\u63ed\u793a\u4e86 SEFL \u8f49\u8b8a\u9ad8\u7b49\u6559\u80b2\u53ca\u5176\u4ed6\u9818\u57df\u56de\u994b\u6d41\u7a0b\u7684\u6f5b\u529b\uff0c\u63d0\u4f9b\u4e86\u4e00\u500b\u7b26\u5408\u9053\u5fb7\u4e14\u53ef\u64f4\u5145\u7684\u66ff\u4ee3\u65b9\u6848\uff0c\u53d6\u4ee3\u50b3\u7d71\u7684\u624b\u52d5\u56de\u994b\u9031\u671f\u3002", "author": "Mike Zhang et.al.", "authors": "Mike Zhang, Amalie Pernille Dilling, L\u00e9on Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva", "id": "2502.12927v1", "paper_url": "http://arxiv.org/abs/2502.12927v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12928/2502.12928v1.json b/database/storage/2502/12928/2502.12928v1.json
new file mode 100644
index 0000000000..d42b2c131a
--- /dev/null
+++ b/database/storage/2502/12928/2502.12928v1.json
@@ -0,0 +1 @@
+{"2502.12928": {"publish_time": "2025-02-18", "title": "Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts", "paper_summary": "Large language models have demonstrated exceptional performance across a wide\nrange of tasks. However, dense models usually suffer from sparse activation,\nwhere many activation values tend towards zero (i.e., being inactivated). We\nargue that this could restrict the efficient exploration of model\nrepresentation space. To mitigate this issue, we propose Finedeep, a\ndeep-layered fine-grained expert architecture for dense models. Our framework\npartitions the feed-forward neural network layers of traditional dense models\ninto small experts, arranges them across multiple sub-layers. A novel routing\nmechanism is proposed to determine each expert's contribution. We conduct\nextensive experiments across various model sizes, demonstrating that our\napproach significantly outperforms traditional dense architectures in terms of\nperplexity and benchmark performance while maintaining a comparable number of\nparameters and floating-point operations. Moreover, we find that Finedeep\nachieves optimal results when balancing depth and width, specifically by\nadjusting the number of expert sub-layers and the number of experts per\nsub-layer. Empirical results confirm that Finedeep effectively alleviates\nsparse activation and efficiently utilizes representation capacity in dense\nmodels.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b\u5728\u5404\u7a2e\u4efb\u52d9\u4e2d\u5c55\u73fe\u51fa\u975e\u51e1\u7684\u6548\u80fd\u3002\u7136\u800c\uff0c\u5bc6\u96c6\u6a21\u578b\u901a\u5e38\u6703\u51fa\u73fe\u7a00\u758f\u6fc0\u6d3b\uff0c\u5176\u4e2d\u8a31\u591a\u6fc0\u6d3b\u503c\u8da8\u8fd1\u65bc\u96f6\uff08\u5373\u8655\u65bc\u975e\u6fc0\u6d3b\u72c0\u614b\uff09\u3002\u6211\u5011\u8a8d\u70ba\u9019\u53ef\u80fd\u6703\u9650\u5236\u6a21\u578b\u8868\u793a\u7a7a\u9593\u7684\u6709\u6548\u63a2\u7d22\u3002\u70ba\u4e86\u6e1b\u8f15\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u63d0\u51fa Finedeep\uff0c\u9019\u662f\u4e00\u7a2e\u91dd\u5c0d\u5bc6\u96c6\u6a21\u578b\u7684\u6df1\u5ea6\u5206\u5c64\u7d30\u7c92\u5ea6\u5c08\u5bb6\u67b6\u69cb\u3002\u6211\u5011\u7684\u6846\u67b6\u5c07\u50b3\u7d71\u5bc6\u96c6\u6a21\u578b\u7684\u524d\u994b\u795e\u7d93\u7db2\u8def\u5c64\u5206\u5272\u6210\u5c0f\u578b\u5c08\u5bb6\uff0c\u4e26\u5c07\u5b83\u5011\u6392\u5217\u5728\u591a\u500b\u5b50\u5c64\u4e2d\u3002\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u7a2e\u65b0\u7a4e\u7684\u8def\u7531\u6a5f\u5236\u4f86\u78ba\u5b9a\u6bcf\u500b\u5c08\u5bb6\u7684\u8ca2\u737b\u3002\u6211\u5011\u91dd\u5c0d\u5404\u7a2e\u6a21\u578b\u5927\u5c0f\u9032\u884c\u4e86\u5ee3\u6cdb\u7684\u5be6\u9a57\uff0c\u8b49\u660e\u6211\u5011\u7684\u505a\u6cd5\u5728\u56f0\u60d1\u5ea6\u548c\u57fa\u6e96\u6548\u80fd\u65b9\u9762\u986f\u8457\u512a\u65bc\u50b3\u7d71\u7684\u5bc6\u96c6\u67b6\u69cb\uff0c\u540c\u6642\u4fdd\u6301\u4e86\u76f8\u7576\u6578\u91cf\u7684\u53c3\u6578\u548c\u6d6e\u9ede\u904b\u7b97\u3002\u6b64\u5916\uff0c\u6211\u5011\u767c\u73fe Finedeep \u5728\u5e73\u8861\u6df1\u5ea6\u548c\u5ee3\u5ea6\u6642\u53ef\u4ee5\u9054\u5230\u6700\u4f73\u7d50\u679c\uff0c\u7279\u5225\u662f\u900f\u904e\u8abf\u6574\u5c08\u5bb6\u5b50\u5c64\u7684\u6578\u91cf\u548c\u6bcf\u500b\u5b50\u5c64\u7684\u5c08\u5bb6\u6578\u91cf\u3002\u5be6\u8b49\u7d50\u679c\u8b49\u5be6\uff0cFinedeep \u6709\u6548\u5730\u6e1b\u8f15\u4e86\u7a00\u758f\u6fc0\u6d3b\uff0c\u4e26\u6709\u6548\u5229\u7528\u4e86\u5bc6\u96c6\u6a21\u578b\u4e2d\u7684\u8868\u793a\u80fd\u529b\u3002", "author": "Leiyu Pan et.al.", "authors": "Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong", "id": "2502.12928v1", "paper_url": "http://arxiv.org/abs/2502.12928v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12929/2502.12929v1.json b/database/storage/2502/12929/2502.12929v1.json
new file mode 100644
index 0000000000..103af2304f
--- /dev/null
+++ b/database/storage/2502/12929/2502.12929v1.json
@@ -0,0 +1 @@
+{"2502.12929": {"publish_time": "2025-02-18", "title": "Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options", "paper_summary": "We present a novel reasoning approach called Flow-of-Options (FoO), designed\nto address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs\nto systematically explore a diverse range of possibilities in their reasoning,\nas demonstrated by an FoO-based agentic system for autonomously solving Machine\nLearning tasks (AutoML). Our framework outperforms state-of-the-art baselines,\nachieving improvements of 38.2% - 69.2% on standard data science tasks, and\n37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost\nunder $1 per task, our framework is well-suited for cost-sensitive\napplications. Beyond classification and regression, we illustrate the broader\napplicability of our FoO-based agentic system to tasks such as reinforcement\nlearning and image generation. Our framework presents significant advancements\ncompared to current state-of-the-art agentic systems for AutoML, due to the\nbenefits of FoO in enforcing diversity in LLM solutions through compressed,\nexplainable representations that also support long-term memory when combined\nwith case-based reasoning.", "paper_summary_zh": "\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u7a2e\u7a31\u70ba\u9078\u9805\u6d41 (FoO) \u7684\u65b0\u63a8\u7406\u65b9\u6cd5\uff0c\u65e8\u5728\u89e3\u6c7a\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u4e2d\u7684\u5167\u5728\u504f\u5dee\u3002FoO \u4f7f LLM \u80fd\u7cfb\u7d71\u6027\u5730\u63a2\u7d22\u5176\u63a8\u7406\u4e2d\u7684\u5404\u7a2e\u53ef\u80fd\u6027\uff0c\u9019\u7531\u4e00\u500b\u57fa\u65bc FoO \u7684\u4ee3\u7406\u7cfb\u7d71\u5c55\u793a\uff0c\u8a72\u7cfb\u7d71\u53ef\u81ea\u4e3b\u89e3\u6c7a\u6a5f\u5668\u5b78\u7fd2\u4efb\u52d9 (AutoML)\u3002\u6211\u5011\u7684\u6846\u67b6\u512a\u65bc\u6700\u5148\u9032\u7684\u57fa\u6e96\uff0c\u5728\u6a19\u6e96\u6578\u64da\u79d1\u5b78\u4efb\u52d9\u4e0a\u53d6\u5f97\u4e86 38.2% - 69.2% \u7684\u6539\u9032\uff0c\u5728\u6cbb\u7642\u5316\u5b78\u4efb\u52d9\u4e0a\u53d6\u5f97\u4e86 37.4% - 47.9% \u7684\u6539\u9032\u3002\u7531\u65bc\u6bcf\u500b\u4efb\u52d9\u7684\u6574\u9ad4\u904b\u71df\u6210\u672c\u4f4e\u65bc 1 \u7f8e\u5143\uff0c\u56e0\u6b64\u6211\u5011\u7684\u6846\u67b6\u975e\u5e38\u9069\u5408\u5c0d\u6210\u672c\u654f\u611f\u7684\u61c9\u7528\u3002\u9664\u4e86\u5206\u985e\u548c\u56de\u6b78\u4e4b\u5916\uff0c\u6211\u5011\u9084\u8aaa\u660e\u4e86\u57fa\u65bc FoO \u7684\u4ee3\u7406\u7cfb\u7d71\u5728\u5f37\u5316\u5b78\u7fd2\u548c\u5716\u50cf\u751f\u6210\u7b49\u4efb\u52d9\u4e2d\u7684\u66f4\u5ee3\u6cdb\u9069\u7528\u6027\u3002\u6211\u5011\u7684\u6846\u67b6\u8207\u7576\u524d\u6700\u5148\u9032\u7684 AutoML \u4ee3\u7406\u7cfb\u7d71\u76f8\u6bd4\u5177\u6709\u986f\u8457\u7684\u9032\u6b65\uff0c\u9019\u662f\u56e0\u70ba FoO \u5728\u901a\u904e\u58d3\u7e2e\u3001\u53ef\u89e3\u91cb\u7684\u8868\u793a\u5f37\u5236 LLM \u89e3\u6c7a\u65b9\u6848\u7684\u591a\u6a23\u6027\u65b9\u9762\u5177\u6709\u512a\u52e2\uff0c\u9019\u4e9b\u8868\u793a\u8207\u57fa\u65bc\u6848\u4f8b\u7684\u63a8\u7406\u7d50\u5408\u6642\u9084\u652f\u6301\u9577\u671f\u8a18\u61b6\u3002", "author": "Lakshmi Nair et.al.", "authors": "Lakshmi Nair, Ian Trase, Mark Kim", "id": "2502.12929v1", "paper_url": "http://arxiv.org/abs/2502.12929v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12932/2502.12932v1.json b/database/storage/2502/12932/2502.12932v1.json
new file mode 100644
index 0000000000..d5415e23ae
--- /dev/null
+++ b/database/storage/2502/12932/2502.12932v1.json
@@ -0,0 +1 @@
+{"2502.12932": {"publish_time": "2025-02-18", "title": "Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages", "paper_summary": "Quantifying reasoning capability in low-resource languages remains a\nchallenge in NLP due to data scarcity and limited access to annotators. While\nLLM-assisted dataset construction has proven useful for medium- and\nhigh-resource languages, its effectiveness in low-resource languages,\nparticularly for commonsense reasoning, is still unclear. In this paper, we\ncompare three dataset creation strategies: (1) LLM-assisted dataset generation,\n(2) machine translation, and (3) human-written data by native speakers, to\nbuild a culturally nuanced story comprehension dataset. We focus on Javanese\nand Sundanese, two major local languages in Indonesia, and evaluate the\neffectiveness of open-weight and closed-weight LLMs in assisting dataset\ncreation through extensive manual validation. To assess the utility of\nsynthetic data, we fine-tune language models on classification and generation\ntasks using this data and evaluate performance on a human-written test set. Our\nfindings indicate that LLM-assisted data creation outperforms machine\ntranslation.", "paper_summary_zh": "\u7531\u65bc\u8cc7\u6599\u7a00\u5c11\u4e14\u6a19\u8a3b\u8005\u6709\u9650\uff0c\u91cf\u5316\u4f4e\u8cc7\u6e90\u8a9e\u8a00\u4e2d\u7684\u63a8\u7406\u80fd\u529b\u5728\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u4e2d\u4ecd\u7136\u662f\u4e00\u9805\u6311\u6230\u3002\u96d6\u7136 LLM \u8f14\u52a9\u7684\u8cc7\u6599\u96c6\u5efa\u69cb\u5df2\u88ab\u8b49\u660e\u5c0d\u4e2d\u9ad8\u8cc7\u6e90\u8a9e\u8a00\u6709\u7528\uff0c\u4f46\u5176\u5728\u4f4e\u8cc7\u6e90\u8a9e\u8a00\u4e2d\u7684\u6709\u6548\u6027\uff0c\u7279\u5225\u662f\u5c0d\u65bc\u5e38\u8b58\u63a8\u7406\uff0c\u4ecd\u7136\u4e0d\u6e05\u695a\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u6bd4\u8f03\u4e86\u4e09\u7a2e\u8cc7\u6599\u96c6\u5efa\u7acb\u7b56\u7565\uff1a(1) LLM \u8f14\u52a9\u7684\u8cc7\u6599\u96c6\u751f\u6210\uff0c(2) \u6a5f\u5668\u7ffb\u8b6f\uff0c\u4ee5\u53ca (3) \u6bcd\u8a9e\u4eba\u58eb\u64b0\u5beb\u7684\u4eba\u5de5\u8cc7\u6599\uff0c\u4ee5\u5efa\u7acb\u5177\u6709\u6587\u5316\u7d30\u5fae\u5dee\u7684\u6545\u4e8b\u7406\u89e3\u8cc7\u6599\u96c6\u3002\u6211\u5011\u5c08\u6ce8\u65bc\u722a\u54c7\u8a9e\u548c\u5dfd\u4ed6\u8a9e\uff0c\u9019\u5169\u7a2e\u5370\u5c3c\u7684\u4e3b\u8981\u5730\u65b9\u8a9e\u8a00\uff0c\u4e26\u900f\u904e\u5ee3\u6cdb\u7684\u624b\u52d5\u9a57\u8b49\u8a55\u4f30\u958b\u653e\u6b0a\u91cd\u548c\u5c01\u9589\u6b0a\u91cd LLM \u5728\u5354\u52a9\u8cc7\u6599\u96c6\u5efa\u7acb\u4e2d\u7684\u6709\u6548\u6027\u3002\u70ba\u4e86\u8a55\u4f30\u5408\u6210\u8cc7\u6599\u7684\u6548\u7528\uff0c\u6211\u5011\u4f7f\u7528\u9019\u4e9b\u8cc7\u6599\u5c0d\u5206\u985e\u548c\u751f\u6210\u4efb\u52d9\u9032\u884c\u8a9e\u8a00\u6a21\u578b\u5fae\u8abf\uff0c\u4e26\u5728\u4eba\u5de5\u64b0\u5beb\u7684\u6e2c\u8a66\u96c6\u4e0a\u8a55\u4f30\u6548\u80fd\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u8868\u660e\uff0cLLM \u8f14\u52a9\u7684\u8cc7\u6599\u5efa\u7acb\u512a\u65bc\u6a5f\u5668\u7ffb\u8b6f\u3002", "author": "Salsabila Zahirah Pranida et.al.", "authors": "Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto", "id": "2502.12932v1", "paper_url": "http://arxiv.org/abs/2502.12932v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12945/2502.12945v1.json b/database/storage/2502/12945/2502.12945v1.json
new file mode 100644
index 0000000000..61534df61c
--- /dev/null
+++ b/database/storage/2502/12945/2502.12945v1.json
@@ -0,0 +1 @@
+{"2502.12945": {"publish_time": "2025-02-18", "title": "LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation", "paper_summary": "Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold\nsignificant commercial value. The rise of high-quality AI-generated content has\nspurred interest in AI-driven micro-video creation. However, despite the\nadvanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek\nin text generation and reasoning, their potential to assist the creation of\npopular micro-videos remains largely unexplored.\n  In this paper, we conduct an empirical study on LLM-assisted popular\nmicro-video generation (LLMPopcorn). Specifically, we investigate the following\nresearch questions: (i) How can LLMs be effectively utilized to assist popular\nmicro-video generation? (ii) To what extent can prompt-based enhancements\noptimize the LLM-generated content for higher popularity? (iii) How well do\nvarious LLMs and video generators perform in the popular micro-video generation\ntask? By exploring these questions, we show that advanced LLMs like DeepSeek-V3\nenable micro-video generation to achieve popularity comparable to human-created\ncontent. Prompt enhancements further boost popularity, and benchmarking\nhighlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and\nHunyuanVideo lead in video generation. This pioneering work advances\nAI-assisted micro-video creation, uncovering new research opportunities. We\nwill release the code and datasets to support future studies.", "paper_summary_zh": "<paragraph>\u5728 TikTok \u548c YouTube \u7b49\u5e73\u53f0\u4e0a\u6d41\u884c\u7684\u5fae\u5f71\u7247\u5177\u6709\n\u91cd\u8981\u7684\u5546\u4e1a\u4ef7\u503c\u3002\u9ad8\u8d28\u91cf AI \u751f\u6210\u7684\u5185\u5bb9\u7684\u5174\u8d77\n\u6fc0\u53d1\u4e86\u4eba\u4eec\u5bf9 AI \u9a71\u52a8\u7684\u5fae\u5f71\u7247\u521b\u4f5c\u7684\u5174\u8da3\u3002\u7136\u800c\uff0c\u5c3d\u7ba1\u5927\u578b\u8bed\u8a00\u6a21\u578b (LLM) \u5982 ChatGPT \u548c DeepSeek\n\u5728\u6587\u672c\u751f\u6210\u548c\u63a8\u7406\u65b9\u9762\u7684\u80fd\u529b\u5f88\u5f3a\uff0c\u4f46\u5b83\u4eec\u5728\u8f85\u52a9\u521b\u5efa\n\u6d41\u884c\u5fae\u5f71\u7247\u65b9\u9762\u7684\u6f5c\u529b\u5728\u5f88\u5927\u7a0b\u5ea6\u4e0a\u4ecd\u672a\u5f97\u5230\u63a2\u7d22\u3002\n  \u5728\u672c\u6587\u4e2d\uff0c\u6211\u4eec\u5bf9 LLM \u8f85\u52a9\u7684\u6d41\u884c\n\u5fae\u5f71\u7247\u751f\u6210 (LLMPopcorn) \u8fdb\u884c\u4e86\u5b9e\u8bc1\u7814\u7a76\u3002\u5177\u4f53\u6765\u8bf4\uff0c\u6211\u4eec\u8c03\u67e5\u4e86\u4ee5\u4e0b\n\u7814\u7a76\u95ee\u9898\uff1a(i) \u5982\u4f55\u6709\u6548\u5229\u7528 LLM \u6765\u8f85\u52a9\u6d41\u884c\n\u5fae\u5f71\u7247\u751f\u6210\uff1f(ii) \u57fa\u4e8e\u63d0\u793a\u7684\u589e\u5f3a\u5728\u591a\u5927\u7a0b\u5ea6\u4e0a\u53ef\u4ee5\n\u4f18\u5316 LLM \u751f\u6210\u7684\u5185\u5bb9\u4ee5\u83b7\u5f97\u66f4\u9ad8\u7684\u6d41\u884c\u5ea6\uff1f(iii) \u5404\u79cd LLM \u548c\u89c6\u9891\u751f\u6210\u5668\u5728\u6d41\u884c\u7684\u5fae\u89c6\u9891\u751f\u6210\u4e2d\u8868\u73b0\u5982\u4f55\n\u4efb\u52a1\uff1f\u901a\u8fc7\u63a2\u7d22\u8fd9\u4e9b\u95ee\u9898\uff0c\u6211\u4eec\u8868\u660e\u4e86\u50cf DeepSeek-V3 \u8fd9\u6837\u7684\u9ad8\u7ea7 LLM\n\u4f7f\u5fae\u89c6\u9891\u751f\u6210\u80fd\u591f\u8fbe\u5230\u4e0e\u4eba\u7c7b\u521b\u4f5c\u7684\u5185\u5bb9\u76f8\u5f53\u7684\u6d41\u884c\u5ea6\u3002\u63d0\u793a\u589e\u5f3a\u8fdb\u4e00\u6b65\u63d0\u9ad8\u4e86\u53d7\u6b22\u8fce\u7a0b\u5ea6\uff0c\u5e76\u4e14\u57fa\u51c6\u6d4b\u8bd5\u7a81\u51fa\u4e86 LLM \u4e2d\u7684 DeepSeek-V3 \u548c DeepSeek-R1\uff0c\u800c LTX-Video \u548c\nHunyuanVideo \u5728\u89c6\u9891\u751f\u6210\u4e2d\u9886\u5148\u3002\u8fd9\u9879\u5f00\u521b\u6027\u7684\u5de5\u4f5c\u63a8\u8fdb\u4e86\n\u4eba\u5de5\u667a\u80fd\u8f85\u52a9\u7684\u5fae\u89c6\u9891\u521b\u4f5c\uff0c\u53d1\u73b0\u4e86\u65b0\u7684\u7814\u7a76\u673a\u4f1a\u3002\u6211\u4eec\u5c06\u53d1\u5e03\u4ee3\u7801\u548c\u6570\u636e\u96c6\u4ee5\u652f\u6301\u672a\u6765\u7684\u7814\u7a76\u3002</paragraph>", "author": "Junchen Fu et.al.", "authors": "Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose", "id": "2502.12945v1", "paper_url": "http://arxiv.org/abs/2502.12945v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12947/2502.12947v1.json b/database/storage/2502/12947/2502.12947v1.json
new file mode 100644
index 0000000000..35072c0ca2
--- /dev/null
+++ b/database/storage/2502/12947/2502.12947v1.json
@@ -0,0 +1 @@
+{"2502.12947": {"publish_time": "2025-02-18", "title": "Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models", "paper_summary": "With the emergence of Mixture-of-Experts (MoE), the efficient scaling of\nmodel size has accelerated the development of large language models in recent\nyears. However, their high memory requirements prevent their use in\nresource-constrained environments. While knowledge distillation (KD) has been a\nproven method for model compression, its application to MoE teacher models\nremains underexplored. Through our investigation, we discover that\nnon-activated experts in MoE models possess valuable knowledge that benefits\nstudent models. We further demonstrate that existing KD methods are not optimal\nfor compressing MoE models, as they fail to leverage this knowledge\neffectively. To address this, we propose two intuitive MoE-specific KD methods\nfor the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR),\nboth designed to effectively extract knowledge from all experts. Specifically,\nKA augments knowledge by sampling experts multiple times, while SAR uses all\nexperts and adjusts the expert weights through router training to provide\noptimal knowledge. Extensive experiments show that our methods outperform\nconventional KD methods, demonstrating their effectiveness for MoE teacher\nmodels.", "paper_summary_zh": "\u96a8\u8457 Mixture-of-Experts (MoE) \u7684\u51fa\u73fe\uff0c\u6a21\u578b\u898f\u6a21\u7684\u6709\u6548\u64f4\u5c55\u52a0\u901f\u4e86\u8fd1\u5e74\u4f86\u5927\u578b\u8a9e\u8a00\u6a21\u578b\u7684\u767c\u5c55\u3002\u7136\u800c\uff0c\u5b83\u5011\u7684\u9ad8\u8a18\u61b6\u9ad4\u9700\u6c42\u6703\u963b\u7919\u5b83\u5011\u5728\u8cc7\u6e90\u53d7\u9650\u7684\u74b0\u5883\u4e2d\u4f7f\u7528\u3002\u96d6\u7136\u77e5\u8b58\u84b8\u993e (KD) \u5df2\u88ab\u8b49\u660e\u662f\u4e00\u7a2e\u6a21\u578b\u58d3\u7e2e\u7684\u65b9\u6cd5\uff0c\u4f46\u5b83\u5728 MoE \u6559\u5e2b\u6a21\u578b\u4e2d\u7684\u61c9\u7528\u4ecd\u672a\u88ab\u5145\u5206\u63a2\u7d22\u3002\u900f\u904e\u6211\u5011\u7684\u8abf\u67e5\uff0c\u6211\u5011\u767c\u73fe MoE \u6a21\u578b\u4e2d\u672a\u88ab\u555f\u7528\u7684\u5c08\u5bb6\u64c1\u6709\u6709\u50f9\u503c\u7684\u77e5\u8b58\uff0c\u9019\u4e9b\u77e5\u8b58\u5c0d\u5b78\u751f\u6a21\u578b\u6709\u76ca\u3002\u6211\u5011\u9032\u4e00\u6b65\u8b49\u660e\uff0c\u73fe\u6709\u7684 KD \u65b9\u6cd5\u4e26\u975e\u58d3\u7e2e MoE \u6a21\u578b\u7684\u6700\u4f73\u65b9\u6cd5\uff0c\u56e0\u70ba\u5b83\u5011\u7121\u6cd5\u6709\u6548\u5229\u7528\u9019\u4e9b\u77e5\u8b58\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u9996\u6b21\u63d0\u51fa\u5169\u7a2e\u76f4\u89c0\u7684 MoE \u5c08\u7528 KD \u65b9\u6cd5\uff1a\u77e5\u8b58\u64f4\u5145 (KA) \u548c\u5b78\u751f\u611f\u77e5\u8def\u7531\u5668 (SAR)\uff0c\u5169\u8005\u90fd\u65e8\u5728\u5f9e\u6240\u6709\u5c08\u5bb6\u6709\u6548\u63d0\u53d6\u77e5\u8b58\u3002\u5177\u9ad4\u4f86\u8aaa\uff0cKA \u900f\u904e\u591a\u6b21\u62bd\u6a23\u5c08\u5bb6\u4f86\u64f4\u5145\u77e5\u8b58\uff0c\u800c SAR \u4f7f\u7528\u6240\u6709\u5c08\u5bb6\u4e26\u900f\u904e\u8def\u7531\u5668\u8a13\u7df4\u8abf\u6574\u5c08\u5bb6\u6b0a\u91cd\u4ee5\u63d0\u4f9b\u6700\u4f73\u77e5\u8b58\u3002\u5ee3\u6cdb\u7684\u5be6\u9a57\u8868\u660e\uff0c\u6211\u5011\u7684\u6a21\u578b\u512a\u65bc\u50b3\u7d71\u7684 KD \u6a21\u578b\uff0c\u8b49\u660e\u4e86\u5b83\u5011\u5c0d MoE \u6559\u5e2b\u6a21\u578b\u7684\u6709\u6548\u6027\u3002", "author": "Gyeongman Kim et.al.", "authors": "Gyeongman Kim, Gyouk Chu, Eunho Yang", "id": "2502.12947v1", "paper_url": "http://arxiv.org/abs/2502.12947v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12948/2502.12948v1.json b/database/storage/2502/12948/2502.12948v1.json
new file mode 100644
index 0000000000..6cef878f2b
--- /dev/null
+++ b/database/storage/2502/12948/2502.12948v1.json
@@ -0,0 +1 @@
+{"2502.12948": {"publish_time": "2025-02-18", "title": "Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection", "paper_summary": "Detection of hyperenhancement from cardiac LGE MRI images is a complex task\nrequiring significant clinical expertise. Although deep learning-based models\nhave shown promising results for the task, they require large amounts of data\nwith fine-grained annotations. Clinical reports generated for cardiac MR\nstudies contain rich, clinically relevant information, including the location,\nextent and etiology of any scars present. Although recently developed\nCLIP-based training enables pretraining models with image-text pairs, it\nrequires large amounts of data and further finetuning strategies on downstream\ntasks. In this study, we use various strategies rooted in domain knowledge to\ntrain a model for LGE detection solely using text from clinical reports, on a\nrelatively small clinical cohort of 965 patients. We improve performance\nthrough the use of synthetic data augmentation, by systematically creating scar\nimages and associated text. In addition, we standardize the orientation of the\nimages in an anatomy-informed way to enable better alignment of spatial and\ntext features. We also use a captioning loss to enable fine-grained supervision\nand explore the effect of pretraining of the vision encoder on performance.\nFinally, ablation studies are carried out to elucidate the contributions of\neach design component to the overall performance of the model.", "paper_summary_zh": "\u5f9e\u5fc3\u81df LGE MRI \u5f71\u50cf\u5075\u6e2c\u51fa\u904e\u5ea6\u589e\u5f37\u662f\u4e00\u9805\u8907\u96dc\u7684\u4efb\u52d9\uff0c\u9700\u8981\u986f\u8457\u7684\u81e8\u5e8a\u5c08\u696d\u77e5\u8b58\u3002\u5118\u7ba1\u57fa\u65bc\u6df1\u5ea6\u5b78\u7fd2\u7684\u6a21\u578b\u5df2\u986f\u793a\u51fa\u5c0d\u9019\u9805\u4efb\u52d9\u6709\u524d\u666f\u7684\u7d50\u679c\uff0c\u4f46\u5b83\u5011\u9700\u8981\u5927\u91cf\u5177\u6709\u7d30\u7dfb\u8a3b\u89e3\u7684\u8cc7\u6599\u3002\u70ba\u5fc3\u81df MR \u7814\u7a76\u7522\u751f\u7684\u81e8\u5e8a\u5831\u544a\u5305\u542b\u8c50\u5bcc\u4e14\u81e8\u5e8a\u4e0a\u76f8\u95dc\u7684\u8cc7\u8a0a\uff0c\u5305\u62ec\u4efb\u4f55\u75a4\u75d5\u7684\u4f4d\u7f6e\u3001\u7bc4\u570d\u548c\u75c5\u56e0\u3002\u5118\u7ba1\u6700\u8fd1\u958b\u767c\u7684\u57fa\u65bc CLIP \u7684\u8a13\u7df4\u80fd\u4f7f\u7528\u5f71\u50cf\u6587\u5b57\u5c0d\u9810\u8a13\u7df4\u6a21\u578b\uff0c\u4f46\u5b83\u9700\u8981\u5927\u91cf\u8cc7\u6599\u548c\u9032\u4e00\u6b65\u5fae\u8abf\u4e0b\u6e38\u4efb\u52d9\u7684\u7b56\u7565\u3002\u5728\u9019\u9805\u7814\u7a76\u4e2d\uff0c\u6211\u5011\u4f7f\u7528\u690d\u57fa\u65bc\u9818\u57df\u77e5\u8b58\u7684\u5404\u7a2e\u7b56\u7565\uff0c\u50c5\u4f7f\u7528\u4f86\u81ea\u81e8\u5e8a\u5831\u544a\u7684\u6587\u5b57\uff0c\u5728\u4e00\u500b\u76f8\u5c0d\u8f03\u5c0f\u7684 965 \u540d\u60a3\u8005\u81e8\u5e8a\u7fa4\u9ad4\u4e2d\u8a13\u7df4\u4e00\u500b LGE \u5075\u6e2c\u6a21\u578b\u3002\u6211\u5011\u900f\u904e\u4f7f\u7528\u5408\u6210\u8cc7\u6599\u64f4\u5145\u4f86\u6539\u5584\u6548\u80fd\uff0c\u7cfb\u7d71\u6027\u5730\u5efa\u7acb\u75a4\u75d5\u5f71\u50cf\u548c\u76f8\u95dc\u6587\u5b57\u3002\u6b64\u5916\uff0c\u6211\u5011\u4ee5\u89e3\u5256\u5b78\u544a\u77e5\u7684\u65b9\u5f0f\u6a19\u6e96\u5316\u5f71\u50cf\u65b9\u5411\uff0c\u4ee5\u4f7f\u7a7a\u9593\u548c\u6587\u5b57\u7279\u5fb5\u80fd\u66f4\u597d\u5730\u5c0d\u9f4a\u3002\u6211\u5011\u4e5f\u4f7f\u7528\u6a19\u984c\u640d\u5931\u4f86\u555f\u7528\u7d30\u7dfb\u7684\u76e3\u7763\uff0c\u4e26\u63a2\u8a0e\u8996\u89ba\u7de8\u78bc\u5668\u7684\u9810\u8a13\u7df4\u5c0d\u6548\u80fd\u7684\u5f71\u97ff\u3002\u6700\u5f8c\uff0c\u9032\u884c\u6d88\u878d\u7814\u7a76\u4ee5\u95e1\u660e\u6bcf\u500b\u8a2d\u8a08\u5143\u4ef6\u5c0d\u6a21\u578b\u6574\u9ad4\u6548\u80fd\u7684\u8ca2\u737b\u3002", "author": "Athira J Jacob et.al.", "authors": "Athira J Jacob, Puneet Sharma, Daniel Rueckert", "id": "2502.12948v1", "paper_url": "http://arxiv.org/abs/2502.12948v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12953/2502.12953v1.json b/database/storage/2502/12953/2502.12953v1.json
new file mode 100644
index 0000000000..27daf63b26
--- /dev/null
+++ b/database/storage/2502/12953/2502.12953v1.json
@@ -0,0 +1 @@
+{"2502.12953": {"publish_time": "2025-02-18", "title": "Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text", "paper_summary": "Masked language modeling has become a widely adopted unsupervised technique\nto pre-train language models. However, the process of selecting tokens for\nmasking is random, and the percentage of masked tokens is typically fixed for\nthe entire training process. In this paper, we propose to adjust the masking\nratio and to decide which tokens to mask based on a novel task-informed\nanti-curriculum learning scheme. First, we harness task-specific knowledge\nabout useful and harmful tokens in order to determine which tokens to mask.\nSecond, we propose a cyclic decaying masking ratio, which corresponds to an\nanti-curriculum schedule (from hard to easy). We exemplify our novel\ntask-informed anti-curriculum by masking (TIACBM) approach across three diverse\ndownstream tasks: sentiment analysis, text classification by topic, and\nauthorship attribution. Our findings suggest that TIACBM enhances the ability\nof the model to focus on key task-relevant features, contributing to\nstatistically significant performance gains across tasks. We release our code\nat https://github.com/JarcaAndrei/TIACBM.", "paper_summary_zh": "\u906e\u853d\u8a9e\u8a00\u6a21\u578b\u5df2\u6210\u70ba\u4e00\u7a2e\u5ee3\u6cdb\u63a1\u7528\u7684\u7121\u76e3\u7763\u6280\u8853\uff0c\u7528\u65bc\u9810\u5148\u8a13\u7df4\u8a9e\u8a00\u6a21\u578b\u3002\u7136\u800c\uff0c\u9078\u64c7\u7528\u65bc\u906e\u853d\u7684\u8a5e\u5f59\u7684\u904e\u7a0b\u662f\u96a8\u6a5f\u7684\uff0c\u4e14\u906e\u853d\u8a5e\u5f59\u7684\u767e\u5206\u6bd4\u901a\u5e38\u5728\u6574\u500b\u8a13\u7df4\u904e\u7a0b\u4e2d\u662f\u56fa\u5b9a\u7684\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u5efa\u8b70\u8abf\u6574\u906e\u853d\u7387\uff0c\u4e26\u6839\u64da\u4e00\u7a2e\u65b0\u7a4e\u7684\u4efb\u52d9\u8cc7\u8a0a\u53cd\u8ab2\u7a0b\u5b78\u7fd2\u65b9\u6848\u4f86\u6c7a\u5b9a\u8981\u906e\u853d\u54ea\u4e9b\u8a5e\u5f59\u3002\u9996\u5148\uff0c\u6211\u5011\u5229\u7528\u4efb\u52d9\u7279\u5b9a\u7684\u77e5\u8b58\uff0c\u4e86\u89e3\u6709\u7528\u7684\u548c\u6709\u5bb3\u7684\u8a5e\u5f59\uff0c\u4ee5\u78ba\u5b9a\u8981\u906e\u853d\u54ea\u4e9b\u8a5e\u5f59\u3002\u5176\u6b21\uff0c\u6211\u5011\u63d0\u51fa\u4e00\u500b\u5faa\u74b0\u905e\u6e1b\u906e\u853d\u7387\uff0c\u9019\u5c0d\u61c9\u65bc\u4e00\u500b\u53cd\u8ab2\u7a0b\u8868\uff08\u5f9e\u96e3\u5230\u6613\uff09\u3002\u6211\u5011\u4ee5\u4e09\u9805\u4e0d\u540c\u7684\u4e0b\u6e38\u4efb\u52d9\u70ba\u4f8b\uff0c\u8aaa\u660e\u6211\u5011\u65b0\u7a4e\u7684\u4efb\u52d9\u8cc7\u8a0a\u53cd\u8ab2\u7a0b\u906e\u853d\uff08TIACBM\uff09\u65b9\u6cd5\uff1a\u60c5\u7dd2\u5206\u6790\u3001\u6309\u4e3b\u984c\u5206\u985e\u6587\u5b57\uff0c\u4ee5\u53ca\u4f5c\u8005\u6b78\u5c6c\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u8868\u660e\uff0cTIACBM \u589e\u5f37\u4e86\u6a21\u578b\u5c08\u6ce8\u65bc\u95dc\u9375\u4efb\u52d9\u76f8\u95dc\u7279\u5fb5\u7684\u80fd\u529b\uff0c\u6709\u52a9\u65bc\u5728\u5404\u9805\u4efb\u52d9\u4e2d\u7372\u5f97\u5177\u6709\u7d71\u8a08\u610f\u7fa9\u7684\u6548\u80fd\u63d0\u5347\u3002\u6211\u5011\u5728 https://github.com/JarcaAndrei/TIACBM \u91cb\u51fa\u6211\u5011\u7684\u7a0b\u5f0f\u78bc\u3002", "author": "Andrei Jarca et.al.", "authors": "Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu", "id": "2502.12953v1", "paper_url": "http://arxiv.org/abs/2502.12953v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12959/2502.12959v1.json b/database/storage/2502/12959/2502.12959v1.json
new file mode 100644
index 0000000000..76e978d62f
--- /dev/null
+++ b/database/storage/2502/12959/2502.12959v1.json
@@ -0,0 +1 @@
+{"2502.12959": {"publish_time": "2025-02-18", "title": "AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages", "paper_summary": "Realignment techniques are often employed to enhance cross-lingual transfer\nin multilingual language models, still, they can sometimes degrade performance\nin languages that differ significantly from the fine-tuned source language.\nThis paper introduces AlignFreeze, a method that freezes either the layers'\nlower half or upper half during realignment. Through controlled experiments on\n4 tasks, 3 models, and in 35 languages, we find that realignment affects all\nthe layers but can be the most detrimental to the lower ones. Freezing the\nlower layers can prevent performance degradation. Particularly, AlignFreeze\nimproves Part-of-Speech (PoS) tagging performances in languages where full\nrealignment fails: with XLM-R, it provides improvements of more than one\nstandard deviation in accuracy in seven more languages than full realignment.", "paper_summary_zh": "\u91cd\u65b0\u5c0d\u9f4a\u6280\u8853\u901a\u5e38\u7528\u65bc\u589e\u5f37\u591a\u8a9e\u8a00\u8a9e\u8a00\u6a21\u578b\u4e2d\u7684\u8de8\u8a9e\u8a00\u8f49\u79fb\uff0c\u7136\u800c\uff0c\u5b83\u5011\u6709\u6642\u6703\u964d\u4f4e\u8207\u5fae\u8abf\u6e90\u8a9e\u8a00\u986f\u8457\u4e0d\u540c\u7684\u8a9e\u8a00\u7684\u6548\u80fd\u3002\u672c\u6587\u4ecb\u7d39\u4e86 AlignFreeze\uff0c\u4e00\u7a2e\u5728\u91cd\u65b0\u5c0d\u9f4a\u671f\u9593\u51cd\u7d50\u5c64\u7684\u4e0b\u534a\u90e8\u6216\u4e0a\u534a\u90e8\u7684\u7684\u65b9\u6cd5\u3002\u900f\u904e 4 \u9805\u4efb\u52d9\u30013 \u500b\u6a21\u578b\u548c 35 \u7a2e\u8a9e\u8a00\u7684\u53d7\u63a7\u5be6\u9a57\uff0c\u6211\u5011\u767c\u73fe\u91cd\u65b0\u5c0d\u9f4a\u6703\u5f71\u97ff\u6240\u6709\u5c64\uff0c\u4f46\u5c0d\u8f03\u4f4e\u5c64\u7684\u5f71\u97ff\u6700\u5927\u3002\u51cd\u7d50\u8f03\u4f4e\u5c64\u53ef\u4ee5\u9632\u6b62\u6548\u80fd\u4e0b\u964d\u3002\u7279\u5225\u662f\uff0cAlignFreeze \u6539\u5584\u4e86\u5728\u5b8c\u5168\u91cd\u65b0\u5c0d\u9f4a\u5931\u6557\u7684\u8a9e\u8a00\u4e2d\u7684\u8a5e\u6027 (PoS) \u6a19\u8a18\u6548\u80fd\uff1a\u4f7f\u7528 XLM-R\uff0c\u5b83\u6bd4\u5b8c\u5168\u91cd\u65b0\u5c0d\u9f4a\u5728\u4e03\u7a2e\u8a9e\u8a00\u4e2d\u63d0\u4f9b\u4e86\u8d85\u904e\u4e00\u500b\u6a19\u6e96\u5dee\u7684\u6e96\u78ba\u5ea6\u6539\u9032\u3002", "author": "Steve Bakos et.al.", "authors": "Steve Bakos, F\u00e9lix Gaschi, David Guzm\u00e1n, Riddhi More, Kelly Chutong Li, En-Shiun Annie Lee", "id": "2502.12959v1", "paper_url": "http://arxiv.org/abs/2502.12959v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12961/2502.12961v1.json b/database/storage/2502/12961/2502.12961v1.json
new file mode 100644
index 0000000000..8fa3ffaa20
--- /dev/null
+++ b/database/storage/2502/12961/2502.12961v1.json
@@ -0,0 +1 @@
+{"2502.12961": {"publish_time": "2025-02-18", "title": "Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger", "paper_summary": "Large language models (LLMs) have shown remarkable emergent capabilities,\ntransforming the execution of functional tasks by leveraging external tools for\ncomplex problems that require specialized processing or real-time data. While\nexisting research expands LLMs access to diverse tools (e.g., program\ninterpreters, search engines, weather/map apps), the necessity of using these\ntools is often overlooked, leading to indiscriminate tool invocation. This\nnaive approach raises two key issues:(1) increased delays due to unnecessary\ntool calls, and (2) potential errors resulting from faulty interactions with\nexternal tools. In this paper, we introduce meta-cognition as a proxy for LLMs\nself-assessment of their capabilities, representing the model's awareness of\nits own limitations. Based on this, we propose MeCo, an adaptive\ndecision-making strategy for external tool use. MeCo quantifies metacognitive\nscores by capturing high-level cognitive signals in the representation space,\nguiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs\nminimal cost. Our experiments show that MeCo accurately detects LLMs' internal\ncognitive signals and significantly improves tool-use decision-making across\nmultiple base models and benchmarks.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5df2\u5c55\u73fe\u51fa\u986f\u8457\u7684\u65b0\u8208\u80fd\u529b\uff0c\u900f\u904e\u904b\u7528\u5916\u90e8\u5de5\u5177\u4f86\u57f7\u884c\u529f\u80fd\u4efb\u52d9\uff0c\u89e3\u6c7a\u9700\u8981\u5c08\u696d\u8655\u7406\u6216\u5373\u6642\u8cc7\u6599\u7684\u8907\u96dc\u554f\u984c\uff0c\u5f9e\u800c\u8f49\u8b8a\u4efb\u52d9\u7684\u57f7\u884c\u65b9\u5f0f\u3002\u5118\u7ba1\u73fe\u6709\u7814\u7a76\u64f4\u5c55\u4e86 LLM \u5c0d\u5404\u7a2e\u5de5\u5177\u7684\u5b58\u53d6\uff08\u4f8b\u5982\u7a0b\u5f0f\u78bc\u8a6e\u91cb\u5668\u3001\u641c\u5c0b\u5f15\u64ce\u3001\u5929\u6c23/\u5730\u5716\u61c9\u7528\u7a0b\u5f0f\uff09\uff0c\u4f46\u4f7f\u7528\u9019\u4e9b\u5de5\u5177\u7684\u5fc5\u8981\u6027\u5f80\u5f80\u88ab\u5ffd\u7565\uff0c\u5c0e\u81f4\u4e0d\u52a0\u9078\u64c7\u5730\u547c\u53eb\u5de5\u5177\u3002\u9019\u7a2e\u5929\u771f\u7684\u65b9\u6cd5\u63d0\u51fa\u4e86\u5169\u500b\u95dc\u9375\u554f\u984c\uff1a(1) \u7531\u65bc\u4e0d\u5fc5\u8981\u7684\u5de5\u5177\u547c\u53eb\u800c\u5c0e\u81f4\u5ef6\u9072\u589e\u52a0\uff0c\u4ee5\u53ca (2) \u7531\u65bc\u8207\u5916\u90e8\u5de5\u5177\u4e92\u52d5\u932f\u8aa4\u800c\u5c0e\u81f4\u7684\u6f5b\u5728\u932f\u8aa4\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u5c07\u5143\u8a8d\u77e5\u5f15\u5165\u4f5c\u70ba LLM \u81ea\u6211\u8a55\u4f30\u5176\u80fd\u529b\u7684\u4ee3\u7406\uff0c\u4ee3\u8868\u6a21\u578b\u610f\u8b58\u5230\u5176\u81ea\u8eab\u7684\u9650\u5236\u3002\u57fa\u65bc\u6b64\uff0c\u6211\u5011\u63d0\u51fa\u4e86 MeCo\uff0c\u4e00\u7a2e\u7528\u65bc\u5916\u90e8\u5de5\u5177\u4f7f\u7528\u7684\u9069\u61c9\u6027\u6c7a\u7b56\u5236\u5b9a\u7b56\u7565\u3002MeCo \u900f\u904e\u64f7\u53d6\u8868\u5fb5\u7a7a\u9593\u4e2d\u7684\u9ad8\u968e\u8a8d\u77e5\u8a0a\u865f\u4f86\u91cf\u5316\u5143\u8a8d\u77e5\u5206\u6578\uff0c\u6307\u5c0e\u4f55\u6642\u547c\u53eb\u5de5\u5177\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0cMeCo \u662f\u514d\u5fae\u8abf\u7684\uff0c\u800c\u4e14\u6210\u672c\u6975\u4f4e\u3002\u6211\u5011\u7684\u5be6\u9a57\u8868\u660e\uff0cMeCo \u80fd\u5920\u6e96\u78ba\u5730\u5075\u6e2c LLM \u7684\u5167\u90e8\u8a8d\u77e5\u8a0a\u865f\uff0c\u4e26\u5927\u5e45\u6539\u5584\u8de8\u591a\u500b\u57fa\u672c\u6a21\u578b\u548c\u57fa\u6e96\u7684\u5de5\u5177\u4f7f\u7528\u6c7a\u7b56\u5236\u5b9a\u3002", "author": "Wenjun Li et.al.", "authors": "Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu", "id": "2502.12961v1", "paper_url": "http://arxiv.org/abs/2502.12961v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12962/2502.12962v1.json b/database/storage/2502/12962/2502.12962v1.json
new file mode 100644
index 0000000000..4616009ced
--- /dev/null
+++ b/database/storage/2502/12962/2502.12962v1.json
@@ -0,0 +1 @@
+{"2502.12962": {"publish_time": "2025-02-18", "title": "Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing", "paper_summary": "Limited by the context window size of Large Language Models(LLMs), handling\nvarious tasks with input tokens exceeding the upper limit has been challenging,\nwhether it is a simple direct retrieval task or a complex multi-hop reasoning\ntask. Although various methods have been proposed to enhance the long-context\nprocessing capabilities of LLMs, they either incur substantial post-training\ncosts, or require additional tool modules(e.g.,RAG), or have not shown\nsignificant improvement in realistic tasks. Our work observes the correlation\nbetween the attention distribution and generated answers across each layer, and\nestablishes the attention allocation aligns with retrieval-augmented\ncapabilities through experiments. Drawing on the above insights, we propose a\nnovel method InfiniRetri that leverages the LLMs's own attention information to\nenable accurate retrieval across inputs of infinitely length. Our evaluations\nindicate that InfiniRetri achieves 100% accuracy in the\nNeedle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,\nsurpassing other method or larger models and setting a new\nstate-of-the-art(SOTA). Moreover, our method achieves significant performance\nimprovements on real-world benchmarks, with a maximum 288% improvement. In\naddition, InfiniRetri can be applied to any Transformer-based LLMs without\nadditional training and substantially reduces inference latency and compute\noverhead in long texts. In summary, our comprehensive studies show\nInfiniRetri's potential for practical applications and creates a paradigm for\nretrievaling information using LLMs own capabilities under infinite-length\ntokens. Code will be released in link.", "paper_summary_zh": "\u53d7\u9650\u4e8e\u5927\u578b\u8bed\u8a00\u6a21\u578b (LLM) \u7684\u4e0a\u4e0b\u6587\u7a97\u53e3\u5927\u5c0f\uff0c\u5904\u7406\u8d85\u51fa\u4e0a\u9650\u7684\u8f93\u5165\u6807\u8bb0\u7684\u5404\u79cd\u4efb\u52a1\u4e00\u76f4\u5177\u6709\u6311\u6218\u6027\uff0c\u65e0\u8bba\u662f\u7b80\u5355\u7684\u76f4\u63a5\u68c0\u7d22\u4efb\u52a1\u8fd8\u662f\u590d\u6742\u7684\u591a\u8df3\u63a8\u7406\u4efb\u52a1\u3002\u867d\u7136\u5df2\u7ecf\u63d0\u51fa\u4e86\u5404\u79cd\u65b9\u6cd5\u6765\u589e\u5f3a LLM \u7684\u957f\u4e0a\u4e0b\u6587\u5904\u7406\u80fd\u529b\uff0c\u4f46\u5b83\u4eec\u8981\u4e48\u4ea7\u751f\u5927\u91cf\u7684\u540e\u8bad\u7ec3\u6210\u672c\uff0c\u8981\u4e48\u9700\u8981\u989d\u5916\u7684\u5de5\u5177\u6a21\u5757\uff08\u4f8b\u5982\uff0cRAG\uff09\uff0c\u8981\u4e48\u5728\u5b9e\u9645\u4efb\u52a1\u4e2d\u6ca1\u6709\u663e\u793a\u51fa\u663e\u7740\u7684\u6539\u8fdb\u3002\u6211\u4eec\u7684\u5de5\u4f5c\u89c2\u5bdf\u4e86\u6bcf\u5c42\u6ce8\u610f\u529b\u5206\u5e03\u548c\u751f\u6210\u7b54\u6848\u4e4b\u95f4\u7684\u76f8\u5173\u6027\uff0c\u5e76\u901a\u8fc7\u5b9e\u9a8c\u5efa\u7acb\u4e86\u6ce8\u610f\u529b\u5206\u914d\u4e0e\u68c0\u7d22\u589e\u5f3a\u80fd\u529b\u4fdd\u6301\u4e00\u81f4\u3002\u6839\u636e\u4e0a\u8ff0\u89c1\u89e3\uff0c\u6211\u4eec\u63d0\u51fa\u4e86\u4e00\u79cd\u65b0\u65b9\u6cd5 InfiniRetri\uff0c\u8be5\u65b9\u6cd5\u5229\u7528 LLM \u81ea\u8eab\u7684\u6ce8\u610f\u529b\u4fe1\u606f\u6765\u5b9e\u73b0\u5bf9\u65e0\u9650\u957f\u5ea6\u8f93\u5165\u7684\u51c6\u786e\u68c0\u7d22\u3002\u6211\u4eec\u7684\u8bc4\u4f30\u8868\u660e\uff0cInfiniRetri \u5728\u4f7f\u7528 0.5B \u53c2\u6570\u6a21\u578b\u5bf9\u8d85\u8fc7 100 \u4e07\u4e2a\u6807\u8bb0\u7684\u9488\u5934\u5e72\u8349\u5806 (NIH) \u6d4b\u8bd5\u4e2d\u5b9e\u73b0\u4e86 100% \u7684\u51c6\u786e\u7387\uff0c\u8d85\u8d8a\u4e86\u5176\u4ed6\u65b9\u6cd5\u6216\u66f4\u5927\u7684\u6a21\u578b\uff0c\u5e76\u521b\u9020\u4e86\u65b0\u7684\u6700\u5148\u8fdb (SOTA)\u3002\u6b64\u5916\uff0c\u6211\u4eec\u7684\u65b9\u6cd5\u5728\u5b9e\u9645\u57fa\u51c6\u4e0a\u5b9e\u73b0\u4e86\u663e\u8457\u7684\u6027\u80fd\u63d0\u5347\uff0c\u6700\u5927\u63d0\u5347\u4e86 288%\u3002\u6b64\u5916\uff0cInfiniRetri \u53ef\u4ee5\u5e94\u7528\u4e8e\u4efb\u4f55\u57fa\u4e8e Transformer \u7684 LLM\uff0c\u800c\u65e0\u9700\u989d\u5916\u7684\u8bad\u7ec3\uff0c\u5e76\u4e14\u53ef\u4ee5\u5927\u5e45\u51cf\u5c11\u63a8\u7406\u5ef6\u8fdf\u548c\u957f\u6587\u672c\u4e2d\u7684\u8ba1\u7b97\u5f00\u9500\u3002\u603b\u4e4b\uff0c\u6211\u4eec\u7684\u7efc\u5408\u7814\u7a76\u8868\u660e\u4e86 InfiniRetri \u5728\u5b9e\u9645\u5e94\u7528\u4e2d\u7684\u6f5c\u529b\uff0c\u5e76\u4e3a\u4f7f\u7528 LLM \u81ea\u8eab\u80fd\u529b\u5728\u65e0\u9650\u957f\u5ea6\u6807\u8bb0\u4e0b\u68c0\u7d22\u4fe1\u606f\u521b\u9020\u4e86\u4e00\u4e2a\u8303\u4f8b\u3002\u4ee3\u7801\u5c06\u5728\u94fe\u63a5\u4e2d\u53d1\u5e03\u3002", "author": "Xiaoju Ye et.al.", "authors": "Xiaoju Ye, Zhichun Wang, Jingyuan Wang", "id": "2502.12962v1", "paper_url": "http://arxiv.org/abs/2502.12962v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12964/2502.12964v1.json b/database/storage/2502/12964/2502.12964v1.json
new file mode 100644
index 0000000000..692c0e9200
--- /dev/null
+++ b/database/storage/2502/12964/2502.12964v1.json
@@ -0,0 +1 @@
+{"2502.12964": {"publish_time": "2025-02-18", "title": "Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs", "paper_summary": "Large Language Models (LLMs) often generate outputs that lack grounding in\nreal-world facts, a phenomenon known as hallucinations. Prior research has\nassociated hallucinations with model uncertainty, leveraging this relationship\nfor hallucination detection and mitigation. In this paper, we challenge the\nunderlying assumption that all hallucinations are associated with uncertainty.\nUsing knowledge detection and uncertainty measurement methods, we demonstrate\nthat models can hallucinate with high certainty even when they have the correct\nknowledge. We further show that high-certainty hallucinations are consistent\nacross models and datasets, distinctive enough to be singled out, and challenge\nexisting mitigation methods. Our findings reveal an overlooked aspect of\nhallucinations, emphasizing the need to understand their origins and improve\nmitigation strategies to enhance LLM safety. The code is available at\nhttps://github.com/technion-cs-nlp/Trust_me_Im_wrong .", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7d93\u5e38\u7522\u751f\u7f3a\u4e4f\u771f\u5be6\u4e16\u754c\u4e8b\u5be6\u6839\u64da\u7684\u8f38\u51fa\uff0c\u9019\u7a2e\u73fe\u8c61\u7a31\u70ba\u5e7b\u89ba\u3002\u5148\u524d\u7684\u7814\u7a76\u5df2\u5c07\u5e7b\u89ba\u8207\u6a21\u578b\u4e0d\u78ba\u5b9a\u6027\u806f\u7e6b\u8d77\u4f86\uff0c\u5229\u7528\u9019\u7a2e\u95dc\u4fc2\u9032\u884c\u5e7b\u89ba\u5075\u6e2c\u548c\u7de9\u89e3\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u6311\u6230\u6240\u6709\u5e7b\u89ba\u90fd\u8207\u4e0d\u78ba\u5b9a\u6027\u76f8\u95dc\u7684\u57fa\u672c\u5047\u8a2d\u3002\u4f7f\u7528\u77e5\u8b58\u5075\u6e2c\u548c\u4e0d\u78ba\u5b9a\u6027\u6e2c\u91cf\u65b9\u6cd5\uff0c\u6211\u5011\u8b49\u660e\u6a21\u578b\u5373\u4f7f\u64c1\u6709\u6b63\u78ba\u7684\u77e5\u8b58\uff0c\u4e5f\u80fd\u4ee5\u9ad8\u5ea6\u78ba\u5b9a\u6027\u7522\u751f\u5e7b\u89ba\u3002\u6211\u5011\u9032\u4e00\u6b65\u8868\u660e\uff0c\u9ad8\u78ba\u5b9a\u6027\u5e7b\u89ba\u5728\u6a21\u578b\u548c\u8cc7\u6599\u96c6\u4e4b\u9593\u662f\u4e00\u81f4\u7684\uff0c\u8db3\u5920\u7368\u7279\u4ee5\u81f3\u65bc\u53ef\u4ee5\u55ae\u7368\u6311\u9078\u51fa\u4f86\uff0c\u4e26\u6311\u6230\u73fe\u6709\u7684\u7de9\u89e3\u65b9\u6cd5\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u63ed\u793a\u4e86\u5e7b\u89ba\u7684\u4e00\u500b\u88ab\u5ffd\u8996\u7684\u65b9\u9762\uff0c\u5f37\u8abf\u9700\u8981\u4e86\u89e3\u5176\u8d77\u6e90\u4e26\u6539\u9032\u7de9\u89e3\u7b56\u7565\u4ee5\u589e\u5f37 LLM \u5b89\u5168\u6027\u3002\u53ef\u4ee5\u5728 https://github.com/technion-cs-nlp/Trust_me_Im_wrong \u627e\u5230\u7a0b\u5f0f\u78bc\u3002", "author": "Adi Simhi et.al.", "authors": "Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov", "id": "2502.12964v1", "paper_url": "http://arxiv.org/abs/2502.12964v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12965/2502.12965v1.json b/database/storage/2502/12965/2502.12965v1.json
new file mode 100644
index 0000000000..0fbd952e22
--- /dev/null
+++ b/database/storage/2502/12965/2502.12965v1.json
@@ -0,0 +1 @@
+{"2502.12965": {"publish_time": "2025-02-18", "title": "A Survey of Text Classification Under Class Distribution Shift", "paper_summary": "The basic underlying assumption of machine learning (ML) models is that the\ntraining and test data are sampled from the same distribution. However, in\ndaily practice, this assumption is often broken, i.e.~the distribution of the\ntest data changes over time, which hinders the application of conventional ML\nmodels. One domain where the distribution shift naturally occurs is text\nclassification, since people always find new topics to discuss. To this end, we\nsurvey research articles studying open-set text classification and related\ntasks. We divide the methods in this area based on the constraints that define\nthe kind of distribution shift and the corresponding problem formulation,\ni.e.~learning with the Universum, zero-shot learning, and open-set learning. We\nnext discuss the predominant mitigation approaches for each problem setup.\nFinally, we identify several future work directions, aiming to push the\nboundaries beyond the state of the art. Interestingly, we find that continual\nlearning can solve many of the issues caused by the shifting class\ndistribution. We maintain a list of relevant papers at\nhttps://github.com/Eduard6421/Open-Set-Survey.", "paper_summary_zh": "\u6a5f\u5668\u5b78\u7fd2 (ML) \u6a21\u578b\u7684\u57fa\u672c\u5047\u8a2d\u662f\u8a13\u7df4\u8cc7\u6599\u548c\u6e2c\u8a66\u8cc7\u6599\u53d6\u6a23\u81ea\u540c\u4e00\u500b\u5206\u4f48\u3002\u7136\u800c\uff0c\u5728\u65e5\u5e38\u5be6\u52d9\u4e2d\uff0c\u9019\u500b\u5047\u8a2d\u7d93\u5e38\u88ab\u6253\u7834\uff0c\u4e5f\u5c31\u662f\u8aaa\u6e2c\u8a66\u8cc7\u6599\u7684\u5206\u5e03\u6703\u96a8\u8457\u6642\u9593\u6539\u8b8a\uff0c\u9019\u6703\u963b\u7919\u50b3\u7d71 ML \u6a21\u578b\u7684\u61c9\u7528\u3002\u5206\u4f48\u8f49\u79fb\u81ea\u7136\u767c\u751f\u7684\u5176\u4e2d\u4e00\u500b\u9818\u57df\u662f\u6587\u5b57\u5206\u985e\uff0c\u56e0\u70ba\u4eba\u5011\u7e3d\u80fd\u627e\u5230\u65b0\u7684\u4e3b\u984c\u4f86\u8a0e\u8ad6\u3002\u70ba\u6b64\uff0c\u6211\u5011\u8abf\u67e5\u7814\u7a76\u958b\u653e\u96c6\u6587\u5b57\u5206\u985e\u548c\u76f8\u95dc\u4efb\u52d9\u7684\u7814\u7a76\u6587\u7ae0\u3002\u6211\u5011\u6839\u64da\u5b9a\u7fa9\u5206\u4f48\u8f49\u79fb\u7684\u985e\u578b\u548c\u5c0d\u61c9\u554f\u984c\u516c\u5f0f\u7684\u9650\u5236\uff0c\u5c07\u9019\u500b\u9818\u57df\u7684\u65b9\u6cd5\u5206\u70ba\uff1a\u4f7f\u7528 Universum \u5b78\u7fd2\u3001\u96f6\u6b21\u5b78\u7fd2\u548c\u958b\u653e\u96c6\u5b78\u7fd2\u3002\u63a5\u4e0b\u4f86\uff0c\u6211\u5011\u8a0e\u8ad6\u6bcf\u500b\u554f\u984c\u8a2d\u5b9a\u7684\u4e3b\u8981\u7de9\u89e3\u65b9\u6cd5\u3002\u6700\u5f8c\uff0c\u6211\u5011\u627e\u51fa\u5e7e\u500b\u672a\u4f86\u7684\u7814\u7a76\u65b9\u5411\uff0c\u76ee\u6a19\u662f\u5c07\u754c\u7dda\u63a8\u5c55\u5230\u73fe\u6709\u6280\u8853\u7684\u6975\u9650\u4e4b\u5916\u3002\u6709\u8da3\u7684\u662f\uff0c\u6211\u5011\u767c\u73fe\u6301\u7e8c\u5b78\u7fd2\u53ef\u4ee5\u89e3\u6c7a\u8a31\u591a\u7531\u985e\u5225\u5206\u4f48\u8f49\u79fb\u6240\u9020\u6210\u7684\u8b70\u984c\u3002\u6211\u5011\u5728 https://github.com/Eduard6421/Open-Set-Survey \u7dad\u8b77\u4e00\u4efd\u76f8\u95dc\u8ad6\u6587\u6e05\u55ae\u3002", "author": "Adriana Valentina Costache et.al.", "authors": "Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu", "id": "2502.12965v1", "paper_url": "http://arxiv.org/abs/2502.12965v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12970/2502.12970v1.json b/database/storage/2502/12970/2502.12970v1.json
new file mode 100644
index 0000000000..89101c29f8
--- /dev/null
+++ b/database/storage/2502/12970/2502.12970v1.json
@@ -0,0 +1 @@
+{"2502.12970": {"publish_time": "2025-02-18", "title": "Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking", "paper_summary": "The reasoning abilities of Large Language Models (LLMs) have demonstrated\nremarkable advancement and exceptional performance across diverse domains.\nHowever, leveraging these reasoning capabilities to enhance LLM safety against\nadversarial attacks and jailbreak queries remains largely unexplored. To bridge\nthis gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that\nintegrates safety reflections of queries and responses into LLMs' generation\nprocess, unlocking a safety-aware reasoning mechanism. This approach enables\nself-evaluation at each reasoning step to create safety pivot tokens as\nindicators of the response's safety status. Furthermore, in order to improve\nthe learning efficiency of pivot token prediction, we propose Contrastive Pivot\nOptimization(CPO), which enhances the model's ability to perceive the safety\nstatus of dialogues. Through this mechanism, LLMs dynamically adjust their\nresponse strategies during reasoning, significantly enhancing their defense\ncapabilities against jailbreak attacks. Extensive experimental results\ndemonstrate that R2D effectively mitigates various attacks and improves overall\nsafety, highlighting the substantial potential of safety-aware reasoning in\nstrengthening LLMs' robustness against jailbreaks.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u63a8\u7406\u80fd\u529b\u5df2\u5c55\u73fe\u51fa\u986f\u8457\u7684\u9032\u6b65\uff0c\u4e26\u5728\u4e0d\u540c\u7684\u9818\u57df\u4e2d\u8868\u73fe\u51fa\u8272\u3002\u7136\u800c\uff0c\u5229\u7528\u9019\u4e9b\u63a8\u7406\u80fd\u529b\u4f86\u589e\u5f37 LLM \u5c0d\u6297\u653b\u64ca\u548c\u8d8a\u7344\u67e5\u8a62\u7684\u5b89\u5168\u6027\u4ecd\u7136\u662f\u672a\u958b\u767c\u7684\u9818\u57df\u3002\u70ba\u4e86\u5f4c\u88dc\u9019\u500b\u5dee\u8ddd\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u63a8\u7406\u9632\u79a6 (R2D)\uff0c\u9019\u662f\u4e00\u7a2e\u65b0\u7a4e\u7684\u8a13\u7df4\u7bc4\u4f8b\uff0c\u5b83\u5c07\u67e5\u8a62\u548c\u56de\u61c9\u7684\u5b89\u5168\u8003\u91cf\u6574\u5408\u5230 LLM \u7684\u751f\u6210\u904e\u7a0b\u4e2d\uff0c\u958b\u555f\u4e86\u4e00\u500b\u5b89\u5168\u611f\u77e5\u63a8\u7406\u6a5f\u5236\u3002\u6b64\u65b9\u6cd5\u53ef\u4ee5\u5728\u6bcf\u500b\u63a8\u7406\u6b65\u9a5f\u4e2d\u9032\u884c\u81ea\u6211\u8a55\u4f30\uff0c\u4ee5\u5efa\u7acb\u5b89\u5168\u6a1e\u7d10\u6a19\u8a18\uff0c\u4f5c\u70ba\u56de\u61c9\u5b89\u5168\u72c0\u614b\u7684\u6307\u6a19\u3002\u6b64\u5916\uff0c\u70ba\u4e86\u63d0\u9ad8\u6a1e\u7d10\u6a19\u8a18\u9810\u6e2c\u7684\u5b78\u7fd2\u6548\u7387\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u5c0d\u6bd4\u6a1e\u7d10\u6700\u4f73\u5316 (CPO)\uff0c\u5b83\u589e\u5f37\u4e86\u6a21\u578b\u611f\u77e5\u5c0d\u8a71\u5b89\u5168\u72c0\u614b\u7684\u80fd\u529b\u3002\u900f\u904e\u6b64\u6a5f\u5236\uff0cLLM \u5728\u63a8\u7406\u904e\u7a0b\u4e2d\u52d5\u614b\u8abf\u6574\u5176\u56de\u61c9\u7b56\u7565\uff0c\u5927\u5e45\u589e\u5f37\u5176\u5c0d\u6297\u8d8a\u7344\u653b\u64ca\u7684\u9632\u79a6\u80fd\u529b\u3002\u5ee3\u6cdb\u7684\u5be6\u9a57\u7d50\u679c\u8b49\u660e\uff0cR2D \u6709\u6548\u5730\u6e1b\u8f15\u4e86\u5404\u7a2e\u653b\u64ca\uff0c\u4e26\u6539\u5584\u4e86\u6574\u9ad4\u5b89\u5168\u6027\uff0c\u7a81\u986f\u4e86\u5b89\u5168\u611f\u77e5\u63a8\u7406\u5728\u52a0\u5f37 LLM \u5c0d\u6297\u8d8a\u7344\u7684\u7a69\u5065\u6027\u65b9\u9762\u7684\u6f5b\u529b\u3002", "author": "Junda Zhu et.al.", "authors": "Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha", "id": "2502.12970v1", "paper_url": "http://arxiv.org/abs/2502.12970v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12982/2502.12982v1.json b/database/storage/2502/12982/2502.12982v1.json
new file mode 100644
index 0000000000..a1513c153f
--- /dev/null
+++ b/database/storage/2502/12982/2502.12982v1.json
@@ -0,0 +1 @@
+{"2502.12982": {"publish_time": "2025-02-18", "title": "Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs", "paper_summary": "Sailor2 is a family of cutting-edge multilingual language models for\nSouth-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit\ndiverse applications. Building on Qwen2.5, Sailor2 undergoes continuous\npre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to\nsupport 13 SEA languages while retaining proficiency in Chinese and English.\nSailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA\nlanguages. We also deliver a comprehensive cookbook on how to develop the\nmultilingual model in an efficient manner, including five key aspects: data\ncuration, pre-training, post-training, model customization and evaluation. We\nhope that Sailor2 model (Apache 2.0 license) will drive language development in\nthe SEA region, and Sailor2 cookbook will inspire researchers to build more\ninclusive LLMs for other under-served languages.", "paper_summary_zh": "Sailor2 \u662f\u4e00\u7cfb\u5217\u91dd\u5c0d\u6771\u5357\u4e9e (SEA) \u8a9e\u8a00\u7684\u5c16\u7aef\u591a\u8a9e\u8a00\u8a9e\u8a00\u6a21\u578b\uff0c\u5099\u6709 1B\u30018B \u548c 20B \u5927\u5c0f\uff0c\u4ee5\u9069\u61c9\u5404\u7a2e\u61c9\u7528\u3002\u5728 Qwen2.5 \u7684\u57fa\u790e\u4e0a\uff0cSailor2 \u6301\u7e8c\u9032\u884c 500B \u4ee3\u5e63\uff08400B SEA \u5c08\u7528\u548c 100B \u91cd\u64ad\u4ee3\u5e63\uff09\u7684\u9810\u8a13\u7df4\uff0c\u4ee5\u652f\u63f4 13 \u7a2e SEA \u8a9e\u8a00\uff0c\u540c\u6642\u4fdd\u7559\u4e2d\u6587\u548c\u82f1\u6587\u7684\u719f\u7df4\u5ea6\u3002Sailor2-20B \u6a21\u578b\u5728 SEA \u8a9e\u8a00\u4e2d\u5c0d\u6297 GPT-4o \u6642\uff0c\u9054\u5230 50-50 \u7684\u7372\u52dd\u7387\u3002\u6211\u5011\u9084\u63d0\u4f9b\u4e00\u672c\u5168\u9762\u7684\u98df\u8b5c\uff0c\u8aaa\u660e\u5982\u4f55\u4ee5\u6709\u6548\u7684\u65b9\u5f0f\u958b\u767c\u591a\u8a9e\u8a00\u6a21\u578b\uff0c\u5305\u62ec\u4e94\u500b\u95dc\u9375\u65b9\u9762\uff1a\u8cc7\u6599\u7b56\u5c55\u3001\u9810\u8a13\u7df4\u3001\u5f8c\u8a13\u7df4\u3001\u6a21\u578b\u81ea\u8a02\u548c\u8a55\u4f30\u3002\u6211\u5011\u5e0c\u671b Sailor2 \u6a21\u578b\uff08Apache 2.0 \u6388\u6b0a\uff09\u5c07\u63a8\u52d5 SEA \u5730\u5340\u7684\u8a9e\u8a00\u767c\u5c55\uff0c\u800c Sailor2 \u98df\u8b5c\u5c07\u6fc0\u52f5\u7814\u7a76\u4eba\u54e1\u70ba\u5176\u4ed6\u670d\u52d9\u4e0d\u8db3\u7684\u8a9e\u8a00\u5efa\u7acb\u66f4\u5177\u5305\u5bb9\u6027\u7684 LLM\u3002", "author": "Longxu Dou et.al.", "authors": "Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydl\u00ed\u010dek, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin", "id": "2502.12982v1", "paper_url": "http://arxiv.org/abs/2502.12982v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12985/2502.12985v1.json b/database/storage/2502/12985/2502.12985v1.json
new file mode 100644
index 0000000000..04c77766b3
--- /dev/null
+++ b/database/storage/2502/12985/2502.12985v1.json
@@ -0,0 +1 @@
+{"2502.12985": {"publish_time": "2025-02-18", "title": "PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization", "paper_summary": "Accurate 3D shape representation is essential in engineering applications\nsuch as design, optimization, and simulation. In practice, engineering\nworkflows require structured, part-aware representations, as objects are\ninherently designed as assemblies of distinct components. However, most\nexisting methods either model shapes holistically or decompose them without\npredefined part structures, limiting their applicability in real-world design\ntasks. We propose PartSDF, a supervised implicit representation framework that\nexplicitly models composite shapes with independent, controllable parts while\nmaintaining shape consistency. Despite its simple single-decoder architecture,\nPartSDF outperforms both supervised and unsupervised baselines in\nreconstruction and generation tasks. We further demonstrate its effectiveness\nas a structured shape prior for engineering applications, enabling precise\ncontrol over individual components while preserving overall coherence. Code\navailable at https://github.com/cvlab-epfl/PartSDF.", "paper_summary_zh": "\u7cbe\u78ba\u7684 3D \u5f62\u72c0\u8868\u793a\u5728\u5de5\u7a0b\u61c9\u7528\u4e2d\u81f3\u95dc\u91cd\u8981\uff0c\u4f8b\u5982\u8a2d\u8a08\u3001\u6700\u4f73\u5316\u548c\u6a21\u64ec\u3002\u5be6\u969b\u4e0a\uff0c\u5de5\u7a0b\u5de5\u4f5c\u6d41\u7a0b\u9700\u8981\u7d50\u69cb\u5316\u3001\u96f6\u4ef6\u611f\u77e5\u7684\u8868\u793a\uff0c\u56e0\u70ba\u7269\u9ad4\u672c\u8cea\u4e0a\u662f\u8a2d\u8a08\u70ba\u4e0d\u540c\u5143\u4ef6\u7684\u7d44\u4ef6\u3002\u7136\u800c\uff0c\u5927\u591a\u6578\u73fe\u6709\u65b9\u6cd5\u4e0d\u662f\u6574\u9ad4\u5efa\u6a21\u5f62\u72c0\uff0c\u5c31\u662f\u5c07\u5176\u5206\u89e3\uff0c\u800c\u6c92\u6709\u9810\u5148\u5b9a\u7fa9\u7684\u96f6\u4ef6\u7d50\u69cb\uff0c\u9019\u9650\u5236\u4e86\u5b83\u5011\u5728\u5be6\u969b\u8a2d\u8a08\u4efb\u52d9\u4e2d\u7684\u9069\u7528\u6027\u3002\u6211\u5011\u63d0\u51fa PartSDF\uff0c\u4e00\u500b\u76e3\u7763\u5f0f\u7684\u96b1\u5f0f\u8868\u793a\u6846\u67b6\uff0c\u5b83\u660e\u78ba\u5730\u4f7f\u7528\u7368\u7acb\u3001\u53ef\u63a7\u7684\u96f6\u4ef6\u5c0d\u8907\u5408\u5f62\u72c0\u9032\u884c\u5efa\u6a21\uff0c\u540c\u6642\u4fdd\u6301\u5f62\u72c0\u4e00\u81f4\u6027\u3002\u5118\u7ba1\u5176\u55ae\u4e00\u7684\u89e3\u78bc\u5668\u67b6\u69cb\u5f88\u7c21\u55ae\uff0c\u4f46 PartSDF \u5728\u91cd\u5efa\u548c\u751f\u6210\u4efb\u52d9\u4e2d\u90fd\u512a\u65bc\u76e3\u7763\u5f0f\u548c\u975e\u76e3\u7763\u5f0f\u57fa\u6e96\u3002\u6211\u5011\u9032\u4e00\u6b65\u8b49\u660e\u4e86\u5176\u4f5c\u70ba\u5de5\u7a0b\u61c9\u7528\u7d50\u69cb\u5316\u5f62\u72c0\u5148\u9a57\u7684\u6709\u6548\u6027\uff0c\u80fd\u5920\u7cbe\u78ba\u63a7\u5236\u5404\u500b\u5143\u4ef6\uff0c\u540c\u6642\u4fdd\u6301\u6574\u9ad4\u4e00\u81f4\u6027\u3002\u7a0b\u5f0f\u78bc\u53ef\u5728 https://github.com/cvlab-epfl/PartSDF \u53d6\u5f97\u3002", "author": "Nicolas Talabot et.al.", "authors": "Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua", "id": "2502.12985v1", "paper_url": "http://arxiv.org/abs/2502.12985v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12988/2502.12988v1.json b/database/storage/2502/12988/2502.12988v1.json
new file mode 100644
index 0000000000..ea2118b74d
--- /dev/null
+++ b/database/storage/2502/12988/2502.12988v1.json
@@ -0,0 +1 @@
+{"2502.12988": {"publish_time": "2025-02-18", "title": "Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs", "paper_summary": "Previous approaches to persona simulation large language models (LLMs) have\ntypically relied on learning basic biographical information, or using limited\nrole-play dialogue datasets to capture a character's responses. However, a\nholistic representation of an individual goes beyond surface-level facts or\nconversations to deeper thoughts and thinking. In this work, we introduce\nCharacterBot, a model designed to replicate both the linguistic patterns and\ndistinctive thought processes of a character. Using Lu Xun, a renowned Chinese\nwriter, as a case study, we propose four training tasks derived from his 17\nessay collections. These include a pre-training task focused on mastering\nexternal linguistic structures and knowledge, as well as three fine-tuning\ntasks: multiple-choice question answering, generative question answering, and\nstyle transfer, each aligning the LLM with Lu Xun's internal ideation and\nwriting style. To optimize learning across these tasks, we introduce a CharLoRA\nparameter updating mechanism, where a general linguistic style expert\ncollaborates with other task-specific experts to better study both the language\nstyle and the understanding of deeper thoughts. We evaluate CharacterBot on\nthree tasks for linguistic accuracy and opinion comprehension, demonstrating\nthat it significantly outperforms the baselines on our adapted metrics. We hope\nthat this work inspires future research on deep character persona simulation\nLLM.", "paper_summary_zh": "<paragraph>\u4ee5\u524d\u5c0d\u89d2\u8272\u6a21\u64ec\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u65b9\u6cd5\u901a\u5e38\u4f9d\u8cf4\u65bc\u5b78\u7fd2\u57fa\u672c\u50b3\u8a18\u8cc7\u8a0a\uff0c\u6216\u4f7f\u7528\u6709\u9650\u7684\u89d2\u8272\u626e\u6f14\u5c0d\u8a71\u8cc7\u6599\u96c6\u4f86\u6355\u6349\u89d2\u8272\u7684\u53cd\u61c9\u3002\u7136\u800c\uff0c\u5c0d\u500b\u4eba\u7684\u6574\u9ad4\u8868\u5fb5\u8d85\u8d8a\u4e86\u8868\u9762\u5c64\u9762\u7684\u4e8b\u5be6\u6216\u5c0d\u8a71\uff0c\u6df1\u5165\u5230\u66f4\u6df1\u5c64\u7684\u60f3\u6cd5\u548c\u601d\u8003\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u5f15\u5165\u4e86 CharacterBot\uff0c\u4e00\u500b\u65e8\u5728\u8907\u88fd\u89d2\u8272\u7684\u8a9e\u8a00\u6a21\u5f0f\u548c\u7368\u7279\u601d\u8003\u904e\u7a0b\u7684\u6a21\u578b\u3002\u4ee5\u8457\u540d\u7684\u4e2d\u570b\u4f5c\u5bb6\u9b6f\u8fc5\u70ba\u6848\u4f8b\u7814\u7a76\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u56db\u500b\u5f9e\u4ed6\u7684 17 \u7bc7\u6563\u6587\u96c6\u4e2d\u884d\u751f\u7684\u8a13\u7df4\u4efb\u52d9\u3002\u5176\u4e2d\u5305\u62ec\u4e00\u500b\u9810\u8a13\u7df4\u4efb\u52d9\uff0c\u5c08\u6ce8\u65bc\u638c\u63e1\u5916\u90e8\u8a9e\u8a00\u7d50\u69cb\u548c\u77e5\u8b58\uff0c\u4ee5\u53ca\u4e09\u500b\u5fae\u8abf\u4efb\u52d9\uff1a\u591a\u9078\u984c\u56de\u7b54\u3001\u751f\u6210\u5f0f\u554f\u7b54\u548c\u98a8\u683c\u8f49\u79fb\uff0c\u6bcf\u500b\u4efb\u52d9\u90fd\u5c07 LLM \u8207\u9b6f\u8fc5\u7684\u5167\u90e8\u89c0\u5ff5\u548c\u5beb\u4f5c\u98a8\u683c\u76f8\u7d50\u5408\u3002\u70ba\u4e86\u512a\u5316\u9019\u4e9b\u4efb\u52d9\u7684\u5b78\u7fd2\uff0c\u6211\u5011\u5f15\u5165\u4e86\u4e00\u500b CharLoRA \u53c3\u6578\u66f4\u65b0\u6a5f\u5236\uff0c\u5176\u4e2d\u4e00\u4f4d\u901a\u66c9\u8a9e\u8a00\u98a8\u683c\u7684\u5c08\u5bb6\u8207\u5176\u4ed6\u7279\u5b9a\u4efb\u52d9\u5c08\u5bb6\u5408\u4f5c\uff0c\u4ee5\u66f4\u597d\u5730\u7814\u7a76\u8a9e\u8a00\u98a8\u683c\u548c\u5c0d\u6df1\u5c64\u601d\u60f3\u7684\u7406\u89e3\u3002\u6211\u5011\u5728\u4e09\u9805\u4efb\u52d9\u4e0a\u8a55\u4f30\u4e86 CharacterBot \u7684\u8a9e\u8a00\u6e96\u78ba\u6027\u548c\u610f\u898b\u7406\u89e3\uff0c\u8b49\u660e\u5b83\u5728\u6211\u5011\u8abf\u6574\u7684\u6307\u6a19\u4e0a\u986f\u8457\u512a\u65bc\u57fa\u6e96\u3002\u6211\u5011\u5e0c\u671b\u9019\u9805\u5de5\u4f5c\u80fd\u6fc0\u52f5\u672a\u4f86\u5c0d\u6df1\u5ea6\u89d2\u8272\u89d2\u8272\u6a21\u64ec LLM \u7684\u7814\u7a76\u3002</paragraph>", "author": "Zixiao Wang et.al.", "authors": "Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen", "id": "2502.12988v1", "paper_url": "http://arxiv.org/abs/2502.12988v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12992/2502.12992v1.json b/database/storage/2502/12992/2502.12992v1.json
new file mode 100644
index 0000000000..6b9686c5dc
--- /dev/null
+++ b/database/storage/2502/12992/2502.12992v1.json
@@ -0,0 +1 @@
+{"2502.12992": {"publish_time": "2025-02-18", "title": "B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability", "paper_summary": "Post-hoc explanation methods for black-box models often struggle with\nfaithfulness and human interpretability due to the lack of explainability in\ncurrent neural models. Meanwhile, B-cos networks have been introduced to\nimprove model explainability through architectural and computational\nadaptations, but their application has so far been limited to computer vision\nmodels and their associated training pipelines. In this work, we introduce\nB-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly\ntransforms pre-trained language models into B-cos LMs by combining B-cos\nconversion and task fine-tuning, improving efficiency compared to previous\nB-cos methods. Our automatic and human evaluation results demonstrate that\nB-cos LMs produce more faithful and human interpretable explanations than post\nhoc methods, while maintaining task performance comparable to conventional\nfine-tuning. Our in-depth analysis explores how B-cos LMs differ from\nconventionally fine-tuned models in their learning processes and explanation\npatterns. Finally, we provide practical guidelines for effectively building\nB-cos LMs based on our findings. Our code is available at\nhttps://anonymous.4open.science/r/bcos_lm.", "paper_summary_zh": "\u9ed1\u76d2\u6a21\u578b\u7684\u4e8b\u540e\u89e3\u91ca\u65b9\u6cd5\u901a\u5e38\u4f1a\u56e0\u4e3a\u5f53\u524d\u795e\u7ecf\u6a21\u578b\u7f3a\u4e4f\u53ef\u89e3\u91ca\u6027\u800c\u96be\u4ee5\u505a\u5230\u5fe0\u5b9e\u548c\u4eba\u7c7b\u53ef\u89e3\u91ca\u3002\u4e0e\u6b64\u540c\u65f6\uff0cB-cos \u7f51\u7edc\u5df2\u88ab\u5f15\u5165\uff0c\u4ee5\u901a\u8fc7\u67b6\u6784\u548c\u8ba1\u7b97\u6539\u7f16\u6765\u63d0\u9ad8\u6a21\u578b\u7684\u53ef\u89e3\u91ca\u6027\uff0c\u4f46\u5230\u76ee\u524d\u4e3a\u6b62\uff0c\u5b83\u4eec\u7684\u5e94\u7528\u4ec5\u9650\u4e8e\u8ba1\u7b97\u673a\u89c6\u89c9\u6a21\u578b\u53ca\u5176\u76f8\u5173\u7684\u8bad\u7ec3\u7ba1\u9053\u3002\u5728\u8fd9\u9879\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u5f15\u5165\u4e86 B-cos LM\uff0c\u5373\u9488\u5bf9 NLP \u4efb\u52a1\u589e\u5f3a\u7684 B-cos \u7f51\u7edc\u3002\u6211\u4eec\u7684\u65b9\u6cd5\u901a\u8fc7\u7ed3\u5408 B-cos \u8f6c\u6362\u548c\u4efb\u52a1\u5fae\u8c03\uff0c\u5c06\u9884\u8bad\u7ec3\u7684\u8bed\u8a00\u6a21\u578b\u76f4\u63a5\u8f6c\u6362\u4e3a B-cos LM\uff0c\u4e0e\u4ee5\u524d B-cos \u65b9\u6cd5\u76f8\u6bd4\uff0c\u63d0\u9ad8\u4e86\u6548\u7387\u3002\u6211\u4eec\u7684\u81ea\u52a8\u548c\u4eba\u5de5\u8bc4\u4f30\u7ed3\u679c\u8868\u660e\uff0c\u4e0e\u4e8b\u540e\u65b9\u6cd5\u76f8\u6bd4\uff0cB-cos LM \u4ea7\u751f\u4e86\u66f4\u5fe0\u5b9e\u548c\u4eba\u7c7b\u53ef\u89e3\u91ca\u7684\u89e3\u91ca\uff0c\u540c\u65f6\u4fdd\u6301\u4e0e\u4f20\u7edf\u5fae\u8c03\u76f8\u5f53\u7684\u4efb\u52a1\u6027\u80fd\u3002\u6211\u4eec\u7684\u6df1\u5165\u5206\u6790\u63a2\u8ba8\u4e86 B-cos LM \u5728\u5176\u5b66\u4e60\u8fc7\u7a0b\u548c\u89e3\u91ca\u6a21\u5f0f\u4e2d\u4e0e\u4f20\u7edf\u5fae\u8c03\u6a21\u578b\u6709\u4f55\u4e0d\u540c\u3002\u6700\u540e\uff0c\u6211\u4eec\u6839\u636e\u6211\u4eec\u7684\u53d1\u73b0\u63d0\u4f9b\u4e86\u6709\u6548\u6784\u5efa B-cos LM \u7684\u5b9e\u7528\u6307\u5357\u3002\u6211\u4eec\u7684\u4ee3\u7801\u53ef\u5728 https://anonymous.4open.science/r/bcos_lm \u83b7\u5f97\u3002", "author": "Yifan Wang et.al.", "authors": "Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg", "id": "2502.12992v1", "paper_url": "http://arxiv.org/abs/2502.12992v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12995/2502.12995v1.json b/database/storage/2502/12995/2502.12995v1.json
new file mode 100644
index 0000000000..6ba1d055ff
--- /dev/null
+++ b/database/storage/2502/12995/2502.12995v1.json
@@ -0,0 +1 @@
+{"2502.12995": {"publish_time": "2025-02-18", "title": "Free Argumentative Exchanges for Explaining Image Classifiers", "paper_summary": "Deep learning models are powerful image classifiers but their opacity hinders\ntheir trustworthiness. Explanation methods for capturing the reasoning process\nwithin these classifiers faithfully and in a clear manner are scarce, due to\ntheir sheer complexity and size. We provide a solution for this problem by\ndefining a novel method for explaining the outputs of image classifiers with\ndebates between two agents, each arguing for a particular class. We obtain\nthese debates as concrete instances of Free Argumentative eXchanges (FAXs), a\nnovel argumentation-based multi-agent framework allowing agents to internalise\nopinions by other agents differently than originally stated. We define two\nmetrics (consensus and persuasion rate) to assess the usefulness of FAXs as\nargumentative explanations for image classifiers. We then conduct a number of\nempirical experiments showing that FAXs perform well along these metrics as\nwell as being more faithful to the image classifiers than conventional,\nnon-argumentative explanation methods. All our implementations can be found at\nhttps://github.com/koriavinash1/FAX.", "paper_summary_zh": "\u6df1\u5ea6\u5b78\u7fd2\u6a21\u578b\u662f\u5f37\u5927\u7684\u5f71\u50cf\u5206\u985e\u5668\uff0c\u4f46\u5176\u4e0d\u900f\u660e\u6027\u963b\u7919\u4e86\u5176\u53ef\u4fe1\u5ea6\u3002\u7531\u65bc\u5176\u6975\u9ad8\u7684\u8907\u96dc\u6027\u548c\u898f\u6a21\uff0c\u5fe0\u5be6\u4e14\u6e05\u695a\u5730\u6355\u6349\u9019\u4e9b\u5206\u985e\u5668\u5167\u90e8\u63a8\u7406\u904e\u7a0b\u7684\u89e3\u91cb\u65b9\u6cd5\u5f88\u5c11\u898b\u3002\u6211\u5011\u900f\u904e\u5b9a\u7fa9\u4e00\u7a2e\u65b0\u7a4e\u7684\u65b9\u6cd5\u4f86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u8a72\u65b9\u6cd5\u900f\u904e\u5169\u500b\u4ee3\u7406\u4e4b\u9593\u7684\u8faf\u8ad6\u4f86\u89e3\u91cb\u5f71\u50cf\u5206\u985e\u5668\u7684\u8f38\u51fa\uff0c\u6bcf\u500b\u4ee3\u7406\u90fd\u4e3b\u5f35\u4e00\u500b\u7279\u5b9a\u985e\u5225\u3002\u6211\u5011\u5c07\u9019\u4e9b\u8faf\u8ad6\u4f5c\u70ba\u81ea\u7531\u8ad6\u8b49\u4ea4\u63db (FAX) \u7684\u5177\u9ad4\u5be6\u4f8b\uff0c\u9019\u662f\u4e00\u500b\u65b0\u7a4e\u7684\u57fa\u65bc\u8ad6\u8b49\u7684\u591a\u4ee3\u7406\u67b6\u69cb\uff0c\u5141\u8a31\u4ee3\u7406\u4ee5\u4e0d\u540c\u65bc\u539f\u59cb\u9673\u8ff0\u7684\u65b9\u5f0f\u5167\u5316\u5176\u4ed6\u4ee3\u7406\u7684\u610f\u898b\u3002\u6211\u5011\u5b9a\u7fa9\u4e86\u5169\u500b\u6307\u6a19\uff08\u5171\u8b58\u7387\u548c\u8aaa\u670d\u7387\uff09\u4f86\u8a55\u4f30 FAX \u4f5c\u70ba\u5f71\u50cf\u5206\u985e\u5668\u8ad6\u8b49\u89e3\u91cb\u7684\u6709\u7528\u6027\u3002\u7136\u5f8c\uff0c\u6211\u5011\u9032\u884c\u4e86\u591a\u9805\u5be6\u8b49\u5be6\u9a57\uff0c\u8868\u660e FAX \u5728\u9019\u4e9b\u6307\u6a19\u4e0a\u8868\u73fe\u826f\u597d\uff0c\u4e26\u4e14\u6bd4\u50b3\u7d71\u7684\u975e\u8ad6\u8b49\u89e3\u91cb\u65b9\u6cd5\u66f4\u5fe0\u5be6\u65bc\u5f71\u50cf\u5206\u985e\u5668\u3002\u6211\u5011\u6240\u6709\u7684\u5be6\u4f5c\u90fd\u53ef\u4ee5\u5728 https://github.com/koriavinash1/FAX \u4e2d\u627e\u5230\u3002", "author": "Avinash Kori et.al.", "authors": "Avinash Kori, Antonio Rago, Francesca Toni", "id": "2502.12995v1", "paper_url": "http://arxiv.org/abs/2502.12995v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12996/2502.12996v1.json b/database/storage/2502/12996/2502.12996v1.json
new file mode 100644
index 0000000000..15cc43308b
--- /dev/null
+++ b/database/storage/2502/12996/2502.12996v1.json
@@ -0,0 +1 @@
+{"2502.12996": {"publish_time": "2025-02-18", "title": "Eager Updates For Overlapped Communication and Computation in DiLoCo", "paper_summary": "Distributed optimization methods such as DiLoCo have been shown to be\neffective in training very large models across multiple distributed workers,\nsuch as datacenters. These methods split updates into two parts: an inner\noptimization phase, where the workers independently execute multiple\noptimization steps on their own local data, and an outer optimization step,\nwhere the inner updates are synchronized. While such approaches require orders\nof magnitude less communication than standard data-parallel training, in\nsettings where the workers are datacenters, even the limited communication\nrequirements of these approaches can still cause significant slow downs due to\nthe blocking necessary at each outer optimization step. In this paper, we\ninvestigate techniques to mitigate this issue by overlapping communication with\ncomputation in a manner that allows the outer optimization step to fully\noverlap with the inner optimization phase. We show that a particular variant,\ndubbed eager updates, provides competitive performance with standard DiLoCo in\nsettings with low bandwidth between workers.", "paper_summary_zh": "\u5206\u6563\u5f0f\u512a\u5316\u65b9\u6cd5\uff08\u4f8b\u5982 DiLoCo\uff09\u5df2\u88ab\u8b49\u660e\u53ef\u6709\u6548\u8a13\u7df4\u6a6b\u8de8\u591a\u500b\u5206\u6563\u5f0f\u5de5\u4f5c\u8005\u7684\u8d85\u5927\u578b\u6a21\u578b\uff0c\u4f8b\u5982\u8cc7\u6599\u4e2d\u5fc3\u3002\u9019\u4e9b\u65b9\u6cd5\u5c07\u66f4\u65b0\u62c6\u5206\u70ba\u5169\u90e8\u5206\uff1a\u5167\u90e8\u6700\u4f73\u5316\u968e\u6bb5\uff0c\u5176\u4e2d\u5de5\u4f5c\u8005\u7368\u7acb\u5730\u5728\u81ea\u5df1\u7684\u672c\u5730\u8cc7\u6599\u4e0a\u57f7\u884c\u591a\u500b\u6700\u4f73\u5316\u6b65\u9a5f\uff0c\u4ee5\u53ca\u5916\u90e8\u6700\u4f73\u5316\u6b65\u9a5f\uff0c\u5176\u4e2d\u5167\u90e8\u66f4\u65b0\u6703\u540c\u6b65\u3002\u96d6\u7136\u6b64\u985e\u65b9\u6cd5\u6240\u9700\u7684\u901a\u8a0a\u91cf\u6bd4\u6a19\u6e96\u8cc7\u6599\u5e73\u884c\u8a13\u7df4\u5c11\u5e7e\u500b\u6578\u91cf\u7d1a\uff0c\u4f46\u5728\u5de5\u4f5c\u8005\u70ba\u8cc7\u6599\u4e2d\u5fc3\u7684\u60c5\u6cc1\u4e0b\uff0c\u5373\u4f7f\u9019\u4e9b\u65b9\u6cd5\u6709\u9650\u7684\u901a\u8a0a\u9700\u6c42\u4ecd\u53ef\u80fd\u7531\u65bc\u6bcf\u500b\u5916\u90e8\u6700\u4f73\u5316\u6b65\u9a5f\u6240\u9700\u7684\u5c01\u9396\u800c\u5c0e\u81f4\u986f\u8457\u7684\u6e1b\u901f\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u63a2\u8a0e\u4e86\u900f\u904e\u4ee5\u5141\u8a31\u5916\u90e8\u6700\u4f73\u5316\u6b65\u9a5f\u8207\u5167\u90e8\u6700\u4f73\u5316\u968e\u6bb5\u5b8c\u5168\u91cd\u758a\u7684\u65b9\u5f0f\u5c07\u901a\u8a0a\u8207\u904b\u7b97\u91cd\u758a\uff0c\u4f86\u6e1b\u8f15\u6b64\u554f\u984c\u7684\u6280\u8853\u3002\u6211\u5011\u5c55\u793a\u4e86\u4e00\u500b\u7279\u5b9a\u8b8a\u9ad4\uff0c\u7a31\u70ba\u5373\u6642\u66f4\u65b0\uff0c\u5728\u5de5\u4f5c\u8005\u4e4b\u9593\u983b\u5bec\u8f03\u4f4e\u7684\u60c5\u6cc1\u4e0b\uff0c\u53ef\u63d0\u4f9b\u8207\u6a19\u6e96 DiLoCo \u76f8\u7576\u7684\u6548\u80fd\u3002", "author": "Satyen Kale et.al.", "authors": "Satyen Kale, Arthur Douillard, Yanislav Donchev", "id": "2502.12996v1", "paper_url": "http://arxiv.org/abs/2502.12996v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/12998/2502.12998v1.json b/database/storage/2502/12998/2502.12998v1.json
new file mode 100644
index 0000000000..6fec7be7b6
--- /dev/null
+++ b/database/storage/2502/12998/2502.12998v1.json
@@ -0,0 +1 @@
+{"2502.12998": {"publish_time": "2025-02-18", "title": "Personalized Top-k Set Queries Over Predicted Scores", "paper_summary": "This work studies the applicability of expensive external oracles such as\nlarge language models in answering top-k queries over predicted scores. Such\nscores are incurred by user-defined functions to answer personalized queries\nover multi-modal data. We propose a generic computational framework that\nhandles arbitrary set-based scoring functions, as long as the functions could\nbe decomposed into constructs, each of which sent to an oracle (in our case an\nLLM) to predict partial scores. At a given point in time, the framework assumes\na set of responses and their partial predicted scores, and it maintains a\ncollection of possible sets that are likely to be the true top-k. Since calling\noracles is costly, our framework judiciously identifies the next construct,\ni.e., the next best question to ask the oracle so as to maximize the likelihood\nof identifying the true top-k. We present a principled probabilistic model that\nquantifies that likelihood. We study efficiency opportunities in designing\nalgorithms. We run an evaluation with three large scale datasets, scoring\nfunctions, and baselines. Experiments indicate the efficacy of our framework,\nas it achieves an order of magnitude improvement over baselines in requiring\nLLM calls while ensuring result accuracy. Scalability experiments further\nindicate that our framework could be used in large-scale applications.", "paper_summary_zh": "\u672c\u7814\u7a76\u63a2\u8a0e\u5728\u9810\u6e2c\u5206\u6578\u4e2d\u56de\u7b54\u524d k \u500b\u67e5\u8a62\u6642\uff0c\u6602\u8cb4\u7684\u5916\u90e8\u9810\u8a00\uff08\u4f8b\u5982\u5927\u578b\u8a9e\u8a00\u6a21\u578b\uff09\u7684\u9069\u7528\u6027\u3002\u6b64\u985e\u5206\u6578\u662f\u7531\u4f7f\u7528\u8005\u5b9a\u7fa9\u7684\u51fd\u5f0f\u7522\u751f\uff0c\u7528\u65bc\u56de\u7b54\u591a\u6a21\u614b\u8cc7\u6599\u4e2d\u7684\u500b\u4eba\u5316\u67e5\u8a62\u3002\u6211\u5011\u63d0\u51fa\u4e00\u500b\u901a\u7528\u7684\u904b\u7b97\u6846\u67b6\uff0c\u7528\u65bc\u8655\u7406\u4efb\u610f\u57fa\u65bc\u96c6\u5408\u7684\u8a08\u5206\u51fd\u5f0f\uff0c\u53ea\u8981\u9019\u4e9b\u51fd\u5f0f\u53ef\u4ee5\u5206\u89e3\u70ba\u5efa\u69cb\u5340\u584a\uff0c\u7136\u5f8c\u5c07\u6bcf\u500b\u5efa\u69cb\u5340\u584a\u50b3\u9001\u7d66\u9810\u8a00\uff08\u5728\u672c\u4f8b\u4e2d\u70ba LLM\uff09\u4ee5\u9810\u6e2c\u90e8\u5206\u5206\u6578\u3002\u5728\u7279\u5b9a\u6642\u9593\u9ede\uff0c\u6b64\u6846\u67b6\u5047\u8a2d\u4e00\u7d44\u56de\u61c9\u53ca\u5176\u90e8\u5206\u9810\u6e2c\u5206\u6578\uff0c\u4e26\u7dad\u8b77\u4e00\u7d44\u53ef\u80fd\u6210\u70ba\u771f\u5be6\u524d k \u500b\u7684\u96c6\u5408\u3002\u7531\u65bc\u547c\u53eb\u9810\u8a00\u7684\u6210\u672c\u5f88\u9ad8\uff0c\u56e0\u6b64\u6211\u5011\u7684\u6846\u67b6\u6703\u660e\u667a\u5730\u627e\u51fa\u4e0b\u4e00\u500b\u5efa\u69cb\u5340\u584a\uff0c\u4ea6\u5373\u4e0b\u4e00\u500b\u6700\u4f73\u554f\u984c\uff0c\u4ee5\u8a62\u554f\u9810\u8a00\uff0c\u4ee5\u4fbf\u6700\u5927\u5316\u627e\u51fa\u771f\u5be6\u524d k \u500b\u7684\u53ef\u80fd\u6027\u3002\u6211\u5011\u63d0\u51fa\u4e00\u500b\u57fa\u65bc\u539f\u7406\u7684\u6a5f\u7387\u6a21\u578b\uff0c\u7528\u65bc\u91cf\u5316\u6b64\u53ef\u80fd\u6027\u3002\u6211\u5011\u7814\u7a76\u8a2d\u8a08\u6f14\u7b97\u6cd5\u6642\u7684\u6548\u7387\u6a5f\u6703\u3002\u6211\u5011\u91dd\u5c0d\u4e09\u500b\u5927\u578b\u8cc7\u6599\u96c6\u3001\u8a08\u5206\u51fd\u5f0f\u548c\u57fa\u6e96\u57f7\u884c\u8a55\u4f30\u3002\u5be6\u9a57\u7d50\u679c\u6307\u51fa\u6211\u5011\u6846\u67b6\u7684\u6548\u80fd\uff0c\u56e0\u70ba\u5b83\u5728\u9700\u8981 LLM \u547c\u53eb\u7684\u540c\u6642\u78ba\u4fdd\u7d50\u679c\u6e96\u78ba\u6027\uff0c\u6bd4\u57fa\u6e96\u9032\u6b65\u4e86\u4e00\u500b\u6578\u91cf\u7d1a\u3002\u53ef\u64f4\u5145\u6027\u5be6\u9a57\u9032\u4e00\u6b65\u6307\u51fa\u6211\u5011\u7684\u6846\u67b6\u53ef\u7528\u65bc\u5927\u578b\u61c9\u7528\u7a0b\u5f0f\u3002", "author": "Sohrab Namazi Nia et.al.", "authors": "Sohrab Namazi Nia, Subhodeep Ghosh, Senjuti Basu Roy, Sihem Amer-Yahia", "id": "2502.12998v1", "paper_url": "http://arxiv.org/abs/2502.12998v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13001/2502.13001v1.json b/database/storage/2502/13001/2502.13001v1.json
new file mode 100644
index 0000000000..0158189c9b
--- /dev/null
+++ b/database/storage/2502/13001/2502.13001v1.json
@@ -0,0 +1 @@
+{"2502.13001": {"publish_time": "2025-02-18", "title": "You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations", "paper_summary": "Meeting summarization suffers from limited high-quality data, mainly due to\nprivacy restrictions and expensive collection processes. We address this gap\nwith FAME, a dataset of 500 meetings in English and 300 in German produced by\nMIMIC, our new multi-agent meeting synthesis framework that generates meeting\ntranscripts on a given knowledge source by defining psychologically grounded\nparticipant profiles, outlining the conversation, and orchestrating a large\nlanguage model (LLM) debate. A modular post-processing step refines these\noutputs, mitigating potential repetitiveness and overly formal tones, ensuring\ncoherent, credible dialogues at scale. We also propose a psychologically\ngrounded evaluation framework assessing naturalness, social behavior\nauthenticity, and transcript difficulties. Human assessments show that FAME\napproximates real-meeting spontaneity (4.5/5 in naturalness), preserves\nspeaker-centric challenges (3/5 in spoken language), and introduces richer\ninformation-oriented difficulty (4/5 in difficulty). These findings highlight\nthat FAME is a good and scalable proxy for real-world meeting conditions. It\nenables new test scenarios for meeting summarization research and other\nconversation-centric applications in tasks requiring conversation data or\nsimulating social scenarios under behavioral constraints.", "paper_summary_zh": "\u6703\u8b70\u6458\u8981\u56e0\u7f3a\u4e4f\u9ad8\u54c1\u8cea\u8cc7\u6599\u800c\u53d7\u9650\uff0c\u4e3b\u8981\u662f\u7531\u65bc\u96b1\u79c1\u9650\u5236\u548c\u6602\u8cb4\u7684\u6536\u96c6\u7a0b\u5e8f\u3002\u6211\u5011\u900f\u904e FAME \u4f86\u89e3\u6c7a\u9019\u500b\u5dee\u8ddd\uff0cFAME \u662f MIMIC \u88fd\u4f5c\u7684 500 \u5834\u82f1\u6587\u6703\u8b70\u548c 300 \u5834\u5fb7\u6587\u6703\u8b70\u7684\u8cc7\u6599\u96c6\uff0cMIMIC \u662f\u6211\u5011\u65b0\u7684\u591a\u91cd\u4ee3\u7406\u6703\u8b70\u5408\u6210\u67b6\u69cb\uff0c\u900f\u904e\u5b9a\u7fa9\u5fc3\u7406\u57fa\u790e\u7684\u53c3\u8207\u8005\u8a2d\u5b9a\u6a94\u3001\u6982\u8ff0\u5c0d\u8a71\uff0c\u4e26\u5354\u8abf\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u8faf\u8ad6\uff0c\u5728\u7d66\u5b9a\u7684\u77e5\u8b58\u4f86\u6e90\u4e0a\u7522\u751f\u6703\u8b70\u8a18\u9304\u3002\u6a21\u7d44\u5316\u5f8c\u8655\u7406\u6b65\u9a5f\u6703\u6539\u5584\u9019\u4e9b\u8f38\u51fa\uff0c\u6e1b\u8f15\u6f5b\u5728\u7684\u91cd\u8907\u6027\u548c\u904e\u65bc\u6b63\u5f0f\u7684\u8a9e\u6c23\uff0c\u78ba\u4fdd\u5927\u898f\u6a21\u7684\u5c0d\u8a71\u9023\u8cab\u4e14\u53ef\u4fe1\u3002\u6211\u5011\u4e5f\u63d0\u51fa\u4e00\u500b\u5fc3\u7406\u57fa\u790e\u7684\u8a55\u4f30\u67b6\u69cb\uff0c\u8a55\u4f30\u81ea\u7136\u6027\u3001\u793e\u4ea4\u884c\u70ba\u771f\u5be6\u6027\uff0c\u4ee5\u53ca\u8a18\u9304\u96e3\u5ea6\u3002\u4eba\u985e\u8a55\u4f30\u986f\u793a\uff0cFAME \u8fd1\u4f3c\u65bc\u771f\u5be6\u6703\u8b70\u7684\u5373\u8208\u6027\uff08\u81ea\u7136\u6027 4.5/5\uff09\uff0c\u4fdd\u7559\u4ee5\u8b1b\u8005\u70ba\u4e2d\u5fc3\u7684\u6311\u6230\uff08\u53e3\u8a9e 3/5\uff09\uff0c\u4e26\u5f15\u5165\u66f4\u8c50\u5bcc\u7684\u8cc7\u8a0a\u5c0e\u5411\u96e3\u5ea6\uff08\u96e3\u5ea6 4/5\uff09\u3002\u9019\u4e9b\u767c\u73fe\u5f37\u8abf FAME \u662f\u771f\u5be6\u4e16\u754c\u6703\u8b70\u689d\u4ef6\u7684\u826f\u597d\u4e14\u53ef\u64f4\u5145\u7684\u4ee3\u7406\u3002\u5b83\u80fd\u70ba\u6703\u8b70\u6458\u8981\u7814\u7a76\u548c\u5176\u4ed6\u5c0d\u8a71\u70ba\u4e2d\u5fc3\u7684\u61c9\u7528\u7a0b\u5f0f\u555f\u7528\u65b0\u7684\u6e2c\u8a66\u60c5\u5883\uff0c\u5728\u9700\u8981\u5c0d\u8a71\u8cc7\u6599\u6216\u5728\u884c\u70ba\u9650\u5236\u4e0b\u6a21\u64ec\u793e\u4ea4\u60c5\u5883\u7684\u4efb\u52d9\u4e2d\u3002", "author": "Frederic Kirstein et.al.", "authors": "Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp", "id": "2502.13001v1", "paper_url": "http://arxiv.org/abs/2502.13001v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13004/2502.13004v1.json b/database/storage/2502/13004/2502.13004v1.json
new file mode 100644
index 0000000000..6868c884ca
--- /dev/null
+++ b/database/storage/2502/13004/2502.13004v1.json
@@ -0,0 +1 @@
+{"2502.13004": {"publish_time": "2025-02-18", "title": "Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation", "paper_summary": "Objective speech quality models aim to predict human-perceived speech quality\nusing automated methods. However, cross-lingual generalization remains a major\nchallenge, as Mean Opinion Scores (MOS) vary across languages due to\nlinguistic, perceptual, and dataset-specific differences. A model trained\nprimarily on English data may struggle to generalize to languages with\ndifferent phonetic, tonal, and prosodic characteristics, leading to\ninconsistencies in objective assessments. This study investigates the\ncross-lingual performance of two speech quality models: NISQA, a CNN-based\nmodel, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both\nmodels were trained exclusively on English datasets containing over 49,000\nspeech samples and subsequently evaluated on speech in German, French,\nMandarin, Swedish, and Dutch. We analyze model performance using Pearson\nCorrelation Coefficient (PCC) and Root Mean Square Error (RMSE) across five\nspeech quality dimensions: coloration, discontinuity, loudness, noise, and MOS.\nOur findings show that while AST achieves a more stable cross-lingual\nperformance, both models exhibit noticeable biases. Notably, Mandarin speech\nquality predictions correlate highly with human MOS scores, whereas Swedish and\nDutch present greater prediction challenges. Discontinuities remain difficult\nto model across all languages. These results highlight the need for more\nbalanced multilingual datasets and architecture-specific adaptations to improve\ncross-lingual generalization.", "paper_summary_zh": "\u5ba2\u89c0\u8a9e\u97f3\u54c1\u8cea\u6a21\u578b\u65e8\u5728\u4f7f\u7528\u81ea\u52d5\u5316\u65b9\u6cd5\u9810\u6e2c\u4eba\u985e\u611f\u77e5\u7684\u8a9e\u97f3\u54c1\u8cea\u3002\u7136\u800c\uff0c\u8de8\u8a9e\u8a00\u7684\u6982\u5316\u4ecd\u7136\u662f\u4e00\u9805\u91cd\u5927\u6311\u6230\uff0c\u56e0\u70ba\u5e73\u5747\u610f\u898b\u5206\u6578 (MOS) \u6703\u56e0\u8a9e\u8a00\u7684\u4e0d\u540c\u800c\u6709\u6240\u4e0d\u540c\uff0c\u9019\u662f\u7531\u65bc\u8a9e\u8a00\u3001\u611f\u77e5\u548c\u7279\u5b9a\u65bc\u8cc7\u6599\u96c6\u7684\u5dee\u7570\u6240\u81f4\u3002\u4e3b\u8981\u4f7f\u7528\u82f1\u8a9e\u8cc7\u6599\u8a13\u7df4\u7684\u6a21\u578b\u53ef\u80fd\u6703\u96e3\u4ee5\u6982\u5316\u5230\u5177\u6709\u4e0d\u540c\u8a9e\u97f3\u3001\u8072\u8abf\u548c\u97fb\u5f8b\u7279\u5fb5\u7684\u8a9e\u8a00\uff0c\u5c0e\u81f4\u5ba2\u89c0\u8a55\u4f30\u4e0d\u4e00\u81f4\u3002\u672c\u7814\u7a76\u63a2\u8a0e\u4e86\u5169\u7a2e\u8a9e\u97f3\u54c1\u8cea\u6a21\u578b\u7684\u8de8\u8a9e\u8a00\u6548\u80fd\uff1a\u57fa\u65bc CNN \u7684 NISQA \u6a21\u578b\u548c\u57fa\u65bc Transformer \u7684\u97f3\u8a0a\u5149\u8b5c Transformer (AST) \u6a21\u578b\u3002\u9019\u5169\u7a2e\u6a21\u578b\u90fd\u50c5\u4f7f\u7528\u5305\u542b\u8d85\u904e 49,000 \u500b\u8a9e\u97f3\u7bc4\u4f8b\u7684\u82f1\u8a9e\u8cc7\u6599\u96c6\u9032\u884c\u8a13\u7df4\uff0c\u7136\u5f8c\u5728\u5fb7\u8a9e\u3001\u6cd5\u8a9e\u3001\u666e\u901a\u8a71\u3001\u745e\u5178\u8a9e\u548c\u8377\u862d\u8a9e\u7684\u8a9e\u97f3\u4e0a\u9032\u884c\u8a55\u4f30\u3002\u6211\u5011\u4f7f\u7528\u76ae\u723e\u68ee\u76f8\u95dc\u4fc2\u6578 (PCC) \u548c\u5747\u65b9\u6839\u8aa4\u5dee (RMSE) \u5206\u6790\u4e94\u500b\u8a9e\u97f3\u54c1\u8cea\u7dad\u5ea6\u7684\u6a21\u578b\u6548\u80fd\uff1a\u8272\u5f69\u3001\u4e0d\u9023\u7e8c\u6027\u3001\u97ff\u5ea6\u3001\u96dc\u8a0a\u548c MOS\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u986f\u793a\uff0c\u5118\u7ba1 AST \u9054\u5230\u4e86\u66f4\u7a69\u5b9a\u7684\u8de8\u8a9e\u8a00\u6548\u80fd\uff0c\u4f46\u9019\u5169\u7a2e\u6a21\u578b\u90fd\u8868\u73fe\u51fa\u660e\u986f\u7684\u504f\u5dee\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u666e\u901a\u8a71\u8a9e\u97f3\u54c1\u8cea\u9810\u6e2c\u8207\u4eba\u985e MOS \u5206\u6578\u9ad8\u5ea6\u76f8\u95dc\uff0c\u800c\u745e\u5178\u8a9e\u548c\u8377\u862d\u8a9e\u5247\u5448\u73fe\u51fa\u66f4\u5927\u7684\u9810\u6e2c\u6311\u6230\u3002\u4e0d\u9023\u7e8c\u6027\u5728\u6240\u6709\u8a9e\u8a00\u4e2d\u4ecd\u7136\u96e3\u4ee5\u5efa\u6a21\u3002\u9019\u4e9b\u7d50\u679c\u51f8\u986f\u4e86\u5c0d\u66f4\u5e73\u8861\u7684\u591a\u8a9e\u8a00\u8cc7\u6599\u96c6\u548c\u7279\u5b9a\u65bc\u67b6\u69cb\u7684\u8abf\u6574\u7684\u9700\u6c42\uff0c\u4ee5\u6539\u5584\u8de8\u8a9e\u8a00\u7684\u6982\u5316\u3002", "author": "Wafaa Wardah et.al.", "authors": "Wafaa Wardah, Tu\u011f\u00e7e Melike Ko\u00e7ak B\u00fcy\u00fckta\u015f, Kirill Shchegelskiy, Sebastian M\u00f6ller, Robert P. Spang", "id": "2502.13004v1", "paper_url": "http://arxiv.org/abs/2502.13004v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13006/2502.13006v1.json b/database/storage/2502/13006/2502.13006v1.json
new file mode 100644
index 0000000000..261ebcb6ae
--- /dev/null
+++ b/database/storage/2502/13006/2502.13006v1.json
@@ -0,0 +1 @@
+{"2502.13006": {"publish_time": "2025-02-18", "title": "Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks", "paper_summary": "Automated Planning algorithms require a model of the domain that specifies\nthe preconditions and effects of each action. Obtaining such a domain model is\nnotoriously hard. Algorithms for learning domain models exist, yet it remains\nunclear whether learning a domain model and planning is an effective approach\nfor numeric planning environments, i.e., where states include discrete and\nnumeric state variables. In this work, we explore the benefits of learning a\nnumeric domain model and compare it with alternative model-free solutions. As a\ncase study, we use two tasks in Minecraft, a popular sandbox game that has been\nused as an AI challenge. First, we consider an offline learning setting, where\na set of expert trajectories are available to learn from. This is the standard\nsetting for learning domain models. We used the Numeric Safe Action Model\nLearning (NSAM) algorithm to learn a numeric domain model and solve new\nproblems with the learned domain model and a numeric planner. We call this\nmodel-based solution NSAM_(+p), and compare it to several model-free Imitation\nLearning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical\nresults show that some IL algorithms can learn faster to solve simple tasks,\nwhile NSAM_(+p) allows solving tasks that require long-term planning and\nenables generalizing to solve problems in larger environments. Then, we\nconsider an online learning setting, where learning is done by moving an agent\nin the environment. For this setting, we introduce RAMP. In RAMP, observations\ncollected during the agent's execution are used to simultaneously train an RL\npolicy and learn a planning domain action model. This forms a positive feedback\nloop between the RL policy and the learned domain model. We demonstrate\nexperimentally the benefits of using RAMP, showing that it finds more efficient\nplans and solves more problems than several RL baselines.", "paper_summary_zh": "<paragraph>\u81ea\u52d5\u5316\u898f\u5283\u6f14\u7b97\u6cd5\u9700\u8981\u4e00\u500b\u7db2\u57df\u6a21\u578b\uff0c\u4f86\u6307\u5b9a\u6bcf\u500b\u52d5\u4f5c\u7684\u524d\u63d0\u689d\u4ef6\u548c\u6548\u679c\u3002\u53d6\u5f97\u9019\u6a23\u7684\u7db2\u57df\u6a21\u578b\u51fa\u4e86\u540d\u7684\u56f0\u96e3\u3002\u5b78\u7fd2\u7db2\u57df\u6a21\u578b\u7684\u6f14\u7b97\u6cd5\u78ba\u5be6\u5b58\u5728\uff0c\u4f46\u5b78\u7fd2\u7db2\u57df\u6a21\u578b\u548c\u898f\u5283\u662f\u5426\u70ba\u6578\u503c\u898f\u5283\u74b0\u5883\u7684\u6709\u6548\u65b9\u6cd5\u4ecd\u7136\u4e0d\u6e05\u695a\uff0c\u4e5f\u5c31\u662f\u8aaa\uff0c\u5176\u4e2d\u72c0\u614b\u5305\u542b\u96e2\u6563\u548c\u6578\u503c\u72c0\u614b\u8b8a\u6578\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u63a2\u8a0e\u5b78\u7fd2\u6578\u503c\u7db2\u57df\u6a21\u578b\u7684\u512a\u9ede\uff0c\u4e26\u5c07\u5176\u8207\u66ff\u4ee3\u7684\u7121\u6a21\u578b\u89e3\u6c7a\u65b9\u6848\u9032\u884c\u6bd4\u8f03\u3002\u4f5c\u70ba\u4e00\u500b\u6848\u4f8b\u7814\u7a76\uff0c\u6211\u5011\u4f7f\u7528 Minecraft \u4e2d\u7684\u5169\u500b\u4efb\u52d9\uff0cMinecraft \u662f\u4e00\u500b\u6d41\u884c\u7684\u6c99\u76d2\u904a\u6232\uff0c\u5df2\u88ab\u7528\u4f5c AI \u6311\u6230\u3002\u9996\u5148\uff0c\u6211\u5011\u8003\u616e\u96e2\u7dda\u5b78\u7fd2\u8a2d\u5b9a\uff0c\u5176\u4e2d\u6709\u4e00\u7d44\u5c08\u5bb6\u8ecc\u8de1\u53ef\u4f9b\u5b78\u7fd2\u3002\u9019\u662f\u5b78\u7fd2\u7db2\u57df\u6a21\u578b\u7684\u6a19\u6e96\u8a2d\u5b9a\u3002\u6211\u5011\u4f7f\u7528\u6578\u503c\u5b89\u5168\u52d5\u4f5c\u6a21\u578b\u5b78\u7fd2 (NSAM) \u6f14\u7b97\u6cd5\u4f86\u5b78\u7fd2\u6578\u503c\u7db2\u57df\u6a21\u578b\uff0c\u4e26\u4f7f\u7528\u5df2\u5b78\u7fd2\u7684\u7db2\u57df\u6a21\u578b\u548c\u6578\u503c\u898f\u5283\u5668\u89e3\u6c7a\u65b0\u554f\u984c\u3002\u6211\u5011\u7a31\u6b64\u6a21\u578b\u70ba\u57fa\u790e\u7684\u89e3\u6c7a\u65b9\u6848 NSAM_(+p)\uff0c\u4e26\u5c07\u5176\u8207\u591a\u7a2e\u7121\u6a21\u578b\u6a21\u4eff\u5b78\u7fd2 (IL) \u548c\u96e2\u7dda\u5f37\u5316\u5b78\u7fd2 (RL) \u6f14\u7b97\u6cd5\u9032\u884c\u6bd4\u8f03\u3002\u7d93\u9a57\u7d50\u679c\u986f\u793a\uff0c\u4e00\u4e9b IL \u6f14\u7b97\u6cd5\u53ef\u4ee5\u66f4\u5feb\u5730\u5b78\u7fd2\u89e3\u6c7a\u7c21\u55ae\u4efb\u52d9\uff0c\u800c NSAM_(+p) \u5141\u8a31\u89e3\u6c7a\u9700\u8981\u9577\u671f\u898f\u5283\u7684\u4efb\u52d9\uff0c\u4e26\u80fd\u5920\u63a8\u5ee3\u5230\u5728\u66f4\u5927\u74b0\u5883\u4e2d\u89e3\u6c7a\u554f\u984c\u3002\u7136\u5f8c\uff0c\u6211\u5011\u8003\u616e\u7dda\u4e0a\u5b78\u7fd2\u8a2d\u5b9a\uff0c\u5176\u4e2d\u5b78\u7fd2\u662f\u900f\u904e\u5728\u74b0\u5883\u4e2d\u79fb\u52d5\u4ee3\u7406\u4f86\u5b8c\u6210\u7684\u3002\u5c0d\u65bc\u6b64\u8a2d\u5b9a\uff0c\u6211\u5011\u5f15\u5165\u4e86 RAMP\u3002\u5728 RAMP \u4e2d\uff0c\u5728\u4ee3\u7406\u57f7\u884c\u671f\u9593\u6536\u96c6\u7684\u89c0\u5bdf\u7d50\u679c\u7528\u65bc\u540c\u6642\u8a13\u7df4 RL \u653f\u7b56\u548c\u5b78\u7fd2\u898f\u5283\u7db2\u57df\u52d5\u4f5c\u6a21\u578b\u3002\u9019\u5728 RL \u653f\u7b56\u548c\u5df2\u5b78\u7fd2\u7684\u7db2\u57df\u6a21\u578b\u4e4b\u9593\u5f62\u6210\u4e86\u4e00\u500b\u6b63\u5411\u56de\u994b\u8ff4\u8def\u3002\u6211\u5011\u900f\u904e\u5be6\u9a57\u8b49\u660e\u4e86\u4f7f\u7528 RAMP \u7684\u597d\u8655\uff0c\u986f\u793a\u5b83\u6bd4\u591a\u500b RL \u57fa\u6e96\u627e\u5230\u4e86\u66f4\u6709\u6548\u7684\u8a08\u756b\uff0c\u4e26\u89e3\u6c7a\u4e86\u66f4\u591a\u554f\u984c\u3002</paragraph>", "author": "Yarin Benyamin et.al.", "authors": "Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern", "id": "2502.13006v1", "paper_url": "http://arxiv.org/abs/2502.13006v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13010/2502.13010v1.json b/database/storage/2502/13010/2502.13010v1.json
new file mode 100644
index 0000000000..51f68b679f
--- /dev/null
+++ b/database/storage/2502/13010/2502.13010v1.json
@@ -0,0 +1 @@
+{"2502.13010": {"publish_time": "2025-02-18", "title": "Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge", "paper_summary": "Large Language Models (LLMs) have significantly advanced medical\nquestion-answering by leveraging extensive clinical data and medical\nliterature. However, the rapid evolution of medical knowledge and the\nlabor-intensive process of manually updating domain-specific resources pose\nchallenges to the reliability of these systems. To address this, we introduce\nAdaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates\nthe construction and continuous updating of medical knowledge graphs,\nintegrates reasoning, and retrieves current external evidence, such as PubMed\nand WikiSearch. By dynamically linking new findings and complex medical\nconcepts, AMG-RAG not only improves accuracy but also enhances interpretability\nin medical queries.\n  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness\nof AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of\n66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to\n100 times larger. Notably, these improvements are achieved without increasing\ncomputational overhead, highlighting the critical role of automated knowledge\ngraph generation and external evidence retrieval in delivering up-to-date,\ntrustworthy medical insights.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u900f\u904e\u5229\u7528\u5ee3\u6cdb\u7684\u81e8\u5e8a\u8cc7\u6599\u548c\u91ab\u5b78\u6587\u737b\uff0c\u5927\u5e45\u63d0\u5347\u4e86\u91ab\u7642\u554f\u984c\u89e3\u7b54\u7684\u9032\u6b65\u3002\u7136\u800c\uff0c\u91ab\u7642\u77e5\u8b58\u7684\u5feb\u901f\u6f14\u9032\u548c\u624b\u52d5\u66f4\u65b0\u7279\u5b9a\u9818\u57df\u8cc7\u6e90\u7684\u7e41\u8907\u7a0b\u5e8f\uff0c\u5c0d\u9019\u4e9b\u7cfb\u7d71\u7684\u53ef\u9760\u6027\u69cb\u6210\u6311\u6230\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u5f15\u5165\u4e86\u9069\u61c9\u6027\u91ab\u7642\u5716\u8868 RAG (AMG-RAG)\uff0c\u9019\u662f\u4e00\u500b\u81ea\u52d5\u5316\u5efa\u69cb\u548c\u6301\u7e8c\u66f4\u65b0\u91ab\u7642\u77e5\u8b58\u5716\u8868\u7684\u7d9c\u5408\u67b6\u69cb\uff0c\u6574\u5408\u63a8\u7406\u4e26\u64f7\u53d6 PubMed \u548c WikiSearch \u7b49\u6700\u65b0\u7684\u5916\u90e8\u8b49\u64da\u3002\u900f\u904e\u52d5\u614b\u9023\u7d50\u65b0\u7684\u767c\u73fe\u548c\u8907\u96dc\u7684\u91ab\u7642\u6982\u5ff5\uff0cAMG-RAG \u4e0d\u50c5\u63d0\u5347\u4e86\u6e96\u78ba\u6027\uff0c\u4e5f\u589e\u5f37\u4e86\u91ab\u7642\u67e5\u8a62\u7684\u53ef\u89e3\u91cb\u6027\u3002\u5728 MEDQA \u548c MEDMCQA \u57fa\u6e96\u4e0a\u7684\u8a55\u91cf\u8b49\u660e\u4e86 AMG-RAG \u7684\u6709\u6548\u6027\uff0c\u5728 MEDQA \u4e0a\u9054\u5230\u4e86 74.1% \u7684 F1 \u5206\u6578\uff0c\u5728 MEDMCQA \u4e0a\u9054\u5230\u4e86 66.34% \u7684\u6e96\u78ba\u5ea6\uff0c\u512a\u65bc\u5176\u4ed6\u540c\u985e\u6a21\u578b\u4ee5\u53ca\u90a3\u4e9b\u5927 10 \u5230 100 \u500d\u7684\u6a21\u578b\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u9019\u4e9b\u6539\u9032\u662f\u5728\u4e0d\u589e\u52a0\u904b\u7b97\u8ca0\u64d4\u7684\u60c5\u6cc1\u4e0b\u5be6\u73fe\u7684\uff0c\u7a81\u986f\u4e86\u81ea\u52d5\u5316\u77e5\u8b58\u5716\u8868\u751f\u6210\u548c\u5916\u90e8\u8b49\u64da\u64f7\u53d6\u5728\u63d0\u4f9b\u6700\u65b0\u3001\u53ef\u4fe1\u8cf4\u7684\u91ab\u7642\u898b\u89e3\u4e2d\u626e\u6f14\u7684\u91cd\u8981\u89d2\u8272\u3002", "author": "Mohammad Reza Rezaei et.al.", "authors": "Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany", "id": "2502.13010v1", "paper_url": "http://arxiv.org/abs/2502.13010v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13012/2502.13012v1.json b/database/storage/2502/13012/2502.13012v1.json
new file mode 100644
index 0000000000..edbe677cae
--- /dev/null
+++ b/database/storage/2502/13012/2502.13012v1.json
@@ -0,0 +1 @@
+{"2502.13012": {"publish_time": "2025-02-18", "title": "Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents", "paper_summary": "Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that\nsimulates human-like behaviors in a variety of tasks. However, evaluating RPAs\nis challenging due to diverse task requirements and agent designs. This paper\nproposes an evidence-based, actionable, and generalizable evaluation design\nguideline for LLM-based RPA by systematically reviewing 1,676 papers published\nbetween Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes,\nseven task attributes, and seven evaluation metrics from existing literature.\nBased on these findings, we present an RPA evaluation design guideline to help\nresearchers develop more systematic and consistent evaluation methods.", "paper_summary_zh": "\u89d2\u8272\u626e\u6f14\u4ee3\u7406\uff08RPA\uff09\u662f\u4e00\u7a2e\u8d8a\u4f86\u8d8a\u6d41\u884c\u7684 LLM \u4ee3\u7406\uff0c\u5b83\u80fd\u6a21\u64ec\u4eba\u985e\u5728\u5404\u7a2e\u4efb\u52d9\u4e2d\u7684\u884c\u70ba\u3002\u7136\u800c\uff0c\u7531\u65bc\u4efb\u52d9\u9700\u6c42\u548c\u4ee3\u7406\u8a2d\u8a08\u7684\u591a\u6a23\u6027\uff0c\u8a55\u4f30 RPA \u5177\u6709\u6311\u6230\u6027\u3002\u672c\u6587\u901a\u904e\u7cfb\u7d71\u5730\u5be9\u67e5 2021 \u5e74 1 \u6708\u81f3 2024 \u5e74 12 \u6708\u671f\u9593\u767c\u8868\u7684 1,676 \u7bc7\u8ad6\u6587\uff0c\u63d0\u51fa\u4e86\u57fa\u65bc\u8b49\u64da\u3001\u53ef\u64cd\u4f5c\u4e14\u53ef\u63a8\u5ee3\u7684 LLM \u57fa\u65bc RPA \u7684\u8a55\u4f30\u8a2d\u8a08\u6307\u5357\u3002\u6211\u5011\u7684\u5206\u6790\u5f9e\u73fe\u6709\u6587\u737b\u4e2d\u8b58\u5225\u51fa\u516d\u500b\u4ee3\u7406\u5c6c\u6027\u3001\u4e03\u500b\u4efb\u52d9\u5c6c\u6027\u548c\u4e03\u500b\u8a55\u4f30\u6307\u6a19\u3002\u6839\u64da\u9019\u4e9b\u767c\u73fe\uff0c\u6211\u5011\u63d0\u51fa\u4e86 RPA \u8a55\u4f30\u8a2d\u8a08\u6307\u5357\uff0c\u4ee5\u5e6b\u52a9\u7814\u7a76\u4eba\u54e1\u958b\u767c\u66f4\u7cfb\u7d71\u5316\u548c\u4e00\u81f4\u7684\u8a55\u4f30\u65b9\u6cd5\u3002", "author": "Chaoran Chen et.al.", "authors": "Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, Dakuo Wang", "id": "2502.13012v1", "paper_url": "http://arxiv.org/abs/2502.13012v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13016/2502.13016v1.json b/database/storage/2502/13016/2502.13016v1.json
new file mode 100644
index 0000000000..ff41e479d1
--- /dev/null
+++ b/database/storage/2502/13016/2502.13016v1.json
@@ -0,0 +1 @@
+{"2502.13016": {"publish_time": "2025-02-18", "title": "LLM-Powered Proactive Data Systems", "paper_summary": "With the power of LLMs, we now have the ability to query data that was\npreviously impossible to query, including text, images, and video. However,\ndespite this enormous potential, most present-day data systems that leverage\nLLMs are reactive, reflecting our community's desire to map LLMs to known\nabstractions. Most data systems treat LLMs as an opaque black box that operates\non user inputs and data as is, optimizing them much like any other approximate,\nexpensive UDFs, in conjunction with other relational operators. Such data\nsystems do as they are told, but fail to understand and leverage what the LLM\nis being asked to do (i.e. the underlying operations, which may be\nerror-prone), the data the LLM is operating on (e.g., long, complex documents),\nor what the user really needs. They don't take advantage of the characteristics\nof the operations and/or the data at hand, or ensure correctness of results\nwhen there are imprecisions and ambiguities. We argue that data systems instead\nneed to be proactive: they need to be given more agency -- armed with the power\nof LLMs -- to understand and rework the user inputs and the data and to make\ndecisions on how the operations and the data should be represented and\nprocessed. By allowing the data system to parse, rewrite, and decompose user\ninputs and data, or to interact with the user in ways that go beyond the\nstandard single-shot query-result paradigm, the data system is able to address\nuser needs more efficiently and effectively. These new capabilities lead to a\nrich design space where the data system takes more initiative: they are\nempowered to perform optimization based on the transformation operations, data\ncharacteristics, and user intent. We discuss various successful examples of how\nthis framework has been and can be applied in real-world tasks, and present\nfuture directions for this ambitious research agenda.", "paper_summary_zh": "<paragraph>\u900f\u904e LLM \u7684\u5f37\u5927\u529f\u80fd\uff0c\u6211\u5011\u73fe\u5728\u80fd\u5920\u67e5\u8a62\u904e\u53bb\u7121\u6cd5\u67e5\u8a62\u7684\u8cc7\u6599\uff0c\u5305\u62ec\u6587\u5b57\u3001\u5716\u7247\u548c\u5f71\u7247\u3002\u7136\u800c\uff0c\u5118\u7ba1\u6709\u5982\u6b64\u9f90\u5927\u7684\u6f5b\u529b\uff0c\u4f46\u73fe\u4eca\u5927\u591a\u6578\u5229\u7528 LLM \u7684\u8cc7\u6599\u7cfb\u7d71\u90fd\u662f\u88ab\u52d5\u7684\uff0c\u53cd\u6620\u51fa\u6211\u5011\u7684\u793e\u7fa4\u5e0c\u671b\u5c07 LLM \u6620\u5c04\u5230\u5df2\u77e5\u7684\u62bd\u8c61\u5316\u3002\u5927\u591a\u6578\u8cc7\u6599\u7cfb\u7d71\u5c07 LLM \u8996\u70ba\u4e00\u500b\u4e0d\u900f\u660e\u7684\u9ed1\u76d2\u5b50\uff0c\u4ee5\u4f7f\u7528\u8005\u8f38\u5165\u548c\u8cc7\u6599\u70ba\u57fa\u790e\u9032\u884c\u904b\u4f5c\uff0c\u4e26\u50cf\u5176\u4ed6\u8fd1\u4f3c\u3001\u6602\u8cb4\u7684 UDF \u4e00\u6a23\u6700\u4f73\u5316\u5b83\u5011\uff0c\u4e26\u8207\u5176\u4ed6\u95dc\u806f\u904b\u7b97\u5b50\u7d50\u5408\u4f7f\u7528\u3002\u9019\u4e9b\u8cc7\u6599\u7cfb\u7d71\u6703\u7167\u8457\u6307\u793a\u57f7\u884c\uff0c\u4f46\u7121\u6cd5\u7406\u89e3\u4e26\u904b\u7528 LLM \u88ab\u8981\u6c42\u57f7\u884c\u7684\u4efb\u52d9\uff08\u4f8b\u5982\u53ef\u80fd\u5bb9\u6613\u51fa\u932f\u7684\u57fa\u672c\u904b\u7b97\uff09\u3001LLM \u6b63\u5728\u904b\u7b97\u7684\u8cc7\u6599\uff08\u4f8b\u5982\u5197\u9577\u3001\u8907\u96dc\u7684\u6587\u4ef6\uff09\uff0c\u6216\u4f7f\u7528\u8005\u771f\u6b63\u9700\u8981\u7684\u662f\u4ec0\u9ebc\u3002\u5b83\u5011\u4e0d\u6703\u5229\u7528\u904b\u7b97\u548c/\u6216\u624b\u908a\u8cc7\u6599\u7684\u7279\u6027\uff0c\u6216\u5728\u6709\u8aa4\u5dee\u548c\u6b67\u7fa9\u6642\u78ba\u4fdd\u7d50\u679c\u7684\u6b63\u78ba\u6027\u3002\u6211\u5011\u8a8d\u70ba\u8cc7\u6599\u7cfb\u7d71\u61c9\u8a72\u6539\u70ba\u4e3b\u52d5\uff1a\u5b83\u5011\u9700\u8981\u88ab\u8ce6\u4e88\u66f4\u591a\u81ea\u4e3b\u6b0a\uff0c\u4e26\u5177\u5099 LLM \u7684\u5f37\u5927\u529f\u80fd\uff0c\u4ee5\u4e86\u89e3\u4e26\u91cd\u65b0\u8655\u7406\u4f7f\u7528\u8005\u8f38\u5165\u548c\u8cc7\u6599\uff0c\u4e26\u5c31\u904b\u7b97\u548c\u8cc7\u6599\u7684\u8868\u793a\u548c\u8655\u7406\u65b9\u5f0f\u505a\u51fa\u6c7a\u7b56\u3002\u900f\u904e\u5141\u8a31\u8cc7\u6599\u7cfb\u7d71\u89e3\u6790\u3001\u6539\u5beb\u548c\u5206\u89e3\u4f7f\u7528\u8005\u8f38\u5165\u548c\u8cc7\u6599\uff0c\u6216\u4ee5\u8d85\u8d8a\u6a19\u6e96\u55ae\u6b21\u67e5\u8a62\u7d50\u679c\u6a21\u5f0f\u7684\u65b9\u5f0f\u8207\u4f7f\u7528\u8005\u4e92\u52d5\uff0c\u8cc7\u6599\u7cfb\u7d71\u80fd\u5920\u66f4\u6709\u6548\u7387\u4e14\u6709\u6548\u5730\u6eff\u8db3\u4f7f\u7528\u8005\u7684\u9700\u6c42\u3002\u9019\u4e9b\u65b0\u529f\u80fd\u6703\u5e36\u4f86\u4e00\u500b\u8c50\u5bcc\u7684\u8a2d\u8a08\u7a7a\u9593\uff0c\u8b93\u8cc7\u6599\u7cfb\u7d71\u767c\u63ee\u66f4\u591a\u4e3b\u5c0e\u6027\uff1a\u5b83\u5011\u6709\u80fd\u529b\u6839\u64da\u8f49\u63db\u904b\u7b97\u3001\u8cc7\u6599\u7279\u6027\u548c\u4f7f\u7528\u8005\u610f\u5716\u9032\u884c\u6700\u4f73\u5316\u3002\u6211\u5011\u5c07\u8a0e\u8ad6\u9019\u500b\u67b6\u69cb\u5982\u4f55\u61c9\u7528\u65bc\u5be6\u969b\u4efb\u52d9\uff0c\u4e26\u63d0\u51fa\u9019\u500b\u96c4\u5fc3\u52c3\u52c3\u7684\u7814\u7a76\u8b70\u7a0b\u7684\u672a\u4f86\u65b9\u5411\u3002</paragraph>", "author": "Sepanta Zeighami et.al.", "authors": "Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran", "id": "2502.13016v1", "paper_url": "http://arxiv.org/abs/2502.13016v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13019/2502.13019v1.json b/database/storage/2502/13019/2502.13019v1.json
new file mode 100644
index 0000000000..28d3105d95
--- /dev/null
+++ b/database/storage/2502/13019/2502.13019v1.json
@@ -0,0 +1 @@
+{"2502.13019": {"publish_time": "2025-02-18", "title": "Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation", "paper_summary": "Despite the remarkable capabilities of Large Language Models (LLMs) in\nvarious NLP tasks, they remain vulnerable to hallucinations due to their\nlimited parametric knowledge and lack of domain-specific expertise.\nRetrieval-Augmented Generation (RAG) addresses this challenge by incorporating\nexternal document retrieval to augment the knowledge base of LLMs. In this\napproach, RAG retrieves document chunks from an external corpus in response to\na query, which are then used as context for the downstream language model to\ngenerate an answer. However, these retrieved knowledge sources often include\nirrelevant or erroneous information, undermining the effectiveness of RAG in\ndownstream tasks. To overcome this limitation, we introduce a compact,\nefficient, and pluggable module designed to refine external knowledge sources\nbefore feeding them to the generator. The module reconstructs retrieved content\nby extracting the most relevant and supportive information and reorganising it\ninto a concise, query-specific format. Through a three-stage training paradigm\n- comprising supervised fine-tuning, contrastive multi-task learning, and\nreinforcement learning-based alignment - it prioritises critical knowledge and\naligns it with the generator's preferences. This method enables LLMs to produce\noutputs that are more accurate, reliable, and contextually appropriate.", "paper_summary_zh": "\u5118\u7ba1\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728\u5404\u7a2e\u81ea\u7136\u8a9e\u8a00\u8655\u7406\u4efb\u52d9\u4e2d\u5177\u5099\u5353\u8d8a\u7684\u80fd\u529b\uff0c\u4f46\u7531\u65bc\u5176\u53c3\u6578\u77e5\u8b58\u6709\u9650\u4e14\u7f3a\u4e4f\u7279\u5b9a\u9818\u57df\u7684\u5c08\u696d\u77e5\u8b58\uff0c\u56e0\u6b64\u5b83\u5011\u4ecd\u7136\u5bb9\u6613\u51fa\u73fe\u5e7b\u89ba\u3002\u6aa2\u7d22\u589e\u5f37\u5f0f\u751f\u6210 (RAG) \u900f\u904e\u7d0d\u5165\u5916\u90e8\u6587\u4ef6\u6aa2\u7d22\u4f86\u64f4\u5145 LLM \u7684\u77e5\u8b58\u5eab\uff0c\u4ee5\u61c9\u5c0d\u6b64\u9805\u6311\u6230\u3002\u5728\u6b64\u65b9\u6cd5\u4e2d\uff0cRAG \u6703\u6839\u64da\u67e5\u8a62\u6aa2\u7d22\u5916\u90e8\u8a9e\u6599\u5eab\u4e2d\u7684\u6587\u4ef6\u5340\u584a\uff0c\u7136\u5f8c\u5c07\u5176\u7528\u4f5c\u4e0b\u6e38\u8a9e\u8a00\u6a21\u578b\u7684\u80cc\u666f\uff0c\u4ee5\u7522\u751f\u7b54\u6848\u3002\u7136\u800c\uff0c\u9019\u4e9b\u6aa2\u7d22\u5230\u7684\u77e5\u8b58\u4f86\u6e90\u901a\u5e38\u5305\u542b\u4e0d\u76f8\u95dc\u6216\u932f\u8aa4\u7684\u8cc7\u8a0a\uff0c\u56e0\u800c\u640d\u5bb3\u4e86 RAG \u5728\u4e0b\u6e38\u4efb\u52d9\u4e2d\u7684\u6548\u80fd\u3002\u70ba\u4e86\u514b\u670d\u6b64\u9805\u9650\u5236\uff0c\u6211\u5011\u5f15\u5165\u4e86\u4e00\u500b\u7cbe\u7c21\u3001\u6709\u6548\u7387\u4e14\u53ef\u63d2\u5165\u7684\u6a21\u7d44\uff0c\u7528\u65bc\u5728\u5c07\u5916\u90e8\u77e5\u8b58\u4f86\u6e90\u63d0\u4f9b\u7d66\u751f\u6210\u5668\u4e4b\u524d\u5c0d\u5176\u9032\u884c\u7cbe\u7149\u3002\u6b64\u6a21\u7d44\u900f\u904e\u63d0\u53d6\u6700\u76f8\u95dc\u4e14\u6709\u7528\u7684\u8cc7\u8a0a\u4e26\u5c07\u5176\u91cd\u65b0\u7d44\u7e54\u6210\u7c21\u6f54\u4e14\u7279\u5b9a\u65bc\u67e5\u8a62\u7684\u683c\u5f0f\uff0c\u4f86\u91cd\u5efa\u6aa2\u7d22\u5230\u7684\u5167\u5bb9\u3002\u900f\u904e\u4e09\u968e\u6bb5\u8a13\u7df4\u7bc4\u4f8b - \u5305\u542b\u76e3\u7763\u5fae\u8abf\u3001\u5c0d\u6bd4\u591a\u4efb\u52d9\u5b78\u7fd2\u4ee5\u53ca\u57fa\u65bc\u5f37\u5316\u5b78\u7fd2\u7684\u6bd4\u5c0d - \u5b83\u512a\u5148\u8003\u91cf\u95dc\u9375\u77e5\u8b58\uff0c\u4e26\u4f7f\u5176\u8207\u751f\u6210\u5668\u7684\u504f\u597d\u76f8\u7b26\u3002\u6b64\u65b9\u6cd5\u53ef\u8b93 LLM \u7522\u751f\u66f4\u6e96\u78ba\u3001\u53ef\u9760\u4e14\u5728\u8a9e\u5883\u4e0a\u66f4\u9069\u7576\u7684\u8f38\u51fa\u3002", "author": "Sha Li et.al.", "authors": "Sha Li, Naren Ramarkrishnan", "id": "2502.13019v1", "paper_url": "http://arxiv.org/abs/2502.13019v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13025/2502.13025v1.json b/database/storage/2502/13025/2502.13025v1.json
new file mode 100644
index 0000000000..998c0e3607
--- /dev/null
+++ b/database/storage/2502/13025/2502.13025v1.json
@@ -0,0 +1 @@
+{"2502.13025": {"publish_time": "2025-02-18", "title": "Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks", "paper_summary": "We present an agentic, autonomous graph expansion framework that iteratively\nstructures and refines knowledge in situ. Unlike conventional knowledge graph\nconstruction methods relying on static extraction or single-pass learning, our\napproach couples a reasoning-native large language model with a continually\nupdated graph representation. At each step, the system actively generates new\nconcepts and relationships, merges them into a global graph, and formulates\nsubsequent prompts based on its evolving structure. Through this\nfeedback-driven loop, the model organizes information into a scale-free network\ncharacterized by hub formation, stable modularity, and bridging nodes that link\ndisparate knowledge clusters. Over hundreds of iterations, new nodes and edges\ncontinue to appear without saturating, while centrality measures and shortest\npath distributions evolve to yield increasingly distributed connectivity. Our\nanalysis reveals emergent patterns, such as the rise of highly connected 'hub'\nconcepts and the shifting influence of 'bridge' nodes, indicating that agentic,\nself-reinforcing graph construction can yield open-ended, coherent knowledge\nstructures. Applied to materials design problems, we present compositional\nreasoning experiments by extracting node-specific and synergy-level principles\nto foster genuinely novel knowledge synthesis, yielding cross-domain ideas that\ntranscend rote summarization and strengthen the framework's potential for\nopen-ended scientific discovery. We discuss other applications in scientific\ndiscovery and outline future directions for enhancing scalability and\ninterpretability.", "paper_summary_zh": "<paragraph>\u6211\u5011\u63d0\u51fa\u4e00\u500b\u80fd\u52d5\u7684\u3001\u81ea\u4e3b\u7684\u5716\u5f62\u64f4\u5c55\u6846\u67b6\uff0c\u5b83\u53cd\u8986\u5730\u5efa\u69cb\u548c\u7cbe\u7149\u539f\u4f4d\u77e5\u8b58\u3002\u8207\u4f9d\u8cf4\u975c\u614b\u63d0\u53d6\u6216\u55ae\u6b21\u5b78\u7fd2\u7684\u50b3\u7d71\u77e5\u8b58\u5716\u5f62\u5efa\u69cb\u65b9\u6cd5\u4e0d\u540c\uff0c\u6211\u5011\u7684\u505a\u6cd5\u5c07\u4e00\u500b\u63a8\u7406\u539f\u751f\u7684\u5927\u8a9e\u8a00\u6a21\u578b\u8207\u4e00\u500b\u6301\u7e8c\u66f4\u65b0\u7684\u5716\u5f62\u8868\u793a\u7d50\u5408\u8d77\u4f86\u3002\u5728\u6bcf\u4e00\u6b65\u4e2d\uff0c\u7cfb\u7d71\u4e3b\u52d5\u7522\u751f\u65b0\u7684\u6982\u5ff5\u548c\u95dc\u4fc2\uff0c\u5c07\u5b83\u5011\u5408\u4f75\u5230\u4e00\u500b\u5168\u57df\u5716\u5f62\u4e2d\uff0c\u4e26\u6839\u64da\u5176\u4e0d\u65b7\u6f14\u5316\u7684\u7d50\u69cb\u5236\u5b9a\u5f8c\u7e8c\u63d0\u793a\u3002\u900f\u904e\u9019\u500b\u56de\u994b\u9a45\u52d5\u7684\u8ff4\u5708\uff0c\u6a21\u578b\u5c07\u8cc7\u8a0a\u7d44\u7e54\u6210\u4e00\u500b\u7121\u6a19\u5ea6\u7db2\u8def\uff0c\u5176\u7279\u5fb5\u662f\u6a1e\u7d10\u5f62\u6210\u3001\u7a69\u5b9a\u7684\u6a21\u7d44\u5316\u4ee5\u53ca\u9023\u7d50\u4e0d\u540c\u77e5\u8b58\u7fa4\u96c6\u7684\u6a4b\u63a5\u7bc0\u9ede\u3002\u5728\u6578\u767e\u6b21\u53cd\u8986\u904b\u7b97\u4e2d\uff0c\u65b0\u7684\u7bc0\u9ede\u548c\u908a\u7de3\u6703\u6301\u7e8c\u51fa\u73fe\uff0c\u800c\u4e0d\u6703\u98fd\u548c\uff0c\u540c\u6642\u4e2d\u5fc3\u6027\u6e2c\u91cf\u548c\u6700\u77ed\u8def\u5f91\u5206\u4f48\u6703\u6f14\u5316\u70ba\u7522\u751f\u8d8a\u4f86\u8d8a\u5206\u6563\u7684\u9023\u901a\u6027\u3002\u6211\u5011\u7684\u5206\u6790\u63ed\u793a\u4e86\u65b0\u8208\u6a21\u5f0f\uff0c\u4f8b\u5982\u9ad8\u5ea6\u9023\u63a5\u7684\u300c\u6a1e\u7d10\u300d\u6982\u5ff5\u7684\u8208\u8d77\u548c\u300c\u6a4b\u6a11\u300d\u7bc0\u9ede\u5f71\u97ff\u529b\u7684\u8f49\u79fb\uff0c\u9019\u8868\u660e\u80fd\u52d5\u7684\u3001\u81ea\u6211\u5f37\u5316\u7684\u5716\u5f62\u5efa\u69cb\u53ef\u4ee5\u7522\u751f\u958b\u653e\u5f0f\u3001\u9023\u8cab\u7684\u77e5\u8b58\u7d50\u69cb\u3002\u61c9\u7528\u65bc\u6750\u6599\u8a2d\u8a08\u554f\u984c\uff0c\u6211\u5011\u63d0\u51fa\u7d44\u5408\u63a8\u7406\u5be6\u9a57\uff0c\u900f\u904e\u63d0\u53d6\u7279\u5b9a\u65bc\u7bc0\u9ede\u7684\u539f\u5247\u548c\u5354\u540c\u6548\u61c9\u5c64\u7d1a\u539f\u5247\uff0c\u4ee5\u4fc3\u9032\u771f\u6b63\u65b0\u7a4e\u7684\u77e5\u8b58\u7d9c\u5408\uff0c\u7522\u751f\u8d85\u8d8a\u6b7b\u80cc\u5f0f\u6458\u8981\u4e26\u5f37\u5316\u6846\u67b6\u5728\u958b\u653e\u5f0f\u79d1\u5b78\u767c\u73fe\u4e2d\u6f5b\u529b\u7684\u8de8\u9818\u57df\u60f3\u6cd5\u3002\u6211\u5011\u8a0e\u8ad6\u4e86\u5728\u79d1\u5b78\u767c\u73fe\u4e2d\u7684\u5176\u4ed6\u61c9\u7528\uff0c\u4e26\u6982\u8ff0\u4e86\u589e\u5f37\u53ef\u64f4\u5145\u6027\u548c\u53ef\u89e3\u91cb\u6027\u7684\u672a\u4f86\u65b9\u5411\u3002</paragraph>", "author": "Markus J. Buehler et.al.", "authors": "Markus J. Buehler", "id": "2502.13025v1", "paper_url": "http://arxiv.org/abs/2502.13025v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13028/2502.13028v1.json b/database/storage/2502/13028/2502.13028v1.json
new file mode 100644
index 0000000000..1e61336815
--- /dev/null
+++ b/database/storage/2502/13028/2502.13028v1.json
@@ -0,0 +1 @@
+{"2502.13028": {"publish_time": "2025-02-18", "title": "Whose story is it? Personalizing story generation by inferring author styles", "paper_summary": "Personalization has become essential for improving user experience in\ninteractive writing and educational applications, yet its potential in story\ngeneration remains largely unexplored. In this work, we propose a novel\ntwo-stage pipeline for personalized story generation. Our approach first infers\nan author's implicit story-writing characteristics from their past work and\norganizes them into an Author Writing Sheet, inspired by narrative theory. The\nsecond stage uses this sheet to simulate the author's persona through tailored\npersona descriptions and personalized story writing rules. To enable and\nvalidate our approach, we construct Mythos, a dataset of 590 stories from 64\nauthors across five distinct sources that reflect diverse story-writing\nsettings. A head-to-head comparison with a non-personalized baseline\ndemonstrates our pipeline's effectiveness in generating high-quality\npersonalized stories. Our personalized stories achieve a 75 percent win rate\n(versus 14 percent for the baseline and 11 percent ties) in capturing authors'\nwriting style based on their past works. Human evaluation highlights the high\nquality of our Author Writing Sheet and provides valuable insights into the\npersonalized story generation task. Notable takeaways are that writings from\ncertain sources, such as Reddit, are easier to personalize than others, like\nAO3, while narrative aspects, like Creativity and Language Use, are easier to\npersonalize than others, like Plot.", "paper_summary_zh": "\u500b\u4eba\u5316\u5df2\u6210\u70ba\u6539\u5584\u4e92\u52d5\u5f0f\u5beb\u4f5c\u548c\u6559\u80b2\u61c9\u7528\u7a0b\u5f0f\u4e2d\u4f7f\u7528\u8005\u9ad4\u9a57\u7684\u5fc5\u8981\u624b\u6bb5\uff0c\u7136\u800c\u5176\u5728\u6545\u4e8b\u751f\u6210\u4e2d\u7684\u6f5b\u529b\u4ecd\u672a\u88ab\u5ee3\u6cdb\u63a2\u7d22\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u500b\u5275\u65b0\u7684\u5169\u968e\u6bb5\u6d41\u7a0b\uff0c\u7528\u65bc\u500b\u4eba\u5316\u6545\u4e8b\u751f\u6210\u3002\u6211\u5011\u7684\u505a\u6cd5\u9996\u5148\u5f9e\u4f5c\u8005\u904e\u53bb\u7684\u4f5c\u54c1\u4e2d\u63a8\u8ad6\u51fa\u4f5c\u8005\u96b1\u542b\u7684\u6545\u4e8b\u5beb\u4f5c\u7279\u5fb5\uff0c\u4e26\u6839\u64da\u6558\u4e8b\u7406\u8ad6\u5c07\u5b83\u5011\u7d44\u7e54\u6210\u4f5c\u8005\u5beb\u4f5c\u8868\u3002\u7b2c\u4e8c\u968e\u6bb5\u4f7f\u7528\u6b64\u8868\u900f\u904e\u91cf\u8eab\u6253\u9020\u7684\u89d2\u8272\u63cf\u8ff0\u548c\u500b\u4eba\u5316\u6545\u4e8b\u5beb\u4f5c\u898f\u5247\u4f86\u6a21\u64ec\u4f5c\u8005\u7684\u89d2\u8272\u3002\u70ba\u4e86\u555f\u7528\u548c\u9a57\u8b49\u6211\u5011\u7684\u505a\u6cd5\uff0c\u6211\u5011\u5efa\u69cb\u4e86 Mythos\uff0c\u4e00\u500b\u5305\u542b\u4f86\u81ea 64 \u4f4d\u4f5c\u8005\u3001\u6a6b\u8de8\u4e94\u500b\u4e0d\u540c\u4f86\u6e90\u7684 590 \u500b\u6545\u4e8b\u7684\u8cc7\u6599\u96c6\uff0c\u9019\u4e9b\u6545\u4e8b\u53cd\u6620\u4e86\u591a\u6a23\u5316\u7684\u6545\u4e8b\u5beb\u4f5c\u8a2d\u5b9a\u3002\u8207\u975e\u500b\u4eba\u5316\u57fa\u6e96\u9032\u884c\u4e00\u5c0d\u4e00\u7684\u6bd4\u8f03\uff0c\u8b49\u660e\u4e86\u6211\u5011\u7684\u6d41\u7a0b\u5728\u751f\u6210\u9ad8\u54c1\u8cea\u500b\u4eba\u5316\u6545\u4e8b\u65b9\u9762\u7684\u6709\u6548\u6027\u3002\u6211\u5011\u7684\u500b\u4eba\u5316\u6545\u4e8b\u4ee5 75% \u7684\u7372\u52dd\u7387\uff08\u76f8\u8f03\u65bc\u57fa\u6e96\u7684 14% \u548c 11% \u5e73\u624b\uff09\u6355\u6349\u5230\u4f5c\u8005\u57fa\u65bc\u5176\u904e\u53bb\u4f5c\u54c1\u7684\u5beb\u4f5c\u98a8\u683c\u3002\u4eba\u985e\u8a55\u4f30\u7a81\u986f\u4e86\u6211\u5011\u4f5c\u8005\u5beb\u4f5c\u8868\u7684\u512a\u826f\u54c1\u8cea\uff0c\u4e26\u63d0\u4f9b\u4e86\u5c0d\u500b\u4eba\u5316\u6545\u4e8b\u751f\u6210\u4efb\u52d9\u7684\u5bf6\u8cb4\u898b\u89e3\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u4f86\u81ea\u67d0\u4e9b\u4f86\u6e90\uff08\u4f8b\u5982 Reddit\uff09\u7684\u4f5c\u54c1\u6bd4\u5176\u4ed6\u4f86\u6e90\uff08\u4f8b\u5982 AO3\uff09\u66f4\u5bb9\u6613\u500b\u4eba\u5316\uff0c\u800c\u6558\u4e8b\u5c64\u9762\uff08\u4f8b\u5982\u5275\u9020\u529b\u548c\u8a9e\u8a00\u4f7f\u7528\uff09\u6bd4\u5176\u4ed6\u5c64\u9762\uff08\u4f8b\u5982\u60c5\u7bc0\uff09\u66f4\u5bb9\u6613\u500b\u4eba\u5316\u3002", "author": "Nischal Ashok Kumar et.al.", "authors": "Nischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer, Andrew Lan", "id": "2502.13028v1", "paper_url": "http://arxiv.org/abs/2502.13028v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13031/2502.13031v1.json b/database/storage/2502/13031/2502.13031v1.json
new file mode 100644
index 0000000000..ff3d2b8011
--- /dev/null
+++ b/database/storage/2502/13031/2502.13031v1.json
@@ -0,0 +1 @@
+{"2502.13031": {"publish_time": "2025-02-18", "title": "HPSS: Heuristic Prompting Strategy Search for LLM Evaluators", "paper_summary": "Since the adoption of large language models (LLMs) for text evaluation has\nbecome increasingly prevalent in the field of natural language processing\n(NLP), a series of existing works attempt to optimize the prompts for LLM\nevaluators to improve their alignment with human judgment. However, their\nefforts are limited to optimizing individual factors of evaluation prompts,\nsuch as evaluation criteria or output formats, neglecting the combinatorial\nimpact of multiple factors, which leads to insufficient optimization of the\nevaluation pipeline. Nevertheless, identifying well-behaved prompting\nstrategies for adjusting multiple factors requires extensive enumeration. To\nthis end, we comprehensively integrate 8 key factors for evaluation prompts and\npropose a novel automatic prompting strategy optimization method called\nHeuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm,\nHPSS conducts an iterative search to find well-behaved prompting strategies for\nLLM evaluators. A heuristic function is employed to guide the search process,\nenhancing the performance of our algorithm. Extensive experiments across four\nevaluation tasks demonstrate the effectiveness of HPSS, consistently\noutperforming both human-designed evaluation prompts and existing automatic\nprompt optimization methods.", "paper_summary_zh": "\u96a8\u8457\u81ea\u7136\u8a9e\u8a00\u8655\u7406\uff08NLP\uff09\u9818\u57df\u4e2d\u63a1\u7528\u5927\u578b\u8a9e\u8a00\u6a21\u578b\uff08LLM\uff09\u9032\u884c\u6587\u672c\u8a55\u4f30\u8b8a\u5f97\u8d8a\u4f86\u8d8a\u666e\u904d\uff0c\u4e00\u7cfb\u5217\u73fe\u6709\u5de5\u4f5c\u5617\u8a66\u512a\u5316 LLM \u8a55\u4f30\u5668\u7684\u63d0\u793a\uff0c\u4ee5\u6539\u5584\u5b83\u5011\u8207\u4eba\u985e\u5224\u65b7\u7684\u4e00\u81f4\u6027\u3002\u7136\u800c\uff0c\u4ed6\u5011\u7684\u52aa\u529b\u50c5\u9650\u65bc\u512a\u5316\u8a55\u4f30\u63d0\u793a\u7684\u500b\u5225\u56e0\u7d20\uff0c\u4f8b\u5982\u8a55\u4f30\u6e96\u5247\u6216\u8f38\u51fa\u683c\u5f0f\uff0c\u800c\u5ffd\u7565\u4e86\u591a\u7a2e\u56e0\u7d20\u7684\u7d44\u5408\u5f71\u97ff\uff0c\u9019\u5c0e\u81f4\u8a55\u4f30\u7ba1\u9053\u512a\u5316\u4e0d\u8db3\u3002\u5118\u7ba1\u5982\u6b64\uff0c\u627e\u51fa\u8abf\u6574\u591a\u7a2e\u56e0\u7d20\u7684\u826f\u597d\u63d0\u793a\u7b56\u7565\u9700\u8981\u5ee3\u6cdb\u7684\u679a\u8209\u3002\u70ba\u6b64\uff0c\u6211\u5011\u5168\u9762\u6574\u5408\u4e86\u8a55\u4f30\u63d0\u793a\u7684 8 \u500b\u95dc\u9375\u56e0\u7d20\uff0c\u4e26\u63d0\u51fa\u4e86\u4e00\u7a2e\u540d\u70ba\u555f\u767c\u5f0f\u63d0\u793a\u7b56\u7565\u641c\u7d22\uff08HPSS\uff09\u7684\u65b0\u578b\u81ea\u52d5\u63d0\u793a\u7b56\u7565\u512a\u5316\u65b9\u6cd5\u3002\u5728\u907a\u50b3\u6f14\u7b97\u6cd5\u7684\u555f\u767c\u4e0b\uff0cHPSS \u9032\u884c\u53cd\u8986\u641c\u7d22\u4ee5\u627e\u51fa LLM \u8a55\u4f30\u5668\u7684\u826f\u597d\u63d0\u793a\u7b56\u7565\u3002\u63a1\u7528\u555f\u767c\u5f0f\u51fd\u6578\u4f86\u6307\u5c0e\u641c\u7d22\u904e\u7a0b\uff0c\u589e\u5f37\u4e86\u6211\u5011\u6f14\u7b97\u6cd5\u7684\u6548\u80fd\u3002\u5728\u56db\u9805\u8a55\u4f30\u4efb\u52d9\u4e2d\u9032\u884c\u7684\u5ee3\u6cdb\u5be6\u9a57\u8b49\u660e\u4e86 HPSS \u7684\u6709\u6548\u6027\uff0c\u59cb\u7d42\u512a\u65bc\u4eba\u985e\u8a2d\u8a08\u7684\u8a55\u4f30\u63d0\u793a\u548c\u73fe\u6709\u7684\u81ea\u52d5\u63d0\u793a\u512a\u5316\u65b9\u6cd5\u3002", "author": "Bosi Wen et.al.", "authors": "Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang", "id": "2502.13031v1", "paper_url": "http://arxiv.org/abs/2502.13031v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13034/2502.13034v1.json b/database/storage/2502/13034/2502.13034v1.json
new file mode 100644
index 0000000000..06c21eeb0e
--- /dev/null
+++ b/database/storage/2502/13034/2502.13034v1.json
@@ -0,0 +1 @@
+{"2502.13034": {"publish_time": "2025-02-18", "title": "Natural Language Generation from Visual Sequences: Challenges and Future Directions", "paper_summary": "The ability to use natural language to talk about visual content is at the\ncore of human intelligence and a crucial feature of any artificial intelligence\nsystem. Various studies have focused on generating text for single images. In\ncontrast, comparatively little attention has been paid to exhaustively\nanalyzing and advancing work on multiple-image vision-to-text settings. In this\nposition paper, we claim that any task dealing with temporally ordered\nsequences of multiple images or frames is an instance of a broader, more\ngeneral problem involving the understanding of intricate relationships between\nthe visual content and the corresponding text. We comprehensively analyze five\ntasks that are instances of this problem and argue that they pose a common set\nof challenges and share similarities in terms of modeling and evaluation\napproaches. Based on the insights from these various aspects and stages of\nmulti-image-to-text generation, we highlight several open questions and suggest\nfuture research directions. We believe that these directions can advance the\nunderstanding of complex phenomena in this domain and the development of better\nmodels.", "paper_summary_zh": "\u4f7f\u7528\u81ea\u7136\u8a9e\u8a00\u4f86\u8ac7\u8ad6\u8996\u89ba\u5167\u5bb9\u7684\u80fd\u529b\u662f\u4eba\u985e\u667a\u6167\u7684\u6838\u5fc3\uff0c\u4e5f\u662f\u4efb\u4f55\u4eba\u5de5\u667a\u6167\u7cfb\u7d71\u7684\u4e00\u9805\u95dc\u9375\u529f\u80fd\u3002\u5404\u7a2e\u7814\u7a76\u90fd\u5c08\u6ce8\u65bc\u70ba\u55ae\u4e00\u5f71\u50cf\u7522\u751f\u6587\u5b57\u3002\u76f8\u8f03\u4e4b\u4e0b\uff0c\u5c0d\u65bc\u8a73\u76e1\u5206\u6790\u548c\u63a8\u9032\u591a\u91cd\u5f71\u50cf\u8996\u89ba\u8f49\u6587\u5b57\u8a2d\u5b9a\u7684\u5de5\u4f5c\uff0c\u95dc\u6ce8\u8f03\u5c11\u3002\u5728\u6b64\u7acb\u5834\u6587\u4ef6\u4e2d\uff0c\u6211\u5011\u8072\u7a31\u4efb\u4f55\u8655\u7406\u591a\u91cd\u5f71\u50cf\u6216\u756b\u683c\u7684\u6642\u9593\u9806\u5e8f\u5e8f\u5217\u7684\u4efb\u52d9\uff0c\u90fd\u662f\u4e00\u500b\u66f4\u5ee3\u6cdb\u3001\u66f4\u666e\u904d\u554f\u984c\u7684\u7bc4\u4f8b\uff0c\u6d89\u53ca\u7406\u89e3\u8996\u89ba\u5167\u5bb9\u548c\u5c0d\u61c9\u6587\u5b57\u4e4b\u9593\u7684\u8907\u96dc\u95dc\u4fc2\u3002\u6211\u5011\u5168\u9762\u5206\u6790\u4e86\u6b64\u554f\u984c\u7684\u4e94\u500b\u7bc4\u4f8b\u4efb\u52d9\uff0c\u4e26\u8ad6\u8b49\u5b83\u5011\u63d0\u51fa\u4e86\u4e00\u7d44\u5e38\u898b\u7684\u6311\u6230\uff0c\u4e14\u5728\u5efa\u6a21\u548c\u8a55\u4f30\u65b9\u6cd5\u65b9\u9762\u6709\u76f8\u4f3c\u4e4b\u8655\u3002\u6839\u64da\u591a\u91cd\u5f71\u50cf\u8f49\u6587\u5b57\u751f\u6210\u7684\u9019\u4e9b\u4e0d\u540c\u9762\u5411\u548c\u968e\u6bb5\u7684\u898b\u89e3\uff0c\u6211\u5011\u7a81\u51fa\u4e86\u5e7e\u500b\u958b\u653e\u6027\u554f\u984c\uff0c\u4e26\u5efa\u8b70\u672a\u4f86\u7684\u7814\u7a76\u65b9\u5411\u3002\u6211\u5011\u76f8\u4fe1\u9019\u4e9b\u65b9\u5411\u53ef\u4ee5\u63a8\u9032\u5c0d\u6b64\u9818\u57df\u4e2d\u8907\u96dc\u73fe\u8c61\u7684\u7406\u89e3\uff0c\u4ee5\u53ca\u958b\u767c\u51fa\u66f4\u597d\u7684\u6a21\u578b\u3002", "author": "Aditya K Surikuchi et.al.", "authors": "Aditya K Surikuchi, Raquel Fern\u00e1ndez, Sandro Pezzelle", "id": "2502.13034v1", "paper_url": "http://arxiv.org/abs/2502.13034v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13044/2502.13044v1.json b/database/storage/2502/13044/2502.13044v1.json
new file mode 100644
index 0000000000..51b0082720
--- /dev/null
+++ b/database/storage/2502/13044/2502.13044v1.json
@@ -0,0 +1 @@
+{"2502.13044": {"publish_time": "2025-02-18", "title": "Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction", "paper_summary": "Aspect sentiment quadruple prediction (ASQP) facilitates a detailed\nunderstanding of opinions expressed in a text by identifying the opinion term,\naspect term, aspect category and sentiment polarity for each opinion. However,\nannotating a full set of training examples to fine-tune models for ASQP is a\nresource-intensive process. In this study, we explore the capabilities of large\nlanguage models (LLMs) for zero- and few-shot learning on the ASQP task across\nfive diverse datasets. We report F1 scores slightly below those obtained with\nstate-of-the-art fine-tuned models but exceeding previously reported zero- and\nfew-shot performance. In the 40-shot setting on the Rest16 restaurant domain\ndataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the\nbest-performing fine-tuned method MVP. Additionally, we report the performance\nof LLMs in target aspect sentiment detection (TASD), where the F1 scores were\nalso close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot\nsetting, compared to 72.76 with MVP. While human annotators remain essential\nfor achieving optimal performance, LLMs can reduce the need for extensive\nmanual annotation in ASQP tasks.", "paper_summary_zh": "\u9762\u5411\u89c0\u9ede\u7684\u56db\u5143\u9810\u6e2c (ASQP) \u900f\u904e\u8fa8\u8b58\u5404\u500b\u89c0\u9ede\u7684\u89c0\u9ede\u8a5e\u5f59\u3001\u9762\u5411\u8a5e\u5f59\u3001\u9762\u5411\u985e\u5225\u548c\u89c0\u9ede\u6975\u6027\uff0c\u5354\u52a9\u8a73\u7d30\u4e86\u89e3\u6587\u5b57\u4e2d\u8868\u9054\u7684\u610f\u898b\u3002\u7136\u800c\uff0c\u6a19\u8a3b\u4e00\u7d44\u5b8c\u6574\u7684\u8a13\u7df4\u7bc4\u4f8b\u4ee5\u5fae\u8abf ASQP \u6a21\u578b\u662f\u4e00\u500b\u8017\u8cbb\u8cc7\u6e90\u7684\u904e\u7a0b\u3002\u5728\u9019\u9805\u7814\u7a76\u4e2d\uff0c\u6211\u5011\u63a2\u8a0e\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5728 ASQP \u4efb\u52d9\u4e2d\u9032\u884c\u96f6\u6b21\u548c\u5c11\u91cf\u5b78\u7fd2\u7684\u80fd\u529b\uff0c\u6a6b\u8de8\u4e94\u500b\u4e0d\u540c\u7684\u8cc7\u6599\u96c6\u3002\u6211\u5011\u5831\u544a\u7684 F1 \u5206\u6578\u7565\u4f4e\u65bc\u4f7f\u7528\u6700\u5148\u9032\u7684\u5fae\u8abf\u6a21\u578b\u7372\u5f97\u7684\u5206\u6578\uff0c\u4f46\u8d85\u904e\u5148\u524d\u5831\u544a\u7684\u96f6\u6b21\u548c\u5c11\u91cf\u5b78\u7fd2\u8868\u73fe\u3002\u5728 Rest16 \u9910\u5ef3\u9818\u57df\u8cc7\u6599\u96c6\u7684 40 \u6b21\u5b78\u7fd2\u8a2d\u5b9a\u4e2d\uff0cLLM \u9054\u5230\u4e86 52.46 \u7684 F1 \u5206\u6578\uff0c\u800c\u6548\u80fd\u6700\u4f73\u7684\u5fae\u8abf\u65b9\u6cd5 MVP \u5247\u70ba 60.39\u3002\u6b64\u5916\uff0c\u6211\u5011\u5831\u544a\u4e86 LLM \u5728\u76ee\u6a19\u9762\u5411\u89c0\u9ede\u5075\u6e2c (TASD) \u4e2d\u7684\u8868\u73fe\uff0c\u5176\u4e2d F1 \u5206\u6578\u4e5f\u63a5\u8fd1\u5fae\u8abf\u6a21\u578b\uff0c\u5728 40 \u6b21\u5b78\u7fd2\u8a2d\u5b9a\u4e2d\u65bc Rest16 \u9054\u5230 66.03\uff0c\u800c MVP \u5247\u70ba 72.76\u3002\u5118\u7ba1\u4eba\u985e\u6a19\u8a3b\u54e1\u5c0d\u65bc\u9054\u6210\u6700\u4f73\u6548\u80fd\u4ecd\u7136\u81f3\u95dc\u91cd\u8981\uff0c\u4f46 LLM \u53ef\u4ee5\u6e1b\u5c11 ASQP \u4efb\u52d9\u4e2d\u5ee3\u6cdb\u624b\u52d5\u6a19\u8a3b\u7684\u9700\u6c42\u3002", "author": "Nils Constantin Hellwig et.al.", "authors": "Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff", "id": "2502.13044v1", "paper_url": "http://arxiv.org/abs/2502.13044v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13055/2502.13055v1.json b/database/storage/2502/13055/2502.13055v1.json
new file mode 100644
index 0000000000..56bfbded7d
--- /dev/null
+++ b/database/storage/2502/13055/2502.13055v1.json
@@ -0,0 +1 @@
+{"2502.13055": {"publish_time": "2025-02-18", "title": "LAMD: Context-driven Android Malware Detection and Classification with LLMs", "paper_summary": "The rapid growth of mobile applications has escalated Android malware\nthreats. Although there are numerous detection methods, they often struggle\nwith evolving attacks, dataset biases, and limited explainability. Large\nLanguage Models (LLMs) offer a promising alternative with their zero-shot\ninference and reasoning capabilities. However, applying LLMs to Android malware\ndetection presents two key challenges: (1)the extensive support code in Android\napplications, often spanning thousands of classes, exceeds LLMs' context limits\nand obscures malicious behavior within benign functionality; (2)the structural\ncomplexity and interdependencies of Android applications surpass LLMs'\nsequence-based reasoning, fragmenting code analysis and hindering malicious\nintent inference. To address these challenges, we propose LAMD, a practical\ncontext-driven framework to enable LLM-based Android malware detection. LAMD\nintegrates key context extraction to isolate security-critical code regions and\nconstruct program structures, then applies tier-wise code reasoning to analyze\napplication behavior progressively, from low-level instructions to high-level\nsemantics, providing final prediction and explanation. A well-designed factual\nconsistency verification mechanism is equipped to mitigate LLM hallucinations\nfrom the first tier. Evaluation in real-world settings demonstrates LAMD's\neffectiveness over conventional detectors, establishing a feasible basis for\nLLM-driven malware analysis in dynamic threat landscapes.", "paper_summary_zh": "\u96a8\u8457\u884c\u52d5\u61c9\u7528\u7a0b\u5f0f\u5feb\u901f\u6210\u9577\uff0cAndroid \u60e1\u610f\u8edf\u9ad4\u5a01\u8105\u4e5f\u96a8\u4e4b\u5347\u7d1a\u3002\u96d6\u7136\u6709\u8a31\u591a\u5075\u6e2c\u65b9\u6cd5\uff0c\u4f46\u5b83\u5011\u7d93\u5e38\u96e3\u4ee5\u61c9\u4ed8\u4e0d\u65b7\u6f14\u9032\u7684\u653b\u64ca\u3001\u8cc7\u6599\u96c6\u504f\u5dee\u548c\u6709\u9650\u7684\u53ef\u89e3\u91cb\u6027\u3002\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u63d0\u4f9b\u4e86\u4e00\u500b\u6709\u524d\u9014\u7684\u66ff\u4ee3\u65b9\u6848\uff0c\u5177\u5099\u96f6\u6b21\u5b78\u7fd2\u63a8\u7406\u548c\u63a8\u7406\u80fd\u529b\u3002\u7136\u800c\uff0c\u5c07 LLM \u61c9\u7528\u65bc Android \u60e1\u610f\u8edf\u9ad4\u5075\u6e2c\u6703\u51fa\u73fe\u5169\u500b\u4e3b\u8981\u6311\u6230\uff1a(1) Android \u61c9\u7528\u7a0b\u5f0f\u4e2d\u5927\u91cf\u7684\u652f\u63f4\u7a0b\u5f0f\u78bc\uff0c\u901a\u5e38\u6a6b\u8de8\u6578\u5343\u500b\u985e\u5225\uff0c\u8d85\u904e LLM \u7684\u4e0a\u4e0b\u6587\u9650\u5236\uff0c\u4e26\u6a21\u7cca\u4e86\u826f\u6027\u529f\u80fd\u4e2d\u7684\u60e1\u610f\u884c\u70ba\uff1b(2) Android \u61c9\u7528\u7a0b\u5f0f\u7684\u7d50\u69cb\u8907\u96dc\u6027\u548c\u76f8\u4e92\u4f9d\u8cf4\u6027\u8d85\u904e LLM \u7684\u57fa\u65bc\u5e8f\u5217\u7684\u63a8\u7406\uff0c\u6703\u9020\u6210\u7a0b\u5f0f\u78bc\u5206\u6790\u7834\u788e\uff0c\u4e26\u963b\u7919\u60e1\u610f\u610f\u5716\u63a8\u8ad6\u3002\u70ba\u4e86\u61c9\u5c0d\u9019\u4e9b\u6311\u6230\uff0c\u6211\u5011\u63d0\u51fa\u4e86 LAMD\uff0c\u4e00\u500b\u5be6\u7528\u7684\u8108\u7d61\u9a45\u52d5\u67b6\u69cb\uff0c\u4ee5\u652f\u63f4\u57fa\u65bc LLM \u7684 Android \u60e1\u610f\u8edf\u9ad4\u5075\u6e2c\u3002LAMD \u6574\u5408\u4e86\u95dc\u9375\u8108\u7d61\u8403\u53d6\uff0c\u4ee5\u9694\u96e2\u8207\u5b89\u5168\u6027\u81f3\u95dc\u91cd\u8981\u7684\u7a0b\u5f0f\u78bc\u5340\u57df\u4e26\u5efa\u69cb\u7a0b\u5f0f\u7d50\u69cb\uff0c\u7136\u5f8c\u5957\u7528\u5206\u5c64\u5f0f\u7a0b\u5f0f\u78bc\u63a8\u7406\uff0c\u9010\u6b65\u5206\u6790\u61c9\u7528\u7a0b\u5f0f\u884c\u70ba\uff0c\u5f9e\u4f4e\u968e\u6307\u4ee4\u5230\u9ad8\u968e\u8a9e\u610f\uff0c\u63d0\u4f9b\u6700\u7d42\u9810\u6e2c\u548c\u8aaa\u660e\u3002\u4e00\u500b\u8a2d\u8a08\u826f\u597d\u7684\u4e8b\u5be6\u4e00\u81f4\u6027\u9a57\u8b49\u6a5f\u5236\u5177\u5099\u6e1b\u8f15 LLM \u5f9e\u7b2c\u4e00\u5c64\u7522\u751f\u7684\u5e7b\u89ba\u7684\u80fd\u529b\u3002\u5728\u771f\u5be6\u74b0\u5883\u4e2d\u7684\u8a55\u4f30\u986f\u793a\uff0cLAMD \u512a\u65bc\u50b3\u7d71\u5075\u6e2c\u5668\uff0c\u70ba\u52d5\u614b\u5a01\u8105\u74b0\u5883\u4e2d\u7684 LLM \u9a45\u52d5\u60e1\u610f\u8edf\u9ad4\u5206\u6790\u5efa\u7acb\u4e86\u4e00\u500b\u53ef\u884c\u7684\u57fa\u790e\u3002", "author": "Xingzhi Qian et.al.", "authors": "Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, Lorenzo Cavallaro", "id": "2502.13055v1", "paper_url": "http://arxiv.org/abs/2502.13055v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13059/2502.13059v1.json b/database/storage/2502/13059/2502.13059v1.json
new file mode 100644
index 0000000000..5c01440181
--- /dev/null
+++ b/database/storage/2502/13059/2502.13059v1.json
@@ -0,0 +1 @@
+{"2502.13059": {"publish_time": "2025-02-18", "title": "SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models", "paper_summary": "The increasing application of multi-modal large language models (MLLMs)\nacross various sectors have spotlighted the essence of their output reliability\nand accuracy, particularly their ability to produce content grounded in factual\ninformation (e.g. common and domain-specific knowledge). In this work, we\nintroduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate\nthe factuality ability of MLLMs to answer natural language short questions.\nSimpleVQA is characterized by six key features: it covers multiple tasks and\nmultiple scenarios, ensures high quality and challenging queries, maintains\nstatic and timeless reference answers, and is straightforward to evaluate. Our\napproach involves categorizing visual question-answering items into 9 different\ntasks around objective events or common knowledge and situating these within 9\ntopics. Rigorous quality control processes are implemented to guarantee\nhigh-quality, concise, and clear answers, facilitating evaluation with minimal\nvariance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a\ncomprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into\ntheir image comprehension and text generation abilities by identifying and\nanalyzing error cases.", "paper_summary_zh": "\u96a8\u8457\u591a\u6a21\u614b\u5927\u578b\u8a9e\u8a00\u6a21\u578b (MLLM) \u5728\u5404\u500b\u9818\u57df\u7684\u61c9\u7528\u65e5\u76ca\u666e\u53ca\uff0c\u5176\u8f38\u51fa\u7d50\u679c\u7684\u53ef\u9760\u6027\u548c\u6e96\u78ba\u6027\u5df2\u5099\u53d7\u95dc\u6ce8\uff0c\u7279\u5225\u662f\u5176\u6839\u64da\u4e8b\u5be6\u8cc7\u8a0a\uff08\u4f8b\u5982\u4e00\u822c\u77e5\u8b58\u548c\u7279\u5b9a\u9818\u57df\u77e5\u8b58\uff09\u7522\u751f\u5167\u5bb9\u7684\u80fd\u529b\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u4ecb\u7d39 SimpleVQA\uff0c\u9019\u662f\u7b2c\u4e00\u500b\u7528\u65bc\u8a55\u4f30 MLLM \u56de\u7b54\u81ea\u7136\u8a9e\u8a00\u7c21\u77ed\u554f\u984c\u7684\u4e8b\u5be6\u80fd\u529b\u7684\u7d9c\u5408\u591a\u6a21\u614b\u57fa\u6e96\u3002SimpleVQA \u6709\u516d\u500b\u4e3b\u8981\u7279\u5fb5\uff1a\u6db5\u84cb\u591a\u9805\u4efb\u52d9\u548c\u591a\u7a2e\u60c5\u5883\u3001\u78ba\u4fdd\u9ad8\u54c1\u8cea\u4e14\u5177\u6311\u6230\u6027\u7684\u67e5\u8a62\u3001\u7dad\u8b77\u975c\u614b\u4e14\u6c38\u6046\u7684\u53c3\u8003\u7b54\u6848\uff0c\u800c\u4e14\u8a55\u4f30\u8d77\u4f86\u5f88\u7c21\u55ae\u3002\u6211\u5011\u7684\u505a\u6cd5\u662f\u5c07\u8996\u89ba\u554f\u7b54\u9805\u76ee\u5206\u985e\u70ba 9 \u500b\u4e0d\u540c\u7684\u4efb\u52d9\uff0c\u570d\u7e5e\u5ba2\u89c0\u4e8b\u4ef6\u6216\u5e38\u8b58\uff0c\u4e26\u5c07\u5b83\u5011\u7f6e\u65bc 9 \u500b\u4e3b\u984c\u4e2d\u3002\u6211\u5011\u5be6\u65bd\u56b4\u683c\u7684\u54c1\u8cea\u63a7\u7ba1\u6d41\u7a0b\uff0c\u4ee5\u4fdd\u8b49\u7b54\u6848\u7684\u9ad8\u54c1\u8cea\u3001\u7c21\u6f54\u548c\u6e05\u6670\uff0c\u4e26\u900f\u904e LLM \u4f5c\u70ba\u8a55\u5206\u7cfb\u7d71\uff0c\u4ee5\u6700\u5c0f\u7684\u5dee\u7570\u9032\u884c\u8a55\u4f30\u3002\u6211\u5011\u4f7f\u7528 SimpleVQA \u5c0d 18 \u500b\u4e3b\u8981\u7684 MLLM \u548c 8 \u500b\u7d14\u6587\u5b57 LLM \u9032\u884c\u5168\u9762\u8a55\u4f30\uff0c\u900f\u904e\u627e\u51fa\u548c\u5206\u6790\u932f\u8aa4\u6848\u4f8b\uff0c\u6df1\u5165\u63a2\u8a0e\u5b83\u5011\u7684\u5f71\u50cf\u7406\u89e3\u548c\u6587\u5b57\u751f\u6210\u80fd\u529b\u3002", "author": "Xianfu Cheng et.al.", "authors": "Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li", "id": "2502.13059v1", "paper_url": "http://arxiv.org/abs/2502.13059v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13061/2502.13061v1.json b/database/storage/2502/13061/2502.13061v1.json
new file mode 100644
index 0000000000..94845ff72e
--- /dev/null
+++ b/database/storage/2502/13061/2502.13061v1.json
@@ -0,0 +1 @@
+{"2502.13061": {"publish_time": "2025-02-18", "title": "Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection", "paper_summary": "Hateful memes have become a significant concern on the Internet,\nnecessitating robust automated detection systems. While large multimodal models\nhave shown strong generalization across various tasks, they exhibit poor\ngeneralization to hateful meme detection due to the dynamic nature of memes\ntied to emerging social trends and breaking news. Recent work further\nhighlights the limitations of conventional supervised fine-tuning for large\nmultimodal models in this context. To address these challenges, we propose\nLarge Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a\nnovel two-stage fine-tuning framework designed to improve both in-domain\naccuracy and cross-domain generalization. Experimental results on six widely\nused meme classification datasets demonstrate that LMM-RGCL achieves\nstate-of-the-art performance, outperforming agent-based systems such as\nVPD-PALI-X-55B. Furthermore, our method effectively generalizes to\nout-of-domain memes under low-resource settings, surpassing models like GPT-4o.", "paper_summary_zh": "\u7db2\u8def\u4e0a\u7684\u4ec7\u6068\u8ff7\u56e0\u5df2\u6210\u70ba\u4e00\u5927\u96b1\u6182\uff0c\u56e0\u6b64\u9700\u8981\u5f37\u5927\u7684\u81ea\u52d5\u5316\u5075\u6e2c\u7cfb\u7d71\u3002\u96d6\u7136\u5927\u578b\u591a\u6a21\u614b\u6a21\u578b\u5df2\u5728\u5404\u7a2e\u4efb\u52d9\u4e2d\u5c55\u73fe\u51fa\u5f37\u5927\u7684\u6cdb\u5316\u80fd\u529b\uff0c\u4f46\u7531\u65bc\u8ff7\u56e0\u8207\u65b0\u8208\u793e\u6703\u8da8\u52e2\u548c\u7a81\u767c\u65b0\u805e\u606f\u606f\u76f8\u95dc\uff0c\u56e0\u6b64\u5728\u4ec7\u6068\u8ff7\u56e0\u5075\u6e2c\u65b9\u9762\u8868\u73fe\u4e0d\u4f73\u3002\u6700\u8fd1\u7684\u7814\u7a76\u9032\u4e00\u6b65\u5f37\u8abf\u4e86\u5728\u9019\u7a2e\u60c5\u6cc1\u4e0b\uff0c\u50b3\u7d71\u76e3\u7763\u5fae\u8abf\u5c0d\u5927\u578b\u591a\u6a21\u614b\u6a21\u578b\u7684\u9650\u5236\u3002\u70ba\u4e86\u61c9\u5c0d\u9019\u4e9b\u6311\u6230\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u5927\u578b\u591a\u6a21\u614b\u6a21\u578b\u6aa2\u7d22\u5f15\u5c0e\u5c0d\u6bd4\u5b78\u7fd2 (LMM-RGCL)\uff0c\u9019\u662f\u4e00\u7a2e\u65b0\u7a4e\u7684\u5169\u968e\u6bb5\u5fae\u8abf\u67b6\u69cb\uff0c\u65e8\u5728\u63d0\u9ad8\u9818\u57df\u5167\u6e96\u78ba\u5ea6\u548c\u8de8\u9818\u57df\u6cdb\u5316\u80fd\u529b\u3002\u5728\u516d\u500b\u5ee3\u6cdb\u4f7f\u7528\u7684\u8ff7\u56e0\u5206\u985e\u8cc7\u6599\u96c6\u4e0a\u7684\u5be6\u9a57\u7d50\u679c\u8868\u660e\uff0cLMM-RGCL \u9054\u5230\u4e86\u6700\u5148\u9032\u7684\u6548\u80fd\uff0c\u512a\u65bc\u57fa\u65bc\u4ee3\u7406\u7684\u7cfb\u7d71\uff0c\u4f8b\u5982 VPD-PALI-X-55B\u3002\u6b64\u5916\uff0c\u6211\u5011\u7684\u6a21\u578b\u5728\u4f4e\u8cc7\u6e90\u8a2d\u5b9a\u4e0b\u6709\u6548\u6cdb\u5316\u5230\u9818\u57df\u5916\u8ff7\u56e0\uff0c\u8d85\u8d8a\u4e86 GPT-4o \u7b49\u6a21\u578b\u3002", "author": "Jingbiao Mei et.al.", "authors": "Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne", "id": "2502.13061v1", "paper_url": "http://arxiv.org/abs/2502.13061v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13062/2502.13062v1.json b/database/storage/2502/13062/2502.13062v1.json
new file mode 100644
index 0000000000..518f036b09
--- /dev/null
+++ b/database/storage/2502/13062/2502.13062v1.json
@@ -0,0 +1 @@
+{"2502.13062": {"publish_time": "2025-02-18", "title": "AI-Assisted Decision Making with Human Learning", "paper_summary": "AI systems increasingly support human decision-making. In many cases, despite\nthe algorithm's superior performance, the final decision remains in human\nhands. For example, an AI may assist doctors in determining which diagnostic\ntests to run, but the doctor ultimately makes the diagnosis. This paper studies\nsuch AI-assisted decision-making settings, where the human learns through\nrepeated interactions with the algorithm. In our framework, the algorithm --\ndesigned to maximize decision accuracy according to its own model -- determines\nwhich features the human can consider. The human then makes a prediction based\non their own less accurate model. We observe that the discrepancy between the\nalgorithm's model and the human's model creates a fundamental tradeoff. Should\nthe algorithm prioritize recommending more informative features, encouraging\nthe human to recognize their importance, even if it results in less accurate\npredictions in the short term until learning occurs? Or is it preferable to\nforgo educating the human and instead select features that align more closely\nwith their existing understanding, minimizing the immediate cost of learning?\nThis tradeoff is shaped by the algorithm's time-discounted objective and the\nhuman's learning ability. Our results show that optimal feature selection has a\nsurprisingly clean combinatorial characterization, reducible to a stationary\nsequence of feature subsets that is tractable to compute. As the algorithm\nbecomes more \"patient\" or the human's learning improves, the algorithm\nincreasingly selects more informative features, enhancing both prediction\naccuracy and the human's understanding. Notably, early investment in learning\nleads to the selection of more informative features than a later investment. We\ncomplement our analysis by showing that the impact of errors in the algorithm's\nknowledge is limited as it does not make the prediction directly.", "paper_summary_zh": "\u4eba\u5de5\u667a\u6167\u7cfb\u7d71\u65e5\u76ca\u652f\u63f4\u4eba\u985e\u6c7a\u7b56\u3002\u5728\u8a31\u591a\u60c5\u6cc1\u4e0b\uff0c\u5118\u7ba1\u6f14\u7b97\u6cd5\u7684\u6548\u80fd\u512a\u7570\uff0c\u6700\u7d42\u6c7a\u7b56\u4ecd\u638c\u63e1\u5728\u4eba\u985e\u624b\u4e2d\u3002\u4f8b\u5982\uff0c\u4eba\u5de5\u667a\u6167\u53ef\u80fd\u6703\u5354\u52a9\u91ab\u751f\u6c7a\u5b9a\u8981\u57f7\u884c\u54ea\u4e9b\u8a3a\u65b7\u6e2c\u8a66\uff0c\u4f46\u6700\u7d42\u4e0b\u8a3a\u65b7\u7684\u662f\u91ab\u751f\u3002\u672c\u6587\u63a2\u8a0e\u6b64\u985e\u4eba\u5de5\u667a\u6167\u8f14\u52a9\u6c7a\u7b56\u8a2d\u5b9a\uff0c\u5176\u4e2d\u4eba\u985e\u900f\u904e\u8207\u6f14\u7b97\u6cd5\u91cd\u8907\u4e92\u52d5\u800c\u5b78\u7fd2\u3002\u5728\u6211\u5011\u7684\u67b6\u69cb\u4e2d\uff0c\u6f14\u7b97\u6cd5\uff08\u65e8\u5728\u6839\u64da\u5176\u81ea\u8eab\u6a21\u578b\u6700\u5927\u5316\u6c7a\u7b56\u6e96\u78ba\u5ea6\uff09\u6703\u6c7a\u5b9a\u4eba\u985e\u53ef\u4ee5\u8003\u91cf\u7684\u7279\u5fb5\u3002\u7136\u5f8c\uff0c\u4eba\u985e\u6839\u64da\u5176\u81ea\u8eab\u8f03\u4e0d\u6e96\u78ba\u7684\u6a21\u578b\u505a\u51fa\u9810\u6e2c\u3002\u6211\u5011\u89c0\u5bdf\u5230\uff0c\u6f14\u7b97\u6cd5\u6a21\u578b\u8207\u4eba\u985e\u6a21\u578b\u4e4b\u9593\u7684\u5dee\u7570\u6703\u7522\u751f\u57fa\u672c\u7684\u6b0a\u8861\u3002\u6f14\u7b97\u6cd5\u662f\u5426\u61c9\u512a\u5148\u63a8\u85a6\u66f4\u591a\u8cc7\u8a0a\u6027\u7279\u5fb5\uff0c\u9f13\u52f5\u4eba\u985e\u8a8d\u8b58\u5176\u91cd\u8981\u6027\uff0c\u5373\u4f7f\u77ed\u671f\u5167\u6703\u5c0e\u81f4\u6e96\u78ba\u5ea6\u8f03\u4f4e\u7684\u9810\u6e2c\uff0c\u76f4\u5230\u5b78\u7fd2\u767c\u751f\uff1f\u6216\u8005\uff0c\u662f\u5426\u8f03\u597d\u653e\u68c4\u6559\u80b2\u4eba\u985e\uff0c\u800c\u9078\u64c7\u8207\u5176\u73fe\u6709\u7406\u89e3\u66f4\u7dca\u5bc6\u5c0d\u9f4a\u7684\u7279\u5fb5\uff0c\u5c07\u5b78\u7fd2\u7684\u7acb\u5373\u6210\u672c\u964d\u81f3\u6700\u4f4e\uff1f\u9019\u7a2e\u6b0a\u8861\u53d6\u6c7a\u65bc\u6f14\u7b97\u6cd5\u7684\u6642\u9593\u6298\u73fe\u76ee\u6a19\u548c\u4eba\u985e\u7684\u5b78\u7fd2\u80fd\u529b\u3002\u6211\u5011\u7684\u7d50\u679c\u8868\u660e\uff0c\u6700\u4f73\u7279\u5fb5\u9078\u64c7\u5177\u6709\u4ee4\u4eba\u9a5a\u8a1d\u7684\u4e7e\u6de8\u7d44\u5408\u7279\u5fb5\uff0c\u53ef\u7c21\u5316\u70ba\u53ef\u8a08\u7b97\u7684\u56fa\u5b9a\u7279\u5fb5\u5b50\u96c6\u5e8f\u5217\u3002\u96a8\u8457\u6f14\u7b97\u6cd5\u8b8a\u5f97\u66f4\u300c\u6709\u8010\u5fc3\u300d\u6216\u4eba\u985e\u7684\u5b78\u7fd2\u9032\u6b65\uff0c\u6f14\u7b97\u6cd5\u6703\u8d8a\u4f86\u8d8a\u591a\u5730\u9078\u64c7\u66f4\u591a\u8cc7\u8a0a\u6027\u7279\u5fb5\uff0c\u589e\u5f37\u9810\u6e2c\u6e96\u78ba\u5ea6\u548c\u4eba\u985e\u7684\u7406\u89e3\u3002\u503c\u5f97\u6ce8\u610f\u7684\u662f\uff0c\u65e9\u671f\u6295\u8cc7\u65bc\u5b78\u7fd2\u6703\u5c0e\u81f4\u9078\u64c7\u6bd4\u5f8c\u671f\u6295\u8cc7\u66f4\u591a\u8cc7\u8a0a\u6027\u7279\u5fb5\u3002\u6211\u5011\u900f\u904e\u986f\u793a\u6f14\u7b97\u6cd5\u77e5\u8b58\u4e2d\u932f\u8aa4\u7684\u5f71\u97ff\u662f\u6709\u9650\u7684\uff0c\u56e0\u70ba\u5b83\u4e0d\u6703\u76f4\u63a5\u505a\u51fa\u9810\u6e2c\uff0c\u4f86\u88dc\u5145\u6211\u5011\u7684\u5206\u6790\u3002", "author": "Gali Noti et.al.", "authors": "Gali Noti, Kate Donahue, Jon Kleinberg, Sigal Oren", "id": "2502.13062v1", "paper_url": "http://arxiv.org/abs/2502.13062v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13063/2502.13063v1.json b/database/storage/2502/13063/2502.13063v1.json
new file mode 100644
index 0000000000..3456a211bd
--- /dev/null
+++ b/database/storage/2502/13063/2502.13063v1.json
@@ -0,0 +1 @@
+{"2502.13063": {"publish_time": "2025-02-18", "title": "Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity", "paper_summary": "A range of recent works addresses the problem of compression of sequence of\ntokens into a shorter sequence of real-valued vectors to be used as inputs\ninstead of token embeddings or key-value cache. These approaches allow to\nreduce the amount of compute in existing language models. Despite relying on\npowerful models as encoders, the maximum attainable lossless compression ratio\nis typically not higher than x10. This fact is highly intriguing because, in\ntheory, the maximum information capacity of large real-valued vectors is far\nbeyond the presented rates even for 16-bit precision and a modest vector size.\nIn this work, we explore the limits of compression by replacing the encoder\nwith a per-sample optimization procedure. We show that vectors with compression\nratios up to x1500 exist, which highlights two orders of magnitude gap between\nexisting and practically attainable solutions. Furthermore, we empirically show\nthat the compression limits are determined not by the length of the input but\nby the amount of uncertainty to be reduced, namely, the cross-entropy loss on\nthis sequence without any conditioning. The obtained limits highlight the\nsubstantial gap between the theoretical capacity of input embeddings and their\npractical utilization, suggesting significant room for optimization in model\ndesign.", "paper_summary_zh": "\u4e00\u7cfb\u5217\u8fd1\u671f\u4f5c\u54c1\u63a2\u8ba8\u4e86\u5c06\u5e8f\u5217\u6807\u8bb0\u538b\u7f29\u6210\u8f83\u77ed\u7684\u5b9e\u503c\u5411\u91cf\u5e8f\u5217\u7684\u95ee\u9898\uff0c\u4ee5\u7528\u4f5c\u8f93\u5165\uff0c\u800c\u4e0d\u662f\u6807\u8bb0\u5d4c\u5165\u6216\u952e\u503c\u7f13\u5b58\u3002\u8fd9\u4e9b\u65b9\u6cd5\u5141\u8bb8\u51cf\u5c11\u73b0\u6709\u8bed\u8a00\u6a21\u578b\u4e2d\u7684\u8ba1\u7b97\u91cf\u3002\u5c3d\u7ba1\u4f9d\u8d56\u4e8e\u5f3a\u5927\u7684\u6a21\u578b\u4f5c\u4e3a\u7f16\u7801\u5668\uff0c\u4f46\u6700\u5927\u53ef\u8fbe\u5230\u7684\u65e0\u635f\u538b\u7f29\u6bd4\u901a\u5e38\u4e0d\u9ad8\u4e8e x10\u3002\u8fd9\u4e00\u4e8b\u5b9e\u975e\u5e38\u6709\u8da3\uff0c\u56e0\u4e3a\u7406\u8bba\u4e0a\uff0c\u5373\u4f7f\u5bf9\u4e8e 16 \u4f4d\u7cbe\u5ea6\u548c\u9002\u4e2d\u7684\u5411\u91cf\u5927\u5c0f\uff0c\u5927\u578b\u5b9e\u503c\u5411\u91cf\u7684\u6700\u5927\u4fe1\u606f\u5bb9\u91cf\u4e5f\u8fdc\u8fdc\u8d85\u51fa\u4e86\u6240\u5448\u73b0\u7684\u901f\u7387\u3002\u5728\u8fd9\u9879\u5de5\u4f5c\u4e2d\uff0c\u6211\u4eec\u901a\u8fc7\u7528\u6309\u6837\u672c\u4f18\u5316\u7a0b\u5e8f\u66ff\u6362\u7f16\u7801\u5668\u6765\u63a2\u7d22\u538b\u7f29\u7684\u6781\u9650\u3002\u6211\u4eec\u8868\u660e\uff0c\u5b58\u5728\u538b\u7f29\u6bd4\u9ad8\u8fbe x1500 \u7684\u5411\u91cf\uff0c\u8fd9\u7a81\u51fa\u4e86\u73b0\u6709\u89e3\u51b3\u65b9\u6848\u548c\u5b9e\u9645\u53ef\u5b9e\u73b0\u89e3\u51b3\u65b9\u6848\u4e4b\u95f4\u4e24\u4e2a\u6570\u91cf\u7ea7\u7684\u5dee\u8ddd\u3002\u6b64\u5916\uff0c\u6211\u4eec\u51ed\u7ecf\u9a8c\u8868\u660e\uff0c\u538b\u7f29\u6781\u9650\u4e0d\u662f\u7531\u8f93\u5165\u7684\u957f\u5ea6\u51b3\u5b9a\u7684\uff0c\u800c\u662f\u7531\u8981\u51cf\u5c11\u7684\u4e0d\u786e\u5b9a\u6027\u91cf\u51b3\u5b9a\u7684\uff0c\u5373\u5728\u6b64\u5e8f\u5217\u4e0a\u7684\u4ea4\u53c9\u71b5\u635f\u5931\uff0c\u6ca1\u6709\u4efb\u4f55\u6761\u4ef6\u3002\u83b7\u5f97\u7684\u6781\u9650\u7a81\u51fa\u4e86\u8f93\u5165\u5d4c\u5165\u7684\u7406\u8bba\u5bb9\u91cf\u4e0e\u5176\u5b9e\u9645\u5229\u7528\u4e4b\u95f4\u7684\u5de8\u5927\u5dee\u8ddd\uff0c\u8868\u660e\u6a21\u578b\u8bbe\u8ba1\u4e2d\u6709\u5f88\u5927\u7684\u4f18\u5316\u7a7a\u95f4\u3002", "author": "Yuri Kuratov et.al.", "authors": "Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev", "id": "2502.13063v1", "paper_url": "http://arxiv.org/abs/2502.13063v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13069/2502.13069v1.json b/database/storage/2502/13069/2502.13069v1.json
new file mode 100644
index 0000000000..0b9531e413
--- /dev/null
+++ b/database/storage/2502/13069/2502.13069v1.json
@@ -0,0 +1 @@
+{"2502.13069": {"publish_time": "2025-02-18", "title": "Interactive Agents to Overcome Ambiguity in Software Engineering", "paper_summary": "AI agents are increasingly being deployed to automate tasks, often based on\nambiguous and underspecified user instructions. Making unwarranted assumptions\nand failing to ask clarifying questions can lead to suboptimal outcomes, safety\nrisks due to tool misuse, and wasted computational resources. In this work, we\nstudy the ability of LLM agents to handle ambiguous instructions in interactive\ncode generation settings by evaluating proprietary and open-weight models on\ntheir performance across three key steps: (a) leveraging interactivity to\nimprove performance in ambiguous scenarios, (b) detecting ambiguity, and (c)\nasking targeted questions. Our findings reveal that models struggle to\ndistinguish between well-specified and underspecified instructions. However,\nwhen models interact for underspecified inputs, they effectively obtain vital\ninformation from the user, leading to significant improvements in performance\nand underscoring the value of effective interaction. Our study highlights\ncritical gaps in how current state-of-the-art models handle ambiguity in\ncomplex software engineering tasks and structures the evaluation into distinct\nsteps to enable targeted improvements.", "paper_summary_zh": "\u4eba\u5de5\u667a\u80fd\u4ee3\u7406\u6b63\u8d8a\u4f86\u8d8a\u591a\u5730\u88ab\u90e8\u7f72\u7528\u65bc\u81ea\u52d5\u5316\u4efb\u52d9\uff0c\u901a\u5e38\u57fa\u65bc\u6a21\u68f1\u5169\u53ef\u4e14\u672a\u660e\u78ba\u898f\u5b9a\u7684\u4f7f\u7528\u8005\u6307\u4ee4\u3002\u505a\u51fa\u4e0d\u5408\u7406\u7684\u5047\u8a2d\u4e14\u672a\u80fd\u63d0\u51fa\u6f84\u6e05\u554f\u984c\uff0c\u53ef\u80fd\u5c0e\u81f4\u6b21\u4f73\u7d50\u679c\u3001\u56e0\u5de5\u5177\u8aa4\u7528\u800c\u7522\u751f\u7684\u5b89\u5168\u98a8\u96aa\uff0c\u4ee5\u53ca\u6d6a\u8cbb\u904b\u7b97\u8cc7\u6e90\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u7814\u7a76\u4e86 LLM \u4ee3\u7406\u5728\u4e92\u52d5\u5f0f\u7a0b\u5f0f\u78bc\u751f\u6210\u8a2d\u5b9a\u4e2d\u8655\u7406\u6a21\u68f1\u5169\u53ef\u6307\u4ee4\u7684\u80fd\u529b\uff0c\u65b9\u6cd5\u662f\u5728\u4e09\u500b\u95dc\u9375\u6b65\u9a5f\u4e2d\u8a55\u4f30\u5c08\u6709\u548c\u958b\u653e\u6b0a\u91cd\u7684\u6a21\u578b\uff1a (a) \u5229\u7528\u4e92\u52d5\u6027\u4f86\u63d0\u5347\u5728\u6a21\u68f1\u5169\u53ef\u5834\u666f\u4e2d\u7684\u6548\u80fd\u3001(b) \u5075\u6e2c\u6a21\u7cca\u6027\uff0c\u4ee5\u53ca (c) \u63d0\u51fa\u76ee\u6a19\u554f\u984c\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u986f\u793a\uff0c\u6a21\u578b\u96e3\u4ee5\u5340\u5206\u660e\u78ba\u898f\u7bc4\u7684\u6307\u4ee4\u548c\u672a\u660e\u78ba\u898f\u7bc4\u7684\u6307\u4ee4\u3002\u7136\u800c\uff0c\u7576\u6a21\u578b\u91dd\u5c0d\u672a\u660e\u78ba\u898f\u7bc4\u7684\u8f38\u5165\u9032\u884c\u4e92\u52d5\u6642\uff0c\u5b83\u5011\u6703\u6709\u6548\u5730\u5f9e\u4f7f\u7528\u8005\u53d6\u5f97\u91cd\u8981\u8cc7\u8a0a\uff0c\u9032\u800c\u5927\u5e45\u63d0\u5347\u6548\u80fd\uff0c\u4e26\u5f37\u8abf\u6709\u6548\u4e92\u52d5\u7684\u50f9\u503c\u3002\u6211\u5011\u7684\u7814\u7a76\u7a81\u986f\u4e86\u76ee\u524d\u6700\u5148\u9032\u7684\u6a21\u578b\u5728\u8655\u7406\u8907\u96dc\u8edf\u9ad4\u5de5\u7a0b\u4efb\u52d9\u4e2d\u7684\u6a21\u7cca\u6027\u6642\u5b58\u5728\u54ea\u4e9b\u95dc\u9375\u5dee\u8ddd\uff0c\u4e26\u5c07\u8a55\u4f30\u67b6\u69cb\u70ba\u4e0d\u540c\u7684\u6b65\u9a5f\uff0c\u4ee5\u4fc3\u6210\u6709\u76ee\u6a19\u7684\u6539\u5584\u3002", "author": "Sanidhya Vijayvargiya et.al.", "authors": "Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig", "id": "2502.13069v1", "paper_url": "http://arxiv.org/abs/2502.13069v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13076/2502.13076v1.json b/database/storage/2502/13076/2502.13076v1.json
new file mode 100644
index 0000000000..1fe8575ad9
--- /dev/null
+++ b/database/storage/2502/13076/2502.13076v1.json
@@ -0,0 +1 @@
+{"2502.13076": {"publish_time": "2025-02-18", "title": "KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits", "paper_summary": "Patent analysis highly relies on concise and interpretable document\nrepresentations, referred to as patent portraits. Keyphrases, both present and\nabsent, are ideal candidates for patent portraits due to their brevity,\nrepresentativeness, and clarity. In this paper, we introduce KAPPA, an\nintegrated framework designed to construct keyphrase-based patent portraits and\nenhance patent analysis. KAPPA operates in two phases: patent portrait\nconstruction and portrait-based analysis. To ensure effective portrait\nconstruction, we propose a semantic-calibrated keyphrase generation paradigm\nthat integrates pre-trained language models with a prompt-based hierarchical\ndecoding strategy to leverage the multi-level structural characteristics of\npatents. For portrait-based analysis, we develop a comprehensive framework that\nemploys keyphrase-based patent portraits to enable efficient and accurate\npatent analysis. Extensive experiments on benchmark datasets of keyphrase\ngeneration, the proposed model achieves significant improvements compared to\nstate-of-the-art baselines. Further experiments conducted on real-world patent\napplications demonstrate that our keyphrase-based portraits effectively capture\ndomain-specific knowledge and enrich semantic representation for patent\nanalysis tasks.", "paper_summary_zh": "\u5c08\u5229\u5206\u6790\u9ad8\u5ea6\u4f9d\u8cf4\u7c21\u6f54\u4e14\u53ef\u89e3\u8b80\u7684\u6587\u4ef6\u8868\u793a\uff0c\u7a31\u70ba\u5c08\u5229\u63cf\u8ff0\u3002\u95dc\u9375\u5b57\u7d44\uff0c\u7121\u8ad6\u662f\u5b58\u5728\u7684\u9084\u662f\u4e0d\u5b58\u5728\u7684\uff0c\u90fd\u662f\u5c08\u5229\u63cf\u8ff0\u7684\u7406\u60f3\u5019\u9078\u8005\uff0c\u56e0\u70ba\u5b83\u5011\u7c21\u6f54\u3001\u5177\u6709\u4ee3\u8868\u6027\u4e14\u6e05\u6670\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u4ecb\u7d39\u4e86 KAPPA\uff0c\u4e00\u500b\u7528\u65bc\u5efa\u69cb\u57fa\u65bc\u95dc\u9375\u5b57\u7d44\u7684\u5c08\u5229\u63cf\u8ff0\u548c\u589e\u5f37\u5c08\u5229\u5206\u6790\u7684\u6574\u5408\u5f0f\u67b6\u69cb\u3002KAPPA \u5206\u70ba\u5169\u500b\u968e\u6bb5\u57f7\u884c\uff1a\u5c08\u5229\u63cf\u8ff0\u5efa\u69cb\u548c\u57fa\u65bc\u63cf\u8ff0\u7684\u5206\u6790\u3002\u70ba\u78ba\u4fdd\u6709\u6548\u7684\u63cf\u8ff0\u5efa\u69cb\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u4e00\u500b\u8a9e\u7fa9\u6821\u6e96\u95dc\u9375\u5b57\u7d44\u751f\u6210\u7bc4\u4f8b\uff0c\u5b83\u5c07\u9810\u5148\u8a13\u7df4\u7684\u8a9e\u8a00\u6a21\u578b\u8207\u57fa\u65bc\u63d0\u793a\u7684\u5206\u5c64\u89e3\u78bc\u7b56\u7565\u6574\u5408\u5728\u4e00\u8d77\uff0c\u4ee5\u5229\u7528\u5c08\u5229\u7684\u591a\u5206\u5c64\u7d50\u69cb\u7279\u6027\u3002\u5c0d\u65bc\u57fa\u65bc\u63cf\u8ff0\u7684\u5206\u6790\uff0c\u6211\u5011\u958b\u767c\u4e86\u4e00\u500b\u5168\u9762\u7684\u67b6\u69cb\uff0c\u5b83\u63a1\u7528\u57fa\u65bc\u95dc\u9375\u5b57\u7d44\u7684\u5c08\u5229\u63cf\u8ff0\uff0c\u4ee5\u5be6\u73fe\u9ad8\u6548\u4e14\u6e96\u78ba\u7684\u5c08\u5229\u5206\u6790\u3002\u5728\u95dc\u9375\u5b57\u7d44\u751f\u6210\u57fa\u6e96\u8cc7\u6599\u96c6\u4e0a\u9032\u884c\u7684\u5ee3\u6cdb\u5be6\u9a57\u4e2d\uff0c\u8207\u6700\u5148\u9032\u7684\u57fa\u6e96\u7dda\u76f8\u6bd4\uff0c\u6240\u63d0\u51fa\u7684\u6a21\u578b\u53d6\u5f97\u4e86\u986f\u8457\u7684\u6539\u9032\u3002\u5728\u771f\u5be6\u4e16\u754c\u5c08\u5229\u7533\u8acb\u4e0a\u9032\u884c\u7684\u9032\u4e00\u6b65\u5be6\u9a57\u8868\u660e\uff0c\u6211\u5011\u57fa\u65bc\u95dc\u9375\u5b57\u7d44\u7684\u63cf\u8ff0\u6709\u6548\u5730\u64f7\u53d6\u4e86\u7279\u5b9a\u9818\u57df\u7684\u77e5\u8b58\uff0c\u4e26\u8c50\u5bcc\u4e86\u5c08\u5229\u5206\u6790\u4efb\u52d9\u7684\u8a9e\u7fa9\u8868\u793a\u3002", "author": "Xin Xia et.al.", "authors": "Xin Xia, Yujin Wang, Jun Zhou, Guisheng Zhong, Linning Cai, Chen Zhang", "id": "2502.13076v1", "paper_url": "http://arxiv.org/abs/2502.13076v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13092/2502.13092v1.json b/database/storage/2502/13092/2502.13092v1.json
new file mode 100644
index 0000000000..61daa89faf
--- /dev/null
+++ b/database/storage/2502/13092/2502.13092v1.json
@@ -0,0 +1 @@
+{"2502.13092": {"publish_time": "2025-02-18", "title": "Text2World: Benchmarking Large Language Models for Symbolic World Model Generation", "paper_summary": "Recently, there has been growing interest in leveraging large language models\n(LLMs) to generate symbolic world models from textual descriptions. Although\nLLMs have been extensively explored in the context of world modeling, prior\nstudies encountered several challenges, including evaluation randomness,\ndependence on indirect metrics, and a limited domain scope. To address these\nlimitations, we introduce a novel benchmark, Text2World, based on planning\ndomain definition language (PDDL), featuring hundreds of diverse domains and\nemploying multi-criteria, execution-based metrics for a more robust evaluation.\nWe benchmark current LLMs using Text2World and find that reasoning models\ntrained with large-scale reinforcement learning outperform others. However,\neven the best-performing model still demonstrates limited capabilities in world\nmodeling. Building on these insights, we examine several promising strategies\nto enhance the world modeling capabilities of LLMs, including test-time\nscaling, agent training, and more. We hope that Text2World can serve as a\ncrucial resource, laying the groundwork for future research in leveraging LLMs\nas world models. The project page is available at\nhttps://text-to-world.github.io/.", "paper_summary_zh": "\u6700\u8fd1\uff0c\u4eba\u4eec\u8d8a\u6765\u8d8a\u6709\u5174\u8da3\u5229\u7528\u5927\u578b\u8bed\u8a00\u6a21\u578b\uff08LLM\uff09\u4ece\u6587\u672c\u63cf\u8ff0\u4e2d\u751f\u6210\u7b26\u53f7\u4e16\u754c\u6a21\u578b\u3002\u5c3d\u7ba1 LLM \u5df2\u5728\u4e16\u754c\u5efa\u6a21\u7684\u80cc\u666f\u4e0b\u5f97\u5230\u5e7f\u6cdb\u63a2\u7d22\uff0c\u4f46\u5148\u524d\u7684\u7814\u7a76\u9047\u5230\u4e86\u82e5\u5e72\u6311\u6218\uff0c\u5305\u62ec\u8bc4\u4f30\u968f\u673a\u6027\u3001\u5bf9\u95f4\u63a5\u6307\u6807\u7684\u4f9d\u8d56\u4ee5\u53ca\u6709\u9650\u7684\u9886\u57df\u8303\u56f4\u3002\u4e3a\u4e86\u89e3\u51b3\u8fd9\u4e9b\u9650\u5236\uff0c\u6211\u4eec\u5f15\u5165\u4e86\u57fa\u4e8e\u89c4\u5212\u57df\u5b9a\u4e49\u8bed\u8a00\uff08PDDL\uff09\u7684\u65b0\u57fa\u51c6 Text2World\uff0c\u8be5\u57fa\u51c6\u5305\u542b\u6570\u767e\u4e2a\u4e0d\u540c\u7684\u57df\uff0c\u5e76\u91c7\u7528\u57fa\u4e8e\u6267\u884c\u7684\u591a\u6807\u51c6\u6307\u6807\u6765\u8fdb\u884c\u66f4\u7a33\u5065\u7684\u8bc4\u4f30\u3002\u6211\u4eec\u4f7f\u7528 Text2World \u5bf9\u5f53\u524d\u7684 LLM \u8fdb\u884c\u4e86\u57fa\u51c6\u6d4b\u8bd5\uff0c\u53d1\u73b0\u4f7f\u7528\u5927\u89c4\u6a21\u5f3a\u5316\u5b66\u4e60\u8bad\u7ec3\u7684\u63a8\u7406\u6a21\u578b\u4f18\u4e8e\u5176\u4ed6\u6a21\u578b\u3002\u7136\u800c\uff0c\u5373\u4f7f\u662f\u6027\u80fd\u6700\u4f73\u7684\u6a21\u578b\u5728\u4e16\u754c\u5efa\u6a21\u65b9\u9762\u4ecd\u7136\u8868\u73b0\u51fa\u6709\u9650\u7684\u80fd\u529b\u3002\u57fa\u4e8e\u8fd9\u4e9b\u89c1\u89e3\uff0c\u6211\u4eec\u7814\u7a76\u4e86\u51e0\u79cd\u6709\u5e0c\u671b\u7684\u7b56\u7565\u6765\u589e\u5f3a LLM \u7684\u4e16\u754c\u5efa\u6a21\u80fd\u529b\uff0c\u5305\u62ec\u6d4b\u8bd5\u65f6\u7f29\u653e\u3001\u4ee3\u7406\u8bad\u7ec3\u7b49\u7b49\u3002\u6211\u4eec\u5e0c\u671b Text2World \u80fd\u591f\u4f5c\u4e3a\u4e00\u9879\u81f3\u5173\u91cd\u8981\u7684\u8d44\u6e90\uff0c\u4e3a\u672a\u6765\u5229\u7528 LLM \u4f5c\u4e3a\u4e16\u754c\u6a21\u578b\u7684\u7814\u7a76\u5960\u5b9a\u57fa\u7840\u3002\u9879\u76ee\u9875\u9762\u53ef\u5728 https://text-to-world.github.io/ \u83b7\u5f97\u3002", "author": "Mengkang Hu et.al.", "authors": "Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo", "id": "2502.13092v1", "paper_url": "http://arxiv.org/abs/2502.13092v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13095/2502.13095v1.json b/database/storage/2502/13095/2502.13095v1.json
new file mode 100644
index 0000000000..e2f34173fd
--- /dev/null
+++ b/database/storage/2502/13095/2502.13095v1.json
@@ -0,0 +1 @@
+{"2502.13095": {"publish_time": "2025-02-18", "title": "Understanding and Rectifying Safety Perception Distortion in VLMs", "paper_summary": "Recent studies reveal that vision-language models (VLMs) become more\nsusceptible to harmful requests and jailbreak attacks after integrating the\nvision modality, exhibiting greater vulnerability than their text-only LLM\nbackbones. To uncover the root cause of this phenomenon, we conduct an in-depth\nanalysis and identify a key issue: multimodal inputs introduce an\nmodality-induced activation shift toward a \"safer\" direction compared to their\ntext-only counterparts, leading VLMs to systematically overestimate the safety\nof harmful inputs. We refer to this issue as safety perception distortion. To\nmitigate such distortion, we propose Activation Shift Disentanglement and\nCalibration (ShiftDC), a training-free method that decomposes and calibrates\nthe modality-induced activation shift to reduce the impact of modality on\nsafety. By isolating and removing the safety-relevant component, ShiftDC\nrestores the inherent safety alignment of the LLM backbone while preserving the\nvision-language capabilities of VLMs. Empirical results demonstrate that\nShiftDC significantly enhances alignment performance on safety benchmarks\nwithout impairing model utility.", "paper_summary_zh": "\u6700\u8fd1\u7684\u7814\u7a76\u8868\u660e\uff0c\u5728\u6574\u5408\u4e86\u89c6\u89c9\u6a21\u6001\u540e\uff0c\u89c6\u89c9\u8bed\u8a00\u6a21\u578b (VLM) \u66f4\u5bb9\u6613\u53d7\u5230\u6709\u5bb3\u8bf7\u6c42\u548c\u8d8a\u72f1\u653b\u51fb\uff0c\u8868\u73b0\u51fa\u6bd4\u5176\u4ec5\u6587\u672c\u7684 LLM \u4e3b\u5e72\u66f4\u5927\u7684\u6f0f\u6d1e\u3002\u4e3a\u4e86\u63ed\u793a\u8fd9\u79cd\u73b0\u8c61\u7684\u6839\u672c\u539f\u56e0\uff0c\u6211\u4eec\u8fdb\u884c\u4e86\u6df1\u5165\u5206\u6790\uff0c\u5e76\u786e\u5b9a\u4e86\u4e00\u4e2a\u5173\u952e\u95ee\u9898\uff1a\u4e0e\u4ec5\u6587\u672c\u7684\u5bf9\u5e94\u7269\u76f8\u6bd4\uff0c\u591a\u6a21\u6001\u8f93\u5165\u5f15\u5165\u4e86\u671d\u201c\u66f4\u5b89\u5168\u201d\u65b9\u5411\u7684\u6a21\u6001\u8bf1\u5bfc\u6fc0\u6d3b\u8f6c\u79fb\uff0c\u5bfc\u81f4 VLM \u7cfb\u7edf\u6027\u5730\u9ad8\u4f30\u6709\u5bb3\u8f93\u5165\u7684\u5b89\u5168\u6027\u3002\u6211\u4eec\u5c06\u6b64\u95ee\u9898\u79f0\u4e3a\u5b89\u5168\u611f\u77e5\u626d\u66f2\u3002\u4e3a\u4e86\u51cf\u8f7b\u8fd9\u79cd\u626d\u66f2\uff0c\u6211\u4eec\u63d0\u51fa\u4e86\u6fc0\u6d3b\u8f6c\u79fb\u89e3\u8026\u548c\u6821\u51c6 (ShiftDC)\uff0c\u8fd9\u662f\u4e00\u79cd\u65e0\u8bad\u7ec3\u65b9\u6cd5\uff0c\u7528\u4e8e\u5206\u89e3\u548c\u6821\u51c6\u6a21\u6001\u8bf1\u5bfc\u7684\u6fc0\u6d3b\u8f6c\u79fb\uff0c\u4ee5\u51cf\u5c11\u6a21\u6001\u5bf9\u5b89\u5168\u6027\u7684\u5f71\u54cd\u3002\u901a\u8fc7\u9694\u79bb\u548c\u79fb\u9664\u4e0e\u5b89\u5168\u6027\u76f8\u5173\u7684\u7ec4\u4ef6\uff0cShiftDC \u6062\u590d\u4e86 LLM \u4e3b\u5e72\u7684\u56fa\u6709\u5b89\u5168\u6027\u5bf9\u9f50\uff0c\u540c\u65f6\u4fdd\u7559\u4e86 VLM \u7684\u89c6\u89c9\u8bed\u8a00\u80fd\u529b\u3002\u5b9e\u8bc1\u7ed3\u679c\u8868\u660e\uff0cShiftDC \u5728\u4e0d\u635f\u5bb3\u6a21\u578b\u6548\u7528\u7684\u60c5\u51b5\u4e0b\uff0c\u663e\u8457\u589e\u5f3a\u4e86\u5b89\u5168\u57fa\u51c6\u4e0a\u7684\u5bf9\u9f50\u6027\u80fd\u3002", "author": "Xiaohan Zou et.al.", "authors": "Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin", "id": "2502.13095v1", "paper_url": "http://arxiv.org/abs/2502.13095v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13107/2502.13107v1.json b/database/storage/2502/13107/2502.13107v1.json
new file mode 100644
index 0000000000..2e5434d107
--- /dev/null
+++ b/database/storage/2502/13107/2502.13107v1.json
@@ -0,0 +1 @@
+{"2502.13107": {"publish_time": "2025-02-18", "title": "MatterChat: A Multi-Modal LLM for Material Science", "paper_summary": "Understanding and predicting the properties of inorganic materials is crucial\nfor accelerating advancements in materials science and driving applications in\nenergy, electronics, and beyond. Integrating material structure data with\nlanguage-based information through multi-modal large language models (LLMs)\noffers great potential to support these efforts by enhancing human-AI\ninteraction. However, a key challenge lies in integrating atomic structures at\nfull resolution into LLMs. In this work, we introduce MatterChat, a versatile\nstructure-aware multi-modal LLM that unifies material structural data and\ntextual inputs into a single cohesive model. MatterChat employs a bridging\nmodule to effectively align a pretrained machine learning interatomic potential\nwith a pretrained LLM, reducing training costs and enhancing flexibility. Our\nresults demonstrate that MatterChat significantly improves performance in\nmaterial property prediction and human-AI interaction, surpassing\ngeneral-purpose LLMs such as GPT-4. We also demonstrate its usefulness in\napplications such as more advanced scientific reasoning and step-by-step\nmaterial synthesis.", "paper_summary_zh": "\u4e86\u89e3\u548c\u9810\u6e2c\u7121\u6a5f\u6750\u6599\u7684\u7279\u6027\u5c0d\u65bc\u52a0\u901f\u6750\u6599\u79d1\u5b78\u7684\u9032\u6b65\u548c\u63a8\u52d5\u80fd\u6e90\u3001\u96fb\u5b50\u7b49\u65b9\u9762\u7684\u61c9\u7528\u81f3\u95dc\u91cd\u8981\u3002\u900f\u904e\u591a\u6a21\u614b\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5c07\u6750\u6599\u7d50\u69cb\u6578\u64da\u8207\u57fa\u65bc\u8a9e\u8a00\u7684\u8cc7\u8a0a\u6574\u5408\uff0c\u53ef\u4ee5\u6975\u5927\u7a0b\u5ea6\u5730\u652f\u6301\u9019\u4e9b\u5de5\u4f5c\uff0c\u85c9\u6b64\u589e\u5f37\u4eba\u985e\u8207 AI \u7684\u4e92\u52d5\u3002\u7136\u800c\uff0c\u4e00\u500b\u95dc\u9375\u6311\u6230\u5728\u65bc\u5c07\u539f\u5b50\u7d50\u69cb\u4ee5\u5b8c\u6574\u89e3\u6790\u5ea6\u6574\u5408\u5230 LLM \u4e2d\u3002\u5728\u9019\u9805\u5de5\u4f5c\u4e2d\uff0c\u6211\u5011\u5f15\u5165\u4e86 MatterChat\uff0c\u9019\u662f\u4e00\u500b\u901a\u7528\u7684\u7d50\u69cb\u611f\u77e5\u591a\u6a21\u614b LLM\uff0c\u5b83\u5c07\u6750\u6599\u7d50\u69cb\u6578\u64da\u548c\u6587\u5b57\u8f38\u5165\u7d71\u4e00\u5230\u4e00\u500b\u55ae\u4e00\u7684\u5167\u805a\u6a21\u578b\u4e2d\u3002MatterChat \u63a1\u7528\u6a4b\u63a5\u6a21\u7d44\uff0c\u5c07\u9810\u5148\u8a13\u7df4\u597d\u7684\u6a5f\u5668\u5b78\u7fd2\u539f\u5b50\u9593\u96fb\u4f4d\u8207\u9810\u5148\u8a13\u7df4\u597d\u7684 LLM \u6709\u6548\u5730\u5c0d\u9f4a\uff0c\u5f9e\u800c\u964d\u4f4e\u8a13\u7df4\u6210\u672c\u4e26\u589e\u5f37\u9748\u6d3b\u6027\u3002\u6211\u5011\u7684\u7d50\u679c\u8868\u660e\uff0cMatterChat \u5927\u5e45\u63d0\u5347\u4e86\u6750\u6599\u7279\u6027\u9810\u6e2c\u548c\u4eba\u985e\u8207 AI \u4e92\u52d5\u7684\u6548\u80fd\uff0c\u8d85\u8d8a\u4e86 GPT-4 \u7b49\u901a\u7528 LLM\u3002\u6211\u5011\u4e5f\u5c55\u793a\u4e86\u5b83\u5728\u66f4\u9032\u968e\u7684\u79d1\u5b78\u63a8\u7406\u548c\u9010\u6b65\u6750\u6599\u5408\u6210\u7b49\u61c9\u7528\u4e2d\u7684\u6548\u7528\u3002", "author": "Yingheng Tang et.al.", "authors": "Yingheng Tang, Wenbin Xu, Jie Cao, Jianzhu Ma, Weilu Gao, Steve Farrell, Benjamin Erichson, Michael W. Mahoney, Andy Nonaka, Zhi Yao", "id": "2502.13107v1", "paper_url": "http://arxiv.org/abs/2502.13107v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13108/2502.13108v1.json b/database/storage/2502/13108/2502.13108v1.json
new file mode 100644
index 0000000000..8d4e36d83e
--- /dev/null
+++ b/database/storage/2502/13108/2502.13108v1.json
@@ -0,0 +1 @@
+{"2502.13108": {"publish_time": "2025-02-18", "title": "Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization", "paper_summary": "Clinical Question Answering (CQA) plays a crucial role in medical\ndecision-making, enabling physicians to extract relevant information from\nElectronic Medical Records (EMRs). While transformer-based models such as BERT,\nBioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in\nCQA, existing models lack the ability to categorize extracted answers, which is\ncritical for structured retrieval, content filtering, and medical decision\nsupport.\n  To address this limitation, we introduce a Multi-Task Learning (MTL)\nframework that jointly trains CQA models for both answer extraction and medical\ncategorization. In addition to predicting answer spans, our model classifies\nresponses into five standardized medical categories: Diagnosis, Medication,\nSymptoms, Procedure, and Lab Reports. This categorization enables more\nstructured and interpretable outputs, making clinical QA models more useful in\nreal-world healthcare settings.\n  We evaluate our approach on emrQA, a large-scale dataset for medical question\nanswering. Results show that MTL improves F1-score by 2.2% compared to standard\nfine-tuning, while achieving 90.7% accuracy in answer categorization. These\nfindings suggest that MTL not only enhances CQA performance but also introduces\nan effective mechanism for categorization and structured medical information\nretrieval.", "paper_summary_zh": "<paragraph>\u81e8\u5e8a\u554f\u7b54 (CQA) \u5728\u91ab\u7642\u6c7a\u7b56\u4e2d\u626e\u6f14\u8457\u81f3\u95dc\u91cd\u8981\u7684\u89d2\u8272\uff0c\u8b93\u91ab\u5e2b\u80fd\u5920\u5f9e\u96fb\u5b50\u75c5\u6b77 (EMR) \u4e2d\u64f7\u53d6\u76f8\u95dc\u8cc7\u8a0a\u3002\u5118\u7ba1 BERT\u3001BioBERT \u548c ClinicalBERT \u7b49\u57fa\u65bc\u8f49\u63db\u5668\u7684\u6a21\u578b\u5df2\u5728 CQA \u4e2d\u5c55\u73fe\u51fa\u6700\u5148\u9032\u7684\u6548\u80fd\uff0c\u4f46\u73fe\u6709\u7684\u6a21\u578b\u7f3a\u4e4f\u5206\u985e\u64f7\u53d6\u7b54\u6848\u7684\u80fd\u529b\uff0c\u9019\u5c0d\u65bc\u7d50\u69cb\u5316\u6aa2\u7d22\u3001\u5167\u5bb9\u904e\u6ffe\u548c\u91ab\u7642\u6c7a\u7b56\u652f\u63f4\u81f3\u95dc\u91cd\u8981\u3002\n  \u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u9650\u5236\uff0c\u6211\u5011\u5f15\u9032\u4e86\u4e00\u500b\u591a\u4efb\u52d9\u5b78\u7fd2 (MTL) \u67b6\u69cb\uff0c\u5b83\u540c\u6642\u8a13\u7df4 CQA \u6a21\u578b\u7528\u65bc\u7b54\u6848\u64f7\u53d6\u548c\u91ab\u7642\u5206\u985e\u3002\u9664\u4e86\u9810\u6e2c\u7b54\u6848\u7bc4\u570d\uff0c\u6211\u5011\u7684\u6a21\u578b\u5c07\u56de\u61c9\u5206\u985e\u70ba\u4e94\u500b\u6a19\u6e96\u5316\u91ab\u7642\u985e\u5225\uff1a\u8a3a\u65b7\u3001\u85e5\u7269\u3001\u75c7\u72c0\u3001\u7a0b\u5e8f\u548c\u5be6\u9a57\u5ba4\u5831\u544a\u3002\u9019\u7a2e\u5206\u985e\u80fd\u7522\u751f\u66f4\u7d50\u69cb\u5316\u4e14\u6613\u65bc\u7406\u89e3\u7684\u8f38\u51fa\uff0c\u8b93\u81e8\u5e8a\u554f\u7b54\u6a21\u578b\u5728\u771f\u5be6\u4e16\u754c\u7684\u91ab\u7642\u4fdd\u5065\u74b0\u5883\u4e2d\u66f4\u5be6\u7528\u3002\n  \u6211\u5011\u5728 emrQA \u4e0a\u8a55\u4f30\u6211\u5011\u7684\u505a\u6cd5\uff0cemrQA \u662f\u7528\u65bc\u91ab\u7642\u554f\u984c\u89e3\u7b54\u7684\u5927\u898f\u6a21\u8cc7\u6599\u96c6\u3002\u7d50\u679c\u986f\u793a\uff0c\u8207\u6a19\u6e96\u5fae\u8abf\u76f8\u6bd4\uff0cMTL \u5c07 F1 \u5206\u6578\u63d0\u9ad8\u4e86 2.2%\uff0c\u540c\u6642\u5728\u7b54\u6848\u5206\u985e\u4e2d\u9054\u5230 90.7% \u7684\u6e96\u78ba\u5ea6\u3002\u9019\u4e9b\u767c\u73fe\u8868\u660e\uff0cMTL \u4e0d\u50c5\u589e\u5f37\u4e86 CQA \u7684\u6548\u80fd\uff0c\u9084\u5f15\u5165\u4e86\u4e00\u7a2e\u5206\u985e\u548c\u7d50\u69cb\u5316\u91ab\u7642\u8cc7\u8a0a\u6aa2\u7d22\u7684\u6709\u6548\u6a5f\u5236\u3002</paragraph>", "author": "Priyaranjan Pattnayak et.al.", "authors": "Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar", "id": "2502.13108v1", "paper_url": "http://arxiv.org/abs/2502.13108v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13114/2502.13114v1.json b/database/storage/2502/13114/2502.13114v1.json
new file mode 100644
index 0000000000..6a2c748f28
--- /dev/null
+++ b/database/storage/2502/13114/2502.13114v1.json
@@ -0,0 +1 @@
+{"2502.13114": {"publish_time": "2025-02-18", "title": "The influence of motion features in temporal perception", "paper_summary": "This paper examines the role of manner-of-motion verbs in shaping subjective\ntemporal perception and emotional resonance. Through four complementary\nstudies, we explore how these verbs influence the conceptualization of time,\nexamining their use in literal and metaphorical (temporal) contexts. Our\nfindings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging\ntemporal experiences, often linked to positive emotions and greater agency. In\ncontrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and\nnegative emotions, reflecting tedious or constrained experiences of time. These\neffects are amplified in metaphorical contexts, where manner verbs encode\nemotional and experiential nuances that transcend their literal meanings. We\nalso find that participants prefer manner verbs over path verbs (e.g., go,\npass) in emotionally charged temporal contexts, as manner verbs capture the\nexperiential and emotional qualities of time more effectively. These findings\nhighlight the interplay between language, motion, and emotion in shaping\ntemporal perception, offering insights into how linguistic framing influences\nsubjective experiences of time.", "paper_summary_zh": "\u672c\u6587\u63a2\u8a0e\u52d5\u4f5c\u65b9\u5f0f\u52d5\u8a5e\u5728\u5f62\u5851\u4e3b\u89c0\u6642\u9593\u611f\u77e5\u548c\u60c5\u7dd2\u5171\u9cf4\u4e2d\u6240\u626e\u6f14\u7684\u89d2\u8272\u3002\u900f\u904e\u56db\u9805\u4e92\u88dc\u7684\u7814\u7a76\uff0c\u6211\u5011\u63a2\u8a0e\u9019\u4e9b\u52d5\u8a5e\u5982\u4f55\u5f71\u97ff\u6642\u9593\u7684\u6982\u5ff5\u5316\uff0c\u4e26\u6aa2\u8996\u5b83\u5011\u5728\u5b57\u9762\u548c\u96b1\u55bb\uff08\u6642\u9593\uff09\u8a9e\u5883\u4e2d\u7684\u7528\u6cd5\u3002\u6211\u5011\u7684\u7814\u7a76\u7d50\u679c\u986f\u793a\uff0c\u8f03\u5feb\u7684\u52d5\u8a5e\uff08\u4f8b\u5982\u98db\u3001\u98c6\uff09\u6703\u5f15\u8d77\u52d5\u614b\u4e14\u5f15\u4eba\u5165\u52dd\u7684\u6642\u9593\u9ad4\u9a57\uff0c\u901a\u5e38\u8207\u6b63\u9762\u60c5\u7dd2\u548c\u8f03\u5927\u7684\u81ea\u4e3b\u6027\u6709\u95dc\u3002\u76f8\u53cd\u5730\uff0c\u8f03\u6162\u7684\u52d5\u8a5e\uff08\u4f8b\u5982\u722c\u3001\u62d6\uff09\u50b3\u9054\u4e86\u88ab\u52d5\u3001\u55ae\u8abf\u548c\u8ca0\u9762\u60c5\u7dd2\uff0c\u53cd\u6620\u51fa\u4e4f\u5473\u6216\u53d7\u9650\u7684\u6642\u9593\u9ad4\u9a57\u3002\u9019\u4e9b\u6548\u61c9\u5728\u96b1\u55bb\u8a9e\u5883\u4e2d\u6703\u88ab\u653e\u5927\uff0c\u5176\u4e2d\u52d5\u4f5c\u52d5\u8a5e\u7de8\u78bc\u4e86\u8d85\u8d8a\u5176\u5b57\u9762\u610f\u7fa9\u7684\u60c5\u7dd2\u548c\u9ad4\u9a57\u7d30\u5fae\u5dee\u5225\u3002\u6211\u5011\u9084\u767c\u73fe\uff0c\u5728\u5145\u6eff\u60c5\u7dd2\u7684\u6642\u9593\u8a9e\u5883\u4e2d\uff0c\u53c3\u8207\u8005\u504f\u597d\u52d5\u4f5c\u52d5\u8a5e\u800c\u975e\u8def\u5f91\u52d5\u8a5e\uff08\u4f8b\u5982\u8d70\u3001\u7d93\u904e\uff09\uff0c\u56e0\u70ba\u52d5\u4f5c\u52d5\u8a5e\u66f4\u6709\u6548\u5730\u6355\u6349\u4e86\u6642\u9593\u7684\u9ad4\u9a57\u548c\u60c5\u7dd2\u54c1\u8cea\u3002\u9019\u4e9b\u7814\u7a76\u7d50\u679c\u7a81\u986f\u4e86\u8a9e\u8a00\u3001\u52d5\u4f5c\u548c\u60c5\u7dd2\u4e4b\u9593\u5728\u5f62\u5851\u6642\u9593\u611f\u77e5\u4e2d\u7684\u4ea4\u4e92\u4f5c\u7528\uff0c\u4e26\u63d0\u4f9b\u4e86\u8a9e\u8a00\u6846\u67b6\u5982\u4f55\u5f71\u97ff\u4e3b\u89c0\u6642\u9593\u9ad4\u9a57\u7684\u898b\u89e3\u3002", "author": "Rosa Illan Castillo et.al.", "authors": "Rosa Illan Castillo, Javier Valenzuela", "id": "2502.13114v1", "paper_url": "http://arxiv.org/abs/2502.13114v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13115/2502.13115v1.json b/database/storage/2502/13115/2502.13115v1.json
new file mode 100644
index 0000000000..e89d59eff8
--- /dev/null
+++ b/database/storage/2502/13115/2502.13115v1.json
@@ -0,0 +1 @@
+{"2502.13115": {"publish_time": "2025-02-18", "title": "Near-Optimal Private Learning in Linear Contextual Bandits", "paper_summary": "We analyze the problem of private learning in generalized linear contextual\nbandits. Our approach is based on a novel method of re-weighted regression,\nyielding an efficient algorithm with regret of order\n$\\sqrt{T}+\\frac{1}{\\alpha}$ and $\\sqrt{T}/\\alpha$ in the joint and local model\nof $\\alpha$-privacy, respectively. Further, we provide near-optimal private\nprocedures that achieve dimension-independent rates in private linear models\nand linear contextual bandits. In particular, our results imply that joint\nprivacy is almost \"for free\" in all the settings we consider, partially\naddressing the open problem posed by Azize and Basu (2024).", "paper_summary_zh": "\u6211\u5011\u5206\u6790\u5ee3\u7fa9\u7dda\u6027\u60c5\u5883\u5f37\u76dc\u4e2d\u79c1\u4eba\u5b78\u7fd2\u7684\u554f\u984c\u3002\u6211\u5011\u7684\u505a\u6cd5\u57fa\u65bc\u91cd\u65b0\u52a0\u6b0a\u56de\u6b78\u7684\u65b0\u65b9\u6cd5\uff0c\u7522\u751f\u4e00\u7a2e\u6709\u6548\u7387\u7684\u6f14\u7b97\u6cd5\uff0c\u5176\u5f8c\u6094\u503c\u5206\u5225\u70ba\n$\\sqrt{T}+\\frac{1}{\\alpha}$ \u548c $\\sqrt{T}/\\alpha$ \u5728 $\\alpha$-\u96b1\u79c1\u7684\u806f\u5408\u548c\u5c40\u90e8\u6a21\u578b\u4e2d\u3002\u6b64\u5916\uff0c\u6211\u5011\u63d0\u4f9b\u8fd1\u4e4e\u6700\u4f73\u7684\u79c1\u4eba\u7a0b\u5e8f\uff0c\u5728\u79c1\u4eba\u7dda\u6027\u6a21\u578b\u548c\u7dda\u6027\u60c5\u5883\u5f37\u76dc\u4e2d\u5be6\u73fe\u8207\u7dad\u5ea6\u7121\u95dc\u7684\u6bd4\u7387\u3002\u7279\u5225\u662f\uff0c\u6211\u5011\u7684\u7d50\u679c\u8868\u660e\uff0c\u5728\u6211\u5011\u8003\u616e\u7684\u6240\u6709\u8a2d\u5b9a\u4e2d\uff0c\u806f\u5408\u96b1\u79c1\u5e7e\u4e4e\u662f\u300c\u514d\u8cbb\u300d\u7684\uff0c\u90e8\u5206\u89e3\u6c7a\u4e86 Azize \u548c Basu (2024) \u63d0\u51fa\u7684\u958b\u653e\u6027\u554f\u984c\u3002", "author": "Fan Chen et.al.", "authors": "Fan Chen, Jiachun Li, Alexander Rakhlin, David Simchi-Levi", "id": "2502.13115v1", "paper_url": "http://arxiv.org/abs/2502.13115v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13117/2502.13117v1.json b/database/storage/2502/13117/2502.13117v1.json
new file mode 100644
index 0000000000..9dffa927cf
--- /dev/null
+++ b/database/storage/2502/13117/2502.13117v1.json
@@ -0,0 +1 @@
+{"2502.13117": {"publish_time": "2025-02-18", "title": "Performance Evaluation of Large Language Models in Statistical Programming", "paper_summary": "The programming capabilities of large language models (LLMs) have\nrevolutionized automatic code generation and opened new avenues for automatic\nstatistical analysis. However, the validity and quality of these generated\ncodes need to be systematically evaluated before they can be widely adopted.\nDespite their growing prominence, a comprehensive evaluation of statistical\ncode generated by LLMs remains scarce in the literature. In this paper, we\nassess the performance of LLMs, including two versions of ChatGPT and one\nversion of Llama, in the domain of SAS programming for statistical analysis.\nOur study utilizes a set of statistical analysis tasks encompassing diverse\nstatistical topics and datasets. Each task includes a problem description,\ndataset information, and human-verified SAS code. We conduct a comprehensive\nassessment of the quality of SAS code generated by LLMs through human expert\nevaluation based on correctness, effectiveness, readability, executability, and\nthe accuracy of output results. The analysis of rating scores reveals that\nwhile LLMs demonstrate usefulness in generating syntactically correct code,\nthey struggle with tasks requiring deep domain understanding and may produce\nredundant or incorrect results. This study offers valuable insights into the\ncapabilities and limitations of LLMs in statistical programming, providing\nguidance for future advancements in AI-assisted coding systems for statistical\nanalysis.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u7a0b\u5f0f\u8a2d\u8a08\u529f\u80fd\u5fb9\u5e95\u6539\u8b8a\u4e86\u81ea\u52d5\u7a0b\u5f0f\u78bc\u751f\u6210\uff0c\u4e26\u70ba\u81ea\u52d5\u7d71\u8a08\u5206\u6790\u958b\u555f\u4e86\u65b0\u9014\u5f91\u3002\u7136\u800c\uff0c\u5728\u5ee3\u6cdb\u63a1\u7528\u9019\u4e9b\u7522\u751f\u7684\u7a0b\u5f0f\u78bc\u4e4b\u524d\uff0c\u9700\u8981\u7cfb\u7d71\u6027\u5730\u8a55\u4f30\u5176\u6709\u6548\u6027\u548c\u54c1\u8cea\u3002\u5118\u7ba1\u5176\u91cd\u8981\u6027\u65e5\u76ca\u63d0\u5347\uff0c\u4f46\u6587\u737b\u4e2d\u5c0d\u65bc LLM \u7522\u751f\u7684\u7d71\u8a08\u7a0b\u5f0f\u78bc\u7684\u5168\u9762\u8a55\u4f30\u4ecd\u7136\u7a00\u5c11\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u8a55\u4f30\u4e86 LLM \u7684\u6548\u80fd\uff0c\u5305\u62ec\u5169\u500b\u7248\u672c\u7684 ChatGPT \u548c\u4e00\u500b\u7248\u672c\u7684 Llama\uff0c\u5728\u7d71\u8a08\u5206\u6790\u7684 SAS \u7a0b\u5f0f\u8a2d\u8a08\u9818\u57df\u3002\u6211\u5011\u7684\u7814\u7a76\u5229\u7528\u4e86\u4e00\u7d44\u6db5\u84cb\u5404\u7a2e\u7d71\u8a08\u4e3b\u984c\u548c\u8cc7\u6599\u96c6\u7684\u7d71\u8a08\u5206\u6790\u4efb\u52d9\u3002\u6bcf\u500b\u4efb\u52d9\u90fd\u5305\u542b\u554f\u984c\u8aaa\u660e\u3001\u8cc7\u6599\u96c6\u8cc7\u8a0a\u548c\u7d93\u904e\u4eba\u5de5\u9a57\u8b49\u7684 SAS \u7a0b\u5f0f\u78bc\u3002\u6211\u5011\u900f\u904e\u57fa\u65bc\u6b63\u78ba\u6027\u3001\u6709\u6548\u6027\u3001\u53ef\u8b80\u6027\u3001\u53ef\u57f7\u884c\u6027\u548c\u8f38\u51fa\u7d50\u679c\u7cbe\u78ba\u5ea6\u7684\u5c08\u5bb6\u8a55\u4f30\uff0c\u5c0d LLM \u7522\u751f\u7684 SAS \u7a0b\u5f0f\u78bc\u54c1\u8cea\u9032\u884c\u5168\u9762\u8a55\u4f30\u3002\u8a55\u5206\u7d50\u679c\u7684\u5206\u6790\u986f\u793a\uff0c\u5118\u7ba1 LLM \u5728\u7522\u751f\u8a9e\u6cd5\u6b63\u78ba\u7684\u7a0b\u5f0f\u78bc\u65b9\u9762\u8868\u73fe\u51fa\u5176\u6548\u7528\uff0c\u4f46\u5b83\u5011\u5728\u9700\u8981\u6df1\u5165\u9818\u57df\u7406\u89e3\u7684\u4efb\u52d9\u4e2d\u6703\u9047\u5230\u56f0\u96e3\uff0c\u4e26\u4e14\u53ef\u80fd\u6703\u7522\u751f\u5197\u9918\u6216\u4e0d\u6b63\u78ba\u7684\u7d50\u679c\u3002\u672c\u7814\u7a76\u63d0\u4f9b\u4e86 LLM \u5728\u7d71\u8a08\u7a0b\u5f0f\u8a2d\u8a08\u4e2d\u80fd\u529b\u548c\u9650\u5236\u7684\u5bf6\u8cb4\u898b\u89e3\uff0c\u70ba\u7d71\u8a08\u5206\u6790\u7684 AI \u8f14\u52a9\u7de8\u78bc\u7cfb\u7d71\u7684\u672a\u4f86\u9032\u5c55\u63d0\u4f9b\u6307\u5c0e\u3002", "author": "Xinyi Song et.al.", "authors": "Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong", "id": "2502.13117v1", "paper_url": "http://arxiv.org/abs/2502.13117v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13119/2502.13119v1.json b/database/storage/2502/13119/2502.13119v1.json
new file mode 100644
index 0000000000..213df39bdc
--- /dev/null
+++ b/database/storage/2502/13119/2502.13119v1.json
@@ -0,0 +1 @@
+{"2502.13119": {"publish_time": "2025-02-18", "title": "STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models", "paper_summary": "How should one judge whether a given large language model (LLM) can reliably\nperform economic reasoning? Most existing LLM benchmarks focus on specific\napplications and fail to present the model with a rich variety of economic\ntasks. A notable exception is Raman et al. [2024], who offer an approach for\ncomprehensively benchmarking strategic decision-making; however, this approach\nfails to address the non-strategic settings prevalent in microeconomics, such\nas supply-and-demand analysis. We address this gap by taxonomizing\nmicroeconomic reasoning into $58$ distinct elements, focusing on the logic of\nsupply and demand, each grounded in up to $10$ distinct domains, $5$\nperspectives, and $3$ types. The generation of benchmark data across this\ncombinatorial space is powered by a novel LLM-assisted data generation protocol\nthat we dub auto-STEER, which generates a set of questions by adapting\nhandwritten templates to target new domains and perspectives. Because it offers\nan automated way of generating fresh questions, auto-STEER mitigates the risk\nthat LLMs will be trained to over-fit evaluation benchmarks; we thus hope that\nit will serve as a useful tool both for evaluating and fine-tuning models for\nyears to come. We demonstrate the usefulness of our benchmark via a case study\non $27$ LLMs, ranging from small open-source models to the current state of the\nart. We examined each model's ability to solve microeconomic problems across\nour whole taxonomy and present the results across a range of prompting\nstrategies and scoring metrics.", "paper_summary_zh": "<paragraph>\u5982\u4f55\u5224\u65b7\u4e00\u500b\u7d66\u5b9a\u7684\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u80fd\u5426\u53ef\u9760\u5730\u9032\u884c\u7d93\u6fdf\u63a8\u7406\uff1f\u73fe\u6709\u7684 LLM \u57fa\u6e96\u6e2c\u8a66\u5927\u591a\u5c08\u6ce8\u65bc\u7279\u5b9a\u61c9\u7528\uff0c\u672a\u80fd\u70ba\u6a21\u578b\u63d0\u4f9b\u8c50\u5bcc\u591a\u6a23\u7684\u7d93\u6fdf\u4efb\u52d9\u3002\u4e00\u500b\u503c\u5f97\u6ce8\u610f\u7684\u4f8b\u5916\u662f Raman \u7b49\u4eba [2024]\uff0c\u4ed6\u5011\u63d0\u4f9b\u4e86\u4e00\u7a2e\u5168\u9762\u8a55\u4f30\u7b56\u7565\u6c7a\u7b56\u5236\u5b9a\u65b9\u6cd5\uff1b\u7136\u800c\uff0c\u9019\u7a2e\u65b9\u6cd5\u7121\u6cd5\u89e3\u6c7a\u5fae\u89c0\u7d93\u6fdf\u5b78\u4e2d\u666e\u904d\u5b58\u5728\u7684\u975e\u7b56\u7565\u6027\u8a2d\u5b9a\uff0c\u4f8b\u5982\u4f9b\u9700\u5206\u6790\u3002\u6211\u5011\u900f\u904e\u5c07\u5fae\u89c0\u7d93\u6fdf\u63a8\u7406\u5206\u985e\u70ba 58 \u500b\u4e0d\u540c\u7684\u5143\u7d20\u4f86\u89e3\u6c7a\u9019\u500b\u5dee\u8ddd\uff0c\u91cd\u9ede\u653e\u5728\u4f9b\u9700\u908f\u8f2f\u4e0a\uff0c\u6bcf\u500b\u5143\u7d20\u90fd\u57fa\u65bc\u591a\u9054 10 \u500b\u4e0d\u540c\u7684\u9818\u57df\u30015 \u500b\u89c0\u9ede\u548c 3 \u7a2e\u985e\u578b\u3002\u5728\u9019\u500b\u7d44\u5408\u7a7a\u9593\u4e2d\u7522\u751f\u57fa\u6e96\u6578\u64da\u662f\u7531\u4e00\u7a2e\u65b0\u7a4e\u7684 LLM \u8f14\u52a9\u6578\u64da\u751f\u6210\u5354\u8b70\uff08\u6211\u5011\u7a31\u4e4b\u70ba auto-STEER\uff09\u63a8\u52d5\u7684\uff0c\u5b83\u901a\u904e\u8abf\u6574\u624b\u5beb\u6a21\u677f\u4f86\u91dd\u5c0d\u65b0\u7684\u9818\u57df\u548c\u89c0\u9ede\u4f86\u751f\u6210\u4e00\u7d44\u554f\u984c\u3002\u7531\u65bc\u5b83\u63d0\u4f9b\u4e86\u4e00\u7a2e\u751f\u6210\u65b0\u554f\u984c\u7684\u81ea\u52d5\u5316\u65b9\u5f0f\uff0cauto-STEER \u6e1b\u8f15\u4e86 LLM \u5c07\u88ab\u8a13\u7df4\u904e\u5ea6\u914d\u5408\u8a55\u4f30\u57fa\u6e96\u6e2c\u8a66\u7684\u98a8\u96aa\uff1b\u56e0\u6b64\uff0c\u6211\u5011\u5e0c\u671b\u5b83\u5c07\u6210\u70ba\u672a\u4f86\u5e7e\u5e74\u8a55\u4f30\u548c\u5fae\u8abf\u6a21\u578b\u7684\u6709\u7528\u5de5\u5177\u3002\u6211\u5011\u901a\u904e\u4e00\u500b\u6848\u4f8b\u7814\u7a76\u5c55\u793a\u4e86\u6211\u5011\u57fa\u6e96\u6e2c\u8a66\u7684\u6548\u7528\uff0c\u8a72\u6848\u4f8b\u7814\u7a76\u6db5\u84cb\u4e86 27 \u500b LLM\uff0c\u5f9e\u5c0f\u578b\u958b\u6e90\u6a21\u578b\u5230\u7576\u524d\u6280\u8853\u72c0\u614b\u3002\u6211\u5011\u6aa2\u67e5\u4e86\u6bcf\u500b\u6a21\u578b\u5728\u6211\u5011\u7684\u6574\u500b\u5206\u985e\u6cd5\u4e2d\u89e3\u6c7a\u5fae\u89c0\u7d93\u6fdf\u554f\u984c\u7684\u80fd\u529b\uff0c\u4e26\u5728\u5404\u7a2e\u63d0\u793a\u7b56\u7565\u548c\u8a55\u5206\u6307\u6a19\u4e2d\u5c55\u793a\u4e86\u7d50\u679c\u3002</paragraph>", "author": "Narun Raman et.al.", "authors": "Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin-Leyton Brown", "id": "2502.13119v1", "paper_url": "http://arxiv.org/abs/2502.13119v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13120/2502.13120v1.json b/database/storage/2502/13120/2502.13120v1.json
new file mode 100644
index 0000000000..f149c37416
--- /dev/null
+++ b/database/storage/2502/13120/2502.13120v1.json
@@ -0,0 +1 @@
+{"2502.13120": {"publish_time": "2025-02-18", "title": "Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context", "paper_summary": "Gender-inclusive language is often used with the aim of ensuring that all\nindividuals, regardless of gender, can be associated with certain concepts.\nWhile psycholinguistic studies have examined its effects in relation to human\ncognition, it remains unclear how Large Language Models (LLMs) process\ngender-inclusive language. Given that commercial LLMs are gaining an\nincreasingly strong foothold in everyday applications, it is crucial to examine\nwhether LLMs in fact interpret gender-inclusive language neutrally, because the\nlanguage they generate has the potential to influence the language of their\nusers. This study examines whether LLM-generated coreferent terms align with a\ngiven gender expression or reflect model biases. Adapting psycholinguistic\nmethods from French to English and German, we find that in English, LLMs\ngenerally maintain the antecedent's gender but exhibit underlying masculine\nbias. In German, this bias is much stronger, overriding all tested\ngender-neutralization strategies.", "paper_summary_zh": "\u6027\u5225\u5305\u5bb9\u6027\u8a9e\u8a00\u901a\u5e38\u7528\u65bc\u78ba\u4fdd\u6240\u6709\u500b\u4eba\uff0c\u7121\u8ad6\u6027\u5225\u5982\u4f55\uff0c\u90fd\u80fd\u8207\u67d0\u4e9b\u6982\u5ff5\u806f\u7e6b\u5728\u4e00\u8d77\u3002\u96d6\u7136\u5fc3\u7406\u8a9e\u8a00\u5b78\u7814\u7a76\u5df2\u7d93\u6aa2\u8996\u4e86\u5b83\u5c0d\u4eba\u985e\u8a8d\u77e5\u7684\u5f71\u97ff\uff0c\u4f46\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5982\u4f55\u8655\u7406\u6027\u5225\u5305\u5bb9\u6027\u8a9e\u8a00\u4ecd\u7136\u4e0d\u6e05\u695a\u3002\u9451\u65bc\u5546\u696d LLM \u5728\u65e5\u5e38\u61c9\u7528\u4e2d\u8d8a\u4f86\u8d8a\u7ad9\u7a69\u8173\u6b65\uff0c\u56e0\u6b64\u81f3\u95dc\u91cd\u8981\u7684\u662f\u8981\u6aa2\u67e5 LLM \u662f\u5426\u5be6\u969b\u4e0a\u4e2d\u7acb\u5730\u89e3\u91cb\u6027\u5225\u5305\u5bb9\u6027\u8a9e\u8a00\uff0c\u56e0\u70ba\u5b83\u5011\u7522\u751f\u7684\u8a9e\u8a00\u6709\u53ef\u80fd\u5f71\u97ff\u5176\u4f7f\u7528\u8005\u7684\u8a9e\u8a00\u3002\u672c\u7814\u7a76\u63a2\u8a0e\u4e86 LLM \u751f\u6210\u7684\u5171\u6307\u8853\u8a9e\u662f\u5426\u8207\u7d66\u5b9a\u7684\u6027\u5225\u8868\u9054\u4e00\u81f4\u6216\u53cd\u6620\u6a21\u578b\u504f\u898b\u3002\u6211\u5011\u63a1\u7528\u6cd5\u8a9e\u5230\u82f1\u8a9e\u548c\u5fb7\u8a9e\u7684\u5fc3\u7406\u8a9e\u8a00\u5b78\u65b9\u6cd5\uff0c\u767c\u73fe\u82f1\u8a9e\u4e2d\uff0cLLM \u901a\u5e38\u6703\u4fdd\u6301\u5148\u884c\u8a5e\u7684\u6027\u5225\uff0c\u4f46\u8868\u73fe\u51fa\u6f5b\u5728\u7684\u7537\u6027\u504f\u898b\u3002\u5728\u5fb7\u8a9e\u4e2d\uff0c\u9019\u7a2e\u504f\u898b\u5f37\u5f97\u591a\uff0c\u51cc\u99d5\u65bc\u6240\u6709\u7d93\u904e\u6e2c\u8a66\u7684\u6027\u5225\u4e2d\u7acb\u5316\u7b56\u7565\u3002", "author": "Marion Bartl et.al.", "authors": "Marion Bartl, Thomas Brendan Murphy, Susan Leavy", "id": "2502.13120v1", "paper_url": "http://arxiv.org/abs/2502.13120v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13124/2502.13124v1.json b/database/storage/2502/13124/2502.13124v1.json
new file mode 100644
index 0000000000..91c4bd5d17
--- /dev/null
+++ b/database/storage/2502/13124/2502.13124v1.json
@@ -0,0 +1 @@
+{"2502.13124": {"publish_time": "2025-02-18", "title": "NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions", "paper_summary": "Scaling reasoning capabilities beyond traditional domains such as math and\ncoding is hindered by the lack of diverse and high-quality questions. To\novercome this limitation, we introduce a scalable approach for generating\ndiverse and challenging reasoning questions, accompanied by reference answers.\nWe present NaturalReasoning, a comprehensive dataset comprising 2.8 million\nquestions that span multiple domains, including STEM fields (e.g., Physics,\nComputer Science), Economics, Social Sciences, and more. We demonstrate the\nutility of the questions in NaturalReasoning through knowledge distillation\nexperiments which show that NaturalReasoning can effectively elicit and\ntransfer reasoning capabilities from a strong teacher model. Furthermore, we\ndemonstrate that NaturalReasoning is also effective for unsupervised\nself-training using external reward models or self-rewarding.", "paper_summary_zh": "\u900f\u904e\u8d85\u8d8a\u50b3\u7d71\u9818\u57df\uff08\u4f8b\u5982\u6578\u5b78\u548c\u7de8\u78bc\uff09\u4f86\u64f4\u5145\u63a8\u7406\u80fd\u529b\uff0c\u53d7\u5230\u7f3a\u4e4f\u591a\u5143\u4e14\u9ad8\u54c1\u8cea\u554f\u984c\u7684\u963b\u7919\u3002\u70ba\u4e86\u514b\u670d\u9019\u500b\u9650\u5236\uff0c\u6211\u5011\u5f15\u5165\u4e00\u500b\u53ef\u64f4\u5145\u7684\u65b9\u6cd5\uff0c\u7528\u65bc\u7522\u751f\u591a\u5143\u4e14\u5177\u6311\u6230\u6027\u7684\u63a8\u7406\u554f\u984c\uff0c\u4e26\u9644\u4e0a\u53c3\u8003\u7b54\u6848\u3002\u6211\u5011\u63d0\u51fa NaturalReasoning\uff0c\u9019\u662f\u4e00\u500b\u5305\u542b 280 \u842c\u500b\u554f\u984c\u7684\u7d9c\u5408\u8cc7\u6599\u96c6\uff0c\u6db5\u84cb\u591a\u500b\u9818\u57df\uff0c\u5305\u62ec STEM \u9818\u57df\uff08\u4f8b\u5982\u7269\u7406\u3001\u96fb\u8166\u79d1\u5b78\uff09\u3001\u7d93\u6fdf\u5b78\u3001\u793e\u6703\u79d1\u5b78\u7b49\u7b49\u3002\u6211\u5011\u900f\u904e\u77e5\u8b58\u84b8\u993e\u5be6\u9a57\uff0c\u5c55\u793a NaturalReasoning \u4e2d\u554f\u984c\u7684\u5be6\u7528\u6027\uff0c\u9019\u4e9b\u5be6\u9a57\u986f\u793a NaturalReasoning \u80fd\u6709\u6548\u5730\u5f15\u767c\u548c\u8f49\u79fb\u5f37\u5927\u6559\u5e2b\u6a21\u578b\u7684\u63a8\u7406\u80fd\u529b\u3002\u6b64\u5916\uff0c\u6211\u5011\u5c55\u793a NaturalReasoning \u4e5f\u9069\u7528\u65bc\u4f7f\u7528\u5916\u90e8\u734e\u52f5\u6a21\u578b\u6216\u81ea\u6211\u734e\u52f5\u7684\u7121\u76e3\u7763\u81ea\u6211\u8a13\u7df4\u3002", "author": "Weizhe Yuan et.al.", "authors": "Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li", "id": "2502.13124v1", "paper_url": "http://arxiv.org/abs/2502.13124v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13125/2502.13125v1.json b/database/storage/2502/13125/2502.13125v1.json
new file mode 100644
index 0000000000..52a8f2fa12
--- /dev/null
+++ b/database/storage/2502/13125/2502.13125v1.json
@@ -0,0 +1 @@
+{"2502.13125": {"publish_time": "2025-02-18", "title": "RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises", "paper_summary": "Recent advances in large language models (LLMs) have shown that they can\nanswer questions requiring complex reasoning. However, their ability to\nidentify and respond to text containing logical fallacies or deliberately\nmisleading premises remains less studied. To address this gap, we introduce\nRuozhiBench, a bilingual dataset comprising 677 carefully curated questions\nthat contain various forms of deceptive reasoning, meticulously crafted through\nextensive human effort and expert review. In a comprehensive evaluation of 17\nLLMs from 5 Series over RuozhiBench using both open-ended and two-choice\nformats, we conduct extensive analyses on evaluation protocols and result\npatterns. Despite their high scores on conventional benchmarks, these models\nshowed limited ability to detect and reason correctly about logical fallacies,\nwith even the best-performing model, Claude-3-haiku, achieving only 62%\naccuracy compared to the human of more than 90%.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u6700\u65b0\u9032\u5c55\u986f\u793a\uff0c\u5b83\u5011\u53ef\u4ee5\u56de\u7b54\u9700\u8981\u8907\u96dc\u63a8\u7406\u7684\u554f\u984c\u3002\u7136\u800c\uff0c\u5b83\u5011\u8b58\u5225\u548c\u56de\u61c9\u5305\u542b\u908f\u8f2f\u8b2c\u8aa4\u6216\u6545\u610f\u8aa4\u5c0e\u524d\u63d0\u7684\u6587\u672c\u7684\u80fd\u529b\u4ecd\u672a\u5f97\u5230\u5145\u5206\u7814\u7a76\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u5dee\u8ddd\uff0c\u6211\u5011\u5f15\u5165\u4e86 RuozhiBench\uff0c\u9019\u662f\u4e00\u500b\u96d9\u8a9e\u8cc7\u6599\u96c6\uff0c\u5305\u542b 677 \u500b\u7d93\u904e\u4ed4\u7d30\u7b56\u5283\u7684\u554f\u984c\uff0c\u5176\u4e2d\u5305\u542b\u5404\u7a2e\u5f62\u5f0f\u7684\u6b3a\u9a19\u6027\u63a8\u7406\uff0c\u4e26\u900f\u904e\u5ee3\u6cdb\u7684\u4eba\u529b\u6295\u5165\u548c\u5c08\u5bb6\u5be9\u67e5\u7cbe\u5fc3\u88fd\u4f5c\u3002\u5728\u4f7f\u7528\u958b\u653e\u5f0f\u548c\u4e8c\u9078\u4e00\u683c\u5f0f\u5c0d\u4f86\u81ea 5 \u500b\u7cfb\u5217\u7684 17 \u500b LLM \u9032\u884c RuozhiBench \u7684\u5168\u9762\u8a55\u4f30\u4e2d\uff0c\u6211\u5011\u5c0d\u8a55\u4f30\u5354\u5b9a\u548c\u7d50\u679c\u6a21\u5f0f\u9032\u884c\u4e86\u5ee3\u6cdb\u7684\u5206\u6790\u3002\u5118\u7ba1\u5b83\u5011\u5728\u50b3\u7d71\u57fa\u6e96\u6e2c\u8a66\u4e2d\u7372\u5f97\u4e86\u9ad8\u5206\uff0c\u4f46\u9019\u4e9b\u6a21\u578b\u5728\u6aa2\u6e2c\u548c\u6b63\u78ba\u63a8\u7406\u908f\u8f2f\u8b2c\u8aa4\u65b9\u9762\u8868\u73fe\u51fa\u7684\u80fd\u529b\u6709\u9650\uff0c\u5373\u4f7f\u662f\u6548\u80fd\u6700\u597d\u7684\u6a21\u578b Claude-3-haiku\uff0c\u8207\u4eba\u985e\u7684 90% \u4ee5\u4e0a\u76f8\u6bd4\uff0c\u4e5f\u53ea\u9054\u5230\u4e86 62% \u7684\u6e96\u78ba\u5ea6\u3002", "author": "Zenan Zhai et.al.", "authors": "Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li", "id": "2502.13125v1", "paper_url": "http://arxiv.org/abs/2502.13125v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13127/2502.13127v1.json b/database/storage/2502/13127/2502.13127v1.json
new file mode 100644
index 0000000000..b5d8c023fc
--- /dev/null
+++ b/database/storage/2502/13127/2502.13127v1.json
@@ -0,0 +1 @@
+{"2502.13127": {"publish_time": "2025-02-18", "title": "Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning", "paper_summary": "Recent advances in Large Language Models (LLMs) have enabled them to process\nincreasingly longer sequences, ranging from 2K to 2M tokens and even beyond.\nHowever, simply extending the input sequence length does not necessarily lead\nto effective long-context understanding. In this study, we integrate\nChain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate\neffective long-context understanding. To achieve this, we introduce\nLongFinanceQA, a synthetic dataset in the financial domain designed to improve\nlong-context reasoning. Unlike existing long-context synthetic data,\nLongFinanceQA includes intermediate CoT reasoning before the final conclusion,\nwhich encourages LLMs to perform explicit reasoning, improving accuracy and\ninterpretability in long-context understanding. To generate synthetic CoT\nreasoning, we propose Property-driven Agentic Inference (PAI), an agentic\nframework that simulates human-like reasoning steps, including property\nextraction, retrieval, and summarization. We evaluate PAI's reasoning\ncapabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark,\noutperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune\nLLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's\nfinancial subset.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u7684\u6700\u65b0\u9032\u5c55\u8b93\u5b83\u5011\u80fd\u5920\u8655\u7406\u8d8a\u4f86\u8d8a\u9577\u7684\u5e8f\u5217\uff0c\u7bc4\u570d\u5f9e 2K \u5230 2M \u500b\u7b26\u865f\uff0c\u751a\u81f3\u66f4\u9577\u3002\n\u7136\u800c\uff0c\u50c5\u50c5\u5ef6\u9577\u8f38\u5165\u5e8f\u5217\u9577\u5ea6\u4e26\u4e0d\u6703\u5fc5\u7136\u5c0e\u81f4\u6709\u6548\u7684\u9577\u8a9e\u5883\u7406\u89e3\u3002\u5728\u672c\u7814\u7a76\u4e2d\uff0c\u6211\u5011\u4ee5\u76e3\u7763\u7684\u65b9\u5f0f\u5c07\u601d\u8003\u93c8 (CoT) \u63a8\u7406\u6574\u5408\u5230 LLM \u4e2d\uff0c\u4ee5\u4fc3\u9032\u6709\u6548\u7684\u9577\u8a9e\u5883\u7406\u89e3\u3002\u70ba\u6b64\uff0c\u6211\u5011\u5f15\u5165\u4e86 LongFinanceQA\uff0c\u9019\u662f\u4e00\u500b\u5728\u91d1\u878d\u9818\u57df\u4e2d\u7684\u5408\u6210\u6578\u64da\u96c6\uff0c\u65e8\u5728\u6539\u9032\u9577\u8a9e\u5883\u63a8\u7406\u3002\u8207\u73fe\u6709\u7684\u9577\u8a9e\u5883\u5408\u6210\u6578\u64da\u4e0d\u540c\uff0cLongFinanceQA \u5728\u6700\u7d42\u7d50\u8ad6\u4e4b\u524d\u5305\u542b\u4e86\u4e2d\u9593\u7684 CoT \u63a8\u7406\uff0c\u9019\u9f13\u52f5 LLM \u57f7\u884c\u660e\u78ba\u7684\u63a8\u7406\uff0c\u5f9e\u800c\u63d0\u9ad8\u9577\u8a9e\u5883\u7406\u89e3\u7684\u6e96\u78ba\u6027\u548c\u53ef\u89e3\u91cb\u6027\u3002\u70ba\u4e86\u751f\u6210\u5408\u6210\u7684 CoT \u63a8\u7406\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u57fa\u65bc\u5c6c\u6027\u7684\u4e3b\u9ad4\u63a8\u7406 (PAI)\uff0c\u9019\u662f\u4e00\u500b\u6a21\u64ec\u985e\u4eba\u63a8\u7406\u6b65\u9a5f\u7684\u4e3b\u9ad4\u6846\u67b6\uff0c\u5305\u62ec\u5c6c\u6027\u63d0\u53d6\u3001\u6aa2\u7d22\u548c\u7e3d\u7d50\u3002\u6211\u5011\u901a\u904e\u8a55\u4f30\u642d\u8f09 PAI \u7684 GPT-4o-mini \u5728 Loong \u57fa\u6e96\u4e0a\u7684\u63a8\u7406\u80fd\u529b\uff0c\u4f7f\u5176\u6bd4\u6a19\u6e96\u7684 GPT-4o-mini \u9ad8\u51fa 20.0%\uff0c\u4f86\u8a55\u4f30 PAI \u7684\u63a8\u7406\u80fd\u529b\u3002\u6b64\u5916\uff0c\u6211\u5011\u5c0d LLaMA-3.1-8B-Instruct \u9032\u884c\u4e86\u5fae\u8abf\uff0c\u5728 Loong \u7684\u91d1\u878d\u5b50\u96c6\u4e2d\u5be6\u73fe\u4e86 24.6% \u7684\u589e\u76ca\u3002", "author": "Jingyang Lin et.al.", "authors": "Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo", "id": "2502.13127v1", "paper_url": "http://arxiv.org/abs/2502.13127v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13128/2502.13128v1.json b/database/storage/2502/13128/2502.13128v1.json
new file mode 100644
index 0000000000..b663f6c6f8
--- /dev/null
+++ b/database/storage/2502/13128/2502.13128v1.json
@@ -0,0 +1 @@
+{"2502.13128": {"publish_time": "2025-02-18", "title": "SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation", "paper_summary": "Text-to-song generation, the task of creating vocals and accompaniment from\ntextual inputs, poses significant challenges due to domain complexity and data\nscarcity. Existing approaches often employ multi-stage generation procedures,\nresulting in cumbersome training and inference pipelines. In this paper, we\npropose SongGen, a fully open-source, single-stage auto-regressive transformer\ndesigned for controllable song generation. The proposed model facilitates\nfine-grained control over diverse musical attributes, including lyrics and\ntextual descriptions of instrumentation, genre, mood, and timbre, while also\noffering an optional three-second reference clip for voice cloning. Within a\nunified auto-regressive framework, SongGen supports two output modes: mixed\nmode, which generates a mixture of vocals and accompaniment directly, and\ndual-track mode, which synthesizes them separately for greater flexibility in\ndownstream applications. We explore diverse token pattern strategies for each\nmode, leading to notable improvements and valuable insights. Furthermore, we\ndesign an automated data preprocessing pipeline with effective quality control.\nTo foster community engagement and future research, we will release our model\nweights, training code, annotated data, and preprocessing pipeline. The\ngenerated samples are showcased on our project page at\nhttps://liuzh-19.github.io/SongGen/ , and the code will be available at\nhttps://github.com/LiuZH-19/SongGen .", "paper_summary_zh": "\u6587\u5b57\u8f49\u6b4c\u66f2\u751f\u6210\uff0c\u5f9e\u6587\u5b57\u8f38\u5165\u5efa\u7acb\u4eba\u8072\u548c\u4f34\u594f\u7684\u4efb\u52d9\uff0c\u7531\u65bc\u9818\u57df\u8907\u96dc\u6027\u548c\u8cc7\u6599\u7a00\u5c11\u6027\uff0c\u56e0\u6b64\u69cb\u6210\u91cd\u5927\u6311\u6230\u3002\u73fe\u6709\u65b9\u6cd5\u901a\u5e38\u63a1\u7528\u591a\u968e\u6bb5\u751f\u6210\u7a0b\u5e8f\uff0c\u5c0e\u81f4\u8a13\u7df4\u548c\u63a8\u8ad6\u7ba1\u9053\u7e41\u7463\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u63d0\u51fa SongGen\uff0c\u4e00\u500b\u5b8c\u5168\u958b\u6e90\u7684\u55ae\u968e\u6bb5\u81ea\u8ff4\u6b78\u8f49\u63db\u5668\uff0c\u5c08\u70ba\u53ef\u63a7\u6b4c\u66f2\u751f\u6210\u800c\u8a2d\u8a08\u3002\u6240\u63d0\u51fa\u7684\u6a21\u578b\u4fc3\u9032\u5c0d\u5404\u7a2e\u97f3\u6a02\u5c6c\u6027\u7684\u7d30\u7c92\u5ea6\u63a7\u5236\uff0c\u5305\u62ec\u6b4c\u8a5e\u548c\u6a02\u5668\u3001\u985e\u578b\u3001\u60c5\u7dd2\u548c\u97f3\u8272\u7684\u6587\u5b57\u63cf\u8ff0\uff0c\u540c\u6642\u9084\u63d0\u4f9b\u53ef\u9078\u7684\u4e09\u79d2\u53c3\u8003\u7247\u6bb5\u4ee5\u9032\u884c\u8a9e\u97f3\u8907\u88fd\u3002\u5728\u7d71\u4e00\u7684\u81ea\u8ff4\u6b78\u6846\u67b6\u5167\uff0cSongGen \u652f\u63f4\u5169\u7a2e\u8f38\u51fa\u6a21\u5f0f\uff1a\u6df7\u5408\u6a21\u5f0f\uff0c\u76f4\u63a5\u751f\u6210\u4eba\u8072\u548c\u4f34\u594f\u7684\u6df7\u5408\uff0c\u4ee5\u53ca\u96d9\u8ecc\u6a21\u5f0f\uff0c\u5c07\u5b83\u5011\u5206\u958b\u5408\u6210\u4ee5\u63d0\u9ad8\u4e0b\u6e38\u61c9\u7528\u7a0b\u5f0f\u7684\u9748\u6d3b\u6027\u3002\u6211\u5011\u63a2\u7d22\u6bcf\u7a2e\u6a21\u5f0f\u7684\u4e0d\u540c\u4ee3\u5e63\u6a21\u5f0f\u7b56\u7565\uff0c\u5f9e\u800c\u5e36\u4f86\u986f\u8457\u7684\u6539\u9032\u548c\u6709\u50f9\u503c\u7684\u898b\u89e3\u3002\u6b64\u5916\uff0c\u6211\u5011\u8a2d\u8a08\u4e86\u4e00\u500b\u81ea\u52d5\u5316\u8cc7\u6599\u9810\u8655\u7406\u7ba1\u9053\uff0c\u5177\u5099\u6709\u6548\u7684\u54c1\u8cea\u63a7\u5236\u3002\u70ba\u4e86\u4fc3\u9032\u793e\u5340\u53c3\u8207\u548c\u672a\u4f86\u7684\u7814\u7a76\uff0c\u6211\u5011\u5c07\u91cb\u51fa\u6211\u5011\u7684\u6a21\u578b\u6b0a\u91cd\u3001\u8a13\u7df4\u7a0b\u5f0f\u78bc\u3001\u8a3b\u89e3\u8cc7\u6599\u548c\u9810\u8655\u7406\u7ba1\u9053\u3002\u751f\u6210\u7684\u7bc4\u4f8b\u5c55\u793a\u5728\u6211\u5011\u7684\u5c08\u6848\u9801\u9762 https://liuzh-19.github.io/SongGen/\uff0c\u7a0b\u5f0f\u78bc\u5c07\u5728 https://github.com/LiuZH-19/SongGen \u4e2d\u63d0\u4f9b\u3002", "author": "Zihan Liu et.al.", "authors": "Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang", "id": "2502.13128v1", "paper_url": "http://arxiv.org/abs/2502.13128v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13130/2502.13130v1.json b/database/storage/2502/13130/2502.13130v1.json
new file mode 100644
index 0000000000..acf1150ecd
--- /dev/null
+++ b/database/storage/2502/13130/2502.13130v1.json
@@ -0,0 +1 @@
+{"2502.13130": {"publish_time": "2025-02-18", "title": "Magma: A Foundation Model for Multimodal AI Agents", "paper_summary": "We present Magma, a foundation model that serves multimodal AI agentic tasks\nin both the digital and physical worlds. Magma is a significant extension of\nvision-language (VL) models in that it not only retains the VL understanding\nability (verbal intelligence) of the latter, but is also equipped with the\nability to plan and act in the visual-spatial world (spatial-temporal\nintelligence) and complete agentic tasks ranging from UI navigation to robot\nmanipulation. To endow the agentic capabilities, Magma is pretrained on large\namounts of heterogeneous datasets spanning from images, videos to robotics\ndata, where the actionable visual objects (e.g., clickable buttons in GUI) in\nimages are labeled by Set-of-Mark (SoM) for action grounding, and the object\nmovements (e.g., the trace of human hands or robotic arms) in videos are\nlabeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show\nthat SoM and ToM reach great synergy and facilitate the acquisition of\nspatial-temporal intelligence for our Magma model, which is fundamental to a\nwide range of tasks as shown in Fig.1. In particular, Magma creates new\nstate-of-the-art results on UI navigation and robotic manipulation tasks,\noutperforming previous models that are specifically tailored to these tasks. On\nimage and video-related multimodal tasks, Magma also compares favorably to\npopular large multimodal models that are trained on much larger datasets. We\nmake our model and code public for reproducibility at\nhttps://microsoft.github.io/Magma.", "paper_summary_zh": "<paragraph>\u6211\u5011\u63d0\u51fa Magma\uff0c\u9019\u662f\u4e00\u500b\u57fa\u790e\u6a21\u578b\uff0c\u7528\u65bc\u670d\u52d9\u6578\u4f4d\u548c\u7269\u7406\u4e16\u754c\u4e2d\u7684\u591a\u6a21\u614b AI \u4ee3\u7406\u4efb\u52d9\u3002Magma \u662f\u8996\u89ba\u8a9e\u8a00 (VL) \u6a21\u578b\u7684\u91cd\u5927\u5ef6\u4f38\uff0c\u5b83\u4e0d\u50c5\u4fdd\u7559\u4e86\u5f8c\u8005\u7684 VL \u7406\u89e3\u80fd\u529b\uff08\u8a9e\u8a00\u667a\u80fd\uff09\uff0c\u9084\u5177\u5099\u5728\u8996\u89ba\u7a7a\u9593\u4e16\u754c\u4e2d\u898f\u5283\u548c\u884c\u52d5\u7684\u80fd\u529b\uff08\u6642\u7a7a\u667a\u80fd\uff09\uff0c\u4e26\u5b8c\u6210\u5f9e UI \u5c0e\u822a\u5230\u6a5f\u5668\u4eba\u64cd\u4f5c\u7684\u4ee3\u7406\u4efb\u52d9\u3002\u70ba\u4e86\u8ce6\u4e88\u4ee3\u7406\u80fd\u529b\uff0cMagma \u5728\u5f9e\u5f71\u50cf\u3001\u5f71\u7247\u5230\u6a5f\u5668\u4eba\u8cc7\u6599\u7684\u5927\u91cf\u7570\u8cea\u8cc7\u6599\u96c6\u4e0a\u9032\u884c\u9810\u8a13\u7df4\uff0c\u5176\u4e2d\u5f71\u50cf\u4e2d\u7684\u53ef\u64cd\u4f5c\u8996\u89ba\u7269\u4ef6\uff08\u4f8b\u5982 GUI \u4e2d\u7684\u53ef\u9ede\u64ca\u6309\u9215\uff09\u7531\u52d5\u4f5c\u63a5\u5730 Set-of-Mark (SoM) \u6a19\u8a18\uff0c\u5f71\u7247\u4e2d\u7684\u7269\u4ef6\u52d5\u4f5c\uff08\u4f8b\u5982\u4eba\u624b\u6216\u6a5f\u5668\u624b\u81c2\u7684\u8ecc\u8de1\uff09\u7531\u52d5\u4f5c\u898f\u5283 Trace-of-Mark (ToM) \u6a19\u8a18\u3002\u5ee3\u6cdb\u7684\u5be6\u9a57\u8868\u660e\uff0cSoM \u548c ToM \u9054\u5230\u4e86\u6975\u5927\u7684\u5354\u540c\u4f5c\u7528\uff0c\u4e26\u4fc3\u9032\u4e86\u6211\u5011 Magma \u6a21\u578b\u7684\u6642\u7a7a\u667a\u80fd\u7684\u7372\u53d6\uff0c\u9019\u5c0d\u65bc\u5716 1 \u4e2d\u6240\u793a\u7684\u5404\u7a2e\u4efb\u52d9\u81f3\u95dc\u91cd\u8981\u3002\u7279\u5225\u662f\uff0cMagma \u5728 UI \u5c0e\u822a\u548c\u6a5f\u5668\u4eba\u64cd\u4f5c\u4efb\u52d9\u4e0a\u5275\u9020\u4e86\u65b0\u7684\u6700\u5148\u9032\u7684\u7d50\u679c\uff0c\u512a\u65bc\u5c08\u9580\u91dd\u5c0d\u9019\u4e9b\u4efb\u52d9\u7684\u5148\u524d\u6a21\u578b\u3002\u5728\u5f71\u50cf\u548c\u5f71\u7247\u76f8\u95dc\u7684\u591a\u6a21\u614b\u4efb\u52d9\u4e0a\uff0cMagma \u4e5f\u8207\u5728\u66f4\u5927\u8cc7\u6599\u96c6\u4e0a\u8a13\u7df4\u7684\u6d41\u884c\u5927\u578b\u591a\u6a21\u614b\u6a21\u578b\u76f8\u6bd4\uff0c\u8868\u73fe\u5f97\u5f88\u597d\u3002\u6211\u5011\u516c\u958b\u6211\u5011\u7684\u6a21\u578b\u548c\u7a0b\u5f0f\u78bc\uff0c\u4ee5\u4fbf\u5728 https://microsoft.github.io/Magma \u4e0a\u91cd\u73fe\u3002</paragraph>", "author": "Jianwei Yang et.al.", "authors": "Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao", "id": "2502.13130v1", "paper_url": "http://arxiv.org/abs/2502.13130v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13131/2502.13131v1.json b/database/storage/2502/13131/2502.13131v1.json
new file mode 100644
index 0000000000..401caf836e
--- /dev/null
+++ b/database/storage/2502/13131/2502.13131v1.json
@@ -0,0 +1 @@
+{"2502.13131": {"publish_time": "2025-02-18", "title": "Rethinking Diverse Human Preference Learning through Principal Component Analysis", "paper_summary": "Understanding human preferences is crucial for improving foundation models\nand building personalized AI systems. However, preferences are inherently\ndiverse and complex, making it difficult for traditional reward models to\ncapture their full range. While fine-grained preference data can help,\ncollecting it is expensive and hard to scale. In this paper, we introduce\nDecomposed Reward Models (DRMs), a novel approach that extracts diverse human\npreferences from binary comparisons without requiring fine-grained annotations.\nOur key insight is to represent human preferences as vectors and analyze them\nusing Principal Component Analysis (PCA). By constructing a dataset of\nembedding differences between preferred and rejected responses, DRMs identify\northogonal basis vectors that capture distinct aspects of preference. These\ndecomposed rewards can be flexibly combined to align with different user needs,\noffering an interpretable and scalable alternative to traditional reward\nmodels. We demonstrate that DRMs effectively extract meaningful preference\ndimensions (e.g., helpfulness, safety, humor) and adapt to new users without\nadditional training. Our results highlight DRMs as a powerful framework for\npersonalized and interpretable LLM alignment.", "paper_summary_zh": "\u7406\u89e3\u4eba\u985e\u504f\u597d\u5c0d\u65bc\u6539\u9032\u57fa\u790e\u6a21\u578b\u548c\u5efa\u69cb\u500b\u4eba\u5316 AI \u7cfb\u7d71\u81f3\u95dc\u91cd\u8981\u3002\u7136\u800c\uff0c\u504f\u597d\u672c\u8cea\u4e0a\u662f\u591a\u6a23\u4e14\u8907\u96dc\u7684\uff0c\u9019\u4f7f\u5f97\u50b3\u7d71\u7684\u734e\u52f5\u6a21\u578b\u96e3\u4ee5\u6355\u6349\u5176\u5168\u90e8\u7bc4\u570d\u3002\u96d6\u7136\u7d30\u7dfb\u7684\u504f\u597d\u6578\u64da\u53ef\u80fd\u6709\u6240\u5e6b\u52a9\uff0c\u4f46\u6536\u96c6\u9019\u4e9b\u6578\u64da\u65e2\u6602\u8cb4\u53c8\u96e3\u4ee5\u64f4\u5c55\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u4ecb\u7d39\u4e86\u89e3\u69cb\u734e\u52f5\u6a21\u578b (DRM)\uff0c\u9019\u662f\u4e00\u7a2e\u65b0\u7a4e\u7684\u65b9\u6cd5\uff0c\u5b83\u53ef\u4ee5\u5f9e\u4e8c\u5143\u6bd4\u8f03\u4e2d\u63d0\u53d6\u591a\u6a23\u5316\u7684\u4eba\u985e\u504f\u597d\uff0c\u800c\u4e0d\u9700\u8981\u7d30\u7dfb\u7684\u8a3b\u89e3\u3002\u6211\u5011\u7684\u95dc\u9375\u898b\u89e3\u662f\u5c07\u4eba\u985e\u504f\u597d\u8868\u793a\u70ba\u5411\u91cf\uff0c\u4e26\u4f7f\u7528\u4e3b\u6210\u5206\u5206\u6790 (PCA) \u5c0d\u5176\u9032\u884c\u5206\u6790\u3002\u900f\u904e\u5efa\u69cb\u504f\u597d\u548c\u62d2\u7d55\u56de\u61c9\u4e4b\u9593\u5d4c\u5165\u5dee\u7570\u7684\u6578\u64da\u96c6\uff0cDRM \u8b58\u5225\u51fa\u6b63\u4ea4\u57fa\u5411\u91cf\uff0c\u9019\u4e9b\u5411\u91cf\u6355\u6349\u504f\u597d\u7684\u4e0d\u540c\u9762\u5411\u3002\u9019\u4e9b\u89e3\u69cb\u7684\u734e\u52f5\u53ef\u4ee5\u9748\u6d3b\u5730\u7d50\u5408\u5728\u4e00\u8d77\uff0c\u4ee5\u7b26\u5408\u4e0d\u540c\u7684\u4f7f\u7528\u8005\u9700\u6c42\uff0c\u63d0\u4f9b\u4e00\u7a2e\u53ef\u89e3\u91cb\u4e14\u53ef\u64f4\u5c55\u7684\u50b3\u7d71\u734e\u52f5\u6a21\u578b\u66ff\u4ee3\u65b9\u6848\u3002\u6211\u5011\u8b49\u660e\u4e86 DRM \u53ef\u4ee5\u6709\u6548\u5730\u63d0\u53d6\u6709\u610f\u7fa9\u7684\u504f\u597d\u7dad\u5ea6\uff08\u4f8b\u5982\uff0c\u6709\u7528\u6027\u3001\u5b89\u5168\u6027\u3001\u5e7d\u9ed8\u611f\uff09\uff0c\u4e26\u5728\u4e0d\u9700\u8981\u984d\u5916\u8a13\u7df4\u7684\u60c5\u6cc1\u4e0b\u9069\u61c9\u65b0\u7684\u4f7f\u7528\u8005\u3002\u6211\u5011\u7684\u7d50\u679c\u7a81\u986f\u4e86 DRM \u4f5c\u70ba\u500b\u4eba\u5316\u4e14\u53ef\u89e3\u91cb\u7684 LLM \u5c0d\u9f4a\u5f37\u5927\u67b6\u69cb\u3002", "author": "Feng Luo et.al.", "authors": "Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen", "id": "2502.13131v1", "paper_url": "http://arxiv.org/abs/2502.13131v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13132/2502.13132v1.json b/database/storage/2502/13132/2502.13132v1.json
new file mode 100644
index 0000000000..b2ed088f89
--- /dev/null
+++ b/database/storage/2502/13132/2502.13132v1.json
@@ -0,0 +1 @@
+{"2502.13132": {"publish_time": "2025-02-18", "title": "Learning to Defer for Causal Discovery with Imperfect Experts", "paper_summary": "Integrating expert knowledge, e.g. from large language models, into causal\ndiscovery algorithms can be challenging when the knowledge is not guaranteed to\nbe correct. Expert recommendations may contradict data-driven results, and\ntheir reliability can vary significantly depending on the domain or specific\nquery. Existing methods based on soft constraints or inconsistencies in\npredicted causal relationships fail to account for these variations in\nexpertise. To remedy this, we propose L2D-CD, a method for gauging the\ncorrectness of expert recommendations and optimally combining them with\ndata-driven causal discovery results. By adapting learning-to-defer (L2D)\nalgorithms for pairwise causal discovery (CD), we learn a deferral function\nthat selects whether to rely on classical causal discovery methods using\nnumerical data or expert recommendations based on textual meta-data. We\nevaluate L2D-CD on the canonical T\\\"ubingen pairs dataset and demonstrate its\nsuperior performance compared to both the causal discovery method and the\nexpert used in isolation. Moreover, our approach identifies domains where the\nexpert's performance is strong or weak. Finally, we outline a strategy for\ngeneralizing this approach to causal discovery on graphs with more than two\nvariables, paving the way for further research in this area.", "paper_summary_zh": "\u6574\u5408\u4e13\u5bb6\u77e5\u8b58\uff0c\u4f8b\u5982\u5f9e\u5927\u578b\u8a9e\u8a00\u6a21\u578b\u4e2d\u6574\u5408\u5230\u56e0\u679c\u767c\u73fe\u6f14\u7b97\u6cd5\u4e2d\uff0c\u7576\u77e5\u8b58\u7121\u6cd5\u4fdd\u8b49\u6b63\u78ba\u6642\u6703\u5f88\u6709\u6311\u6230\u6027\u3002\u5c08\u5bb6\u5efa\u8b70\u53ef\u80fd\u6703\u8207\u8cc7\u6599\u9a45\u52d5\u7684\u7d50\u679c\u76f8\u77db\u76fe\uff0c\u800c\u4e14\u4ed6\u5011\u7684\u53ef\u9760\u6027\u53ef\u80fd\u6703\u6839\u64da\u9818\u57df\u6216\u7279\u5b9a\u67e5\u8a62\u800c\u6709\u986f\u8457\u5dee\u7570\u3002\u73fe\u6709\u7684\u57fa\u65bc\u8edf\u7d04\u675f\u6216\u9810\u6e2c\u56e0\u679c\u95dc\u4fc2\u4e2d\u4e0d\u4e00\u81f4\u7684\u65b9\u6cd5\u7121\u6cd5\u8aaa\u660e\u5c08\u696d\u77e5\u8b58\u4e2d\u7684\u9019\u4e9b\u8b8a\u5316\u3002\u70ba\u4e86\u88dc\u6551\u9019\u4e00\u9ede\uff0c\u6211\u5011\u63d0\u51fa\u4e86 L2D-CD\uff0c\u4e00\u7a2e\u7528\u65bc\u8a55\u4f30\u5c08\u5bb6\u5efa\u8b70\u7684\u6b63\u78ba\u6027\u4e26\u5c07\u5176\u8207\u8cc7\u6599\u9a45\u52d5\u7684\u56e0\u679c\u767c\u73fe\u7d50\u679c\u6700\u4f73\u7d50\u5408\u7684\u65b9\u6cd5\u3002\u900f\u904e\u8abf\u6574\u5b78\u7fd2\u5ef6\u9072 (L2D) \u6f14\u7b97\u6cd5\u4ee5\u9032\u884c\u6210\u5c0d\u56e0\u679c\u767c\u73fe (CD)\uff0c\u6211\u5011\u5b78\u7fd2\u4e86\u4e00\u500b\u5ef6\u9072\u51fd\u6578\uff0c\u7528\u65bc\u9078\u64c7\u4f9d\u8cf4\u4f7f\u7528\u6578\u503c\u8cc7\u6599\u7684\u50b3\u7d71\u56e0\u679c\u767c\u73fe\u65b9\u6cd5\u6216\u57fa\u65bc\u6587\u5b57\u5143\u8cc7\u6599\u7684\u5c08\u5bb6\u5efa\u8b70\u3002\u6211\u5011\u5728\u7d93\u5178\u7684 T\\\"ubingen \u5c0d\u8cc7\u6599\u96c6\u4e0a\u8a55\u4f30 L2D-CD\uff0c\u4e26\u8b49\u660e\u5176\u8207\u55ae\u7368\u4f7f\u7528\u7684\u56e0\u679c\u767c\u73fe\u65b9\u6cd5\u548c\u5c08\u5bb6\u76f8\u6bd4\u5177\u6709\u512a\u8d8a\u7684\u6548\u80fd\u3002\u6b64\u5916\uff0c\u6211\u5011\u7684\u505a\u6cd5\u8b58\u5225\u51fa\u5c08\u5bb6\u8868\u73fe\u5f37\u6216\u5f31\u7684\u9818\u57df\u3002\u6700\u5f8c\uff0c\u6211\u5011\u6982\u8ff0\u4e86\u4e00\u7a2e\u5c07\u6b64\u65b9\u6cd5\u63a8\u5ee3\u5230\u5177\u6709\u5169\u500b\u4ee5\u4e0a\u8b8a\u6578\u7684\u5716\u8868\u4e0a\u9032\u884c\u56e0\u679c\u767c\u73fe\u7684\u7b56\u7565\uff0c\u70ba\u6b64\u9818\u57df\u7684\u9032\u4e00\u6b65\u7814\u7a76\u92ea\u5e73\u4e86\u9053\u8def\u3002", "author": "Oscar Clivio et.al.", "authors": "Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin", "id": "2502.13132v1", "paper_url": "http://arxiv.org/abs/2502.13132v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13135/2502.13135v1.json b/database/storage/2502/13135/2502.13135v1.json
new file mode 100644
index 0000000000..50efb18619
--- /dev/null
+++ b/database/storage/2502/13135/2502.13135v1.json
@@ -0,0 +1 @@
+{"2502.13135": {"publish_time": "2025-02-18", "title": "Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions", "paper_summary": "We present an end-to-end framework for generating synthetic users for\nevaluating interactive agents designed to encourage positive behavior changes,\nsuch as in health and lifestyle coaching. The synthetic users are grounded in\nhealth and lifestyle conditions, specifically sleep and diabetes management in\nthis study, to ensure realistic interactions with the health coaching agent.\nSynthetic users are created in two stages: first, structured data are generated\ngrounded in real-world health and lifestyle factors in addition to basic\ndemographics and behavioral attributes; second, full profiles of the synthetic\nusers are developed conditioned on the structured data. Interactions between\nsynthetic users and the coaching agent are simulated using generative\nagent-based models such as Concordia, or directly by prompting a language\nmodel. Using two independently-developed agents for sleep and diabetes coaching\nas case studies, the validity of this framework is demonstrated by analyzing\nthe coaching agent's understanding of the synthetic users' needs and\nchallenges. Finally, through multiple blinded evaluations of user-coach\ninteractions by human experts, we demonstrate that our synthetic users with\nhealth and behavioral attributes more accurately portray real human users with\nthe same attributes, compared to generic synthetic users not grounded in such\nattributes. The proposed framework lays the foundation for efficient\ndevelopment of conversational agents through extensive, realistic, and grounded\nsimulated interactions.", "paper_summary_zh": "<paragraph>\u6211\u5011\u63d0\u4f9b\u4e86\u4e00\u500b\u7aef\u5230\u7aef\u7684\u67b6\u69cb\uff0c\u7528\u65bc\u70ba\u8a55\u4f30\u4e92\u52d5\u5f0f\u4ee3\u7406\u751f\u6210\u5408\u6210\u4f7f\u7528\u8005\uff0c\u9019\u4e9b\u4ee3\u7406\u65e8\u5728\u9f13\u52f5\u6b63\u5411\u884c\u70ba\u6539\u8b8a\uff0c\u4f8b\u5982\u5065\u5eb7\u548c\u751f\u6d3b\u65b9\u5f0f\u6307\u5c0e\u3002\u5408\u6210\u4f7f\u7528\u8005\u4ee5\u5065\u5eb7\u548c\u751f\u6d3b\u65b9\u5f0f\u72c0\u6cc1\u70ba\u57fa\u790e\uff0c\u7279\u5225\u662f\u672c\u7814\u7a76\u4e2d\u7684\u7761\u7720\u548c\u7cd6\u5c3f\u75c5\u7ba1\u7406\uff0c\u4ee5\u78ba\u4fdd\u8207\u5065\u5eb7\u6307\u5c0e\u4ee3\u7406\u7684\u4e92\u52d5\u5177\u6709\u771f\u5be6\u6027\u3002\u5408\u6210\u4f7f\u7528\u8005\u5206\u5169\u500b\u968e\u6bb5\u5efa\u7acb\uff1a\u9996\u5148\uff0c\u9664\u4e86\u57fa\u672c\u4eba\u53e3\u7d71\u8a08\u8cc7\u6599\u548c\u884c\u70ba\u5c6c\u6027\u5916\uff0c\u9084\u6703\u7522\u751f\u4ee5\u73fe\u5be6\u4e16\u754c\u7684\u5065\u5eb7\u548c\u751f\u6d3b\u65b9\u5f0f\u56e0\u7d20\u70ba\u57fa\u790e\u7684\u7d50\u69cb\u5316\u8cc7\u6599\uff1b\u5176\u6b21\uff0c\u6703\u6839\u64da\u7d50\u69cb\u5316\u8cc7\u6599\u958b\u767c\u5408\u6210\u4f7f\u7528\u8005\u7684\u5b8c\u6574\u500b\u4eba\u8cc7\u6599\u3002\u5408\u6210\u4f7f\u7528\u8005\u548c\u6307\u5c0e\u4ee3\u7406\u4e4b\u9593\u7684\u4e92\u52d5\u662f\u4f7f\u7528\u751f\u6210\u5f0f\u57fa\u65bc\u4ee3\u7406\u7684\u6a21\u578b\uff08\u4f8b\u5982 Concordia\uff09\u6a21\u64ec\u7684\uff0c\u6216\u8005\u76f4\u63a5\u901a\u904e\u63d0\u793a\u8a9e\u8a00\u6a21\u578b\u4f86\u6a21\u64ec\u3002\u4f7f\u7528\u5169\u500b\u7368\u7acb\u958b\u767c\u7684\u7761\u7720\u548c\u7cd6\u5c3f\u75c5\u6307\u5c0e\u4ee3\u7406\u4f5c\u70ba\u6848\u4f8b\u7814\u7a76\uff0c\u901a\u904e\u5206\u6790\u6307\u5c0e\u4ee3\u7406\u5c0d\u5408\u6210\u4f7f\u7528\u8005\u9700\u6c42\u548c\u6311\u6230\u7684\u7406\u89e3\uff0c\u8b49\u660e\u4e86\u6b64\u67b6\u69cb\u7684\u6709\u6548\u6027\u3002\u6700\u5f8c\uff0c\u901a\u904e\u4eba\u985e\u5c08\u5bb6\u5c0d\u4f7f\u7528\u8005\u6307\u5c0e\u4e92\u52d5\u9032\u884c\u591a\u91cd\u76f2\u6e2c\u8a55\u4f30\uff0c\u6211\u5011\u8b49\u660e\u4e86\u8207\u672a\u4ee5\u9019\u4e9b\u5c6c\u6027\u70ba\u57fa\u790e\u7684\u901a\u7528\u5408\u6210\u4f7f\u7528\u8005\u76f8\u6bd4\uff0c\u5177\u6709\u5065\u5eb7\u548c\u884c\u70ba\u5c6c\u6027\u7684\u5408\u6210\u4f7f\u7528\u8005\u66f4\u6e96\u78ba\u5730\u63cf\u7e6a\u4e86\u5177\u6709\u76f8\u540c\u5c6c\u6027\u7684\u771f\u5be6\u4eba\u985e\u4f7f\u7528\u8005\u3002\u6240\u63d0\u51fa\u7684\u67b6\u69cb\u70ba\u901a\u904e\u5ee3\u6cdb\u3001\u771f\u5be6\u4e14\u6709\u6839\u64da\u7684\u6a21\u64ec\u4e92\u52d5\uff0c\u70ba\u5c0d\u8a71\u4ee3\u7406\u7684\u6709\u6548\u958b\u767c\u5960\u5b9a\u4e86\u57fa\u790e\u3002</paragraph>", "author": "Taedong Yun et.al.", "authors": "Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matari\u0107", "id": "2502.13135v1", "paper_url": "http://arxiv.org/abs/2502.13135v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13137/2502.13137v1.json b/database/storage/2502/13137/2502.13137v1.json
new file mode 100644
index 0000000000..489a698ff5
--- /dev/null
+++ b/database/storage/2502/13137/2502.13137v1.json
@@ -0,0 +1 @@
+{"2502.13137": {"publish_time": "2025-02-18", "title": "Theorem Prover as a Judge for Synthetic Data Generation", "paper_summary": "The demand for synthetic data in mathematical reasoning has increased due to\nits potential to enhance the mathematical capabilities of large language models\n(LLMs). However, ensuring the validity of intermediate reasoning steps remains\na significant challenge, affecting data quality. While formal verification via\ntheorem provers effectively validates LLM reasoning, the autoformalisation of\nmathematical proofs remains error-prone. In response, we introduce iterative\nautoformalisation, an approach that iteratively refines theorem prover\nformalisation to mitigate errors, thereby increasing the execution rate on the\nLean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as\na Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to\nrigorously assess LLM intermediate reasoning, effectively integrating\nautoformalisation with synthetic data generation. Finally, we present\nReinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that\nreplaces human annotation with theorem prover feedback in Reinforcement\nLearning from Human Feedback (RLHF). Across multiple LLMs, applying\nTP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving\n5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for\nSVAMP, and 3.55% on Llama-3.1-8B for AQUA.", "paper_summary_zh": "<paragraph>\u7531\u65bc\u5408\u6210\u8cc7\u6599\u5728\u6578\u5b78\u63a8\u7406\u4e2d\u5177\u6709\u589e\u5f37\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u6578\u5b78\u80fd\u529b\u7684\u6f5b\u529b\uff0c\u5c0d\u5408\u6210\u8cc7\u6599\u7684\u9700\u6c42\u5df2\u589e\u52a0\u3002\u7136\u800c\uff0c\u78ba\u4fdd\u4e2d\u9593\u63a8\u7406\u6b65\u9a5f\u7684\u6709\u6548\u6027\u4ecd\u7136\u662f\u4e00\u9805\u91cd\u5927\u7684\u6311\u6230\uff0c\u5f71\u97ff\u8cc7\u6599\u54c1\u8cea\u3002\u96d6\u7136\u900f\u904e\u5b9a\u7406\u8b49\u660e\u5668\u9032\u884c\u5f62\u5f0f\u9a57\u8b49\u53ef\u6709\u6548\u9a57\u8b49 LLM \u63a8\u7406\uff0c\u4f46\u6578\u5b78\u8b49\u660e\u81ea\u52d5\u5f62\u5f0f\u5316\u4ecd\u7136\u5bb9\u6613\u51fa\u932f\u3002\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u5f15\u5165\u4e86\u8fed\u4ee3\u81ea\u52d5\u5f62\u5f0f\u5316\uff0c\u9019\u662f\u4e00\u7a2e\u8fed\u4ee3\u512a\u5316\u5b9a\u7406\u8b49\u660e\u5668\u5f62\u5f0f\u5316\u4ee5\u6e1b\u5c11\u932f\u8aa4\u7684\u65b9\u6cd5\uff0c\u5f9e\u800c\u5c07 Lean \u8b49\u660e\u5668\u7684\u57f7\u884c\u7387\u5f9e 60% \u63d0\u9ad8\u5230 87%\u3002\u5728\u6b64\u57fa\u790e\u4e0a\uff0c\u6211\u5011\u5f15\u5165\u4e86\u5b9a\u7406\u8b49\u660e\u5668\u4f5c\u70ba\u8a55\u5be9 (TP-as-a-Judge)\uff0c\u9019\u662f\u4e00\u7a2e\u63a1\u7528\u5b9a\u7406\u8b49\u660e\u5668\u5f62\u5f0f\u5316\u4f86\u56b4\u683c\u8a55\u4f30 LLM \u4e2d\u9593\u63a8\u7406\u7684\u65b9\u6cd5\uff0c\u6709\u6548\u5730\u5c07\u81ea\u52d5\u5f62\u5f0f\u5316\u8207\u5408\u6210\u8cc7\u6599\u7522\u751f\u6574\u5408\u3002\u6700\u5f8c\uff0c\u6211\u5011\u63d0\u51fa\u4e86\u5b9a\u7406\u8b49\u660e\u5668\u56de\u994b\u5f37\u5316\u5b78\u7fd2 (RLTPF)\uff0c\u9019\u662f\u4e00\u500b\u6846\u67b6\uff0c\u7528\u5b9a\u7406\u8b49\u660e\u5668\u56de\u994b\u53d6\u4ee3\u4eba\u985e\u6a19\u8a3b\uff0c\u4ee5\u9032\u884c\u4eba\u985e\u56de\u994b\u5f37\u5316\u5b78\u7fd2 (RLHF)\u3002\u5728\u591a\u500b LLM \u4e2d\uff0c\u61c9\u7528 TP-as-a-Judge \u548c RLTPF \u53ef\u900f\u904e\u50c5 3,508 \u500b\u6a23\u672c\u6539\u5584\u57fa\u6e96\uff0c\u5728 MultiArith \u4e0a\u7372\u5f97 5.56% \u7684\u6e96\u78ba\u5ea6\u63d0\u5347\uff0c\u5728 SVAMP \u4e0a\u7372\u5f97 Llama-2-7B \u7684 6.00% \u63d0\u5347\uff0c\u5728 AQUA \u4e0a\u7372\u5f97 Llama-3.1-8B \u7684 3.55% \u63d0\u5347\u3002</paragraph>", "author": "Joshua Ong Jun Leang et.al.", "authors": "Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen", "id": "2502.13137v1", "paper_url": "http://arxiv.org/abs/2502.13137v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13138/2502.13138v1.json b/database/storage/2502/13138/2502.13138v1.json
new file mode 100644
index 0000000000..9ee0243d16
--- /dev/null
+++ b/database/storage/2502/13138/2502.13138v1.json
@@ -0,0 +1 @@
+{"2502.13138": {"publish_time": "2025-02-18", "title": "AIDE: AI-Driven Exploration in the Space of Code", "paper_summary": "Machine learning, the foundation of modern artificial intelligence, has\ndriven innovations that have fundamentally transformed the world. Yet, behind\nadvancements lies a complex and often tedious process requiring labor and\ncompute intensive iteration and experimentation. Engineers and scientists\ndeveloping machine learning models spend much of their time on trial-and-error\ntasks instead of conceptualizing innovative solutions or research hypotheses.\nTo address this challenge, we introduce AI-Driven Exploration (AIDE), a machine\nlearning engineering agent powered by large language models (LLMs). AIDE frames\nmachine learning engineering as a code optimization problem, and formulates\ntrial-and-error as a tree search in the space of potential solutions. By\nstrategically reusing and refining promising solutions, AIDE effectively trades\ncomputational resources for enhanced performance, achieving state-of-the-art\nresults on multiple machine learning engineering benchmarks, including our\nKaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.", "paper_summary_zh": "\u6a5f\u5668\u5b78\u7fd2\uff0c\u73fe\u4ee3\u4eba\u5de5\u667a\u6167\u7684\u57fa\u790e\uff0c\u5df2\u7d93\u63a8\u52d5\u4e86\u6839\u672c\u6027\u5730\u6539\u8b8a\u4e16\u754c\u7684\u5275\u65b0\u3002\u7136\u800c\uff0c\u9032\u6b65\u7684\u80cc\u5f8c\u662f\u4e00\u500b\u8907\u96dc\u4e14\u7d93\u5e38\u7e41\u7463\u7684\u904e\u7a0b\uff0c\u9700\u8981\u4eba\u5de5\u548c\u8a08\u7b97\u5bc6\u96c6\u7684\u8fed\u4ee3\u548c\u5be6\u9a57\u3002\u958b\u767c\u6a5f\u5668\u5b78\u7fd2\u6a21\u578b\u7684\u5de5\u7a0b\u5e2b\u548c\u79d1\u5b78\u5bb6\u5c07\u5927\u90e8\u5206\u6642\u9593\u82b1\u5728\u8a66\u932f\u4efb\u52d9\u4e0a\uff0c\u800c\u4e0d\u662f\u69cb\u601d\u5275\u65b0\u7684\u89e3\u6c7a\u65b9\u6848\u6216\u7814\u7a76\u5047\u8a2d\u3002\u70ba\u4e86\u61c9\u5c0d\u9019\u4e00\u6311\u6230\uff0c\u6211\u5011\u5f15\u5165\u4e86 AI \u9a45\u52d5\u63a2\u7d22 (AIDE)\uff0c\u9019\u662f\u4e00\u7a2e\u7531\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u9a45\u52d5\u7684\u6a5f\u5668\u5b78\u7fd2\u5de5\u7a0b\u4ee3\u7406\u3002AIDE \u5c07\u6a5f\u5668\u5b78\u7fd2\u5de5\u7a0b\u69cb\u5efa\u70ba\u4e00\u500b\u7a0b\u5f0f\u78bc\u6700\u4f73\u5316\u554f\u984c\uff0c\u4e26\u5c07\u8a66\u932f\u8868\u8ff0\u70ba\u5728\u6f5b\u5728\u89e3\u6c7a\u65b9\u6848\u7a7a\u9593\u4e2d\u7684\u6a39\u72c0\u641c\u5c0b\u3002\u900f\u904e\u7b56\u7565\u6027\u5730\u91cd\u8907\u4f7f\u7528\u548c\u6539\u9032\u6709\u5e0c\u671b\u7684\u89e3\u6c7a\u65b9\u6848\uff0cAIDE \u6709\u6548\u5730\u5c07\u8a08\u7b97\u8cc7\u6e90\u8f49\u63db\u70ba\u589e\u5f37\u7684\u6548\u80fd\uff0c\u5728\u591a\u500b\u6a5f\u5668\u5b78\u7fd2\u5de5\u7a0b\u57fa\u6e96\u4e0a\u53d6\u5f97\u4e86\u6700\u5148\u9032\u7684\u6210\u679c\uff0c\u5305\u62ec\u6211\u5011\u7684 Kaggle \u8a55\u4f30\u3001OpenAI MLE-Bench \u548c METRs RE-Bench\u3002", "author": "Zhengyao Jiang et.al.", "authors": "Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, Yuxiang Wu", "id": "2502.13138v1", "paper_url": "http://arxiv.org/abs/2502.13138v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13141/2502.13141v1.json b/database/storage/2502/13141/2502.13141v1.json
new file mode 100644
index 0000000000..7a383905d4
--- /dev/null
+++ b/database/storage/2502/13141/2502.13141v1.json
@@ -0,0 +1 @@
+{"2502.13141": {"publish_time": "2025-02-18", "title": "UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models", "paper_summary": "Large Language Models (LLMs) are vulnerable to attacks like prompt injection,\nbackdoor attacks, and adversarial attacks, which manipulate prompts or models\nto generate harmful outputs. In this paper, departing from traditional deep\nlearning attack paradigms, we explore their intrinsic relationship and\ncollectively term them Prompt Trigger Attacks (PTA). This raises a key\nquestion: Can we determine if a prompt is benign or poisoned? To address this,\nwe propose UniGuardian, the first unified defense mechanism designed to detect\nprompt injection, backdoor attacks, and adversarial attacks in LLMs.\nAdditionally, we introduce a single-forward strategy to optimize the detection\npipeline, enabling simultaneous attack detection and text generation within a\nsingle forward pass. Our experiments confirm that UniGuardian accurately and\nefficiently identifies malicious prompts in LLMs.", "paper_summary_zh": "\u5927\u578b\u8a9e\u8a00\u6a21\u578b (LLM) \u5bb9\u6613\u53d7\u5230\u63d0\u793a\u6ce8\u5165\u3001\u5f8c\u9580\u653b\u64ca\u548c\u5c0d\u6297\u6027\u653b\u64ca\u7b49\u653b\u64ca\uff0c\u9019\u4e9b\u653b\u64ca\u6703\u64cd\u7e31\u63d0\u793a\u6216\u6a21\u578b\u4ee5\u7522\u751f\u6709\u5bb3\u7684\u8f38\u51fa\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u8df3\u812b\u50b3\u7d71\u6df1\u5ea6\u5b78\u7fd2\u653b\u64ca\u7bc4\u4f8b\uff0c\u63a2\u8a0e\u5b83\u5011\u7684\u5167\u5728\u95dc\u4fc2\uff0c\u4e26\u5c07\u5b83\u5011\u7d71\u7a31\u70ba\u63d0\u793a\u89f8\u767c\u653b\u64ca (PTA)\u3002\u9019\u5f15\u767c\u4e86\u4e00\u500b\u95dc\u9375\u554f\u984c\uff1a\u6211\u5011\u80fd\u78ba\u5b9a\u4e00\u500b\u63d0\u793a\u662f\u826f\u6027\u7684\u9084\u662f\u60e1\u610f\u7684\u55ce\uff1f\u70ba\u4e86\u89e3\u6c7a\u9019\u500b\u554f\u984c\uff0c\u6211\u5011\u63d0\u51fa\u4e86 UniGuardian\uff0c\u9019\u662f\u4e00\u7a2e\u65e8\u5728\u5075\u6e2c LLM \u4e2d\u7684\u63d0\u793a\u6ce8\u5165\u3001\u5f8c\u9580\u653b\u64ca\u548c\u5c0d\u6297\u6027\u653b\u64ca\u7684\u7b2c\u4e00\u500b\u7d71\u4e00\u9632\u79a6\u6a5f\u5236\u3002\u6b64\u5916\uff0c\u6211\u5011\u5f15\u5165\u4e86\u4e00\u500b\u55ae\u4e00\u524d\u5411\u7b56\u7565\u4f86\u6700\u4f73\u5316\u5075\u6e2c\u7ba1\u9053\uff0c\u5728\u55ae\u4e00\u524d\u5411\u50b3\u905e\u4e2d\u540c\u6642\u9032\u884c\u653b\u64ca\u5075\u6e2c\u548c\u6587\u5b57\u751f\u6210\u3002\u6211\u5011\u7684\u5be6\u9a57\u8b49\u5be6\uff0cUniGuardian \u80fd\u6e96\u78ba\u4e14\u6709\u6548\u5730\u8b58\u5225 LLM \u4e2d\u7684\u60e1\u610f\u63d0\u793a\u3002", "author": "Huawei Lin et.al.", "authors": "Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao", "id": "2502.13141v1", "paper_url": "http://arxiv.org/abs/2502.13141v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13142/2502.13142v1.json b/database/storage/2502/13142/2502.13142v1.json
new file mode 100644
index 0000000000..5c772de2b8
--- /dev/null
+++ b/database/storage/2502/13142/2502.13142v1.json
@@ -0,0 +1 @@
+{"2502.13142": {"publish_time": "2025-02-18", "title": "Pre-training Auto-regressive Robotic Models with 4D Representations", "paper_summary": "Foundation models pre-trained on massive unlabeled datasets have\nrevolutionized natural language and computer vision, exhibiting remarkable\ngeneralization capabilities, thus highlighting the importance of pre-training.\nYet, efforts in robotics have struggled to achieve similar success, limited by\neither the need for costly robotic annotations or the lack of representations\nthat effectively model the physical world. In this paper, we introduce ARM4R,\nan Auto-regressive Robotic Model that leverages low-level 4D Representations\nlearned from human video data to yield a better pre-trained robotic model.\nSpecifically, we focus on utilizing 3D point tracking representations from\nvideos derived by lifting 2D representations into 3D space via monocular depth\nestimation across time. These 4D representations maintain a shared geometric\nstructure between the points and robot state representations up to a linear\ntransformation, enabling efficient transfer learning from human video data to\nlow-level robotic control. Our experiments show that ARM4R can transfer\nefficiently from human video data to robotics and consistently improves\nperformance on tasks across various robot environments and configurations.", "paper_summary_zh": "\u9810\u5148\u5728\u5927\u91cf\u672a\u6a19\u8a18\u8cc7\u6599\u96c6\u4e0a\u8a13\u7df4\u597d\u7684\u57fa\u790e\u6a21\u578b\u5df2\u7d93\u5fb9\u5e95\u6539\u8b8a\u4e86\u81ea\u7136\u8a9e\u8a00\u548c\u96fb\u8166\u8996\u89ba\uff0c\u5c55\u73fe\u51fa\u975e\u51e1\u7684\u6982\u5316\u80fd\u529b\uff0c\u56e0\u6b64\u7a81\u986f\u4e86\u9810\u5148\u8a13\u7df4\u7684\u91cd\u8981\u6027\u3002\u7136\u800c\uff0c\u6a5f\u5668\u4eba\u9818\u57df\u7684\u52aa\u529b\u4e00\u76f4\u96e3\u4ee5\u53d6\u5f97\u985e\u4f3c\u7684\u6210\u529f\uff0c\u53d7\u5230\u6602\u8cb4\u7684\u6a5f\u5668\u4eba\u6a19\u8a3b\u9700\u6c42\u6216\u7f3a\u4e4f\u6709\u6548\u5efa\u6a21\u7269\u7406\u4e16\u754c\u7684\u8868\u5fb5\u7684\u9650\u5236\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u4ecb\u7d39\u4e86 ARM4R\uff0c\u4e00\u7a2e\u81ea\u8ff4\u6b78\u6a5f\u5668\u4eba\u6a21\u578b\uff0c\u5b83\u5229\u7528\u5f9e\u4eba\u985e\u5f71\u7247\u8cc7\u6599\u4e2d\u5b78\u7fd2\u5230\u7684\u4f4e\u968e 4D \u8868\u5fb5\uff0c\u4ee5\u7522\u751f\u66f4\u597d\u7684\u9810\u5148\u8a13\u7df4\u6a5f\u5668\u4eba\u6a21\u578b\u3002\u5177\u9ad4\u4f86\u8aaa\uff0c\u6211\u5011\u5c08\u6ce8\u65bc\u5229\u7528\u5f9e\u5f71\u7247\u4e2d\u7372\u5f97\u7684 3D \u9ede\u8ffd\u8e64\u8868\u5fb5\uff0c\u9019\u4e9b\u8868\u5fb5\u662f\u900f\u904e\u55ae\u773c\u6df1\u5ea6\u4f30\u8a08\u8de8\u6642\u9593\u5c07 2D \u8868\u5fb5\u63d0\u5347\u5230 3D \u7a7a\u9593\u800c\u5c0e\u51fa\u7684\u3002\u9019\u4e9b 4D \u8868\u5fb5\u5728\u9ede\u548c\u6a5f\u5668\u4eba\u72c0\u614b\u8868\u5fb5\u4e4b\u9593\u4fdd\u6301\u4e00\u500b\u5171\u7528\u7684\u5e7e\u4f55\u7d50\u69cb\uff0c\u76f4\u5230\u4e00\u500b\u7dda\u6027\u8f49\u63db\uff0c\u9019\u4f7f\u5f97\u5f9e\u4eba\u985e\u5f71\u7247\u8cc7\u6599\u5230\u4f4e\u968e\u6a5f\u5668\u4eba\u63a7\u5236\u7684\u6709\u6548\u9077\u79fb\u5b78\u7fd2\u6210\u70ba\u53ef\u80fd\u3002\u6211\u5011\u7684\u5be6\u9a57\u8868\u660e\uff0cARM4R \u53ef\u4ee5\u6709\u6548\u5730\u5f9e\u4eba\u985e\u5f71\u7247\u8cc7\u6599\u8f49\u79fb\u5230\u6a5f\u5668\u4eba\u6280\u8853\uff0c\u4e26\u6301\u7e8c\u6539\u5584\u5404\u7a2e\u6a5f\u5668\u4eba\u74b0\u5883\u548c\u7d44\u614b\u4e2d\u7684\u4efb\u52d9\u6548\u80fd\u3002", "author": "Dantong Niu et.al.", "authors": "Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig", "id": "2502.13142v1", "paper_url": "http://arxiv.org/abs/2502.13142v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/2502/13143/2502.13143v1.json b/database/storage/2502/13143/2502.13143v1.json
new file mode 100644
index 0000000000..09cf1f0751
--- /dev/null
+++ b/database/storage/2502/13143/2502.13143v1.json
@@ -0,0 +1 @@
+{"2502.13143": {"publish_time": "2025-02-18", "title": "SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation", "paper_summary": "Spatial intelligence is a critical component of embodied AI, promoting robots\nto understand and interact with their environments. While recent advances have\nenhanced the ability of VLMs to perceive object locations and positional\nrelationships, they still lack the capability to precisely understand object\norientations-a key requirement for tasks involving fine-grained manipulations.\nAddressing this limitation not only requires geometric reasoning but also an\nexpressive and intuitive way to represent orientation. In this context, we\npropose that natural language offers a more flexible representation space than\ncanonical frames, making it particularly suitable for instruction-following\nrobotic systems. In this paper, we introduce the concept of semantic\norientation, which defines object orientations using natural language in a\nreference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the\n''handle'' direction of a knife). To support this, we construct OrienText300K,\na large-scale dataset of 3D models annotated with semantic orientations that\nlink geometric understanding to functional semantics. By integrating semantic\norientation into a VLM system, we enable robots to generate manipulation\nactions with both positional and orientational constraints. Extensive\nexperiments in simulation and real world demonstrate that our approach\nsignificantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy\non Open6DOR and 74.9% accuracy on SIMPLER.", "paper_summary_zh": "\u7a7a\u9593\u667a\u80fd\u662f\u5177\u8c61 AI \u7684\u95dc\u9375\u7d44\u6210\u90e8\u5206\uff0c\u4fc3\u4f7f\u6a5f\u5668\u4eba\u4e86\u89e3\u5176\u74b0\u5883\u4e26\u8207\u4e4b\u4e92\u52d5\u3002\u96d6\u7136\u6700\u8fd1\u7684\u9032\u5c55\u589e\u5f37\u4e86 VLM \u611f\u77e5\u7269\u4ef6\u4f4d\u7f6e\u548c\u4f4d\u7f6e\u95dc\u4fc2\u7684\u80fd\u529b\uff0c\u4f46\u5b83\u5011\u4ecd\u7136\u7f3a\u4e4f\u7cbe\u78ba\u7406\u89e3\u7269\u4ef6\u65b9\u5411\u7684\u80fd\u529b\uff0c\u9019\u5c0d\u65bc\u6d89\u53ca\u7d30\u5fae\u64cd\u4f5c\u7684\u4efb\u52d9\u4f86\u8aaa\u662f\u4e00\u9805\u95dc\u9375\u8981\u6c42\u3002\u89e3\u6c7a\u9019\u500b\u9650\u5236\u4e0d\u50c5\u9700\u8981\u5e7e\u4f55\u63a8\u7406\uff0c\u9084\u9700\u8981\u4e00\u7a2e\u8868\u9054\u6027\u548c\u76f4\u89c0\u7684\u65b9\u5f0f\u4f86\u8868\u793a\u65b9\u5411\u3002\u5728\u6b64\u80cc\u666f\u4e0b\uff0c\u6211\u5011\u63d0\u51fa\u81ea\u7136\u8a9e\u8a00\u63d0\u4f9b\u4e86\u4e00\u500b\u6bd4\u6a19\u6e96\u6846\u67b6\u66f4\u9748\u6d3b\u7684\u8868\u793a\u7a7a\u9593\uff0c\u4f7f\u5176\u7279\u5225\u9069\u5408\u65bc\u9075\u5faa\u6307\u4ee4\u7684\u6a5f\u5668\u4eba\u7cfb\u7d71\u3002\u5728\u672c\u6587\u4e2d\uff0c\u6211\u5011\u4ecb\u7d39\u4e86\u8a9e\u7fa9\u65b9\u5411\u7684\u6982\u5ff5\uff0c\u5b83\u4f7f\u7528\u81ea\u7136\u8a9e\u8a00\u4ee5\u7121\u53c3\u8003\u6846\u67b6\u7684\u65b9\u5f0f\u5b9a\u7fa9\u7269\u4ef6\u65b9\u5411\uff08\u4f8b\u5982\uff0cUSB \u7684\u300c\u63d2\u5165\u300d\u65b9\u5411\u6216\u5200\u5b50\u7684\u300c\u63e1\u67c4\u300d\u65b9\u5411\uff09\u3002\u70ba\u4e86\u652f\u6301\u9019\u4e00\u9ede\uff0c\u6211\u5011\u69cb\u5efa\u4e86 OrienText300K\uff0c\u9019\u662f\u4e00\u500b\u5927\u578b 3D \u6a21\u578b\u6578\u64da\u96c6\uff0c\u5176\u4e2d\u8a3b\u91cb\u4e86\u8a9e\u7fa9\u65b9\u5411\uff0c\u5c07\u5e7e\u4f55\u7406\u89e3\u8207\u529f\u80fd\u8a9e\u7fa9\u806f\u7e6b\u8d77\u4f86\u3002\u901a\u904e\u5c07\u8a9e\u7fa9\u65b9\u5411\u6574\u5408\u5230 VLM \u7cfb\u7d71\u4e2d\uff0c\u6211\u5011\u4f7f\u6a5f\u5668\u4eba\u80fd\u5920\u751f\u6210\u540c\u6642\u5177\u6709\u4f4d\u7f6e\u548c\u65b9\u5411\u7d04\u675f\u7684\u64cd\u4f5c\u52d5\u4f5c\u3002\u5728\u6a21\u64ec\u548c\u73fe\u5be6\u4e16\u754c\u4e2d\u9032\u884c\u7684\u5ee3\u6cdb\u5be6\u9a57\u8868\u660e\uff0c\u6211\u5011\u7684\u505a\u6cd5\u986f\u8457\u589e\u5f37\u4e86\u6a5f\u5668\u4eba\u7684\u64cd\u4f5c\u80fd\u529b\uff0c\u4f8b\u5982\uff0cOpen6DOR \u7684\u6e96\u78ba\u7387\u70ba 48.7%\uff0cSIMPLER \u7684\u6e96\u78ba\u7387\u70ba 74.9%\u3002", "author": "Zekun Qi et.al.", "authors": "Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi", "id": "2502.13143v1", "paper_url": "http://arxiv.org/abs/2502.13143v1", "repo": "null"}}
\ No newline at end of file
diff --git a/database/storage/storage_2025-02-19.md b/database/storage/storage_2025-02-19.md
index 51680a83a7..626af2d452 100644
--- a/database/storage/storage_2025-02-19.md
+++ b/database/storage/storage_2025-02-19.md
@@ -1,5 +1,5 @@
 # arxiv-daily
- Automated deployment @ 2025-02-19 09:05:53 Asia/Taipei
+ Automated deployment @ 2025-02-19 20:34:11 Asia/Taipei
 > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml).
 > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage).
 
@@ -8,6 +8,7 @@
 ### Medical explainable AI
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
 |**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
@@ -107,9 +108,22 @@
 |**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
 |**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
 |**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
 
 #### Abstracts
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
 2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
@@ -2675,36 +2689,16 @@ characteristics, in addition to the diverse stakeholders' perceptions.
 
 摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
-
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
-
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
-
 
 ### Medical
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|null|
+|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null|
 |**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
 |**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
 |**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null|
@@ -2716,17 +2710,19 @@ sinus bradycardia.
 |**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null|
 |**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|null|
 |**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|null|
+|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null|
 |**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|null|
 |**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null|
 |**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null|
 |**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|null|
-|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|null|
+|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)|
 |**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null|
 |**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null|
 |**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null|
 |**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null|
 |**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v1](http://arxiv.org/abs/2502.10526v1)|null|
 |**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null|
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null|
 |**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null|
 |**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null|
@@ -2743,6 +2739,7 @@ sinus bradycardia.
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
 |**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null|
 |**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)|
 |**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
 |**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
@@ -2754,7 +2751,7 @@ sinus bradycardia.
 |**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
 |**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)|
 |**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null|
 |**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
 |**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
 |**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
@@ -2796,17 +2793,150 @@ sinus bradycardia.
 |**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
 |**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
 |**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v2](http://arxiv.org/abs/2502.03568v2)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|[link](https://github.com/martinwimpff/eeg-continual)|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
 
 #### Abstracts
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **LLM Safety for Children**
+2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat
+
+This paper analyzes the safety of Large Language Models (LLMs) in
+interactions with children below age of 18 years. Despite the transformative
+applications of LLMs in various aspects of children's lives such as education
+and therapy, there remains a significant gap in understanding and mitigating
+potential content harms specific to this demographic. The study acknowledges
+the diverse nature of children often overlooked by standard safety evaluations
+and proposes a comprehensive approach to evaluating LLM safety specifically for
+children. We list down potential risks that children may encounter when using
+LLM powered applications. Additionally we develop Child User Models that
+reflect the varied personalities and interests of children informed by
+literature in child care and psychology. These user models aim to bridge the
+existing gap in child safety literature across various fields. We utilize Child
+User Models to evaluate the safety of six state of the art LLMs. Our
+observations reveal significant safety gaps in LLMs particularly in categories
+harmful to children but not adults
+
+摘要：本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面（例如教育和治療）都有轉變性的應用，但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性，而標準安全評估通常會忽略這些多樣性，並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外，我們開發了兒童使用者模型，這些模型反映了兒童不同的個性特質和興趣，並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞，特別是在對兒童有害但對成年人無害的類別中
+
+##### **Classifiers of Data Sharing Statements in Clinical Trial Records**
+2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth
+
+Digital individual participant data (IPD) from clinical trials are
+increasingly distributed for potential scientific reuse. The identification of
+available IPD, however, requires interpretations of textual data-sharing
+statements (DSS) in large databases. Recent advancements in computational
+linguistics include pre-trained language models that promise to simplify the
+implementation of effective classifiers based on textual inputs. In a subset of
+5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers
+based on domain-specific pre-trained language models reproduce original
+availability categories as well as manually annotated labels. Typical metrics
+indicate that classifiers that predicted manual annotations outperformed those
+that learned to output the original availability categories. This suggests that
+the textual DSS descriptions contain applicable information that the
+availability categories do not, and that such classifiers could thus aid the
+automatic identification of available IPD in large trial databases.
+
+摘要：臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而，要找出可用的 IPD，需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型，有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中，我們評估了基於特定領域預先訓練語言模型的分類器，在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示，預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊，而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。
+
 ##### **Relational Norms for Human-AI Cooperation**
 2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
 
@@ -3096,6 +3226,28 @@ chatbot applications.
 
 摘要：大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而，它們經常產生未經驗證的輸出，這會損害它們在關鍵應用中的可靠性。在本研究中，我們提出了一個創新的框架，透過檢索增強生成技術，將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體，開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型，產生在脈絡上相關且可驗證的回應，並直接參考臨床證據。實驗結果顯示，此方法顯著減少了幻覺、增強了事實準確性，並改善了生成回應的清晰度，為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。
 
+##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**
+2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang
+
+Automatic depression detection provides cues for early clinical intervention
+by clinicians. Clinical interviews for depression detection involve dialogues
+centered around multiple themes. Existing studies primarily design end-to-end
+neural network models to capture the hierarchical structure of clinical
+interview dialogues. However, these methods exhibit defects in modeling the
+thematic content of clinical interviews: 1) they fail to capture intra-theme
+and inter-theme correlation explicitly, and 2) they do not allow clinicians to
+intervene and focus on themes of interest. To address these issues, this paper
+introduces an interactive depression detection framework. This framework
+leverages in-context learning techniques to identify themes in clinical
+interviews and then models both intra-theme and inter-theme correlation.
+Additionally, it employs AI-driven feedback to simulate the interests of
+clinicians, enabling interactive adjustment of theme importance. PDIMC achieves
+absolute improvements of 35\% and 12\% compared to the state-of-the-art on the
+depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of
+modeling theme correlation and incorporating interactive external feedback.
+
+摘要：自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而，這些方法在建模臨床訪談的主題內容時表現出缺陷：1）它們無法明確捕捉主題內和主題間的關聯性，以及 2）它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題，本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題，然後對主題內和主題間的關聯性進行建模。此外，它採用 AI 驅動的回饋來模擬臨床醫師的興趣，實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比，PDIMC 的絕對改進率分別為 35% 和 12%，這證明了對主題關聯性建模和納入互動式外部回饋的有效性。
+
 ##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**
 2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu
 
@@ -3364,6 +3516,20 @@ differences, such as rotation and cropping.
 
 摘要：随着人工智能在我们的生活中变得越来越普遍，人们正在享受它带来的便利，但也面临着隐藏的威胁，例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果，特别是对于一些立即生效的应用，例如自动驾驶和医疗领域。在这些威胁中，后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象，使其成为不可忽视的威胁，然而，在部署后门模型的过程中，后门攻击往往存在一些使其在实际应用中不尽如人意的原因，例如抖动和亮度变化。基于此，我们提出了一种高度鲁棒的后门攻击，该攻击对目标样本进行平移并将其与自身结合以形成后门样本，即置换后门攻击 (DBA)。实验结果表明，DBA 攻击可以抵抗模拟真实世界差异的数据增强，例如旋转和裁剪。
 
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**
 2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
@@ -3749,6 +3915,32 @@ care interventions, and large-scale health monitoring.
 
 摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
+##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design**
+2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang
+
+Taste peptides have emerged as promising natural flavoring agents attributed
+to their unique organoleptic properties, high safety profile, and potential
+health benefits. However, the de novo identification of taste peptides derived
+from animal, plant, or microbial sources remains a time-consuming and
+resource-intensive process, significantly impeding their widespread application
+in the food industry. Here, we present TastePepAI, a comprehensive artificial
+intelligence framework for customized taste peptide design and safety
+assessment. As the key element of this framework, a loss-supervised adaptive
+variational autoencoder (LA-VAE) is implemented to efficiently optimizes the
+latent representation of sequences during training and facilitates the
+generation of target peptides with desired taste profiles. Notably, our model
+incorporates a novel taste-avoidance mechanism, allowing for selective flavor
+exclusion. Subsequently, our in-house developed toxicity prediction algorithm
+(SpepToxPred) is integrated in the framework to undergo rigorous safety
+evaluation of generated peptides. Using this integrated platform, we
+successfully identified 73 peptides exhibiting sweet, salty, and umami,
+significantly expanding the current repertoire of taste peptides. This work
+demonstrates the potential of TastePepAI in accelerating taste peptide
+discovery for food applications and provides a versatile framework adaptable to
+broader peptide engineering challenges.
+
+摘要：味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而，从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程，严重阻碍了它们在食品工业中的广泛应用。在此，我们提出了 TastePepAI，这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素，实现了损失监督自适应变分自动编码器 (LA-VAE)，以在训练期间有效优化序列的潜在表示，并促进生成具有所需味觉特征的目标肽。值得注意的是，我们的模型包含了一种新颖的味觉回避机制，允许选择性排除风味。随后，我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中，以对生成的肽进行严格的安全评估。使用这个集成平台，我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽，极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力，并提供了一个适用于更广泛的肽工程挑战的多功能框架。
+
 ##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
 2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
@@ -4045,7 +4237,7 @@ CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
 疾病研究和診斷量化。
 
 ##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
 Early prediction of pediatric cardiac arrest (CA) is critical for timely
 intervention in high-risk intensive care settings. We introduce PedCA-FT, a
@@ -4060,7 +4252,7 @@ and identifies clinically meaningful risk factors. These findings underscore
 the potential of multimodal fusion techniques to enhance early CA detection and
 improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+摘要：早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT，一個新穎的基於轉換器的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組，PedCA-FT 捕獲複雜的時間和上下文模式，以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估，我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型，並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。
 
 ##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
 2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
@@ -5101,2682 +5293,16 @@ experiment details are made available.
 
 摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v2 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
-
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
-
-摘要：許多推理、規劃和問題解決任務都具有內在的演算法性質：正確模擬每一步是正確解決它們的充分條件。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的能力。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成數據是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見結構作為自然主義推理任務建構區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複操作方面的能力，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎很依賴模式辨識。我們的貢獻建立在合成測試 LLM 的推理能力之上，作為手工製作的人工標記問題的可擴充補充。
-
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
-
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
-
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
-
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
-
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
-
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
-
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
-
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
-
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
-
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
-
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
-
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
-
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
-
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
-
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
-
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
-
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
-
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
-
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
-
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
-
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
-
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
-
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
-
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
-
-
-### LLM
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-17**|**Diffusion Models without Classifier-free Guidance**|Zhicong Tang et.al.|[2502.12154v1](http://arxiv.org/abs/2502.12154v1)|[link](https://github.com/tzco/Diffusion-wo-CFG)|
-|**2025-02-17**|**Idiosyncrasies in Large Language Models**|Mingjie Sun et.al.|[2502.12150v1](http://arxiv.org/abs/2502.12150v1)|null|
-|**2025-02-17**|**HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**|Kenan Jiang et.al.|[2502.12149v1](http://arxiv.org/abs/2502.12149v1)|null|
-|**2025-02-17**|**Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**|Jinyan Su et.al.|[2502.12145v1](http://arxiv.org/abs/2502.12145v1)|null|
-|**2025-02-17**|**Small Models Struggle to Learn from Strong Reasoners**|Yuetai Li et.al.|[2502.12143v1](http://arxiv.org/abs/2502.12143v1)|null|
-|**2025-02-17**|**SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**|Yige Xu et.al.|[2502.12134v1](http://arxiv.org/abs/2502.12134v1)|null|
-|**2025-02-17**|**Transformer Dynamics: A neuroscientific approach to interpretability of large language models**|Jesseba Fernando et.al.|[2502.12131v1](http://arxiv.org/abs/2502.12131v1)|null|
-|**2025-02-17**|**Scaling Autonomous Agents via Automatic Reward Modeling And Planning**|Zhenfang Chen et.al.|[2502.12130v1](http://arxiv.org/abs/2502.12130v1)|null|
-|**2025-02-17**|**LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**|Florian Sestak et.al.|[2502.12128v1](http://arxiv.org/abs/2502.12128v1)|null|
-|**2025-02-17**|**On the Query Complexity of Verifier-Assisted Language Generation**|Edoardo Botta et.al.|[2502.12123v1](http://arxiv.org/abs/2502.12123v1)|null|
-|**2025-02-17**|**LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**|Prasanna Mayilvahanan et.al.|[2502.12120v1](http://arxiv.org/abs/2502.12120v1)|null|
-|**2025-02-17**|**PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**|Jinhe Bi et.al.|[2502.12119v1](http://arxiv.org/abs/2502.12119v1)|null|
-|**2025-02-17**|**Scaling Test-Time Compute Without Verification or RL is Suboptimal**|Amrith Setlur et.al.|[2502.12118v1](http://arxiv.org/abs/2502.12118v1)|null|
-|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
-|**2025-02-17**|**Personality Structured Interview for Large Language Model Simulation in Personality Research**|Pengda Wang et.al.|[2502.12109v1](http://arxiv.org/abs/2502.12109v1)|null|
-|**2025-02-17**|**Using the Path of Least Resistance to Explain Deep Networks**|Sina Salek et.al.|[2502.12108v1](http://arxiv.org/abs/2502.12108v1)|null|
-|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
-|**2025-02-17**|**A Study on Leveraging Search and Self-Feedback for Agent Reasoning**|Karthikeyan K et.al.|[2502.12094v1](http://arxiv.org/abs/2502.12094v1)|null|
-|**2025-02-17**|**Meta-Statistical Learning: Supervised Learning of Statistical Inference**|Maxime Peyrard et.al.|[2502.12088v1](http://arxiv.org/abs/2502.12088v1)|null|
-|**2025-02-17**|**APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**|Yuxiang Huang et.al.|[2502.12085v1](http://arxiv.org/abs/2502.12085v1)|null|
-|**2025-02-17**|**VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**|Jianshu Zhang et.al.|[2502.12084v1](http://arxiv.org/abs/2502.12084v1)|null|
-|**2025-02-17**|**AdaSplash: Adaptive Sparse Flash Attention**|Nuno Gonçalves et.al.|[2502.12082v1](http://arxiv.org/abs/2502.12082v1)|null|
-|**2025-02-17**|**Unhackable Temporal Rewarding for Scalable Video MLLMs**|En Yu et.al.|[2502.12081v1](http://arxiv.org/abs/2502.12081v1)|null|
-|**2025-02-17**|**Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**|Zhongyi Qiu et.al.|[2502.12073v1](http://arxiv.org/abs/2502.12073v1)|null|
-|**2025-02-17**|**TokenSkip: Controllable Chain-of-Thought Compression in LLMs**|Heming Xia et.al.|[2502.12067v1](http://arxiv.org/abs/2502.12067v1)|null|
-|**2025-02-17**|**CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**|Yifan Zhang et.al.|[2502.12066v1](http://arxiv.org/abs/2502.12066v1)|null|
-|**2025-02-17**|**Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**|Lan Zhang et.al.|[2502.12065v1](http://arxiv.org/abs/2502.12065v1)|null|
-|**2025-02-17**|**AI-generated Text Detection with a GLTR-based Approach**|Lucía Yan Wu et.al.|[2502.12064v1](http://arxiv.org/abs/2502.12064v1)|null|
-|**2025-02-17**|**Culture is Not Trivia: Sociocultural Theory for Cultural NLP**|Naitian Zhou et.al.|[2502.12057v1](http://arxiv.org/abs/2502.12057v1)|null|
-|**2025-02-17**|**Designing Role Vectors to Improve LLM Inference Behaviour**|Daniele Potertì et.al.|[2502.12055v1](http://arxiv.org/abs/2502.12055v1)|null|
-|**2025-02-17**|**PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**|Xinyu Zhang et.al.|[2502.12054v1](http://arxiv.org/abs/2502.12054v1)|null|
-|**2025-02-17**|**A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**|Xinyu Hu et.al.|[2502.12052v1](http://arxiv.org/abs/2502.12052v1)|null|
-|**2025-02-17**|**How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**|Ayan Sengupta et.al.|[2502.12051v1](http://arxiv.org/abs/2502.12051v1)|null|
-|**2025-02-17**|**SpeechT: Findings of the First Mentorship in Speech Translation**|Yasmin Moslem et.al.|[2502.12050v1](http://arxiv.org/abs/2502.12050v1)|null|
-|**2025-02-17**|**A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**|Shreya Shukla et.al.|[2502.12048v1](http://arxiv.org/abs/2502.12048v1)|null|
-|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
-|**2025-02-17**|**SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**|Fengqing Jiang et.al.|[2502.12025v1](http://arxiv.org/abs/2502.12025v1)|null|
-|**2025-02-17**|**Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**|Xin Xu et.al.|[2502.12022v1](http://arxiv.org/abs/2502.12022v1)|null|
-|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
-|**2025-02-17**|**Demographic Attributes Prediction from Speech Using WavLM Embeddings**|Yuchen Yang et.al.|[2502.12007v1](http://arxiv.org/abs/2502.12007v1)|null|
-|**2025-02-17**|**Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**|Thibault Rousset et.al.|[2502.12001v1](http://arxiv.org/abs/2502.12001v1)|null|
-|**2025-02-17**|**Presumed Cultural Identity: How Names Shape LLM Responses**|Siddhesh Pawar et.al.|[2502.11995v1](http://arxiv.org/abs/2502.11995v1)|null|
-|**2025-02-17**|**Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**|Negar Kamali et.al.|[2502.11989v1](http://arxiv.org/abs/2502.11989v1)|null|
-|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|null|
-|**2025-02-17**|**Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**|Sehun Jung et.al.|[2502.11969v1](http://arxiv.org/abs/2502.11969v1)|null|
-|**2025-02-17**|**A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**|Jun Jiang et.al.|[2502.11965v1](http://arxiv.org/abs/2502.11965v1)|null|
-|**2025-02-17**|**Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**|Tianyi Wu et.al.|[2502.11962v1](http://arxiv.org/abs/2502.11962v1)|null|
-|**2025-02-17**|**STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**|Haisong Gong et.al.|[2502.11959v1](http://arxiv.org/abs/2502.11959v1)|null|
-|**2025-02-17**|**Can Your Uncertainty Scores Detect Hallucinated Entity?**|Min-Hsuan Yeh et.al.|[2502.11948v1](http://arxiv.org/abs/2502.11948v1)|null|
-|**2025-02-17**|**Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**|Ailin Huang et.al.|[2502.11946v1](http://arxiv.org/abs/2502.11946v1)|null|
-|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
-|**2025-02-17**|**FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**|Yutong Ye et.al.|[2502.11937v1](http://arxiv.org/abs/2502.11937v1)|null|
-|**2025-02-17**|**On Representational Dissociation of Language and Arithmetic in Large Language Models**|Riku Kisako et.al.|[2502.11932v1](http://arxiv.org/abs/2502.11932v1)|null|
-|**2025-02-17**|**BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**|Shamsuddeen Hassan Muhammad et.al.|[2502.11926v1](http://arxiv.org/abs/2502.11926v1)|null|
-|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null|
-|**2025-02-17**|**From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**|Zhuoyan Li et.al.|[2502.11919v1](http://arxiv.org/abs/2502.11919v1)|null|
-|**2025-02-17**|**EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**|Jiamin Su et.al.|[2502.11916v1](http://arxiv.org/abs/2502.11916v1)|null|
-|**2025-02-17**|**On the robustness of ChatGPT in teaching Korean Mathematics**|Phuong-Nam Nguyen et.al.|[2502.11915v1](http://arxiv.org/abs/2502.11915v1)|null|
-|**2025-02-17**|**MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**|Haochen Xue et.al.|[2502.11903v1](http://arxiv.org/abs/2502.11903v1)|null|
-|**2025-02-17**|**Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**|Dylan Zhang et.al.|[2502.11901v1](http://arxiv.org/abs/2502.11901v1)|null|
-|**2025-02-17**|**DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**|Zhihang Yuan et.al.|[2502.11897v1](http://arxiv.org/abs/2502.11897v1)|null|
-|**2025-02-17**|**CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**|Yanxiao Zhao et.al.|[2502.11896v1](http://arxiv.org/abs/2502.11896v1)|null|
-|**2025-02-17**|**Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**|Jacob Nielsen et.al.|[2502.11895v1](http://arxiv.org/abs/2502.11895v1)|null|
-|**2025-02-17**|**Revisiting Classification Taxonomy for Grammatical Errors**|Deqing Zou et.al.|[2502.11890v1](http://arxiv.org/abs/2502.11890v1)|null|
-|**2025-02-17**|**Stonefish: Supporting Machine Learning Research in Marine Robotics**|Michele Grimaldi et.al.|[2502.11887v1](http://arxiv.org/abs/2502.11887v1)|null|
-|**2025-02-17**|**LIMR: Less is More for RL Scaling**|Xuefeng Li et.al.|[2502.11886v1](http://arxiv.org/abs/2502.11886v1)|null|
-|**2025-02-17**|**Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**|Shao Zhang et.al.|[2502.11882v1](http://arxiv.org/abs/2502.11882v1)|null|
-|**2025-02-17**|**Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**|Hyunwoo Kim et.al.|[2502.11881v1](http://arxiv.org/abs/2502.11881v1)|null|
-|**2025-02-17**|**Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**|Jinheng Wang et.al.|[2502.11880v1](http://arxiv.org/abs/2502.11880v1)|null|
-|**2025-02-17**|**VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**|Hugh Mee Wong et.al.|[2502.11874v1](http://arxiv.org/abs/2502.11874v1)|null|
-|**2025-02-17**|**Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**|Michael McRae et.al.|[2502.11866v1](http://arxiv.org/abs/2502.11866v1)|null|
-|**2025-02-17**|**FedEAT: A Robustness Optimization Framework for Federated LLMs**|Yahao Pang et.al.|[2502.11863v1](http://arxiv.org/abs/2502.11863v1)|null|
-|**2025-02-17**|**Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**|Renhao Pei et.al.|[2502.11862v1](http://arxiv.org/abs/2502.11862v1)|null|
-|**2025-02-17**|**Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**|Shuqi Yang et.al.|[2502.11861v1](http://arxiv.org/abs/2502.11861v1)|null|
-|**2025-02-17**|**Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**|Wenrui Xu et.al.|[2502.11859v1](http://arxiv.org/abs/2502.11859v1)|null|
-|**2025-02-17**|**LLMs as a synthesis between symbolic and continuous approaches to language**|Gemma Boleda et.al.|[2502.11856v1](http://arxiv.org/abs/2502.11856v1)|null|
-|**2025-02-17**|**BaxBench: Can LLMs Generate Correct and Secure Backends?**|Mark Vero et.al.|[2502.11844v1](http://arxiv.org/abs/2502.11844v1)|null|
-|**2025-02-17**|**Can LLM Agents Maintain a Persona in Discourse?**|Pranav Bhandari et.al.|[2502.11843v1](http://arxiv.org/abs/2502.11843v1)|null|
-|**2025-02-17**|**ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**|Muhammad Waseem Akram et.al.|[2502.11840v1](http://arxiv.org/abs/2502.11840v1)|null|
-|**2025-02-17**|**Intuitive physics understanding emerges from self-supervised pretraining on natural videos**|Quentin Garrido et.al.|[2502.11831v1](http://arxiv.org/abs/2502.11831v1)|null|
-|**2025-02-17**|**Text Classification in the LLM Era - Where do we stand?**|Sowmya Vajjala et.al.|[2502.11830v1](http://arxiv.org/abs/2502.11830v1)|null|
-|**2025-02-17**|**Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**|Hanbin Wang et.al.|[2502.11829v1](http://arxiv.org/abs/2502.11829v1)|null|
-|**2025-02-17**|**M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**|Chengyan Wu et.al.|[2502.11824v1](http://arxiv.org/abs/2502.11824v1)|null|
-|**2025-02-17**|**AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**|Hao Zhou et.al.|[2502.11817v1](http://arxiv.org/abs/2502.11817v1)|null|
-|**2025-02-17**|**Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**|Xu Wang et.al.|[2502.11812v1](http://arxiv.org/abs/2502.11812v1)|null|
-|**2025-02-17**|**FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**|Qianchi Zhang et.al.|[2502.11811v1](http://arxiv.org/abs/2502.11811v1)|null|
-|**2025-02-17**|**Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**|Yanbiao Ma et.al.|[2502.11809v1](http://arxiv.org/abs/2502.11809v1)|null|
-|**2025-02-17**|**Exploring Translation Mechanism of Large Language Models**|Hongbin Zhang et.al.|[2502.11806v1](http://arxiv.org/abs/2502.11806v1)|null|
-|**2025-02-17**|**Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**|Peiying Yu et.al.|[2502.11799v1](http://arxiv.org/abs/2502.11799v1)|null|
-|**2025-02-17**|**Personality Editing for Language Models through Relevant Knowledge Editing**|Seojin Hwang et.al.|[2502.11789v1](http://arxiv.org/abs/2502.11789v1)|null|
-|**2025-02-17**|**Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**|Xuan Ren et.al.|[2502.11779v1](http://arxiv.org/abs/2502.11779v1)|null|
-|**2025-02-17**|**Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**|Siddiqui Muhammad Yasir et.al.|[2502.11777v1](http://arxiv.org/abs/2502.11777v1)|null|
-|**2025-02-17**|**The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**|Leonardo Bertolazzi et.al.|[2502.11771v1](http://arxiv.org/abs/2502.11771v1)|null|
-|**2025-02-17**|**Cognitive-Aligned Document Selection for Retrieval-augmented Generation**|Bingyu Wan et.al.|[2502.11770v1](http://arxiv.org/abs/2502.11770v1)|null|
-|**2025-02-17**|**From Selection to Generation: A Survey of LLM-based Active Learning**|Yu Xia et.al.|[2502.11767v1](http://arxiv.org/abs/2502.11767v1)|null|
-|**2025-02-17**|**Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**|Zengkui Sun et.al.|[2502.11766v1](http://arxiv.org/abs/2502.11766v1)|null|
-|**2025-02-17**|**Lightweight Deepfake Detection Based on Multi-Feature Fusion**|Siddiqui Muhammad Yasir et.al.|[2502.11763v1](http://arxiv.org/abs/2502.11763v1)|null|
-|**2025-02-17**|**HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**|Michiel van der Meer et.al.|[2502.11753v1](http://arxiv.org/abs/2502.11753v1)|null|
-|**2025-02-17**|**Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**|Yuqi Pang et.al.|[2502.11751v1](http://arxiv.org/abs/2502.11751v1)|null|
-|**2025-02-17**|**SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**|Shuai Lyu et.al.|[2502.11741v1](http://arxiv.org/abs/2502.11741v1)|null|
-
-#### Abstracts
-##### **Diffusion Models without Classifier-free Guidance**
-2502.12154v1 by Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
-
-This paper presents Model-guidance (MG), a novel objective for training
-diffusion model that addresses and removes of the commonly used Classifier-free
-guidance (CFG). Our innovative approach transcends the standard modeling of
-solely data distribution to incorporating the posterior probability of
-conditions. The proposed technique originates from the idea of CFG and is easy
-yet effective, making it a plug-and-play module for existing models. Our method
-significantly accelerates the training process, doubles the inference speed,
-and achieve exceptional quality that parallel and even surpass concurrent
-diffusion models with CFG. Extensive experiments demonstrate the effectiveness,
-efficiency, scalability on different models and datasets. Finally, we establish
-state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34.
-Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
-
-摘要：本文提出模型指導 (MG)，一種用於訓練擴散模型的新目標，它解決並消除了常用的無分類器指導 (CFG)。我們的創新方法超越了僅數據分佈的標準建模，並納入了條件的後驗機率。提議的技術源自 CFG 的概念，既簡單又有效，使其成為現有模型的即插即用模組。我們的技術顯著加速了訓練過程，將推論速度提高了一倍，並取得了與 CFG 並行甚至超越並行擴散模型的出色品質。廣泛的實驗證明了該技術在不同模型和資料集上的有效性、效率和可擴充性。最後，我們在 ImageNet 256 基準上建立了最先進的效能，FID 為 1.34。我們的程式碼可在 https://github.com/tzco/Diffusion-wo-CFG 取得。
-
-##### **Idiosyncrasies in Large Language Models**
-2502.12150v1 by Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
-
-In this work, we unveil and study idiosyncrasies in Large Language Models
-(LLMs) -- unique patterns in their outputs that can be used to distinguish the
-models. To do so, we consider a simple classification task: given a particular
-text output, the objective is to predict the source LLM that generates the
-text. We evaluate this synthetic task across various groups of LLMs and find
-that simply fine-tuning existing text embedding models on LLM-generated texts
-yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on
-held-out validation data in the five-way classification problem involving
-ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals
-that these idiosyncrasies are rooted in word-level distributions. These
-patterns persist even when the texts are rewritten, translated, or summarized
-by an external LLM, suggesting that they are also encoded in the semantic
-content. Additionally, we leverage LLM as judges to generate detailed,
-open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the
-broader implications of our findings, particularly for training on synthetic
-data and inferring model similarity. Code is available at
-https://github.com/locuslab/llm-idiosyncrasies.
-
-摘要：在這項工作中，我們揭示並研究了大型語言模型 (LLM) 中的特殊性，也就是其輸出中可區分模型的獨特模式。為此，我們考慮了一項簡單的分類任務：給定一個特定文本輸出，目標是預測產生該文本的來源 LLM。我們在各種 LLM 組合中評估這個合成任務，並發現僅微調現有的文本嵌入模型在 LLM 生成的文本上即可產生極佳的分類準確度。值得注意的是，在涉及 ChatGPT、Claude、Grok、Gemini 和 DeepSeek 的五向分類問題中，我們在留存驗證資料上達到了 97.1% 的準確度。我們的進一步調查顯示，這些特殊性根植於詞彙層級的分布。即使文本是由外部 LLM 改寫、翻譯或摘要，這些模式仍然存在，這表明它們也編碼在語義內容中。此外，我們利用 LLM 作為評審，為每個模型的特殊性產生詳細、開放式的描述。最後，我們討論了我們發現的更廣泛含意，特別是對於合成資料的訓練和推斷模型相似性。程式碼可在 https://github.com/locuslab/llm-idiosyncrasies 取得。
-
-##### **HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**
-2502.12149v1 by Kenan Jiang, Li Xiong, Fei Liu
-
-We investigate factors contributing to LLM agents' success in competitive
-multi-agent environments, using auctions as a testbed where agents bid to
-maximize profit. The agents are equipped with bidding domain knowledge,
-distinct personas that reflect item preferences, and a memory of auction
-history. Our work extends the classic auction scenario by creating a realistic
-environment where multiple agents bid on houses, weighing aspects such as size,
-location, and budget to secure the most desirable homes at the lowest prices.
-Particularly, we investigate three key questions: (a) How does a persona
-influence an agent's behavior in a competitive setting? (b) Can an agent
-effectively profile its competitors' behavior during auctions? (c) How can
-persona profiling be leveraged to create an advantage using strategies such as
-theory of mind? Through a series of experiments, we analyze the behaviors of
-LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a
-valuable platform for deepening our understanding of multi-agent workflows in
-competitive environments.
-
-摘要：我們研究促成 LLM 代理在競爭性多代理環境中成功的因素，使用拍賣作為測試平台，其中代理出價以最大化利潤。這些代理配備了競標領域知識、反映物品偏好的不同角色以及拍賣歷史的記憶。我們的研究透過創造一個現實的環境來擴展經典的拍賣場景，在該環境中，多個代理對房屋出價，權衡大小、位置和預算等方面以最低價格確保最理想的房屋。特別是，我們研究了三個關鍵問題：(a) 角色如何在競爭環境中影響代理的行為？(b) 代理是否可以在拍賣期間有效地分析其競爭對手的行為？(c) 如何利用角色分析來利用心智理論等策略創造優勢？透過一系列實驗，我們分析 LLM 代理的行為並闡明新的發現。我們的測試平台稱為 HARBOR，它提供了一個有價值的平台，用於加深我們對競爭環境中多代理工作流程的理解。
-
-##### **Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**
-2502.12145v1 by Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie
-
-Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to
-mitigate large language model (LLM) hallucinations by incorporating external
-knowledge retrieval. However, existing RAG frameworks often apply retrieval
-indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or
-failing to retrieve iteratively when required for complex reasoning. Recent
-adaptive retrieval strategies, though adaptively navigates these retrieval
-strategies, predict only based on query complexity and lacks user-driven
-flexibility, making them infeasible for diverse user application needs. In this
-paper, we introduce a novel user-controllable RAG framework that enables
-dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two
-classifiers: one trained to prioritize accuracy and another to prioritize
-retrieval efficiency. Via an interpretable control parameter $\alpha$, users
-can seamlessly navigate between minimal-cost retrieval and high-accuracy
-retrieval based on their specific requirements. We empirically demonstrate that
-our approach effectively balances accuracy, retrieval cost, and user
-controllability, making it a practical and adaptable solution for real-world
-applications.
-
-摘要：檢索增強生成 (RAG) 已成為一種強大的方法，可透過整合外部知識檢索來減輕大型語言模型 (LLM) 的幻覺。然而，現有的 RAG 框架經常不加區別地應用檢索，導致低效率，在不必要時過度檢索，或在複雜推理時無法反覆檢索。最近的自適應檢索策略，儘管自適應地導航這些檢索策略，但僅根據查詢複雜性進行預測，並且缺乏使用者驅動的靈活性，這使得它們無法滿足多樣化的使用者應用需求。在本文中，我們引入了一個新穎的使用者可控制 RAG 框架，它可以動態調整準確度成本權衡。我們的做法利用兩個分類器：一個訓練用於優先考慮準確度，另一個用於優先考慮檢索效率。透過可解釋的控制參數 $\alpha$，使用者可以在最低成本檢索和基於其特定需求的高準確度檢索之間無縫導航。我們通過實證證明，我們的做法有效地平衡了準確度、檢索成本和使用者可控性，使其成為現實世界應用中實用且適應性強的解決方案。
-
-##### **Small Models Struggle to Learn from Strong Reasoners**
-2502.12143v1 by Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
-
-Large language models (LLMs) excel in complex reasoning tasks, and distilling
-their reasoning capabilities into smaller models has shown promise. However, we
-uncover an interesting phenomenon, which we term the Small Model Learnability
-Gap: small models ($\leq$3B parameters) do not consistently benefit from long
-chain-of-thought (CoT) reasoning or distillation from larger models. Instead,
-they perform better when fine-tuned on shorter, simpler reasoning chains that
-better align with their intrinsic learning capacity. To address this, we
-propose Mix Distillation, a simple yet effective strategy that balances
-reasoning complexity by combining long and short CoT examples or reasoning from
-both larger and smaller models. Our experiments demonstrate that Mix
-Distillation significantly improves small model reasoning performance compared
-to training on either data alone. These findings highlight the limitations of
-direct strong model distillation and underscore the importance of adapting
-reasoning complexity for effective reasoning capability transfer.
-
-摘要：大型語言模型 (LLM) 在複雜推理任務中表現出色，且將其推理能力提煉成較小的模型已展現前景。然而，我們發現了一個有趣的現象，我們稱之為小型模型可學習性差距：小型模型（參數數目 ≤ 3B）並非總能從大型模型的長鏈條思考 (CoT) 推理或提煉中受益。相反地，當針對較短、較簡單的推理鏈進行微調時，它們的表現會更好，而這更符合其內在學習能力。為了解決此問題，我們提出混合提煉，這是一種簡單但有效的策略，透過結合長短 CoT 範例或從較大及較小模型進行推理，來平衡推理的複雜性。我們的實驗證明，與僅針對任一資料進行訓練相比，混合提煉顯著改善了小型模型的推理效能。這些發現突顯了直接強模型提煉的限制，並強調了調整推理複雜性以有效轉移推理能力的重要性。
-
-##### **SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**
-2502.12134v1 by Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
-
-Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to
-solve complex reasoning tasks by generating intermediate reasoning steps.
-However, most existing approaches focus on hard token decoding, which
-constrains reasoning within the discrete vocabulary space and may not always be
-optimal. While recent efforts explore continuous-space reasoning, they often
-suffer from catastrophic forgetting, limiting their applicability to
-state-of-the-art LLMs that already perform well in zero-shot settings with a
-proper instruction. To address this challenge, we propose a novel approach for
-continuous-space reasoning that does not require modifying the underlying LLM.
-Specifically, we employ a lightweight assistant model to generate
-instance-specific soft thought tokens speculatively as the initial chain of
-thoughts, which are then mapped into the LLM's representation space via a
-projection module. Experimental results on five reasoning benchmarks
-demonstrate that our method enhances LLM reasoning performance through
-supervised, parameter-efficient fine-tuning.
-
-摘要：鏈式思考 (CoT) 推理讓大型語言模型 (LLM) 能夠透過產生中間推理步驟來解決複雜的推理任務。然而，現有的大多數方法都專注於硬標記解碼，這會將推理限制在離散的詞彙空間內，而且可能並非總是最佳。雖然最近的研究探索了連續空間推理，但它們經常會遭遇災難性遺忘，這限制了它們在零次學習設置中表現良好的最先進 LLM 的適用性，且需要適當的說明。為了應對這項挑戰，我們提出了一種創新的連續空間推理方法，不需要修改底層的 LLM。具體來說，我們採用一個輕量級的輔助模型來產生特定於實例的軟思考標記，作為思考的初始鏈，然後透過投影模組將它們映射到 LLM 的表示空間。在五個推理基準上的實驗結果表明，我們的模型透過監督式、參數高效的微調，增強了 LLM 的推理效能。
-
-##### **Transformer Dynamics: A neuroscientific approach to interpretability of large language models**
-2502.12131v1 by Jesseba Fernando, Grigori Guitchounts
-
-As artificial intelligence models have exploded in scale and capability,
-understanding of their internal mechanisms remains a critical challenge.
-Inspired by the success of dynamical systems approaches in neuroscience, here
-we propose a novel framework for studying computations in deep learning
-systems. We focus on the residual stream (RS) in transformer models,
-conceptualizing it as a dynamical system evolving across layers. We find that
-activations of individual RS units exhibit strong continuity across layers,
-despite the RS being a non-privileged basis. Activations in the RS accelerate
-and grow denser over layers, while individual units trace unstable periodic
-orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with
-attractor-like dynamics in the lower layers. These insights bridge dynamical
-systems theory and mechanistic interpretability, establishing a foundation for
-a "neuroscience of AI" that combines theoretical rigor with large-scale data
-analysis to advance our understanding of modern neural networks.
-
-摘要：隨著人工智慧模型在規模和能力上爆炸式增長，
-理解其內部機制仍然是一項嚴峻的挑戰。
-受到神經科學中動力系統方法成功的啟發，我們在此
-提出了一個新的框架來研究深度學習系統中的運算。我們專注於Transformer模型中的殘差流 (RS)，
-將其概念化為一個跨層演化的動態系統。我們發現
-儘管 RS 不是一個特權基礎，但個別 RS 單元的激活在各層之間表現出很強的連續性。RS 中的激活
-隨著層數的增加而加速並變得更密集，而個別單元則追蹤不穩定的週期
-軌道。在降維空間中，RS 遵循一個曲線軌跡，在較低層中具有類吸引子的動力學。這些見解橋接了動力
-系統理論和機制可解釋性，為「AI 神經科學」奠定了基礎，結合了理論嚴謹性和大規模數據
-分析，以增進我們對現代神經網路的理解。
-
-##### **Scaling Autonomous Agents via Automatic Reward Modeling And Planning**
-2502.12130v1 by Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
-
-Large language models (LLMs) have demonstrated remarkable capabilities across
-a range of text-generation tasks. However, LLMs still struggle with problems
-requiring multi-step decision-making and environmental feedback, such as online
-shopping, scientific reasoning, and mathematical problem-solving. Unlike pure
-text data, collecting large-scale decision-making data is challenging.
-Moreover, many powerful LLMs are only accessible through APIs, which hinders
-their fine-tuning for agent tasks due to cost and complexity. To address LLM
-agents' limitations, we propose a framework that can automatically learn a
-reward model from the environment without human annotations. This model can be
-used to evaluate the action trajectories of LLM agents and provide heuristics
-for task planning. Specifically, our approach involves employing one LLM-based
-agent to navigate an environment randomly, generating diverse action
-trajectories. Subsequently, a separate LLM is leveraged to assign a task intent
-and synthesize a negative response alongside the correct response for each
-trajectory. These triplets (task intent, positive response, and negative
-response) are then utilized as training data to optimize a reward model capable
-of scoring action trajectories. The effectiveness and generalizability of our
-framework are demonstrated through evaluations conducted on different agent
-benchmarks. In conclusion, our proposed framework represents a significant
-advancement in enhancing LLM agents' decision-making capabilities. By
-automating the learning of reward models, we overcome the challenges of data
-scarcity and API limitations, potentially revolutionizing the application of
-LLMs in complex and interactive environments. This research paves the way for
-more sophisticated AI agents capable of tackling a wide range of real-world
-problems requiring multi-step decision-making.
-
-摘要：大型語言模型 (LLM) 已在各種文字生成任務中展示出非凡的能力。然而，LLM 仍然在需要多步驟決策制定和環境回饋的問題上苦苦掙扎，例如網上購物、科學推理和數學問題求解。與純文本數據不同，收集大規模決策制定數據具有挑戰性。此外，許多強大的 LLM 只能通過 API 訪問，這由於成本和複雜性而阻礙了它們對代理任務的微調。為了解決 LLM 代理的局限性，我們提出了一個框架，該框架可以從環境中自動學習獎勵模型，而無需人工註釋。此模型可用于評估 LLM 代理的動作軌跡並為任務規劃提供啟發式方法。具體來說，我們的方法涉及使用一個基於 LLM 的代理隨機導航環境，生成不同的動作軌跡。隨後，利用一個單獨的 LLM 為每個軌跡分配任務意圖並合成一個負面響應以及正確的響應。然後將這些三元組（任務意圖、正面響應和負面響應）用作訓練數據，以優化能夠評分動作軌跡的獎勵模型。我們框架的有效性和普遍性通過在不同代理基準上進行的評估得到證明。總之，我們提出的框架代表了加強 LLM 代理決策能力的重大進步。通過自動化獎勵模型的學習，我們克服了數據稀缺和 API 限制的挑戰，有可能徹底改變 LLM 在複雜和互動環境中的應用。這項研究為更複雜的 AI 代理鋪平了道路，這些代理能夠解決需要多步驟決策制定的大量現實世界問題。
-
-##### **LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**
-2502.12128v1 by Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
-
-Generative models are spearheading recent progress in deep learning, showing
-strong promise for trajectory sampling in dynamical systems as well. However,
-while latent space modeling paradigms have transformed image and video
-generation, similar approaches are more difficult for most dynamical systems.
-Such systems -- from chemical molecule structures to collective human behavior
--- are described by interactions of entities, making them inherently linked to
-connectivity patterns and the traceability of entities over time. Our approach,
-LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked
-Entities), combines the advantages of graph neural networks, i.e., the
-traceability of entities across time-steps, with the efficiency and scalability
-of recent advances in image and video generation, where pre-trained encoder and
-decoder are frozen to enable generative modeling in the latent space. The core
-idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for
-retrieval of entity properties, e.g., entity coordinates, from latent system
-representations and thus enables traceability. Experimentally, across different
-domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy,
-and generalizability. (Code is available at
-https://github.com/ml-jku/LaM-SLidE)
-
-摘要：生成模型引領深度學習的最新進展，也展現出在動態系統中進行軌跡取樣的強大前景。然而，儘管潛在空間建模範例已轉變圖像和影片生成，但對於大多數動態系統來說，類似的做法較為困難。此類系統（從化學分子結構到人類集體行為）由實體的交互作用所描述，使它們與連接模式和實體隨時間的追溯性產生固有聯繫。我們的做法 LaM-SLidE（透過連結實體進行空間動態系統的潛在空間建模）結合圖形神經網路的優點，亦即跨時間步長的實體追溯性，以及圖像和影片生成中近期進展的高效率和可擴充性，其中預先訓練的編碼器和解碼器被凍結以在潛在空間中啟用生成模型。LaM-SLidE 的核心概念是導入識別符號表示（ID），以允許從潛在系統表示中擷取實體屬性（例如實體座標），從而實現追溯性。透過不同領域的實驗，我們證明 LaM-SLidE 在速度、準確度和可概括性方面表現良好。（程式碼可在 https://github.com/ml-jku/LaM-SLidE 取得）
-
-##### **On the Query Complexity of Verifier-Assisted Language Generation**
-2502.12123v1 by Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
-
-Recently, a plethora of works have proposed inference-time algorithms (e.g.
-best-of-n), which incorporate verifiers to assist the generation process. Their
-quality-efficiency trade-offs have been empirically benchmarked on a variety of
-constrained generation tasks, but the algorithmic design landscape is still
-largely poorly understood. In this paper, we develop a mathematical framework
-for reasoning about constrained generation using a pre-trained language model
-generator oracle and a process verifier--which can decide whether a prefix can
-be extended to a string which satisfies the constraints of choice. We show that
-even in very simple settings, access to a verifier can render an intractable
-problem (information-theoretically or computationally) to a tractable one. In
-fact, we show even simple algorithms, like tokenwise rejection sampling, can
-enjoy significant benefits from access to a verifier. Empirically, we show that
-a natural modification of tokenwise rejection sampling, in which the sampler is
-allowed to "backtrack" (i.e., erase the final few generated tokens) has robust
-and substantive benefits over natural baselines (e.g. (blockwise) rejection
-sampling, nucleus sampling)--both in terms of computational efficiency,
-accuracy and diversity.
-
-摘要：<paragraph>最近，许多作品提出了推理时间算法（例如 best-of-n），其中包含验证器以协助生成过程。它们的质量效率权衡已在各种受限生成任务中得到经验基准测试，但算法设计格局仍然很大程度上难以理解。在本文中，我们开发了一个数学框架，用于使用预训练语言模型生成器预言机和过程验证器推理受限生成——它可以决定是否可以将前缀扩展为满足选择约束的字符串。我们表明，即使在非常简单的设置中，访问验证器也可以将一个棘手的问题（信息论或计算）转换为一个易处理的问题。事实上，我们表明即使是简单的算法，如逐个标记拒绝采样，也可以从访问验证器中受益匪浅。凭经验，我们表明逐个标记拒绝采样的自然修改，其中允许采样器“回溯”（即，擦除最后几个生成的标记）比自然基线（例如（按块）拒绝采样、核采样）具有强大而实质性的优势——无论是在计算效率、准确性还是多样性方面。</paragraph>
-
-##### **LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**
-2502.12120v1 by Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
-
-Scaling laws guide the development of large language models (LLMs) by
-offering estimates for the optimal balance of model size, tokens, and compute.
-More recently, loss-to-loss scaling laws that relate losses across pretraining
-datasets and downstream tasks have emerged as a powerful tool for understanding
-and improving LLM performance. In this work, we investigate which factors most
-strongly influence loss-to-loss scaling. Our experiments reveal that the
-pretraining data and tokenizer determine the scaling trend. In contrast, model
-size, optimization hyperparameters, and even significant architectural
-differences, such as between transformer-based models like Llama and
-state-space models like Mamba, have limited impact. Consequently, practitioners
-should carefully curate suitable pretraining datasets for optimal downstream
-performance, while architectures and other settings can be freely optimized for
-training efficiency.
-
-摘要：規模化定律透過提供模型大小、符號和運算的最佳平衡估計，引導大型語言模型 (LLM) 的開發。最近，與預訓練資料集和下游任務相關的損失到損失縮放定律已成為了解和改善 LLM 效能的強大工具。在這項工作中，我們探討哪些因素最能影響損失到損失縮放。我們的實驗顯示，預訓練資料和分詞器會決定縮放趨勢。相反地，模型大小、最佳化超參數，甚至重大的架構差異（例如基於Transformer的模型，如 Llama，和狀態空間模型，如 Mamba 之間的差異）影響有限。因此，從業人員應仔細策劃適當的預訓練資料集以獲得最佳的下游效能，而架構和其他設定可以自由最佳化以提升訓練效率。
-
-##### **PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**
-2502.12119v1 by Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
-
-Visual instruction tuning refines pre-trained Multimodal Large Language
-Models (MLLMs) to enhance their real-world task performance. However, the rapid
-expansion of visual instruction datasets introduces significant data
-redundancy, leading to excessive computational costs. Existing data selection
-methods predominantly rely on proxy models or loss-based metrics, both of which
-impose substantial computational overheads due to the necessity of model
-inference and backpropagation. To address this challenge, we propose PRISM, a
-novel training-free approach for efficient multimodal data selection. Unlike
-existing methods, PRISM eliminates the reliance on proxy models, warm-up
-pretraining, and gradient-based optimization. Instead, it leverages Pearson
-correlation analysis to quantify the intrinsic visual encoding properties of
-MLLMs, computing a task-specific correlation score to identify high-value
-instances. This not only enbles data-efficient selection,but maintains the
-original performance. Empirical evaluations across multiple MLLMs demonstrate
-that PRISM reduces the overall time required for visual instruction tuning and
-data selection to just 30% of conventional methods, while surpassing fully
-fine-tuned models across eight multimodal and three language understanding
-benchmarks, achieving a 101.7% relative improvement in final performance.
-
-摘要：視覺指令調整優化預先訓練的多模態大型語言模型 (MLLM)，以增強其真實世界的任務表現。然而，視覺指令資料集的快速擴展引入了顯著的資料冗餘，導致過度的運算成本。現有的資料選取方法主要依賴於代理模型或基於損失的指標，這兩者由於模型推理和反向傳播的必要性而造成大量的運算負擔。為了應對這一挑戰，我們提出了 PRISM，一種用於高效多模態資料選取的新型無訓練方法。與現有方法不同，PRISM 消除了對代理模型、熱身預訓練和基於梯度的優化的依賴。相反，它利用 Pearson 相關分析來量化 MLLM 的內在視覺編碼特性，計算特定任務相關性分數以識別高價值實例。這不僅能選擇資料效率，而且能保持原始效能。跨多個 MLLM 的經驗評估表明，PRISM 將視覺指令調整和資料選取所需的總時間減少到傳統方法的 30%，同時在八個多模態和三個語言理解基準中超越了完全微調的模型，在最終效能上實現了 101.7% 的相對改進。
-
-##### **Scaling Test-Time Compute Without Verification or RL is Suboptimal**
-2502.12118v1 by Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
-
-Despite substantial advances in scaling test-time compute, an ongoing debate
-in the community is how it should be scaled up to enable continued and
-efficient improvements with scaling. There are largely two approaches: first,
-distilling successful search or thinking traces; and second, using verification
-(e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement
-learning (RL) and search algorithms. In this paper, we prove that finetuning
-LLMs with verifier-based (VB) methods based on RL or search is far superior to
-verifier-free (VF) approaches based on distilling or cloning search traces,
-given a fixed amount of compute/data budget. Further, we show that as we scale
-test-time compute (measured as the output token length) and training data,
-suboptimality of VF methods scales poorly compared to VB when the base
-pre-trained LLM presents a heterogeneous distribution over correct solution
-traces (e.g., different lengths, styles, etc.) and admits a non-sharp
-distribution over rewards on traces sampled from it. We formalize this
-condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger
-result that VB methods scale better asymptotically, with the performance gap
-between VB and VF methods widening as test-time budget grows. We corroborate
-our theory empirically on both didactic and math reasoning problems with
-3/8/32B-sized pre-trained LLMs, where we find verification is crucial for
-scaling test-time compute.
-
-摘要：儘管在擴展測試時間計算方面取得了重大進展，但社群中持續的辯論是如何擴展它以持續有效地改善擴展。大致有兩種方法：首先，提煉成功的搜尋或思考軌跡；其次，使用驗證（例如，0/1 結果獎勵、獎勵模型或驗證器）來指導強化學習 (RL) 和搜尋演算法。在本文中，我們證明使用基於 RL 或搜尋的驗證器為基礎 (VB) 方法微調 LLM 遠優於基於提煉或複製搜尋軌跡的驗證器免費 (VF) 方法，給定固定數量的計算/資料預算。此外，我們表明，當我們擴展測試時間計算（以輸出標記長度衡量）和訓練資料時，與 VB 相比，VF 方法的次最佳性擴展效果不佳，當基礎預先訓練的 LLM 在正確的解決方案軌跡上呈現異質分佈（例如，不同的長度、樣式等）並承認從其中取樣的軌跡上獎勵的分佈不尖銳時。我們使用反集中 [Erd\H{o}s，1945] 將此條件形式化。這暗示了一個更強的結果，即 VB 方法在漸近上擴展得更好，VB 和 VF 方法之間的效能差距隨著測試時間預算的增加而擴大。我們在具有 3/8/32B 大小的預先訓練 LLM 的教學和數學推理問題上對我們的理論進行實證驗證，我們發現驗證對於擴展測試時間計算至關重要。
-
-##### **A-MEM: Agentic Memory for LLM Agents**
-2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
-
-While large language model (LLM) agents can effectively use external tools
-for complex real-world tasks, they require memory systems to leverage
-historical experiences. Current memory systems enable basic storage and
-retrieval but lack sophisticated memory organization, despite recent attempts
-to incorporate graph databases. Moreover, these systems' fixed operations and
-structures limit their adaptability across diverse tasks. To address this
-limitation, this paper proposes a novel agentic memory system for LLM agents
-that can dynamically organize memories in an agentic way. Following the basic
-principles of the Zettelkasten method, we designed our memory system to create
-interconnected knowledge networks through dynamic indexing and linking. When a
-new memory is added, we generate a comprehensive note containing multiple
-structured attributes, including contextual descriptions, keywords, and tags.
-The system then analyzes historical memories to identify relevant connections,
-establishing links where meaningful similarities exist. Additionally, this
-process enables memory evolution - as new memories are integrated, they can
-trigger updates to the contextual representations and attributes of existing
-historical memories, allowing the memory network to continuously refine its
-understanding. Our approach combines the structured organization principles of
-Zettelkasten with the flexibility of agent-driven decision making, allowing for
-more adaptive and context-aware memory management. Empirical experiments on six
-foundation models show superior improvement against existing SOTA baselines.
-The source code is available at https://github.com/WujiangXu/AgenticMemory.
-
-摘要：大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務，但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索，但缺乏精密的記憶體組織，儘管最近嘗試納入圖形資料庫。此外，這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制，本文提出了一種新的代理記憶體系統，供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則，我們設計我們的記憶體系統，透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時，我們會產生包含多個結構化屬性的綜合筆記，包括脈絡描述、關鍵字和標籤。然後，系統會分析歷史記憶體以找出相關連結，在有意義的相似性時建立連結。此外，這個程序能讓記憶體演化，因為當整合新的記憶體時，它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新，讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性，能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。
-
-##### **Personality Structured Interview for Large Language Model Simulation in Personality Research**
-2502.12109v1 by Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, Frederick L. Oswald
-
-Although psychometrics researchers have recently explored the use of large
-language models (LLMs) as proxies for human participants, LLMs often fail to
-generate heterogeneous data with human-like diversity, which diminishes their
-value in advancing social science research. To address these challenges, we
-explored the potential of the theory-informed Personality Structured Interview
-(PSI) as a tool for simulating human responses in personality research. In this
-approach, the simulation is grounded in nuanced real-human interview
-transcripts that target the personality construct of interest. We have provided
-a growing set of 357 structured interview transcripts from a representative
-sample, each containing an individual's response to 32 open-ended questions
-carefully designed to gather theory-based personality evidence. Additionally,
-grounded in psychometric research, we have summarized an evaluation framework
-to systematically validate LLM-generated psychometric data. Results from three
-experiments demonstrate that well-designed structured interviews could improve
-human-like heterogeneity in LLM-simulated personality data and predict
-personality-related behavioral outcomes (i.e., organizational citizenship
-behaviors and counterproductive work behavior). We further discuss the role of
-theory-informed structured interviews in LLM-based simulation and outline a
-general framework for designing structured interviews to simulate human-like
-data for psychometric research.
-
-摘要：儘管心理測量研究人員最近已探討將大型語言模型 (LLM) 用作人類參與者的代理，但 LLM 經常無法產生具有類似人類多樣性的異質資料，這降低了它們在推進社會科學研究中的價值。為了應對這些挑戰，我們探討了理論知情的個性結構化訪談 (PSI) 作為模擬人格研究中人類反應的工具的潛力。在此方法中，模擬基於針對目標人格建構的細緻真實人類訪談記錄。我們提供了一組不斷增加的 357 個結構化訪談記錄，來自一個具代表性的樣本，每個記錄都包含個人對 32 個開放式問題的回答，這些問題經過仔細設計，用於收集基於理論的人格證據。此外，基於心理測量研究，我們總結了一個評估架構，以系統性驗證 LLM 生成的精神測量資料。三個實驗的結果表明，設計良好的結構化訪談可以改善 LLM 模擬的人格資料中類似人類的異質性，並預測與人格相關的行為結果（例如，組織公民行為和適得其反的工作行為）。我們進一步討論了理論知情的結構化訪談在基於 LLM 的模擬中的作用，並概述了一個通用框架，用於設計結構化訪談以模擬類似人類的資料，以進行心理測量研究。
-
-##### **Using the Path of Least Resistance to Explain Deep Networks**
-2502.12108v1 by Sina Salek, Joseph Enguehard
-
-Integrated Gradients (IG), a widely used axiomatic path-based attribution
-method, assigns importance scores to input features by integrating model
-gradients along a straight path from a baseline to the input. While effective
-in some cases, we show that straight paths can lead to flawed attributions. In
-this paper, we identify the cause of these misattributions and propose an
-alternative approach that treats the input space as a Riemannian manifold,
-computing attributions by integrating gradients along geodesics. We call this
-method Geodesic Integrated Gradients (GIG). To approximate geodesic paths, we
-introduce two techniques: a k-Nearest Neighbours-based approach for smaller
-models and a Stochastic Variational Inference-based method for larger ones.
-Additionally, we propose a new axiom, Strong Completeness, extending the axioms
-satisfied by IG. We show that this property is desirable for attribution
-methods and that GIG is the only method that satisfies it. Through experiments
-on both synthetic and real-world data, we demonstrate that GIG outperforms
-existing explainability methods, including IG.
-
-摘要：整合梯度 (IG) 是一種廣泛使用的公理路徑歸因方法，它透過整合從基線到輸入的直線路徑上的模型梯度，為輸入特徵分配重要性分數。雖然在某些情況下有效，但我們表明直線路徑可能會導致錯誤的歸因。在本文中，我們找出這些錯誤歸因的原因，並提出將輸入空間視為黎曼流形的替代方法，透過整合測地線上的梯度來計算歸因。我們將此方法稱為測地線整合梯度 (GIG)。為了近似測地線路徑，我們引入了兩種技術：一種基於 k 最近鄰的方法，適用於較小的模型；一種基於隨機變異推論的方法，適用於較大的模型。此外，我們提出了新的公理，即強完整性，擴展了 IG 滿足的公理。我們表明此屬性對於歸因方法而言是理想的，並且 GIG 是唯一滿足此屬性的方法。透過對合成資料和真實世界資料進行的實驗，我們證明 GIG 優於現有的可解釋性方法，包括 IG。
-
-##### **Relational Norms for Human-AI Cooperation**
-2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
-
-How we should design and interact with social artificial intelligence depends
-on the socio-relational role the AI is meant to emulate or occupy. In human
-society, relationships such as teacher-student, parent-child, neighbors,
-siblings, or employer-employee are governed by specific norms that prescribe or
-proscribe cooperative functions including hierarchy, care, transaction, and
-mating. These norms shape our judgments of what is appropriate for each
-partner. For example, workplace norms may allow a boss to give orders to an
-employee, but not vice versa, reflecting hierarchical and transactional
-expectations. As AI agents and chatbots powered by large language models are
-increasingly designed to serve roles analogous to human positions - such as
-assistant, mental health provider, tutor, or romantic partner - it is
-imperative to examine whether and how human relational norms should extend to
-human-AI interactions. Our analysis explores how differences between AI systems
-and humans, such as the absence of conscious experience and immunity to
-fatigue, may affect an AI's capacity to fulfill relationship-specific functions
-and adhere to corresponding norms. This analysis, which is a collaborative
-effort by philosophers, psychologists, relationship scientists, ethicists,
-legal experts, and AI researchers, carries important implications for AI
-systems design, user behavior, and regulation. While we accept that AI systems
-can offer significant benefits such as increased availability and consistency
-in certain socio-relational roles, they also risk fostering unhealthy
-dependencies or unrealistic expectations that could spill over into human-human
-relationships. We propose that understanding and thoughtfully shaping (or
-implementing) suitable human-AI relational norms will be crucial for ensuring
-that human-AI interactions are ethical, trustworthy, and favorable to human
-well-being.
-
-摘要：<paragraph>我們應如何設計和與社交人工智慧互動，取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中，師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配，這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如，職場規範可能允許老闆對員工發號施令，但反之則不行，這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色，例如助理、心理健康提供者、導師或浪漫伴侶，審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異，例如缺乏意識體驗和對疲勞的免疫力，如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果，對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處，例如增加可用性和一致性，但它們也可能助長不健康的依賴關係或不切實際的期望，這些期望可能會蔓延到人際關係中。我們提出，理解和深思熟慮地塑造（或實施）適當的人類與人工智慧關係規範，對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。</paragraph>
-
-##### **A Study on Leveraging Search and Self-Feedback for Agent Reasoning**
-2502.12094v1 by Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba
-
-Recent works have demonstrated that incorporating search during inference can
-significantly improve reasoning capabilities of language agents. Some
-approaches may make use of the ground truth or rely on model's own generated
-feedback. The search algorithm uses this feedback to then produce values that
-will update its criterion for exploring and exploiting various reasoning paths.
-In this study, we investigate how search and model's self-feedback can be
-leveraged for reasoning tasks. First, we explore differences in ground-truth
-feedback and self-feedback during search for math reasoning. Second, we observe
-limitations in applying search techniques to more complex tasks like
-tool-calling and design domain-specific approaches to address these gaps. Our
-experiments reveal challenges related to generalization when solely relying on
-self-feedback during search. For search to work effectively, either access to
-the ground-truth is needed or feedback mechanisms need to be carefully designed
-for the specific task.
-
-摘要：最近的研究表明，在推理过程中加入搜索功能可以显著提升语言代理的推理能力。一些方法可能会利用基本事实或依赖模型本身产生的反馈。搜索算法使用此反馈，然后生成值，以更新其探索和利用各种推理路径的标准。在本研究中，我们调查了如何利用搜索和模型的自反馈来进行推理任务。首先，我们探讨了数学推理搜索过程中基本事实反馈和自反馈的差异。其次，我们观察到在将搜索技术应用于更复杂的任务（如工具调用和设计特定于领域的解决方案）时存在的局限性，并提出针对这些差距的解决方案。我们的实验揭示了在搜索过程中仅依赖自反馈时与泛化相关的挑战。要使搜索有效，需要访问基本事实或需要针对特定任务仔细设计反馈机制。
-
-##### **Meta-Statistical Learning: Supervised Learning of Statistical Inference**
-2502.12088v1 by Maxime Peyrard, Kyunghyun Cho
-
-This work demonstrates that the tools and principles driving the success of
-large language models (LLMs) can be repurposed to tackle distribution-level
-tasks, where the goal is to predict properties of the data-generating
-distribution rather than labels for individual datapoints. These tasks
-encompass statistical inference problems such as parameter estimation,
-hypothesis testing, or mutual information estimation. Framing these tasks
-within traditional machine learning pipelines is challenging, as supervision is
-typically tied to individual datapoint. We propose meta-statistical learning, a
-framework inspired by multi-instance learning that reformulates statistical
-inference tasks as supervised learning problems. In this approach, entire
-datasets are treated as single inputs to neural networks, which predict
-distribution-level parameters. Transformer-based architectures, without
-positional encoding, provide a natural fit due to their permutation-invariance
-properties. By training on large-scale synthetic datasets, meta-statistical
-models can leverage the scalability and optimization infrastructure of
-Transformer-based LLMs. We demonstrate the framework's versatility with
-applications in hypothesis testing and mutual information estimation, showing
-strong performance, particularly for small datasets where traditional neural
-methods struggle.
-
-摘要：这项工作表明，推动大型语言模型 (LLM) 成功发展的工具和原则可以重新用于解决分布级别任务，其中目标是预测数据生成分布的属性，而不是单个数据点的标签。这些任务包括统计推断问题，例如参数估计、假设检验或互信息估计。在传统的机器学习管道中构建这些任务具有挑战性，因为监督通常与单个数据点相关联。我们提出了元统计学习，这是一个受多实例学习启发的框架，它将统计推断任务重新表述为监督学习问题。在此方法中，整个数据集被视为神经网络的单个输入，该神经网络预测分布级别参数。基于 Transformer 的架构在没有位置编码的情况下提供了自然拟合，因为它们具有置换不变性。通过在大型合成数据集上进行训练，元统计模型可以利用基于 Transformer 的 LLM 的可扩展性和优化基础设施。我们通过在假设检验和互信息估计中的应用展示了该框架的多功能性，显示出强大的性能，特别是对于传统神经方法难以处理的小型数据集。
-
-##### **APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**
-2502.12085v1 by Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
-
-While long-context inference is crucial for advancing large language model
-(LLM) applications, its prefill speed remains a significant bottleneck. Current
-approaches, including sequence parallelism strategies and compute reduction
-through approximate attention mechanisms, still fall short of delivering
-optimal inference efficiency. This hinders scaling the inputs to longer
-sequences and processing long-context queries in a timely manner. To address
-this, we introduce APB, an efficient long-context inference framework that
-leverages multi-host approximate attention to enhance prefill speed by reducing
-compute and enhancing parallelism simultaneously. APB introduces a
-communication mechanism for essential key-value pairs within a sequence
-parallelism framework, enabling a faster inference speed while maintaining task
-performance. We implement APB by incorporating a tailored FlashAttn kernel
-alongside optimized distribution strategies, supporting diverse models and
-parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x
-compared with FlashAttn, RingAttn, and StarAttn, respectively, without any
-observable task performance degradation. We provide the implementation and
-experiment code of APB in https://github.com/thunlp/APB.
-
-摘要：雖然長文本推理對於推進大型語言模型 (LLM) 應用至關重要，但其預填充速度仍然是一個重大的瓶頸。目前的各種方法，包括序列並行策略和透過近似注意力機制減少運算，仍然無法提供最佳的推理效率。這會阻礙將輸入擴展到更長的序列，以及及時處理長文本查詢。為了解決這個問題，我們引入了 APB，這是一個高效的長文本推理架構，它利用多主機近似注意力來減少運算並同時提高並行性，從而提高預填充速度。APB 在序列並行架構中引入了一個用於基本鍵值對的通訊機制，在維持任務效能的同時，實現更快的推理速度。我們透過整合一個量身打造的 FlashAttn 核心以及最佳化的分佈策略來實作 APB，支援各種模型和並行配置。與 FlashAttn、RingAttn 和 StarAttn 相比，APB 分別實現了高達 9.2 倍、4.2 倍和 1.6 倍的加速，同時沒有任何可觀察到的任務效能下降。我們在 https://github.com/thunlp/APB 中提供了 APB 的實作和實驗程式碼。
-
-##### **VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**
-2502.12084v1 by Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R., Fung
-
-Visually linking matching cues is a crucial ability in daily life, such as
-identifying the same person in multiple photos based on their cues, even
-without knowing who they are. Despite the extensive knowledge that
-vision-language models (VLMs) possess, it remains largely unexplored whether
-they are capable of performing this fundamental task. To address this, we
-introduce VLM$^2$-Bench, a benchmark designed to assess whether VLMs can
-Visually Link Matching cues, with 9 subtasks and over 3,000 test cases.
-Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with
-further analysis of various language-side and vision-side prompting methods,
-leads to a total of eight key findings. We identify critical challenges in
-models' ability to link visual cues, highlighting a significant performance gap
-where even GPT-4o lags 34.80% behind humans. Based on these insights, we
-advocate for (i) enhancing core visual capabilities to improve adaptability and
-reduce reliance on prior knowledge, (ii) establishing clearer principles for
-integrating language-based reasoning in vision-centric tasks to prevent
-unnecessary biases, and (iii) shifting vision-text training paradigms toward
-fostering models' ability to independently structure and infer relationships
-among visual cues.
-
-摘要：視覺連結匹配線索是日常生活中的關鍵能力，例如在多張照片中根據線索辨識同一個人，即使不知道他們是誰。儘管視覺語言模型 (VLM) 擁有廣泛的知識，但它們是否能執行這項基本任務，在很大程度上仍未被探討。為了解決這個問題，我們引入了 VLM$^2$-Bench，一個基準測試，旨在評估 VLM 是否能視覺連結匹配線索，包含 9 個子任務和超過 3,000 個測試案例。對八個開源 VLM 和 GPT-4o 的全面評估，以及對各種語言側和視覺側提示方法的進一步分析，得出總共八項關鍵發現。我們找出模型連結視覺線索能力的關鍵挑戰，強調一個顯著的效能差距，即使是 GPT-4o 也落後人類 34.80%。根據這些見解，我們提倡 (i) 提升核心視覺能力以改善適應性並減少對先驗知識的依賴，(ii) 為整合基於語言的推理到以視覺為中心的任務中建立更明確的原則，以防止不必要的偏見，以及 (iii) 將視覺文字訓練範例轉移到培養模型獨立建構和推論視覺線索之間關係的能力。
-
-##### **AdaSplash: Adaptive Sparse Flash Attention**
-2502.12082v1 by Nuno Gonçalves, Marcos Treviso, André F. T. Martins
-
-The computational cost of softmax-based attention in transformers limits
-their applicability to long-context tasks. Adaptive sparsity, of which
-$\alpha$-entmax attention is an example, offers a flexible data-dependent
-alternative, but existing implementations are inefficient and do not leverage
-the sparsity to obtain runtime and memory gains. In this work, we propose
-AdaSplash, which combines the efficiency of GPU-optimized algorithms with the
-sparsity benefits of $\alpha$-entmax. We first introduce a hybrid
-Halley-bisection algorithm, resulting in a 7-fold reduction in the number of
-iterations needed to compute the $\alpha$-entmax transformation. Then, we
-implement custom Triton kernels to efficiently handle adaptive sparsity.
-Experiments with RoBERTa and ModernBERT for text classification and
-single-vector retrieval, along with GPT-2 for language modeling, show that our
-method achieves substantial improvements in runtime and memory efficiency
-compared to existing $\alpha$-entmax implementations. It approaches -- and in
-some cases surpasses -- the efficiency of highly optimized softmax
-implementations like FlashAttention-2, enabling long-context training while
-maintaining strong task performance.
-
-摘要：基於 softmax 的注意力在 Transformer 中的運算成本限制了它們在長內容任務中的應用性。適應性稀疏性，其中 $\alpha$-entmax 注意力是一個例子，提供了一個靈活的資料相關替代方案，但現有的實作效率低下，且無法利用稀疏性來獲得執行時間和記憶體的增益。在這項工作中，我們提出了 AdaSplash，它結合了 GPU 最佳化演算法的效率和 $\alpha$-entmax 的稀疏性優點。我們首先引入了一個混合 Halley-二分法演算法，導致計算 $\alpha$-entmax 轉換所需的迭代次數減少了 7 倍。然後，我們實作自訂 Triton 核心，以有效處理適應性稀疏性。針對文字分類和單一向量擷取的 RoBERTa 和 ModernBERT，以及用於語言建模的 GPT-2 的實驗顯示，與現有的 $\alpha$-entmax 實作相比，我們的方法在執行時間和記憶體效率方面獲得了顯著的改善。它接近了 -- 在某些情況下超越了 -- 高度最佳化 softmax 實作（例如 FlashAttention-2）的效率，同時在維持強大任務效能的同時，能夠進行長內容訓練。
-
-##### **Unhackable Temporal Rewarding for Scalable Video MLLMs**
-2502.12081v1 by En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
-
-In the pursuit of superior video-processing MLLMs, we have encountered a
-perplexing paradox: the "anti-scaling law", where more data and larger models
-lead to worse performance. This study unmasks the culprit: "temporal hacking",
-a phenomenon where models shortcut by fixating on select frames, missing the
-full video narrative. In this work, we systematically establish a comprehensive
-theory of temporal hacking, defining it from a reinforcement learning
-perspective, introducing the Temporal Perplexity (TPL) score to assess this
-misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework
-to mitigate the temporal hacking. Both theoretically and empirically, TPL
-proves to be a reliable indicator of temporal modeling quality, correlating
-strongly with frame activation patterns. Extensive experiments reveal that UTR
-not only counters temporal hacking but significantly elevates video
-comprehension capabilities. This work not only advances video-AI systems but
-also illuminates the critical importance of aligning proxy rewards with true
-objectives in MLLM development.
-
-摘要：在追求卓越的影片處理 MLLM 時，我們遭遇了一個令人費解的矛盾現象：「反規模化定律」，也就是更多資料和更大的模型會導致更差的效能。本研究揭露了罪魁禍首：「時間駭客」，這是一種模型透過專注於特定影格來簡化的現象，錯失了完整的影片敘事。在這項研究中，我們系統性地建立了一個關於時間駭客的全面理論，從強化學習的角度定義它，並引入了時間困惑度 (TPL) 分數來評估這種失衡，並提出了無法破解的時間獎勵 (UTR) 架構來減輕時間駭客現象。從理論和經驗上來說，TPL 被證明是時間建模品質的可靠指標，與影格啟動模式有很強的相關性。大量的實驗顯示，UTR 不僅對抗時間駭客，還能顯著提升影片理解能力。這項研究不僅推動了影片 AI 系統，也闡明了在 MLLM 開發中，將代理獎勵與真實目標對齊的重要性。
-
-##### **Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**
-2502.12073v1 by Zhongyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
-
-Social media enables dynamic user engagement with trending topics, and recent
-research has explored the potential of large language models (LLMs) for
-response generation. While some studies investigate LLMs as agents for
-simulating user behavior on social media, their focus remains on practical
-viability and scalability rather than a deeper understanding of how well LLM
-aligns with human behavior. This paper analyzes LLMs' ability to simulate
-social media engagement through action guided response generation, where a
-model first predicts a user's most likely engagement action-retweet, quote, or
-rewrite-towards a trending post before generating a personalized response
-conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and
-DeepSeek-R1 in social media engagement simulation regarding a major societal
-event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT
-in action prediction, while few-shot prompting initially degrades the
-prediction accuracy of LLMs with limited examples. However, in response
-generation, few-shot LLMs achieve stronger semantic alignment with ground truth
-posts.
-
-摘要：社交媒體讓使用者能夠動態參與熱門話題，而最近的研究探索了大型語言模型 (LLM) 在回應生成方面的潛力。儘管有些研究將 LLM 視為模擬社交媒體使用者行為的代理，但其重點仍放在實務可行性和可擴充性，而非深入了解 LLM 如何與人類行為相符。本文分析了 LLM 透過動作引導回應生成來模擬社交媒體參與的能力，其中一個模型首先預測使用者最有可能的參與動作（轉推、引用或改寫）對熱門貼文的參與，然後根據預測的動作產生個人化回應。我們在 X 上討論的一個重大社會事件中，對 GPT-4o-mini、O1-mini 和 DeepSeek-R1 進行社交媒體參與模擬的基準測試。我們的研究結果顯示，零次學習 LLM 在動作預測方面表現不如 BERT，而少次學習提示最初會降低範例有限的 LLM 預測準確度。然而，在回應生成方面，少次學習 LLM 與真實貼文達到了更強的語義對齊。
-
-##### **TokenSkip: Controllable Chain-of-Thought Compression in LLMs**
-2502.12067v1 by Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li
-
-Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning
-capabilities of large language models (LLMs). Recent advancements, such as
-OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT
-sequences during inference could further boost LLM reasoning performance.
-However, due to the autoregressive nature of LLM decoding, longer CoT outputs
-lead to a linear increase in inference latency, adversely affecting user
-experience, particularly when the CoT exceeds 10,000 tokens. To address this
-limitation, we analyze the semantic importance of tokens within CoT outputs and
-reveal that their contributions to reasoning vary. Building on this insight, we
-propose TokenSkip, a simple yet effective approach that enables LLMs to
-selectively skip less important tokens, allowing for controllable CoT
-compression. Extensive experiments across various models and tasks demonstrate
-the effectiveness of TokenSkip in reducing CoT token usage while preserving
-strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct,
-TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less
-than a 0.4% performance drop.
-
-摘要：<paragraph>鏈式思維 (CoT) 已被證明能有效提升大型語言模型 (LLM) 的推理能力。最近的進展，例如 OpenAI 的 o1 和 DeepSeek-R1，表明在推理過程中擴展 CoT 序列的長度可以進一步提升 LLM 的推理效能。然而，由於 LLM 解碼的自動回歸特性，較長的 CoT 輸出會導致推理延遲線性增加，對使用者體驗造成負面影響，特別是在 CoT 超過 10,000 個符號時。為了解決這個限制，我們分析了 CoT 輸出中符號的語義重要性，並揭示了它們對推理的貢獻度不同。基於這個見解，我們提出了 TokenSkip，一種簡單但有效的技術，使 LLM 能有選擇地略過較不重要的符號，從而實現可控的 CoT 壓縮。跨越各種模型和任務的廣泛實驗證明了 TokenSkip 在減少 CoT 符號使用量同時保持強大推理效能方面的有效性。值得注意的是，當應用於 Qwen2.5-14B-Instruct 時，TokenSkip 將 GSM8K 上的推理符號減少了 40%（從 313 個減少到 181 個），效能下降不到 0.4%。</paragraph>
-
-##### **CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**
-2502.12066v1 by Yifan Zhang, Xue Yang
-
-Automating planning with LLMs presents transformative opportunities for
-traditional industries, yet remains underexplored. In commercial construction,
-the complexity of automated scheduling often requires manual intervention to
-ensure precision. We propose CONSTRUCTA, a novel framework leveraging LLMs to
-optimize construction schedules in complex projects like semiconductor
-fabrication. CONSTRUCTA addresses key challenges by: (1) integrating
-construction-specific knowledge through static RAG; (2) employing
-context-sampling techniques inspired by architectural expertise to provide
-relevant input; and (3) deploying Construction DPO to align schedules with
-expert preferences using RLHF. Experiments on proprietary data demonstrate
-performance improvements of +42.3% in missing value prediction, +79.1% in
-dependency analysis, and +28.9% in automated planning compared to baseline
-methods, showcasing its potential to revolutionize construction workflows and
-inspire domain-specific LLM advancements.
-
-摘要：利用 LLM 自動化規劃為傳統產業帶來轉型契機，但仍有待進一步探索。在商業建築中，自動化排程的複雜性通常需要手動介入以確保精確度。我們提出 CONSTRUCTA，一個利用 LLM 優化複雜專案（如半導體製造）建築排程的新穎架構。CONSTRUCTA 透過下列方式解決關鍵挑戰：(1) 整合靜態 RAG 的建築特定知識；(2) 採用受建築專業知識啟發的脈絡取樣技術，提供相關輸入；(3) 部署建築 DPO，使用 RLHF 將排程與專家偏好對齊。專利數據的實驗顯示，與基準方法相比，遺失值預測的效能提升 +42.3%、相依性分析提升 +79.1%、自動化規劃提升 +28.9%，展示其革新建築工作流程和激勵領域特定 LLM 進展的潛力。
-
-##### **Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**
-2502.12065v1 by Lan Zhang, Marco Valentino, Andre Freitas
-
-Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge
-the gap between informal mathematics and formal languages through
-autoformalization. However, it is still unclear how well LLMs generalize to
-sophisticated and naturally occurring mathematical statements. To address this
-gap, we investigate the task of autoformalizing real-world mathematical
-definitions -- a critical component of mathematical discourse. Specifically, we
-introduce two novel resources for autoformalisation, collecting definitions
-from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically
-evaluate a range of LLMs, analyzing their ability to formalize definitions into
-Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs'
-performance including refinement through external feedback from Proof
-Assistants, and formal definition grounding, where we guide LLMs through
-relevant contextual elements from formal mathematical libraries. Our findings
-reveal that definitions present a greater challenge compared to existing
-benchmarks, such as miniF2F. In particular, we found that LLMs still struggle
-with self-correction, and aligning with relevant mathematical libraries. At the
-same time, structured refinement methods and definition grounding strategies
-yield notable improvements of up to 16% on self-correction capabilities and 43%
-on the reduction of undefined errors, highlighting promising directions for
-enhancing LLM-based autoformalization in real-world scenarios.
-
-摘要：由於語言能力，LLM 提供了一個機會，透過自動形式化來彌合非正式數學和形式語言之間的差距。然而，LLM 在多麼精巧且自然發生的數學陳述中概化，這仍不清楚。為了解決這個差距，我們探討了自動形式化真實世界數學定義的任務，這是數學論述中的關鍵組成部分。具體來說，我們介紹了自動形式化的兩個新資源，收集來自維基百科（Def_Wiki）和 arXiv 論文（Def_ArXiv）的定義。然後，我們系統性地評估了一系列 LLM，分析它們將定義形式化為 Isabelle/HOL 的能力。此外，我們探討了增強 LLM 效能的策略，包括透過證明輔助工具的外部回饋進行精煉，以及形式定義基礎，其中我們透過形式數學函式庫中的相關脈絡元素來引導 LLM。我們的發現顯示，與現有的基準（例如 miniF2F）相比，定義提出了更大的挑戰。特別是，我們發現 LLM 在自我修正和與相關數學函式庫對齊方面仍然有困難。同時，結構化的精煉方法和定義基礎策略在自我修正能力上產生了顯著的改善，高達 16%，在減少未定義錯誤方面改善了 43%，突顯了在真實世界場景中增強基於 LLM 的自動形式化的有希望的方向。
-
-##### **AI-generated Text Detection with a GLTR-based Approach**
-2502.12064v1 by Lucía Yan Wu, Isabel Segura-Bedmar
-
-The rise of LLMs (Large Language Models) has contributed to the improved
-performance and development of cutting-edge NLP applications. However, these
-can also pose risks when used maliciously, such as spreading fake news, harmful
-content, impersonating individuals, or facilitating school plagiarism, among
-others. This is because LLMs can generate high-quality texts, which are
-challenging to differentiate from those written by humans. GLTR, which stands
-for Giant Language Model Test Room and was developed jointly by the MIT-IBM
-Watson AI Lab and HarvardNLP, is a visual tool designed to help detect
-machine-generated texts based on GPT-2, that highlights the words in text
-depending on the probability that they were machine-generated. One limitation
-of GLTR is that the results it returns can sometimes be ambiguous and lead to
-confusion. This study aims to explore various ways to improve GLTR's
-effectiveness for detecting AI-generated texts within the context of the
-IberLef-AuTexTification 2023 shared task, in both English and Spanish
-languages. Experiment results show that our GLTR-based GPT-2 model overcomes
-the state-of-the-art models on the English dataset with a macro F1-score of
-80.19%, except for the first ranking model (80.91%). However, for the Spanish
-dataset, we obtained a macro F1-score of 66.20%, which differs by 4.57%
-compared to the top-performing model.
-
-摘要：大型語言模型 (LLM) 的興起有助於改進尖端 NLP 應用程式的效能和開發。不過，這些應用程式若遭惡意使用，例如散布假新聞、有害內容、冒充個人或協助學校抄襲等，也可能造成風險。這是因為 LLM 可以產生高品質的文字，而這些文字難以與人類所寫的文字區分。GLTR（代表大型語言模型測試室）是由麻省理工學院-IBM Watson AI 實驗室和 HarvardNLP 共同開發的視覺工具，旨在協助偵測基於 GPT-2 的機器產生的文字，它會根據文字中每個字詞機器產生的機率來標示。GLTR 的一個限制在於，它回傳的結果有時可能模稜兩可，容易造成混淆。本研究旨在探討各種方法來改善 GLTR 在 IberLef-AuTexTification 2023 共享任務中偵測 AI 生成的文字的效能，任務中包含英文和西班牙文兩種語言。實驗結果顯示，我們的基於 GLTR 的 GPT-2 模型在英文資料集上以 80.19% 的巨觀 F1 分數超越了最先進的模型，僅次於第一名排名模型 (80.91%)。不過，在西班牙文資料集上，我們獲得的巨觀 F1 分數為 66.20%，與表現最佳的模型相比，相差 4.57%。
-
-##### **Culture is Not Trivia: Sociocultural Theory for Cultural NLP**
-2502.12057v1 by Naitian Zhou, David Bamman, Isaac L. Bleaman
-
-The field of cultural NLP has recently experienced rapid growth, driven by a
-pressing need to ensure that language technologies are effective and safe
-across a pluralistic user base. This work has largely progressed without a
-shared conception of culture, instead choosing to rely on a wide array of
-cultural proxies. However, this leads to a number of recurring limitations:
-coarse national boundaries fail to capture nuanced differences that lay within
-them, limited coverage restricts datasets to only a subset of usually
-highly-represented cultures, and a lack of dynamicity results in static
-cultural benchmarks that do not change as culture evolves. In this position
-paper, we argue that these methodological limitations are symptomatic of a
-theoretical gap. We draw on a well-developed theory of culture from
-sociocultural linguistics to fill this gap by 1) demonstrating in a case study
-how it can clarify methodological constraints and affordances, 2) offering
-theoretically-motivated paths forward to achieving cultural competence, and 3)
-arguing that localization is a more useful framing for the goals of much
-current work in cultural NLP.
-
-摘要：文化 NLP 領域最近經歷了快速成長，這是因為迫切需要確保語言技術對於多元化的使用者基礎而言是有效且安全的。這項工作在很大程度上沒有文化共識，而是選擇依賴各種文化代理。然而，這導致了許多重複性的限制：粗略的國家界線無法捕捉到其中的細微差異，有限的涵蓋範圍將資料集限制在通常高度代表的文化子集，而且缺乏動態性導致靜態文化基準無法隨著文化演變而改變。在這篇立場文件中，我們認為這些方法論限制是理論差距的徵兆。我們從社會文化語言學中汲取一個發展良好的文化理論，透過 1) 在個案研究中展示它如何釐清方法論限制和可負擔性，2) 提供理論上合理的途徑來實現文化能力，以及 3) 主張在地化對於文化 NLP 中許多當前工作的目標而言是一個更有用的框架，來填補這個差距。
-
-##### **Designing Role Vectors to Improve LLM Inference Behaviour**
-2502.12055v1 by Daniele Potertì, Andrea Seveso, Fabio Mercorio
-
-The influence of personas on Large Language Models (LLMs) has been widely
-studied, yet their direct impact on performance remains uncertain. This work
-explores a novel approach to guiding LLM behaviour through role vectors, an
-alternative to persona-based prompting. We construct 29 role vectors derived
-from model activations and evaluate their impact on benchmark performance
-across multiple domains. Our analysis investigates whether these vectors can
-effectively steer models toward domain-specific expertise. We measure two key
-interventions: (i) activation addition, which reinforces role-specific
-directions, and (ii) directional ablation, which removes them. Results on
-well-established benchmarks indicate that role vectors do, in fact, influence
-model behaviour, improving task performance in relevant domains while
-marginally affecting unrelated tasks. This, in turn, suggests that manipulating
-internal model representations has a greater impact on outcomes than
-persona-based prompting.
-
-摘要：大型語言模型 (LLM) 中角色的影響已被廣泛研究，但它們對效能的直接影響仍然不確定。本研究探討了一種透過角色向量引導 LLM 行為的新方法，這是一種基於角色提示的替代方案。我們從模型激活中建構了 29 個角色向量，並評估它們對多個領域基準效能的影響。我們的分析探討了這些向量是否能有效地引導模型朝向特定領域的專業知識。我們衡量了兩個關鍵干預措施：(i) 激活新增，它加強了特定角色的方向，以及 (ii) 方向消融，它移除了這些方向。在既定基準上的結果表明，角色向量確實會影響模型行為，在相關領域中改善任務效能，同時對不相關任務的影響很小。這反過來表明，操縱內部模型表示對結果的影響比基於角色的提示更大。
-
-##### **PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**
-2502.12054v1 by Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu
-
-Large language models demonstrate remarkable capabilities across various
-domains, especially mathematics and logic reasoning. However, current
-evaluations overlook physics-based reasoning - a complex task requiring physics
-theorems and constraints. We present PhysReason, a 1,200-problem benchmark
-comprising knowledge-based (25%) and reasoning-based (75%) problems, where the
-latter are divided into three difficulty levels (easy, medium, hard). Notably,
-problems require an average of 8.1 solution steps, with hard requiring 15.6,
-reflecting the complexity of physics-based reasoning. We propose the Physics
-Solution Auto Scoring Framework, incorporating efficient answer-level and
-comprehensive step-level evaluations. Top-performing models like Deepseek-R1,
-Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on
-answer-level evaluation, with performance dropping from knowledge questions
-(75.11%) to hard problems (31.95%). Through step-level evaluation, we
-identified four key bottlenecks: Physics Theorem Application, Physics Process
-Understanding, Calculation, and Physics Condition Analysis. These findings
-position PhysReason as a novel and comprehensive benchmark for evaluating
-physics-based reasoning capabilities in large language models. Our code and
-data will be published at https:/dxzxy12138.github.io/PhysReason.
-
-摘要：大型語言模型展示了在各個領域的非凡能力，特別是數學和邏輯推理。然而，目前的評估忽略了基於物理的推理——這是一項複雜的任務，需要物理定理和約束。我們提出了 PhysReason，一個包含 1,200 題的基準，包含基於知識的（25%）和基於推理的（75%）問題，後者分為三個難度等級（容易、中等、困難）。值得注意的是，問題需要平均 8.1 個求解步驟，困難的需要 15.6 個，反映了基於物理的推理的複雜性。我們提出了物理解決方案自動評分框架，結合了高效的答案級別和全面的步驟級別評估。Deepseek-R1、Gemini-2.0-Flash-Thinking 和 o3-mini-high 等表現最佳的模型在答案級別評估中獲得低於 60% 的分數，性能從知識問題（75.11%）下降到困難問題（31.95%）。通過步驟級別評估，我們確定了四個關鍵瓶頸：物理定理應用、物理過程理解、計算和物理條件分析。這些發現將 PhysReason 定位為一個新穎且全面的基準，用於評估大型語言模型中基於物理的推理能力。我們的代碼和數據將發布在 https:/dxzxy12138.github.io/PhysReason。
-
-##### **A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**
-2502.12052v1 by Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
-
-In NLG meta-evaluation, evaluation metrics are typically assessed based on
-their consistency with humans. However, we identify some limitations in
-traditional NLG meta-evaluation approaches, such as issues in handling human
-ratings and ambiguous selections of correlation measures, which undermine the
-effectiveness of meta-evaluation. In this work, we propose a dual-perspective
-NLG meta-evaluation framework that focuses on different evaluation
-capabilities, thereby providing better interpretability. In addition, we
-introduce a method of automatically constructing the corresponding benchmarks
-without requiring new human annotations. Furthermore, we conduct experiments
-with 16 representative LLMs as the evaluators based on our proposed framework,
-comprehensively analyzing their evaluation performance from different
-perspectives.
-
-摘要：在 NLG 元評估中，評估指標通常根據其與人類的一致性進行評估。然而，我們在傳統的 NLG 元評估方法中發現了一些限制，例如在處理人類評分和模稜兩可的相關性測量選擇方面存在問題，這會損害元評估的有效性。在這項工作中，我們提出了一個雙視角 NLG 元評估框架，該框架專注於不同的評估能力，從而提供更好的可解釋性。此外，我們引入了一種自動構建相應基準的方法，而不需要新的手動註釋。此外，我們根據我們提出的框架對 16 個具有代表性的 LLM 作為評估器進行了實驗，從不同的角度全面分析了它們的評估性能。
-
-##### **How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**
-2502.12051v1 by Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
-
-Neural scaling laws have revolutionized the design and optimization of
-large-scale AI models by revealing predictable relationships between model
-size, dataset volume, and computational resources. Early research established
-power-law relationships in model performance, leading to compute-optimal
-scaling strategies. However, recent studies highlighted their limitations
-across architectures, modalities, and deployment contexts. Sparse models,
-mixture-of-experts, retrieval-augmented learning, and multimodal models often
-deviate from traditional scaling patterns. Moreover, scaling behaviors vary
-across domains such as vision, reinforcement learning, and fine-tuning,
-underscoring the need for more nuanced approaches. In this survey, we
-synthesize insights from over 50 studies, examining the theoretical
-foundations, empirical findings, and practical implications of scaling laws. We
-also explore key challenges, including data efficiency, inference scaling, and
-architecture-specific constraints, advocating for adaptive scaling strategies
-tailored to real-world applications. We suggest that while scaling laws provide
-a useful guide, they do not always generalize across all architectures and
-training strategies.
-
-摘要：神經網路規模定律透過揭示模型規模、資料集體積和計算資源之間可預測的關係，徹底革新了大型 AI 模型的設計和最佳化。早期研究建立了模型效能中的冪次定律關係，進而產生最佳化的運算規模策略。然而，最近的研究突出了它們在架構、模態和部署脈絡中的限制。稀疏模型、專家混合、檢索增強式學習和多模態模型通常偏離傳統的規模模式。此外，規模行為因視覺、強化學習和微調等領域而異，強調需要更細緻的方法。在這項調查中，我們綜合了 50 多項研究的見解，探討規模定律的理論基礎、實證發現和實務意涵。我們也探討了關鍵挑戰，包括資料效率、推論規模和特定於架構的限制，提倡針對實際應用量身打造的自適應規模策略。我們建議，儘管規模定律提供了有用的指南，但它們並不總是能概括到所有架構和訓練策略。
-
-##### **SpeechT: Findings of the First Mentorship in Speech Translation**
-2502.12050v1 by Yasmin Moslem, Juan Julián Cea Morán, Mariano Gonzalez-Gomez, Muhammad Hazim Al Farouq, Farah Abdou, Satarupa Deb
-
-This work presents the details and findings of the first mentorship in speech
-translation (SpeechT), which took place in December 2024 and January 2025. To
-fulfil the requirements of the mentorship, the participants engaged in key
-activities, including data preparation, modelling, and advanced research.
-
-摘要：本研究報告了 2024 年 12 月和 2025 年 1 月舉行的首次語音翻譯 (SpeechT) 指導計畫的詳細資訊和發現。為了滿足指導計畫的要求，參與者參與了關鍵活動，包括資料準備、建模和進階研究。
-
-##### **A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**
-2502.12048v1 by Shreya Shukla, Jose Torres, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury
-
-Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial
-Intelligence (GenAI) has opened new frontiers in brain signal decoding,
-enabling assistive communication, neural representation learning, and
-multimodal integration. BCIs, particularly those leveraging
-Electroencephalography (EEG), provide a non-invasive means of translating
-neural activity into meaningful outputs. Recent advances in deep learning,
-including Generative Adversarial Networks (GANs) and Transformer-based Large
-Language Models (LLMs), have significantly improved EEG-based generation of
-images, text, and speech. This paper provides a literature review of the
-state-of-the-art in EEG-based multimodal generation, focusing on (i)
-EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and
-Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based
-language models and contrastive learning methods. Additionally, we discuss the
-emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We
-highlight key datasets, use cases, challenges, and EEG feature encoding methods
-that underpin generative approaches. By providing a structured overview of
-EEG-based generative AI, this survey aims to equip researchers and
-practitioners with insights to advance neural decoding, enhance assistive
-technologies, and expand the frontiers of brain-computer interaction.
-
-摘要：腦機介面（BCIs）與生成式人工智慧（GenAI）的整合為腦信號解碼開啟了新領域，能協助溝通、神經表徵學習與多模式整合。BCIs，特別是利用腦電圖（EEG）的 BCIs，提供了一種非侵入性的方式，可將神經活動轉換為有意義的輸出。深度學習的最新進展，包括生成對抗網路（GANs）與基於 Transformer 的大型語言模型（LLMs），大幅改善了基於 EEG 的影像、文字與語音生成。本文提供了一份基於 EEG 的多模式生成的最新文獻回顧，重點在於（一）透過 GANs、變異自動編碼器（VAEs）與擴散模型進行 EEG 到影像的生成，以及（二）利用基於 Transformer 的語言模型與對比學習方法進行 EEG 到文字的生成。此外，我們討論了 EEG 到語音合成的新興領域，這是一個不斷演進的多模式領域。我們重點介紹了關鍵的資料集、用例、挑戰與支撐生成方法的 EEG 特徵編碼方法。透過提供基於 EEG 的生成式 AI 的結構化概觀，本調查旨在為研究人員與從業人員提供見解，以推進神經解碼、增強輔助技術並擴展腦機互動的領域。
-
-##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**
-2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li
-
-Large language models (LLMs) have demonstrated remarkable capabilities in
-various complex tasks, yet they still suffer from hallucinations. Introducing
-external knowledge, such as knowledge graph, can enhance the LLMs' ability to
-provide factual answers. LLMs have the ability to interactively explore
-knowledge graphs. However, most approaches have been affected by insufficient
-internal knowledge excavation in LLMs, limited generation of trustworthy
-knowledge reasoning paths, and a vague integration between internal and
-external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large
-model framework driven by the collaboration of internal and external knowledge.
-It relies on the internal knowledge of the LLM to guide the exploration of
-interpretable directed subgraphs in external knowledge graphs, better
-integrating the two knowledge sources for more accurate reasoning. Extensive
-experiments on multiple real-world datasets confirm the superiority of
-KnowPath.
-
-摘要：大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力，但仍會出現幻覺。引入外部知識（例如知識圖譜）可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而，大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限，以及內部和外部知識之間的整合模糊的影響。因此，我們提出 KnowPath，這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索，更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。
-
-##### **SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**
-2502.12025v1 by Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
-
-Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage
-long chain-of-thought (CoT) reasoning to generate structured intermediate
-steps, enhancing their reasoning capabilities. However, long CoT does not
-inherently guarantee safe outputs, potentially leading to harmful consequences
-such as the introduction of security vulnerabilities in code or the spread of
-misinformation. Current research on large language model (LLM) safety usually
-focuses on short-answer responses, overlooking the long CoT style outputs of
-LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First,
-we investigate safety evaluators calibrated against human annotations. Using
-our newly developed metrics, we thoroughly assess the safety of 12
-state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results
-show that LRMs are not safe compared to their reasoning advance. Further, we
-perform a fine-grained analysis of the reasoning trace and final answer. We
-find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can
-improve model safety without additional training. However, these strategies
-either use constrained reasoning traces or incur high inference costs. To
-better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind
-safety training dataset in CoT style. We fine-tune two LRMs with SafeChain,
-showing that it not only enhances model safety but also preserves performance
-across 6 reasoning benchmarks.
-
-摘要：新興的大型推理模型（LRM），例如 DeepSeek-R1 模型，利用長鏈思考（CoT）推理來生成結構化的中間步驟，增強其推理能力。然而，長 CoT 本質上並不能保證安全的輸出，可能會導致有害的後果，例如在程式碼中引入安全漏洞或散佈錯誤訊息。目前針對大型語言模型（LLM）安全性的研究通常側重於簡短的回答回應，忽略了 LRM 的長 CoT 風格輸出。為了彌補這個差距，我們對 LRM 安全性進行系統性研究。首先，我們研究根據人類註解校正的安全評估器。使用我們新開發的指標，我們徹底評估了 12 個最先進的 LRM 在 StrongReject 和 WildJailbreak 資料集上的安全性。我們的結果表明，與其推理進度相比，LRM 並不安全。此外，我們對推理軌跡和最終答案進行了細粒度分析。我們發現三種解碼策略（ZeroThink、LessThink 和 MoreThink）可以在不額外訓練的情況下提高模型安全性。然而，這些策略要么使用受約束的推理軌跡，要么會產生高昂的推論成本。為了進一步加強 LRM 安全性，我們引入了 SafeChain，這是第一個 CoT 風格的安全訓練資料集。我們使用 SafeChain 微調了兩個 LRM，表明它不僅增強了模型安全性，而且在 6 個推理基準測試中都保持了效能。
-
-##### **Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**
-2502.12022v1 by Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
-
-Existing approaches to mathematical reasoning with large language models
-(LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated
-Reasoning (TIR) for precise computation. While efforts have been made to
-combine these methods, they primarily rely on post-selection or predefined
-strategies, leaving an open question: whether LLMs can autonomously adapt their
-reasoning strategy based on their inherent capabilities. In this work, we
-propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework
-that enables LLMs to personalize their reasoning strategy spontaneously,
-aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware
-data selection during supervised fine-tuning (SFT) to tailor training data to
-the model's unique abilities. This approach equips LLMs to autonomously
-determine and apply the appropriate reasoning strategy at test time. We
-evaluate TATA through extensive experiments on six mathematical reasoning
-benchmarks, using both general-purpose and math-specialized LLMs. Empirical
-results demonstrate that TATA effectively combines the complementary strengths
-of CoT and TIR, achieving superior or comparable performance with improved
-inference efficiency compared to TIR alone. Further analysis underscores the
-critical role of aptitude-aware data selection in enabling LLMs to make
-effective and adaptive reasoning decisions and align reasoning strategies with
-model capabilities.
-
-摘要：現有的數學推理方法使用大型語言模型 (LLM) 仰賴思考鏈 (CoT) 來達到泛化性，或使用工具整合推理 (TIR) 來進行精確運算。儘管已有人嘗試結合這些方法，但它們主要依賴後選取或預定義策略，留下一個開放性的問題：LLM 是否能根據其內在能力自主調整其推理策略。在這項工作中，我們提出 TATA（根據其天賦來教授 LLM），這是一個適應性架構，讓 LLM 能夠自發地個人化其推理策略，並與其內在的天賦保持一致。TATA 在監督微調 (SFT) 期間納入了基礎 LLM 感知資料選取，以根據模型的獨特能力調整訓練資料。此方法讓 LLM 能夠在測試時自主決定並套用適當的推理策略。我們透過對六個數學推理基準進行廣泛的實驗來評估 TATA，使用通用和數學專用 LLM。經驗結果顯示，TATA 有效地結合了 CoT 和 TIR 的互補優勢，與僅使用 TIR 相比，達到了優越或相當的效能，並改善了推論效率。進一步的分析強調了天賦感知資料選取在讓 LLM 能夠做出有效且適應性的推理決策，並將推理策略與模型能力保持一致時所扮演的關鍵角色。
-
-##### **Atom of Thoughts for Markov LLM Test-Time Scaling**
-2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
-
-Large Language Models (LLMs) achieve superior performance through
-training-time scaling, and test-time scaling further enhances their
-capabilities by conducting effective reasoning during inference. However, as
-the scale of reasoning increases, existing test-time scaling methods suffer
-from accumulated historical information, which not only wastes computational
-resources but also interferes with effective reasoning. To address this issue,
-we observe that complex reasoning progress is often achieved by solving a
-sequence of independent subquestions, each being self-contained and verifiable.
-These subquestions are essentially atomic questions, relying primarily on their
-current state rather than accumulated history, similar to the memoryless
-transitions in a Markov process. Based on this observation, we propose Atom of
-Thoughts (AoT), where each state transition in the reasoning process consists
-of decomposing the current question into a dependency-based directed acyclic
-graph and contracting its subquestions, forming a new atomic question state.
-This iterative decomposition-contraction process continues until reaching
-directly solvable atomic questions, naturally realizing Markov transitions
-between question states. Furthermore, these atomic questions can be seamlessly
-integrated into existing test-time scaling methods, enabling AoT to serve as a
-plug-in enhancement for improving reasoning capabilities. Experiments across
-six benchmarks demonstrate the effectiveness of AoT both as a standalone
-framework and a plug-in enhancement. Notably, on HotpotQA, when applied to
-gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and
-DeepSeek-R1 by 10.6%. The code will be available at
-https://github.com/qixucen/atom.
-
-摘要：大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能，而測試時間擴充透過在推論期間進行有效的推理，進一步提升其能力。然而，隨著推理規模的擴大，現有的測試時間擴充方法會受到累積的歷史資訊影響，這不僅會浪費運算資源，還會干擾有效的推理。為了解決這個問題，我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成，每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題，主要依賴於它們的當前狀態，而不是累積的歷史，類似於馬可夫過程中的無記憶轉換。基於這個觀察，我們提出了思想原子 (AoT)，其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖，並收縮其子問題，形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行，直到達到可直接解決的原子問題，自然地實現問題狀態之間的馬可夫轉換。此外，這些原子問題可以無縫整合到現有的測試時間擴充方法中，讓 AoT 可以作為外掛程式強化功能，以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是，在 HotpotQA 上，當應用於 gpt-4o-mini 時，AoT 達到了 80.6% 的 F1 分數，比 o3-mini 高出 3.4%，比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。
-
-##### **Demographic Attributes Prediction from Speech Using WavLM Embeddings**
-2502.12007v1 by Yuchen Yang, Thomas Thebaud, Najim Dehak
-
-This paper introduces a general classifier based on WavLM features, to infer
-demographic characteristics, such as age, gender, native language, education,
-and country, from speech. Demographic feature prediction plays a crucial role
-in applications like language learning, accessibility, and digital forensics,
-enabling more personalized and inclusive technologies. Leveraging pretrained
-models for embedding extraction, the proposed framework identifies key acoustic
-and linguistic fea-tures associated with demographic attributes, achieving a
-Mean Absolute Error (MAE) of 4.94 for age prediction and over 99.81% accuracy
-for gender classification across various datasets. Our system improves upon
-existing models by up to relative 30% in MAE and up to relative 10% in accuracy
-and F1 scores across tasks, leveraging a diverse range of datasets and large
-pretrained models to ensure robustness and generalizability. This study offers
-new insights into speaker diversity and provides a strong foundation for future
-research in speech-based demographic profiling.
-
-摘要：本文介紹一個基於 WavLM 特徵的一般分類器，用於從語音中推斷人口特徵，例如年齡、性別、母語、教育和國家。人口特徵預測在語言學習、無障礙性和數位鑑識等應用中扮演著至關重要的角色，能實現更個人化且包容性的技術。利用預先訓練的模型進行嵌入式萃取，提出的架構識別與人口屬性相關的主要音訊和語言特徵，在年齡預測中達到 4.94 的平均絕對誤差 (MAE)，在各種資料集中的性別分類中準確率超過 99.81%。我們的系統在平均絕對誤差上比現有模型提升了相對 30%，在準確率和 F1 分數上提升了相對 10%，利用各種資料集和大型預先訓練模型來確保穩健性和概括性。本研究提供了對說話者多元性的新見解，並為未來基於語音的人口特徵分析研究奠定了堅實的基礎。
-
-##### **Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**
-2502.12001v1 by Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, Yoshihide Nomura
-
-This paper investigates the integration of technical vocabulary in merged
-language models. We explore the knowledge transfer mechanisms involved when
-combining a general-purpose language-specific model with a domain-specific
-model, focusing on the resulting model's comprehension of technical jargon. Our
-experiments analyze the impact of this merging process on the target model's
-proficiency in handling specialized terminology. We present a quantitative
-evaluation of the performance of the merged model, comparing it with that of
-the individual constituent models. The findings offer insights into the
-effectiveness of different model merging methods for enhancing domain-specific
-knowledge and highlight potential challenges and future directions in
-leveraging these methods for cross-lingual knowledge transfer in Natural
-Language Processing.
-
-摘要：本文探討了技術詞彙在合併語言模型中的整合。我們探討了結合一般用途語言特定模型與特定領域模型時所涉及的知識轉移機制，重點在於所產生模型對技術術語的理解。我們的實驗分析了此合併程序對目標模型處理專業術語能力的影響。我們提出了合併模型效能的量化評估，並將其與個別組成模型的效能進行比較。這些發現提供了見解，說明了不同模型合併方法在增強特定領域知識方面的效能，並強調了利用這些方法進行自然語言處理中跨語言知識轉移的潛在挑戰和未來方向。
-
-##### **Presumed Cultural Identity: How Names Shape LLM Responses**
-2502.11995v1 by Siddhesh Pawar, Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein
-
-Names are deeply tied to human identity. They can serve as markers of
-individuality, cultural heritage, and personal history. However, using names as
-a core indicator of identity can lead to over-simplification of complex
-identities. When interacting with LLMs, user names are an important point of
-information for personalisation. Names can enter chatbot conversations through
-direct user input (requested by chatbots), as part of task contexts such as CV
-reviews, or as built-in memory features that store user information for
-personalisation. We study biases associated with names by measuring cultural
-presumptions in the responses generated by LLMs when presented with common
-suggestion-seeking queries, which might involve making assumptions about the
-user. Our analyses demonstrate strong assumptions about cultural identity
-associated with names present in LLM generations across multiple cultures. Our
-work has implications for designing more nuanced personalisation systems that
-avoid reinforcing stereotypes while maintaining meaningful customisation.
-
-摘要：姓名與人類身分密不可分。它們可以作為個人特質、文化遺產和個人歷史的標記。然而，將姓名作為身分的核心指標可能會導致複雜身分的過度簡化。在與 LLM 互動時，使用者名稱是個人化的重要資訊點。姓名可以透過直接使用者輸入（聊天機器人要求）、作為履歷審查等任務情境的其中一部分，或作為儲存使用者資訊以供個人化的內建記憶功能，進入聊天機器人對話。我們透過衡量 LLM 在面對常見的建議尋求查詢時所產生的回應中的文化預設，來研究與姓名相關的偏見，這可能涉及對使用者的假設。我們的分析顯示，在跨多種文化的 LLM 世代中，與姓名相關的文化身分有強烈的假設。我們的研究對於設計更細緻的個人化系統有影響，這些系統避免強化刻板印象，同時維持有意義的客製化。
-
-##### **Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**
-2502.11989v1 by Negar Kamali, Karyn Nakamura, Aakriti Kumar, Angelos Chatzimparmpas, Jessica Hullman, Matthew Groh
-
-Diffusion model-generated images can appear indistinguishable from authentic
-photographs, but these images often contain artifacts and implausibilities that
-reveal their AI-generated provenance. Given the challenge to public trust in
-media posed by photorealistic AI-generated images, we conducted a large-scale
-experiment measuring human detection accuracy on 450 diffusion-model generated
-images and 149 real images. Based on collecting 749,828 observations and 34,675
-comments from 50,444 participants, we find that scene complexity of an image,
-artifact types within an image, display time of an image, and human curation of
-AI-generated images all play significant roles in how accurately people
-distinguish real from AI-generated images. Additionally, we propose a taxonomy
-characterizing artifacts often appearing in images generated by diffusion
-models. Our empirical observations and taxonomy offer nuanced insights into the
-capabilities and limitations of diffusion models to generate photorealistic
-images in 2024.
-
-摘要：擴散模型生成的影像看起來可能與真實照片無異，但這些影像通常包含人工智慧生成來源的瑕疵和不合理之處。由於寫實的人工智慧生成影像對公眾對媒體的信任構成挑戰，我們進行了一項大規模實驗，測量人類對 450 張擴散模型生成影像和 149 張真實影像的檢測準確度。根據收集自 50,444 位參與者的 749,828 次觀察和 34,675 則評論，我們發現影像的場景複雜性、影像中的瑕疵類型、影像的顯示時間，以及人類對人工智慧生成影像的策展，在人們準確區分真實影像和人工智慧生成影像方面都扮演重要的角色。此外，我們提出了一種分類法，用於描述經常出現在擴散模型生成的影像中的瑕疵。我們的經驗觀察和分類法為擴散模型在 2024 年生成寫實影像的能力和限制提供了細緻的見解。
-
-##### **Generating Text from Uniform Meaning Representation**
-2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein
-
-Uniform Meaning Representation (UMR) is a recently developed graph-based
-semantic representation, which expands on Abstract Meaning Representation (AMR)
-in a number of ways, in particular through the inclusion of document-level
-information and multilingual flexibility. In order to effectively adopt and
-leverage UMR for downstream tasks, efforts must be placed toward developing a
-UMR technological ecosystem. Though still limited amounts of UMR annotations
-have been produced to date, in this work, we investigate the first approaches
-to producing text from multilingual UMR graphs: (1) a pipeline conversion of
-UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large
-language models with UMR data, and (3) fine-tuning existing AMR-to-text
-generation models with UMR data. Our best performing model achieves a
-multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared
-to the reference, which is a promising indication of the effectiveness of
-fine-tuning approaches for UMR-to-text generation with even limited amounts of
-UMR data.
-
-摘要：統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示，它在許多方面擴展了抽象語意表示 (AMR)，特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR，必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限，但在這項工作中，我們探討了從多語言 UMR 圖形產生文字的第一種方法：(1) 將 UMR 轉換為 AMR 的管道，然後使用 AMR 轉文字生成模型，(2) 使用 UMR 資料微調大型語言模型，以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比，我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數，在中文中達到 0.882，這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果，即使 UMR 資料數量有限。
-
-##### **Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**
-2502.11969v1 by Sehun Jung, Hyang-won Lee
-
-In vision-language models (VLMs), prompt tuning has shown its effectiveness
-in adapting models to downstream tasks. However, learned prompts struggle to
-generalize to unseen classes, as they tend to overfit to the classes that are
-targeted during prompt tuning. Examining failure cases, we observed that
-learned prompts disrupt the semantics of unseen classes, generating text
-embeddings with incorrect semantic relationships among classes. To address
-this, we propose Similarity Alignment Regularization (SAR), which regularizes
-learnable prompts to preserve the semantic relationships among classes captured
-by hand-crafted prompts. Specifically, we first obtain novel classes related to
-base classes using ChatGPT-4o and utilize them as potential unseen classes
-during prompt tuning. Then, by targeting both base and novel classes, SAR
-aligns the similarity relationships among text embeddings generated by
-learnable prompts with the similarity relationships from hand-crafted prompts.
-Extensive experiments applying SAR to existing prompt tuning methods
-demonstrate its effectiveness in improving generalization to unseen classes.
-
-摘要：在視覺語言模型 (VLM) 中，提示調整已展現其在調整模型至下游任務上的效能。然而，已學習的提示難以推廣至未見類別，因為它們傾向於過度擬合提示調整期間所鎖定的類別。在檢視失敗案例時，我們觀察到已學習的提示會擾亂未見類別的語義，產生具有類別間不正確語義關係的文字嵌入。為了解決此問題，我們提出相似度對齊正則化 (SAR)，它會對可學習提示進行正則化，以保留由手工提示捕捉到的類別間語義關係。具體來說，我們首先使用 ChatGPT-4o 取得與基本類別相關的新穎類別，並在提示調整期間將它們用作潛在的未見類別。然後，透過鎖定基本類別和新穎類別，SAR 會將可學習提示產生的文字嵌入之間的相似度關係與手工提示的相似度關係對齊。將 SAR 應用於現有提示調整方法的廣泛實驗證明了其在改善對未見類別的概括上的效能。
-
-##### **A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**
-2502.11965v1 by Jun Jiang, Wenjun Yu, Yunfan Li, Yuan Gao, Shugong Xu
-
-In the field of artificial intelligence, self-supervised learning has
-demonstrated superior generalization capabilities by leveraging large-scale
-unlabeled datasets for pretraining, which is especially critical for wireless
-communication models to adapt to a variety of scenarios. This paper
-innovatively treats Channel State Information (CSI) and Channel Impulse
-Response (CIR) as naturally aligned multi-modal data and proposes the first
-MIMO wireless channel foundation model, named CSI-CLIP. By effectively
-capturing the joint representations of both CIR and CSI, CSI-CLIP exhibits
-remarkable adaptability across scenarios and robust feature extraction
-capabilities. Experimental results show that in positioning task, CSI-CLIP
-reduces the mean error distance by 22%; in beam management task, it increases
-accuracy by 1% compared to traditional supervised methods, as well as in the
-channel identification task. These improvements not only highlight the
-potential and value of CSI-CLIP in integrating sensing and communication but
-also demonstrate its significant advantages over existing techniques. Moreover,
-viewing CSI and CIR as multi-modal pairs and contrastive learning for wireless
-channel foundation model open up new research directions in the domain of MIMO
-wireless communications.
-
-摘要：在人工智能领域，自监督学习通过利用大规模无标签数据集进行预训练，展示了卓越的泛化能力，这对于无线通信模型适应各种场景尤为关键。本文创新地将信道状态信息 (CSI) 和信道脉冲响应 (CIR) 视为自然对齐的多模态数据，并提出了第一个 MIMO 无线信道基础模型，名为 CSI-CLIP。通过有效捕获 CIR 和 CSI 的联合表示，CSI-CLIP 在各种场景中表现出卓越的适应性和强大的特征提取能力。实验结果表明，在定位任务中，CSI-CLIP 将平均误差距离减少了 22%；在波束管理任务中，与传统的监督方法相比，其准确度提高了 1%，以及在信道识别任务中。这些改进不仅突出了 CSI-CLIP 在集成感知和通信方面的潜力和价值，而且还展示了其相对于现有技术的显着优势。此外，将 CSI 和 CIR 视为多模态对，并对比学习无线信道基础模型，为 MIMO 无线通信领域开辟了新的研究方向。
-
-##### **Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**
-2502.11962v1 by Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
-
-Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language
-Models (LLMs), but it may lower their truthfulness. This trade-off arises
-because IFT steers LLMs to generate responses with long-tail knowledge that is
-not well covered during pre-training, leading to more informative but less
-truthful answers when generalizing to unseen tasks. In this paper, we
-empirically demonstrate this helpfulness-truthfulness trade-off in IFT and
-propose $\textbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs
-to recognize their uncertainty and explicitly reflect it at the end of their
-responses. Experimental results show that UNIT-tuned models maintain their
-helpfulness while distinguishing between certain and uncertain claims, thereby
-reducing hallucinations.
-
-摘要：指令微調 (IFT) 可以提升大型語言模型 (LLM) 的實用性，但可能會降低其真實性。這種取捨會出現，是因為 IFT 引導 LLM 生成具有長尾知識的回應，而這些知識在預訓練期間並未充分涵蓋，導致在推廣到未見任務時，答案更具資訊性，但真實性較低。在本文中，我們透過實證展示 IFT 中的這種實用性與真實性取捨，並提出一個新穎的 IFT 典範 $\textbf{UNIT}$ 來解決這個問題。UNIT 教導 LLM 辨識其不確定性，並明確反映在其回應的結尾。實驗結果顯示，經過 UNIT 微調的模型維持其實用性，同時區分確定和不確定的說法，從而減少幻覺。
-
-##### **STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**
-2502.11959v1 by Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang
-
-Claim verification is the task of determining whether a claim is supported or
-refuted by evidence. Self-improvement methods, where reasoning chains are
-generated and those leading to correct results are selected for training, have
-succeeded in tasks like mathematical problem solving. However, in claim
-verification, this approach struggles. Low-quality reasoning chains may falsely
-match binary truth labels, introducing faulty reasoning into the
-self-improvement process and ultimately degrading performance. To address this,
-we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our
-method introduces a structured reasoning design with Claim Decomposition,
-Entity Analysis, and Evidence Grounding Verification. These components improve
-reasoning quality, reduce errors, and provide additional supervision signals
-for self-improvement. STRIVE begins with a warm-up phase, where the base model
-is fine-tuned on a small number of annotated examples to learn the structured
-reasoning design. It is then applied to generate reasoning chains for all
-training examples, selecting only those that are correct and structurally sound
-for subsequent self-improvement training. We demonstrate that STRIVE achieves
-significant improvements over baseline models, with a 31.4% performance gain
-over the base model and 20.7% over Chain of Thought on the HOVER datasets,
-highlighting its effectiveness.
-
-摘要：聲明驗證的任務是確定聲明是否受到證據支持或反駁。自改善方法（產生推理鏈並選擇導致正確結果的鏈進行訓練）已成功應用於數學問題求解等任務。然而，在聲明驗證中，此方法會遇到困難。低品質的推理鏈可能錯誤地匹配二元真值標籤，將錯誤的推理引入自改善流程並最終降低效能。為了解決此問題，我們提出 STRIVE：結構化推理自改善驗證。我們的模型引入了結構化推理設計，包含聲明分解、實體分析和證據依據驗證。這些組件改善了推理品質、減少了錯誤，並為自改善提供了額外的監督訊號。STRIVE 從熱身階段開始，在少數標註範例上微調基礎模型以學習結構化推理設計。接著將其應用於為所有訓練範例產生推理鏈，僅選擇正確且結構上合理的推理鏈進行後續的自改善訓練。我們證明 STRIVE 獲得了顯著的改善，在 HOVER 資料集上，效能比基礎模型提升了 31.4%，比 Chain of Thought 提升了 20.7%，突顯了其有效性。
-
-##### **Can Your Uncertainty Scores Detect Hallucinated Entity?**
-2502.11948v1 by Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li
-
-To mitigate the impact of hallucination nature of LLMs, many studies propose
-detecting hallucinated generation through uncertainty estimation. However,
-these approaches predominantly operate at the sentence or paragraph level,
-failing to pinpoint specific spans or entities responsible for hallucinated
-content. This lack of granularity is especially problematic for long-form
-outputs that mix accurate and fabricated information. To address this
-limitation, we explore entity-level hallucination detection. We propose a new
-data set, HalluEntity, which annotates hallucination at the entity level. Based
-on the dataset, we comprehensively evaluate uncertainty-based hallucination
-detection approaches across 17 modern LLMs. Our experimental results show that
-uncertainty estimation approaches focusing on individual token probabilities
-tend to over-predict hallucinations, while context-aware methods show better
-but still suboptimal performance. Through an in-depth qualitative study, we
-identify relationships between hallucination tendencies and linguistic
-properties and highlight important directions for future research.
-
-摘要：為了減輕 LLM 幻覺性質的影響，許多研究提出透過不確定性估計來偵測幻覺產生的內容。然而，這些方法主要是在句子或段落層級運作，無法精確找出對幻覺內容負責的特定區間或實體。這種缺乏粒度的現象對於混合了準確和虛構資訊的長篇輸出內容來說尤其成問題。為了解決這個限制，我們探討了實體層級的幻覺偵測。我們提出了一個新的資料集 HalluEntity，其中註解了實體層級的幻覺。根據該資料集，我們全面評估了 17 種現代 LLM 的基於不確定性的幻覺偵測方法。我們的實驗結果顯示，專注於個別代幣機率的不確定性估計方法傾向於過度預測幻覺，而具備背景感知能力的方法則表現得更好，但仍未達到最佳狀態。透過深入的定性研究，我們找出幻覺傾向與語言特徵之間的關係，並強調未來研究的重要方向。
-
-##### **Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**
-2502.11946v1 by Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuting Yan, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
-
-Real-time speech interaction, serving as a fundamental interface for
-human-machine collaboration, holds immense potential. However, current
-open-source models face limitations such as high costs in voice data
-collection, weakness in dynamic control, and limited intelligence. To address
-these challenges, this paper introduces Step-Audio, the first production-ready
-open-source solution. Key contributions include: 1) a 130B-parameter unified
-speech-text multi-modal model that achieves unified understanding and
-generation, with the Step-Audio-Chat version open-sourced; 2) a generative
-speech data engine that establishes an affordable voice cloning framework and
-produces the open-sourced lightweight Step-Audio-TTS-3B model through
-distillation; 3) an instruction-driven fine control system enabling dynamic
-adjustments across dialects, emotions, singing, and RAP; 4) an enhanced
-cognitive architecture augmented with tool calling and role-playing abilities
-to manage complex tasks effectively. Based on our new StepEval-Audio-360
-evaluation benchmark, Step-Audio achieves state-of-the-art performance in human
-evaluations, especially in terms of instruction following. On open-source
-benchmarks like LLaMA Question, shows 9.3% average performance improvement,
-demonstrating our commitment to advancing the development of open-source
-multi-modal language technologies. Our code and models are available at
-https://github.com/stepfun-ai/Step-Audio.
-
-摘要：<paragraph>即時語音互動作為人機協作的基本介面，蘊含著巨大的潛力。然而，目前的開源模型面臨著語音數據收集成本高、動態控制能力弱、智慧有限等限制。為了應對這些挑戰，本文介紹了 Step-Audio，這是第一個可投入生產的開源解決方案。主要貢獻包括：1) 一個 130B 參數的統一語音文字多模態模型，實現了統一的理解和生成，其中 Step-Audio-Chat 版本已開源；2) 一個生成式語音數據引擎，建立了一個經濟實惠的語音克隆框架，並通過蒸餾技術產生了開源的輕量級 Step-Audio-TTS-3B 模型；3) 一個指令驅動的精細控制系統，實現了跨方言、情緒、唱歌和饒舌的動態調整；4) 一個增強的認知架構，增加了工具呼叫和角色扮演的能力，以有效地管理複雜的任務。根據我們新的 StepEval-Audio-360 評估基準，Step-Audio 在人類評估中實現了最先進的性能，特別是在指令遵循方面。在 LLaMA Question 等開源基準測試中，表現出平均提升了 9.3%，證明了我們致力於推進開源多模態語言技術的發展。我們的程式碼和模型可在 https://github.com/stepfun-ai/Step-Audio 取得。</paragraph>
-
-##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**
-2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy
-
-Air quality prediction is key to mitigating health impacts and guiding
-decisions, yet existing models tend to focus on temporal trends while
-overlooking spatial generalization. We propose AQ-Net, a spatiotemporal
-reanalysis model for both observed and unobserved stations in the near future.
-AQ-Net utilizes the LSTM and multi-head attention for the temporal regression.
-We also propose a cyclic encoding technique to ensure continuous time
-representation. To learn fine-grained spatial air quality estimation, we
-incorporate AQ-Net with the neural kNN to explore feature-based interpolation,
-such that we can fill the spatial gaps given coarse observation stations. To
-demonstrate the efficiency of our model for spatiotemporal reanalysis, we use
-data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive
-experiments show that AQ-Net excels in air quality reanalysis, highlighting the
-potential of hybrid spatio-temporal models to better capture environmental
-dynamics, especially in urban areas where both spatial and temporal variability
-are critical.
-
-摘要：空气品质预测是减轻健康影响和指导决策的关键，但现有的模型倾向于关注时间趋势，而忽略空间概化。我们提出了 AQ-Net，这是一种时空再分析模型，适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计，我们将 AQ-Net 与神经 kNN 结合起来，以探索基于特征的插值，以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率，我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明，AQ-Net 在空气质量再分析中表现出色，突出了混合时空模型在更好地捕捉环境动态方面的潜力，尤其是在空间和时间变异性都很关键的城市地区。
-
-##### **FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**
-2502.11937v1 by Yutong Ye, Yingbo Zhou, Zhusen Liu, Xiao Du, Hao Zhou, Xiang Lian, Mingsong Chen
-
-Although Reinforcement Learning (RL)-based Traffic Signal Control (TSC)
-methods have been extensively studied, their practical applications still raise
-some serious issues such as high learning cost and poor generalizability. This
-is because the ``trial-and-error'' training style makes RL agents extremely
-dependent on the specific traffic environment, which also requires a long
-convergence time. To address these issues, we propose a novel Federated
-Imitation Learning (FIL)-based framework for multi-intersection TSC, named
-FitLight, which allows RL agents to plug-and-play for any traffic environment
-without additional pre-training cost. Unlike existing imitation learning
-approaches that rely on pre-training RL agents with demonstrations, FitLight
-allows real-time imitation learning and seamless transition to reinforcement
-learning. Due to our proposed knowledge-sharing mechanism and novel hybrid
-pressure-based agent design, RL agents can quickly find a best control policy
-with only a few episodes. Moreover, for resource-constrained TSC scenarios,
-FitLight supports model pruning and heterogeneous model aggregation, such that
-RL agents can work on a micro-controller with merely 16{\it KB} RAM and 32{\it
-KB} ROM. Extensive experiments demonstrate that, compared to state-of-the-art
-methods, FitLight not only provides a superior starting point but also
-converges to a better final solution on both real-world and synthetic datasets,
-even under extreme resource limitations.
-
-摘要：儘管基於強化學習 (RL) 的交通號誌控制 (TSC) 方法已經廣泛研究，但其實際應用仍會產生一些嚴重的問題，例如學習成本高和泛化能力差。這是因為「試錯法」訓練風格讓 RL 代理極度依賴特定的交通環境，這也需要很長的收斂時間。為了解決這些問題，我們提出一個名為 FitLight 的基於聯邦模仿學習 (FIL) 的多路口 TSC 框架，讓 RL 代理可以即插即用於任何交通環境，而無需額外的預訓練成本。與依賴使用示範預訓練 RL 代理的現有模仿學習方法不同，FitLight 允許即時模仿學習和無縫過渡到強化學習。由於我們提出的知識共享機制和新穎的基於壓力的混合代理設計，RL 代理只需幾個回合即可快速找到最佳控制策略。此外，對於資源受限的 TSC 場景，FitLight 支援模型剪枝和異質模型聚合，讓 RL 代理可以在僅有 16{\it KB} RAM 和 32{\it KB} ROM 的微控制器上運行。廣泛的實驗證明，與最先進的方法相比，FitLight 不僅提供了更好的起點，而且在實際和合成資料集上都能收斂到更好的最終解決方案，即使在極端的資源限制下也是如此。
-
-##### **On Representational Dissociation of Language and Arithmetic in Large Language Models**
-2502.11932v1 by Riku Kisako, Tatsuki Kuribayashi, Ryohei Sasano
-
-The association between language and (non-linguistic) thinking ability in
-humans has long been debated, and recently, neuroscientific evidence of brain
-activity patterns has been considered. Such a scientific context naturally
-raises an interdisciplinary question -- what about such a language-thought
-dissociation in large language models (LLMs)? In this paper, as an initial
-foray, we explore this question by focusing on simple arithmetic skills (e.g.,
-$1+2=$ ?) as a thinking ability and analyzing the geometry of their encoding in
-LLMs' representation space. Our experiments with linear classifiers and cluster
-separability tests demonstrate that simple arithmetic equations and general
-language input are encoded in completely separated regions in LLMs' internal
-representation space across all the layers, which is also supported with more
-controlled stimuli (e.g., spelled-out equations). These tentatively suggest
-that arithmetic reasoning is mapped into a distinct region from general
-language input, which is in line with the neuroscientific observations of human
-brain activations, while we also point out their somewhat cognitively
-implausible geometric properties.
-
-摘要：人類語言與（非語言）思考能力之間的關聯性長期以來一直備受爭論，而最近，神經科學證據中的大腦活動模式也已受到考量。這樣一個科學背景自然會引發一個跨領域問題——大型語言模型（LLM）中這種語言與思考的分離又是如何？在本文中，作為初步探討，我們透過專注於簡單的算術技能（例如 $1+2=$？）作為思考能力，並分析它們在 LLM 表徵空間中的編碼幾何形狀來探討這個問題。我們透過線性分類器和群集可分性測試進行的實驗證明，簡單的算術方程式和一般語言輸入在 LLM 的內部表徵空間中所有層中都是以完全分離的區域編碼，這也獲得了更受控刺激（例如，拼寫出的方程式）的支持。這些初步表明算術推理被映射到與一般語言輸入不同的區域，這與人類大腦活化的神經科學觀察結果一致，同時我們也指出了它們在認知上有些難以置信的幾何屬性。
-
-##### **BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**
-2502.11926v1 by Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Andrew Piper, Alexander Panchenko, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
-
-People worldwide use language in subtle and complex ways to express emotions.
-While emotion recognition -- an umbrella term for several NLP tasks --
-significantly impacts different applications in NLP and other fields, most work
-in the area is focused on high-resource languages. Therefore, this has led to
-major disparities in research and proposed solutions, especially for
-low-resource languages that suffer from the lack of high-quality datasets. In
-this paper, we present BRIGHTER-- a collection of multilabeled
-emotion-annotated datasets in 28 different languages. BRIGHTER covers
-predominantly low-resource languages from Africa, Asia, Eastern Europe, and
-Latin America, with instances from various domains annotated by fluent
-speakers. We describe the data collection and annotation processes and the
-challenges of building these datasets. Then, we report different experimental
-results for monolingual and crosslingual multi-label emotion identification, as
-well as intensity-level emotion recognition. We investigate results with and
-without using LLMs and analyse the large variability in performance across
-languages and text domains. We show that BRIGHTER datasets are a step towards
-bridging the gap in text-based emotion recognition and discuss their impact and
-utility.
-
-摘要：全球各地的人們都以微妙且複雜的方式使用語言來表達情感。
-雖然情緒辨識——幾個 NLP 任務的總稱——
-顯著影響 NLP 及其他領域中的不同應用，但該領域中的大部分工作
-都集中於高資源語言。因此，這導致研究和提出的解決方案出現重大差異，特別是
-對於缺乏高品質資料集的低資源語言。在本文中，我們提出 BRIGHTER——一個
-由 28 種不同語言組成的多標記情緒標註資料集。BRIGHTER 主要涵蓋來自非洲、亞洲、東歐和
-拉丁美洲的低資源語言，其中包含由流利講者標註的來自不同領域的實例。我們描述了資料收集和標註流程以及
-建立這些資料集的挑戰。然後，我們報告了單語和跨語言多標籤情緒識別的不同實驗結果，以及
-強度級別的情緒識別。我們研究了使用和不使用 LLM 的結果，並分析了跨語言和文字領域的性能的巨大變異。我們表明，BRIGHTER 資料集是縮小基於文字的情緒識別差距的一步，並討論了它們的影響和
-效用。
-
-##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**
-2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han
-
-The rapid development of Multimodal Large Language Models (MLLMs) has enabled
-the integration of multiple modalities, including texts and images, within the
-large language model (LLM) framework. However, texts and images are usually
-interconnected, forming a multimodal attributed graph (MMAG). It is
-underexplored how MLLMs can incorporate the relational information
-(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts
-and images) on such graphs for multimodal comprehension and generation. In this
-paper, we propose GraphGPT-o, which supports omni-multimodal understanding and
-creation on MMAGs. We first comprehensively study linearization variants to
-transform semantic and structural information as input for MLLMs. Then, we
-propose a hierarchical aligner that enables deep graph encoding, bridging the
-gap between MMAGs and MLLMs. Finally, we explore the inference choices,
-adapting MLLM to interleaved text and image generation in graph scenarios.
-Extensive experiments on three datasets from different domains demonstrate the
-effectiveness of our proposed method. Datasets and codes will be open-sourced
-upon acceptance.
-
-摘要：多模态大语言模型 (MLLM) 的快速发展，促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而，文本和图像通常是相互关联的，形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息（即图结构）和语义信息（即文本和图像）以进行多模态理解和生成，目前仍未得到充分探索。在本文中，我们提出了 GraphGPT-o，它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体，以将语义和结构信息转换为 MLLM 的输入。然后，我们提出了一个分层对齐器，它支持深度图编码，弥合了 MMAG 和 MLLM 之间的差距。最后，我们探索了推理选择，使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。
-
-##### **From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**
-2502.11919v1 by Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ziang Xiao, Ming Yin
-
-AI-assisted decision making becomes increasingly prevalent, yet individuals
-often fail to utilize AI-based decision aids appropriately especially when the
-AI explanations are absent, potentially as they do not %understand reflect on
-AI's decision recommendations critically. Large language models (LLMs), with
-their exceptional conversational and analytical capabilities, present great
-opportunities to enhance AI-assisted decision making in the absence of AI
-explanations by providing natural-language-based analysis of AI's decision
-recommendation, e.g., how each feature of a decision making task might
-contribute to the AI recommendation. In this paper, via a randomized
-experiment, we first show that presenting LLM-powered analysis of each task
-feature, either sequentially or concurrently, does not significantly improve
-people's AI-assisted decision performance. To enable decision makers to better
-leverage LLM-powered analysis, we then propose an algorithmic framework to
-characterize the effects of LLM-powered analysis on human decisions and
-dynamically decide which analysis to present. Our evaluation with human
-subjects shows that this approach effectively improves decision makers'
-appropriate reliance on AI in AI-assisted decision making.
-
-摘要：隨著 AI 輔助決策越來越普遍，但個人常常無法適當地利用 AI 決策輔助，特別是在沒有 AI 解釋的情況下，潛在原因是他們無法批判性地理解 AI 的決策建議。大型語言模型 (LLM) 擁有卓越的對話和分析能力，在沒有 AI 解釋的情況下，透過提供基於自然語言的 AI 決策建議分析，例如決策任務的每個特徵如何影響 AI 建議，為增強 AI 輔助決策提供了絕佳的機會。在本文中，我們透過隨機實驗，首先展示了以循序或並行的方式呈現 LLM 分析的每個任務特徵，並未顯著改善人們的 AI 輔助決策表現。為了讓決策者能更好地利用 LLM 分析，我們接著提出了演算法架構，用於描述 LLM 分析對人類決策的影響，並動態決定要呈現哪種分析。我們對人類受試者的評估顯示，這種方法有效地改善了決策者在 AI 輔助決策中對 AI 的適當依賴性。
-
-##### **EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**
-2502.11916v1 by Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
-
-Automated Essay Scoring (AES) plays a crucial role in educational assessment
-by providing scalable and consistent evaluations of writing tasks. However,
-traditional AES systems face three major challenges: (1) reliance on
-handcrafted features that limit generalizability, (2) difficulty in capturing
-fine-grained traits like coherence and argumentation, and (3) inability to
-handle multimodal contexts. In the era of Multimodal Large Language Models
-(MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES
-capabilities across lexical-, sentence-, and discourse-level traits. By
-leveraging MLLMs' strengths in trait-specific scoring and multimodal context
-understanding, EssayJudge aims to offer precise, context-rich evaluations
-without manual feature engineering, addressing longstanding AES limitations.
-Our experiments with 18 representative MLLMs reveal gaps in AES performance
-compared to human evaluation, particularly in discourse-level traits,
-highlighting the need for further advancements in MLLM-based AES research. Our
-dataset and code will be available upon acceptance.
-
-摘要：自動化論文評分 (AES) 在教育評量中扮演著重要的角色，它能提供可擴充且一致的寫作任務評量。然而，傳統的 AES 系統面臨了三個主要的挑戰：(1) 依賴於限制泛用性的手工特徵，(2) 難以捕捉連貫性和論證等細微特徵，以及 (3) 無法處理多模態的脈絡。在多模態大型語言模型 (MLLM) 的時代，我們提出了 EssayJudge，這是第一個評估 AES 能力的多模態基準，橫跨詞彙、句子和篇章層級的特徵。EssayJudge 透過利用 MLLM 在特定特徵評分和多模態脈絡理解方面的優勢，旨在提供精確且富含脈絡的評量，而無需手動特徵工程，進而解決長久以來的 AES 限制。我們針對 18 個具代表性的 MLLM 進行的實驗揭露了 AES 效能與人類評量之間的差距，特別是在篇章層級的特徵，這凸顯了 MLLM 為基礎的 AES 研究需要進一步的進展。我們的資料集和程式碼將在通過驗證後提供。
-
-##### **On the robustness of ChatGPT in teaching Korean Mathematics**
-2502.11915v1 by Phuong-Nam Nguyen, Quang Nguyen-The, An Vu-Minh, Diep-Anh Nguyen, Xuan-Lam Pham
-
-ChatGPT, an Artificial Intelligence model, has the potential to revolutionize
-education. However, its effectiveness in solving non-English questions remains
-uncertain. This study evaluates ChatGPT's robustness using 586 Korean
-mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering
-391 out of 586 questions. We also assess its ability to rate mathematics
-questions based on eleven criteria and perform a topic analysis. Our findings
-show that ChatGPT's ratings align with educational theory and test-taker
-perspectives. While ChatGPT performs well in question classification, it
-struggles with non-English contexts, highlighting areas for improvement. Future
-research should address linguistic biases and enhance accuracy across diverse
-languages. Domain-specific optimizations and multilingual training could
-improve ChatGPT's role in personalized education.
-
-摘要：ChatGPT，一種人工智慧模型，具有革新教育的潛力。然而，其解決非英語問題的有效性仍不確定。本研究使用 586 個韓語數學問題評估 ChatGPT 的健壯性。ChatGPT 達到 66.72% 的準確率，正確回答了 586 個問題中的 391 個。我們也評估其根據 11 個標準對數學問題進行評分並執行主題分析的能力。我們的研究結果顯示，ChatGPT 的評分與教育理論和應試者的觀點一致。儘管 ChatGPT 在問題分類中表現良好，但它在非英語語境中表現不佳，突顯出需要改進的地方。未來的研究應解決語言偏見並提高跨不同語言的準確性。特定領域的優化和多語言訓練可以提升 ChatGPT 在個人化教育中的作用。
-
-##### **MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**
-2502.11903v1 by Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao
-
-Recent multimodal large language models (MLLMs) have demonstrated significant
-potential in open-ended conversation, generating more accurate and personalized
-responses. However, their abilities to memorize, recall, and reason in
-sustained interactions within real-world scenarios remain underexplored. This
-paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for
-evaluating six core open-ended abilities of MLLMs: information extraction,
-multi-turn reasoning, information update, image management, memory recall, and
-answer refusal. With data collected from real-world scenarios, MMRC comprises
-5,120 conversations and 28,720 corresponding manually labeled questions, posing
-a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC
-indicate an accuracy drop during open-ended interactions. We identify four
-common failure patterns: long-term memory degradation, inadequacies in updating
-factual knowledge, accumulated assumption of error propagation, and reluctance
-to say no. To mitigate these issues, we propose a simple yet effective
-NOTE-TAKING strategy, which can record key information from the conversation
-and remind the model during its responses, enhancing conversational
-capabilities. Experiments across six MLLMs demonstrate significant performance
-improvements.
-
-摘要：最近的多模态大型语言模型 (MLLM) 已在开放式对话中展现出显著的潜力，产生更准确且个性化的回应。然而，它们在现实世界场景中持续互动中的记忆、回忆和推理能力仍未得到充分探索。本文介绍了 MMRC，一个多模态现实世界对话基准，用于评估 MLLM 的六项核心开放式能力：信息提取、多轮推理、信息更新、图像管理、记忆回忆和答案拒绝。通过从现实世界场景中收集的数据，MMRC 包含 5,120 个对话和 28,720 个相应的手动标记问题，对现有的 MLLM 构成了重大挑战。在 MMRC 中对 20 个 MLLM 的评估表明，在开放式互动期间准确性下降。我们确定了四种常见的故障模式：长期记忆退化、更新事实知识的不足、累积的错误传播假设以及不愿说不。为了减轻这些问题，我们提出了一种简单但有效的笔记策略，它可以记录对话中的关键信息并在模型响应期间提醒模型，从而增强对话能力。六个 MLLM 的实验表明了显著的性能改进。
-
-##### **Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**
-2502.11901v1 by Dylan Zhang, Justin Wang, Tianran Sun
-
-Existing LMs struggle with proof-oriented programming due to data scarcity,
-which manifest in two key ways: (1) a lack of sufficient corpora for
-proof-oriented programming languages such as F*, and (2) the absence of
-large-scale, project-level proof-oriented implementations that can teach the
-model the intricate reasoning process when performing proof-oriented
-programming. We present the first on synthetic data augmentation for project
-level proof oriented programming for both generation and repair. Our method
-addresses data scarcity by synthesizing basic proof-oriented programming
-problems for proficiency in that language; incorporating diverse coding data
-for reasoning capability elicitation and creating new proofs and repair data
-within existing repositories. This approach enables language models to both
-synthesize and repair proofs for function- and repository-level code. We show
-that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of
-the models that outperforms GPT-4o in project-level proof-oriented programming
-by 64% relative margin, and can improve GPT-4o's performance by 54% by
-repairing its outputs over GPT-4o's self-repair.
-
-摘要：現有的語言模型在基於證明編程時會因資料稀少而有困難，
-這會以兩種關鍵方式表現出來：(1) 缺乏足夠的語料庫，例如 F* 等面向證明的程式語言，以及 (2) 缺乏大型的專案層級面向證明實作，這些實作可以在執行面向證明編程時，教導模型複雜的推理程序。我們提出第一個面向專案層級面向證明編程的合成資料擴充，用於產生和修復。我們的做法透過合成基本的面向證明編程問題來解決資料稀少的問題，以精通該語言；納入不同的編碼資料，以引出推理能力，並在現有的儲存庫中建立新的證明和修復資料。這個方法讓語言模型能夠為函數層級和儲存庫層級的程式碼合成和修復證明。我們展示經過微調的 14B 參數模型 PoPilot，可以超過在專案層級面向證明編程中表現優於 GPT-4o 的模型 64% 的相對差距，並且可以透過修復 GPT-4o 自我修復的輸出，將 GPT-4o 的效能提升 54%。
-
-##### **DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**
-2502.11897v1 by Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
-
-In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a
-training-free paradigm that can make use of adaptive temporal compression in
-latent space. While existing video generative models apply fixed compression
-rates via pretrained VAE, we observe that real-world video content exhibits
-substantial temporal non-uniformity, with high-motion segments containing more
-information than static scenes. Based on this insight, DLFR-VAE dynamically
-adjusts the latent frame rate according to the content complexity.
-Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent
-Frame Rate Scheduler that partitions videos into temporal chunks and adaptively
-determines optimal frame rates based on information-theoretic content
-complexity, and (2) A training-free adaptation mechanism that transforms
-pretrained VAE architectures into a dynamic VAE that can process features with
-variable frame rates. Our simple but effective DLFR-VAE can function as a
-plug-and-play module, seamlessly integrating with existing video generation
-models and accelerating the video generation process.
-
-摘要：在本文中，我們提出動態潛在幀率 VAE (DLFR-VAE)，一種無需訓練的範例，它可以在潛在空間中使用自適應時間壓縮。現有的影片生成模型透過預訓練的 VAE 應用固定壓縮率，但我們觀察到真實世界的影片內容展現出大量的時間非一致性，其中高動作片段包含比靜態場景更多的資訊。基於這個見解，DLFR-VAE 會根據內容複雜度動態調整潛在幀率。具體來說，DLFR-VAE 包含兩項核心創新：(1) 一個動態潛在幀率排程器，它將影片分割成時間區塊，並根據資訊理論內容複雜度自適應地決定最佳幀率，以及 (2) 一個無需訓練的適應機制，它將預訓練的 VAE 架構轉換成一個動態 VAE，它可以處理具有可變幀率的特色。我們簡單但有效的 DLFR-VAE 可以作為一個即插即用的模組，與現有的影片生成模型無縫整合，並加速影片生成過程。
-
-##### **CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**
-2502.11896v1 by Yanxiao Zhao, Yangge Qian, Jingyang Shan, Xiaolin Qin
-
-Reinforcement learning (RL) in continuous action spaces encounters persistent
-challenges, such as inefficient exploration and convergence to suboptimal
-solutions. To address these limitations, we propose CAMEL, a novel framework
-integrating LLM-generated suboptimal policies into the RL training pipeline.
-CAMEL leverages dynamic action masking and an adaptive epsilon-masking
-mechanism to guide exploration during early training stages while gradually
-enabling agents to optimize policies independently. At the core of CAMEL lies
-the integration of Python-executable suboptimal policies generated by LLMs
-based on environment descriptions and task objectives. Although simplistic and
-hard-coded, these policies offer valuable initial guidance for RL agents. To
-effectively utilize these priors, CAMEL employs masking-aware optimization to
-dynamically constrain the action space based on LLM outputs. Additionally,
-epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling
-agents to transition from constrained exploration to autonomous policy
-refinement. Experimental validation on Gymnasium MuJoCo environments
-demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated
-policies significantly improve sample efficiency, achieving performance
-comparable to or surpassing expert masking baselines. For Walker2d-v4, where
-LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust
-RL performance without notable degradation, highlighting the framework's
-adaptability across diverse tasks. While CAMEL shows promise in enhancing
-sample efficiency and mitigating convergence challenges, these issues remain
-open for further research. Future work aims to generalize CAMEL to multimodal
-LLMs for broader observation-action spaces and automate policy evaluation,
-reducing human intervention and enhancing scalability in RL training pipelines.
-
-摘要：<paragraph>在連續動作空間中的強化學習 (RL) 會遇到持續的挑戰，例如探索效率低落和收斂至次佳解。為了解決這些限制，我們提出 CAMEL，一個將 LLM 生成的次佳策略整合到 RL 訓練管線中的新框架。CAMEL 透過動態動作遮罩和自適應 epsilon 遮罩機制來引導探索，同時逐漸讓代理程式能夠獨立最佳化策略。CAMEL 的核心在於整合由 LLM 生成的 Python 可執行次佳策略，這些策略基於環境描述和任務目標。儘管這些策略過於簡化且硬編碼，但它們為 RL 代理程式提供了有價值的初始指導。為了有效利用這些先驗知識，CAMEL 採用遮罩感知最佳化來根據 LLM 輸出動態限制動作空間。此外，epsilon 遮罩逐漸減少對 LLM 生成的指導依賴，讓代理程式能夠從受限探索轉換為自主策略改善。在 Gymnasium MuJoCo 環境上的實驗驗證證明了 CAMEL 的有效性。在 Hopper-v4 和 Ant-v4 中，LLM 生成的策略顯著提升了樣本效率，達到了與專家遮罩基準相近或超越的效能。對於 LLM 難以準確建模雙足步態動態的 Walker2d-v4，CAMEL 維持穩健的 RL 效能，且沒有顯著降低，突顯了該框架在不同任務中的適應性。儘管 CAMEL 在提升樣本效率和緩解收斂挑戰方面顯示出前景，但這些問題仍有待進一步研究。未來的研究工作旨在將 CAMEL 推廣到多模態 LLM，以涵蓋更廣泛的觀察動作空間，並自動化策略評估，減少人工介入並提升 RL 訓練管線的可擴充性。</paragraph>
-
-##### **Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**
-2502.11895v1 by Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
-
-Large language models (LLMs) require immense resources for training and
-inference. Quantization, a technique that reduces the precision of model
-parameters, offers a promising solution for improving LLM efficiency and
-sustainability. While post-training quantization methods typically achieve 4-8
-bits per parameter, recent research suggests that training LLMs with 1.58 bits
-per weight parameter from scratch can maintain model accuracy while greatly
-reducing memory requirements and energy consumption at inference time. Here, we
-investigate a training strategy for quantization-aware pre-training, where the
-models are first trained with 16-bit precision and then transition into
-1.58-bit quantization-aware training. Our results on 11 downstream tasks show
-that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit
-training and leaves models closer to those which have undergone 16-bit
-training. We further investigate the effects of retaining the optimizer state
-at the transition point and gradually phasing in quantization strength --
-finding that both techniques alleviate the magnitude of loss spikes, but also
-that these effects can be compensated through further training.
-
-摘要：大型語言模型 (LLM) 需要大量的資源來進行訓練和推理。量化是一種降低模型參數精度的技術，為提高 LLM 效率和可持續性提供了一個有希望的解決方案。雖然訓練後量化方法通常每參數達到 4-8 位元，但最近的研究表明，從頭開始使用每權重參數 1.58 位元訓練 LLM 可以維持模型準確性，同時大幅減少推理時間的記憶體需求和能源消耗。在此，我們探討量化感知預訓練的訓練策略，其中模型首先使用 16 位元精度訓練，然後轉換為 1.58 位元量化感知訓練。我們在 11 個下游任務上的結果表明，這種 16 位元到 1.58 位元的訓練策略優於完全 1.58 位元訓練，並且使模型更接近經過 16 位元訓練的模型。我們進一步探討了在轉換點保留最佳化器狀態和逐漸調整量化強度的影響——發現這兩種技術都可以減輕損失尖峰的大小，但這些影響也可以透過進一步訓練來補償。
-
-##### **Revisiting Classification Taxonomy for Grammatical Errors**
-2502.11890v1 by Deqing Zou, Jingheng Ye, Yulu Liu, Yu Wu, Zishan Xu, Yinghui Li, Hai-Tao Zheng, Bingxu An, Zhao Wei, Yong Xu
-
-Grammatical error classification plays a crucial role in language learning
-systems, but existing classification taxonomies often lack rigorous validation,
-leading to inconsistencies and unreliable feedback. In this paper, we revisit
-previous classification taxonomies for grammatical errors by introducing a
-systematic and qualitative evaluation framework. Our approach examines four
-aspects of a taxonomy, i.e., exclusivity, coverage, balance, and usability.
-Then, we construct a high-quality grammatical error classification dataset
-annotated with multiple classification taxonomies and evaluate them grounding
-on our proposed evaluation framework. Our experiments reveal the drawbacks of
-existing taxonomies. Our contributions aim to improve the precision and
-effectiveness of error analysis, providing more understandable and actionable
-feedback for language learners.
-
-摘要：語法錯誤分類在語言學習系統中扮演至關重要的角色，但現有的分類法常常缺乏嚴謹的驗證，導致不一致且不可靠的回饋。在本文中，我們透過引入一個系統且定性的評估架構，重新檢視先前的語法錯誤分類法。我們的做法檢視分類法的四個面向，即排他性、涵蓋性、平衡性和可用性。接著，我們建構一個高品質的語法錯誤分類資料集，並用多個分類法進行標註，並根據我們提出的評估架構對其進行評估。我們的實驗揭露了現有分類法的缺點。我們的貢獻旨在改善錯誤分析的準確性和有效性，為語言學習者提供更易於理解且可操作的回饋。
-
-##### **Stonefish: Supporting Machine Learning Research in Marine Robotics**
-2502.11887v1 by Michele Grimaldi, Patryk Cieslak, Eduardo Ochoa, Vibhav Bharti, Hayat Rajani, Ignacio Carlucho, Maria Koskinopoulou, Yvan R. Petillot, Nuno Gracias
-
-Simulations are highly valuable in marine robotics, offering a cost-effective
-and controlled environment for testing in the challenging conditions of
-underwater and surface operations. Given the high costs and logistical
-difficulties of real-world trials, simulators capable of capturing the
-operational conditions of subsea environments have become key in developing and
-refining algorithms for remotely-operated and autonomous underwater vehicles.
-This paper highlights recent enhancements to the Stonefish simulator, an
-advanced open-source platform supporting development and testing of marine
-robotics solutions. Key updates include a suite of additional sensors, such as
-an event-based camera, a thermal camera, and an optical flow camera, as well
-as, visual light communication, support for tethered operations, improved
-thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy.
-These developments and an automated annotation tool significantly bolster
-Stonefish's role in marine robotics research, especially in the field of
-machine learning, where training data with a known ground truth is hard or
-impossible to collect.
-
-摘要：模擬在海洋機器人中極具價值，提供具成本效益且受控的環境，用於在水下和水面作業的挑戰性條件下進行測試。鑑於現實世界試驗的高成本和後勤困難，能夠捕捉海底環境作業條件的模擬器已成為開發和改進遠程操作和自主水下載具演算法的關鍵。本文重點介紹了 Stonefish 模擬器最近的增強功能，這是一個先進的開源平台，支援海洋機器人解決方案的開發和測試。主要更新包括一系列額外的感測器，例如事件式相機、熱像儀和光流相機，以及可見光通訊、對繫繩操作的支援、改進的推進器建模、更靈活的水動力學和增強的聲納準確度。這些開發和自動化標註工具顯著提升了 Stonefish 在海洋機器人研究中的作用，特別是在機器學習領域，其中具有已知基本事實的訓練資料難以或無法收集。
-
-##### **LIMR: Less is More for RL Scaling**
-2502.11886v1 by Xuefeng Li, Haoyang Zou, Pengfei Liu
-
-In this paper, we ask: what truly determines the effectiveness of RL training
-data for enhancing language models' reasoning capabilities? While recent
-advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack
-of transparency about training data requirements has hindered systematic
-progress. Starting directly from base models without distillation, we challenge
-the assumption that scaling up RL training data inherently improves
-performance. we demonstrate that a strategically selected subset of just 1,389
-samples can outperform the full 8,523-sample dataset. We introduce Learning
-Impact Measurement (LIM), an automated method to evaluate and prioritize
-training samples based on their alignment with model learning trajectories,
-enabling efficient resource utilization and scalable implementation. Our method
-achieves comparable or even superior performance using only 1,389 samples
-versus the full 8,523 samples dataset. Notably, while recent data-efficient
-approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it
-significantly underperforms at 7B-scale through supervised fine-tuning (SFT).
-In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and
-outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results
-fundamentally reshape our understanding of RL scaling in LLMs, demonstrating
-that precise sample selection, rather than data scale, may be the key to
-unlocking enhanced reasoning capabilities. For reproducible research and future
-innovation, we are open-sourcing LIMR, including implementation of LIM,
-training and evaluation code, curated datasets, and trained models at
-https://github.com/GAIR-NLP/LIMR.
-
-摘要：<paragraph>在這篇論文中，我們提出一個問題：究竟是什麼決定了 RL 訓練資料增強語言模型推理能力的有效性？雖然最近的進展，例如 o1、Deepseek R1 和 Kimi1.5，展示了 RL 的潛力，但缺乏關於訓練資料需求的透明度阻礙了系統化的進展。從沒有蒸餾的基本模型直接開始，我們挑戰了擴充 RL 訓練資料本質上就會提升效能的假設。我們證明，策略性地選出僅 1,389 個樣本的子集就能勝過完整的 8,523 個樣本資料集。我們引入了學習影響力測量 (LIM)，這是一種自動化方法，用來評估和優先處理訓練樣本，根據它們與模型學習軌跡的一致性，能有效利用資源和擴充實作。我們的方法使用僅 1,389 個樣本就能達到與使用完整的 8,523 個樣本資料集相當甚至更佳的效能。值得注意的是，雖然最近資料有效率的方法（例如 LIMO 和 s1）在 32B 規模的模型上展現了前景，但我們發現它在 7B 規模上透過監督微調 (SFT) 的表現大幅落後。相比之下，我們基於 RL 的 LIMR 在 AIME24 上達到了高出 16.7% 的準確度，並在 MATH500 上比 LIMO 和 s1 分別高出 13.0% 和 22.2%。這些結果從根本上改變了我們對 LLM 中 RL 擴充的理解，證明精確的樣本選取，而非資料規模，可能是解鎖增強推理能力的關鍵。為了可重製的研究和未來的創新，我們開放原始碼 LIMR，包括 LIM 的實作、訓練和評估程式碼、策展的資料集，以及在 https://github.com/GAIR-NLP/LIMR 上訓練的模型。</paragraph>
-
-##### **Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**
-2502.11882v1 by Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
-
-Agents built on large language models (LLMs) have excelled in turn-by-turn
-human-AI collaboration but struggle with simultaneous tasks requiring real-time
-interaction. Latency issues and the challenge of inferring variable human
-strategies hinder their ability to make autonomous decisions without explicit
-instructions. Through experiments with current independent System 1 and System
-2 methods, we validate the necessity of using Dual Process Theory (DPT) in
-real-time tasks. We propose DPT-Agent, a novel language agent framework that
-integrates System 1 and System 2 for efficient real-time simultaneous human-AI
-collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and
-code-as-policy for fast, intuitive, and controllable decision-making.
-DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous
-reflection to infer human intentions and perform reasoning-based autonomous
-decisions. We demonstrate the effectiveness of DPT-Agent through further
-experiments with rule-based agents and human collaborators, showing significant
-improvements over mainstream LLM-based frameworks. To the best of our
-knowledge, DPT-Agent is the first language agent framework that achieves
-successful real-time simultaneous human-AI collaboration autonomously. Code of
-DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
-
-摘要：建立在大语言模型（LLM）上的代理在回合制人机协作方面表现出色，但在需要实时交互的同时任务中却举步维艰。延迟问题和推断可变人类策略的挑战阻碍了他们在没有明确指示的情况下做出自主决策的能力。通过使用当前独立的系统 1 和系统 2 方法进行的实验，我们验证了在实时任务中使用双重过程理论 (DPT) 的必要性。我们提出了 DPT-Agent，这是一个新颖的语言代理框架，它集成了系统 1 和系统 2，以实现高效的实时同时人机协作。DPT-Agent 的系统 1 使用有限状态机 (FSM) 和代码作为策略，以进行快速、直观且可控的决策。DPT-Agent 的系统 2 集成了心智理论 (ToM) 和异步反射，以推断人类意图并执行基于推理的自主决策。我们通过与基于规则的代理和人类合作者进行进一步的实验来证明 DPT-Agent 的有效性，展示了对主流基于 LLM 的框架的重大改进。据我们所知，DPT-Agent 是第一个实现自主的实时同时人机协作的语言代理框架。DPT-Agent 的代码可以在 https://github.com/sjtu-marl/DPT-Agent 中找到。
-
-##### **Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**
-2502.11881v1 by Hyunwoo Kim, Melanie Sclar, Tan Zhi-Xuan, Lance Ying, Sydney Levine, Yang Liu, Joshua B. Tenenbaum, Yejin Choi
-
-Existing LLM reasoning methods have shown impressive capabilities across
-various tasks, such as solving math and coding problems. However, applying
-these methods to scenarios without ground-truth answers or rule-based
-verification methods - such as tracking the mental states of an agent - remains
-challenging. Inspired by the sequential Monte Carlo algorithm, we introduce
-thought-tracing, an inference-time reasoning algorithm designed to trace the
-mental states of specific agents by generating hypotheses and weighting them
-based on observations without relying on ground-truth solutions to questions in
-datasets. Our algorithm is modeled after the Bayesian theory-of-mind framework,
-using LLMs to approximate probabilistic inference over agents' evolving mental
-states based on their perceptions and actions. We evaluate thought-tracing on
-diverse theory-of-mind benchmarks, demonstrating significant performance
-improvements compared to baseline LLMs. Our experiments also reveal interesting
-behaviors of the recent reasoning models - e.g., o1 and R1 - on theory-of-mind,
-highlighting the difference of social reasoning compared to other domains.
-
-摘要：現有的 LLM 推理方法已在各種任務中展現出令人印象深刻的能力，例如解決數學和編碼問題。然而，將這些方法應用於沒有正解答案或基於規則的驗證方法的情境中 - 例如追蹤代理人的心智狀態 - 仍然具有挑戰性。受到序貫蒙地卡羅演算法的啟發，我們引入了思想追蹤，這是一種在推理時間進行推理的演算法，旨在透過產生假設並根據觀察加權這些假設來追蹤特定代理人的心智狀態，而無需依賴資料集中的問題正解。我們的演算法是以貝氏心智理論架構為範本，使用 LLM 根據代理人的感知和行動來近似代理人不斷演變的心智狀態的機率推論。我們在各種心智理論基準上評估思想追蹤，與基準 LLM 相比，證明了顯著的效能提升。我們的實驗也揭露了近期推理模型在心智理論上的有趣行為 - 例如 o1 和 R1 - 突顯了社會推理與其他領域的差異。
-
-##### **Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**
-2502.11880v1 by Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
-
-The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has
-spurred interest in ternary LLMs. Despite this, research and practical
-applications focusing on efficient edge inference for ternary LLMs remain
-scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system
-optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix
-multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs,
-Bitnet.cpp incorporates a novel mpGEMM library to facilitate
-sub-2-bits-per-weight, efficient and lossless inference. The library features
-two core solutions: Ternary Lookup Table (TL), which addresses spatial
-inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S),
-which ensures lossless edge inference, both enabling high-speed inference. Our
-experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over
-full-precision baselines and up to 2.32x over low-bit baselines, setting new
-benchmarks in the field. Additionally, we expand TL to element-wise lookup
-table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and
-empirical evidence of its considerable potential. Bitnet.cpp is publicly
-available at https://github.com/microsoft/BitNet/tree/paper , offering a
-sophisticated solution for the efficient and practical deployment of edge LLMs.
-
-摘要：隨著由 BitNet b1.58 領先的 1 位元大型語言模型 (LLM) 出現，已激發了對三元 LLM 的興趣。儘管如此，專注於三元 LLM 的高效能邊緣推論的研究和實際應用仍然很少見。為了彌補這個差距，我們引入了 Bitnet.cpp，這是一個針對 BitNet b1.58 和三元 LLM 最佳化的推論系統。由於混合精度矩陣乘法 (mpGEMM) 構成三元 LLM 中推論時間的大部分，Bitnet.cpp 結合了一個新穎的 mpGEMM 函式庫，以利於每權重低於 2 位元、高效能且無損失的推論。該函式庫具有兩個核心解決方案：三元查詢表 (TL)，它解決了先前逐位元方法的空間低效率，以及具有比例的 Int2 (I2_S)，它確保無損失的邊緣推論，兩者都能實現高速推論。我們的實驗顯示，Bitnet.cpp 的速度比全精度的基準快了 6.25 倍，比低位元基準快了 2.32 倍，樹立了該領域的新基準。此外，我們在附錄中將 TL 擴充到逐元素查詢表 (ELUT) 以用於低位元 LLM，並提出其巨大潛力的理論和實證證據。Bitnet.cpp 已公開於 https://github.com/microsoft/BitNet/tree/paper，提供了一個精密的解決方案，用於邊緣 LLM 的高效能和實際部署。
-
-##### **VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**
-2502.11874v1 by Hugh Mee Wong, Rick Nouwen, Albert Gatt
-
-Vague quantifiers such as "a few" and "many" are influenced by many
-contextual factors, including how many objects are present in a given context.
-In this work, we evaluate the extent to which vision-and-language models (VLMs)
-are compatible with humans when producing or judging the appropriateness of
-vague quantifiers in visual contexts. We release a novel dataset, VAQUUM,
-containing 20300 human ratings on quantified statements across a total of 1089
-images. Using this dataset, we compare human judgments and VLM predictions
-using three different evaluation methods. Our findings show that VLMs, like
-humans, are influenced by object counts in vague quantifier use. However, we
-find significant inconsistencies across models in different evaluation
-settings, suggesting that judging and producing vague quantifiers rely on two
-different processes.
-
-摘要：模糊量词，例如「一些」和「许多」，会受到许多语境因素的影响，包括在给定语境中出现的对象数量。在这项工作中，我们评估视觉语言模型 (VLM) 在视觉语境中产生或判断模糊量词的适当性时，与人类的兼容程度。我们发布了一个新数据集 VAQUUM，其中包含对 1089 张图像中的量化陈述的 20300 个人类评级。使用此数据集，我们使用三种不同的评估方法来比较人类判断和 VLM 预测。我们的研究结果表明，VLM 与人类一样，在模糊量词的使用中会受到对象数量的影响。然而，我们发现不同评估设置中的模型之间存在显着的不一致性，这表明判断和产生模糊量词依赖于两个不同的过程。
-
-##### **Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**
-2502.11866v1 by Michael McRae
-
-I introduce a new large-scale dataset of historical wire articles from U.S.
-Southern newspapers, spanning 1960-1975 and covering multiple wire services:
-The Associated Press, United Press International, Newspaper Enterprise
-Association. Unlike prior work focusing on front-page content, this dataset
-captures articles across the entire newspaper, offering broader insight into
-mid-century Southern coverage. The dataset includes a version that has
-undergone an LLM-based text cleanup pipeline to reduce OCR noise, enhancing its
-suitability for quantitative text analysis. Additionally, duplicate versions of
-articles are retained to enable analysis of editorial differences in language
-and framing across newspapers. Each article is tagged by wire service,
-facilitating comparative studies of editorial patterns across agencies. This
-resource opens new avenues for research in computational social science,
-digital humanities, and historical linguistics, providing a detailed
-perspective on how Southern newspapers relayed national and international news
-during a transformative period in American history. The dataset will be made
-available upon publication or request for research purposes.
-
-摘要：我介紹一個新的美國歷史電訊文章大型資料集，時間跨度為 1960-1975 年，涵蓋多個電訊服務：美聯社、美聯國際社、報業企業協會。與先前專注於頭版內容的研究不同，此資料集擷取了整份報紙的文章，提供更廣泛的見解，深入探討世紀中葉的南方報導。該資料集包含一個經過 LLM 文字清理管線處理的版本，以減少 OCR 雜訊，提升其適用於量化文字分析。此外，保留文章的重複版本，以利分析報紙間語言和架構的編輯差異。每篇文章都標記電訊服務，便於比較各家機構的編輯模式。此資源為計算社會科學、數位人文和歷史語言學的研究開啟了新的途徑，提供一個詳細的觀點，探討南方報紙在美國歷史的轉型時期如何傳遞國內和國際新聞。該資料集將在出版或研究目的請求後提供。
-
-##### **FedEAT: A Robustness Optimization Framework for Federated LLMs**
-2502.11863v1 by Yahao Pang, Xingyuan Wu, Xiaojin Zhang, Wei Chen, Hai Jin
-
-Significant advancements have been made by Large Language Models (LLMs) in
-the domains of natural language understanding and automated content creation.
-However, they still face persistent problems, including substantial
-computational costs and inadequate availability of training data. The
-combination of Federated Learning (FL) and LLMs (federated LLMs) offers a
-solution by leveraging distributed data while protecting privacy, which
-positions it as an ideal choice for sensitive domains. However, Federated LLMs
-still suffer from robustness challenges, including data heterogeneity,
-malicious clients, and adversarial attacks, which greatly hinder their
-applications. We first introduce the robustness problems in federated LLMs, to
-address these challenges, we propose FedEAT (Federated Embedding space
-Adversarial Training), a novel framework that applies adversarial training in
-the embedding space of client LLM and employs a robust aggregation approach,
-specifically geometric median aggregation, to enhance the robustness of
-Federated LLMs. Our experiments demonstrate that FedEAT effectively improves
-the robustness of Federated LLMs with minimal performance loss.
-
-摘要：大型語言模型 (LLM) 在自然語言理解和自動化內容創作領域取得了重大進展。
-然而，它們仍然面臨持續的問題，包括大量的運算成本和訓練數據的可用性不足。
-聯合學習 (FL) 和 LLM（聯合 LLM）的結合提供了一個解決方案，在保護隱私的同時利用分佈式數據，這使其成為敏感領域的理想選擇。
-然而，聯合 LLM 仍然面臨著穩健性的挑戰，包括數據異質性、惡意用戶和對抗性攻擊，這極大地阻礙了它們的應用。
-我們首先介紹了聯合 LLM 中的穩健性問題，為了應對這些挑戰，我們提出了 FedEAT（聯合嵌入空間對抗訓練），這是一個新穎的框架，它在用戶端 LLM 的嵌入空間中應用對抗訓練，並採用穩健的聚合方法，特別是幾何中值聚合，以增強聯合 LLM 的穩健性。
-我們的實驗表明，FedEAT 有效地提高了聯合 LLM 的穩健性，同時性能損失最小。
-
-##### **Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**
-2502.11862v1 by Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
-
-In-context machine translation (MT) with large language models (LLMs) is a
-promising approach for low-resource MT, as it can readily take advantage of
-linguistic resources such as grammar books and dictionaries. Such resources are
-usually selectively integrated into the prompt so that LLMs can directly
-perform translation without any specific training, via their in-context
-learning capability (ICL). However, the relative importance of each type of
-resource e.g., dictionary, grammar book, and retrieved parallel examples, is
-not entirely clear. To address this gap, this study systematically investigates
-how each resource and its quality affects the translation performance, with the
-Manchu language as our case study. To remove any prior knowledge of Manchu
-encoded in the LLM parameters and single out the effect of ICL, we also
-experiment with an encrypted version of Manchu texts. Our results indicate that
-high-quality dictionaries and good parallel examples are very helpful, while
-grammars hardly help. In a follow-up study, we showcase a promising application
-of in-context MT: parallel data augmentation as a way to bootstrap the
-conventional MT model. When monolingual data abound, generating synthetic
-parallel data through in-context MT offers a pathway to mitigate data scarcity
-and build effective and efficient low-resource neural MT systems.
-
-摘要：語境機器翻譯 (MT) 與大型語言模型 (LLM) 結合，對於低資源 MT 來說是一種有前景的方法，因為它可以輕易利用語法書和字典等語言資源。此類資源通常會選擇性地整合到提示中，讓 LLM 能夠透過其語境學習能力 (ICL) 直接執行翻譯，而無需任何特定訓練。然而，每種類型的資源（例如字典、語法書和擷取的平行範例）的相對重要性並不明確。為了解決這個問題，本研究系統性地探討每項資源及其品質如何影響翻譯效能，並以滿語作為我們的案例研究。為了移除 LLM 參數中編碼的任何滿語先備知識，並找出 ICL 的影響，我們也對滿語文本的加密版本進行實驗。我們的結果顯示，高品質的字典和良好的平行範例非常有幫助，而語法幾乎沒有幫助。在後續研究中，我們展示了語境 MT 的一個有前景的應用：平行數據擴充，作為引導傳統 MT 模型的一種方式。當單語資料豐富時，透過語境 MT 產生合成平行資料提供了一條途徑，可以減輕資料短缺，並建構有效且高效的低資源神經 MT 系統。
-
-##### **Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**
-2502.11861v1 by Shuqi Yang, Mingrui Jing, Shuai Wang, Jiaxin Kou, Manfei Shi, Weijie Xing, Yan Hu, Zheng Zhu
-
-This study reviewed the use of Large Language Models (LLMs) in healthcare,
-focusing on their training corpora, customization techniques, and evaluation
-metrics. A systematic search of studies from 2021 to 2024 identified 61
-articles. Four types of corpora were used: clinical resources, literature,
-open-source datasets, and web-crawled data. Common construction techniques
-included pre-training, prompt engineering, and retrieval-augmented generation,
-with 44 studies combining multiple methods. Evaluation metrics were categorized
-into process, usability, and outcome metrics, with outcome metrics divided into
-model-based and expert-assessed outcomes. The study identified critical gaps in
-corpus fairness, which contributed to biases from geographic, cultural, and
-socio-economic factors. The reliance on unverified or unstructured data
-highlighted the need for better integration of evidence-based clinical
-guidelines. Future research should focus on developing a tiered corpus
-architecture with vetted sources and dynamic weighting, while ensuring model
-transparency. Additionally, the lack of standardized evaluation frameworks for
-domain-specific models called for comprehensive validation of LLMs in
-real-world healthcare settings.
-
-摘要：本研究回顧了大型語言模型 (LLM) 在醫療保健中的使用，重點在於其訓練語料庫、自訂技術和評估指標。針對 2021 年至 2024 年的研究進行系統性搜尋，找出 61 篇文章。語料庫類型有四種：臨床資源、文獻、開放原始碼資料集和網路爬取資料。常見的建構技術包括預訓練、提示工程和檢索增強生成，其中有 44 項研究結合多種方法。評估指標分為流程、可用性和成果指標，其中成果指標又分為基於模型和專家評估的成果。本研究發現語料庫公平性存在重大差距，這會導致地理、文化和社會經濟因素的偏見。對未驗證或非結構化資料的依賴性突顯出更佳整合循證臨床指南的必要性。未來的研究應專注於開發具有審查來源和動態加權的分層語料庫架構，同時確保模型透明性。此外，缺乏針對特定領域模型的標準化評估架構，因此需要對 LLM 在實際醫療保健環境中進行全面驗證。
-
-##### **Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**
-2502.11859v1 by Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, Yong Li
-
-The Theory of Multiple Intelligences underscores the hierarchical nature of
-cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer
-a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual
-Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial
-Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13
-mainstream VLMs through nine validated psychometric experiments reveals
-significant gaps versus humans (average score 24.95 vs. 68.38), with three key
-findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation,
-weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller
-models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading
-(30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought
-(0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from
-architectural constraints. Identified barriers include weak geometry encoding
-and missing dynamic simulation. By linking psychometric BSAs to VLM
-capabilities, we provide a diagnostic toolkit for spatial intelligence
-evaluation, methodological foundations for embodied AI development, and a
-cognitive science-informed roadmap for achieving human-like spatial
-intelligence.
-
-摘要：多元智能理論強調認知能力的層次性質。為了推進空間人工智慧，我們開創了一個心理測量框架，在視覺語言模型 (VLM) 中定義了五種基本空間能力 (BSA)：空間知覺、空間關係、空間定向、心智旋轉和空間視覺化。通過九項經過驗證的心理測量實驗對 13 個主流 VLM 進行基準測試，揭示了與人類相比的顯著差距（平均分數 24.95 對 68.38），並得出三個關鍵發現：1) VLM 反映人類層次結構（2D 定向最強，3D 旋轉最弱）具有獨立的 BSA（Pearson's r<0.4）；2) Qwen2-VL-7B 等較小的模型超越了較大的模型，其中 Qwen 領先（30.82），InternVL2 落後（19.6）；3) 思想鏈等干預措施（0.100  accuracy gain）和 5 次訓練（0.259 提升）顯示了架構約束的限制。已識別的障礙包括弱幾何編碼和缺少動態模擬。通過將心理測量 BSA 與 VLM 能力聯繫起來，我們提供了一個用於空間智能評估的診斷工具包、具身 AI 開發的方法論基礎，以及實現類人空間智能的認知科學信息路標。
-
-##### **LLMs as a synthesis between symbolic and continuous approaches to language**
-2502.11856v1 by Gemma Boleda
-
-Since the middle of the 20th century, a fierce battle is being fought between
-symbolic and continuous approaches to language and cognition. The success of
-deep learning models, and LLMs in particular, has been alternatively taken as
-showing that the continuous camp has won, or dismissed as an irrelevant
-engineering development. However, in this position paper I argue that deep
-learning models for language actually represent a synthesis between the two
-traditions. This is because 1) deep learning architectures allow for both
-continuous/distributed and symbolic/discrete-like representations and
-computations; 2) models trained on language make use this flexibility. In
-particular, I review recent research in mechanistic interpretability that
-showcases how a substantial part of morphosyntactic knowledge is encoded in a
-near-discrete fashion in LLMs. This line of research suggests that different
-behaviors arise in an emergent fashion, and models flexibly alternate between
-the two modes (and everything in between) as needed. This is possibly one of
-the main reasons for their wild success; and it is also what makes them
-particularly interesting for the study of language and cognition. Is it time
-for peace?
-
-摘要：自 20 世紀中葉以來，象徵與連續的語言和認知方法之間展開了一場激烈的戰鬥。深度學習模型，特別是 LLM 的成功，被交替視為連續陣營獲勝的證明，或被視為無關的工程發展而被忽視。然而，在本文中，我認為用於語言的深度學習模型實際上代表了這兩種傳統之間的綜合。這是因為 1) 深度學習架構允許連續/分佈式和符號/離散式表示和計算；2) 在語言上訓練的模型利用了這種靈活性。特別是，我回顧了機制可解釋性的最新研究，展示了形態句法知識的實質部分是如何以近乎離散的方式編碼在 LLM 中的。這條研究線表明，不同的行為以一種新興的方式出現，並且模型根據需要在兩種模式（以及介於兩者之間的所有內容）之間靈活地交替。這可能是它們獲得巨大成功的主要原因之一；這也是它們對語言和認知研究特別有趣的原因。和平的時刻到了嗎？
-
-##### **BaxBench: Can LLMs Generate Correct and Secure Backends?**
-2502.11844v1 by Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev
-
-The automatic generation of programs has long been a fundamental challenge in
-computer science. Recent benchmarks have shown that large language models
-(LLMs) can effectively generate code at the function level, make code edits,
-and solve algorithmic coding tasks. However, to achieve full automation, LLMs
-should be able to generate production-quality, self-contained application
-modules. To evaluate the capabilities of LLMs in solving this challenge, we
-introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for
-the generation of backend applications. We focus on backends for three critical
-reasons: (i) they are practically relevant, building the core components of
-most modern web and cloud software, (ii) they are difficult to get right,
-requiring multiple functions and files to achieve the desired functionality,
-and (iii) they are security-critical, as they are exposed to untrusted
-third-parties, making secure solutions that prevent deployment-time attacks an
-imperative. BaxBench validates the functionality of the generated applications
-with comprehensive test cases, and assesses their security exposure by
-executing end-to-end exploits. Our experiments reveal key limitations of
-current LLMs in both functionality and security: (i) even the best model,
-OpenAI o1, achieves a mere 60% on code correctness; (ii) on average, we could
-successfully execute security exploits on more than half of the correct
-programs generated by each LLM; and (iii) in less popular backend frameworks,
-models further struggle to generate correct and secure applications. Progress
-on BaxBench signifies important steps towards autonomous and secure software
-development with LLMs.
-
-摘要：<paragraph>程式自動產生一直是電腦科學中的基本挑戰。最近的基準測試顯示，大型語言模型 (LLM) 能夠有效產生函數層級的程式碼、進行程式碼編輯，以及解決演算法編碼任務。然而，若要達成完全自動化，LLM 應能夠產生生產品質、獨立的應用程式模組。為了評估 LLM 在解決此挑戰的能力，我們引入了 BaxBench，這是一個包含 392 個後端應用程式產生任務的新評估基準。我們專注於後端有三個關鍵原因：(i) 它們在實務上有其相關性，建構了大多數現代網路和雲端軟體的核心元件；(ii) 它們難以正確執行，需要多個函數和檔案才能達成所需的運作功能；(iii) 它們與安全性息息相關，因為它們會暴露於不受信任的第三方，使得預防部署時攻擊的安全解決方案成為當務之急。BaxBench 使用全面的測試案例驗證產生應用程式的功能，並透過執行端對端漏洞利用來評估其安全性風險。我們的實驗揭露了目前 LLM 在功能和安全性上的主要限制：(i) 即使是最好的模型 OpenAI o1，在程式碼正確性上也僅達到 60%；(ii) 平均而言，我們能夠在每個 LLM 產生的正確程式中成功執行超過一半的安全漏洞利用；(iii) 在較不受歡迎的後端框架中，模型在產生正確且安全的應用程式上更加困難。在 BaxBench 上的進展代表著使用 LLM 朝向自主且安全的軟體開發邁出了重要的一步。</paragraph>
-
-##### **Can LLM Agents Maintain a Persona in Discourse?**
-2502.11843v1 by Pranav Bhandari, Nicolas Fay, Michael Wise, Amitava Datta, Stephanie Meek, Usman Naseem, Mehwish Nasim
-
-Large Language Models (LLMs) are widely used as conversational agents,
-exploiting their capabilities in various sectors such as education, law,
-medicine, and more. However, LLMs are often subjected to context-shifting
-behaviour, resulting in a lack of consistent and interpretable
-personality-aligned interactions. Adherence to psychological traits lacks
-comprehensive analysis, especially in the case of dyadic (pairwise)
-conversations. We examine this challenge from two viewpoints, initially using
-two conversation agents to generate a discourse on a certain topic with an
-assigned personality from the OCEAN framework (Openness, Conscientiousness,
-Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This
-is followed by using multiple judge agents to infer the original traits
-assigned to explore prediction consistency, inter-model agreement, and
-alignment with the assigned personality. Our findings indicate that while LLMs
-can be guided toward personality-driven dialogue, their ability to maintain
-personality traits varies significantly depending on the combination of models
-and discourse settings. These inconsistencies emphasise the challenges in
-achieving stable and interpretable personality-aligned interactions in LLMs.
-
-摘要：大型語言模型 (LLM) 被廣泛用作對話代理，
-在教育、法律、
-醫學等各個領域發揮其能力。然而，LLM 經常受到情境轉換
-行為的影響，導致缺乏一致且可解釋的
-與人格一致的互動。對心理特質的堅持缺乏
-全面的分析，特別是在二元 (成對)
-對話的情況下。我們從兩個觀點審視這個挑戰，最初使用
-兩個對話代理在特定主題上產生論述，並從 OCEAN 框架 (開放性、盡責性、
-外向性、宜人性、神經質) 中分配人格，每個特質為高/低。這
-接著使用多個評審代理來推斷分配給探索預測一致性、模型間協議的原始特質，
-以及與分配人格的一致性。我們的研究結果表明，雖然 LLM
-可以引導至以人格為導向的對話，但它們維持
-人格特質的能力會根據模型和論述設定的組合而有顯著差異。這些不一致強調了
-在 LLM 中實現穩定且可解釋的與人格一致的互動的挑戰。
-
-##### **ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**
-2502.11840v1 by Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo
-
-Chord recognition serves as a critical task in music information retrieval
-due to the abstract and descriptive nature of chords in music analysis. While
-audio chord recognition systems have achieved significant accuracy for small
-vocabularies (e.g., major/minor chords), large-vocabulary chord recognition
-remains a challenging problem. This complexity also arises from the inherent
-long-tail distribution of chords, where rare chord types are underrepresented
-in most datasets, leading to insufficient training samples. Effective chord
-recognition requires leveraging contextual information from audio sequences,
-yet existing models, such as combinations of convolutional neural networks,
-bidirectional long short-term memory networks, and bidirectional transformers,
-face limitations in capturing long-term dependencies and exhibit suboptimal
-performance on large-vocabulary chord recognition tasks. This work proposes
-ChordFormer, a novel conformer-based architecture designed to tackle structural
-chord recognition (e.g., triads, bass, sevenths) for large vocabularies.
-ChordFormer leverages conformer blocks that integrate convolutional neural
-networks with transformers, thus enabling the model to capture both local
-patterns and global dependencies effectively. By addressing challenges such as
-class imbalance through a reweighted loss function and structured chord
-representations, ChordFormer outperforms state-of-the-art models, achieving a
-2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy
-on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling
-class imbalance, providing robust and balanced recognition across chord types.
-This approach bridges the gap between theoretical music knowledge and practical
-applications, advancing the field of large-vocabulary chord recognition.
-
-摘要：和弦辨識由於和弦在音樂分析中具有抽象性和描述性，因此在音樂資訊檢索中扮演著重要的任務。雖然音訊和弦辨識系統已在小型詞彙（例如，大調/小調和弦）中達到顯著的準確度，但大型詞彙和弦辨識仍然是一個具有挑戰性的問題。這種複雜性也來自和弦固有的長尾分佈，其中在大多數資料集中罕見的和弦類型代表性不足，導致訓練樣本不足。有效的和弦辨識需要利用音訊序列中的上下文資訊，但現有的模型，例如卷積神經網路、雙向長短期記憶網路和雙向轉換器的組合，在捕捉長期依賴關係方面面臨限制，並且在大詞彙和弦辨識任務上表現不佳。這項工作提出了 ChordFormer，這是一種新穎的基於變形器的架構，旨在解決大型詞彙的結構和弦辨識（例如，三和弦、低音、七和弦）。ChordFormer 利用變形器區塊將卷積神經網路與變形器整合在一起，從而使模型能夠有效地捕捉局部模式和全局依賴關係。透過重新加權損失函數和結構化和弦表示來解決類別不平衡等挑戰，ChordFormer 優於最先進的模型，在大詞彙和弦資料集上實現了幀準確度提高 2% 和類準確度提高 6%。此外，ChordFormer 在處理類別不平衡方面表現出色，在和弦類型中提供穩健且平衡的辨識。這種方法彌合了理論音樂知識與實際應用之間的差距，推動了大型詞彙和弦辨識領域的發展。
-
-##### **Intuitive physics understanding emerges from self-supervised pretraining on natural videos**
-2502.11831v1 by Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
-
-We investigate the emergence of intuitive physics understanding in
-general-purpose deep neural network models trained to predict masked regions in
-natural videos. Leveraging the violation-of-expectation framework, we find that
-video prediction models trained to predict outcomes in a learned representation
-space demonstrate an understanding of various intuitive physics properties,
-such as object permanence and shape consistency. In contrast, video prediction
-in pixel space and multimodal large language models, which reason through text,
-achieve performance closer to chance. Our comparisons of these architectures
-reveal that jointly learning an abstract representation space while predicting
-missing parts of sensory input, akin to predictive coding, is sufficient to
-acquire an understanding of intuitive physics, and that even models trained on
-one week of unique video achieve above chance performance. This challenges the
-idea that core knowledge -- a set of innate systems to help understand the
-world -- needs to be hardwired to develop an understanding of intuitive
-physics.
-
-摘要：我們探討了在經過訓練以預測自然影片中遮蔽區域的通用深度神經網路模型中，直覺物理理解的出現。利用違反預期框架，我們發現經過訓練以預測學習表徵空間中結果的影片預測模型，展現了對各種直覺物理特性的理解，例如物體恆存和形狀一致性。相反地，影片在像素空間和多模態大型語言模型中的預測，透過文字推理，達到的效能接近隨機。我們對這些架構的比較揭示了在預測感官輸入的遺失部分時，同時學習抽象表徵空間，類似於預測編碼，足以獲得對直覺物理的理解，而且即使在獨特影片上訓練一週的模型，也達到了高於隨機的效能。這挑戰了核心知識（一套幫助理解世界的先天系統）需要硬連線才能發展對直覺物理的理解這個想法。
-
-##### **Text Classification in the LLM Era - Where do we stand?**
-2502.11830v1 by Sowmya Vajjala, Shwetali Shimangaud
-
-Large Language Models revolutionized NLP and showed dramatic performance
-improvements across several tasks. In this paper, we investigated the role of
-such language models in text classification and how they compare with other
-approaches relying on smaller pre-trained language models. Considering 32
-datasets spanning 8 languages, we compared zero-shot classification, few-shot
-fine-tuning and synthetic data based classifiers with classifiers built using
-the complete human labeled dataset. Our results show that zero-shot approaches
-do well for sentiment classification, but are outperformed by other approaches
-for the rest of the tasks, and synthetic data sourced from multiple LLMs can
-build better classifiers than zero-shot open LLMs. We also see wide performance
-disparities across languages in all the classification scenarios. We expect
-that these findings would guide practitioners working on developing text
-classification systems across languages.
-
-摘要：大型語言模型革新了自然語言處理，並在多項任務中展現出顯著的效能提升。在本文中，我們探討了此類語言模型在文字分類中的角色，以及它們與依賴較小規模預先訓練語言模型的其他方法相比如何。考量涵蓋 8 種語言的 32 個資料集，我們比較了零次學習分類、少次學習微調和合成資料分類器，以及使用完整人工標記資料集建置的分類器。我們的結果顯示，零次學習方法在情緒分類中表現良好，但在其他任務中則不如其他方法，而來自多個大型語言模型的合成資料可以建置比零次學習開放大型語言模型更好的分類器。我們也看到在所有分類情境中，不同語言之間的效能差異很大。我們預期這些發現將引導從事跨語言文字分類系統開發的實務工作者。
-
-##### **Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**
-2502.11829v1 by Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
-
-This paper introduces Code-Vision, a benchmark designed to evaluate the
-logical understanding and code generation capabilities of Multimodal Large
-Language Models (MLLMs). It challenges MLLMs to generate a correct program that
-fulfills specific functionality requirements based on a given flowchart, which
-visually represents the desired algorithm or process. Code-Vision comprises
-three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding
-abilities across basic programming, algorithmic, and mathematical
-problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision.
-Experimental results demonstrate that there is a large performance difference
-between proprietary and open-source models. On Hard problems, GPT-4o can
-achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further
-experiments reveal that Code-Vision can pose unique challenges compared to
-other multimodal reasoning benchmarks MMCode and MathVista. We also explore the
-reason for the poor performance of the open-source models. All data and codes
-are available at https://github.com/wanghanbinpanda/CodeVision.
-
-摘要：本文介绍 Code-Vision，此基准测试旨在评估多模态大型语言模型 (MLLM) 的逻辑理解和代码生成能力。它要求 MLLM 根据给定的流程图生成一个正确的程序，以满足特定的功能需求，而流程图直观地表示所需的算法或流程。Code-Vision 包含三个子集：HumanEval-V、Algorithm 和 MATH，它们评估 MLLM 在基本编程、算法和数学问题解决域中的编码能力。我们的实验对 Code-Vision 上的 12 个 MLLM 进行了评估。实验结果表明，专有模型和开源模型之间的性能差异很大。在困难问题上，GPT-4o 可以达到 79.3% 的 pass@1，但最好的开源模型只能达到 15%。进一步的实验表明，与其他多模态推理基准 MMCode 和 MathVista 相比，Code-Vision 可能会带来独特的挑战。我们还探讨了开源模型性能不佳的原因。所有数据和代码均可在 https://github.com/wanghanbinpanda/CodeVision 中获得。
-
-##### **M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**
-2502.11824v1 by Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Barbara Plank, Yun Xue
-
-Aspect-based sentiment analysis (ABSA) is a crucial task in information
-extraction and sentiment analysis, aiming to identify aspects with associated
-sentiment elements in text. However, existing ABSA datasets are predominantly
-English-centric, limiting the scope for multilingual evaluation and research.
-To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7
-domains and 21 languages, making it the most extensive multilingual parallel
-dataset for ABSA to date. Our primary focus is on triplet extraction, which
-involves identifying aspect terms, aspect categories, and sentiment polarities.
-The dataset is constructed through an automatic translation process with human
-review to ensure quality. We perform extensive experiments using various
-baselines to assess performance and compatibility on M-ABSA. Our empirical
-findings highlight that the dataset enables diverse evaluation tasks, such as
-multilingual and multi-domain transfer learning, and large language model
-evaluation, underscoring its inclusivity and its potential to drive
-advancements in multilingual ABSA research.
-
-摘要：面向方面的观点分析 (ABSA) 是資訊萃取和觀點分析中的一項重要任務，旨在識別文本中帶有相關觀點元素的方面。然而，現有的 ABSA 資料集以英語為中心，限制了多語言評估和研究的範圍。為了彌補這個差距，我們提出了 M-ABSA，這是一個涵蓋 7 個領域和 21 種語言的綜合性資料集，使其成為迄今為止最廣泛的多語言平行資料集，適用於 ABSA。我們的重點是三元組萃取，其中涉及識別方面術語、方面類別和觀點極性。該資料集是透過自動翻譯過程構建的，並經過人工審查以確保品質。我們使用各種基線進行廣泛的實驗，以評估 M-ABSA 上的效能和相容性。我們的實證結果強調，該資料集支援多樣化的評估任務，例如多語言和多領域遷移學習，以及大型語言模型評估，凸顯其包容性和推動多語言 ABSA 研究進展的潛力。
-
-##### **AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**
-2502.11817v1 by Hao Zhou, Wenge Rong, Jianfei Zhang, Qing Sun, Yuanxin Ouyang, Zhang Xiong
-
-Knowledge Tracing (KT) aims to predict students' future performances based on
-their former exercises and additional information in educational settings. KT
-has received significant attention since it facilitates personalized
-experiences in educational situations. Simultaneously, the autoregressive
-modeling on the sequence of former exercises has been proven effective for this
-task. One of the primary challenges in autoregressive modeling for Knowledge
-Tracing is effectively representing the anterior (pre-response) and posterior
-(post-response) states of learners across exercises. Existing methods often
-employ complex model architectures to update learner states using question and
-response records. In this study, we propose a novel perspective on knowledge
-tracing task by treating it as a generative process, consistent with the
-principles of autoregressive models. We demonstrate that knowledge states can
-be directly represented through autoregressive encodings on a question-response
-alternate sequence, where model generate the most probable representation in
-hidden state space by analyzing history interactions. This approach underpins
-our framework, termed Alternate Autoregressive Knowledge Tracing (AAKT).
-Additionally, we incorporate supplementary educational information, such as
-question-related skills, into our framework through an auxiliary task, and
-include extra exercise details, like response time, as additional inputs. Our
-proposed framework is implemented using advanced autoregressive technologies
-from Natural Language Generation (NLG) for both training and prediction.
-Empirical evaluations on four real-world KT datasets indicate that AAKT
-consistently outperforms all baseline models in terms of AUC, ACC, and RMSE.
-Furthermore, extensive ablation studies and visualized analysis validate the
-effectiveness of key components in AAKT.
-
-摘要：<paragraph>知識追蹤 (KT) 旨在根據學生的前次練習和教育環境中的額外資訊，預測學生的未來表現。KT 自從促進教育情境中的個人化體驗後，便備受關注。同時，前次練習序列上的自迴歸模型已被證明對此任務有效。知識追蹤中自迴歸模型的主要挑戰之一，是有效表示學習者在各項練習中的先驗 (反應前) 和後驗 (反應後) 狀態。現有方法通常採用複雜的模型架構，使用問題和反應記錄來更新學習者狀態。在本研究中，我們提出了一個關於知識追蹤任務的新觀點，將其視為一個生成過程，與自迴歸模型的原理一致。我們證明了知識狀態可以直接透過問答交替序列上的自迴歸編碼來表示，其中模型透過分析歷史互動來生成隱藏狀態空間中最可能的表示。此方法支撐了我們的架構，稱為交替自迴歸知識追蹤 (AAKT)。此外，我們透過輔助任務將補充教育資訊（例如與問題相關的技能）納入我們的架構，並將額外練習細節（例如反應時間）納入額外輸入。我們提出的架構是使用自然語言生成 (NLG) 的先進自迴歸技術，用於訓練和預測。對四個真實世界的 KT 資料集進行的經驗評估表明，AAKT 在 AUC、ACC 和 RMSE 方面始終優於所有基準模型。此外，廣泛的消融研究和視覺化分析驗證了 AAKT 中關鍵組件的有效性。</paragraph>
-
-##### **Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**
-2502.11812v1 by Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
-
-Fine-tuning significantly improves the performance of Large Language Models
-(LLMs), yet its underlying mechanisms remain poorly understood. This paper aims
-to provide an in-depth interpretation of the fine-tuning process through
-circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike
-previous studies
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that focus on tasks where pre-trained models already perform well, we develop a
-set of mathematical tasks where fine-tuning yields substantial performance
-gains, which are closer to the practical setting. In our experiments, we
-identify circuits at various checkpoints during fine-tuning and examine the
-interplay between circuit analysis, fine-tuning methods, and task complexities.
-First, we find that while circuits maintain high node similarity before and
-after fine-tuning, their edges undergo significant changes, which is in
-contrast to the previous work
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that show circuits only add some additional components after fine-tuning. Based
-on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA)
-method, which assigns ranks to layers based on edge changes in the circuits.
-Experimental results demonstrate that our circuit-based LoRA algorithm achieves
-an average performance improvement of 2.46\% over standard LoRA with similar
-parameter sizes. Furthermore, we explore how combining circuits from subtasks
-can enhance fine-tuning in compositional tasks, providing new insights into the
-design of such tasks and deepening the understanding of circuit dynamics and
-fine-tuning mechanisms.
-
-摘要：微調大幅提升大型語言模型 (LLM) 的效能，但其底層機制仍鮮為人知。本文旨在透過電路分析，一種機械可解釋性 (MI) 中廣泛使用的工具，提供微調過程的深入詮釋。不同於先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-專注於預訓練模型已表現良好的任務，我們開發了一組數學任務，其中微調產生顯著的效能提升，更接近實際設定。在我們的實驗中，我們在微調期間的各種檢查點識別電路，並探討電路分析、微調方法和任務複雜度之間的交互作用。首先，我們發現電路在微調前後雖然維持高節點相似度，但其邊緣卻經歷顯著變化，這與先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-顯示電路僅在微調後新增一些額外組件的結果相反。基於這些觀察，我們開發了一個電路感知低秩適應 (LoRA) 方法，根據電路中的邊緣變化為層級分配秩。實驗結果證明，我們的基於電路的 LoRA 演算法在參數大小相似的條件下，比標準 LoRA 平均提升了 2.46% 的效能。此外，我們探討如何結合子任務的電路來增強組合任務中的微調，為此類任務的設計提供新的見解，並加深對電路動態和微調機制的理解。
-
-##### **FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**
-2502.11811v1 by Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
-
-Retrieved documents containing noise will hinder Retrieval-Augmented
-Generation (RAG) from detecting answer clues, necessitating noise filtering
-mechanisms to enhance accuracy.Existing methods use re-ranking or summarization
-to identify the most relevant sentences, but directly and accurately locating
-answer clues from these large-scale and complex documents remains challenging.
-Unlike these document-level operations, we treat noise filtering as a
-sentence-level MinMax optimization problem: first identifying the potential
-clues from multiple documents using contextual information, then ranking them
-by relevance, and finally retaining the least clues through truncation. In this
-paper, we propose FineFilter, a novel fine-grained noise filtering mechanism
-for RAG consisting of a clue extractor, a re-ranker, and a truncator. We
-optimize each module to tackle complex reasoning challenges: (1) Clue extractor
-firstly uses sentences containing the answer and similar ones as fine-tuned
-targets, aiming at extracting sufficient potential clues; (2) Re-ranker is
-trained to prioritize effective clues based on the real feedback from
-generation module, with clues capable of generating correct answer as positive
-samples and others as negative; (3) Truncator takes the minimum clues needed to
-answer the question (truncation point) as fine-tuned targets, and performs
-truncation on the re-ranked clues to achieve fine-grained noise filtering.
-Experiments on three QA datasets demonstrate that FineFilter significantly
-outperforms baselines in terms of performance and inference cost. Further
-analysis on each module shows the effectiveness of our optimizations for
-complex reasoning.
-
-摘要：<paragraph>檢索到含有雜訊的文件會阻礙檢索增強生成 (RAG) 偵測答案線索，因此需要雜訊過濾機制來增強準確性。現有方法使用重新排序或摘要來找出最相關的句子，但從這些大規模且複雜的文件中直接且準確地找出答案線索仍然具有挑戰性。與這些文件層級的操作不同，我們將雜訊過濾視為一個句子層級的 MinMax 最佳化問題：首先使用脈絡資訊從多個文件中找出潛在線索，接著依據相關性對它們進行排序，最後透過截斷保留最少的線索。在本文中，我們提出 FineFilter，一種創新的細緻雜訊過濾機制，用於 RAG，它包含一個線索萃取器、一個重新排序器和一個截斷器。我們最佳化每個模組來應對複雜的推理挑戰：(1) 線索萃取器首先使用包含答案和類似答案的句子作為微調的目標，旨在萃取足夠的潛在線索；(2) 重新排序器經過訓練，根據生成模組的真實回饋來優先處理有效的線索，其中能夠生成正確答案的線索為正樣本，其他則為負樣本；(3) 截斷器將回答問題所需的最小線索 (截斷點) 視為微調的目標，並對重新排序的線索執行截斷，以達成細緻的雜訊過濾。在三個問答資料集上的實驗證實，FineFilter 在效能和推論成本方面都明顯優於基線。進一步分析每個模組顯示，我們的最佳化對於複雜推理而言是有效的。</paragraph>
-
-##### **Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**
-2502.11809v1 by Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
-
-Deep neural networks (DNNs) often exhibit biases toward certain categories
-during object recognition, even under balanced training data conditions. The
-intrinsic mechanisms underlying these biases remain unclear. Inspired by the
-human visual system, which decouples object manifolds through hierarchical
-processing to achieve object recognition, we propose a geometric analysis
-framework linking the geometric complexity of class-specific perceptual
-manifolds in DNNs to model bias. Our findings reveal that differences in
-geometric complexity can lead to varying recognition capabilities across
-categories, introducing biases. To support this analysis, we present the
-Perceptual-Manifold-Geometry library, designed for calculating the geometric
-properties of perceptual manifolds.
-
-摘要：深度神經網路 (DNN) 在物件辨識過程中，即使在平衡的訓練資料條件下，通常會對特定類別表現出偏見。這些偏見背後的基本機制仍然不清楚。受人類視覺系統的啟發，人類視覺系統透過階層化處理來解耦物件流形以達成物件辨識，我們提出一個幾何分析架構，將 DNN 中特定類別感知流形的幾何複雜度與模型偏見連結起來。我們的研究結果顯示，幾何複雜度的差異會導致不同類別的辨識能力有所不同，進而造成偏見。為了支持這個分析，我們提出感知流形幾何函式庫，用於計算感知流形的幾何屬性。
-
-##### **Exploring Translation Mechanism of Large Language Models**
-2502.11806v1 by Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Min Zhang
-
-Large language models (LLMs) have succeeded remarkably in multilingual
-translation tasks. However, the inherent translation mechanisms of LLMs remain
-poorly understood, largely due to sophisticated architectures and vast
-parameter scales. In response to this issue, this study explores the
-translation mechanism of LLM from the perspective of computational components
-(e.g., attention heads and MLPs). Path patching is utilized to explore causal
-relationships between components, detecting those crucial for translation tasks
-and subsequently analyzing their behavioral patterns in human-interpretable
-terms. Comprehensive analysis reveals that translation is predominantly
-facilitated by a sparse subset of specialized attention heads (less than 5\%),
-which extract source language, indicator, and positional features. MLPs
-subsequently integrate and process these features by transiting towards
-English-centric latent representations. Notably, building on the above
-findings, targeted fine-tuning of only 64 heads achieves translation
-improvement comparable to full-parameter tuning while preserving general
-capabilities.
-
-摘要：大型語言模型 (LLM) 在多語言翻譯任務中取得了顯著的成功。然而，LLM 內在的翻譯機制仍未被很好地理解，這主要是由於複雜的架構和龐大的參數規模。為了應對這個問題，本研究從計算元件（例如注意力頭和 MLP）的角度探討了 LLM 的翻譯機制。路徑修補用於探索元件之間的因果關係，檢測對翻譯任務至關重要的元件，並隨後以人類可解釋的方式分析它們的行為模式。綜合分析表明，翻譯主要由稀疏的專門注意力頭（不到 5%）促進，這些注意力頭提取源語言、指標和位置特徵。MLPs 隨後通過轉換為以英語為中心的潛在表示來整合和處理這些特徵。值得注意的是，根據上述發現，僅對 64 個頭進行有針對性的微調，即可實現與全參數調整相當的翻譯改進，同時保留一般能力。
-
-##### **Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**
-2502.11799v1 by Peiying Yu, Guoxin Chen, Jingjing Wang
-
-Despite the remarkable capabilities of large language models (LLMs) in
-various reasoning tasks, they still struggle with table reasoning tasks,
-particularly in maintaining consistency throughout multi-step reasoning
-processes. While existing approaches have explored various decomposition
-strategies, they often lack effective mechanisms to identify and correct errors
-in intermediate reasoning steps, leading to cascading error propagation. To
-address these issues, we propose Table-Critic, a novel multi-agent framework
-that facilitates collaborative criticism and iterative refinement of the
-reasoning process until convergence to correct solutions. Our framework
-consists of four specialized agents: a Judge for error identification, a Critic
-for comprehensive critiques, a Refiner for process improvement, and a Curator
-for pattern distillation. To effectively deal with diverse and unpredictable
-error types, we introduce a self-evolving template tree that systematically
-accumulates critique knowledge through experience-driven learning and guides
-future reflections. Extensive experiments have demonstrated that Table-Critic
-achieves substantial improvements over existing methods, achieving superior
-accuracy and error correction rates while maintaining computational efficiency
-and lower solution degradation rate.
-
-摘要：儘管大型語言模型 (LLM) 在各種推理任務中展現出非凡的能力，它們在表格推理任務中仍面臨挑戰，特別是在多步驟推理過程中維持一致性方面。現有方法雖然探索了各種分解策略，但它們通常缺乏有效機制來識別和修正中間推理步驟中的錯誤，導致錯誤遞增。為了解決這些問題，我們提出 Table-Critic，一個新穎的多代理架構，它促進協作批評和反覆改進推理過程，直到收斂到正確的解決方案。我們的架構包含四個專業代理：用於錯誤識別的法官、用於全面批評的批評者、用於流程改進的精煉器，以及用於模式萃取的策展人。為了有效處理多樣且不可預測的錯誤類型，我們引入了一個自演化範本樹，它透過經驗驅動的學習系統性地累積批評知識，並引導未來的反思。廣泛的實驗證明，Table-Critic 在現有方法的基礎上取得了顯著的進步，在維持運算效率和較低解決方案劣化率的同時，達到了更高的準確度和錯誤修正率。
-
-##### **Personality Editing for Language Models through Relevant Knowledge Editing**
-2502.11789v1 by Seojin Hwang, Yumin Kim, Byeongjeong Kim, Hwanhee Lee
-
-Large Language Models (LLMs) play a vital role in applications like
-conversational agents and content creation, where controlling a model's
-personality is crucial for maintaining tone, consistency, and engagement.
-However, traditional prompt-based techniques for controlling personality often
-fall short, as they do not effectively mitigate the model's inherent biases. In
-this paper, we introduce a novel method PALETTE that enhances personality
-control through knowledge editing. By generating adjustment queries inspired by
-psychological assessments, our approach systematically adjusts responses to
-personality-related queries similar to modifying factual knowledge, thereby
-achieving controlled shifts in personality traits. Experimental results from
-both automatic and human evaluations demonstrate that our method enables more
-stable and well-balanced personality control in LLMs.
-
-摘要：大型語言模型 (LLM) 在會話代理和內容創作等應用程式中扮演至關重要的角色，其中控制模型的人格特質對於維持語氣、一致性和參與度至關重要。然而，傳統基於提示的控制人格技術通常無法達到預期效果，因為它們無法有效減輕模型固有的偏差。在本文中，我們介紹一種創新的方法 PALETTE，它通過知識編輯來增強人格控制。透過產生受心理評量啟發的調整查詢，我們的做法系統性地調整對人格相關查詢的回應，類似於修改事實知識，從而實現人格特質的受控轉變。來自自動和人工評估的實驗結果表明，我們的模型能夠在 LLM 中實現更穩定且均衡的人格控制。
-
-##### **Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**
-2502.11779v1 by Xuan Ren, Qi Chen, Lingqiao Liu
-
-The training data for fine-tuning large language models (LLMs) is typically
-structured as input-output pairs. However, for many tasks, there can be
-multiple equally valid output variations for the same input. Recent studies
-have observed that the choice of output variation used in training can affect
-the model's performance. This raises an important question: how can we generate
-the most effective output from the many possible response generation strategy
-options? Rather than relying on the traditional but resource-intensive
-train-and-evaluate approach, this paper proposes a scalable, approximate method
-for estimating the quality of a small subset of generated training data derived
-from the same input. We then evaluate how well this small subset of generated
-output fits the target model we are trying to train. We present a large-scale
-benchmark covering diverse reasoning-based datasets to support our study.
-  The central idea is that a good output should closely resemble the output
-generated by the target LLM. We formalize this 'closeness' as the expected
-alignment score between a candidate output and the output sampled from the
-target LLM. We connect this measurement to the perplexity metric used in
-previous literature and demonstrate that leveraging an alignment-based metric
-can provide better predictions of model performance. Using this strategy, we
-can evaluate a small subset of the generated output from each response
-generation strategy option, then select the most effective strategy. We show
-that an LLM trained on data generated by the selected strategy could lead to a
-significant performance gain in many cases.
-
-摘要：大型語言模型 (LLM) 的微調訓練資料通常
-以輸入輸出配對結構化。然而，對於許多任務而言，相同的輸入可能有多個同樣有效的輸出變化。最近的研究
-觀察到訓練中使用的輸出變化選擇會影響模型的效能。這引發了一個重要問題：我們如何從許多可能的回應產生策略選項中產生最有效的輸出？本文提出一個可擴充、近似的方法，用於估計從相同輸入衍生的訓練資料小子集的品質，而非依賴傳統但資源密集的訓練和評估方法。然後我們評估這個產生輸出的小子集與我們嘗試訓練的目標模型的契合程度。我們提出一個涵蓋各種基於推理的資料集的大規模基準，以支持我們的研究。
-核心概念是良好的輸出應與目標 LLM 產生的輸出密切相似。我們將這種「接近度」形式化為候選輸出與從目標 LLM 取樣的輸出之間的預期對齊分數。我們將此測量連接到先前文獻中使用的困惑度指標，並證明利用基於對齊的指標可以提供更好的模型效能預測。使用此策略，我們可以評估每個回應產生策略選項所產生輸出的小子集，然後選擇最有效的策略。我們展示在由所選策略產生的資料上訓練的 LLM，在許多情況下可能導致顯著的效能提升。
-
-##### **Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**
-2502.11777v1 by Siddiqui Muhammad Yasir, Hyunsik Ahn
-
-Depth estimation plays a pivotal role in advancing human-robot interactions,
-especially in indoor environments where accurate 3D scene reconstruction is
-essential for tasks like navigation and object handling. Monocular depth
-estimation, which relies on a single RGB camera, offers a more affordable
-solution compared to traditional methods that use stereo cameras or LiDAR.
-However, despite recent progress, many monocular approaches struggle with
-accurately defining depth boundaries, leading to less precise reconstructions.
-In response to these challenges, this study introduces a novel depth estimation
-framework that leverages latent space features within a deep convolutional
-neural network to enhance the precision of monocular depth maps. The proposed
-model features dual encoder-decoder architecture, enabling both color-to-depth
-and depth-to-depth transformations. This structure allows for refined depth
-estimation through latent space encoding. To further improve the accuracy of
-depth boundaries and local features, a new loss function is introduced. This
-function combines latent loss with gradient loss, helping the model maintain
-the integrity of depth boundaries. The framework is thoroughly tested using the
-NYU Depth V2 dataset, where it sets a new benchmark, particularly excelling in
-complex indoor scenarios. The results clearly show that this approach
-effectively reduces depth ambiguities and blurring, making it a promising
-solution for applications in human-robot interaction and 3D scene
-reconstruction.
-
-摘要：深度估計在推進人機互動方面發揮著至關重要的作用，特別是在室內環境中，準確的 3D 場景重建對於導航和物體處理等任務至關重要。單目深度估計依賴於單個 RGB 相機，與使用立體相機或 LiDAR 的傳統方法相比，它提供了一個更經濟的解決方案。然而，儘管最近取得了進展，許多單目方法在準確定義深度邊界方面仍然存在困難，從而導致重建精度降低。為了應對這些挑戰，本研究引入了一個新穎的深度估計框架，該框架利用深度卷積神經網路中的潛在空間特徵來增強單目深度圖的精度。所提出的模型採用雙編碼器-解碼器架構，既能進行顏色到深度的轉換，又能進行深度到深度的轉換。這種結構允許通過潛在空間編碼進行精確的深度估計。為了進一步提高深度邊界和局部特徵的精度，引入了一個新的損失函數。此函數將潛在損失與梯度損失相結合，幫助模型維護深度邊界的完整性。使用 NYU Depth V2 數據集對該框架進行了全面測試，在該數據集上，它設定了一個新的基準，特別是在複雜的室內場景中表現出色。結果清楚地表明，這種方法有效地減少了深度模糊和模糊，使其成為人機互動和 3D 場景重建應用中一種有前途的解決方案。
-
-##### **The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**
-2502.11771v1 by Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
-
-The ability of large language models (LLMs) to validate their output and
-identify potential errors is crucial for ensuring robustness and reliability.
-However, current research indicates that LLMs struggle with self-correction,
-encountering significant challenges in detecting errors. While studies have
-explored methods to enhance self-correction in LLMs, relatively little
-attention has been given to understanding the models' internal mechanisms
-underlying error detection. In this paper, we present a mechanistic analysis of
-error detection in LLMs, focusing on simple arithmetic problems. Through
-circuit analysis, we identify the computational subgraphs responsible for
-detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal
-that all models heavily rely on $\textit{consistency heads}$--attention heads
-that assess surface-level alignment of numerical values in arithmetic
-solutions. Moreover, we observe that the models' internal arithmetic
-computation primarily occurs in higher layers, whereas validation takes place
-in middle layers, before the final arithmetic results are fully encoded. This
-structural dissociation between arithmetic computation and validation seems to
-explain why current LLMs struggle to detect even simple arithmetic errors.
-
-摘要：大型語言模型 (LLM) 驗證其輸出並識別潛在錯誤的能力對於確保穩健性和可靠性至關重要。
-然而，目前的研究所示，LLM 難以進行自我修正，在檢測錯誤時遇到重大挑戰。儘管研究已探討增強 LLM 自我修正的方法，但對於瞭解模型內部錯誤檢測機制卻關注較少。在本文中，我們提出對 LLM 中錯誤檢測的機制分析，重點關注簡單的算術問題。通過電路分析，我們識別出負責檢測四個較小規模 LLM 中算術錯誤的計算子圖。我們的研究結果表明，所有模型都嚴重依賴於「一致性頭部」--注意頭部，用於評估算術解中數值表面的對齊方式。此外，我們觀察到模型的內部算術運算主要發生在較高層，而驗證則發生在中間層，在最終算術結果完全編碼之前。算術運算和驗證之間的這種結構性分離似乎解釋了為什麼當前的 LLM 難以檢測到即使是簡單的算術錯誤。
-
-##### **Cognitive-Aligned Document Selection for Retrieval-augmented Generation**
-2502.11770v1 by Bingyu Wan, Fuxi Zhang, Zhongpeng Qi, Jiayi Ding, Jijun Li, Baoshi Fan, Yijia Zhang, Jun Zhang
-
-Large language models (LLMs) inherently display hallucinations since the
-precision of generated texts cannot be guaranteed purely by the parametric
-knowledge they include. Although retrieval-augmented generation (RAG) systems
-enhance the accuracy and reliability of generative models by incorporating
-external documents, these retrieved documents often fail to adequately support
-the model's responses in practical applications. To address this issue, we
-propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment
-Re\textbf{trieval} for verifiable generation), which leverages an LLM to
-dynamically update queries and filter high-quality, reliable retrieval
-documents. Specifically, we parse the user query into its syntactic components
-and perform fine-grained grounded alignment with the retrieved documents. For
-query components that cannot be individually aligned, we propose a dynamic
-semantic compensation mechanism that iteratively refines and rewrites the query
-while continuously updating the retrieval results. This iterative process
-continues until the retrieved documents sufficiently support the query's
-response. Our approach introduces a novel criterion for filtering retrieved
-documents, closely emulating human strategies for acquiring targeted
-information. This ensures that the retrieved content effectively supports and
-verifies the generated outputs. On the ALCE benchmark, our method significantly
-surpasses a wide range of baselines, achieving state-of-the-art performance.
-
-摘要：大型語言模型 (LLM) 本質上會出現幻覺，因為生成的文本的準確性無法僅透過它們包含的參數化知識來保證。儘管檢索增強生成 (RAG) 系統透過納入外部文件來提升生成模型的準確性和可靠性，但這些檢索的文件在實際應用中常常無法充分支援模型的回應。為了解決這個問題，我們提出 GGatrieval（用於可驗證生成的精細化粒度化基礎對齊檢索），它利用 LLM 來動態更新查詢並過濾高品質、可靠的檢索文件。具體來說，我們將使用者查詢分析成其語法組成部分，並對檢索文件執行精細化粒度化基礎對齊。對於無法個別對齊的查詢組成部分，我們提出一個動態語義補償機制，在持續更新檢索結果的同時，反覆修正和重寫查詢。這個反覆的程序會持續到檢索的文件充分支援查詢的回應為止。我們的做法引進了一個新的檢索文件過濾標準，嚴密地模擬人類獲取目標資訊的策略。這確保檢索的內容有效地支援和驗證生成的輸出。在 ALCE 基準測試中，我們的做法顯著超越各種基線，達成最先進的效能。
-
-##### **From Selection to Generation: A Survey of LLM-based Active Learning**
-2502.11767v1 by Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley
-
-Active Learning (AL) has been a powerful paradigm for improving model
-efficiency and performance by selecting the most informative data points for
-labeling and training. In recent active learning frameworks, Large Language
-Models (LLMs) have been employed not only for selection but also for generating
-entirely new data instances and providing more cost-effective annotations.
-Motivated by the increasing importance of high-quality data and efficient model
-training in the era of LLMs, we present a comprehensive survey on LLM-based
-Active Learning. We introduce an intuitive taxonomy that categorizes these
-techniques and discuss the transformative roles LLMs can play in the active
-learning loop. We further examine the impact of AL on LLM learning paradigms
-and its applications across various domains. Finally, we identify open
-challenges and propose future research directions. This survey aims to serve as
-an up-to-date resource for researchers and practitioners seeking to gain an
-intuitive understanding of LLM-based AL techniques and deploy them to new
-applications.
-
-摘要：主動學習 (AL) 透過挑選最具資訊性的資料點來標記和訓練，已成為一種強大的範例，用以提升模型效率和效能。在最近的主動學習架構中，大型語言模型 (LLM) 不僅用於挑選，也用於產生全新的資料實例，並提供更具成本效益的註解。在大型語言模型時代，由於高品質資料和高效能模型訓練日益重要，我們針對基於大型語言模型的主動學習提出了一項全面的調查。我們提出一個直覺式的分類法，用以分類這些技術，並探討大型語言模型在主動學習迴圈中可以扮演的轉型角色。我們進一步探討主動學習對大型語言模型學習範例的影響，以及它在各種領域中的應用。最後，我們找出開放式挑戰，並提出未來的研究方向。本調查旨在作為研究人員和實務工作者的最新資源，用以獲得對基於大型語言模型的主動學習技術的直覺式理解，並將其部署至新的應用程式。
-
-##### **Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**
-2502.11766v1 by Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
-
-The widespread deployment of Large Language Models (LLMs) is hindered by the
-high computational demands, making knowledge distillation (KD) crucial for
-developing compact smaller ones. However, the conventional KD methods endure
-the distribution mismatch issue between the teacher and student models, leading
-to the poor performance of distillation. For instance, the widely-used KL-based
-methods suffer the mode-averaging and mode-collapsing problems, since the
-mismatched probabitliy distribution between both models. Previous studies
-mainly optimize this issue via different distance calculations towards the
-distribution of both models. Unfortunately, the distribution mismatch issue
-still exists in the early stage of the distillation. Hence, to reduce the
-impact of distribution mismatch, we propose a simple yet efficient method,
-named Warmup-Distill, which aligns the distillation of the student to that of
-the teacher in advance of distillation. Specifically, we first detect the
-distribution of the student model in practical scenarios with its internal
-knowledge, and then modify the knowledge with low probability via the teacher
-as the checker. Consequently, Warmup-Distill aligns the internal student's
-knowledge to that of the teacher, which expands the distribution of the student
-with the teacher's, and assists the student model to learn better in the
-subsequent distillation. Experiments on the seven benchmarks demonstrate that
-Warmup-Distill could provide a warmup student more suitable for distillation,
-which outperforms the vanilla student by as least +0.4 averaged score among all
-benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation
-on the math task could yield a further improvement, at most +1.9% accuracy.
-
-摘要：大型語言模型 (LLM) 的廣泛部署受到高運算需求的阻礙，這使得知識蒸餾 (KD) 對於開發緊湊型的小型模型至關重要。然而，傳統的 KD 方法忍受了教師和學生模型之間的分布不匹配問題，導致蒸餾效果不佳。例如，廣泛使用的基於 KL 的方法會出現模式平均和模式崩潰問題，因為兩個模型之間的機率分佈不匹配。先前的研究主要透過不同的距離計算來最佳化這個問題，以朝向兩個模型的分布。不幸的是，分布不匹配的問題仍然存在於蒸餾的早期階段。因此，為了減少分布不匹配的影響，我們提出了一種簡單但有效的方法，稱為 Warmup-Distill，它在蒸餾之前將學生的蒸餾與教師的蒸餾對齊。具體來說，我們首先使用其內部知識在實際場景中檢測學生的分布，然後透過教師作為檢查員修改低機率的知識。因此，Warmup-Distill 將學生的內部知識與教師的知識對齊，這會將學生的分布擴展到教師的分布，並協助學生模型在後續的蒸餾中學習得更好。在七個基準測試上的實驗表明，Warmup-Distill 可以提供更適合蒸餾的熱身學生，在所有基準測試中，其表現優於香草學生至少 +0.4 的平均分數。值得注意的是，在 Warmup-Distill 的協助下，數學任務上的蒸餾可以進一步提升，最多可提升 +1.9% 的準確度。
-
-##### **Lightweight Deepfake Detection Based on Multi-Feature Fusion**
-2502.11763v1 by Siddiqui Muhammad Yasir, Hyun Kim
-
-Deepfake technology utilizes deep learning based face manipulation techniques
-to seamlessly replace faces in videos creating highly realistic but
-artificially generated content. Although this technology has beneficial
-applications in media and entertainment misuse of its capabilities may lead to
-serious risks including identity theft cyberbullying and false information. The
-integration of DL with visual cognition has resulted in important technological
-improvements particularly in addressing privacy risks caused by artificially
-generated deepfake images on digital media platforms. In this study we propose
-an efficient and lightweight method for detecting deepfake images and videos
-making it suitable for devices with limited computational resources. In order
-to reduce the computational burden usually associated with DL models our method
-integrates machine learning classifiers in combination with keyframing
-approaches and texture analysis. Moreover the features extracted with a
-histogram of oriented gradients (HOG) local binary pattern (LBP) and KAZE bands
-were integrated to evaluate using random forest extreme gradient boosting extra
-trees and support vector classifier algorithms. Our findings show a
-feature-level fusion of HOG LBP and KAZE features improves accuracy to 92% and
-96% on FaceForensics++ and Celeb-DFv2 respectively.
-
-摘要：深度偽造技術利用基於深度學習的換臉技術，可無縫替換影片中的臉孔，創造出高度逼真但人工產生的內容。儘管這項技術在媒體和娛樂方面有益，但若誤用其功能可能會導致嚴重的風險，包括身分盜用、網路霸凌和虛假訊息。深度學習與視覺認知的整合已帶來重要的技術進步，特別是在解決由數位媒體平台上的人工深度偽造影像所造成的隱私風險方面。在本研究中，我們提出了一種用於偵測深度偽造影像和影片的有效且輕量級的方法，使其適用於運算資源有限的裝置。為了降低通常與深度學習模型相關的運算負擔，我們的做法結合了機器學習分類器、關鍵影格方法和紋理分析。此外，我們整合了使用方向梯度直方圖 (HOG)、局部二進位模式 (LBP) 和 KAZE 頻段所萃取出的特徵，並使用隨機森林、極端梯度提升、額外樹木和支援向量分類器演算法進行評估。我們的研究結果顯示，HOG、LBP 和 KAZE 特徵的層級融合將準確度提升至 92%，分別在 FaceForensics++ 和 Celeb-DFv2 上達到 96%。
-
-##### **HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**
-2502.11753v1 by Michiel van der Meer, Pavel Korshunov, Sébastien Marcel, Lonneke van der Plas
-
-Misinformation can be countered with fact-checking, but the process is costly
-and slow. Identifying checkworthy claims is the first step, where automation
-can help scale fact-checkers' efforts. However, detection methods struggle with
-content that is 1) multimodal, 2) from diverse domains, and 3) synthetic. We
-introduce HintsOfTruth, a public dataset for multimodal checkworthiness
-detection with $27$K real-world and synthetic image/claim pairs. The mix of
-real and synthetic data makes this dataset unique and ideal for benchmarking
-detection methods. We compare fine-tuned and prompted Large Language Models
-(LLMs). We find that well-configured lightweight text-based encoders perform
-comparably to multimodal models but the first only focus on identifying
-non-claim-like content. Multimodal LLMs can be more accurate but come at a
-significant computational cost, making them impractical for large-scale
-applications. When faced with synthetic data, multimodal models perform more
-robustly
-
-摘要：錯誤訊息可以透過事實查核來反駁，但這個過程既昂貴又緩慢。辨識需要查核的說法是第一步，自動化可以幫助擴大事實查核人員的努力。然而，偵測方法會在處理 1) 多模態、2) 來自不同領域，以及 3) 合成的內容時遇到困難。我們引進 HintsOfTruth，一個用於多模態查核價值偵測的公開資料集，其中包含 27K 個真實世界和合成的影像/說法配對。真實和合成資料的組合讓這個資料集獨一無二，非常適合用於基準偵測方法。我們比較微調和提示的大語言模型 (LLM)。我們發現，設定良好的輕量級文字編碼器的表現與多模態模型相當，但前者只專注於辨識非說法類型的內容。多模態 LLM 可能更準確，但需要大量的運算成本，這讓它們不適用於大規模的應用。在面對合成資料時，多模態模型的表現更強健。
-
-##### **Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**
-2502.11751v1 by Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
-
-Although Large Language Models (LLMs) excel in reasoning and generation for
-language tasks, they are not specifically designed for multimodal challenges.
-Training Multimodal Large Language Models (MLLMs), however, is
-resource-intensive and constrained by various training limitations. In this
-paper, we propose the Modular-based Visual Contrastive Decoding (MVCD)
-framework to move this obstacle. Our framework leverages LLMs' In-Context
-Learning (ICL) capability and the proposed visual contrastive-example decoding
-(CED), specifically tailored for this framework, without requiring any
-additional training. By converting visual signals into text and focusing on
-contrastive output distributions during decoding, we can highlight the new
-information introduced by contextual examples, explore their connections, and
-avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual
-perception to make it see and reason over the input visuals. To demonstrate
-MVCD's effectiveness, we conduct experiments with four LLMs across five
-question answering datasets. Our results not only show consistent improvement
-in model accuracy but well explain the effective components inside our decoding
-strategy. Our code will be available at https://github.com/Pbhgit/MVCD.
-
-摘要：儘管大型語言模型 (LLM) 在語言任務的推理和生成方面表現優異，但它們並非專門針對多模態挑戰而設計。然而，訓練多模態大型語言模型 (MLLM) 十分耗費資源，並受到各種訓練限制。在本文中，我們提出基於模組的視覺對比解碼 (MVCD) 架構來克服這個障礙。我們的架構利用 LLM 的情境學習 (ICL) 能力和專門為此架構量身打造的視覺對比範例解碼 (CED)，而無需任何額外訓練。透過將視覺信號轉換為文字，並在解碼過程中專注於對比輸出分佈，我們可以突顯情境範例引入的新資訊，探索它們的關聯性，並避免過度依賴先前編碼的知識。MVCD 增強了 LLM 的視覺感知能力，使其能夠觀察並推論輸入視覺效果。為了證明 MVCD 的有效性，我們使用四個 LLM 在五個問答資料集上進行實驗。我們的結果不僅顯示模型準確度持續提升，還能清楚說明我們的解碼策略中的有效組成部分。我們的程式碼將在 https://github.com/Pbhgit/MVCD 公開。
-
-##### **SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**
-2502.11741v1 by Shuai Lyu, Haoran Luo, Zhonghong Ou, Yifan Zhu, Xiaoran Shang, Yang Qin, Meina Song
-
-The Text-to-SQL(Text2SQL) task aims to convert natural language queries into
-executable SQL queries. Thanks to the application of large language models
-(LLMs), significant progress has been made in this field. However, challenges
-such as model scalability, limited generation space, and coherence issues in
-SQL generation still persist. To address these issues, we propose SQL-o1, a
-Self-Reward-based heuristic search method designed to enhance the reasoning
-ability of LLMs in SQL query generation. SQL-o1 combines Monte Carlo Tree
-Search (MCTS) for heuristic process-level search and constructs a Schema-Aware
-dataset to help the model better understand database schemas. Extensive
-experiments on the Bird and Spider datasets demonstrate that SQL-o1 improves
-execution accuracy by 10.8\% on the complex Bird dataset compared to the latest
-baseline methods, even outperforming GPT-4-based approaches. Additionally,
-SQL-o1 excels in few-shot learning scenarios and shows strong cross-model
-transferability. Our code is publicly available
-at:https://github.com/ShuaiLyu0110/SQL-o1.
-
-摘要：文本转 SQL（Text2SQL）任务旨在将自然语言查询转换为可执行的 SQL 查询。得益于大型语言模型（LLM）的应用，该领域取得了显著进展。然而，模型可扩展性、生成空间受限和 SQL 生成的连贯性问题等挑战仍然存在。为了解决这些问题，我们提出了 SQL-o1，这是一种基于自我奖励的启发式搜索方法，旨在增强 LLM 在 SQL 查询生成中的推理能力。SQL-o1 结合了蒙特卡罗树搜索（MCTS）用于启发式过程级搜索，并构建了一个模式感知数据集，以帮助模型更好地理解数据库模式。在 Bird 和 Spider 数据集上的大量实验表明，与最新的基准方法相比，SQL-o1 将复杂 Bird 数据集上的执行准确率提高了 10.8%，甚至优于基于 GPT-4 的方法。此外，SQL-o1 在少样本学习场景中表现出色，并显示出强大的跨模型可迁移性。我们的代码已公开发布在：https://github.com/ShuaiLyu0110/SQL-o1。
-
 
 ### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null|
+|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|null|
 |**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
 |**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
 |**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
@@ -7841,7 +5367,7 @@ at:https://github.com/ShuaiLyu0110/SQL-o1.
 |**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
 |**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
 |**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null|
 |**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
 |**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
 |**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
@@ -7871,14 +5397,163 @@ at:https://github.com/ShuaiLyu0110/SQL-o1.
 |**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
 |**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
 |**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
 
 #### Abstracts
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
+
+##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**
+2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang
+
+The rapid advancement of perovskite solar cells (PSCs) has led to an
+exponential growth in research publications, creating an urgent need for
+efficient knowledge management and reasoning systems in this domain. We present
+a comprehensive knowledge-enhanced system for PSCs that integrates three key
+components. First, we develop Perovskite-KG, a domain-specific knowledge graph
+constructed from 1,517 research papers, containing 23,789 entities and 22,272
+relationships. Second, we create two complementary datasets: Perovskite-Chat,
+comprising 55,101 high-quality question-answer pairs generated through a novel
+multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully
+curated materials science problems. Third, we introduce two specialized large
+language models: Perovskite-Chat-LLM for domain-specific knowledge assistance
+and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental
+results demonstrate that our system significantly outperforms existing models
+in both domain-specific knowledge retrieval and scientific reasoning tasks,
+providing researchers with effective tools for literature review, experimental
+design, and complex problem-solving in PSC research.
+
+摘要：由於 perovskite 太陽能電池 (PSC) 快速進展，導致研究出版物呈指數成長，迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先，我們開發出 Perovskite-KG，一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次，我們建立兩個互補的資料集：Perovskite-Chat，包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對；以及 Perovskite-Reasoning，包含 2,217 個仔細策展的材料科學問題。第三，我們推出兩個專門化大型語言模型：針對領域特定知識協助的 Perovskite-Chat-LLM，以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示，我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型，為研究人員提供有效的工具，用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。
+
+##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**
+2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
+
+Explainable recommendation has demonstrated significant advantages in
+informing users about the logic behind recommendations, thereby increasing
+system transparency, effectiveness, and trustworthiness. To provide
+personalized and interpretable explanations, existing works often combine the
+generation capabilities of large language models (LLMs) with collaborative
+filtering (CF) information. CF information extracted from the user-item
+interaction graph captures the user behaviors and preferences, which is crucial
+for providing informative explanations. However, due to the complexity of graph
+structure, effectively extracting the CF information from graphs still remains
+a challenge. Moreover, existing methods often struggle with the integration of
+extracted CF information with LLMs due to its implicit representation and the
+modality gap between graph structures and natural language explanations. To
+address these challenges, we propose G-Refer, a framework using graph
+retrieval-augmented large language models (LLMs) for explainable
+recommendation. Specifically, we first employ a hybrid graph retrieval
+mechanism to retrieve explicit CF signals from both structural and semantic
+perspectives. The retrieved CF information is explicitly formulated as
+human-understandable text by the proposed graph translation and accounts for
+the explanations generated by LLMs. To bridge the modality gap, we introduce
+knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of
+LLMs to process and utilize the retrieved CF information to generate
+explanations. Extensive experiments show that G-Refer achieves superior
+performance compared with existing methods in both explainability and
+stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
+
+摘要：可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點，從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明，現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好，這對於提供資訊性說明至關重要。然而，由於圖形結構的複雜性，從圖形中有效提取 CF 資訊仍然是一個挑戰。此外，現有方法通常難以將提取的 CF 資訊與 LLM 整合，因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰，我們提出 G-Refer，一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說，我們首先採用混合圖形檢索機制，從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字，並說明 LLM 生成的解釋。為了彌合模式差距，我們引入了知識修剪和檢索增強微調，以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明，與現有方法相比，G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。
+
 ##### **A-MEM: Agentic Memory for LLM Agents**
 2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
 
@@ -9399,7 +7074,7 @@ absence of agent-level demonstrations. Project code will be released.
 摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
 ##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
 Recent advancements have highlighted that Large Language Models (LLMs) are
 prone to hallucinations when solving complex reasoning problems, leading to
@@ -9426,7 +7101,7 @@ better or comparable performance compared to various strong baselines. Further
 analysis reveals that our agent can identify missing triples, facilitating
 automatic KG updates.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：<paragraph>最近的進展強調出，大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺，導致錯誤的結果。為了解決這個問題，研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而，現有方法面臨兩個限制：1) 它們通常假設問題的所有答案都包含在 KG 中，忽略了 KG 的不完整性問題，以及 2) 它們將 KG 視為一個靜態儲存庫，而忽略了 KG 中固有的隱式邏輯推理結構。在本文中，我們介紹了 SymAgent，一個創新的神經符號代理架構，它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境，並將複雜的推理任務轉化為一個多步驟的互動過程，使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組：代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則，指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊，解決 KG 不完整性的問題。此外，我們設計了一個自學習框架，包括線上探索和離線反覆的政策更新階段，使代理能夠自動合成推理軌跡並改善效能。實驗結果表明，具有弱 LLM 主幹的 SymAgent（例如，7B 系列）與各種強大的基線相比，產生了更好或相當的效能。進一步的分析表明，我們的代理可以識別遺失的三元組，促進自動 KG 更新。</paragraph>
 
 ##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
 2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
@@ -10122,209 +7797,2460 @@ improve a real-world KGQA system.
 
 摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
-
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
-
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
-
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
-
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
-
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
-
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
-
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
-
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
-
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
-
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
-
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
-
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
-
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
-
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
-
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
-
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
-
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
-
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
-
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
-
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
-
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
+
+### LLM
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**|Zekun Qi et.al.|[2502.13143v1](http://arxiv.org/abs/2502.13143v1)|null|
+|**2025-02-18**|**Pre-training Auto-regressive Robotic Models with 4D Representations**|Dantong Niu et.al.|[2502.13142v1](http://arxiv.org/abs/2502.13142v1)|null|
+|**2025-02-18**|**UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**|Huawei Lin et.al.|[2502.13141v1](http://arxiv.org/abs/2502.13141v1)|null|
+|**2025-02-18**|**AIDE: AI-Driven Exploration in the Space of Code**|Zhengyao Jiang et.al.|[2502.13138v1](http://arxiv.org/abs/2502.13138v1)|null|
+|**2025-02-18**|**Theorem Prover as a Judge for Synthetic Data Generation**|Joshua Ong Jun Leang et.al.|[2502.13137v1](http://arxiv.org/abs/2502.13137v1)|null|
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Rethinking Diverse Human Preference Learning through Principal Component Analysis**|Feng Luo et.al.|[2502.13131v1](http://arxiv.org/abs/2502.13131v1)|null|
+|**2025-02-18**|**Magma: A Foundation Model for Multimodal AI Agents**|Jianwei Yang et.al.|[2502.13130v1](http://arxiv.org/abs/2502.13130v1)|null|
+|**2025-02-18**|**SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**|Zihan Liu et.al.|[2502.13128v1](http://arxiv.org/abs/2502.13128v1)|null|
+|**2025-02-18**|**Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**|Jingyang Lin et.al.|[2502.13127v1](http://arxiv.org/abs/2502.13127v1)|null|
+|**2025-02-18**|**RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**|Zenan Zhai et.al.|[2502.13125v1](http://arxiv.org/abs/2502.13125v1)|null|
+|**2025-02-18**|**NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**|Weizhe Yuan et.al.|[2502.13124v1](http://arxiv.org/abs/2502.13124v1)|null|
+|**2025-02-18**|**Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**|Marion Bartl et.al.|[2502.13120v1](http://arxiv.org/abs/2502.13120v1)|null|
+|**2025-02-18**|**STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**|Narun Raman et.al.|[2502.13119v1](http://arxiv.org/abs/2502.13119v1)|null|
+|**2025-02-18**|**Performance Evaluation of Large Language Models in Statistical Programming**|Xinyi Song et.al.|[2502.13117v1](http://arxiv.org/abs/2502.13117v1)|null|
+|**2025-02-18**|**Near-Optimal Private Learning in Linear Contextual Bandits**|Fan Chen et.al.|[2502.13115v1](http://arxiv.org/abs/2502.13115v1)|null|
+|**2025-02-18**|**The influence of motion features in temporal perception**|Rosa Illan Castillo et.al.|[2502.13114v1](http://arxiv.org/abs/2502.13114v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**MatterChat: A Multi-Modal LLM for Material Science**|Yingheng Tang et.al.|[2502.13107v1](http://arxiv.org/abs/2502.13107v1)|null|
+|**2025-02-18**|**Understanding and Rectifying Safety Perception Distortion in VLMs**|Xiaohan Zou et.al.|[2502.13095v1](http://arxiv.org/abs/2502.13095v1)|null|
+|**2025-02-18**|**Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**|Mengkang Hu et.al.|[2502.13092v1](http://arxiv.org/abs/2502.13092v1)|null|
+|**2025-02-18**|**KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**|Xin Xia et.al.|[2502.13076v1](http://arxiv.org/abs/2502.13076v1)|null|
+|**2025-02-18**|**Interactive Agents to Overcome Ambiguity in Software Engineering**|Sanidhya Vijayvargiya et.al.|[2502.13069v1](http://arxiv.org/abs/2502.13069v1)|null|
+|**2025-02-18**|**Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**|Yuri Kuratov et.al.|[2502.13063v1](http://arxiv.org/abs/2502.13063v1)|null|
+|**2025-02-18**|**AI-Assisted Decision Making with Human Learning**|Gali Noti et.al.|[2502.13062v1](http://arxiv.org/abs/2502.13062v1)|null|
+|**2025-02-18**|**Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**|Jingbiao Mei et.al.|[2502.13061v1](http://arxiv.org/abs/2502.13061v1)|null|
+|**2025-02-18**|**SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**|Xianfu Cheng et.al.|[2502.13059v1](http://arxiv.org/abs/2502.13059v1)|null|
+|**2025-02-18**|**LAMD: Context-driven Android Malware Detection and Classification with LLMs**|Xingzhi Qian et.al.|[2502.13055v1](http://arxiv.org/abs/2502.13055v1)|null|
+|**2025-02-18**|**Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**|Nils Constantin Hellwig et.al.|[2502.13044v1](http://arxiv.org/abs/2502.13044v1)|null|
+|**2025-02-18**|**Natural Language Generation from Visual Sequences: Challenges and Future Directions**|Aditya K Surikuchi et.al.|[2502.13034v1](http://arxiv.org/abs/2502.13034v1)|null|
+|**2025-02-18**|**HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**|Bosi Wen et.al.|[2502.13031v1](http://arxiv.org/abs/2502.13031v1)|null|
+|**2025-02-18**|**Whose story is it? Personalizing story generation by inferring author styles**|Nischal Ashok Kumar et.al.|[2502.13028v1](http://arxiv.org/abs/2502.13028v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**|Sha Li et.al.|[2502.13019v1](http://arxiv.org/abs/2502.13019v1)|null|
+|**2025-02-18**|**LLM-Powered Proactive Data Systems**|Sepanta Zeighami et.al.|[2502.13016v1](http://arxiv.org/abs/2502.13016v1)|null|
+|**2025-02-18**|**Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**|Chaoran Chen et.al.|[2502.13012v1](http://arxiv.org/abs/2502.13012v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**|Yarin Benyamin et.al.|[2502.13006v1](http://arxiv.org/abs/2502.13006v1)|null|
+|**2025-02-18**|**Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**|Wafaa Wardah et.al.|[2502.13004v1](http://arxiv.org/abs/2502.13004v1)|null|
+|**2025-02-18**|**You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**|Frederic Kirstein et.al.|[2502.13001v1](http://arxiv.org/abs/2502.13001v1)|null|
+|**2025-02-18**|**Personalized Top-k Set Queries Over Predicted Scores**|Sohrab Namazi Nia et.al.|[2502.12998v1](http://arxiv.org/abs/2502.12998v1)|null|
+|**2025-02-18**|**Eager Updates For Overlapped Communication and Computation in DiLoCo**|Satyen Kale et.al.|[2502.12996v1](http://arxiv.org/abs/2502.12996v1)|null|
+|**2025-02-18**|**Free Argumentative Exchanges for Explaining Image Classifiers**|Avinash Kori et.al.|[2502.12995v1](http://arxiv.org/abs/2502.12995v1)|null|
+|**2025-02-18**|**B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**|Yifan Wang et.al.|[2502.12992v1](http://arxiv.org/abs/2502.12992v1)|null|
+|**2025-02-18**|**Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**|Zixiao Wang et.al.|[2502.12988v1](http://arxiv.org/abs/2502.12988v1)|null|
+|**2025-02-18**|**PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**|Nicolas Talabot et.al.|[2502.12985v1](http://arxiv.org/abs/2502.12985v1)|null|
+|**2025-02-18**|**Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**|Longxu Dou et.al.|[2502.12982v1](http://arxiv.org/abs/2502.12982v1)|null|
+|**2025-02-18**|**Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**|Junda Zhu et.al.|[2502.12970v1](http://arxiv.org/abs/2502.12970v1)|null|
+|**2025-02-18**|**A Survey of Text Classification Under Class Distribution Shift**|Adriana Valentina Costache et.al.|[2502.12965v1](http://arxiv.org/abs/2502.12965v1)|null|
+|**2025-02-18**|**Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**|Adi Simhi et.al.|[2502.12964v1](http://arxiv.org/abs/2502.12964v1)|null|
+|**2025-02-18**|**Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**|Xiaoju Ye et.al.|[2502.12962v1](http://arxiv.org/abs/2502.12962v1)|null|
+|**2025-02-18**|**Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**|Wenjun Li et.al.|[2502.12961v1](http://arxiv.org/abs/2502.12961v1)|null|
+|**2025-02-18**|**AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**|Steve Bakos et.al.|[2502.12959v1](http://arxiv.org/abs/2502.12959v1)|null|
+|**2025-02-18**|**Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**|Andrei Jarca et.al.|[2502.12953v1](http://arxiv.org/abs/2502.12953v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**|Gyeongman Kim et.al.|[2502.12947v1](http://arxiv.org/abs/2502.12947v1)|null|
+|**2025-02-18**|**LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**|Junchen Fu et.al.|[2502.12945v1](http://arxiv.org/abs/2502.12945v1)|null|
+|**2025-02-18**|**Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**|Salsabila Zahirah Pranida et.al.|[2502.12932v1](http://arxiv.org/abs/2502.12932v1)|null|
+|**2025-02-18**|**Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**|Lakshmi Nair et.al.|[2502.12929v1](http://arxiv.org/abs/2502.12929v1)|null|
+|**2025-02-18**|**Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**|Leiyu Pan et.al.|[2502.12928v1](http://arxiv.org/abs/2502.12928v1)|null|
+|**2025-02-18**|**SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**|Mike Zhang et.al.|[2502.12927v1](http://arxiv.org/abs/2502.12927v1)|null|
+|**2025-02-18**|**Towards more Contextual Agents: An extractor-Generator Optimization Framework**|Mourad Aouini et.al.|[2502.12926v1](http://arxiv.org/abs/2502.12926v1)|null|
+|**2025-02-18**|**Keep what you need : extracting efficient subnetworks from large audio representation models**|David Genova et.al.|[2502.12925v1](http://arxiv.org/abs/2502.12925v1)|null|
+|**2025-02-18**|**Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**|Maite Heredia et.al.|[2502.12924v1](http://arxiv.org/abs/2502.12924v1)|null|
+|**2025-02-18**|**On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**|Rune Birkmose et.al.|[2502.12923v1](http://arxiv.org/abs/2502.12923v1)|null|
+|**2025-02-18**|**Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**|George-Kirollos Saad et.al.|[2502.12921v1](http://arxiv.org/abs/2502.12921v1)|null|
+|**2025-02-18**|**GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**|Sifan Zhou et.al.|[2502.12913v1](http://arxiv.org/abs/2502.12913v1)|null|
+|**2025-02-18**|**Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**|Zheng Yuan et.al.|[2502.12911v1](http://arxiv.org/abs/2502.12911v1)|null|
+|**2025-02-18**|**Graph Neural Networks for Databases: A Survey**|Ziming Li et.al.|[2502.12908v1](http://arxiv.org/abs/2502.12908v1)|null|
+|**2025-02-18**|**Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**|Shu Yang et.al.|[2502.12904v1](http://arxiv.org/abs/2502.12904v1)|null|
+|**2025-02-18**|**Soundwave: Less is More for Speech-Text Alignment in LLMs**|Yuhao Zhang et.al.|[2502.12900v1](http://arxiv.org/abs/2502.12900v1)|null|
+|**2025-02-18**|**None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**|Eva Sánchez Salido et.al.|[2502.12896v1](http://arxiv.org/abs/2502.12896v1)|null|
+|**2025-02-18**|**Multilingual European Language Models: Benchmarking Approaches and Challenges**|Fabio Barth et.al.|[2502.12895v1](http://arxiv.org/abs/2502.12895v1)|null|
+|**2025-02-18**|**H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**|Martin Kuo et.al.|[2502.12893v1](http://arxiv.org/abs/2502.12893v1)|null|
+|**2025-02-18**|**Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**|Georg Rehm et.al.|[2502.12886v1](http://arxiv.org/abs/2502.12886v1)|null|
+|**2025-02-18**|**How desirable is alignment between LLMs and linguistically diverse human users?**|Pia Knoeferle et.al.|[2502.12884v1](http://arxiv.org/abs/2502.12884v1)|null|
+|**2025-02-18**|**Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**|Nandakishor M et.al.|[2502.12876v1](http://arxiv.org/abs/2502.12876v1)|null|
+|**2025-02-18**|**PAFT: Prompt-Agnostic Fine-Tuning**|Chenxing Wei et.al.|[2502.12859v1](http://arxiv.org/abs/2502.12859v1)|null|
+|**2025-02-18**|**Rejected Dialects: Biases Against African American Language in Reward Models**|Joel Mire et.al.|[2502.12858v1](http://arxiv.org/abs/2502.12858v1)|null|
+|**2025-02-18**|**Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**|Neeraj Gangwar et.al.|[2502.12855v1](http://arxiv.org/abs/2502.12855v1)|null|
+|**2025-02-18**|**S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**|Ruotian Ma et.al.|[2502.12853v1](http://arxiv.org/abs/2502.12853v1)|null|
+|**2025-02-18**|**MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**|Fabian David Schmidt et.al.|[2502.12852v1](http://arxiv.org/abs/2502.12852v1)|null|
+|**2025-02-18**|**MeMo: Towards Language Models with Associative Memory Mechanisms**|Fabio Massimo Zanzotto et.al.|[2502.12851v1](http://arxiv.org/abs/2502.12851v1)|null|
+|**2025-02-18**|**Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**|Kathrin Seßler et.al.|[2502.12842v1](http://arxiv.org/abs/2502.12842v1)|null|
+|**2025-02-18**|**Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**|Berk Yilmaz et.al.|[2502.12838v1](http://arxiv.org/abs/2502.12838v1)|null|
+|**2025-02-18**|**An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**|Mohammad Feli et.al.|[2502.12836v1](http://arxiv.org/abs/2502.12836v1)|null|
+|**2025-02-18**|**Subword models struggle with word learning, but surprisal hides it**|Bastian Bunzeck et.al.|[2502.12835v1](http://arxiv.org/abs/2502.12835v1)|null|
+|**2025-02-18**|**KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**|Mukhammed Togmanov et.al.|[2502.12829v1](http://arxiv.org/abs/2502.12829v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**|Elena Stringli et.al.|[2502.12821v1](http://arxiv.org/abs/2502.12821v1)|null|
+|**2025-02-18**|**Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**|Adnan Ahmad et.al.|[2502.12813v1](http://arxiv.org/abs/2502.12813v1)|null|
+|**2025-02-18**|**Towards Text-Image Interleaved Retrieval**|Xin Zhang et.al.|[2502.12799v1](http://arxiv.org/abs/2502.12799v1)|null|
+|**2025-02-18**|**Envious Explore and Exploit**|Omer Ben-Porat et.al.|[2502.12798v1](http://arxiv.org/abs/2502.12798v1)|null|
+|**2025-02-18**|**Commonsense Reasoning in Arab Culture**|Abdelrahman Sadallah et.al.|[2502.12788v1](http://arxiv.org/abs/2502.12788v1)|null|
+|**2025-02-18**|**VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**|Xinlong Chen et.al.|[2502.12782v1](http://arxiv.org/abs/2502.12782v1)|null|
+|**2025-02-18**|**Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**|Daiki Chijiwa et.al.|[2502.12776v1](http://arxiv.org/abs/2502.12776v1)|null|
+|**2025-02-18**|**Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**|Danny Dongyeop Han et.al.|[2502.12771v1](http://arxiv.org/abs/2502.12771v1)|null|
+|**2025-02-18**|**How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**|Saad Obaid ul Islam et.al.|[2502.12769v1](http://arxiv.org/abs/2502.12769v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+
+#### Abstracts
+##### **SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**
+2502.13143v1 by Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
+
+Spatial intelligence is a critical component of embodied AI, promoting robots
+to understand and interact with their environments. While recent advances have
+enhanced the ability of VLMs to perceive object locations and positional
+relationships, they still lack the capability to precisely understand object
+orientations-a key requirement for tasks involving fine-grained manipulations.
+Addressing this limitation not only requires geometric reasoning but also an
+expressive and intuitive way to represent orientation. In this context, we
+propose that natural language offers a more flexible representation space than
+canonical frames, making it particularly suitable for instruction-following
+robotic systems. In this paper, we introduce the concept of semantic
+orientation, which defines object orientations using natural language in a
+reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the
+''handle'' direction of a knife). To support this, we construct OrienText300K,
+a large-scale dataset of 3D models annotated with semantic orientations that
+link geometric understanding to functional semantics. By integrating semantic
+orientation into a VLM system, we enable robots to generate manipulation
+actions with both positional and orientational constraints. Extensive
+experiments in simulation and real world demonstrate that our approach
+significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy
+on Open6DOR and 74.9% accuracy on SIMPLER.
+
+摘要：空間智能是具象 AI 的關鍵組成部分，促使機器人了解其環境並與之互動。雖然最近的進展增強了 VLM 感知物件位置和位置關係的能力，但它們仍然缺乏精確理解物件方向的能力，這對於涉及細微操作的任務來說是一項關鍵要求。解決這個限制不僅需要幾何推理，還需要一種表達性和直觀的方式來表示方向。在此背景下，我們提出自然語言提供了一個比標準框架更靈活的表示空間，使其特別適合於遵循指令的機器人系統。在本文中，我們介紹了語義方向的概念，它使用自然語言以無參考框架的方式定義物件方向（例如，USB 的「插入」方向或刀子的「握柄」方向）。為了支持這一點，我們構建了 OrienText300K，這是一個大型 3D 模型數據集，其中註釋了語義方向，將幾何理解與功能語義聯繫起來。通過將語義方向整合到 VLM 系統中，我們使機器人能夠生成同時具有位置和方向約束的操作動作。在模擬和現實世界中進行的廣泛實驗表明，我們的做法顯著增強了機器人的操作能力，例如，Open6DOR 的準確率為 48.7%，SIMPLER 的準確率為 74.9%。
+
+##### **Pre-training Auto-regressive Robotic Models with 4D Representations**
+2502.13142v1 by Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig
+
+Foundation models pre-trained on massive unlabeled datasets have
+revolutionized natural language and computer vision, exhibiting remarkable
+generalization capabilities, thus highlighting the importance of pre-training.
+Yet, efforts in robotics have struggled to achieve similar success, limited by
+either the need for costly robotic annotations or the lack of representations
+that effectively model the physical world. In this paper, we introduce ARM4R,
+an Auto-regressive Robotic Model that leverages low-level 4D Representations
+learned from human video data to yield a better pre-trained robotic model.
+Specifically, we focus on utilizing 3D point tracking representations from
+videos derived by lifting 2D representations into 3D space via monocular depth
+estimation across time. These 4D representations maintain a shared geometric
+structure between the points and robot state representations up to a linear
+transformation, enabling efficient transfer learning from human video data to
+low-level robotic control. Our experiments show that ARM4R can transfer
+efficiently from human video data to robotics and consistently improves
+performance on tasks across various robot environments and configurations.
+
+摘要：預先在大量未標記資料集上訓練好的基礎模型已經徹底改變了自然語言和電腦視覺，展現出非凡的概化能力，因此突顯了預先訓練的重要性。然而，機器人領域的努力一直難以取得類似的成功，受到昂貴的機器人標註需求或缺乏有效建模物理世界的表徵的限制。在本文中，我們介紹了 ARM4R，一種自迴歸機器人模型，它利用從人類影片資料中學習到的低階 4D 表徵，以產生更好的預先訓練機器人模型。具體來說，我們專注於利用從影片中獲得的 3D 點追蹤表徵，這些表徵是透過單眼深度估計跨時間將 2D 表徵提升到 3D 空間而導出的。這些 4D 表徵在點和機器人狀態表徵之間保持一個共用的幾何結構，直到一個線性轉換，這使得從人類影片資料到低階機器人控制的有效遷移學習成為可能。我們的實驗表明，ARM4R 可以有效地從人類影片資料轉移到機器人技術，並持續改善各種機器人環境和組態中的任務效能。
+
+##### **UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**
+2502.13141v1 by Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao
+
+Large Language Models (LLMs) are vulnerable to attacks like prompt injection,
+backdoor attacks, and adversarial attacks, which manipulate prompts or models
+to generate harmful outputs. In this paper, departing from traditional deep
+learning attack paradigms, we explore their intrinsic relationship and
+collectively term them Prompt Trigger Attacks (PTA). This raises a key
+question: Can we determine if a prompt is benign or poisoned? To address this,
+we propose UniGuardian, the first unified defense mechanism designed to detect
+prompt injection, backdoor attacks, and adversarial attacks in LLMs.
+Additionally, we introduce a single-forward strategy to optimize the detection
+pipeline, enabling simultaneous attack detection and text generation within a
+single forward pass. Our experiments confirm that UniGuardian accurately and
+efficiently identifies malicious prompts in LLMs.
+
+摘要：大型語言模型 (LLM) 容易受到提示注入、後門攻擊和對抗性攻擊等攻擊，這些攻擊會操縱提示或模型以產生有害的輸出。在本文中，我們跳脫傳統深度學習攻擊範例，探討它們的內在關係，並將它們統稱為提示觸發攻擊 (PTA)。這引發了一個關鍵問題：我們能確定一個提示是良性的還是惡意的嗎？為了解決這個問題，我們提出了 UniGuardian，這是一種旨在偵測 LLM 中的提示注入、後門攻擊和對抗性攻擊的第一個統一防禦機制。此外，我們引入了一個單一前向策略來最佳化偵測管道，在單一前向傳遞中同時進行攻擊偵測和文字生成。我們的實驗證實，UniGuardian 能準確且有效地識別 LLM 中的惡意提示。
+
+##### **AIDE: AI-Driven Exploration in the Space of Code**
+2502.13138v1 by Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, Yuxiang Wu
+
+Machine learning, the foundation of modern artificial intelligence, has
+driven innovations that have fundamentally transformed the world. Yet, behind
+advancements lies a complex and often tedious process requiring labor and
+compute intensive iteration and experimentation. Engineers and scientists
+developing machine learning models spend much of their time on trial-and-error
+tasks instead of conceptualizing innovative solutions or research hypotheses.
+To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine
+learning engineering agent powered by large language models (LLMs). AIDE frames
+machine learning engineering as a code optimization problem, and formulates
+trial-and-error as a tree search in the space of potential solutions. By
+strategically reusing and refining promising solutions, AIDE effectively trades
+computational resources for enhanced performance, achieving state-of-the-art
+results on multiple machine learning engineering benchmarks, including our
+Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.
+
+摘要：機器學習，現代人工智慧的基礎，已經推動了根本性地改變世界的創新。然而，進步的背後是一個複雜且經常繁瑣的過程，需要人工和計算密集的迭代和實驗。開發機器學習模型的工程師和科學家將大部分時間花在試錯任務上，而不是構思創新的解決方案或研究假設。為了應對這一挑戰，我們引入了 AI 驅動探索 (AIDE)，這是一種由大型語言模型 (LLM) 驅動的機器學習工程代理。AIDE 將機器學習工程構建為一個程式碼最佳化問題，並將試錯表述為在潛在解決方案空間中的樹狀搜尋。透過策略性地重複使用和改進有希望的解決方案，AIDE 有效地將計算資源轉換為增強的效能，在多個機器學習工程基準上取得了最先進的成果，包括我們的 Kaggle 評估、OpenAI MLE-Bench 和 METRs RE-Bench。
+
+##### **Theorem Prover as a Judge for Synthetic Data Generation**
+2502.13137v1 by Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen
+
+The demand for synthetic data in mathematical reasoning has increased due to
+its potential to enhance the mathematical capabilities of large language models
+(LLMs). However, ensuring the validity of intermediate reasoning steps remains
+a significant challenge, affecting data quality. While formal verification via
+theorem provers effectively validates LLM reasoning, the autoformalisation of
+mathematical proofs remains error-prone. In response, we introduce iterative
+autoformalisation, an approach that iteratively refines theorem prover
+formalisation to mitigate errors, thereby increasing the execution rate on the
+Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as
+a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to
+rigorously assess LLM intermediate reasoning, effectively integrating
+autoformalisation with synthetic data generation. Finally, we present
+Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that
+replaces human annotation with theorem prover feedback in Reinforcement
+Learning from Human Feedback (RLHF). Across multiple LLMs, applying
+TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving
+5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for
+SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
+
+摘要：<paragraph>由於合成資料在數學推理中具有增強大型語言模型 (LLM) 數學能力的潛力，對合成資料的需求已增加。然而，確保中間推理步驟的有效性仍然是一項重大的挑戰，影響資料品質。雖然透過定理證明器進行形式驗證可有效驗證 LLM 推理，但數學證明自動形式化仍然容易出錯。為了解決這個問題，我們引入了迭代自動形式化，這是一種迭代優化定理證明器形式化以減少錯誤的方法，從而將 Lean 證明器的執行率從 60% 提高到 87%。在此基礎上，我們引入了定理證明器作為評審 (TP-as-a-Judge)，這是一種採用定理證明器形式化來嚴格評估 LLM 中間推理的方法，有效地將自動形式化與合成資料產生整合。最後，我們提出了定理證明器回饋強化學習 (RLTPF)，這是一個框架，用定理證明器回饋取代人類標註，以進行人類回饋強化學習 (RLHF)。在多個 LLM 中，應用 TP-as-a-Judge 和 RLTPF 可透過僅 3,508 個樣本改善基準，在 MultiArith 上獲得 5.56% 的準確度提升，在 SVAMP 上獲得 Llama-2-7B 的 6.00% 提升，在 AQUA 上獲得 Llama-3.1-8B 的 3.55% 提升。</paragraph>
+
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Rethinking Diverse Human Preference Learning through Principal Component Analysis**
+2502.13131v1 by Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
+
+Understanding human preferences is crucial for improving foundation models
+and building personalized AI systems. However, preferences are inherently
+diverse and complex, making it difficult for traditional reward models to
+capture their full range. While fine-grained preference data can help,
+collecting it is expensive and hard to scale. In this paper, we introduce
+Decomposed Reward Models (DRMs), a novel approach that extracts diverse human
+preferences from binary comparisons without requiring fine-grained annotations.
+Our key insight is to represent human preferences as vectors and analyze them
+using Principal Component Analysis (PCA). By constructing a dataset of
+embedding differences between preferred and rejected responses, DRMs identify
+orthogonal basis vectors that capture distinct aspects of preference. These
+decomposed rewards can be flexibly combined to align with different user needs,
+offering an interpretable and scalable alternative to traditional reward
+models. We demonstrate that DRMs effectively extract meaningful preference
+dimensions (e.g., helpfulness, safety, humor) and adapt to new users without
+additional training. Our results highlight DRMs as a powerful framework for
+personalized and interpretable LLM alignment.
+
+摘要：理解人類偏好對於改進基礎模型和建構個人化 AI 系統至關重要。然而，偏好本質上是多樣且複雜的，這使得傳統的獎勵模型難以捕捉其全部範圍。雖然細緻的偏好數據可能有所幫助，但收集這些數據既昂貴又難以擴展。在本文中，我們介紹了解構獎勵模型 (DRM)，這是一種新穎的方法，它可以從二元比較中提取多樣化的人類偏好，而不需要細緻的註解。我們的關鍵見解是將人類偏好表示為向量，並使用主成分分析 (PCA) 對其進行分析。透過建構偏好和拒絕回應之間嵌入差異的數據集，DRM 識別出正交基向量，這些向量捕捉偏好的不同面向。這些解構的獎勵可以靈活地結合在一起，以符合不同的使用者需求，提供一種可解釋且可擴展的傳統獎勵模型替代方案。我們證明了 DRM 可以有效地提取有意義的偏好維度（例如，有用性、安全性、幽默感），並在不需要額外訓練的情況下適應新的使用者。我們的結果突顯了 DRM 作為個人化且可解釋的 LLM 對齊強大架構。
+
+##### **Magma: A Foundation Model for Multimodal AI Agents**
+2502.13130v1 by Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
+
+We present Magma, a foundation model that serves multimodal AI agentic tasks
+in both the digital and physical worlds. Magma is a significant extension of
+vision-language (VL) models in that it not only retains the VL understanding
+ability (verbal intelligence) of the latter, but is also equipped with the
+ability to plan and act in the visual-spatial world (spatial-temporal
+intelligence) and complete agentic tasks ranging from UI navigation to robot
+manipulation. To endow the agentic capabilities, Magma is pretrained on large
+amounts of heterogeneous datasets spanning from images, videos to robotics
+data, where the actionable visual objects (e.g., clickable buttons in GUI) in
+images are labeled by Set-of-Mark (SoM) for action grounding, and the object
+movements (e.g., the trace of human hands or robotic arms) in videos are
+labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show
+that SoM and ToM reach great synergy and facilitate the acquisition of
+spatial-temporal intelligence for our Magma model, which is fundamental to a
+wide range of tasks as shown in Fig.1. In particular, Magma creates new
+state-of-the-art results on UI navigation and robotic manipulation tasks,
+outperforming previous models that are specifically tailored to these tasks. On
+image and video-related multimodal tasks, Magma also compares favorably to
+popular large multimodal models that are trained on much larger datasets. We
+make our model and code public for reproducibility at
+https://microsoft.github.io/Magma.
+
+摘要：<paragraph>我們提出 Magma，這是一個基礎模型，用於服務數位和物理世界中的多模態 AI 代理任務。Magma 是視覺語言 (VL) 模型的重大延伸，它不僅保留了後者的 VL 理解能力（語言智能），還具備在視覺空間世界中規劃和行動的能力（時空智能），並完成從 UI 導航到機器人操作的代理任務。為了賦予代理能力，Magma 在從影像、影片到機器人資料的大量異質資料集上進行預訓練，其中影像中的可操作視覺物件（例如 GUI 中的可點擊按鈕）由動作接地 Set-of-Mark (SoM) 標記，影片中的物件動作（例如人手或機器手臂的軌跡）由動作規劃 Trace-of-Mark (ToM) 標記。廣泛的實驗表明，SoM 和 ToM 達到了極大的協同作用，並促進了我們 Magma 模型的時空智能的獲取，這對於圖 1 中所示的各種任務至關重要。特別是，Magma 在 UI 導航和機器人操作任務上創造了新的最先進的結果，優於專門針對這些任務的先前模型。在影像和影片相關的多模態任務上，Magma 也與在更大資料集上訓練的流行大型多模態模型相比，表現得很好。我們公開我們的模型和程式碼，以便在 https://microsoft.github.io/Magma 上重現。</paragraph>
+
+##### **SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**
+2502.13128v1 by Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
+
+Text-to-song generation, the task of creating vocals and accompaniment from
+textual inputs, poses significant challenges due to domain complexity and data
+scarcity. Existing approaches often employ multi-stage generation procedures,
+resulting in cumbersome training and inference pipelines. In this paper, we
+propose SongGen, a fully open-source, single-stage auto-regressive transformer
+designed for controllable song generation. The proposed model facilitates
+fine-grained control over diverse musical attributes, including lyrics and
+textual descriptions of instrumentation, genre, mood, and timbre, while also
+offering an optional three-second reference clip for voice cloning. Within a
+unified auto-regressive framework, SongGen supports two output modes: mixed
+mode, which generates a mixture of vocals and accompaniment directly, and
+dual-track mode, which synthesizes them separately for greater flexibility in
+downstream applications. We explore diverse token pattern strategies for each
+mode, leading to notable improvements and valuable insights. Furthermore, we
+design an automated data preprocessing pipeline with effective quality control.
+To foster community engagement and future research, we will release our model
+weights, training code, annotated data, and preprocessing pipeline. The
+generated samples are showcased on our project page at
+https://liuzh-19.github.io/SongGen/ , and the code will be available at
+https://github.com/LiuZH-19/SongGen .
+
+摘要：文字轉歌曲生成，從文字輸入建立人聲和伴奏的任務，由於領域複雜性和資料稀少性，因此構成重大挑戰。現有方法通常採用多階段生成程序，導致訓練和推論管道繁瑣。在本文中，我們提出 SongGen，一個完全開源的單階段自迴歸轉換器，專為可控歌曲生成而設計。所提出的模型促進對各種音樂屬性的細粒度控制，包括歌詞和樂器、類型、情緒和音色的文字描述，同時還提供可選的三秒參考片段以進行語音複製。在統一的自迴歸框架內，SongGen 支援兩種輸出模式：混合模式，直接生成人聲和伴奏的混合，以及雙軌模式，將它們分開合成以提高下游應用程式的靈活性。我們探索每種模式的不同代幣模式策略，從而帶來顯著的改進和有價值的見解。此外，我們設計了一個自動化資料預處理管道，具備有效的品質控制。為了促進社區參與和未來的研究，我們將釋出我們的模型權重、訓練程式碼、註解資料和預處理管道。生成的範例展示在我們的專案頁面 https://liuzh-19.github.io/SongGen/，程式碼將在 https://github.com/LiuZH-19/SongGen 中提供。
+
+##### **Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**
+2502.13127v1 by Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo
+
+Recent advances in Large Language Models (LLMs) have enabled them to process
+increasingly longer sequences, ranging from 2K to 2M tokens and even beyond.
+However, simply extending the input sequence length does not necessarily lead
+to effective long-context understanding. In this study, we integrate
+Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate
+effective long-context understanding. To achieve this, we introduce
+LongFinanceQA, a synthetic dataset in the financial domain designed to improve
+long-context reasoning. Unlike existing long-context synthetic data,
+LongFinanceQA includes intermediate CoT reasoning before the final conclusion,
+which encourages LLMs to perform explicit reasoning, improving accuracy and
+interpretability in long-context understanding. To generate synthetic CoT
+reasoning, we propose Property-driven Agentic Inference (PAI), an agentic
+framework that simulates human-like reasoning steps, including property
+extraction, retrieval, and summarization. We evaluate PAI's reasoning
+capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark,
+outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune
+LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's
+financial subset.
+
+摘要：大型語言模型 (LLM) 的最新進展讓它們能夠處理越來越長的序列，範圍從 2K 到 2M 個符號，甚至更長。
+然而，僅僅延長輸入序列長度並不會必然導致有效的長語境理解。在本研究中，我們以監督的方式將思考鏈 (CoT) 推理整合到 LLM 中，以促進有效的長語境理解。為此，我們引入了 LongFinanceQA，這是一個在金融領域中的合成數據集，旨在改進長語境推理。與現有的長語境合成數據不同，LongFinanceQA 在最終結論之前包含了中間的 CoT 推理，這鼓勵 LLM 執行明確的推理，從而提高長語境理解的準確性和可解釋性。為了生成合成的 CoT 推理，我們提出了基於屬性的主體推理 (PAI)，這是一個模擬類人推理步驟的主體框架，包括屬性提取、檢索和總結。我們通過評估搭載 PAI 的 GPT-4o-mini 在 Loong 基準上的推理能力，使其比標準的 GPT-4o-mini 高出 20.0%，來評估 PAI 的推理能力。此外，我們對 LLaMA-3.1-8B-Instruct 進行了微調，在 Loong 的金融子集中實現了 24.6% 的增益。
+
+##### **RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**
+2502.13125v1 by Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li
+
+Recent advances in large language models (LLMs) have shown that they can
+answer questions requiring complex reasoning. However, their ability to
+identify and respond to text containing logical fallacies or deliberately
+misleading premises remains less studied. To address this gap, we introduce
+RuozhiBench, a bilingual dataset comprising 677 carefully curated questions
+that contain various forms of deceptive reasoning, meticulously crafted through
+extensive human effort and expert review. In a comprehensive evaluation of 17
+LLMs from 5 Series over RuozhiBench using both open-ended and two-choice
+formats, we conduct extensive analyses on evaluation protocols and result
+patterns. Despite their high scores on conventional benchmarks, these models
+showed limited ability to detect and reason correctly about logical fallacies,
+with even the best-performing model, Claude-3-haiku, achieving only 62%
+accuracy compared to the human of more than 90%.
+
+摘要：大型語言模型 (LLM) 的最新進展顯示，它們可以回答需要複雜推理的問題。然而，它們識別和回應包含邏輯謬誤或故意誤導前提的文本的能力仍未得到充分研究。為了解決這個差距，我們引入了 RuozhiBench，這是一個雙語資料集，包含 677 個經過仔細策劃的問題，其中包含各種形式的欺騙性推理，並透過廣泛的人力投入和專家審查精心製作。在使用開放式和二選一格式對來自 5 個系列的 17 個 LLM 進行 RuozhiBench 的全面評估中，我們對評估協定和結果模式進行了廣泛的分析。儘管它們在傳統基準測試中獲得了高分，但這些模型在檢測和正確推理邏輯謬誤方面表現出的能力有限，即使是效能最好的模型 Claude-3-haiku，與人類的 90% 以上相比，也只達到了 62% 的準確度。
+
+##### **NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**
+2502.13124v1 by Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li
+
+Scaling reasoning capabilities beyond traditional domains such as math and
+coding is hindered by the lack of diverse and high-quality questions. To
+overcome this limitation, we introduce a scalable approach for generating
+diverse and challenging reasoning questions, accompanied by reference answers.
+We present NaturalReasoning, a comprehensive dataset comprising 2.8 million
+questions that span multiple domains, including STEM fields (e.g., Physics,
+Computer Science), Economics, Social Sciences, and more. We demonstrate the
+utility of the questions in NaturalReasoning through knowledge distillation
+experiments which show that NaturalReasoning can effectively elicit and
+transfer reasoning capabilities from a strong teacher model. Furthermore, we
+demonstrate that NaturalReasoning is also effective for unsupervised
+self-training using external reward models or self-rewarding.
+
+摘要：透過超越傳統領域（例如數學和編碼）來擴充推理能力，受到缺乏多元且高品質問題的阻礙。為了克服這個限制，我們引入一個可擴充的方法，用於產生多元且具挑戰性的推理問題，並附上參考答案。我們提出 NaturalReasoning，這是一個包含 280 萬個問題的綜合資料集，涵蓋多個領域，包括 STEM 領域（例如物理、電腦科學）、經濟學、社會科學等等。我們透過知識蒸餾實驗，展示 NaturalReasoning 中問題的實用性，這些實驗顯示 NaturalReasoning 能有效地引發和轉移強大教師模型的推理能力。此外，我們展示 NaturalReasoning 也適用於使用外部獎勵模型或自我獎勵的無監督自我訓練。
+
+##### **Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**
+2502.13120v1 by Marion Bartl, Thomas Brendan Murphy, Susan Leavy
+
+Gender-inclusive language is often used with the aim of ensuring that all
+individuals, regardless of gender, can be associated with certain concepts.
+While psycholinguistic studies have examined its effects in relation to human
+cognition, it remains unclear how Large Language Models (LLMs) process
+gender-inclusive language. Given that commercial LLMs are gaining an
+increasingly strong foothold in everyday applications, it is crucial to examine
+whether LLMs in fact interpret gender-inclusive language neutrally, because the
+language they generate has the potential to influence the language of their
+users. This study examines whether LLM-generated coreferent terms align with a
+given gender expression or reflect model biases. Adapting psycholinguistic
+methods from French to English and German, we find that in English, LLMs
+generally maintain the antecedent's gender but exhibit underlying masculine
+bias. In German, this bias is much stronger, overriding all tested
+gender-neutralization strategies.
+
+摘要：性別包容性語言通常用於確保所有個人，無論性別如何，都能與某些概念聯繫在一起。雖然心理語言學研究已經檢視了它對人類認知的影響，但大型語言模型 (LLM) 如何處理性別包容性語言仍然不清楚。鑑於商業 LLM 在日常應用中越來越站穩腳步，因此至關重要的是要檢查 LLM 是否實際上中立地解釋性別包容性語言，因為它們產生的語言有可能影響其使用者的語言。本研究探討了 LLM 生成的共指術語是否與給定的性別表達一致或反映模型偏見。我們採用法語到英語和德語的心理語言學方法，發現英語中，LLM 通常會保持先行詞的性別，但表現出潛在的男性偏見。在德語中，這種偏見強得多，凌駕於所有經過測試的性別中立化策略。
+
+##### **STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**
+2502.13119v1 by Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin-Leyton Brown
+
+How should one judge whether a given large language model (LLM) can reliably
+perform economic reasoning? Most existing LLM benchmarks focus on specific
+applications and fail to present the model with a rich variety of economic
+tasks. A notable exception is Raman et al. [2024], who offer an approach for
+comprehensively benchmarking strategic decision-making; however, this approach
+fails to address the non-strategic settings prevalent in microeconomics, such
+as supply-and-demand analysis. We address this gap by taxonomizing
+microeconomic reasoning into $58$ distinct elements, focusing on the logic of
+supply and demand, each grounded in up to $10$ distinct domains, $5$
+perspectives, and $3$ types. The generation of benchmark data across this
+combinatorial space is powered by a novel LLM-assisted data generation protocol
+that we dub auto-STEER, which generates a set of questions by adapting
+handwritten templates to target new domains and perspectives. Because it offers
+an automated way of generating fresh questions, auto-STEER mitigates the risk
+that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that
+it will serve as a useful tool both for evaluating and fine-tuning models for
+years to come. We demonstrate the usefulness of our benchmark via a case study
+on $27$ LLMs, ranging from small open-source models to the current state of the
+art. We examined each model's ability to solve microeconomic problems across
+our whole taxonomy and present the results across a range of prompting
+strategies and scoring metrics.
+
+摘要：<paragraph>如何判斷一個給定的大型語言模型 (LLM) 能否可靠地進行經濟推理？現有的 LLM 基準測試大多專注於特定應用，未能為模型提供豐富多樣的經濟任務。一個值得注意的例外是 Raman 等人 [2024]，他們提供了一種全面評估策略決策制定方法；然而，這種方法無法解決微觀經濟學中普遍存在的非策略性設定，例如供需分析。我們透過將微觀經濟推理分類為 58 個不同的元素來解決這個差距，重點放在供需邏輯上，每個元素都基於多達 10 個不同的領域、5 個觀點和 3 種類型。在這個組合空間中產生基準數據是由一種新穎的 LLM 輔助數據生成協議（我們稱之為 auto-STEER）推動的，它通過調整手寫模板來針對新的領域和觀點來生成一組問題。由於它提供了一種生成新問題的自動化方式，auto-STEER 減輕了 LLM 將被訓練過度配合評估基準測試的風險；因此，我們希望它將成為未來幾年評估和微調模型的有用工具。我們通過一個案例研究展示了我們基準測試的效用，該案例研究涵蓋了 27 個 LLM，從小型開源模型到當前技術狀態。我們檢查了每個模型在我們的整個分類法中解決微觀經濟問題的能力，並在各種提示策略和評分指標中展示了結果。</paragraph>
+
+##### **Performance Evaluation of Large Language Models in Statistical Programming**
+2502.13117v1 by Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong
+
+The programming capabilities of large language models (LLMs) have
+revolutionized automatic code generation and opened new avenues for automatic
+statistical analysis. However, the validity and quality of these generated
+codes need to be systematically evaluated before they can be widely adopted.
+Despite their growing prominence, a comprehensive evaluation of statistical
+code generated by LLMs remains scarce in the literature. In this paper, we
+assess the performance of LLMs, including two versions of ChatGPT and one
+version of Llama, in the domain of SAS programming for statistical analysis.
+Our study utilizes a set of statistical analysis tasks encompassing diverse
+statistical topics and datasets. Each task includes a problem description,
+dataset information, and human-verified SAS code. We conduct a comprehensive
+assessment of the quality of SAS code generated by LLMs through human expert
+evaluation based on correctness, effectiveness, readability, executability, and
+the accuracy of output results. The analysis of rating scores reveals that
+while LLMs demonstrate usefulness in generating syntactically correct code,
+they struggle with tasks requiring deep domain understanding and may produce
+redundant or incorrect results. This study offers valuable insights into the
+capabilities and limitations of LLMs in statistical programming, providing
+guidance for future advancements in AI-assisted coding systems for statistical
+analysis.
+
+摘要：大型語言模型 (LLM) 的程式設計功能徹底改變了自動程式碼生成，並為自動統計分析開啟了新途徑。然而，在廣泛採用這些產生的程式碼之前，需要系統性地評估其有效性和品質。儘管其重要性日益提升，但文獻中對於 LLM 產生的統計程式碼的全面評估仍然稀少。在本文中，我們評估了 LLM 的效能，包括兩個版本的 ChatGPT 和一個版本的 Llama，在統計分析的 SAS 程式設計領域。我們的研究利用了一組涵蓋各種統計主題和資料集的統計分析任務。每個任務都包含問題說明、資料集資訊和經過人工驗證的 SAS 程式碼。我們透過基於正確性、有效性、可讀性、可執行性和輸出結果精確度的專家評估，對 LLM 產生的 SAS 程式碼品質進行全面評估。評分結果的分析顯示，儘管 LLM 在產生語法正確的程式碼方面表現出其效用，但它們在需要深入領域理解的任務中會遇到困難，並且可能會產生冗餘或不正確的結果。本研究提供了 LLM 在統計程式設計中能力和限制的寶貴見解，為統計分析的 AI 輔助編碼系統的未來進展提供指導。
+
+##### **Near-Optimal Private Learning in Linear Contextual Bandits**
+2502.13115v1 by Fan Chen, Jiachun Li, Alexander Rakhlin, David Simchi-Levi
+
+We analyze the problem of private learning in generalized linear contextual
+bandits. Our approach is based on a novel method of re-weighted regression,
+yielding an efficient algorithm with regret of order
+$\sqrt{T}+\frac{1}{\alpha}$ and $\sqrt{T}/\alpha$ in the joint and local model
+of $\alpha$-privacy, respectively. Further, we provide near-optimal private
+procedures that achieve dimension-independent rates in private linear models
+and linear contextual bandits. In particular, our results imply that joint
+privacy is almost "for free" in all the settings we consider, partially
+addressing the open problem posed by Azize and Basu (2024).
+
+摘要：我們分析廣義線性情境強盜中私人學習的問題。我們的做法基於重新加權回歸的新方法，產生一種有效率的演算法，其後悔值分別為
+$\sqrt{T}+\frac{1}{\alpha}$ 和 $\sqrt{T}/\alpha$ 在 $\alpha$-隱私的聯合和局部模型中。此外，我們提供近乎最佳的私人程序，在私人線性模型和線性情境強盜中實現與維度無關的比率。特別是，我們的結果表明，在我們考慮的所有設定中，聯合隱私幾乎是「免費」的，部分解決了 Azize 和 Basu (2024) 提出的開放性問題。
+
+##### **The influence of motion features in temporal perception**
+2502.13114v1 by Rosa Illan Castillo, Javier Valenzuela
+
+This paper examines the role of manner-of-motion verbs in shaping subjective
+temporal perception and emotional resonance. Through four complementary
+studies, we explore how these verbs influence the conceptualization of time,
+examining their use in literal and metaphorical (temporal) contexts. Our
+findings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging
+temporal experiences, often linked to positive emotions and greater agency. In
+contrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and
+negative emotions, reflecting tedious or constrained experiences of time. These
+effects are amplified in metaphorical contexts, where manner verbs encode
+emotional and experiential nuances that transcend their literal meanings. We
+also find that participants prefer manner verbs over path verbs (e.g., go,
+pass) in emotionally charged temporal contexts, as manner verbs capture the
+experiential and emotional qualities of time more effectively. These findings
+highlight the interplay between language, motion, and emotion in shaping
+temporal perception, offering insights into how linguistic framing influences
+subjective experiences of time.
+
+摘要：本文探討動作方式動詞在形塑主觀時間感知和情緒共鳴中所扮演的角色。透過四項互補的研究，我們探討這些動詞如何影響時間的概念化，並檢視它們在字面和隱喻（時間）語境中的用法。我們的研究結果顯示，較快的動詞（例如飛、飆）會引起動態且引人入勝的時間體驗，通常與正面情緒和較大的自主性有關。相反地，較慢的動詞（例如爬、拖）傳達了被動、單調和負面情緒，反映出乏味或受限的時間體驗。這些效應在隱喻語境中會被放大，其中動作動詞編碼了超越其字面意義的情緒和體驗細微差別。我們還發現，在充滿情緒的時間語境中，參與者偏好動作動詞而非路徑動詞（例如走、經過），因為動作動詞更有效地捕捉了時間的體驗和情緒品質。這些研究結果突顯了語言、動作和情緒之間在形塑時間感知中的交互作用，並提供了語言框架如何影響主觀時間體驗的見解。
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **MatterChat: A Multi-Modal LLM for Material Science**
+2502.13107v1 by Yingheng Tang, Wenbin Xu, Jie Cao, Jianzhu Ma, Weilu Gao, Steve Farrell, Benjamin Erichson, Michael W. Mahoney, Andy Nonaka, Zhi Yao
+
+Understanding and predicting the properties of inorganic materials is crucial
+for accelerating advancements in materials science and driving applications in
+energy, electronics, and beyond. Integrating material structure data with
+language-based information through multi-modal large language models (LLMs)
+offers great potential to support these efforts by enhancing human-AI
+interaction. However, a key challenge lies in integrating atomic structures at
+full resolution into LLMs. In this work, we introduce MatterChat, a versatile
+structure-aware multi-modal LLM that unifies material structural data and
+textual inputs into a single cohesive model. MatterChat employs a bridging
+module to effectively align a pretrained machine learning interatomic potential
+with a pretrained LLM, reducing training costs and enhancing flexibility. Our
+results demonstrate that MatterChat significantly improves performance in
+material property prediction and human-AI interaction, surpassing
+general-purpose LLMs such as GPT-4. We also demonstrate its usefulness in
+applications such as more advanced scientific reasoning and step-by-step
+material synthesis.
+
+摘要：了解和預測無機材料的特性對於加速材料科學的進步和推動能源、電子等方面的應用至關重要。透過多模態大型語言模型 (LLM) 將材料結構數據與基於語言的資訊整合，可以極大程度地支持這些工作，藉此增強人類與 AI 的互動。然而，一個關鍵挑戰在於將原子結構以完整解析度整合到 LLM 中。在這項工作中，我們引入了 MatterChat，這是一個通用的結構感知多模態 LLM，它將材料結構數據和文字輸入統一到一個單一的內聚模型中。MatterChat 採用橋接模組，將預先訓練好的機器學習原子間電位與預先訓練好的 LLM 有效地對齊，從而降低訓練成本並增強靈活性。我們的結果表明，MatterChat 大幅提升了材料特性預測和人類與 AI 互動的效能，超越了 GPT-4 等通用 LLM。我們也展示了它在更進階的科學推理和逐步材料合成等應用中的效用。
+
+##### **Understanding and Rectifying Safety Perception Distortion in VLMs**
+2502.13095v1 by Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin
+
+Recent studies reveal that vision-language models (VLMs) become more
+susceptible to harmful requests and jailbreak attacks after integrating the
+vision modality, exhibiting greater vulnerability than their text-only LLM
+backbones. To uncover the root cause of this phenomenon, we conduct an in-depth
+analysis and identify a key issue: multimodal inputs introduce an
+modality-induced activation shift toward a "safer" direction compared to their
+text-only counterparts, leading VLMs to systematically overestimate the safety
+of harmful inputs. We refer to this issue as safety perception distortion. To
+mitigate such distortion, we propose Activation Shift Disentanglement and
+Calibration (ShiftDC), a training-free method that decomposes and calibrates
+the modality-induced activation shift to reduce the impact of modality on
+safety. By isolating and removing the safety-relevant component, ShiftDC
+restores the inherent safety alignment of the LLM backbone while preserving the
+vision-language capabilities of VLMs. Empirical results demonstrate that
+ShiftDC significantly enhances alignment performance on safety benchmarks
+without impairing model utility.
+
+摘要：最近的研究表明，在整合了视觉模态后，视觉语言模型 (VLM) 更容易受到有害请求和越狱攻击，表现出比其仅文本的 LLM 主干更大的漏洞。为了揭示这种现象的根本原因，我们进行了深入分析，并确定了一个关键问题：与仅文本的对应物相比，多模态输入引入了朝“更安全”方向的模态诱导激活转移，导致 VLM 系统性地高估有害输入的安全性。我们将此问题称为安全感知扭曲。为了减轻这种扭曲，我们提出了激活转移解耦和校准 (ShiftDC)，这是一种无训练方法，用于分解和校准模态诱导的激活转移，以减少模态对安全性的影响。通过隔离和移除与安全性相关的组件，ShiftDC 恢复了 LLM 主干的固有安全性对齐，同时保留了 VLM 的视觉语言能力。实证结果表明，ShiftDC 在不损害模型效用的情况下，显著增强了安全基准上的对齐性能。
+
+##### **Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**
+2502.13092v1 by Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
+
+Recently, there has been growing interest in leveraging large language models
+(LLMs) to generate symbolic world models from textual descriptions. Although
+LLMs have been extensively explored in the context of world modeling, prior
+studies encountered several challenges, including evaluation randomness,
+dependence on indirect metrics, and a limited domain scope. To address these
+limitations, we introduce a novel benchmark, Text2World, based on planning
+domain definition language (PDDL), featuring hundreds of diverse domains and
+employing multi-criteria, execution-based metrics for a more robust evaluation.
+We benchmark current LLMs using Text2World and find that reasoning models
+trained with large-scale reinforcement learning outperform others. However,
+even the best-performing model still demonstrates limited capabilities in world
+modeling. Building on these insights, we examine several promising strategies
+to enhance the world modeling capabilities of LLMs, including test-time
+scaling, agent training, and more. We hope that Text2World can serve as a
+crucial resource, laying the groundwork for future research in leveraging LLMs
+as world models. The project page is available at
+https://text-to-world.github.io/.
+
+摘要：最近，人们越来越有兴趣利用大型语言模型（LLM）从文本描述中生成符号世界模型。尽管 LLM 已在世界建模的背景下得到广泛探索，但先前的研究遇到了若干挑战，包括评估随机性、对间接指标的依赖以及有限的领域范围。为了解决这些限制，我们引入了基于规划域定义语言（PDDL）的新基准 Text2World，该基准包含数百个不同的域，并采用基于执行的多标准指标来进行更稳健的评估。我们使用 Text2World 对当前的 LLM 进行了基准测试，发现使用大规模强化学习训练的推理模型优于其他模型。然而，即使是性能最佳的模型在世界建模方面仍然表现出有限的能力。基于这些见解，我们研究了几种有希望的策略来增强 LLM 的世界建模能力，包括测试时缩放、代理训练等等。我们希望 Text2World 能够作为一项至关重要的资源，为未来利用 LLM 作为世界模型的研究奠定基础。项目页面可在 https://text-to-world.github.io/ 获得。
+
+##### **KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**
+2502.13076v1 by Xin Xia, Yujin Wang, Jun Zhou, Guisheng Zhong, Linning Cai, Chen Zhang
+
+Patent analysis highly relies on concise and interpretable document
+representations, referred to as patent portraits. Keyphrases, both present and
+absent, are ideal candidates for patent portraits due to their brevity,
+representativeness, and clarity. In this paper, we introduce KAPPA, an
+integrated framework designed to construct keyphrase-based patent portraits and
+enhance patent analysis. KAPPA operates in two phases: patent portrait
+construction and portrait-based analysis. To ensure effective portrait
+construction, we propose a semantic-calibrated keyphrase generation paradigm
+that integrates pre-trained language models with a prompt-based hierarchical
+decoding strategy to leverage the multi-level structural characteristics of
+patents. For portrait-based analysis, we develop a comprehensive framework that
+employs keyphrase-based patent portraits to enable efficient and accurate
+patent analysis. Extensive experiments on benchmark datasets of keyphrase
+generation, the proposed model achieves significant improvements compared to
+state-of-the-art baselines. Further experiments conducted on real-world patent
+applications demonstrate that our keyphrase-based portraits effectively capture
+domain-specific knowledge and enrich semantic representation for patent
+analysis tasks.
+
+摘要：專利分析高度依賴簡潔且可解讀的文件表示，稱為專利描述。關鍵字組，無論是存在的還是不存在的，都是專利描述的理想候選者，因為它們簡潔、具有代表性且清晰。在本文中，我們介紹了 KAPPA，一個用於建構基於關鍵字組的專利描述和增強專利分析的整合式架構。KAPPA 分為兩個階段執行：專利描述建構和基於描述的分析。為確保有效的描述建構，我們提出了一個語義校準關鍵字組生成範例，它將預先訓練的語言模型與基於提示的分層解碼策略整合在一起，以利用專利的多分層結構特性。對於基於描述的分析，我們開發了一個全面的架構，它採用基於關鍵字組的專利描述，以實現高效且準確的專利分析。在關鍵字組生成基準資料集上進行的廣泛實驗中，與最先進的基準線相比，所提出的模型取得了顯著的改進。在真實世界專利申請上進行的進一步實驗表明，我們基於關鍵字組的描述有效地擷取了特定領域的知識，並豐富了專利分析任務的語義表示。
+
+##### **Interactive Agents to Overcome Ambiguity in Software Engineering**
+2502.13069v1 by Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
+
+AI agents are increasingly being deployed to automate tasks, often based on
+ambiguous and underspecified user instructions. Making unwarranted assumptions
+and failing to ask clarifying questions can lead to suboptimal outcomes, safety
+risks due to tool misuse, and wasted computational resources. In this work, we
+study the ability of LLM agents to handle ambiguous instructions in interactive
+code generation settings by evaluating proprietary and open-weight models on
+their performance across three key steps: (a) leveraging interactivity to
+improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c)
+asking targeted questions. Our findings reveal that models struggle to
+distinguish between well-specified and underspecified instructions. However,
+when models interact for underspecified inputs, they effectively obtain vital
+information from the user, leading to significant improvements in performance
+and underscoring the value of effective interaction. Our study highlights
+critical gaps in how current state-of-the-art models handle ambiguity in
+complex software engineering tasks and structures the evaluation into distinct
+steps to enable targeted improvements.
+
+摘要：人工智能代理正越來越多地被部署用於自動化任務，通常基於模棱兩可且未明確規定的使用者指令。做出不合理的假設且未能提出澄清問題，可能導致次佳結果、因工具誤用而產生的安全風險，以及浪費運算資源。在這項工作中，我們研究了 LLM 代理在互動式程式碼生成設定中處理模棱兩可指令的能力，方法是在三個關鍵步驟中評估專有和開放權重的模型： (a) 利用互動性來提升在模棱兩可場景中的效能、(b) 偵測模糊性，以及 (c) 提出目標問題。我們的研究結果顯示，模型難以區分明確規範的指令和未明確規範的指令。然而，當模型針對未明確規範的輸入進行互動時，它們會有效地從使用者取得重要資訊，進而大幅提升效能，並強調有效互動的價值。我們的研究突顯了目前最先進的模型在處理複雜軟體工程任務中的模糊性時存在哪些關鍵差距，並將評估架構為不同的步驟，以促成有目標的改善。
+
+##### **Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**
+2502.13063v1 by Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
+
+A range of recent works addresses the problem of compression of sequence of
+tokens into a shorter sequence of real-valued vectors to be used as inputs
+instead of token embeddings or key-value cache. These approaches allow to
+reduce the amount of compute in existing language models. Despite relying on
+powerful models as encoders, the maximum attainable lossless compression ratio
+is typically not higher than x10. This fact is highly intriguing because, in
+theory, the maximum information capacity of large real-valued vectors is far
+beyond the presented rates even for 16-bit precision and a modest vector size.
+In this work, we explore the limits of compression by replacing the encoder
+with a per-sample optimization procedure. We show that vectors with compression
+ratios up to x1500 exist, which highlights two orders of magnitude gap between
+existing and practically attainable solutions. Furthermore, we empirically show
+that the compression limits are determined not by the length of the input but
+by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
+this sequence without any conditioning. The obtained limits highlight the
+substantial gap between the theoretical capacity of input embeddings and their
+practical utilization, suggesting significant room for optimization in model
+design.
+
+摘要：一系列近期作品探讨了将序列标记压缩成较短的实值向量序列的问题，以用作输入，而不是标记嵌入或键值缓存。这些方法允许减少现有语言模型中的计算量。尽管依赖于强大的模型作为编码器，但最大可达到的无损压缩比通常不高于 x10。这一事实非常有趣，因为理论上，即使对于 16 位精度和适中的向量大小，大型实值向量的最大信息容量也远远超出了所呈现的速率。在这项工作中，我们通过用按样本优化程序替换编码器来探索压缩的极限。我们表明，存在压缩比高达 x1500 的向量，这突出了现有解决方案和实际可实现解决方案之间两个数量级的差距。此外，我们凭经验表明，压缩极限不是由输入的长度决定的，而是由要减少的不确定性量决定的，即在此序列上的交叉熵损失，没有任何条件。获得的极限突出了输入嵌入的理论容量与其实际利用之间的巨大差距，表明模型设计中有很大的优化空间。
+
+##### **AI-Assisted Decision Making with Human Learning**
+2502.13062v1 by Gali Noti, Kate Donahue, Jon Kleinberg, Sigal Oren
+
+AI systems increasingly support human decision-making. In many cases, despite
+the algorithm's superior performance, the final decision remains in human
+hands. For example, an AI may assist doctors in determining which diagnostic
+tests to run, but the doctor ultimately makes the diagnosis. This paper studies
+such AI-assisted decision-making settings, where the human learns through
+repeated interactions with the algorithm. In our framework, the algorithm --
+designed to maximize decision accuracy according to its own model -- determines
+which features the human can consider. The human then makes a prediction based
+on their own less accurate model. We observe that the discrepancy between the
+algorithm's model and the human's model creates a fundamental tradeoff. Should
+the algorithm prioritize recommending more informative features, encouraging
+the human to recognize their importance, even if it results in less accurate
+predictions in the short term until learning occurs? Or is it preferable to
+forgo educating the human and instead select features that align more closely
+with their existing understanding, minimizing the immediate cost of learning?
+This tradeoff is shaped by the algorithm's time-discounted objective and the
+human's learning ability. Our results show that optimal feature selection has a
+surprisingly clean combinatorial characterization, reducible to a stationary
+sequence of feature subsets that is tractable to compute. As the algorithm
+becomes more "patient" or the human's learning improves, the algorithm
+increasingly selects more informative features, enhancing both prediction
+accuracy and the human's understanding. Notably, early investment in learning
+leads to the selection of more informative features than a later investment. We
+complement our analysis by showing that the impact of errors in the algorithm's
+knowledge is limited as it does not make the prediction directly.
+
+摘要：人工智慧系統日益支援人類決策。在許多情況下，儘管演算法的效能優異，最終決策仍掌握在人類手中。例如，人工智慧可能會協助醫生決定要執行哪些診斷測試，但最終下診斷的是醫生。本文探討此類人工智慧輔助決策設定，其中人類透過與演算法重複互動而學習。在我們的架構中，演算法（旨在根據其自身模型最大化決策準確度）會決定人類可以考量的特徵。然後，人類根據其自身較不準確的模型做出預測。我們觀察到，演算法模型與人類模型之間的差異會產生基本的權衡。演算法是否應優先推薦更多資訊性特徵，鼓勵人類認識其重要性，即使短期內會導致準確度較低的預測，直到學習發生？或者，是否較好放棄教育人類，而選擇與其現有理解更緊密對齊的特徵，將學習的立即成本降至最低？這種權衡取決於演算法的時間折現目標和人類的學習能力。我們的結果表明，最佳特徵選擇具有令人驚訝的乾淨組合特徵，可簡化為可計算的固定特徵子集序列。隨著演算法變得更「有耐心」或人類的學習進步，演算法會越來越多地選擇更多資訊性特徵，增強預測準確度和人類的理解。值得注意的是，早期投資於學習會導致選擇比後期投資更多資訊性特徵。我們透過顯示演算法知識中錯誤的影響是有限的，因為它不會直接做出預測，來補充我們的分析。
+
+##### **Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**
+2502.13061v1 by Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
+
+Hateful memes have become a significant concern on the Internet,
+necessitating robust automated detection systems. While large multimodal models
+have shown strong generalization across various tasks, they exhibit poor
+generalization to hateful meme detection due to the dynamic nature of memes
+tied to emerging social trends and breaking news. Recent work further
+highlights the limitations of conventional supervised fine-tuning for large
+multimodal models in this context. To address these challenges, we propose
+Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a
+novel two-stage fine-tuning framework designed to improve both in-domain
+accuracy and cross-domain generalization. Experimental results on six widely
+used meme classification datasets demonstrate that LMM-RGCL achieves
+state-of-the-art performance, outperforming agent-based systems such as
+VPD-PALI-X-55B. Furthermore, our method effectively generalizes to
+out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
+
+摘要：網路上的仇恨迷因已成為一大隱憂，因此需要強大的自動化偵測系統。雖然大型多模態模型已在各種任務中展現出強大的泛化能力，但由於迷因與新興社會趨勢和突發新聞息息相關，因此在仇恨迷因偵測方面表現不佳。最近的研究進一步強調了在這種情況下，傳統監督微調對大型多模態模型的限制。為了應對這些挑戰，我們提出了大型多模態模型檢索引導對比學習 (LMM-RGCL)，這是一種新穎的兩階段微調架構，旨在提高領域內準確度和跨領域泛化能力。在六個廣泛使用的迷因分類資料集上的實驗結果表明，LMM-RGCL 達到了最先進的效能，優於基於代理的系統，例如 VPD-PALI-X-55B。此外，我們的模型在低資源設定下有效泛化到領域外迷因，超越了 GPT-4o 等模型。
+
+##### **SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**
+2502.13059v1 by Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
+
+The increasing application of multi-modal large language models (MLLMs)
+across various sectors have spotlighted the essence of their output reliability
+and accuracy, particularly their ability to produce content grounded in factual
+information (e.g. common and domain-specific knowledge). In this work, we
+introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate
+the factuality ability of MLLMs to answer natural language short questions.
+SimpleVQA is characterized by six key features: it covers multiple tasks and
+multiple scenarios, ensures high quality and challenging queries, maintains
+static and timeless reference answers, and is straightforward to evaluate. Our
+approach involves categorizing visual question-answering items into 9 different
+tasks around objective events or common knowledge and situating these within 9
+topics. Rigorous quality control processes are implemented to guarantee
+high-quality, concise, and clear answers, facilitating evaluation with minimal
+variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a
+comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into
+their image comprehension and text generation abilities by identifying and
+analyzing error cases.
+
+摘要：隨著多模態大型語言模型 (MLLM) 在各個領域的應用日益普及，其輸出結果的可靠性和準確性已備受關注，特別是其根據事實資訊（例如一般知識和特定領域知識）產生內容的能力。在本文中，我們介紹 SimpleVQA，這是第一個用於評估 MLLM 回答自然語言簡短問題的事實能力的綜合多模態基準。SimpleVQA 有六個主要特徵：涵蓋多項任務和多種情境、確保高品質且具挑戰性的查詢、維護靜態且永恆的參考答案，而且評估起來很簡單。我們的做法是將視覺問答項目分類為 9 個不同的任務，圍繞客觀事件或常識，並將它們置於 9 個主題中。我們實施嚴格的品質控管流程，以保證答案的高品質、簡潔和清晰，並透過 LLM 作為評分系統，以最小的差異進行評估。我們使用 SimpleVQA 對 18 個主要的 MLLM 和 8 個純文字 LLM 進行全面評估，透過找出和分析錯誤案例，深入探討它們的影像理解和文字生成能力。
+
+##### **LAMD: Context-driven Android Malware Detection and Classification with LLMs**
+2502.13055v1 by Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, Lorenzo Cavallaro
+
+The rapid growth of mobile applications has escalated Android malware
+threats. Although there are numerous detection methods, they often struggle
+with evolving attacks, dataset biases, and limited explainability. Large
+Language Models (LLMs) offer a promising alternative with their zero-shot
+inference and reasoning capabilities. However, applying LLMs to Android malware
+detection presents two key challenges: (1)the extensive support code in Android
+applications, often spanning thousands of classes, exceeds LLMs' context limits
+and obscures malicious behavior within benign functionality; (2)the structural
+complexity and interdependencies of Android applications surpass LLMs'
+sequence-based reasoning, fragmenting code analysis and hindering malicious
+intent inference. To address these challenges, we propose LAMD, a practical
+context-driven framework to enable LLM-based Android malware detection. LAMD
+integrates key context extraction to isolate security-critical code regions and
+construct program structures, then applies tier-wise code reasoning to analyze
+application behavior progressively, from low-level instructions to high-level
+semantics, providing final prediction and explanation. A well-designed factual
+consistency verification mechanism is equipped to mitigate LLM hallucinations
+from the first tier. Evaluation in real-world settings demonstrates LAMD's
+effectiveness over conventional detectors, establishing a feasible basis for
+LLM-driven malware analysis in dynamic threat landscapes.
+
+摘要：隨著行動應用程式快速成長，Android 惡意軟體威脅也隨之升級。雖然有許多偵測方法，但它們經常難以應付不斷演進的攻擊、資料集偏差和有限的可解釋性。大型語言模型 (LLM) 提供了一個有前途的替代方案，具備零次學習推理和推理能力。然而，將 LLM 應用於 Android 惡意軟體偵測會出現兩個主要挑戰：(1) Android 應用程式中大量的支援程式碼，通常橫跨數千個類別，超過 LLM 的上下文限制，並模糊了良性功能中的惡意行為；(2) Android 應用程式的結構複雜性和相互依賴性超過 LLM 的基於序列的推理，會造成程式碼分析破碎，並阻礙惡意意圖推論。為了應對這些挑戰，我們提出了 LAMD，一個實用的脈絡驅動架構，以支援基於 LLM 的 Android 惡意軟體偵測。LAMD 整合了關鍵脈絡萃取，以隔離與安全性至關重要的程式碼區域並建構程式結構，然後套用分層式程式碼推理，逐步分析應用程式行為，從低階指令到高階語意，提供最終預測和說明。一個設計良好的事實一致性驗證機制具備減輕 LLM 從第一層產生的幻覺的能力。在真實環境中的評估顯示，LAMD 優於傳統偵測器，為動態威脅環境中的 LLM 驅動惡意軟體分析建立了一個可行的基礎。
+
+##### **Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**
+2502.13044v1 by Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
+
+Aspect sentiment quadruple prediction (ASQP) facilitates a detailed
+understanding of opinions expressed in a text by identifying the opinion term,
+aspect term, aspect category and sentiment polarity for each opinion. However,
+annotating a full set of training examples to fine-tune models for ASQP is a
+resource-intensive process. In this study, we explore the capabilities of large
+language models (LLMs) for zero- and few-shot learning on the ASQP task across
+five diverse datasets. We report F1 scores slightly below those obtained with
+state-of-the-art fine-tuned models but exceeding previously reported zero- and
+few-shot performance. In the 40-shot setting on the Rest16 restaurant domain
+dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the
+best-performing fine-tuned method MVP. Additionally, we report the performance
+of LLMs in target aspect sentiment detection (TASD), where the F1 scores were
+also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot
+setting, compared to 72.76 with MVP. While human annotators remain essential
+for achieving optimal performance, LLMs can reduce the need for extensive
+manual annotation in ASQP tasks.
+
+摘要：面向觀點的四元預測 (ASQP) 透過辨識各個觀點的觀點詞彙、面向詞彙、面向類別和觀點極性，協助詳細了解文字中表達的意見。然而，標註一組完整的訓練範例以微調 ASQP 模型是一個耗費資源的過程。在這項研究中，我們探討大型語言模型 (LLM) 在 ASQP 任務中進行零次和少量學習的能力，橫跨五個不同的資料集。我們報告的 F1 分數略低於使用最先進的微調模型獲得的分數，但超過先前報告的零次和少量學習表現。在 Rest16 餐廳領域資料集的 40 次學習設定中，LLM 達到了 52.46 的 F1 分數，而效能最佳的微調方法 MVP 則為 60.39。此外，我們報告了 LLM 在目標面向觀點偵測 (TASD) 中的表現，其中 F1 分數也接近微調模型，在 40 次學習設定中於 Rest16 達到 66.03，而 MVP 則為 72.76。儘管人類標註員對於達成最佳效能仍然至關重要，但 LLM 可以減少 ASQP 任務中廣泛手動標註的需求。
+
+##### **Natural Language Generation from Visual Sequences: Challenges and Future Directions**
+2502.13034v1 by Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
+
+The ability to use natural language to talk about visual content is at the
+core of human intelligence and a crucial feature of any artificial intelligence
+system. Various studies have focused on generating text for single images. In
+contrast, comparatively little attention has been paid to exhaustively
+analyzing and advancing work on multiple-image vision-to-text settings. In this
+position paper, we claim that any task dealing with temporally ordered
+sequences of multiple images or frames is an instance of a broader, more
+general problem involving the understanding of intricate relationships between
+the visual content and the corresponding text. We comprehensively analyze five
+tasks that are instances of this problem and argue that they pose a common set
+of challenges and share similarities in terms of modeling and evaluation
+approaches. Based on the insights from these various aspects and stages of
+multi-image-to-text generation, we highlight several open questions and suggest
+future research directions. We believe that these directions can advance the
+understanding of complex phenomena in this domain and the development of better
+models.
+
+摘要：使用自然語言來談論視覺內容的能力是人類智慧的核心，也是任何人工智慧系統的一項關鍵功能。各種研究都專注於為單一影像產生文字。相較之下，對於詳盡分析和推進多重影像視覺轉文字設定的工作，關注較少。在此立場文件中，我們聲稱任何處理多重影像或畫格的時間順序序列的任務，都是一個更廣泛、更普遍問題的範例，涉及理解視覺內容和對應文字之間的複雜關係。我們全面分析了此問題的五個範例任務，並論證它們提出了一組常見的挑戰，且在建模和評估方法方面有相似之處。根據多重影像轉文字生成的這些不同面向和階段的見解，我們突出了幾個開放性問題，並建議未來的研究方向。我們相信這些方向可以推進對此領域中複雜現象的理解，以及開發出更好的模型。
+
+##### **HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**
+2502.13031v1 by Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
+
+Since the adoption of large language models (LLMs) for text evaluation has
+become increasingly prevalent in the field of natural language processing
+(NLP), a series of existing works attempt to optimize the prompts for LLM
+evaluators to improve their alignment with human judgment. However, their
+efforts are limited to optimizing individual factors of evaluation prompts,
+such as evaluation criteria or output formats, neglecting the combinatorial
+impact of multiple factors, which leads to insufficient optimization of the
+evaluation pipeline. Nevertheless, identifying well-behaved prompting
+strategies for adjusting multiple factors requires extensive enumeration. To
+this end, we comprehensively integrate 8 key factors for evaluation prompts and
+propose a novel automatic prompting strategy optimization method called
+Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm,
+HPSS conducts an iterative search to find well-behaved prompting strategies for
+LLM evaluators. A heuristic function is employed to guide the search process,
+enhancing the performance of our algorithm. Extensive experiments across four
+evaluation tasks demonstrate the effectiveness of HPSS, consistently
+outperforming both human-designed evaluation prompts and existing automatic
+prompt optimization methods.
+
+摘要：隨著自然語言處理（NLP）領域中採用大型語言模型（LLM）進行文本評估變得越來越普遍，一系列現有工作嘗試優化 LLM 評估器的提示，以改善它們與人類判斷的一致性。然而，他們的努力僅限於優化評估提示的個別因素，例如評估準則或輸出格式，而忽略了多種因素的組合影響，這導致評估管道優化不足。儘管如此，找出調整多種因素的良好提示策略需要廣泛的枚舉。為此，我們全面整合了評估提示的 8 個關鍵因素，並提出了一種名為啟發式提示策略搜索（HPSS）的新型自動提示策略優化方法。在遺傳演算法的啟發下，HPSS 進行反覆搜索以找出 LLM 評估器的良好提示策略。採用啟發式函數來指導搜索過程，增強了我們演算法的效能。在四項評估任務中進行的廣泛實驗證明了 HPSS 的有效性，始終優於人類設計的評估提示和現有的自動提示優化方法。
+
+##### **Whose story is it? Personalizing story generation by inferring author styles**
+2502.13028v1 by Nischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer, Andrew Lan
+
+Personalization has become essential for improving user experience in
+interactive writing and educational applications, yet its potential in story
+generation remains largely unexplored. In this work, we propose a novel
+two-stage pipeline for personalized story generation. Our approach first infers
+an author's implicit story-writing characteristics from their past work and
+organizes them into an Author Writing Sheet, inspired by narrative theory. The
+second stage uses this sheet to simulate the author's persona through tailored
+persona descriptions and personalized story writing rules. To enable and
+validate our approach, we construct Mythos, a dataset of 590 stories from 64
+authors across five distinct sources that reflect diverse story-writing
+settings. A head-to-head comparison with a non-personalized baseline
+demonstrates our pipeline's effectiveness in generating high-quality
+personalized stories. Our personalized stories achieve a 75 percent win rate
+(versus 14 percent for the baseline and 11 percent ties) in capturing authors'
+writing style based on their past works. Human evaluation highlights the high
+quality of our Author Writing Sheet and provides valuable insights into the
+personalized story generation task. Notable takeaways are that writings from
+certain sources, such as Reddit, are easier to personalize than others, like
+AO3, while narrative aspects, like Creativity and Language Use, are easier to
+personalize than others, like Plot.
+
+摘要：個人化已成為改善互動式寫作和教育應用程式中使用者體驗的必要手段，然而其在故事生成中的潛力仍未被廣泛探索。在這項工作中，我們提出了一個創新的兩階段流程，用於個人化故事生成。我們的做法首先從作者過去的作品中推論出作者隱含的故事寫作特徵，並根據敘事理論將它們組織成作者寫作表。第二階段使用此表透過量身打造的角色描述和個人化故事寫作規則來模擬作者的角色。為了啟用和驗證我們的做法，我們建構了 Mythos，一個包含來自 64 位作者、橫跨五個不同來源的 590 個故事的資料集，這些故事反映了多樣化的故事寫作設定。與非個人化基準進行一對一的比較，證明了我們的流程在生成高品質個人化故事方面的有效性。我們的個人化故事以 75% 的獲勝率（相較於基準的 14% 和 11% 平手）捕捉到作者基於其過去作品的寫作風格。人類評估突顯了我們作者寫作表的優良品質，並提供了對個人化故事生成任務的寶貴見解。值得注意的是，來自某些來源（例如 Reddit）的作品比其他來源（例如 AO3）更容易個人化，而敘事層面（例如創造力和語言使用）比其他層面（例如情節）更容易個人化。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**
+2502.13019v1 by Sha Li, Naren Ramarkrishnan
+
+Despite the remarkable capabilities of Large Language Models (LLMs) in
+various NLP tasks, they remain vulnerable to hallucinations due to their
+limited parametric knowledge and lack of domain-specific expertise.
+Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating
+external document retrieval to augment the knowledge base of LLMs. In this
+approach, RAG retrieves document chunks from an external corpus in response to
+a query, which are then used as context for the downstream language model to
+generate an answer. However, these retrieved knowledge sources often include
+irrelevant or erroneous information, undermining the effectiveness of RAG in
+downstream tasks. To overcome this limitation, we introduce a compact,
+efficient, and pluggable module designed to refine external knowledge sources
+before feeding them to the generator. The module reconstructs retrieved content
+by extracting the most relevant and supportive information and reorganising it
+into a concise, query-specific format. Through a three-stage training paradigm
+- comprising supervised fine-tuning, contrastive multi-task learning, and
+reinforcement learning-based alignment - it prioritises critical knowledge and
+aligns it with the generator's preferences. This method enables LLMs to produce
+outputs that are more accurate, reliable, and contextually appropriate.
+
+摘要：儘管大型語言模型 (LLM) 在各種自然語言處理任務中具備卓越的能力，但由於其參數知識有限且缺乏特定領域的專業知識，因此它們仍然容易出現幻覺。檢索增強式生成 (RAG) 透過納入外部文件檢索來擴充 LLM 的知識庫，以應對此項挑戰。在此方法中，RAG 會根據查詢檢索外部語料庫中的文件區塊，然後將其用作下游語言模型的背景，以產生答案。然而，這些檢索到的知識來源通常包含不相關或錯誤的資訊，因而損害了 RAG 在下游任務中的效能。為了克服此項限制，我們引入了一個精簡、有效率且可插入的模組，用於在將外部知識來源提供給生成器之前對其進行精煉。此模組透過提取最相關且有用的資訊並將其重新組織成簡潔且特定於查詢的格式，來重建檢索到的內容。透過三階段訓練範例 - 包含監督微調、對比多任務學習以及基於強化學習的比對 - 它優先考量關鍵知識，並使其與生成器的偏好相符。此方法可讓 LLM 產生更準確、可靠且在語境上更適當的輸出。
+
+##### **LLM-Powered Proactive Data Systems**
+2502.13016v1 by Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran
+
+With the power of LLMs, we now have the ability to query data that was
+previously impossible to query, including text, images, and video. However,
+despite this enormous potential, most present-day data systems that leverage
+LLMs are reactive, reflecting our community's desire to map LLMs to known
+abstractions. Most data systems treat LLMs as an opaque black box that operates
+on user inputs and data as is, optimizing them much like any other approximate,
+expensive UDFs, in conjunction with other relational operators. Such data
+systems do as they are told, but fail to understand and leverage what the LLM
+is being asked to do (i.e. the underlying operations, which may be
+error-prone), the data the LLM is operating on (e.g., long, complex documents),
+or what the user really needs. They don't take advantage of the characteristics
+of the operations and/or the data at hand, or ensure correctness of results
+when there are imprecisions and ambiguities. We argue that data systems instead
+need to be proactive: they need to be given more agency -- armed with the power
+of LLMs -- to understand and rework the user inputs and the data and to make
+decisions on how the operations and the data should be represented and
+processed. By allowing the data system to parse, rewrite, and decompose user
+inputs and data, or to interact with the user in ways that go beyond the
+standard single-shot query-result paradigm, the data system is able to address
+user needs more efficiently and effectively. These new capabilities lead to a
+rich design space where the data system takes more initiative: they are
+empowered to perform optimization based on the transformation operations, data
+characteristics, and user intent. We discuss various successful examples of how
+this framework has been and can be applied in real-world tasks, and present
+future directions for this ambitious research agenda.
+
+摘要：<paragraph>透過 LLM 的強大功能，我們現在能夠查詢過去無法查詢的資料，包括文字、圖片和影片。然而，儘管有如此龐大的潛力，但現今大多數利用 LLM 的資料系統都是被動的，反映出我們的社群希望將 LLM 映射到已知的抽象化。大多數資料系統將 LLM 視為一個不透明的黑盒子，以使用者輸入和資料為基礎進行運作，並像其他近似、昂貴的 UDF 一樣最佳化它們，並與其他關聯運算子結合使用。這些資料系統會照著指示執行，但無法理解並運用 LLM 被要求執行的任務（例如可能容易出錯的基本運算）、LLM 正在運算的資料（例如冗長、複雜的文件），或使用者真正需要的是什麼。它們不會利用運算和/或手邊資料的特性，或在有誤差和歧義時確保結果的正確性。我們認為資料系統應該改為主動：它們需要被賦予更多自主權，並具備 LLM 的強大功能，以了解並重新處理使用者輸入和資料，並就運算和資料的表示和處理方式做出決策。透過允許資料系統解析、改寫和分解使用者輸入和資料，或以超越標準單次查詢結果模式的方式與使用者互動，資料系統能夠更有效率且有效地滿足使用者的需求。這些新功能會帶來一個豐富的設計空間，讓資料系統發揮更多主導性：它們有能力根據轉換運算、資料特性和使用者意圖進行最佳化。我們將討論這個架構如何應用於實際任務，並提出這個雄心勃勃的研究議程的未來方向。</paragraph>
+
+##### **Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**
+2502.13012v1 by Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, Dakuo Wang
+
+Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that
+simulates human-like behaviors in a variety of tasks. However, evaluating RPAs
+is challenging due to diverse task requirements and agent designs. This paper
+proposes an evidence-based, actionable, and generalizable evaluation design
+guideline for LLM-based RPA by systematically reviewing 1,676 papers published
+between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes,
+seven task attributes, and seven evaluation metrics from existing literature.
+Based on these findings, we present an RPA evaluation design guideline to help
+researchers develop more systematic and consistent evaluation methods.
+
+摘要：角色扮演代理（RPA）是一種越來越流行的 LLM 代理，它能模擬人類在各種任務中的行為。然而，由於任務需求和代理設計的多樣性，評估 RPA 具有挑戰性。本文通過系統地審查 2021 年 1 月至 2024 年 12 月期間發表的 1,676 篇論文，提出了基於證據、可操作且可推廣的 LLM 基於 RPA 的評估設計指南。我們的分析從現有文獻中識別出六個代理屬性、七個任務屬性和七個評估指標。根據這些發現，我們提出了 RPA 評估設計指南，以幫助研究人員開發更系統化和一致的評估方法。
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**
+2502.13006v1 by Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
+
+Automated Planning algorithms require a model of the domain that specifies
+the preconditions and effects of each action. Obtaining such a domain model is
+notoriously hard. Algorithms for learning domain models exist, yet it remains
+unclear whether learning a domain model and planning is an effective approach
+for numeric planning environments, i.e., where states include discrete and
+numeric state variables. In this work, we explore the benefits of learning a
+numeric domain model and compare it with alternative model-free solutions. As a
+case study, we use two tasks in Minecraft, a popular sandbox game that has been
+used as an AI challenge. First, we consider an offline learning setting, where
+a set of expert trajectories are available to learn from. This is the standard
+setting for learning domain models. We used the Numeric Safe Action Model
+Learning (NSAM) algorithm to learn a numeric domain model and solve new
+problems with the learned domain model and a numeric planner. We call this
+model-based solution NSAM_(+p), and compare it to several model-free Imitation
+Learning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical
+results show that some IL algorithms can learn faster to solve simple tasks,
+while NSAM_(+p) allows solving tasks that require long-term planning and
+enables generalizing to solve problems in larger environments. Then, we
+consider an online learning setting, where learning is done by moving an agent
+in the environment. For this setting, we introduce RAMP. In RAMP, observations
+collected during the agent's execution are used to simultaneously train an RL
+policy and learn a planning domain action model. This forms a positive feedback
+loop between the RL policy and the learned domain model. We demonstrate
+experimentally the benefits of using RAMP, showing that it finds more efficient
+plans and solves more problems than several RL baselines.
+
+摘要：<paragraph>自動化規劃演算法需要一個網域模型，來指定每個動作的前提條件和效果。取得這樣的網域模型出了名的困難。學習網域模型的演算法確實存在，但學習網域模型和規劃是否為數值規劃環境的有效方法仍然不清楚，也就是說，其中狀態包含離散和數值狀態變數。在這項工作中，我們探討學習數值網域模型的優點，並將其與替代的無模型解決方案進行比較。作為一個案例研究，我們使用 Minecraft 中的兩個任務，Minecraft 是一個流行的沙盒遊戲，已被用作 AI 挑戰。首先，我們考慮離線學習設定，其中有一組專家軌跡可供學習。這是學習網域模型的標準設定。我們使用數值安全動作模型學習 (NSAM) 演算法來學習數值網域模型，並使用已學習的網域模型和數值規劃器解決新問題。我們稱此模型為基礎的解決方案 NSAM_(+p)，並將其與多種無模型模仿學習 (IL) 和離線強化學習 (RL) 演算法進行比較。經驗結果顯示，一些 IL 演算法可以更快地學習解決簡單任務，而 NSAM_(+p) 允許解決需要長期規劃的任務，並能夠推廣到在更大環境中解決問題。然後，我們考慮線上學習設定，其中學習是透過在環境中移動代理來完成的。對於此設定，我們引入了 RAMP。在 RAMP 中，在代理執行期間收集的觀察結果用於同時訓練 RL 政策和學習規劃網域動作模型。這在 RL 政策和已學習的網域模型之間形成了一個正向回饋迴路。我們透過實驗證明了使用 RAMP 的好處，顯示它比多個 RL 基準找到了更有效的計畫，並解決了更多問題。</paragraph>
+
+##### **Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**
+2502.13004v1 by Wafaa Wardah, Tuğçe Melike Koçak Büyüktaş, Kirill Shchegelskiy, Sebastian Möller, Robert P. Spang
+
+Objective speech quality models aim to predict human-perceived speech quality
+using automated methods. However, cross-lingual generalization remains a major
+challenge, as Mean Opinion Scores (MOS) vary across languages due to
+linguistic, perceptual, and dataset-specific differences. A model trained
+primarily on English data may struggle to generalize to languages with
+different phonetic, tonal, and prosodic characteristics, leading to
+inconsistencies in objective assessments. This study investigates the
+cross-lingual performance of two speech quality models: NISQA, a CNN-based
+model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both
+models were trained exclusively on English datasets containing over 49,000
+speech samples and subsequently evaluated on speech in German, French,
+Mandarin, Swedish, and Dutch. We analyze model performance using Pearson
+Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five
+speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS.
+Our findings show that while AST achieves a more stable cross-lingual
+performance, both models exhibit noticeable biases. Notably, Mandarin speech
+quality predictions correlate highly with human MOS scores, whereas Swedish and
+Dutch present greater prediction challenges. Discontinuities remain difficult
+to model across all languages. These results highlight the need for more
+balanced multilingual datasets and architecture-specific adaptations to improve
+cross-lingual generalization.
+
+摘要：客觀語音品質模型旨在使用自動化方法預測人類感知的語音品質。然而，跨語言的概化仍然是一項重大挑戰，因為平均意見分數 (MOS) 會因語言的不同而有所不同，這是由於語言、感知和特定於資料集的差異所致。主要使用英語資料訓練的模型可能會難以概化到具有不同語音、聲調和韻律特徵的語言，導致客觀評估不一致。本研究探討了兩種語音品質模型的跨語言效能：基於 CNN 的 NISQA 模型和基於 Transformer 的音訊光譜 Transformer (AST) 模型。這兩種模型都僅使用包含超過 49,000 個語音範例的英語資料集進行訓練，然後在德語、法語、普通話、瑞典語和荷蘭語的語音上進行評估。我們使用皮爾森相關係數 (PCC) 和均方根誤差 (RMSE) 分析五個語音品質維度的模型效能：色彩、不連續性、響度、雜訊和 MOS。我們的研究結果顯示，儘管 AST 達到了更穩定的跨語言效能，但這兩種模型都表現出明顯的偏差。值得注意的是，普通話語音品質預測與人類 MOS 分數高度相關，而瑞典語和荷蘭語則呈現出更大的預測挑戰。不連續性在所有語言中仍然難以建模。這些結果凸顯了對更平衡的多語言資料集和特定於架構的調整的需求，以改善跨語言的概化。
+
+##### **You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**
+2502.13001v1 by Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp
+
+Meeting summarization suffers from limited high-quality data, mainly due to
+privacy restrictions and expensive collection processes. We address this gap
+with FAME, a dataset of 500 meetings in English and 300 in German produced by
+MIMIC, our new multi-agent meeting synthesis framework that generates meeting
+transcripts on a given knowledge source by defining psychologically grounded
+participant profiles, outlining the conversation, and orchestrating a large
+language model (LLM) debate. A modular post-processing step refines these
+outputs, mitigating potential repetitiveness and overly formal tones, ensuring
+coherent, credible dialogues at scale. We also propose a psychologically
+grounded evaluation framework assessing naturalness, social behavior
+authenticity, and transcript difficulties. Human assessments show that FAME
+approximates real-meeting spontaneity (4.5/5 in naturalness), preserves
+speaker-centric challenges (3/5 in spoken language), and introduces richer
+information-oriented difficulty (4/5 in difficulty). These findings highlight
+that FAME is a good and scalable proxy for real-world meeting conditions. It
+enables new test scenarios for meeting summarization research and other
+conversation-centric applications in tasks requiring conversation data or
+simulating social scenarios under behavioral constraints.
+
+摘要：會議摘要因缺乏高品質資料而受限，主要是由於隱私限制和昂貴的收集程序。我們透過 FAME 來解決這個差距，FAME 是 MIMIC 製作的 500 場英文會議和 300 場德文會議的資料集，MIMIC 是我們新的多重代理會議合成架構，透過定義心理基礎的參與者設定檔、概述對話，並協調大型語言模型 (LLM) 辯論，在給定的知識來源上產生會議記錄。模組化後處理步驟會改善這些輸出，減輕潛在的重複性和過於正式的語氣，確保大規模的對話連貫且可信。我們也提出一個心理基礎的評估架構，評估自然性、社交行為真實性，以及記錄難度。人類評估顯示，FAME 近似於真實會議的即興性（自然性 4.5/5），保留以講者為中心的挑戰（口語 3/5），並引入更豐富的資訊導向難度（難度 4/5）。這些發現強調 FAME 是真實世界會議條件的良好且可擴充的代理。它能為會議摘要研究和其他對話為中心的應用程式啟用新的測試情境，在需要對話資料或在行為限制下模擬社交情境的任務中。
+
+##### **Personalized Top-k Set Queries Over Predicted Scores**
+2502.12998v1 by Sohrab Namazi Nia, Subhodeep Ghosh, Senjuti Basu Roy, Sihem Amer-Yahia
+
+This work studies the applicability of expensive external oracles such as
+large language models in answering top-k queries over predicted scores. Such
+scores are incurred by user-defined functions to answer personalized queries
+over multi-modal data. We propose a generic computational framework that
+handles arbitrary set-based scoring functions, as long as the functions could
+be decomposed into constructs, each of which sent to an oracle (in our case an
+LLM) to predict partial scores. At a given point in time, the framework assumes
+a set of responses and their partial predicted scores, and it maintains a
+collection of possible sets that are likely to be the true top-k. Since calling
+oracles is costly, our framework judiciously identifies the next construct,
+i.e., the next best question to ask the oracle so as to maximize the likelihood
+of identifying the true top-k. We present a principled probabilistic model that
+quantifies that likelihood. We study efficiency opportunities in designing
+algorithms. We run an evaluation with three large scale datasets, scoring
+functions, and baselines. Experiments indicate the efficacy of our framework,
+as it achieves an order of magnitude improvement over baselines in requiring
+LLM calls while ensuring result accuracy. Scalability experiments further
+indicate that our framework could be used in large-scale applications.
+
+摘要：本研究探討在預測分數中回答前 k 個查詢時，昂貴的外部預言（例如大型語言模型）的適用性。此類分數是由使用者定義的函式產生，用於回答多模態資料中的個人化查詢。我們提出一個通用的運算框架，用於處理任意基於集合的計分函式，只要這些函式可以分解為建構區塊，然後將每個建構區塊傳送給預言（在本例中為 LLM）以預測部分分數。在特定時間點，此框架假設一組回應及其部分預測分數，並維護一組可能成為真實前 k 個的集合。由於呼叫預言的成本很高，因此我們的框架會明智地找出下一個建構區塊，亦即下一個最佳問題，以詢問預言，以便最大化找出真實前 k 個的可能性。我們提出一個基於原理的機率模型，用於量化此可能性。我們研究設計演算法時的效率機會。我們針對三個大型資料集、計分函式和基準執行評估。實驗結果指出我們框架的效能，因為它在需要 LLM 呼叫的同時確保結果準確性，比基準進步了一個數量級。可擴充性實驗進一步指出我們的框架可用於大型應用程式。
+
+##### **Eager Updates For Overlapped Communication and Computation in DiLoCo**
+2502.12996v1 by Satyen Kale, Arthur Douillard, Yanislav Donchev
+
+Distributed optimization methods such as DiLoCo have been shown to be
+effective in training very large models across multiple distributed workers,
+such as datacenters. These methods split updates into two parts: an inner
+optimization phase, where the workers independently execute multiple
+optimization steps on their own local data, and an outer optimization step,
+where the inner updates are synchronized. While such approaches require orders
+of magnitude less communication than standard data-parallel training, in
+settings where the workers are datacenters, even the limited communication
+requirements of these approaches can still cause significant slow downs due to
+the blocking necessary at each outer optimization step. In this paper, we
+investigate techniques to mitigate this issue by overlapping communication with
+computation in a manner that allows the outer optimization step to fully
+overlap with the inner optimization phase. We show that a particular variant,
+dubbed eager updates, provides competitive performance with standard DiLoCo in
+settings with low bandwidth between workers.
+
+摘要：分散式優化方法（例如 DiLoCo）已被證明可有效訓練橫跨多個分散式工作者的超大型模型，例如資料中心。這些方法將更新拆分為兩部分：內部最佳化階段，其中工作者獨立地在自己的本地資料上執行多個最佳化步驟，以及外部最佳化步驟，其中內部更新會同步。雖然此類方法所需的通訊量比標準資料平行訓練少幾個數量級，但在工作者為資料中心的情況下，即使這些方法有限的通訊需求仍可能由於每個外部最佳化步驟所需的封鎖而導致顯著的減速。在本文中，我們探討了透過以允許外部最佳化步驟與內部最佳化階段完全重疊的方式將通訊與運算重疊，來減輕此問題的技術。我們展示了一個特定變體，稱為即時更新，在工作者之間頻寬較低的情況下，可提供與標準 DiLoCo 相當的效能。
+
+##### **Free Argumentative Exchanges for Explaining Image Classifiers**
+2502.12995v1 by Avinash Kori, Antonio Rago, Francesca Toni
+
+Deep learning models are powerful image classifiers but their opacity hinders
+their trustworthiness. Explanation methods for capturing the reasoning process
+within these classifiers faithfully and in a clear manner are scarce, due to
+their sheer complexity and size. We provide a solution for this problem by
+defining a novel method for explaining the outputs of image classifiers with
+debates between two agents, each arguing for a particular class. We obtain
+these debates as concrete instances of Free Argumentative eXchanges (FAXs), a
+novel argumentation-based multi-agent framework allowing agents to internalise
+opinions by other agents differently than originally stated. We define two
+metrics (consensus and persuasion rate) to assess the usefulness of FAXs as
+argumentative explanations for image classifiers. We then conduct a number of
+empirical experiments showing that FAXs perform well along these metrics as
+well as being more faithful to the image classifiers than conventional,
+non-argumentative explanation methods. All our implementations can be found at
+https://github.com/koriavinash1/FAX.
+
+摘要：深度學習模型是強大的影像分類器，但其不透明性阻礙了其可信度。由於其極高的複雜性和規模，忠實且清楚地捕捉這些分類器內部推理過程的解釋方法很少見。我們透過定義一種新穎的方法來解決這個問題，該方法透過兩個代理之間的辯論來解釋影像分類器的輸出，每個代理都主張一個特定類別。我們將這些辯論作為自由論證交換 (FAX) 的具體實例，這是一個新穎的基於論證的多代理架構，允許代理以不同於原始陳述的方式內化其他代理的意見。我們定義了兩個指標（共識率和說服率）來評估 FAX 作為影像分類器論證解釋的有用性。然後，我們進行了多項實證實驗，表明 FAX 在這些指標上表現良好，並且比傳統的非論證解釋方法更忠實於影像分類器。我們所有的實作都可以在 https://github.com/koriavinash1/FAX 中找到。
+
+##### **B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**
+2502.12992v1 by Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
+
+Post-hoc explanation methods for black-box models often struggle with
+faithfulness and human interpretability due to the lack of explainability in
+current neural models. Meanwhile, B-cos networks have been introduced to
+improve model explainability through architectural and computational
+adaptations, but their application has so far been limited to computer vision
+models and their associated training pipelines. In this work, we introduce
+B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
+transforms pre-trained language models into B-cos LMs by combining B-cos
+conversion and task fine-tuning, improving efficiency compared to previous
+B-cos methods. Our automatic and human evaluation results demonstrate that
+B-cos LMs produce more faithful and human interpretable explanations than post
+hoc methods, while maintaining task performance comparable to conventional
+fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
+conventionally fine-tuned models in their learning processes and explanation
+patterns. Finally, we provide practical guidelines for effectively building
+B-cos LMs based on our findings. Our code is available at
+https://anonymous.4open.science/r/bcos_lm.
+
+摘要：黑盒模型的事后解释方法通常会因为当前神经模型缺乏可解释性而难以做到忠实和人类可解释。与此同时，B-cos 网络已被引入，以通过架构和计算改编来提高模型的可解释性，但到目前为止，它们的应用仅限于计算机视觉模型及其相关的训练管道。在这项工作中，我们引入了 B-cos LM，即针对 NLP 任务增强的 B-cos 网络。我们的方法通过结合 B-cos 转换和任务微调，将预训练的语言模型直接转换为 B-cos LM，与以前 B-cos 方法相比，提高了效率。我们的自动和人工评估结果表明，与事后方法相比，B-cos LM 产生了更忠实和人类可解释的解释，同时保持与传统微调相当的任务性能。我们的深入分析探讨了 B-cos LM 在其学习过程和解释模式中与传统微调模型有何不同。最后，我们根据我们的发现提供了有效构建 B-cos LM 的实用指南。我们的代码可在 https://anonymous.4open.science/r/bcos_lm 获得。
+
+##### **Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**
+2502.12988v1 by Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen
+
+Previous approaches to persona simulation large language models (LLMs) have
+typically relied on learning basic biographical information, or using limited
+role-play dialogue datasets to capture a character's responses. However, a
+holistic representation of an individual goes beyond surface-level facts or
+conversations to deeper thoughts and thinking. In this work, we introduce
+CharacterBot, a model designed to replicate both the linguistic patterns and
+distinctive thought processes of a character. Using Lu Xun, a renowned Chinese
+writer, as a case study, we propose four training tasks derived from his 17
+essay collections. These include a pre-training task focused on mastering
+external linguistic structures and knowledge, as well as three fine-tuning
+tasks: multiple-choice question answering, generative question answering, and
+style transfer, each aligning the LLM with Lu Xun's internal ideation and
+writing style. To optimize learning across these tasks, we introduce a CharLoRA
+parameter updating mechanism, where a general linguistic style expert
+collaborates with other task-specific experts to better study both the language
+style and the understanding of deeper thoughts. We evaluate CharacterBot on
+three tasks for linguistic accuracy and opinion comprehension, demonstrating
+that it significantly outperforms the baselines on our adapted metrics. We hope
+that this work inspires future research on deep character persona simulation
+LLM.
+
+摘要：<paragraph>以前對角色模擬大型語言模型 (LLM) 的方法通常依賴於學習基本傳記資訊，或使用有限的角色扮演對話資料集來捕捉角色的反應。然而，對個人的整體表徵超越了表面層面的事實或對話，深入到更深層的想法和思考。在這項工作中，我們引入了 CharacterBot，一個旨在複製角色的語言模式和獨特思考過程的模型。以著名的中國作家魯迅為案例研究，我們提出了四個從他的 17 篇散文集中衍生的訓練任務。其中包括一個預訓練任務，專注於掌握外部語言結構和知識，以及三個微調任務：多選題回答、生成式問答和風格轉移，每個任務都將 LLM 與魯迅的內部觀念和寫作風格相結合。為了優化這些任務的學習，我們引入了一個 CharLoRA 參數更新機制，其中一位通曉語言風格的專家與其他特定任務專家合作，以更好地研究語言風格和對深層思想的理解。我們在三項任務上評估了 CharacterBot 的語言準確性和意見理解，證明它在我們調整的指標上顯著優於基準。我們希望這項工作能激勵未來對深度角色角色模擬 LLM 的研究。</paragraph>
+
+##### **PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**
+2502.12985v1 by Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua
+
+Accurate 3D shape representation is essential in engineering applications
+such as design, optimization, and simulation. In practice, engineering
+workflows require structured, part-aware representations, as objects are
+inherently designed as assemblies of distinct components. However, most
+existing methods either model shapes holistically or decompose them without
+predefined part structures, limiting their applicability in real-world design
+tasks. We propose PartSDF, a supervised implicit representation framework that
+explicitly models composite shapes with independent, controllable parts while
+maintaining shape consistency. Despite its simple single-decoder architecture,
+PartSDF outperforms both supervised and unsupervised baselines in
+reconstruction and generation tasks. We further demonstrate its effectiveness
+as a structured shape prior for engineering applications, enabling precise
+control over individual components while preserving overall coherence. Code
+available at https://github.com/cvlab-epfl/PartSDF.
+
+摘要：精確的 3D 形狀表示在工程應用中至關重要，例如設計、最佳化和模擬。實際上，工程工作流程需要結構化、零件感知的表示，因為物體本質上是設計為不同元件的組件。然而，大多數現有方法不是整體建模形狀，就是將其分解，而沒有預先定義的零件結構，這限制了它們在實際設計任務中的適用性。我們提出 PartSDF，一個監督式的隱式表示框架，它明確地使用獨立、可控的零件對複合形狀進行建模，同時保持形狀一致性。儘管其單一的解碼器架構很簡單，但 PartSDF 在重建和生成任務中都優於監督式和非監督式基準。我們進一步證明了其作為工程應用結構化形狀先驗的有效性，能夠精確控制各個元件，同時保持整體一致性。程式碼可在 https://github.com/cvlab-epfl/PartSDF 取得。
+
+##### **Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**
+2502.12982v1 by Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
+
+Sailor2 is a family of cutting-edge multilingual language models for
+South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
+diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
+pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
+support 13 SEA languages while retaining proficiency in Chinese and English.
+Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
+languages. We also deliver a comprehensive cookbook on how to develop the
+multilingual model in an efficient manner, including five key aspects: data
+curation, pre-training, post-training, model customization and evaluation. We
+hope that Sailor2 model (Apache 2.0 license) will drive language development in
+the SEA region, and Sailor2 cookbook will inspire researchers to build more
+inclusive LLMs for other under-served languages.
+
+摘要：Sailor2 是一系列針對東南亞 (SEA) 語言的尖端多語言語言模型，備有 1B、8B 和 20B 大小，以適應各種應用。在 Qwen2.5 的基礎上，Sailor2 持續進行 500B 代幣（400B SEA 專用和 100B 重播代幣）的預訓練，以支援 13 種 SEA 語言，同時保留中文和英文的熟練度。Sailor2-20B 模型在 SEA 語言中對抗 GPT-4o 時，達到 50-50 的獲勝率。我們還提供一本全面的食譜，說明如何以有效的方式開發多語言模型，包括五個關鍵方面：資料策展、預訓練、後訓練、模型自訂和評估。我們希望 Sailor2 模型（Apache 2.0 授權）將推動 SEA 地區的語言發展，而 Sailor2 食譜將激勵研究人員為其他服務不足的語言建立更具包容性的 LLM。
+
+##### **Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**
+2502.12970v1 by Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
+
+The reasoning abilities of Large Language Models (LLMs) have demonstrated
+remarkable advancement and exceptional performance across diverse domains.
+However, leveraging these reasoning capabilities to enhance LLM safety against
+adversarial attacks and jailbreak queries remains largely unexplored. To bridge
+this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that
+integrates safety reflections of queries and responses into LLMs' generation
+process, unlocking a safety-aware reasoning mechanism. This approach enables
+self-evaluation at each reasoning step to create safety pivot tokens as
+indicators of the response's safety status. Furthermore, in order to improve
+the learning efficiency of pivot token prediction, we propose Contrastive Pivot
+Optimization(CPO), which enhances the model's ability to perceive the safety
+status of dialogues. Through this mechanism, LLMs dynamically adjust their
+response strategies during reasoning, significantly enhancing their defense
+capabilities against jailbreak attacks. Extensive experimental results
+demonstrate that R2D effectively mitigates various attacks and improves overall
+safety, highlighting the substantial potential of safety-aware reasoning in
+strengthening LLMs' robustness against jailbreaks.
+
+摘要：大型語言模型 (LLM) 的推理能力已展現出顯著的進步，並在不同的領域中表現出色。然而，利用這些推理能力來增強 LLM 對抗攻擊和越獄查詢的安全性仍然是未開發的領域。為了彌補這個差距，我們提出了推理防禦 (R2D)，這是一種新穎的訓練範例，它將查詢和回應的安全考量整合到 LLM 的生成過程中，開啟了一個安全感知推理機制。此方法可以在每個推理步驟中進行自我評估，以建立安全樞紐標記，作為回應安全狀態的指標。此外，為了提高樞紐標記預測的學習效率，我們提出了對比樞紐最佳化 (CPO)，它增強了模型感知對話安全狀態的能力。透過此機制，LLM 在推理過程中動態調整其回應策略，大幅增強其對抗越獄攻擊的防禦能力。廣泛的實驗結果證明，R2D 有效地減輕了各種攻擊，並改善了整體安全性，突顯了安全感知推理在加強 LLM 對抗越獄的穩健性方面的潛力。
+
+##### **A Survey of Text Classification Under Class Distribution Shift**
+2502.12965v1 by Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
+
+The basic underlying assumption of machine learning (ML) models is that the
+training and test data are sampled from the same distribution. However, in
+daily practice, this assumption is often broken, i.e.~the distribution of the
+test data changes over time, which hinders the application of conventional ML
+models. One domain where the distribution shift naturally occurs is text
+classification, since people always find new topics to discuss. To this end, we
+survey research articles studying open-set text classification and related
+tasks. We divide the methods in this area based on the constraints that define
+the kind of distribution shift and the corresponding problem formulation,
+i.e.~learning with the Universum, zero-shot learning, and open-set learning. We
+next discuss the predominant mitigation approaches for each problem setup.
+Finally, we identify several future work directions, aiming to push the
+boundaries beyond the state of the art. Interestingly, we find that continual
+learning can solve many of the issues caused by the shifting class
+distribution. We maintain a list of relevant papers at
+https://github.com/Eduard6421/Open-Set-Survey.
+
+摘要：機器學習 (ML) 模型的基本假設是訓練資料和測試資料取樣自同一個分佈。然而，在日常實務中，這個假設經常被打破，也就是說測試資料的分布會隨著時間改變，這會阻礙傳統 ML 模型的應用。分佈轉移自然發生的其中一個領域是文字分類，因為人們總能找到新的主題來討論。為此，我們調查研究開放集文字分類和相關任務的研究文章。我們根據定義分佈轉移的類型和對應問題公式的限制，將這個領域的方法分為：使用 Universum 學習、零次學習和開放集學習。接下來，我們討論每個問題設定的主要緩解方法。最後，我們找出幾個未來的研究方向，目標是將界線推展到現有技術的極限之外。有趣的是，我們發現持續學習可以解決許多由類別分佈轉移所造成的議題。我們在 https://github.com/Eduard6421/Open-Set-Survey 維護一份相關論文清單。
+
+##### **Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**
+2502.12964v1 by Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
+
+Large Language Models (LLMs) often generate outputs that lack grounding in
+real-world facts, a phenomenon known as hallucinations. Prior research has
+associated hallucinations with model uncertainty, leveraging this relationship
+for hallucination detection and mitigation. In this paper, we challenge the
+underlying assumption that all hallucinations are associated with uncertainty.
+Using knowledge detection and uncertainty measurement methods, we demonstrate
+that models can hallucinate with high certainty even when they have the correct
+knowledge. We further show that high-certainty hallucinations are consistent
+across models and datasets, distinctive enough to be singled out, and challenge
+existing mitigation methods. Our findings reveal an overlooked aspect of
+hallucinations, emphasizing the need to understand their origins and improve
+mitigation strategies to enhance LLM safety. The code is available at
+https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
+
+摘要：大型語言模型 (LLM) 經常產生缺乏真實世界事實根據的輸出，這種現象稱為幻覺。先前的研究已將幻覺與模型不確定性聯繫起來，利用這種關係進行幻覺偵測和緩解。在本文中，我們挑戰所有幻覺都與不確定性相關的基本假設。使用知識偵測和不確定性測量方法，我們證明模型即使擁有正確的知識，也能以高度確定性產生幻覺。我們進一步表明，高確定性幻覺在模型和資料集之間是一致的，足夠獨特以至於可以單獨挑選出來，並挑戰現有的緩解方法。我們的研究結果揭示了幻覺的一個被忽視的方面，強調需要了解其起源並改進緩解策略以增強 LLM 安全性。可以在 https://github.com/technion-cs-nlp/Trust_me_Im_wrong 找到程式碼。
+
+##### **Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**
+2502.12962v1 by Xiaoju Ye, Zhichun Wang, Jingyuan Wang
+
+Limited by the context window size of Large Language Models(LLMs), handling
+various tasks with input tokens exceeding the upper limit has been challenging,
+whether it is a simple direct retrieval task or a complex multi-hop reasoning
+task. Although various methods have been proposed to enhance the long-context
+processing capabilities of LLMs, they either incur substantial post-training
+costs, or require additional tool modules(e.g.,RAG), or have not shown
+significant improvement in realistic tasks. Our work observes the correlation
+between the attention distribution and generated answers across each layer, and
+establishes the attention allocation aligns with retrieval-augmented
+capabilities through experiments. Drawing on the above insights, we propose a
+novel method InfiniRetri that leverages the LLMs's own attention information to
+enable accurate retrieval across inputs of infinitely length. Our evaluations
+indicate that InfiniRetri achieves 100% accuracy in the
+Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,
+surpassing other method or larger models and setting a new
+state-of-the-art(SOTA). Moreover, our method achieves significant performance
+improvements on real-world benchmarks, with a maximum 288% improvement. In
+addition, InfiniRetri can be applied to any Transformer-based LLMs without
+additional training and substantially reduces inference latency and compute
+overhead in long texts. In summary, our comprehensive studies show
+InfiniRetri's potential for practical applications and creates a paradigm for
+retrievaling information using LLMs own capabilities under infinite-length
+tokens. Code will be released in link.
+
+摘要：受限于大型语言模型 (LLM) 的上下文窗口大小，处理超出上限的输入标记的各种任务一直具有挑战性，无论是简单的直接检索任务还是复杂的多跳推理任务。虽然已经提出了各种方法来增强 LLM 的长上下文处理能力，但它们要么产生大量的后训练成本，要么需要额外的工具模块（例如，RAG），要么在实际任务中没有显示出显着的改进。我们的工作观察了每层注意力分布和生成答案之间的相关性，并通过实验建立了注意力分配与检索增强能力保持一致。根据上述见解，我们提出了一种新方法 InfiniRetri，该方法利用 LLM 自身的注意力信息来实现对无限长度输入的准确检索。我们的评估表明，InfiniRetri 在使用 0.5B 参数模型对超过 100 万个标记的针头干草堆 (NIH) 测试中实现了 100% 的准确率，超越了其他方法或更大的模型，并创造了新的最先进 (SOTA)。此外，我们的方法在实际基准上实现了显著的性能提升，最大提升了 288%。此外，InfiniRetri 可以应用于任何基于 Transformer 的 LLM，而无需额外的训练，并且可以大幅减少推理延迟和长文本中的计算开销。总之，我们的综合研究表明了 InfiniRetri 在实际应用中的潜力，并为使用 LLM 自身能力在无限长度标记下检索信息创造了一个范例。代码将在链接中发布。
+
+##### **Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**
+2502.12961v1 by Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
+
+Large language models (LLMs) have shown remarkable emergent capabilities,
+transforming the execution of functional tasks by leveraging external tools for
+complex problems that require specialized processing or real-time data. While
+existing research expands LLMs access to diverse tools (e.g., program
+interpreters, search engines, weather/map apps), the necessity of using these
+tools is often overlooked, leading to indiscriminate tool invocation. This
+naive approach raises two key issues:(1) increased delays due to unnecessary
+tool calls, and (2) potential errors resulting from faulty interactions with
+external tools. In this paper, we introduce meta-cognition as a proxy for LLMs
+self-assessment of their capabilities, representing the model's awareness of
+its own limitations. Based on this, we propose MeCo, an adaptive
+decision-making strategy for external tool use. MeCo quantifies metacognitive
+scores by capturing high-level cognitive signals in the representation space,
+guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs
+minimal cost. Our experiments show that MeCo accurately detects LLMs' internal
+cognitive signals and significantly improves tool-use decision-making across
+multiple base models and benchmarks.
+
+摘要：大型語言模型 (LLM) 已展現出顯著的新興能力，透過運用外部工具來執行功能任務，解決需要專業處理或即時資料的複雜問題，從而轉變任務的執行方式。儘管現有研究擴展了 LLM 對各種工具的存取（例如程式碼詮釋器、搜尋引擎、天氣/地圖應用程式），但使用這些工具的必要性往往被忽略，導致不加選擇地呼叫工具。這種天真的方法提出了兩個關鍵問題：(1) 由於不必要的工具呼叫而導致延遲增加，以及 (2) 由於與外部工具互動錯誤而導致的潛在錯誤。在本文中，我們將元認知引入作為 LLM 自我評估其能力的代理，代表模型意識到其自身的限制。基於此，我們提出了 MeCo，一種用於外部工具使用的適應性決策制定策略。MeCo 透過擷取表徵空間中的高階認知訊號來量化元認知分數，指導何時呼叫工具。值得注意的是，MeCo 是免微調的，而且成本極低。我們的實驗表明，MeCo 能夠準確地偵測 LLM 的內部認知訊號，並大幅改善跨多個基本模型和基準的工具使用決策制定。
+
+##### **AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**
+2502.12959v1 by Steve Bakos, Félix Gaschi, David Guzmán, Riddhi More, Kelly Chutong Li, En-Shiun Annie Lee
+
+Realignment techniques are often employed to enhance cross-lingual transfer
+in multilingual language models, still, they can sometimes degrade performance
+in languages that differ significantly from the fine-tuned source language.
+This paper introduces AlignFreeze, a method that freezes either the layers'
+lower half or upper half during realignment. Through controlled experiments on
+4 tasks, 3 models, and in 35 languages, we find that realignment affects all
+the layers but can be the most detrimental to the lower ones. Freezing the
+lower layers can prevent performance degradation. Particularly, AlignFreeze
+improves Part-of-Speech (PoS) tagging performances in languages where full
+realignment fails: with XLM-R, it provides improvements of more than one
+standard deviation in accuracy in seven more languages than full realignment.
+
+摘要：重新對齊技術通常用於增強多語言語言模型中的跨語言轉移，然而，它們有時會降低與微調源語言顯著不同的語言的效能。本文介紹了 AlignFreeze，一種在重新對齊期間凍結層的下半部或上半部的的方法。透過 4 項任務、3 個模型和 35 種語言的受控實驗，我們發現重新對齊會影響所有層，但對較低層的影響最大。凍結較低層可以防止效能下降。特別是，AlignFreeze 改善了在完全重新對齊失敗的語言中的詞性 (PoS) 標記效能：使用 XLM-R，它比完全重新對齊在七種語言中提供了超過一個標準差的準確度改進。
+
+##### **Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**
+2502.12953v1 by Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
+
+Masked language modeling has become a widely adopted unsupervised technique
+to pre-train language models. However, the process of selecting tokens for
+masking is random, and the percentage of masked tokens is typically fixed for
+the entire training process. In this paper, we propose to adjust the masking
+ratio and to decide which tokens to mask based on a novel task-informed
+anti-curriculum learning scheme. First, we harness task-specific knowledge
+about useful and harmful tokens in order to determine which tokens to mask.
+Second, we propose a cyclic decaying masking ratio, which corresponds to an
+anti-curriculum schedule (from hard to easy). We exemplify our novel
+task-informed anti-curriculum by masking (TIACBM) approach across three diverse
+downstream tasks: sentiment analysis, text classification by topic, and
+authorship attribution. Our findings suggest that TIACBM enhances the ability
+of the model to focus on key task-relevant features, contributing to
+statistically significant performance gains across tasks. We release our code
+at https://github.com/JarcaAndrei/TIACBM.
+
+摘要：遮蔽語言模型已成為一種廣泛採用的無監督技術，用於預先訓練語言模型。然而，選擇用於遮蔽的詞彙的過程是隨機的，且遮蔽詞彙的百分比通常在整個訓練過程中是固定的。在本文中，我們建議調整遮蔽率，並根據一種新穎的任務資訊反課程學習方案來決定要遮蔽哪些詞彙。首先，我們利用任務特定的知識，了解有用的和有害的詞彙，以確定要遮蔽哪些詞彙。其次，我們提出一個循環遞減遮蔽率，這對應於一個反課程表（從難到易）。我們以三項不同的下游任務為例，說明我們新穎的任務資訊反課程遮蔽（TIACBM）方法：情緒分析、按主題分類文字，以及作者歸屬。我們的研究結果表明，TIACBM 增強了模型專注於關鍵任務相關特徵的能力，有助於在各項任務中獲得具有統計意義的效能提升。我們在 https://github.com/JarcaAndrei/TIACBM 釋出我們的程式碼。
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**
+2502.12947v1 by Gyeongman Kim, Gyouk Chu, Eunho Yang
+
+With the emergence of Mixture-of-Experts (MoE), the efficient scaling of
+model size has accelerated the development of large language models in recent
+years. However, their high memory requirements prevent their use in
+resource-constrained environments. While knowledge distillation (KD) has been a
+proven method for model compression, its application to MoE teacher models
+remains underexplored. Through our investigation, we discover that
+non-activated experts in MoE models possess valuable knowledge that benefits
+student models. We further demonstrate that existing KD methods are not optimal
+for compressing MoE models, as they fail to leverage this knowledge
+effectively. To address this, we propose two intuitive MoE-specific KD methods
+for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR),
+both designed to effectively extract knowledge from all experts. Specifically,
+KA augments knowledge by sampling experts multiple times, while SAR uses all
+experts and adjusts the expert weights through router training to provide
+optimal knowledge. Extensive experiments show that our methods outperform
+conventional KD methods, demonstrating their effectiveness for MoE teacher
+models.
+
+摘要：隨著 Mixture-of-Experts (MoE) 的出現，模型規模的有效擴展加速了近年來大型語言模型的發展。然而，它們的高記憶體需求會阻礙它們在資源受限的環境中使用。雖然知識蒸餾 (KD) 已被證明是一種模型壓縮的方法，但它在 MoE 教師模型中的應用仍未被充分探索。透過我們的調查，我們發現 MoE 模型中未被啟用的專家擁有有價值的知識，這些知識對學生模型有益。我們進一步證明，現有的 KD 方法並非壓縮 MoE 模型的最佳方法，因為它們無法有效利用這些知識。為了解決這個問題，我們首次提出兩種直觀的 MoE 專用 KD 方法：知識擴充 (KA) 和學生感知路由器 (SAR)，兩者都旨在從所有專家有效提取知識。具體來說，KA 透過多次抽樣專家來擴充知識，而 SAR 使用所有專家並透過路由器訓練調整專家權重以提供最佳知識。廣泛的實驗表明，我們的模型優於傳統的 KD 模型，證明了它們對 MoE 教師模型的有效性。
+
+##### **LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**
+2502.12945v1 by Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose
+
+Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold
+significant commercial value. The rise of high-quality AI-generated content has
+spurred interest in AI-driven micro-video creation. However, despite the
+advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek
+in text generation and reasoning, their potential to assist the creation of
+popular micro-videos remains largely unexplored.
+  In this paper, we conduct an empirical study on LLM-assisted popular
+micro-video generation (LLMPopcorn). Specifically, we investigate the following
+research questions: (i) How can LLMs be effectively utilized to assist popular
+micro-video generation? (ii) To what extent can prompt-based enhancements
+optimize the LLM-generated content for higher popularity? (iii) How well do
+various LLMs and video generators perform in the popular micro-video generation
+task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3
+enable micro-video generation to achieve popularity comparable to human-created
+content. Prompt enhancements further boost popularity, and benchmarking
+highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and
+HunyuanVideo lead in video generation. This pioneering work advances
+AI-assisted micro-video creation, uncovering new research opportunities. We
+will release the code and datasets to support future studies.
+
+摘要：<paragraph>在 TikTok 和 YouTube 等平台上流行的微影片具有
+重要的商业价值。高质量 AI 生成的内容的兴起
+激发了人们对 AI 驱动的微影片创作的兴趣。然而，尽管大型语言模型 (LLM) 如 ChatGPT 和 DeepSeek
+在文本生成和推理方面的能力很强，但它们在辅助创建
+流行微影片方面的潜力在很大程度上仍未得到探索。
+  在本文中，我们对 LLM 辅助的流行
+微影片生成 (LLMPopcorn) 进行了实证研究。具体来说，我们调查了以下
+研究问题：(i) 如何有效利用 LLM 来辅助流行
+微影片生成？(ii) 基于提示的增强在多大程度上可以
+优化 LLM 生成的内容以获得更高的流行度？(iii) 各种 LLM 和视频生成器在流行的微视频生成中表现如何
+任务？通过探索这些问题，我们表明了像 DeepSeek-V3 这样的高级 LLM
+使微视频生成能够达到与人类创作的内容相当的流行度。提示增强进一步提高了受欢迎程度，并且基准测试突出了 LLM 中的 DeepSeek-V3 和 DeepSeek-R1，而 LTX-Video 和
+HunyuanVideo 在视频生成中领先。这项开创性的工作推进了
+人工智能辅助的微视频创作，发现了新的研究机会。我们将发布代码和数据集以支持未来的研究。</paragraph>
+
+##### **Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**
+2502.12932v1 by Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
+
+Quantifying reasoning capability in low-resource languages remains a
+challenge in NLP due to data scarcity and limited access to annotators. While
+LLM-assisted dataset construction has proven useful for medium- and
+high-resource languages, its effectiveness in low-resource languages,
+particularly for commonsense reasoning, is still unclear. In this paper, we
+compare three dataset creation strategies: (1) LLM-assisted dataset generation,
+(2) machine translation, and (3) human-written data by native speakers, to
+build a culturally nuanced story comprehension dataset. We focus on Javanese
+and Sundanese, two major local languages in Indonesia, and evaluate the
+effectiveness of open-weight and closed-weight LLMs in assisting dataset
+creation through extensive manual validation. To assess the utility of
+synthetic data, we fine-tune language models on classification and generation
+tasks using this data and evaluate performance on a human-written test set. Our
+findings indicate that LLM-assisted data creation outperforms machine
+translation.
+
+摘要：由於資料稀少且標註者有限，量化低資源語言中的推理能力在自然語言處理中仍然是一項挑戰。雖然 LLM 輔助的資料集建構已被證明對中高資源語言有用，但其在低資源語言中的有效性，特別是對於常識推理，仍然不清楚。在本文中，我們比較了三種資料集建立策略：(1) LLM 輔助的資料集生成，(2) 機器翻譯，以及 (3) 母語人士撰寫的人工資料，以建立具有文化細微差的故事理解資料集。我們專注於爪哇語和巽他語，這兩種印尼的主要地方語言，並透過廣泛的手動驗證評估開放權重和封閉權重 LLM 在協助資料集建立中的有效性。為了評估合成資料的效用，我們使用這些資料對分類和生成任務進行語言模型微調，並在人工撰寫的測試集上評估效能。我們的研究結果表明，LLM 輔助的資料建立優於機器翻譯。
+
+##### **Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**
+2502.12929v1 by Lakshmi Nair, Ian Trase, Mark Kim
+
+We present a novel reasoning approach called Flow-of-Options (FoO), designed
+to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs
+to systematically explore a diverse range of possibilities in their reasoning,
+as demonstrated by an FoO-based agentic system for autonomously solving Machine
+Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines,
+achieving improvements of 38.2% - 69.2% on standard data science tasks, and
+37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost
+under $1 per task, our framework is well-suited for cost-sensitive
+applications. Beyond classification and regression, we illustrate the broader
+applicability of our FoO-based agentic system to tasks such as reinforcement
+learning and image generation. Our framework presents significant advancements
+compared to current state-of-the-art agentic systems for AutoML, due to the
+benefits of FoO in enforcing diversity in LLM solutions through compressed,
+explainable representations that also support long-term memory when combined
+with case-based reasoning.
+
+摘要：我們提出了一種稱為選項流 (FoO) 的新推理方法，旨在解決大型語言模型 (LLM) 中的內在偏差。FoO 使 LLM 能系統性地探索其推理中的各種可能性，這由一個基於 FoO 的代理系統展示，該系統可自主解決機器學習任務 (AutoML)。我們的框架優於最先進的基準，在標準數據科學任務上取得了 38.2% - 69.2% 的改進，在治療化學任務上取得了 37.4% - 47.9% 的改進。由於每個任務的整體運營成本低於 1 美元，因此我們的框架非常適合對成本敏感的應用。除了分類和回歸之外，我們還說明了基於 FoO 的代理系統在強化學習和圖像生成等任務中的更廣泛適用性。我們的框架與當前最先進的 AutoML 代理系統相比具有顯著的進步，這是因為 FoO 在通過壓縮、可解釋的表示強制 LLM 解決方案的多樣性方面具有優勢，這些表示與基於案例的推理結合時還支持長期記憶。
+
+##### **Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**
+2502.12928v1 by Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
+
+Large language models have demonstrated exceptional performance across a wide
+range of tasks. However, dense models usually suffer from sparse activation,
+where many activation values tend towards zero (i.e., being inactivated). We
+argue that this could restrict the efficient exploration of model
+representation space. To mitigate this issue, we propose Finedeep, a
+deep-layered fine-grained expert architecture for dense models. Our framework
+partitions the feed-forward neural network layers of traditional dense models
+into small experts, arranges them across multiple sub-layers. A novel routing
+mechanism is proposed to determine each expert's contribution. We conduct
+extensive experiments across various model sizes, demonstrating that our
+approach significantly outperforms traditional dense architectures in terms of
+perplexity and benchmark performance while maintaining a comparable number of
+parameters and floating-point operations. Moreover, we find that Finedeep
+achieves optimal results when balancing depth and width, specifically by
+adjusting the number of expert sub-layers and the number of experts per
+sub-layer. Empirical results confirm that Finedeep effectively alleviates
+sparse activation and efficiently utilizes representation capacity in dense
+models.
+
+摘要：大型語言模型在各種任務中展現出非凡的效能。然而，密集模型通常會出現稀疏激活，其中許多激活值趨近於零（即處於非激活狀態）。我們認為這可能會限制模型表示空間的有效探索。為了減輕這個問題，我們提出 Finedeep，這是一種針對密集模型的深度分層細粒度專家架構。我們的框架將傳統密集模型的前饋神經網路層分割成小型專家，並將它們排列在多個子層中。我們提出了一種新穎的路由機制來確定每個專家的貢獻。我們針對各種模型大小進行了廣泛的實驗，證明我們的做法在困惑度和基準效能方面顯著優於傳統的密集架構，同時保持了相當數量的參數和浮點運算。此外，我們發現 Finedeep 在平衡深度和廣度時可以達到最佳結果，特別是透過調整專家子層的數量和每個子層的專家數量。實證結果證實，Finedeep 有效地減輕了稀疏激活，並有效利用了密集模型中的表示能力。
+
+##### **SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**
+2502.12927v1 by Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva
+
+Providing high-quality feedback is crucial for student success but is
+constrained by time, cost, and limited data availability. We introduce
+Synthetic Educational Feedback Loops (SEFL), a novel framework designed to
+deliver immediate, on-demand feedback at scale without relying on extensive,
+real-world student data. In SEFL, two large language models (LLMs) operate in
+teacher--student roles to simulate assignment completion and formative
+feedback, generating abundant synthetic pairs of student work and corresponding
+critiques. We then fine-tune smaller, more computationally efficient LLMs on
+these synthetic pairs, enabling them to replicate key features of high-quality,
+goal-oriented feedback. Unlike personalized tutoring approaches that offer
+multi-turn, individualized instruction, SEFL specifically focuses on
+replicating the teacher-->student feedback loop for diverse assignments.
+Through both LLM-as-a-judge and human evaluations, we demonstrate that
+SEFL-tuned models outperform their non-tuned counterparts in feedback quality,
+clarity, and timeliness. These findings reveal SEFL's potential to transform
+feedback processes for higher education and beyond, offering an ethical and
+scalable alternative to conventional manual feedback cycles.
+
+摘要：提供高品質的回饋對於學生的成功至關重要，但受到時間、成本和資料取得有限的限制。我們引入了合成教育回饋迴圈 (SEFL)，這是一個新穎的架構，旨在提供立即且依需求的回饋，且無需仰賴大量的真實世界學生資料。在 SEFL 中，兩個大型語言模型 (LLM) 以師生角色運作，模擬作業完成和形成性回饋，產生大量的合成學生作業和對應的評論。然後我們針對這些合成配對微調較小、計算效率較高的 LLM，讓它們能夠複製高品質、目標導向回饋的主要特徵。與提供多回合、個別化教學的個人化輔導方法不同，SEFL 特別專注於複製適用於各種作業的教師-->學生回饋迴圈。透過 LLM 作為評審和人類評估，我們證明了 SEFL 微調模型在回饋品質、清晰度和時效性方面優於未微調的模型。這些發現揭示了 SEFL 轉變高等教育及其他領域回饋流程的潛力，提供了一個符合道德且可擴充的替代方案，取代傳統的手動回饋週期。
+
+##### **Towards more Contextual Agents: An extractor-Generator Optimization Framework**
+2502.12926v1 by Mourad Aouini, Jinan Loubani
+
+Large Language Model (LLM)-based agents have demonstrated remarkable success
+in solving complex tasks across a wide range of general-purpose applications.
+However, their performance often degrades in context-specific scenarios, such
+as specialized industries or research domains, where the absence of
+domain-relevant knowledge leads to imprecise or suboptimal outcomes. To address
+this challenge, our work introduces a systematic approach to enhance the
+contextual adaptability of LLM-based agents by optimizing their underlying
+prompts-critical components that govern agent behavior, roles, and
+interactions. Manually crafting optimized prompts for context-specific tasks is
+labor-intensive, error-prone, and lacks scalability. In this work, we introduce
+an Extractor-Generator framework designed to automate the optimization of
+contextual LLM-based agents. Our method operates through two key stages: (i)
+feature extraction from a dataset of gold-standard input-output examples, and
+(ii) prompt generation via a high-level optimization strategy that iteratively
+identifies underperforming cases and applies self-improvement techniques. This
+framework substantially improves prompt adaptability by enabling more precise
+generalization across diverse inputs, particularly in context-specific tasks
+where maintaining semantic consistency and minimizing error propagation are
+critical for reliable performance. Although developed with single-stage
+workflows in mind, the approach naturally extends to multi-stage workflows,
+offering broad applicability across various agent-based systems. Empirical
+evaluations demonstrate that our framework significantly enhances the
+performance of prompt-optimized agents, providing a structured and efficient
+approach to contextual LLM-based agents.
+
+摘要：大型語言模型 (LLM) 為基礎的代理已展現出非凡的成功，
+能解決廣泛一般用途應用程式的複雜任務。
+然而，它們的效能通常會在特定情境中下降，例如專門產業或研究領域，
+其中缺乏與領域相關知識會導致不精確或次佳的結果。為了解決
+這項挑戰，我們的研究引進了一種系統化的方法來增強 LLM 為基礎的代理的
+情境適應性，方法是最佳化它們的基礎提示，這些提示是決定代理行為、角色和
+互動的重要組成部分。手動製作最佳化的提示以應對特定情境的任務既費時又容易出錯，而且缺乏可擴充性。在這項研究中，我們引進
+一個萃取產生器架構，旨在自動化情境 LLM 為基礎代理的最佳化。我們的
+方法透過兩個關鍵階段運作：(i) 從黃金標準輸入輸出範例的資料集萃取特徵，以及
+(ii) 透過高階最佳化策略產生提示，此策略會反覆找出表現不佳的案例並套用自我改善技術。此
+架構大幅改善了提示適應性，讓它能針對不同的輸入進行更精確的概括，特別是在情境特定任務中，在這些任務中，維持語意一致性和將錯誤傳播降至最低對於可靠的效能至關重要。儘管是針對單階段工作流程開發，但此方法自然能延伸至多階段工作流程，在各種基於代理的系統中提供廣泛的適用性。實證評估顯示，我們的架構大幅增強了提示最佳化代理的效能，為基於情境的 LLM 代理提供了一個結構化且有效率的方法。
+
+##### **Keep what you need : extracting efficient subnetworks from large audio representation models**
+2502.12925v1 by David Genova, Philippe Esling, Tom Hurlin
+
+Recently, research on audio foundation models has witnessed notable advances,
+as illustrated by the ever improving results on complex downstream tasks.
+Subsequently, those pretrained networks have quickly been used for various
+audio applications. These improvements have however resulted in a considerable
+increase both in size and complexity of these models. Along the environmental
+concerns this issue raises, this prevents the deployment of such networks on
+consumer-level devices, and precludes their use for real-time applications.
+Moreover, this appears contradictory with the specificity of the tasks for
+which these models are used, which are often simpler compared to extracting a
+rich, multi-purpose representation from any type of audio data. In this paper,
+we address this issue with a simple, yet effective method to extract
+lightweight specialist subnetworks from large foundation models. Specifically,
+we introduce learnable binary masks in-between the layers of a pretrained
+representation model. When training the end-to-end model on a downstream task,
+we add a sparsity-inducing loss to the overall objective, hence learning a
+compact subnetwork specialized on a single task. Importantly, the weights of
+the foundation model are kept frozen, resulting into low additional training
+costs. Once trained, the masked computational units can then be removed from
+the network, implying significant performance gains. We assess our method on
+three widespread audio foundation models, each based on a different backbone
+architecture, and illustrate its effectiveness on common audio representation
+evaluation tasks, as well as its versatility on both speech, music, and general
+audio. Code for reproducing the results and supporting webpage are available at
+https://github.com/gnvIRCAM/Audio-representation-trimming
+
+摘要：<paragraph>近期，音频基础模型的研究取得了显著进展，
+复杂的下游任务上不断提升的结果证明了这一点。
+随后，这些预训练网络已迅速用于各种
+音频应用程序。然而，这些改进导致了这些模型的尺寸和复杂性都大幅
+增加。除了由此产生的环境问题外，这也阻止了此类网络在
+消费者级设备上的部署，并排除了它们在实时应用程序中的使用。
+此外，这似乎与这些模型的使用任务的特殊性相矛盾，与从任何类型的音频数据中提取丰富的多用途表示相比，这些任务通常更简单。在本文中，
+我们通过一种简单但有效的方法来解决此问题，从大型基础模型中提取轻量级专家子网络。具体来说，
+我们在预训练表示模型的层之间引入了可学习的二进制掩码。当在某个下游任务上训练端到端模型时，
+我们在总体目标中添加了稀疏性诱导损失，从而学习到专门用于单个任务的紧凑型子网络。重要的是，
+基础模型的权重保持冻结，从而导致额外的训练成本低。一旦训练完成，就可以从网络中移除掩码的计算单元，这意味着性能将大幅提升。我们对三个广泛使用的音频基础模型评估了我们的方法，每个模型都基于不同的骨干架构，并说明了其在常见音频表示评估任务上的有效性，以及其在语音、音乐和通用音频上的多功能性。用于重现结果的代码和支持网页可在
+https://github.com/gnvIRCAM/Audio-representation-trimming 获得</paragraph>
+
+##### **Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**
+2502.12924v1 by Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
+
+Code-switching (CS) is still a critical challenge in Natural Language
+Processing (NLP). Current Large Language Models (LLMs) struggle to interpret
+and generate code-switched text, primarily due to the scarcity of large-scale
+CS datasets for training. This paper presents a novel methodology to generate
+CS data using LLMs, and test it on the English-Spanish language pair. We
+propose back-translating natural CS sentences into monolingual English, and
+using the resulting parallel corpus to fine-tune LLMs to turn monolingual
+sentences into CS. Unlike previous approaches to CS generation, our methodology
+uses natural CS data as a starting point, allowing models to learn its natural
+distribution beyond grammatical patterns. We thoroughly analyse the models'
+performance through a study on human preferences, a qualitative error analysis
+and an evaluation with popular automatic metrics. Results show that our
+methodology generates fluent code-switched text, expanding research
+opportunities in CS communication, and that traditional metrics do not
+correlate with human judgement when assessing the quality of the generated CS
+data. We release our code and generated dataset under a CC-BY-NC-SA license.
+
+摘要：代碼轉換（CS）在自然語言處理（NLP）中仍是一個嚴峻的挑戰。目前的巨量語言模型（LLM）難以解讀和生成代碼轉換文字，主要是因為缺乏用於訓練的大規模 CS 資料集。本文提出了一種使用 LLM 生成 CS 資料的新方法，並在英語-西班牙語語言對上進行測試。我們建議將自然 CS 句子反向翻譯成單語英語，並使用產生的平行語料庫微調 LLM，將單語句子轉換為 CS。與先前的 CS 生成方法不同，我們的技術使用自然 CS 資料作為起點，讓模型能夠學習其超越語法模式的自然分佈。我們透過研究人類偏好、定性錯誤分析和使用流行的自動化指標進行評估，徹底分析模型的效能。結果顯示，我們的技術可以生成流利的代碼轉換文字，擴展 CS 溝通的研究機會，而且在評估生成的 CS 資料品質時，傳統指標與人類判斷無關。我們在 CC-BY-NC-SA 授權下釋出我們的程式碼和生成的資料集。
+
+##### **On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**
+2502.12923v1 by Rune Birkmose, Nathan Mørkeberg Reece, Esben Hofstedt Norvin, Johannes Bjerva, Mike Zhang
+
+This paper investigates whether Large Language Models (LLMs), fine-tuned on
+synthetic but domain-representative data, can perform the twofold task of (i)
+slot and intent detection and (ii) natural language response generation for a
+smart home assistant, while running solely on resource-limited, CPU-only edge
+hardware. We fine-tune LLMs to produce both JSON action calls and text
+responses. Our experiments show that 16-bit and 8-bit quantized variants
+preserve high accuracy on slot and intent detection and maintain strong
+semantic coherence in generated text, while the 4-bit model, while retaining
+generative fluency, suffers a noticeable drop in device-service classification
+accuracy. Further evaluations on noisy human (non-synthetic) prompts and
+out-of-domain intents confirm the models' generalization ability, obtaining
+around 80--86\% accuracy. While the average inference time is 5--6 seconds per
+query -- acceptable for one-shot commands but suboptimal for multi-turn
+dialogue -- our results affirm that an on-device LLM can effectively unify
+command interpretation and flexible response generation for home automation
+without relying on specialized hardware.
+
+摘要：本文探討微調於合成但具領域代表性的資料上的大型語言模型 (LLM)，是否能執行 (i) 槽位和意圖偵測，以及 (ii) 自然語言回應產生的雙重任務，同時僅在資源受限、僅 CPU 的邊緣硬體上執行。我們微調 LLM 以產生 JSON 動作呼叫和文字回應。我們的實驗顯示，16 位元和 8 位元量化的變體在槽位和意圖偵測上保持高準確度，並在產生的文字中維持強大的語意一致性，而 4 位元模型雖然保有生成流暢度，但在裝置服務分類準確度上卻有明顯下降。進一步對有雜訊的人類 (非合成) 提示和領域外意圖的評估，證實了模型的泛化能力，獲得約 80--86% 的準確度。雖然平均推論時間為每個查詢 5--6 秒，對於一次性命令來說是可以接受的，但對於多輪對話來說並不理想，但我們的結果證實，裝置上的 LLM 可以有效地統一命令解譯和彈性回應產生，以進行家庭自動化，而無需依賴專用硬體。
+
+##### **Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**
+2502.12921v1 by George-Kirollos Saad, Scott Sanner
+
+Query-driven recommendation with unknown items poses a challenge for users to
+understand why certain items are appropriate for their needs. Query-driven
+Contrastive Summarization (QCS) is a methodology designed to address this issue
+by leveraging language-based item descriptions to clarify contrasts between
+them. However, existing state-of-the-art contrastive summarization methods such
+as STRUM-LLM fall short of this goal. To overcome these limitations, we
+introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs
+debate-style prompting to generate focused and contrastive summarizations of
+item aspects relevant to a query. Leveraging modern large language models
+(LLMs) as powerful tools for generating debates, Q-STRUM Debate provides
+enhanced contrastive summaries. Experiments across three datasets demonstrate
+that Q-STRUM Debate yields significant performance improvements over existing
+methods on key contrastive summarization criteria, thus introducing a novel and
+performant debate prompting methodology for QCS.
+
+摘要：以未知項目進行的查詢驅動推薦對使用者來說是一項挑戰，他們難以理解為何某些項目適合自己的需求。查詢驅動對比摘要 (QCS) 是一種方法，旨在透過利用基於語言的項目描述來釐清項目之間的對比，以解決這個問題。然而，現有的最先進對比摘要方法（例如 STRUM-LLM）並未達成此目標。為了克服這些限制，我們引進 Q-STRUM Debate，一種 STRUM-LLM 的新延伸，它採用辯論式提示來產生與查詢相關的項目面向的重點式對比摘要。透過利用現代大型語言模型 (LLM) 作為產生辯論的強大工具，Q-STRUM Debate 提供增強的對比摘要。透過三個資料集的實驗證明，Q-STRUM Debate 在關鍵的對比摘要標準上，比現有方法有顯著的效能改善，因此為 QCS 引進一種新穎且高性能的辯論提示方法。
+
+##### **GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**
+2502.12913v1 by Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang
+
+Large Language Models (LLMs) fine-tuning technologies have achieved
+remarkable results. However, traditional LLM fine-tuning approaches face
+significant challenges: they require large Floating Point (FP) computation,
+raising privacy concerns when handling sensitive data, and are impractical for
+resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT)
+techniques reduce trainable parameters, their reliance on floating-point
+arithmetic creates fundamental incompatibilities with edge hardware. In this
+work, we introduce a novel framework for on-device LLM fine-tuning that
+eliminates the need for floating-point operations in both inference and
+training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer
+format, which efficiently represents model parameters in integer format using
+shared exponents among parameter groups. When combined with LoRA-like adapters,
+this enables fully integer-based fine-tuning that is both memory and compute
+efficient. We demonstrate that our approach achieves accuracy comparable to
+FP16-based fine-tuning while significantly reducing memory usage (50%).
+Moreover, compared to FP8, our method can reduce 5x power consumption and 11x
+chip area with same performance, making large-scale model adaptation feasible
+on edge devices.
+
+摘要：大型语言模型 (LLM) 微调技术已取得显著成果。然而，传统的 LLM 微调方法面临着严峻的挑战：它们需要大量的浮点 (FP) 计算，在处理敏感数据时会引发隐私问题，并且对于资源受限的边缘设备而言不切实际。虽然参数高效微调 (PEFT) 技术减少了可训练参数，但它们对浮点运算的依赖与边缘硬件产生了根本上的不兼容性。在这项工作中，我们引入了一个用于设备上 LLM 微调的新框架，该框架消除了推理和训练中对浮点运算的需求，名为 GSQ-Tuning。其核心是组共享指数整数格式，该格式使用参数组之间的共享指数以整数格式有效地表示模型参数。当与类似 LoRA 的适配器相结合时，这实现了完全基于整数的微调，既节省内存又节省计算。我们证明了我们的方法实现了与基于 FP16 的微调相当的准确性，同时显著减少了内存使用量 (50%)。此外，与 FP8 相比，我们的方法可以在相同的性能下减少 5 倍的功耗和 11 倍的芯片面积，从而使大规模模型适应在边缘设备上成为可能。
+
+##### **Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**
+2502.12911v1 by Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang
+
+Generating SQLs from user queries is a long-standing challenge, where the
+accuracy of initial schema linking significantly impacts subsequent SQL
+generation performance. However, current schema linking models still struggle
+with missing relevant schema elements or an excess of redundant ones. A crucial
+reason for this is that commonly used metrics, recall and precision, fail to
+capture relevant element missing and thus cannot reflect actual schema linking
+performance. Motivated by this, we propose an enhanced schema linking metric by
+introducing a restricted missing indicator. Accordingly, we introduce Knapsack
+optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent
+designed to prevent the missing of relevant schema elements while minimizing
+the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy
+that first identifies the optimal table linking and subsequently links columns
+within the selected table to reduce linking candidate space. In each linking
+process, it utilize a knapsack optimization approach to link potentially
+relevant elements while accounting for a limited tolerance of potential
+redundant ones.With this optimization, KaSLA-1.6B achieves superior schema
+linking results compared to large-scale LLMs, including deepseek-v3 with
+state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider
+and BIRD benchmarks verify that KaSLA can significantly improve the SQL
+generation performance of SOTA text-to-SQL models by substituting their schema
+linking processes.
+
+摘要：從使用者查詢中產生 SQL 是個長期的挑戰，其中初始架構連結的準確性會顯著影響後續 SQL 產生效能。然而，目前的架構連結模型仍難以處理遺漏相關架構元素或過多重複元素的問題。造成此問題的一個關鍵原因是，常用的指標召回率和精確度無法捕捉遺漏相關元素，因此無法反映實際的架構連結效能。有鑑於此，我們提出一個增強的架構連結指標，透過引入受限遺漏指標。因此，我們介紹基於背包最佳化的架構連結代理 (KaSLA)，這是一個外掛式架構連結代理，旨在防止遺漏相關架構元素，同時將重複元素的納入降至最低。KaSLA 採用分層連結策略，首先找出最佳的表格連結，然後連結所選表格中的欄位，以減少連結候選空間。在每個連結過程中，它利用背包最佳化方法連結潛在相關元素，同時考量對潛在重複元素的容忍度。透過此最佳化，KaSLA-1.6B 達到優於大規模 LLM 的架構連結結果，包括採用最先進 (SOTA) 架構連結方法的 deepseek-v3。在 Spider 和 BIRD 基準上的廣泛實驗驗證，KaSLA 可透過取代其架構連結流程，大幅提升 SOTA 文字轉 SQL 模型的 SQL 產生效能。
+
+##### **Graph Neural Networks for Databases: A Survey**
+2502.12908v1 by Ziming Li, Youhuan Li, Yuyu Luo, Guoliang Li, Chuxu Zhang
+
+Graph neural networks (GNNs) are powerful deep learning models for
+graph-structured data, demonstrating remarkable success across diverse domains.
+Recently, the database (DB) community has increasingly recognized the
+potentiality of GNNs, prompting a surge of researches focusing on improving
+database systems through GNN-based approaches. However, despite notable
+advances, There is a lack of a comprehensive review and understanding of how
+GNNs could improve DB systems. Therefore, this survey aims to bridge this gap
+by providing a structured and in-depth overview of GNNs for DB systems.
+Specifically, we propose a new taxonomy that classifies existing methods into
+two key categories: (1) Relational Databases, which includes tasks like
+performance prediction, query optimization, and text-to-SQL, and (2) Graph
+Databases, addressing challenges like efficient graph query processing and
+graph similarity computation. We systematically review key methods in each
+category, highlighting their contributions and practical implications. Finally,
+we suggest promising avenues for integrating GNNs into Database systems.
+
+摘要：圖形神經網路 (GNN) 是用於圖形結構資料的強大深度學習模型，在各種領域中展現出顯著的成功。最近，資料庫 (DB) 社群越來越認識到 GNN 的潛力，促使大量研究專注於透過基於 GNN 的方法來改善資料庫系統。然而，儘管有顯著的進展，但對於 GNN 如何改善資料庫系統，仍然缺乏全面的回顧和理解。因此，本調查旨在透過提供 GNN 在資料庫系統中的結構化且深入的概觀來彌補這個差距。具體來說，我們提出了一個新的分類法，將現有方法分類為兩個主要類別：(1) 關係資料庫，其中包括效能預測、查詢最佳化和文字轉 SQL 等任務，以及 (2) 圖形資料庫，用於處理高效圖形查詢處理和圖形相似度計算等挑戰。我們系統性地回顧了每個類別中的關鍵方法，重點說明其貢獻和實務意涵。最後，我們建議將 GNN 整合到資料庫系統中的有希望途徑。
+
+##### **Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**
+2502.12904v1 by Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
+
+We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to
+defend against internet fraud and phishing in dynamic, real-world scenarios.
+Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job
+postings, social media, and news, categorized into 5 major fraud types. Unlike
+previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to
+assess LLMs' resistance to fraud at different stages, including credibility
+building, urgency creation, and emotional manipulation. Furthermore, we
+evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM
+provides general decision-making assistance, and 2. Role-play, where the model
+assumes a specific persona, widely used in real-world agent-based interactions.
+Our evaluation reveals the significant challenges in defending against fraud
+and phishing inducement, especially in role-play settings and fake job
+postings. Additionally, we observe a substantial performance gap between
+Chinese and English, underscoring the need for improved multilingual fraud
+detection capabilities.
+
+摘要：我們推出 Fraud-R1，一個基準，旨在評估 LLM 在動態、真實世界場景中防範網路詐騙和網路釣魚的能力。Fraud-R1 包含 8,564 起詐騙案例，來源包括網路釣魚詐騙、虛假職缺、社群媒體和新聞，分類為 5 種類型的主要詐騙手法。與先前的基準不同，Fraud-R1 引入多輪評估管道，以評估 LLM 在不同階段對詐騙的抵抗力，包括建立信譽、製造急迫感和情感操縱。此外，我們在兩種設定下評估 15 個 LLM：1. 協助助理，其中 LLM 提供一般決策協助，以及 2. 角色扮演，其中模型假設特定角色，廣泛用於現實世界中基於代理的互動。我們的評估揭示了在防範詐騙和網路釣魚誘導方面面臨的重大挑戰，尤其是在角色扮演設定和虛假職缺中。此外，我們觀察到中文和英文之間有顯著的效能差距，這凸顯了改進多語言詐騙偵測功能的必要性。
+
+##### **Soundwave: Less is More for Speech-Text Alignment in LLMs**
+2502.12900v1 by Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
+
+Existing end-to-end speech large language models (LLMs) usually rely on
+large-scale annotated data for training, while data-efficient training has not
+been discussed in depth. We focus on two fundamental problems between speech
+and text: the representation space gap and sequence length inconsistency. We
+propose Soundwave, which utilizes an efficient training strategy and a novel
+architecture to address these issues. Results show that Soundwave outperforms
+the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,
+using only one-fiftieth of the training data. Further analysis shows that
+Soundwave still retains its intelligence during conversation. The project is
+available at https://github.com/FreedomIntelligence/Soundwave.
+
+摘要：現有的端對端語音大型語言模型 (LLM) 通常依賴於大規模註釋資料進行訓練，而資料有效率的訓練尚未深入探討。我們專注於語音和文字之間的兩個基本問題：表示空間差距和序列長度不一致。我們提出 Soundwave，它利用高效的訓練策略和新穎的架構來解決這些問題。結果顯示，Soundwave 在語音翻譯和 AIR-Bench 語音任務中優於進階的 Qwen2-Audio，僅使用五十分之一的訓練資料。進一步的分析顯示，Soundwave 在對話中仍能保持其智慧。專案可於 https://github.com/FreedomIntelligence/Soundwave 取得。
+
+##### **None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**
+2502.12896v1 by Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
+
+In LLM evaluations, reasoning is often distinguished from recall/memorization
+by performing numerical variations to math-oriented questions. Here we
+introduce a general variation method for multiple-choice questions that
+completely dissociates the correct answer from previously seen tokens or
+concepts, requiring LLMs to understand and reason (rather than memorizing) in
+order to answer correctly. Using this method, we evaluate state-of-the-art
+proprietary and open-source LLMs on two datasets available in English and
+Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
+Results show that all models experience remarkable accuracy drops under our
+proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
+2024, ranging from 10% to 93% across models. Notably, the most accurate model
+in our experimentation (OpenAI-o3-mini) is not the most robust
+(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
+not be the ones with better reasoning capabilities. Also, we see larger
+accuracy drops in public (vs private) datasets and questions posed in their
+original language (vs a manual translation), which are signs of contamination
+and also point to a relevant role of recall/memorization in current LLMs'
+answers.
+
+摘要：在 LLM 評估中，推理通常透過對數學導向問題進行數值變異來區別於回憶/記憶。在此，我們引入一種通用變異方法，適用於多選題，它將正確答案與先前看到的代幣或概念完全區分開來，要求 LLM 理解和推理（而不是記憶），以便正確回答。使用此方法，我們在英語和西班牙語中評估了兩種數據集中的最先進的專有和開源 LLM：公共 MMLU 基準和私有 UNED-Access 2024 數據集。結果表明，在我們提出的變異下，所有模型的準確度都出現顯著下降，在 MMLU 上平均損失 57%，在 UNED-Access 2024 上平均損失 50%，在不同模型中範圍從 10% 到 93%。值得注意的是，我們實驗中最準確的模型（OpenAI-o3-mini）並不是最穩健的模型（DeepSeek-R1-70B），這表明標準評估中最好的模型可能不是推理能力最強的模型。此外，我們看到公共（相對於私有）數據集和以原始語言提出的問題（相對於人工翻譯）的準確度下降幅度更大，這是汙染的跡象，也表明回憶/記憶在當前 LLM 的答案中發揮著相關作用。
+
+##### **Multilingual European Language Models: Benchmarking Approaches and Challenges**
+2502.12895v1 by Fabio Barth, Georg Rehm
+
+The breakthrough of generative large language models (LLMs) that can solve
+different tasks through chat interaction has led to a significant increase in
+the use of general benchmarks to assess the quality or performance of these
+models beyond individual applications. There is also a need for better methods
+to evaluate and also to compare models due to the ever increasing number of new
+models published. However, most of the established benchmarks revolve around
+the English language. This paper analyses the benefits and limitations of
+current evaluation datasets, focusing on multilingual European benchmarks. We
+analyse seven multilingual benchmarks and identify four major challenges.
+Furthermore, we discuss potential solutions to enhance translation quality and
+mitigate cultural biases, including human-in-the-loop verification and
+iterative translation ranking. Our analysis highlights the need for culturally
+aware and rigorously validated benchmarks to assess the reasoning and
+question-answering capabilities of multilingual LLMs accurately.
+
+摘要：生成式大型語言模型 (LLM) 的突破，它能透過聊天互動解決不同任務，這導致使用一般基準來評估這些模型在個別應用程式以外的品質或效能大幅增加。由於已發布的新模型數量不斷增加，因此也有必要採用更好的方法來評估模型並進行比較。然而，大多數已建立的基準都圍繞著英語。本文分析了目前評估資料集的優點和限制，重點放在多語言歐洲基準。我們分析了七個多語言基準，並找出四個主要的挑戰。此外，我們討論了增強翻譯品質和減輕文化偏見的潛在解決方案，包括人為迴圈驗證和反覆翻譯排名。我們的分析突顯了對文化意識和嚴格驗證的基準的需求，以準確評估多語言 LLM 的推理和問答能力。
+
+##### **H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**
+2502.12893v1 by Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen
+
+Large Reasoning Models (LRMs) have recently extended their powerful reasoning
+capabilities to safety checks-using chain-of-thought reasoning to decide
+whether a request should be answered. While this new approach offers a
+promising route for balancing model utility and safety, its robustness remains
+underexplored. To address this gap, we introduce Malicious-Educator, a
+benchmark that disguises extremely dangerous or malicious requests beneath
+seemingly legitimate educational prompts. Our experiments reveal severe
+security flaws in popular commercial-grade LRMs, including OpenAI o1/o3,
+DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1
+model initially maintains a high refusal rate of about 98%, subsequent model
+updates significantly compromise its safety; and attackers can easily extract
+criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any
+additional tricks. To further highlight these vulnerabilities, we propose
+Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method
+that leverages the model's own displayed intermediate reasoning to jailbreak
+its safety reasoning mechanism. Under H-CoT, refusal rates sharply
+decline-dropping from 98% to below 2%-and, in some instances, even transform
+initially cautious tones into ones that are willing to provide harmful content.
+We hope these findings underscore the urgent need for more robust safety
+mechanisms to preserve the benefits of advanced reasoning capabilities without
+compromising ethical standards.
+
+摘要：大型推理模型 (LRM) 最近將其強大的推理能力擴展到安全檢查，使用思維鏈推理來決定是否應回答請求。雖然這種新方法為平衡模型實用性和安全性提供了一條有希望的途徑，但其穩健性仍未得到充分探索。為了解決這一差距，我們引入了 Malicious-Educator，這是一個基準，它將極其危險或惡意的請求偽裝在看似合法的教育提示之下。我們的實驗揭示了流行的商業級 LRM 中嚴重的安全缺陷，包括 OpenAI o1/o3、DeepSeek-R1 和 Gemini 2.0 Flash Thinking。例如，儘管 OpenAI 的 o1 模型最初保持約 98% 的高拒絕率，但後續的模型更新顯著損害了其安全性；攻擊者可以輕鬆地從 DeepSeek-R1 和 Gemini 2.0 Flash Thinking 中提取犯罪策略，而無需任何額外的技巧。為了進一步強調這些漏洞，我們提出了劫持思維鏈 (H-CoT)，這是一種通用且可轉移的攻擊方法，它利用模型自己顯示的中間推理來越獄其安全推理機制。在 H-CoT 下，拒絕率急劇下降，從 98% 降至 2% 以下，在某些情況下，甚至將最初謹慎的語氣轉變為願意提供有害內容的語氣。我們希望這些發現強調了對更強大的安全機制的迫切需要，以保留先進推理能力的好處，同時不損害道德標準。
+
+##### **Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**
+2502.12886v1 by Georg Rehm, Annika Grützner-Zahn, Fabio Barth
+
+Large language models (LLMs) demonstrate unprecedented capabilities and
+define the state of the art for almost all natural language processing (NLP)
+tasks and also for essentially all Language Technology (LT) applications. LLMs
+can only be trained for languages for which a sufficient amount of pre-training
+data is available, effectively excluding many languages that are typically
+characterised as under-resourced. However, there is both circumstantial and
+empirical evidence that multilingual LLMs, which have been trained using data
+sets that cover multiple languages (including under-resourced ones), do exhibit
+strong capabilities for some of these under-resourced languages. Eventually,
+this approach may have the potential to be a technological off-ramp for those
+under-resourced languages for which "native" LLMs, and LLM-based technologies,
+cannot be developed due to a lack of training data. This paper, which
+concentrates on European languages, examines this idea, analyses the current
+situation in terms of technology support and summarises related work. The
+article concludes by focusing on the key open questions that need to be
+answered for the approach to be put into practice in a systematic way.
+
+摘要：大型語言模型 (LLM) 展現前所未有的能力，並定義了幾乎所有自然語言處理 (NLP) 任務以及所有語言技術 (LT) 應用的最新技術。LLM 只能針對有足夠預訓練資料可用的語言進行訓練，實際上排除了許多通常被歸類為資源不足的語言。然而，有環境和經驗證據顯示，多語言 LLM 已使用涵蓋多種語言（包括資源不足的語言）的資料集進行訓練，確實對其中一些資源不足的語言展現出強大的能力。最終，這種方法可能具有成為那些由於缺乏訓練資料而無法開發「原生」LLM 和基於 LLM 的技術的資源不足語言的技術跳板的潛力。本文專注於歐洲語言，探討這個想法，分析技術支援方面的現狀，並總結相關工作。本文最後專注於必須回答的主要開放性問題，以便系統性地實踐這種方法。
+
+##### **How desirable is alignment between LLMs and linguistically diverse human users?**
+2502.12884v1 by Pia Knoeferle, Sebastian Möller, Dorothea Kolossa, Veronika Solopova, Georg Rehm
+
+We discuss how desirable it is that Large Language Models (LLMs) be able to
+adapt or align their language behavior with users who may be diverse in their
+language use. User diversity may come about among others due to i) age
+differences; ii) gender characteristics, and/or iii) multilingual experience,
+and associated differences in language processing and use. We consider
+potential consequences for usability, communication, and LLM development.
+
+摘要：我們探討大型語言模型 (LLM) 能夠適應或調整其語言行為，以適應語言使用可能多樣化的使用者，這有多麼可取。使用者多樣性可能出於以下原因而產生：i) 年齡差異；ii) 性別特徵，和/或 iii) 多語言經驗，以及語言處理和使用上的相關差異。我們考慮對可用性、溝通和 LLM 開發的潛在後果。
+
+##### **Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**
+2502.12876v1 by Nandakishor M, Anjali M
+
+Creating personalized and adaptable conversational AI remains a key
+challenge. This paper introduces a Continuous Learning Conversational AI (CLCA)
+approach, implemented using A2C reinforcement learning, to move beyond static
+Large Language Models (LLMs). We use simulated sales dialogues, generated by
+LLMs, to train an A2C agent. This agent learns to optimize conversation
+strategies for personalization, focusing on engagement and delivering value.
+Our system architecture integrates reinforcement learning with LLMs for both
+data creation and response selection. This method offers a practical way to
+build personalized AI companions that evolve through continuous learning,
+advancing beyond traditional static LLM techniques.
+
+摘要：建立個人化且適應性強的對話式 AI 仍然是一項關鍵挑戰。本文介紹了一種持續學習對話式 AI (CLCA) 方法，透過 A2C 強化學習實作，以超越靜態大型語言模型 (LLM)。我們使用 LLM 生成的模擬銷售對話來訓練 A2C 代理。此代理會學習最佳化對話策略以實現個人化，並專注於參與和提供價值。我們的系統架構將強化學習與 LLM 整合，用於資料建立和回應選取。此方法提供了一種實用的方式來建立個人化 AI 伴侶，這些伴侶會透過持續學習而演進，超越傳統的靜態 LLM 技術。
+
+##### **PAFT: Prompt-Agnostic Fine-Tuning**
+2502.12859v1 by Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu
+
+While Large Language Models (LLMs) adapt well to downstream tasks after
+fine-tuning, this adaptability often compromises prompt robustness, as even
+minor prompt variations can significantly degrade performance. To address this,
+we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach
+that dynamically adjusts prompts during fine-tuning. This encourages the model
+to learn underlying task principles rather than overfitting to specific prompt
+formulations. PAFT operates in two stages: First, a diverse set of meaningful,
+synthetic candidate prompts is constructed. Second, during fine-tuning, prompts
+are randomly sampled from this set to create dynamic training inputs. Extensive
+experiments across diverse datasets and LLMs demonstrate that models trained
+with PAFT exhibit strong robustness and generalization across a wide range of
+prompts, including unseen ones. This enhanced robustness improves both model
+performance and inference speed while maintaining training efficiency. Ablation
+studies further confirm the effectiveness of PAFT.
+
+摘要：儘管大型語言模型 (LLM) 在微調後能很好地適應下游任務，但這種適應性通常會損害提示的穩健性，因為即使微小的提示變異也會大幅降低效能。為了解決這個問題，我們提出提示不可知微調 (PAFT)，這是一種簡單卻有效的方法，可以在微調期間動態調整提示。這鼓勵模型學習底層任務原則，而不是過度擬合特定的提示表述。PAFT 分為兩個階段運作：首先，構建一組多樣化、有意義的合成候選提示。其次，在微調期間，從此集合中隨機抽取提示以建立動態訓練輸入。針對各種資料集和 LLM 進行的廣泛實驗表明，使用 PAFT 訓練的模型在各種提示中表現出強大的穩健性和概括性，包括未見過的提示。這種增強的穩健性同時改善了模型效能和推理速度，同時維持訓練效率。消融研究進一步證實了 PAFT 的有效性。
+
+##### **Rejected Dialects: Biases Against African American Language in Reward Models**
+2502.12858v1 by Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, Maarten Sap
+
+Preference alignment via reward models helps build safe, helpful, and
+reliable large language models (LLMs). However, subjectivity in preference
+judgments and the lack of representative sampling in preference data collection
+can introduce new biases, hindering reward models' fairness and equity. In this
+work, we introduce a framework for evaluating dialect biases in reward models
+and conduct a case study on biases against African American Language (AAL)
+through several experiments comparing reward model preferences and behavior on
+paired White Mainstream English (WME) and both machine-translated and
+human-written AAL corpora. We show that reward models are less aligned with
+human preferences when processing AAL texts vs. WME ones (-4\% accuracy on
+average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and
+steer conversations toward WME, even when prompted with AAL texts. Our findings
+provide a targeted analysis of anti-AAL biases at a relatively understudied
+stage in LLM development, highlighting representational harms and ethical
+questions about the desired behavior of LLMs concerning AAL.
+
+摘要：透過獎勵模型進行偏好比對有助於建立安全、有用的可靠大型語言模型 (LLM)。然而，偏好判斷的主觀性，以及偏好資料收集中缺乏代表性抽樣，可能會引進新的偏誤，阻礙獎勵模型的公平性和公正性。在這項工作中，我們引進一個用於評估獎勵模型中方言偏誤的架構，並透過數個實驗進行案例研究，探討針對非裔美國人語言 (AAL) 的偏誤，這些實驗比較了獎勵模型偏好和行為，比較成對的白人主流英語 (WME) 與機器翻譯和人類撰寫的 AAL 語料庫。我們顯示，與處理 WME 文字相比，獎勵模型在處理 AAL 文字時與人類偏好較不一致（平均準確度降低 4%），經常不偏好與 AAL 一致的文字，而偏好與 WME 一致的文字，並將對話導向 WME，即使提示的是 AAL 文字。我們的發現針對 LLM 開發中相對未受重視的階段，提供針對反 AAL 偏誤的目標分析，強調與表徵相關的危害和關於 LLM 對 AAL 的期望行為的倫理問題。
+
+##### **Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**
+2502.12855v1 by Neeraj Gangwar, Suma P Bhat, Nickvash Kani
+
+While large models pre-trained on high-quality data exhibit excellent
+performance across various reasoning tasks, including mathematical reasoning
+(e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical
+reasoning remains a challenging problem. Common approaches to address this
+challenge include knowledge distillation, where smaller student models learn
+from large pre-trained teacher models, and data augmentation, such as
+rephrasing questions. Despite these efforts, smaller models struggle with
+arithmetic computations, leading to errors in mathematical reasoning. In this
+work, we focus on leveraging a programmatically generated arithmetic dataset to
+enhance the reasoning capabilities of smaller models. We investigate two key
+approaches to incorporate this dataset -- (1) intermediate fine-tuning, where a
+model is fine-tuned on the arithmetic dataset before being trained on a
+reasoning dataset, and (2) integrating the arithmetic dataset into the
+instruction-tuning mixture, allowing the model to learn arithmetic skills
+alongside general instruction-following abilities. Our experiments on multiple
+reasoning benchmarks demonstrate that incorporating an arithmetic dataset,
+whether through targeted fine-tuning or within the instruction-tuning mixture,
+enhances the models' arithmetic capabilities, which in turn improves their
+mathematical reasoning performance.
+
+摘要：大型模型经过针对高质量数据的预训练，在各种推理任务中表现出色，包括数学推理（例如 GSM8k、MultiArith），但专门化小型模型以擅长数学推理仍然是一个具有挑战性的问题。解决这一挑战的常见方法包括知识蒸馏，其中较小的学生模型从经过预训练的大型教师模型中学习，以及数据增强，例如重新表述问题。尽管做出了这些努力，较小的模型在算术计算中仍然存在困难，从而导致数学推理错误。在这项工作中，我们专注于利用程序化生成的算术数据集来增强较小模型的推理能力。我们研究了两种关键方法来合并此数据集——（1）中间微调，其中模型在算术数据集上进行微调，然后在推理数据集上进行训练，以及（2）将算术数据集集成到指令微调混合中，允许模型学习算术技能以及一般的指令遵循能力。我们在多个推理基准上的实验表明，通过有针对性的微调或在指令微调混合中合并算术数据集，增强了模型的算术能力，进而提高了它们的数学推理性能。
+
+##### **S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**
+2502.12853v1 by Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
+
+Recent studies have demonstrated the effectiveness of LLM test-time scaling.
+However, existing approaches to incentivize LLMs' deep thinking abilities
+generally require large-scale data or significant training efforts. Meanwhile,
+it remains unclear how to improve the thinking abilities of less powerful base
+models. In this work, we introduce S$^2$R, an efficient framework that enhances
+LLM reasoning by teaching models to self-verify and self-correct during
+inference. Specifically, we first initialize LLMs with iterative
+self-verification and self-correction behaviors through supervised fine-tuning
+on carefully curated data. The self-verification and self-correction skills are
+then further strengthened by both outcome-level and process-level reinforcement
+learning, with minimized resource requirements, enabling the model to
+adaptively refine its reasoning process during inference. Our results
+demonstrate that, with only 3.1k self-verifying and self-correcting behavior
+initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
+51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
+long-CoT distilled data. Extensive experiments and analysis based on three base
+models across both in-domain and out-of-domain benchmarks validate the
+effectiveness of S$^2$R. Our code and data are available at
+https://github.com/NineAbyss/S2R.
+
+摘要：<paragraph>最近的研究表明了 LLM 测试时间扩展的有效性。
+然而，现有激励 LLM 深度思考能力的方法
+通常需要大规模数据或大量的训练工作。同时，
+如何提高较弱基础模型的思考能力仍然不清楚。在这项工作中，我们引入了 S$^2$R，一个通过教导模型在
+推理过程中进行自我验证和自我纠正来增强 LLM 推理的有效框架。具体来说，我们首先通过监督微调对精心整理的数据来初始化具有迭代自我验证和自我纠正行为的 LLM。然后通过结果级别和过程级别的强化
+学习进一步加强自我验证和自我纠正技能，同时最大程度地减少资源需求，使模型能够
+在推理过程中自适应地优化其推理过程。我们的结果
+表明，仅使用 3.1k 个自我验证和自我纠正行为
+初始化样本，Qwen2.5-math-7B 的准确率从
+51.0% 提高到 81.6%，优于在等量长 CoT 蒸馏数据上训练的模型。基于三个基础模型在域内和域外基准上的广泛实验和分析验证了
+S$^2$R 的有效性。我们的代码和数据可以在
+https://github.com/NineAbyss/S2R 获得。</paragraph>
+
+##### **MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**
+2502.12852v1 by Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
+
+Existing multilingual vision-language (VL) benchmarks often only cover a
+handful of languages. Consequently, evaluations of large vision-language models
+(LVLMs) predominantly target high-resource languages, underscoring the need for
+evaluation data for low-resource languages. To address this limitation, we
+introduce MVL-SIB, a massively multilingual vision-language benchmark that
+evaluates both cross-modal and text-only topical matching across 205 languages
+-- over 100 more than the most multilingual existing VL benchmarks encompass.
+We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)
+on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic
+matching in lower-resource languages, performing no better than chance on
+languages like N'Koo. Our analysis further reveals that VL support in LVLMs
+declines disproportionately relative to textual support for lower-resource
+languages, as evidenced by comparison of cross-modal and text-only topical
+matching performance. We further observe that open-weight LVLMs do not benefit
+from representing a topic with more than one image, suggesting that these
+models are not yet fully effective at handling multi-image tasks. By
+correlating performance on MVL-SIB with other multilingual VL benchmarks, we
+highlight that MVL-SIB serves as a comprehensive probe of multilingual VL
+understanding in LVLMs.
+
+摘要：現有的多語言視覺語言 (VL) 基準通常只涵蓋少數語言。因此，大型視覺語言模型 (LVLMs) 的評估主要針對資源豐富的語言，強調了對資源匱乏語言的評估資料的需求。為了解決此限制，我們引入了 MVL-SIB，一個大規模的多語言視覺語言基準，它評估了 205 種語言的跨模態和純文字主題匹配，比現有的多語言 VL 基準涵蓋的語言多出 100 多種。然後，我們在 MVL-SIB 上對一系列開放權重的 LVLMs 與 GPT-4o(-mini) 進行了基準測試。我們的結果表明，LVLMs 在資源較少的語言中難以進行跨模態主題匹配，在 N'Koo 等語言上的表現不比隨機好。我們的分析進一步表明，LVLMs 中的 VL 支援相對於資源較少的語言的文字支援下降得不成比例，這從跨模態和純文字主題匹配效能的比較中可以看出。我們進一步觀察到，開放權重的 LVLMs 無法從用多於一張影像來表示主題中受益，這表明這些模型在處理多影像任務方面尚未完全有效。通過將 MVL-SIB 上的效能與其他多語言 VL 基準相關聯，我們強調 MVL-SIB 可作為 LVLMs 中多語言 VL 理解的綜合探測。
+
+##### **MeMo: Towards Language Models with Associative Memory Mechanisms**
+2502.12851v1 by Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
+
+Memorization is a fundamental ability of Transformer-based Large Language
+Models, achieved through learning. In this paper, we propose a paradigm shift
+by designing an architecture to memorize text directly, bearing in mind the
+principle that memorization precedes learning. We introduce MeMo, a novel
+architecture for language modeling that explicitly memorizes sequences of
+tokens in layered associative memories. By design, MeMo offers transparency and
+the possibility of model editing, including forgetting texts. We experimented
+with the MeMo architecture, showing the memorization power of the one-layer and
+the multi-layer configurations.
+
+摘要：記憶是 Transformer 大型語言模型的基本能力，可透過學習達成。在本文中，我們提出一個典範轉移，透過設計一個架構來直接記憶文字，並牢記記憶先於學習的原則。我們導入 MeMo，一個新穎的語言建模架構，可明確地記憶分層關聯式記憶中的代幣序列。透過設計，MeMo 提供透明度和模型編輯的可能性，包括遺忘文字。我們實驗了 MeMo 架構，展示了單層和多層組態的記憶力。
+
+##### **Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**
+2502.12842v1 by Kathrin Seßler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci
+
+Effective feedback is essential for fostering students' success in scientific
+inquiry. With advancements in artificial intelligence, large language models
+(LLMs) offer new possibilities for delivering instant and adaptive feedback.
+However, this feedback often lacks the pedagogical validation provided by
+real-world practitioners. To address this limitation, our study evaluates and
+compares the feedback quality of LLM agents with that of human teachers and
+science education experts on student-written experimentation protocols. Four
+blinded raters, all professionals in scientific inquiry and science education,
+evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and
+3) the science education experts using a five-point Likert scale based on six
+criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive
+Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that
+LLM-generated feedback shows no significant difference to that of teachers and
+experts in overall quality. However, the LLM agent's performance lags in the
+Feed Back dimension, which involves identifying and explaining errors within
+the student's work context. Qualitative analysis highlighted the LLM agent's
+limitations in contextual understanding and in the clear communication of
+specific errors. Our findings suggest that combining LLM-generated feedback
+with human expertise can enhance educational practices by leveraging the
+efficiency of LLMs and the nuanced understanding of educators.
+
+摘要：有效的回饋對於培養學生在科學探究中的成功至關重要。隨著人工智慧的進步，大型語言模型 (LLM) 為提供即時且適應性的回饋提供了新的可能性。然而，此回饋通常缺乏實際從業者提供的教學驗證。為了解決此限制，我們的研究評估並比較了 LLM 代理與人類教師和科學教育專家在學生撰寫的實驗協定上的回饋品質。四位盲評者，皆為科學探究和科學教育專業人士，使用基於六個有效回饋準則的五點李克特量表評估由 1) LLM 代理、2) 教師和 3) 科學教育專家產生的回饋文字：鼓勵、回饋、前饋、建設性語氣、語言清晰度和技術術語。我們的結果表明，LLM 產生的回饋在整體品質上與教師和專家產生的回饋沒有顯著差異。然而，LLM 代理的表現落後於回饋面向，這涉及在學生的作業背景中識別和解釋錯誤。定性分析突顯了 LLM 代理在情境理解和明確傳達特定錯誤方面的限制。我們的研究結果表明，將 LLM 產生的回饋與人類專業知識相結合，可以透過利用 LLM 的效率和教育者的細緻理解來提升教育實務。
+
+##### **Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**
+2502.12838v1 by Berk Yilmaz, Huthaifa I. Ashqar
+
+The recent advances in large language models (LLMs) have revolutionized
+industries such as finance, marketing, and customer service by enabling
+sophisticated natural language processing tasks. However, the broad adoption of
+LLMs brings significant challenges, particularly in the form of social biases
+that can be embedded within their outputs. Biases related to gender, age, and
+other sensitive attributes can lead to unfair treatment, raising ethical
+concerns and risking both company reputation and customer trust. This study
+examined bias in finance-related marketing slogans generated by LLMs (i.e.,
+ChatGPT) by prompting tailored ads targeting five demographic categories:
+gender, marital status, age, income level, and education level. A total of
+1,700 slogans were generated for 17 unique demographic groups, and key terms
+were categorized into four thematic groups: empowerment, financial, benefits
+and features, and personalization. Bias was systematically assessed using
+relative bias calculations and statistically tested with the Kolmogorov-Smirnov
+(KS) test against general slogans generated for any individual. Results
+revealed that marketing slogans are not neutral; rather, they emphasize
+different themes based on demographic factors. Women, younger individuals,
+low-income earners, and those with lower education levels receive more distinct
+messaging compared to older, higher-income, and highly educated individuals.
+This underscores the need to consider demographic-based biases in AI-generated
+marketing strategies and their broader societal implications. The findings of
+this study provide a roadmap for developing more equitable AI systems,
+highlighting the need for ongoing bias detection and mitigation efforts in
+LLMs.
+
+摘要：大型語言模型 (LLM) 的最新進展徹底改變了金融、行銷和客戶服務等產業，因為它能執行複雜的自然語言處理任務。然而，LLM 的廣泛採用帶來重大的挑戰，特別是潛藏在其輸出結果中的社會偏見形式。與性別、年齡和其他敏感屬性相關的偏見可能導致不公平的待遇，引發道德問題，並危及公司聲譽和客戶信任。本研究探討了 LLM（即 ChatGPT）產生的與金融相關的行銷標語中的偏見，方法是針對五個人口統計類別：性別、婚姻狀況、年齡、收入水準和教育水準，提示量身打造的廣告。總共為 17 個獨特的人口統計群組產生了 1,700 個標語，並且關鍵詞被分類為四個主題群組：賦權、財務、好處和功能，以及個人化。偏見使用相對偏見計算進行系統性評估，並使用科爾莫哥洛夫-史米諾夫 (KS) 檢定與針對任何個人產生的通用標語進行統計檢定。結果顯示行銷標語並非中立；相反地，它們根據人口統計因素強調不同的主題。與年紀較大、收入較高和受教育程度較高的個人相比，女性、年輕人、低收入者和教育程度較低者接收到的訊息更為不同。這強調了在 AI 生成的行銷策略中考量基於人口統計的偏見及其更廣泛的社會影響的必要性。本研究的發現提供了開發更公平 AI 系統的路線圖，突顯了在 LLM 中持續進行偏見偵測和緩解工作的重要性。
+
+##### **An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**
+2502.12836v1 by Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M. Rahmani
+
+Large language models (LLMs) are revolutionizing healthcare by improving
+diagnosis, patient care, and decision support through interactive
+communication. More recently, they have been applied to analyzing physiological
+time-series like wearable data for health insight extraction. Existing methods
+embed raw numerical sequences directly into prompts, which exceeds token limits
+and increases computational costs. Additionally, some studies integrated
+features extracted from time-series in textual prompts or applied multimodal
+approaches. However, these methods often produce generic and unreliable outputs
+due to LLMs' limited analytical rigor and inefficiency in interpreting
+continuous waveforms. In this paper, we develop an LLM-powered agent for
+physiological time-series analysis aimed to bridge the gap in integrating LLMs
+with well-established analytical tools. Built on the OpenCHA, an open-source
+LLM-powered framework, our agent features an orchestrator that integrates user
+interaction, data sources, and analytical tools to generate accurate health
+insights. To evaluate its effectiveness, we implement a case study on heart
+rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of
+PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study.
+The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o,
+with ECG serving as the gold standard for HR estimation. Results demonstrate
+that our agent significantly outperforms benchmark models by achieving lower
+error rates and more reliable HR estimations. The agent implementation is
+publicly available on GitHub.
+
+摘要：大型語言模型 (LLM) 透過互動式溝通，改善診斷、病人照護和決策支援，進而革新醫療保健。最近，它們已應用於分析生理時間序列，例如可穿戴式裝置的資料，以萃取健康見解。現有方法會將原始數值序列直接嵌入提示中，這會超過權杖限制並增加運算成本。此外，一些研究將從時間序列中萃取的特徵整合到文字提示中，或應用多模態方法。然而，由於 LLM 在解譯連續波形時分析嚴謹度有限且效率不彰，這些方法經常產生通用且不可靠的輸出。在本文中，我們開發了一個由 LLM 驅動的代理，用於生理時間序列分析，旨在彌合將 LLM 與既有分析工具整合的差距。我們的代理建立在 OpenCHA（一個由 LLM 驅動的開源架構）之上，具備一個整合使用者互動、資料來源和分析工具的協調器，以產生準確的健康見解。為了評估其有效性，我們實作了一個案例研究，從遠距健康監測研究中的一組光電容積描記圖 (PPG) 和心電圖 (ECG) 記錄中估算心率 (HR)。該代理的效能與 OpenAI GPT-4o-mini 和 GPT-4o 進行基準測試，其中 ECG 作為 HR 估算的金標準。結果顯示，我們的代理透過達成較低的錯誤率和更可靠的 HR 估算，顯著優於基準模型。該代理實作已公開在 GitHub 上。
+
+##### **Subword models struggle with word learning, but surprisal hides it**
+2502.12835v1 by Bastian Bunzeck, Sina Zarrieß
+
+We study word learning in subword and character language models with the
+psycholinguistic lexical decision task. While subword LMs struggle to discern
+words and non-words with high accuracy, character LMs solve this task easily
+and consistently. Furthermore, when comparing word learning and syntactic
+learning, both processes are separable in character LM where word learning
+predates syntactic learning, whereas these processes are simultaneous in
+subword LM. This raises questions about the adequacy of subword LMs for
+modeling language acquisition and positions character LMs as a viable
+alternative.
+
+摘要：我們使用心理語言學的詞彙決策任務研究在子詞和字元語言模型中的詞彙學習。儘管子詞語言模型難以區分單詞和非單詞，但字元語言模型可以輕鬆且一致地解決此任務。此外，在比較單詞學習和句法學習時，這兩個過程在字元語言模型中是可分離的，其中單詞學習先於句法學習，而這些過程在子詞語言模型中是同時發生的。這引發了關於子詞語言模型對語言習得建模的充分性的問題，並將字元語言模型定位為可行的替代方案。
+
+##### **KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**
+2502.12829v1 by Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
+
+Despite having a population of twenty million, Kazakhstan's culture and
+language remain underrepresented in the field of natural language processing.
+Although large language models (LLMs) continue to advance worldwide, progress
+in Kazakh language has been limited, as seen in the scarcity of dedicated
+models and benchmark evaluations. To address this gap, we introduce KazMMLU,
+the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU
+comprises 23,000 questions that cover various educational levels, including
+STEM, humanities, and social sciences, sourced from authentic educational
+materials and manually validated by native speakers and educators. The dataset
+includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting
+Kazakhstan's bilingual education system and rich local context. Our evaluation
+of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4,
+and DeepSeek V3) demonstrates substantial room for improvement, as even the
+best-performing models struggle to achieve competitive performance in Kazakh
+and Russian. These findings underscore significant performance gaps compared to
+high-resource languages. We hope that our dataset will enable further research
+and development of Kazakh-centric LLMs. Data and code will be made available
+upon acceptance.
+
+摘要：儘管哈薩克人口達兩千萬，但哈薩克的文化和語言在自然語言處理領域仍未得到充分的重視。儘管大型語言模型 (LLM) 在全球持續進步，但哈薩克語的進展卻十分有限，這從專用模型和基準評估的稀缺性中可見一斑。為了解決這個差距，我們引入了 KazMMLU，這是第一個專門為哈薩克語設計的 MMLU 風格資料集。KazMMLU 包含 23,000 個問題，涵蓋各種教育層級，包括 STEM、人文學科和社會科學，這些問題來自真實的教育材料，並由母語人士和教育工作者手動驗證。該資料集包含 10,969 個哈薩克語問題和 12,031 個俄語問題，反映了哈薩克的雙語教育體系和豐富的在地脈絡。我們對幾個最先進的多語言模型（Llama-3.1、Qwen-2.5、GPT-4 和 DeepSeek V3）的評估顯示，仍有很大的改進空間，因為即使是效能最好的模型，也很難在哈薩克語和俄語中達到有競爭力的效能。這些發現強調了與資源豐富的語言相比，存在顯著的效能差距。我們希望我們的資料集能促進以哈薩克語為中心的 LLM 的進一步研究和開發。資料和程式碼將在獲得接受後提供。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**
+2502.12821v1 by Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
+
+Inverse tasks can uncover potential reasoning gaps as Large Language Models
+(LLMs) scale up. In this work, we explore the redefinition task, in which we
+assign alternative values to well-known physical constants and units of
+measure, prompting LLMs to respond accordingly. Our findings show that not only
+does model performance degrade with scale, but its false confidence also rises.
+Moreover, while factors such as prompting strategies or response formatting are
+influential, they do not preclude LLMs from anchoring to memorized values.
+
+摘要：逆向任務可以揭示大型語言模型 (LLM) 擴展時潛在的推理差距。在本文中，我們探討重新定義任務，其中我們將替換值指定給著名的物理常數和測量單位，促使 LLM 做出相應回應。我們的研究結果表明，模型效能不僅會隨著規模而下降，其虛假信心也會上升。此外，儘管提示策略或回應格式等因素具有影響力，但它們並不妨礙 LLM 錨定在記憶值上。
+
+##### **Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**
+2502.12813v1 by Adnan Ahmad, Stefan Hillmann, Sebastian Möller
+
+In this study, we explore the application of Large Language Models (LLMs) for
+generating synthetic users and simulating user conversations with a
+task-oriented dialogue system and present detailed results and their analysis.
+We propose a comprehensive novel approach to user simulation technique that
+uses LLMs to create diverse user profiles, set goals, engage in multi-turn
+dialogues, and evaluate the conversation success. We employ two proprietary
+LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a
+heterogeneous base of user profiles, characterized by varied demographics,
+multiple user goals, different conversational styles, initial knowledge levels,
+interests, and conversational objectives. We perform a detailed analysis of the
+user profiles generated by LLMs to assess the diversity, consistency, and
+potential biases inherent in these LLM-generated user simulations. We find that
+GPT-o1 generates more heterogeneous user distribution across most user
+attributes, while GPT-4o generates more skewed user attributes. The generated
+set of user profiles are then utilized to simulate dialogue sessions by
+interacting with a task-oriented dialogue system.
+
+摘要：在這項研究中，我們探討大型語言模型 (LLM) 在生成合成使用者和模擬使用者對話，並使用任務導向對話系統進行對話的應用，並提出詳細的結果及其分析。我們提出了一種全面的使用者模擬技術新方法，利用 LLM 建立多樣化的使用者概況、設定目標、參與多輪對話，並評估對話的成功性。我們採用了兩個專有的 LLM，即 GPT-4o 和 GPT-o1 (Achiam 等人，2023 年)，以生成一個異質的使用者概況基礎，其特徵在於不同的人口統計資料、多個使用者目標、不同的對話風格、初始知識水準、興趣和對話目標。我們對 LLM 生成的使用者概況進行了詳細分析，以評估這些 LLM 生成的使用者模擬中固有的多樣性、一致性和潛在偏差。我們發現 GPT-o1 在大多數使用者屬性中產生更異質的使用者分佈，而 GPT-4o 則產生更偏斜的使用者屬性。然後利用生成的使用者概況集，透過與任務導向對話系統互動來模擬對話會話。
+
+##### **Towards Text-Image Interleaved Retrieval**
+2502.12799v1 by Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
+
+Current multimodal information retrieval studies mainly focus on single-image
+inputs, which limits real-world applications involving multiple images and
+text-image interleaved content. In this work, we introduce the text-image
+interleaved retrieval (TIIR) task, where the query and document are interleaved
+text-image sequences, and the model is required to understand the semantics
+from the interleaved context for effective retrieval. We construct a TIIR
+benchmark based on naturally interleaved wikiHow tutorials, where a specific
+pipeline is designed to generate interleaved queries. To explore the task, we
+adapt several off-the-shelf retrievers and build a dense baseline by
+interleaved multimodal large language model (MLLM). We then propose a novel
+Matryoshka Multimodal Embedder (MME), which compresses the number of visual
+tokens at different granularity, to address the challenge of excessive visual
+tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption
+of existing models does not consistently yield effective results. Our MME
+achieves significant improvements over the baseline by substantially fewer
+visual tokens. We provide extensive analysis and will release the dataset and
+code to facilitate future research.
+
+摘要：目前的多模態資訊檢索研究主要集中在單一影像輸入，這限制了涉及多個影像和文字影像交錯內容的實際應用。在這項工作中，我們引入了文字影像交錯檢索 (TIIR) 任務，其中查詢和文件是交錯的文字影像序列，並且模型需要理解交錯內容的語意以進行有效檢索。我們根據自然交錯的 wikiHow 教學課程建構了一個 TIIR 基準，其中設計了一個特定的管線來產生交錯查詢。為了探索這個任務，我們調整了幾個現成的檢索器，並透過交錯的多模態大型語言模型 (MLLM) 建立了一個密集的基準。然後，我們提出了一個新穎的 Matryoshka 多模態嵌入器 (MME)，它壓縮了不同粒度視覺符號的數量，以解決基於 MLLM 的 TIIR 模型中過多視覺符號的挑戰。實驗表明，對現有模型的簡單調整並未持續產生有效結果。我們的 MME 透過大幅減少視覺符號，達到了比基準顯著的改進。我們提供了廣泛的分析，並將釋出資料集和程式碼以促進未來的研究。
+
+##### **Envious Explore and Exploit**
+2502.12798v1 by Omer Ben-Porat, Yotam Gafni, Or Markovetzki
+
+Explore-and-exploit tradeoffs play a key role in recommendation systems
+(RSs), aiming at serving users better by learning from previous interactions.
+Despite their commercial success, the societal effects of explore-and-exploit
+mechanisms are not well understood, especially regarding the utility
+discrepancy they generate between different users. In this work, we measure
+such discrepancy using the economic notion of envy. We present a multi-armed
+bandit-like model in which every round consists of several sessions, and
+rewards are realized once per round. We call the latter property reward
+consistency, and show that the RS can leverage this property for better
+societal outcomes. On the downside, doing so also generates envy, as
+late-to-arrive users enjoy the information gathered by early-to-arrive users.
+We examine the generated envy under several arrival order mechanisms and
+virtually any anonymous algorithm, i.e., any algorithm that treats all similar
+users similarly without leveraging their identities. We provide tight envy
+bounds on uniform arrival and upper bound the envy for nudged arrival, in which
+the RS can affect the order of arrival by nudging its users. Furthermore, we
+study the efficiency-fairness trade-off by devising an algorithm that allows
+constant envy and approximates the optimal welfare in restricted settings.
+Finally, we validate our theoretical results empirically using simulations.
+
+摘要：探索與開發的取捨在推薦系統 (RS) 中扮演著關鍵角色，旨在透過學習先前的互動來為使用者提供更好的服務。儘管在商業上獲得成功，但探索與開發機制的社會效應仍未被充分理解，特別是關於它們在不同使用者之間產生的效用差異。在這項工作中，我們使用經濟學中的嫉妒概念來衡量這種差異。我們提出了一個多臂老虎機模型，其中每一輪都包含多個回合，並且每回合只會實現一次獎勵。我們將後者的特性稱為獎勵一致性，並證明 RS 可以利用此特性來獲得更好的社會成果。不利的是，這麼做也會產生嫉妒，因為較晚加入的使用者可以享受較早加入的使用者所收集的資訊。我們在多種到達順序機制和幾乎任何匿名演算法（即任何演算法都以類似的方式對待所有類似的使用者，而不利用他們的身份）下檢驗產生的嫉妒。我們對均勻到達提供嚴格的嫉妒界線，並對推動到達的上限進行嫉妒界線，其中 RS 可以透過推動其使用者來影響到達順序。此外，我們透過設計一種演算法來研究效率公平權衡，該演算法允許恆定的嫉妒，並在受限設定中近似最佳福利。最後，我們使用模擬對我們的理論結果進行經驗驗證。
+
+##### **Commonsense Reasoning in Arab Culture**
+2502.12788v1 by Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto
+
+Despite progress in Arabic large language models, such as Jais and AceGPT,
+their evaluation on commonsense reasoning has largely relied on
+machine-translated datasets, which lack cultural depth and may introduce
+Anglocentric biases. Commonsense reasoning is shaped by geographical and
+cultural contexts, and existing English datasets fail to capture the diversity
+of the Arab world. To address this, we introduce \datasetname, a commonsense
+reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13
+countries across the Gulf, Levant, North Africa, and the Nile Valley. The
+dataset was built from scratch by engaging native speakers to write and
+validate culturally relevant questions for their respective countries.
+\datasetname spans 12 daily life domains with 54 fine-grained subtopics,
+reflecting various aspects of social norms, traditions, and everyday
+experiences. Zero-shot evaluations show that open-weight language models with
+up to 32B parameters struggle to comprehend diverse Arab cultures, with
+performance varying across regions. These findings highlight the need for more
+culturally aware models and datasets tailored to the Arabic-speaking world.
+
+摘要：儘管阿拉伯語大型語言模型（例如 Jais 和 AceGPT）已有進展，
+但它們在常識推理上的評估在很大程度上依賴於
+機器翻譯的資料集，這些資料集缺乏文化深度，可能會引入
+以英語為中心的偏見。常識推理受地理和
+文化背景影響，現有的英文資料集無法捕捉阿拉伯世界的多樣性。為了解決這個問題，我們引入了 \datasetname，一個現代標準阿拉伯語 (MSA) 的常識推理資料集，涵蓋海灣地區、黎凡特地區、北非和尼羅河谷 13 個國家的文化。此資料集是從頭開始建立的，由母語人士參與編寫和驗證他們各自國家的文化相關問題。\datasetname 涵蓋 12 個日常生活領域，包含 54 個細緻的主題，反映社會規範、傳統和日常經驗的各個方面。零次學習評估顯示，具有高達 32B 參數的開放式權重語言模型難以理解不同的阿拉伯文化，且各區域的表現不一。這些發現突顯了對更具文化意識的模型和專為阿拉伯語系世界量身打造的資料集的需求。
+
+##### **VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**
+2502.12782v1 by Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan
+
+The training of controllable text-to-video (T2V) models relies heavily on the
+alignment between videos and captions, yet little existing research connects
+video caption evaluation with T2V generation assessment. This paper introduces
+VidCapBench, a video caption evaluation scheme specifically designed for T2V
+generation, agnostic to any particular caption format. VidCapBench employs a
+data annotation pipeline, combining expert model labeling and human refinement,
+to associate each collected video with key information spanning video
+aesthetics, content, motion, and physical laws. VidCapBench then partitions
+these key information attributes into automatically assessable and manually
+assessable subsets, catering to both the rapid evaluation needs of agile
+development and the accuracy requirements of thorough validation. By evaluating
+numerous state-of-the-art captioning models, we demonstrate the superior
+stability and comprehensiveness of VidCapBench compared to existing video
+captioning evaluation approaches. Verification with off-the-shelf T2V models
+reveals a significant positive correlation between scores on VidCapBench and
+the T2V quality evaluation metrics, indicating that VidCapBench can provide
+valuable guidance for training T2V models. The project is available at
+https://github.com/VidCapBench/VidCapBench.
+
+摘要：可控制文本到影片 (T2V) 模型的訓練極度仰賴影片和字幕之間的對齊，但現有研究鮮少將影片字幕評估與 T2V 生成評估連結起來。本文介紹 VidCapBench，這是一種專門為 T2V 生成設計的影片字幕評估架構，與任何特定的字幕格式無關。VidCapBench 採用資料標註流程，結合專家模型標記和人工微調，將每個收集到的影片與涵蓋影片美學、內容、動作和物理定律等關鍵資訊關聯起來。VidCapBench 接著將這些關鍵資訊屬性分割成可自動評估和可手動評估的子集，以滿足敏捷開發的快速評估需求和全面驗證的準確性要求。透過評估許多最先進的字幕模型，我們證明了 VidCapBench 與現有的影片字幕評估方法相比，具有優異的穩定性和全面性。使用現成的 T2V 模型驗證顯示，VidCapBench 得分與 T2V 品質評估指標之間存在顯著的正相關，這表示 VidCapBench 可以為訓練 T2V 模型提供有價值的指導。專案可於 https://github.com/VidCapBench/VidCapBench 取得。
+
+##### **Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**
+2502.12776v1 by Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi
+
+While foundation models have been exploited for various expert tasks through
+fine-tuning, any foundation model will become outdated due to its old knowledge
+or limited capability. Thus the underlying foundation model should be
+eventually replaced by new ones, which leads to repeated cost of fine-tuning
+these new models. Existing work addresses this problem by inference-time
+tuning, i.e., modifying the output probabilities from the new foundation model
+with the outputs from the old foundation model and its fine-tuned model, which
+involves an additional overhead in inference by the latter two models. In this
+paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT),
+that reduces the inference overhead by its nature, based on the reformulation
+of fine-tuning as the reward maximization. Specifically, instead of fine-tuning
+parameters of the foundation models, PRT trains the reward model explicitly
+through the same loss function as in fine-tuning. During inference, the reward
+model can be used with any foundation model (with the same set of vocabularies
+or labels) through the formulation of reward maximization. Experimental
+results, covering both vision and language models, demonstrate that the
+PRT-trained model can achieve comparable accuracy to the existing work of
+inference-time tuning, with less inference cost.
+
+摘要：儘管基礎模型已透過微調用於各種專家任務，任何基礎模型都將因其舊知識或有限功能而過時。因此，基礎模型最終應由新模型取代，這導致重複微調這些新模型的成本。現有工作透過推論時間調整來解決這個問題，即使用舊基礎模型及其微調模型的輸出修改新基礎模型的輸出機率，這涉及後兩個模型在推論中的額外開銷。在本文中，我們提出一個新的微調原則，可攜式獎勵調整 (PRT)，它本質上會減少推論開銷，基於將微調重新表述為獎勵最大化。具體來說，PRT 不是微調基礎模型的參數，而是透過與微調中相同的損失函數明確訓練獎勵模型。在推論期間，獎勵模型可透過獎勵最大化的公式與任何基礎模型（具有相同的詞彙或標籤組）一起使用。涵蓋視覺和語言模型的實驗結果證明，PRT 訓練的模型可以達到與現有推論時間調整工作相當的準確度，且推論成本較低。
+
+##### **Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**
+2502.12771v1 by Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee
+
+Self-supervised language and audio models effectively predict brain responses
+to speech. However, traditional prediction models rely on linear mappings from
+unimodal features, despite the complex integration of auditory signals with
+linguistic and semantic information across widespread brain networks during
+speech comprehension. Here, we introduce a nonlinear, multimodal prediction
+model that combines audio and linguistic features from pre-trained models
+(e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in
+prediction performance (unnormalized and normalized correlation) over
+traditional unimodal linear models, as well as a 7.7% and 14.4% improvement,
+respectively, over prior state-of-the-art models. These improvements represent
+a major step towards future robust in-silico testing and improved decoding
+performance. They also reveal how auditory and semantic information are fused
+in motor, somatosensory, and higher-level semantic regions, aligning with
+existing neurolinguistic theories. Overall, our work highlights the often
+neglected potential of nonlinear and multimodal approaches to brain modeling,
+paving the way for future studies to embrace these strategies in naturalistic
+neurolinguistics research.
+
+摘要：自我監督的語言和音訊模型有效預測大腦對語言的反應。然而，傳統的預測模型依賴於單模態特徵的線性映射，儘管在語言理解過程中，聽覺信號與語言和語義資訊在廣泛的腦網路中進行複雜的整合。在此，我們引入一個非線性、多模態預測模型，結合預先訓練模型（例如，LLAMA、Whisper）中的音訊和語言特徵。我們的做法在預測效能上（未正規化和正規化相關性）分別比傳統的單模態線性模型提升了 17.2% 和 17.9%，分別比先前的最先進模型提升了 7.7% 和 14.4%。這些改進代表了未來穩健的電腦模擬測試和改進的解碼效能邁出了一大步。它們也揭示了聽覺和語義資訊如何在運動、體感和更高層次的語義區域中融合，與現有的神經語言學理論一致。總的來說，我們的研究突出了非線性和多模態大腦建模方法經常被忽略的潛力，為未來研究在自然主義神經語言學研究中採用這些策略鋪平了道路。
+
+##### **How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**
+2502.12769v1 by Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
+
+In the age of misinformation, hallucination -- the tendency of Large Language
+Models (LLMs) to generate non-factual or unfaithful responses -- represents the
+main risk for their global utility. Despite LLMs becoming increasingly
+multilingual, the vast majority of research on detecting and quantifying LLM
+hallucination are (a) English-centric and (b) focus on machine translation (MT)
+and summarization, tasks that are less common ``in the wild'' than open
+information seeking. In contrast, we aim to quantify the extent of LLM
+hallucination across languages in knowledge-intensive long-form question
+answering. To this end, we train a multilingual hallucination detection model
+and conduct a large-scale study across 30 languages and 6 open-source LLM
+families. We start from an English hallucination detection dataset and rely on
+MT to generate (noisy) training data in other languages. We also manually
+annotate gold data for five high-resource languages; we then demonstrate, for
+these languages, that the estimates of hallucination rates are similar between
+silver (LLM-generated) and gold test sets, validating the use of silver data
+for estimating hallucination rates for other languages. For the final rates
+estimation, we build a knowledge-intensive QA dataset for 30 languages with
+LLM-generated prompts and Wikipedia articles as references. We find that, while
+LLMs generate longer responses with more hallucinated tokens for
+higher-resource languages, there is no correlation between length-normalized
+hallucination rates of languages and their digital representation. Further, we
+find that smaller LLMs exhibit larger hallucination rates than larger models.
+
+摘要：<paragraph>在错误訊息的時代，幻覺——大型語言模型 (LLM) 產生非事實或不忠實回應的傾向——代表其全球效用的主要風險。儘管 LLM 變得越來越多元化，但絕大多數關於偵測和量化 LLM 幻覺的研究都是 (a) 以英語為中心，(b) 專注於機器翻譯 (MT) 和摘要，這些任務在「野外」中不如開放式資訊搜尋常見。相反地，我們旨在量化 LLM 在知識密集型長篇問答中跨語言的幻覺程度。為此，我們訓練了一個多語言幻覺偵測模型，並針對 30 種語言和 6 個開放原始碼 LLM 家族進行大規模研究。我們從一個英語幻覺偵測資料集開始，並依賴 MT 在其他語言中產生（有雜訊的）訓練資料。我們還手動為五種高資源語言註解黃金資料；然後我們證明，對於這些語言，幻覺率的估計值在白銀（LLM 產生）和黃金測試集之間是相似的，驗證了使用白銀資料來估計其他語言的幻覺率。對於最終的比率估計，我們建立了一個知識密集型問答資料集，其中包含 30 種語言，並以 LLM 產生的提示和維基百科文章作為參考。我們發現，儘管 LLM 為資源較多的語言產生了更長的回應和更多幻覺的代幣，但語言的長度正規化幻覺率與其數位表示之間沒有相關性。此外，我們發現較小的 LLM 表現出比較大的模型更大的幻覺率。</paragraph>
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
 
diff --git a/docs/AI/Knowledge Graphs.md b/docs/AI/Knowledge Graphs.md
index d786fb69ca..3404c1f043 100644
--- a/docs/AI/Knowledge Graphs.md	
+++ b/docs/AI/Knowledge Graphs.md	
@@ -2,6 +2,12 @@
 ### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null|
+|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|null|
 |**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
 |**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
 |**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
@@ -66,7 +72,7 @@
 |**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
 |**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
 |**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null|
 |**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
 |**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
 |**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
@@ -96,14 +102,163 @@
 |**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
 |**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
 |**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
 
 #### Abstracts
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
+
+##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**
+2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang
+
+The rapid advancement of perovskite solar cells (PSCs) has led to an
+exponential growth in research publications, creating an urgent need for
+efficient knowledge management and reasoning systems in this domain. We present
+a comprehensive knowledge-enhanced system for PSCs that integrates three key
+components. First, we develop Perovskite-KG, a domain-specific knowledge graph
+constructed from 1,517 research papers, containing 23,789 entities and 22,272
+relationships. Second, we create two complementary datasets: Perovskite-Chat,
+comprising 55,101 high-quality question-answer pairs generated through a novel
+multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully
+curated materials science problems. Third, we introduce two specialized large
+language models: Perovskite-Chat-LLM for domain-specific knowledge assistance
+and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental
+results demonstrate that our system significantly outperforms existing models
+in both domain-specific knowledge retrieval and scientific reasoning tasks,
+providing researchers with effective tools for literature review, experimental
+design, and complex problem-solving in PSC research.
+
+摘要：由於 perovskite 太陽能電池 (PSC) 快速進展，導致研究出版物呈指數成長，迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先，我們開發出 Perovskite-KG，一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次，我們建立兩個互補的資料集：Perovskite-Chat，包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對；以及 Perovskite-Reasoning，包含 2,217 個仔細策展的材料科學問題。第三，我們推出兩個專門化大型語言模型：針對領域特定知識協助的 Perovskite-Chat-LLM，以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示，我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型，為研究人員提供有效的工具，用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。
+
+##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**
+2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
+
+Explainable recommendation has demonstrated significant advantages in
+informing users about the logic behind recommendations, thereby increasing
+system transparency, effectiveness, and trustworthiness. To provide
+personalized and interpretable explanations, existing works often combine the
+generation capabilities of large language models (LLMs) with collaborative
+filtering (CF) information. CF information extracted from the user-item
+interaction graph captures the user behaviors and preferences, which is crucial
+for providing informative explanations. However, due to the complexity of graph
+structure, effectively extracting the CF information from graphs still remains
+a challenge. Moreover, existing methods often struggle with the integration of
+extracted CF information with LLMs due to its implicit representation and the
+modality gap between graph structures and natural language explanations. To
+address these challenges, we propose G-Refer, a framework using graph
+retrieval-augmented large language models (LLMs) for explainable
+recommendation. Specifically, we first employ a hybrid graph retrieval
+mechanism to retrieve explicit CF signals from both structural and semantic
+perspectives. The retrieved CF information is explicitly formulated as
+human-understandable text by the proposed graph translation and accounts for
+the explanations generated by LLMs. To bridge the modality gap, we introduce
+knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of
+LLMs to process and utilize the retrieved CF information to generate
+explanations. Extensive experiments show that G-Refer achieves superior
+performance compared with existing methods in both explainability and
+stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
+
+摘要：可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點，從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明，現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好，這對於提供資訊性說明至關重要。然而，由於圖形結構的複雜性，從圖形中有效提取 CF 資訊仍然是一個挑戰。此外，現有方法通常難以將提取的 CF 資訊與 LLM 整合，因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰，我們提出 G-Refer，一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說，我們首先採用混合圖形檢索機制，從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字，並說明 LLM 生成的解釋。為了彌合模式差距，我們引入了知識修剪和檢索增強微調，以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明，與現有方法相比，G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。
+
 ##### **A-MEM: Agentic Memory for LLM Agents**
 2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
 
@@ -1624,7 +1779,7 @@ absence of agent-level demonstrations. Project code will be released.
 摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
 ##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
 Recent advancements have highlighted that Large Language Models (LLMs) are
 prone to hallucinations when solving complex reasoning problems, leading to
@@ -1651,7 +1806,7 @@ better or comparable performance compared to various strong baselines. Further
 analysis reveals that our agent can identify missing triples, facilitating
 automatic KG updates.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：<paragraph>最近的進展強調出，大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺，導致錯誤的結果。為了解決這個問題，研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而，現有方法面臨兩個限制：1) 它們通常假設問題的所有答案都包含在 KG 中，忽略了 KG 的不完整性問題，以及 2) 它們將 KG 視為一個靜態儲存庫，而忽略了 KG 中固有的隱式邏輯推理結構。在本文中，我們介紹了 SymAgent，一個創新的神經符號代理架構，它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境，並將複雜的推理任務轉化為一個多步驟的互動過程，使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組：代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則，指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊，解決 KG 不完整性的問題。此外，我們設計了一個自學習框架，包括線上探索和離線反覆的政策更新階段，使代理能夠自動合成推理軌跡並改善效能。實驗結果表明，具有弱 LLM 主幹的 SymAgent（例如，7B 系列）與各種強大的基線相比，產生了更好或相當的效能。進一步的分析表明，我們的代理可以識別遺失的三元組，促進自動 KG 更新。</paragraph>
 
 ##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
 2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
@@ -2347,209 +2502,3 @@ improve a real-world KGQA system.
 
 摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
-
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
-
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
-
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
-
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
-
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
-
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
-
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
-
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
-
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
-
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
-
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
-
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
-
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
-
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
-
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
-
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
-
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
-
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
-
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
-
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
-
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
-
diff --git a/docs/AI/LLM.md b/docs/AI/LLM.md
index 0225b24ad2..6d6f602247 100644
--- a/docs/AI/LLM.md
+++ b/docs/AI/LLM.md
@@ -2,2446 +2,2456 @@
 ### LLM
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
-|**2025-02-17**|**Diffusion Models without Classifier-free Guidance**|Zhicong Tang et.al.|[2502.12154v1](http://arxiv.org/abs/2502.12154v1)|[link](https://github.com/tzco/Diffusion-wo-CFG)|
-|**2025-02-17**|**Idiosyncrasies in Large Language Models**|Mingjie Sun et.al.|[2502.12150v1](http://arxiv.org/abs/2502.12150v1)|null|
-|**2025-02-17**|**HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**|Kenan Jiang et.al.|[2502.12149v1](http://arxiv.org/abs/2502.12149v1)|null|
-|**2025-02-17**|**Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**|Jinyan Su et.al.|[2502.12145v1](http://arxiv.org/abs/2502.12145v1)|null|
-|**2025-02-17**|**Small Models Struggle to Learn from Strong Reasoners**|Yuetai Li et.al.|[2502.12143v1](http://arxiv.org/abs/2502.12143v1)|null|
-|**2025-02-17**|**SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**|Yige Xu et.al.|[2502.12134v1](http://arxiv.org/abs/2502.12134v1)|null|
-|**2025-02-17**|**Transformer Dynamics: A neuroscientific approach to interpretability of large language models**|Jesseba Fernando et.al.|[2502.12131v1](http://arxiv.org/abs/2502.12131v1)|null|
-|**2025-02-17**|**Scaling Autonomous Agents via Automatic Reward Modeling And Planning**|Zhenfang Chen et.al.|[2502.12130v1](http://arxiv.org/abs/2502.12130v1)|null|
-|**2025-02-17**|**LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**|Florian Sestak et.al.|[2502.12128v1](http://arxiv.org/abs/2502.12128v1)|null|
-|**2025-02-17**|**On the Query Complexity of Verifier-Assisted Language Generation**|Edoardo Botta et.al.|[2502.12123v1](http://arxiv.org/abs/2502.12123v1)|null|
-|**2025-02-17**|**LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**|Prasanna Mayilvahanan et.al.|[2502.12120v1](http://arxiv.org/abs/2502.12120v1)|null|
-|**2025-02-17**|**PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**|Jinhe Bi et.al.|[2502.12119v1](http://arxiv.org/abs/2502.12119v1)|null|
-|**2025-02-17**|**Scaling Test-Time Compute Without Verification or RL is Suboptimal**|Amrith Setlur et.al.|[2502.12118v1](http://arxiv.org/abs/2502.12118v1)|null|
-|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
-|**2025-02-17**|**Personality Structured Interview for Large Language Model Simulation in Personality Research**|Pengda Wang et.al.|[2502.12109v1](http://arxiv.org/abs/2502.12109v1)|null|
-|**2025-02-17**|**Using the Path of Least Resistance to Explain Deep Networks**|Sina Salek et.al.|[2502.12108v1](http://arxiv.org/abs/2502.12108v1)|null|
-|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
-|**2025-02-17**|**A Study on Leveraging Search and Self-Feedback for Agent Reasoning**|Karthikeyan K et.al.|[2502.12094v1](http://arxiv.org/abs/2502.12094v1)|null|
-|**2025-02-17**|**Meta-Statistical Learning: Supervised Learning of Statistical Inference**|Maxime Peyrard et.al.|[2502.12088v1](http://arxiv.org/abs/2502.12088v1)|null|
-|**2025-02-17**|**APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**|Yuxiang Huang et.al.|[2502.12085v1](http://arxiv.org/abs/2502.12085v1)|null|
-|**2025-02-17**|**VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**|Jianshu Zhang et.al.|[2502.12084v1](http://arxiv.org/abs/2502.12084v1)|null|
-|**2025-02-17**|**AdaSplash: Adaptive Sparse Flash Attention**|Nuno Gonçalves et.al.|[2502.12082v1](http://arxiv.org/abs/2502.12082v1)|null|
-|**2025-02-17**|**Unhackable Temporal Rewarding for Scalable Video MLLMs**|En Yu et.al.|[2502.12081v1](http://arxiv.org/abs/2502.12081v1)|null|
-|**2025-02-17**|**Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**|Zhongyi Qiu et.al.|[2502.12073v1](http://arxiv.org/abs/2502.12073v1)|null|
-|**2025-02-17**|**TokenSkip: Controllable Chain-of-Thought Compression in LLMs**|Heming Xia et.al.|[2502.12067v1](http://arxiv.org/abs/2502.12067v1)|null|
-|**2025-02-17**|**CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**|Yifan Zhang et.al.|[2502.12066v1](http://arxiv.org/abs/2502.12066v1)|null|
-|**2025-02-17**|**Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**|Lan Zhang et.al.|[2502.12065v1](http://arxiv.org/abs/2502.12065v1)|null|
-|**2025-02-17**|**AI-generated Text Detection with a GLTR-based Approach**|Lucía Yan Wu et.al.|[2502.12064v1](http://arxiv.org/abs/2502.12064v1)|null|
-|**2025-02-17**|**Culture is Not Trivia: Sociocultural Theory for Cultural NLP**|Naitian Zhou et.al.|[2502.12057v1](http://arxiv.org/abs/2502.12057v1)|null|
-|**2025-02-17**|**Designing Role Vectors to Improve LLM Inference Behaviour**|Daniele Potertì et.al.|[2502.12055v1](http://arxiv.org/abs/2502.12055v1)|null|
-|**2025-02-17**|**PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**|Xinyu Zhang et.al.|[2502.12054v1](http://arxiv.org/abs/2502.12054v1)|null|
-|**2025-02-17**|**A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**|Xinyu Hu et.al.|[2502.12052v1](http://arxiv.org/abs/2502.12052v1)|null|
-|**2025-02-17**|**How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**|Ayan Sengupta et.al.|[2502.12051v1](http://arxiv.org/abs/2502.12051v1)|null|
-|**2025-02-17**|**SpeechT: Findings of the First Mentorship in Speech Translation**|Yasmin Moslem et.al.|[2502.12050v1](http://arxiv.org/abs/2502.12050v1)|null|
-|**2025-02-17**|**A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**|Shreya Shukla et.al.|[2502.12048v1](http://arxiv.org/abs/2502.12048v1)|null|
-|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
-|**2025-02-17**|**SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**|Fengqing Jiang et.al.|[2502.12025v1](http://arxiv.org/abs/2502.12025v1)|null|
-|**2025-02-17**|**Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**|Xin Xu et.al.|[2502.12022v1](http://arxiv.org/abs/2502.12022v1)|null|
-|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
-|**2025-02-17**|**Demographic Attributes Prediction from Speech Using WavLM Embeddings**|Yuchen Yang et.al.|[2502.12007v1](http://arxiv.org/abs/2502.12007v1)|null|
-|**2025-02-17**|**Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**|Thibault Rousset et.al.|[2502.12001v1](http://arxiv.org/abs/2502.12001v1)|null|
-|**2025-02-17**|**Presumed Cultural Identity: How Names Shape LLM Responses**|Siddhesh Pawar et.al.|[2502.11995v1](http://arxiv.org/abs/2502.11995v1)|null|
-|**2025-02-17**|**Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**|Negar Kamali et.al.|[2502.11989v1](http://arxiv.org/abs/2502.11989v1)|null|
-|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|null|
-|**2025-02-17**|**Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**|Sehun Jung et.al.|[2502.11969v1](http://arxiv.org/abs/2502.11969v1)|null|
-|**2025-02-17**|**A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**|Jun Jiang et.al.|[2502.11965v1](http://arxiv.org/abs/2502.11965v1)|null|
-|**2025-02-17**|**Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**|Tianyi Wu et.al.|[2502.11962v1](http://arxiv.org/abs/2502.11962v1)|null|
-|**2025-02-17**|**STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**|Haisong Gong et.al.|[2502.11959v1](http://arxiv.org/abs/2502.11959v1)|null|
-|**2025-02-17**|**Can Your Uncertainty Scores Detect Hallucinated Entity?**|Min-Hsuan Yeh et.al.|[2502.11948v1](http://arxiv.org/abs/2502.11948v1)|null|
-|**2025-02-17**|**Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**|Ailin Huang et.al.|[2502.11946v1](http://arxiv.org/abs/2502.11946v1)|null|
-|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
-|**2025-02-17**|**FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**|Yutong Ye et.al.|[2502.11937v1](http://arxiv.org/abs/2502.11937v1)|null|
-|**2025-02-17**|**On Representational Dissociation of Language and Arithmetic in Large Language Models**|Riku Kisako et.al.|[2502.11932v1](http://arxiv.org/abs/2502.11932v1)|null|
-|**2025-02-17**|**BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**|Shamsuddeen Hassan Muhammad et.al.|[2502.11926v1](http://arxiv.org/abs/2502.11926v1)|null|
-|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null|
-|**2025-02-17**|**From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**|Zhuoyan Li et.al.|[2502.11919v1](http://arxiv.org/abs/2502.11919v1)|null|
-|**2025-02-17**|**EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**|Jiamin Su et.al.|[2502.11916v1](http://arxiv.org/abs/2502.11916v1)|null|
-|**2025-02-17**|**On the robustness of ChatGPT in teaching Korean Mathematics**|Phuong-Nam Nguyen et.al.|[2502.11915v1](http://arxiv.org/abs/2502.11915v1)|null|
-|**2025-02-17**|**MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**|Haochen Xue et.al.|[2502.11903v1](http://arxiv.org/abs/2502.11903v1)|null|
-|**2025-02-17**|**Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**|Dylan Zhang et.al.|[2502.11901v1](http://arxiv.org/abs/2502.11901v1)|null|
-|**2025-02-17**|**DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**|Zhihang Yuan et.al.|[2502.11897v1](http://arxiv.org/abs/2502.11897v1)|null|
-|**2025-02-17**|**CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**|Yanxiao Zhao et.al.|[2502.11896v1](http://arxiv.org/abs/2502.11896v1)|null|
-|**2025-02-17**|**Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**|Jacob Nielsen et.al.|[2502.11895v1](http://arxiv.org/abs/2502.11895v1)|null|
-|**2025-02-17**|**Revisiting Classification Taxonomy for Grammatical Errors**|Deqing Zou et.al.|[2502.11890v1](http://arxiv.org/abs/2502.11890v1)|null|
-|**2025-02-17**|**Stonefish: Supporting Machine Learning Research in Marine Robotics**|Michele Grimaldi et.al.|[2502.11887v1](http://arxiv.org/abs/2502.11887v1)|null|
-|**2025-02-17**|**LIMR: Less is More for RL Scaling**|Xuefeng Li et.al.|[2502.11886v1](http://arxiv.org/abs/2502.11886v1)|null|
-|**2025-02-17**|**Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**|Shao Zhang et.al.|[2502.11882v1](http://arxiv.org/abs/2502.11882v1)|null|
-|**2025-02-17**|**Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**|Hyunwoo Kim et.al.|[2502.11881v1](http://arxiv.org/abs/2502.11881v1)|null|
-|**2025-02-17**|**Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**|Jinheng Wang et.al.|[2502.11880v1](http://arxiv.org/abs/2502.11880v1)|null|
-|**2025-02-17**|**VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**|Hugh Mee Wong et.al.|[2502.11874v1](http://arxiv.org/abs/2502.11874v1)|null|
-|**2025-02-17**|**Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**|Michael McRae et.al.|[2502.11866v1](http://arxiv.org/abs/2502.11866v1)|null|
-|**2025-02-17**|**FedEAT: A Robustness Optimization Framework for Federated LLMs**|Yahao Pang et.al.|[2502.11863v1](http://arxiv.org/abs/2502.11863v1)|null|
-|**2025-02-17**|**Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**|Renhao Pei et.al.|[2502.11862v1](http://arxiv.org/abs/2502.11862v1)|null|
-|**2025-02-17**|**Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**|Shuqi Yang et.al.|[2502.11861v1](http://arxiv.org/abs/2502.11861v1)|null|
-|**2025-02-17**|**Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**|Wenrui Xu et.al.|[2502.11859v1](http://arxiv.org/abs/2502.11859v1)|null|
-|**2025-02-17**|**LLMs as a synthesis between symbolic and continuous approaches to language**|Gemma Boleda et.al.|[2502.11856v1](http://arxiv.org/abs/2502.11856v1)|null|
-|**2025-02-17**|**BaxBench: Can LLMs Generate Correct and Secure Backends?**|Mark Vero et.al.|[2502.11844v1](http://arxiv.org/abs/2502.11844v1)|null|
-|**2025-02-17**|**Can LLM Agents Maintain a Persona in Discourse?**|Pranav Bhandari et.al.|[2502.11843v1](http://arxiv.org/abs/2502.11843v1)|null|
-|**2025-02-17**|**ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**|Muhammad Waseem Akram et.al.|[2502.11840v1](http://arxiv.org/abs/2502.11840v1)|null|
-|**2025-02-17**|**Intuitive physics understanding emerges from self-supervised pretraining on natural videos**|Quentin Garrido et.al.|[2502.11831v1](http://arxiv.org/abs/2502.11831v1)|null|
-|**2025-02-17**|**Text Classification in the LLM Era - Where do we stand?**|Sowmya Vajjala et.al.|[2502.11830v1](http://arxiv.org/abs/2502.11830v1)|null|
-|**2025-02-17**|**Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**|Hanbin Wang et.al.|[2502.11829v1](http://arxiv.org/abs/2502.11829v1)|null|
-|**2025-02-17**|**M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**|Chengyan Wu et.al.|[2502.11824v1](http://arxiv.org/abs/2502.11824v1)|null|
-|**2025-02-17**|**AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**|Hao Zhou et.al.|[2502.11817v1](http://arxiv.org/abs/2502.11817v1)|null|
-|**2025-02-17**|**Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**|Xu Wang et.al.|[2502.11812v1](http://arxiv.org/abs/2502.11812v1)|null|
-|**2025-02-17**|**FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**|Qianchi Zhang et.al.|[2502.11811v1](http://arxiv.org/abs/2502.11811v1)|null|
-|**2025-02-17**|**Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**|Yanbiao Ma et.al.|[2502.11809v1](http://arxiv.org/abs/2502.11809v1)|null|
-|**2025-02-17**|**Exploring Translation Mechanism of Large Language Models**|Hongbin Zhang et.al.|[2502.11806v1](http://arxiv.org/abs/2502.11806v1)|null|
-|**2025-02-17**|**Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**|Peiying Yu et.al.|[2502.11799v1](http://arxiv.org/abs/2502.11799v1)|null|
-|**2025-02-17**|**Personality Editing for Language Models through Relevant Knowledge Editing**|Seojin Hwang et.al.|[2502.11789v1](http://arxiv.org/abs/2502.11789v1)|null|
-|**2025-02-17**|**Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**|Xuan Ren et.al.|[2502.11779v1](http://arxiv.org/abs/2502.11779v1)|null|
-|**2025-02-17**|**Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**|Siddiqui Muhammad Yasir et.al.|[2502.11777v1](http://arxiv.org/abs/2502.11777v1)|null|
-|**2025-02-17**|**The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**|Leonardo Bertolazzi et.al.|[2502.11771v1](http://arxiv.org/abs/2502.11771v1)|null|
-|**2025-02-17**|**Cognitive-Aligned Document Selection for Retrieval-augmented Generation**|Bingyu Wan et.al.|[2502.11770v1](http://arxiv.org/abs/2502.11770v1)|null|
-|**2025-02-17**|**From Selection to Generation: A Survey of LLM-based Active Learning**|Yu Xia et.al.|[2502.11767v1](http://arxiv.org/abs/2502.11767v1)|null|
-|**2025-02-17**|**Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**|Zengkui Sun et.al.|[2502.11766v1](http://arxiv.org/abs/2502.11766v1)|null|
-|**2025-02-17**|**Lightweight Deepfake Detection Based on Multi-Feature Fusion**|Siddiqui Muhammad Yasir et.al.|[2502.11763v1](http://arxiv.org/abs/2502.11763v1)|null|
-|**2025-02-17**|**HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**|Michiel van der Meer et.al.|[2502.11753v1](http://arxiv.org/abs/2502.11753v1)|null|
-|**2025-02-17**|**Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**|Yuqi Pang et.al.|[2502.11751v1](http://arxiv.org/abs/2502.11751v1)|null|
-|**2025-02-17**|**SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**|Shuai Lyu et.al.|[2502.11741v1](http://arxiv.org/abs/2502.11741v1)|null|
+|**2025-02-18**|**SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**|Zekun Qi et.al.|[2502.13143v1](http://arxiv.org/abs/2502.13143v1)|null|
+|**2025-02-18**|**Pre-training Auto-regressive Robotic Models with 4D Representations**|Dantong Niu et.al.|[2502.13142v1](http://arxiv.org/abs/2502.13142v1)|null|
+|**2025-02-18**|**UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**|Huawei Lin et.al.|[2502.13141v1](http://arxiv.org/abs/2502.13141v1)|null|
+|**2025-02-18**|**AIDE: AI-Driven Exploration in the Space of Code**|Zhengyao Jiang et.al.|[2502.13138v1](http://arxiv.org/abs/2502.13138v1)|null|
+|**2025-02-18**|**Theorem Prover as a Judge for Synthetic Data Generation**|Joshua Ong Jun Leang et.al.|[2502.13137v1](http://arxiv.org/abs/2502.13137v1)|null|
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Rethinking Diverse Human Preference Learning through Principal Component Analysis**|Feng Luo et.al.|[2502.13131v1](http://arxiv.org/abs/2502.13131v1)|null|
+|**2025-02-18**|**Magma: A Foundation Model for Multimodal AI Agents**|Jianwei Yang et.al.|[2502.13130v1](http://arxiv.org/abs/2502.13130v1)|null|
+|**2025-02-18**|**SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**|Zihan Liu et.al.|[2502.13128v1](http://arxiv.org/abs/2502.13128v1)|null|
+|**2025-02-18**|**Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**|Jingyang Lin et.al.|[2502.13127v1](http://arxiv.org/abs/2502.13127v1)|null|
+|**2025-02-18**|**RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**|Zenan Zhai et.al.|[2502.13125v1](http://arxiv.org/abs/2502.13125v1)|null|
+|**2025-02-18**|**NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**|Weizhe Yuan et.al.|[2502.13124v1](http://arxiv.org/abs/2502.13124v1)|null|
+|**2025-02-18**|**Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**|Marion Bartl et.al.|[2502.13120v1](http://arxiv.org/abs/2502.13120v1)|null|
+|**2025-02-18**|**STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**|Narun Raman et.al.|[2502.13119v1](http://arxiv.org/abs/2502.13119v1)|null|
+|**2025-02-18**|**Performance Evaluation of Large Language Models in Statistical Programming**|Xinyi Song et.al.|[2502.13117v1](http://arxiv.org/abs/2502.13117v1)|null|
+|**2025-02-18**|**Near-Optimal Private Learning in Linear Contextual Bandits**|Fan Chen et.al.|[2502.13115v1](http://arxiv.org/abs/2502.13115v1)|null|
+|**2025-02-18**|**The influence of motion features in temporal perception**|Rosa Illan Castillo et.al.|[2502.13114v1](http://arxiv.org/abs/2502.13114v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**MatterChat: A Multi-Modal LLM for Material Science**|Yingheng Tang et.al.|[2502.13107v1](http://arxiv.org/abs/2502.13107v1)|null|
+|**2025-02-18**|**Understanding and Rectifying Safety Perception Distortion in VLMs**|Xiaohan Zou et.al.|[2502.13095v1](http://arxiv.org/abs/2502.13095v1)|null|
+|**2025-02-18**|**Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**|Mengkang Hu et.al.|[2502.13092v1](http://arxiv.org/abs/2502.13092v1)|null|
+|**2025-02-18**|**KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**|Xin Xia et.al.|[2502.13076v1](http://arxiv.org/abs/2502.13076v1)|null|
+|**2025-02-18**|**Interactive Agents to Overcome Ambiguity in Software Engineering**|Sanidhya Vijayvargiya et.al.|[2502.13069v1](http://arxiv.org/abs/2502.13069v1)|null|
+|**2025-02-18**|**Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**|Yuri Kuratov et.al.|[2502.13063v1](http://arxiv.org/abs/2502.13063v1)|null|
+|**2025-02-18**|**AI-Assisted Decision Making with Human Learning**|Gali Noti et.al.|[2502.13062v1](http://arxiv.org/abs/2502.13062v1)|null|
+|**2025-02-18**|**Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**|Jingbiao Mei et.al.|[2502.13061v1](http://arxiv.org/abs/2502.13061v1)|null|
+|**2025-02-18**|**SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**|Xianfu Cheng et.al.|[2502.13059v1](http://arxiv.org/abs/2502.13059v1)|null|
+|**2025-02-18**|**LAMD: Context-driven Android Malware Detection and Classification with LLMs**|Xingzhi Qian et.al.|[2502.13055v1](http://arxiv.org/abs/2502.13055v1)|null|
+|**2025-02-18**|**Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**|Nils Constantin Hellwig et.al.|[2502.13044v1](http://arxiv.org/abs/2502.13044v1)|null|
+|**2025-02-18**|**Natural Language Generation from Visual Sequences: Challenges and Future Directions**|Aditya K Surikuchi et.al.|[2502.13034v1](http://arxiv.org/abs/2502.13034v1)|null|
+|**2025-02-18**|**HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**|Bosi Wen et.al.|[2502.13031v1](http://arxiv.org/abs/2502.13031v1)|null|
+|**2025-02-18**|**Whose story is it? Personalizing story generation by inferring author styles**|Nischal Ashok Kumar et.al.|[2502.13028v1](http://arxiv.org/abs/2502.13028v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**|Sha Li et.al.|[2502.13019v1](http://arxiv.org/abs/2502.13019v1)|null|
+|**2025-02-18**|**LLM-Powered Proactive Data Systems**|Sepanta Zeighami et.al.|[2502.13016v1](http://arxiv.org/abs/2502.13016v1)|null|
+|**2025-02-18**|**Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**|Chaoran Chen et.al.|[2502.13012v1](http://arxiv.org/abs/2502.13012v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**|Yarin Benyamin et.al.|[2502.13006v1](http://arxiv.org/abs/2502.13006v1)|null|
+|**2025-02-18**|**Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**|Wafaa Wardah et.al.|[2502.13004v1](http://arxiv.org/abs/2502.13004v1)|null|
+|**2025-02-18**|**You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**|Frederic Kirstein et.al.|[2502.13001v1](http://arxiv.org/abs/2502.13001v1)|null|
+|**2025-02-18**|**Personalized Top-k Set Queries Over Predicted Scores**|Sohrab Namazi Nia et.al.|[2502.12998v1](http://arxiv.org/abs/2502.12998v1)|null|
+|**2025-02-18**|**Eager Updates For Overlapped Communication and Computation in DiLoCo**|Satyen Kale et.al.|[2502.12996v1](http://arxiv.org/abs/2502.12996v1)|null|
+|**2025-02-18**|**Free Argumentative Exchanges for Explaining Image Classifiers**|Avinash Kori et.al.|[2502.12995v1](http://arxiv.org/abs/2502.12995v1)|null|
+|**2025-02-18**|**B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**|Yifan Wang et.al.|[2502.12992v1](http://arxiv.org/abs/2502.12992v1)|null|
+|**2025-02-18**|**Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**|Zixiao Wang et.al.|[2502.12988v1](http://arxiv.org/abs/2502.12988v1)|null|
+|**2025-02-18**|**PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**|Nicolas Talabot et.al.|[2502.12985v1](http://arxiv.org/abs/2502.12985v1)|null|
+|**2025-02-18**|**Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**|Longxu Dou et.al.|[2502.12982v1](http://arxiv.org/abs/2502.12982v1)|null|
+|**2025-02-18**|**Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**|Junda Zhu et.al.|[2502.12970v1](http://arxiv.org/abs/2502.12970v1)|null|
+|**2025-02-18**|**A Survey of Text Classification Under Class Distribution Shift**|Adriana Valentina Costache et.al.|[2502.12965v1](http://arxiv.org/abs/2502.12965v1)|null|
+|**2025-02-18**|**Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**|Adi Simhi et.al.|[2502.12964v1](http://arxiv.org/abs/2502.12964v1)|null|
+|**2025-02-18**|**Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**|Xiaoju Ye et.al.|[2502.12962v1](http://arxiv.org/abs/2502.12962v1)|null|
+|**2025-02-18**|**Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**|Wenjun Li et.al.|[2502.12961v1](http://arxiv.org/abs/2502.12961v1)|null|
+|**2025-02-18**|**AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**|Steve Bakos et.al.|[2502.12959v1](http://arxiv.org/abs/2502.12959v1)|null|
+|**2025-02-18**|**Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**|Andrei Jarca et.al.|[2502.12953v1](http://arxiv.org/abs/2502.12953v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**|Gyeongman Kim et.al.|[2502.12947v1](http://arxiv.org/abs/2502.12947v1)|null|
+|**2025-02-18**|**LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**|Junchen Fu et.al.|[2502.12945v1](http://arxiv.org/abs/2502.12945v1)|null|
+|**2025-02-18**|**Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**|Salsabila Zahirah Pranida et.al.|[2502.12932v1](http://arxiv.org/abs/2502.12932v1)|null|
+|**2025-02-18**|**Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**|Lakshmi Nair et.al.|[2502.12929v1](http://arxiv.org/abs/2502.12929v1)|null|
+|**2025-02-18**|**Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**|Leiyu Pan et.al.|[2502.12928v1](http://arxiv.org/abs/2502.12928v1)|null|
+|**2025-02-18**|**SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**|Mike Zhang et.al.|[2502.12927v1](http://arxiv.org/abs/2502.12927v1)|null|
+|**2025-02-18**|**Towards more Contextual Agents: An extractor-Generator Optimization Framework**|Mourad Aouini et.al.|[2502.12926v1](http://arxiv.org/abs/2502.12926v1)|null|
+|**2025-02-18**|**Keep what you need : extracting efficient subnetworks from large audio representation models**|David Genova et.al.|[2502.12925v1](http://arxiv.org/abs/2502.12925v1)|null|
+|**2025-02-18**|**Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**|Maite Heredia et.al.|[2502.12924v1](http://arxiv.org/abs/2502.12924v1)|null|
+|**2025-02-18**|**On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**|Rune Birkmose et.al.|[2502.12923v1](http://arxiv.org/abs/2502.12923v1)|null|
+|**2025-02-18**|**Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**|George-Kirollos Saad et.al.|[2502.12921v1](http://arxiv.org/abs/2502.12921v1)|null|
+|**2025-02-18**|**GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**|Sifan Zhou et.al.|[2502.12913v1](http://arxiv.org/abs/2502.12913v1)|null|
+|**2025-02-18**|**Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**|Zheng Yuan et.al.|[2502.12911v1](http://arxiv.org/abs/2502.12911v1)|null|
+|**2025-02-18**|**Graph Neural Networks for Databases: A Survey**|Ziming Li et.al.|[2502.12908v1](http://arxiv.org/abs/2502.12908v1)|null|
+|**2025-02-18**|**Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**|Shu Yang et.al.|[2502.12904v1](http://arxiv.org/abs/2502.12904v1)|null|
+|**2025-02-18**|**Soundwave: Less is More for Speech-Text Alignment in LLMs**|Yuhao Zhang et.al.|[2502.12900v1](http://arxiv.org/abs/2502.12900v1)|null|
+|**2025-02-18**|**None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**|Eva Sánchez Salido et.al.|[2502.12896v1](http://arxiv.org/abs/2502.12896v1)|null|
+|**2025-02-18**|**Multilingual European Language Models: Benchmarking Approaches and Challenges**|Fabio Barth et.al.|[2502.12895v1](http://arxiv.org/abs/2502.12895v1)|null|
+|**2025-02-18**|**H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**|Martin Kuo et.al.|[2502.12893v1](http://arxiv.org/abs/2502.12893v1)|null|
+|**2025-02-18**|**Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**|Georg Rehm et.al.|[2502.12886v1](http://arxiv.org/abs/2502.12886v1)|null|
+|**2025-02-18**|**How desirable is alignment between LLMs and linguistically diverse human users?**|Pia Knoeferle et.al.|[2502.12884v1](http://arxiv.org/abs/2502.12884v1)|null|
+|**2025-02-18**|**Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**|Nandakishor M et.al.|[2502.12876v1](http://arxiv.org/abs/2502.12876v1)|null|
+|**2025-02-18**|**PAFT: Prompt-Agnostic Fine-Tuning**|Chenxing Wei et.al.|[2502.12859v1](http://arxiv.org/abs/2502.12859v1)|null|
+|**2025-02-18**|**Rejected Dialects: Biases Against African American Language in Reward Models**|Joel Mire et.al.|[2502.12858v1](http://arxiv.org/abs/2502.12858v1)|null|
+|**2025-02-18**|**Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**|Neeraj Gangwar et.al.|[2502.12855v1](http://arxiv.org/abs/2502.12855v1)|null|
+|**2025-02-18**|**S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**|Ruotian Ma et.al.|[2502.12853v1](http://arxiv.org/abs/2502.12853v1)|null|
+|**2025-02-18**|**MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**|Fabian David Schmidt et.al.|[2502.12852v1](http://arxiv.org/abs/2502.12852v1)|null|
+|**2025-02-18**|**MeMo: Towards Language Models with Associative Memory Mechanisms**|Fabio Massimo Zanzotto et.al.|[2502.12851v1](http://arxiv.org/abs/2502.12851v1)|null|
+|**2025-02-18**|**Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**|Kathrin Seßler et.al.|[2502.12842v1](http://arxiv.org/abs/2502.12842v1)|null|
+|**2025-02-18**|**Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**|Berk Yilmaz et.al.|[2502.12838v1](http://arxiv.org/abs/2502.12838v1)|null|
+|**2025-02-18**|**An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**|Mohammad Feli et.al.|[2502.12836v1](http://arxiv.org/abs/2502.12836v1)|null|
+|**2025-02-18**|**Subword models struggle with word learning, but surprisal hides it**|Bastian Bunzeck et.al.|[2502.12835v1](http://arxiv.org/abs/2502.12835v1)|null|
+|**2025-02-18**|**KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**|Mukhammed Togmanov et.al.|[2502.12829v1](http://arxiv.org/abs/2502.12829v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**|Elena Stringli et.al.|[2502.12821v1](http://arxiv.org/abs/2502.12821v1)|null|
+|**2025-02-18**|**Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**|Adnan Ahmad et.al.|[2502.12813v1](http://arxiv.org/abs/2502.12813v1)|null|
+|**2025-02-18**|**Towards Text-Image Interleaved Retrieval**|Xin Zhang et.al.|[2502.12799v1](http://arxiv.org/abs/2502.12799v1)|null|
+|**2025-02-18**|**Envious Explore and Exploit**|Omer Ben-Porat et.al.|[2502.12798v1](http://arxiv.org/abs/2502.12798v1)|null|
+|**2025-02-18**|**Commonsense Reasoning in Arab Culture**|Abdelrahman Sadallah et.al.|[2502.12788v1](http://arxiv.org/abs/2502.12788v1)|null|
+|**2025-02-18**|**VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**|Xinlong Chen et.al.|[2502.12782v1](http://arxiv.org/abs/2502.12782v1)|null|
+|**2025-02-18**|**Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**|Daiki Chijiwa et.al.|[2502.12776v1](http://arxiv.org/abs/2502.12776v1)|null|
+|**2025-02-18**|**Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**|Danny Dongyeop Han et.al.|[2502.12771v1](http://arxiv.org/abs/2502.12771v1)|null|
+|**2025-02-18**|**How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**|Saad Obaid ul Islam et.al.|[2502.12769v1](http://arxiv.org/abs/2502.12769v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
 
 #### Abstracts
-##### **Diffusion Models without Classifier-free Guidance**
-2502.12154v1 by Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
-
-This paper presents Model-guidance (MG), a novel objective for training
-diffusion model that addresses and removes of the commonly used Classifier-free
-guidance (CFG). Our innovative approach transcends the standard modeling of
-solely data distribution to incorporating the posterior probability of
-conditions. The proposed technique originates from the idea of CFG and is easy
-yet effective, making it a plug-and-play module for existing models. Our method
-significantly accelerates the training process, doubles the inference speed,
-and achieve exceptional quality that parallel and even surpass concurrent
-diffusion models with CFG. Extensive experiments demonstrate the effectiveness,
-efficiency, scalability on different models and datasets. Finally, we establish
-state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34.
-Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
-
-摘要：本文提出模型指導 (MG)，一種用於訓練擴散模型的新目標，它解決並消除了常用的無分類器指導 (CFG)。我們的創新方法超越了僅數據分佈的標準建模，並納入了條件的後驗機率。提議的技術源自 CFG 的概念，既簡單又有效，使其成為現有模型的即插即用模組。我們的技術顯著加速了訓練過程，將推論速度提高了一倍，並取得了與 CFG 並行甚至超越並行擴散模型的出色品質。廣泛的實驗證明了該技術在不同模型和資料集上的有效性、效率和可擴充性。最後，我們在 ImageNet 256 基準上建立了最先進的效能，FID 為 1.34。我們的程式碼可在 https://github.com/tzco/Diffusion-wo-CFG 取得。
-
-##### **Idiosyncrasies in Large Language Models**
-2502.12150v1 by Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
-
-In this work, we unveil and study idiosyncrasies in Large Language Models
-(LLMs) -- unique patterns in their outputs that can be used to distinguish the
-models. To do so, we consider a simple classification task: given a particular
-text output, the objective is to predict the source LLM that generates the
-text. We evaluate this synthetic task across various groups of LLMs and find
-that simply fine-tuning existing text embedding models on LLM-generated texts
-yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on
-held-out validation data in the five-way classification problem involving
-ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals
-that these idiosyncrasies are rooted in word-level distributions. These
-patterns persist even when the texts are rewritten, translated, or summarized
-by an external LLM, suggesting that they are also encoded in the semantic
-content. Additionally, we leverage LLM as judges to generate detailed,
-open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the
-broader implications of our findings, particularly for training on synthetic
-data and inferring model similarity. Code is available at
-https://github.com/locuslab/llm-idiosyncrasies.
-
-摘要：在這項工作中，我們揭示並研究了大型語言模型 (LLM) 中的特殊性，也就是其輸出中可區分模型的獨特模式。為此，我們考慮了一項簡單的分類任務：給定一個特定文本輸出，目標是預測產生該文本的來源 LLM。我們在各種 LLM 組合中評估這個合成任務，並發現僅微調現有的文本嵌入模型在 LLM 生成的文本上即可產生極佳的分類準確度。值得注意的是，在涉及 ChatGPT、Claude、Grok、Gemini 和 DeepSeek 的五向分類問題中，我們在留存驗證資料上達到了 97.1% 的準確度。我們的進一步調查顯示，這些特殊性根植於詞彙層級的分布。即使文本是由外部 LLM 改寫、翻譯或摘要，這些模式仍然存在，這表明它們也編碼在語義內容中。此外，我們利用 LLM 作為評審，為每個模型的特殊性產生詳細、開放式的描述。最後，我們討論了我們發現的更廣泛含意，特別是對於合成資料的訓練和推斷模型相似性。程式碼可在 https://github.com/locuslab/llm-idiosyncrasies 取得。
-
-##### **HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**
-2502.12149v1 by Kenan Jiang, Li Xiong, Fei Liu
-
-We investigate factors contributing to LLM agents' success in competitive
-multi-agent environments, using auctions as a testbed where agents bid to
-maximize profit. The agents are equipped with bidding domain knowledge,
-distinct personas that reflect item preferences, and a memory of auction
-history. Our work extends the classic auction scenario by creating a realistic
-environment where multiple agents bid on houses, weighing aspects such as size,
-location, and budget to secure the most desirable homes at the lowest prices.
-Particularly, we investigate three key questions: (a) How does a persona
-influence an agent's behavior in a competitive setting? (b) Can an agent
-effectively profile its competitors' behavior during auctions? (c) How can
-persona profiling be leveraged to create an advantage using strategies such as
-theory of mind? Through a series of experiments, we analyze the behaviors of
-LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a
-valuable platform for deepening our understanding of multi-agent workflows in
-competitive environments.
-
-摘要：我們研究促成 LLM 代理在競爭性多代理環境中成功的因素，使用拍賣作為測試平台，其中代理出價以最大化利潤。這些代理配備了競標領域知識、反映物品偏好的不同角色以及拍賣歷史的記憶。我們的研究透過創造一個現實的環境來擴展經典的拍賣場景，在該環境中，多個代理對房屋出價，權衡大小、位置和預算等方面以最低價格確保最理想的房屋。特別是，我們研究了三個關鍵問題：(a) 角色如何在競爭環境中影響代理的行為？(b) 代理是否可以在拍賣期間有效地分析其競爭對手的行為？(c) 如何利用角色分析來利用心智理論等策略創造優勢？透過一系列實驗，我們分析 LLM 代理的行為並闡明新的發現。我們的測試平台稱為 HARBOR，它提供了一個有價值的平台，用於加深我們對競爭環境中多代理工作流程的理解。
-
-##### **Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**
-2502.12145v1 by Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie
-
-Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to
-mitigate large language model (LLM) hallucinations by incorporating external
-knowledge retrieval. However, existing RAG frameworks often apply retrieval
-indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or
-failing to retrieve iteratively when required for complex reasoning. Recent
-adaptive retrieval strategies, though adaptively navigates these retrieval
-strategies, predict only based on query complexity and lacks user-driven
-flexibility, making them infeasible for diverse user application needs. In this
-paper, we introduce a novel user-controllable RAG framework that enables
-dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two
-classifiers: one trained to prioritize accuracy and another to prioritize
-retrieval efficiency. Via an interpretable control parameter $\alpha$, users
-can seamlessly navigate between minimal-cost retrieval and high-accuracy
-retrieval based on their specific requirements. We empirically demonstrate that
-our approach effectively balances accuracy, retrieval cost, and user
-controllability, making it a practical and adaptable solution for real-world
-applications.
-
-摘要：檢索增強生成 (RAG) 已成為一種強大的方法，可透過整合外部知識檢索來減輕大型語言模型 (LLM) 的幻覺。然而，現有的 RAG 框架經常不加區別地應用檢索，導致低效率，在不必要時過度檢索，或在複雜推理時無法反覆檢索。最近的自適應檢索策略，儘管自適應地導航這些檢索策略，但僅根據查詢複雜性進行預測，並且缺乏使用者驅動的靈活性，這使得它們無法滿足多樣化的使用者應用需求。在本文中，我們引入了一個新穎的使用者可控制 RAG 框架，它可以動態調整準確度成本權衡。我們的做法利用兩個分類器：一個訓練用於優先考慮準確度，另一個用於優先考慮檢索效率。透過可解釋的控制參數 $\alpha$，使用者可以在最低成本檢索和基於其特定需求的高準確度檢索之間無縫導航。我們通過實證證明，我們的做法有效地平衡了準確度、檢索成本和使用者可控性，使其成為現實世界應用中實用且適應性強的解決方案。
-
-##### **Small Models Struggle to Learn from Strong Reasoners**
-2502.12143v1 by Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
-
-Large language models (LLMs) excel in complex reasoning tasks, and distilling
-their reasoning capabilities into smaller models has shown promise. However, we
-uncover an interesting phenomenon, which we term the Small Model Learnability
-Gap: small models ($\leq$3B parameters) do not consistently benefit from long
-chain-of-thought (CoT) reasoning or distillation from larger models. Instead,
-they perform better when fine-tuned on shorter, simpler reasoning chains that
-better align with their intrinsic learning capacity. To address this, we
-propose Mix Distillation, a simple yet effective strategy that balances
-reasoning complexity by combining long and short CoT examples or reasoning from
-both larger and smaller models. Our experiments demonstrate that Mix
-Distillation significantly improves small model reasoning performance compared
-to training on either data alone. These findings highlight the limitations of
-direct strong model distillation and underscore the importance of adapting
-reasoning complexity for effective reasoning capability transfer.
-
-摘要：大型語言模型 (LLM) 在複雜推理任務中表現出色，且將其推理能力提煉成較小的模型已展現前景。然而，我們發現了一個有趣的現象，我們稱之為小型模型可學習性差距：小型模型（參數數目 ≤ 3B）並非總能從大型模型的長鏈條思考 (CoT) 推理或提煉中受益。相反地，當針對較短、較簡單的推理鏈進行微調時，它們的表現會更好，而這更符合其內在學習能力。為了解決此問題，我們提出混合提煉，這是一種簡單但有效的策略，透過結合長短 CoT 範例或從較大及較小模型進行推理，來平衡推理的複雜性。我們的實驗證明，與僅針對任一資料進行訓練相比，混合提煉顯著改善了小型模型的推理效能。這些發現突顯了直接強模型提煉的限制，並強調了調整推理複雜性以有效轉移推理能力的重要性。
-
-##### **SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**
-2502.12134v1 by Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
-
-Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to
-solve complex reasoning tasks by generating intermediate reasoning steps.
-However, most existing approaches focus on hard token decoding, which
-constrains reasoning within the discrete vocabulary space and may not always be
-optimal. While recent efforts explore continuous-space reasoning, they often
-suffer from catastrophic forgetting, limiting their applicability to
-state-of-the-art LLMs that already perform well in zero-shot settings with a
-proper instruction. To address this challenge, we propose a novel approach for
-continuous-space reasoning that does not require modifying the underlying LLM.
-Specifically, we employ a lightweight assistant model to generate
-instance-specific soft thought tokens speculatively as the initial chain of
-thoughts, which are then mapped into the LLM's representation space via a
-projection module. Experimental results on five reasoning benchmarks
-demonstrate that our method enhances LLM reasoning performance through
-supervised, parameter-efficient fine-tuning.
-
-摘要：鏈式思考 (CoT) 推理讓大型語言模型 (LLM) 能夠透過產生中間推理步驟來解決複雜的推理任務。然而，現有的大多數方法都專注於硬標記解碼，這會將推理限制在離散的詞彙空間內，而且可能並非總是最佳。雖然最近的研究探索了連續空間推理，但它們經常會遭遇災難性遺忘，這限制了它們在零次學習設置中表現良好的最先進 LLM 的適用性，且需要適當的說明。為了應對這項挑戰，我們提出了一種創新的連續空間推理方法，不需要修改底層的 LLM。具體來說，我們採用一個輕量級的輔助模型來產生特定於實例的軟思考標記，作為思考的初始鏈，然後透過投影模組將它們映射到 LLM 的表示空間。在五個推理基準上的實驗結果表明，我們的模型透過監督式、參數高效的微調，增強了 LLM 的推理效能。
-
-##### **Transformer Dynamics: A neuroscientific approach to interpretability of large language models**
-2502.12131v1 by Jesseba Fernando, Grigori Guitchounts
-
-As artificial intelligence models have exploded in scale and capability,
-understanding of their internal mechanisms remains a critical challenge.
-Inspired by the success of dynamical systems approaches in neuroscience, here
-we propose a novel framework for studying computations in deep learning
-systems. We focus on the residual stream (RS) in transformer models,
-conceptualizing it as a dynamical system evolving across layers. We find that
-activations of individual RS units exhibit strong continuity across layers,
-despite the RS being a non-privileged basis. Activations in the RS accelerate
-and grow denser over layers, while individual units trace unstable periodic
-orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with
-attractor-like dynamics in the lower layers. These insights bridge dynamical
-systems theory and mechanistic interpretability, establishing a foundation for
-a "neuroscience of AI" that combines theoretical rigor with large-scale data
-analysis to advance our understanding of modern neural networks.
-
-摘要：隨著人工智慧模型在規模和能力上爆炸式增長，
-理解其內部機制仍然是一項嚴峻的挑戰。
-受到神經科學中動力系統方法成功的啟發，我們在此
-提出了一個新的框架來研究深度學習系統中的運算。我們專注於Transformer模型中的殘差流 (RS)，
-將其概念化為一個跨層演化的動態系統。我們發現
-儘管 RS 不是一個特權基礎，但個別 RS 單元的激活在各層之間表現出很強的連續性。RS 中的激活
-隨著層數的增加而加速並變得更密集，而個別單元則追蹤不穩定的週期
-軌道。在降維空間中，RS 遵循一個曲線軌跡，在較低層中具有類吸引子的動力學。這些見解橋接了動力
-系統理論和機制可解釋性，為「AI 神經科學」奠定了基礎，結合了理論嚴謹性和大規模數據
-分析，以增進我們對現代神經網路的理解。
-
-##### **Scaling Autonomous Agents via Automatic Reward Modeling And Planning**
-2502.12130v1 by Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
-
-Large language models (LLMs) have demonstrated remarkable capabilities across
-a range of text-generation tasks. However, LLMs still struggle with problems
-requiring multi-step decision-making and environmental feedback, such as online
-shopping, scientific reasoning, and mathematical problem-solving. Unlike pure
-text data, collecting large-scale decision-making data is challenging.
-Moreover, many powerful LLMs are only accessible through APIs, which hinders
-their fine-tuning for agent tasks due to cost and complexity. To address LLM
-agents' limitations, we propose a framework that can automatically learn a
-reward model from the environment without human annotations. This model can be
-used to evaluate the action trajectories of LLM agents and provide heuristics
-for task planning. Specifically, our approach involves employing one LLM-based
-agent to navigate an environment randomly, generating diverse action
-trajectories. Subsequently, a separate LLM is leveraged to assign a task intent
-and synthesize a negative response alongside the correct response for each
-trajectory. These triplets (task intent, positive response, and negative
-response) are then utilized as training data to optimize a reward model capable
-of scoring action trajectories. The effectiveness and generalizability of our
-framework are demonstrated through evaluations conducted on different agent
-benchmarks. In conclusion, our proposed framework represents a significant
-advancement in enhancing LLM agents' decision-making capabilities. By
-automating the learning of reward models, we overcome the challenges of data
-scarcity and API limitations, potentially revolutionizing the application of
-LLMs in complex and interactive environments. This research paves the way for
-more sophisticated AI agents capable of tackling a wide range of real-world
-problems requiring multi-step decision-making.
-
-摘要：大型語言模型 (LLM) 已在各種文字生成任務中展示出非凡的能力。然而，LLM 仍然在需要多步驟決策制定和環境回饋的問題上苦苦掙扎，例如網上購物、科學推理和數學問題求解。與純文本數據不同，收集大規模決策制定數據具有挑戰性。此外，許多強大的 LLM 只能通過 API 訪問，這由於成本和複雜性而阻礙了它們對代理任務的微調。為了解決 LLM 代理的局限性，我們提出了一個框架，該框架可以從環境中自動學習獎勵模型，而無需人工註釋。此模型可用于評估 LLM 代理的動作軌跡並為任務規劃提供啟發式方法。具體來說，我們的方法涉及使用一個基於 LLM 的代理隨機導航環境，生成不同的動作軌跡。隨後，利用一個單獨的 LLM 為每個軌跡分配任務意圖並合成一個負面響應以及正確的響應。然後將這些三元組（任務意圖、正面響應和負面響應）用作訓練數據，以優化能夠評分動作軌跡的獎勵模型。我們框架的有效性和普遍性通過在不同代理基準上進行的評估得到證明。總之，我們提出的框架代表了加強 LLM 代理決策能力的重大進步。通過自動化獎勵模型的學習，我們克服了數據稀缺和 API 限制的挑戰，有可能徹底改變 LLM 在複雜和互動環境中的應用。這項研究為更複雜的 AI 代理鋪平了道路，這些代理能夠解決需要多步驟決策制定的大量現實世界問題。
-
-##### **LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**
-2502.12128v1 by Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
-
-Generative models are spearheading recent progress in deep learning, showing
-strong promise for trajectory sampling in dynamical systems as well. However,
-while latent space modeling paradigms have transformed image and video
-generation, similar approaches are more difficult for most dynamical systems.
-Such systems -- from chemical molecule structures to collective human behavior
--- are described by interactions of entities, making them inherently linked to
-connectivity patterns and the traceability of entities over time. Our approach,
-LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked
-Entities), combines the advantages of graph neural networks, i.e., the
-traceability of entities across time-steps, with the efficiency and scalability
-of recent advances in image and video generation, where pre-trained encoder and
-decoder are frozen to enable generative modeling in the latent space. The core
-idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for
-retrieval of entity properties, e.g., entity coordinates, from latent system
-representations and thus enables traceability. Experimentally, across different
-domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy,
-and generalizability. (Code is available at
-https://github.com/ml-jku/LaM-SLidE)
-
-摘要：生成模型引領深度學習的最新進展，也展現出在動態系統中進行軌跡取樣的強大前景。然而，儘管潛在空間建模範例已轉變圖像和影片生成，但對於大多數動態系統來說，類似的做法較為困難。此類系統（從化學分子結構到人類集體行為）由實體的交互作用所描述，使它們與連接模式和實體隨時間的追溯性產生固有聯繫。我們的做法 LaM-SLidE（透過連結實體進行空間動態系統的潛在空間建模）結合圖形神經網路的優點，亦即跨時間步長的實體追溯性，以及圖像和影片生成中近期進展的高效率和可擴充性，其中預先訓練的編碼器和解碼器被凍結以在潛在空間中啟用生成模型。LaM-SLidE 的核心概念是導入識別符號表示（ID），以允許從潛在系統表示中擷取實體屬性（例如實體座標），從而實現追溯性。透過不同領域的實驗，我們證明 LaM-SLidE 在速度、準確度和可概括性方面表現良好。（程式碼可在 https://github.com/ml-jku/LaM-SLidE 取得）
-
-##### **On the Query Complexity of Verifier-Assisted Language Generation**
-2502.12123v1 by Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
-
-Recently, a plethora of works have proposed inference-time algorithms (e.g.
-best-of-n), which incorporate verifiers to assist the generation process. Their
-quality-efficiency trade-offs have been empirically benchmarked on a variety of
-constrained generation tasks, but the algorithmic design landscape is still
-largely poorly understood. In this paper, we develop a mathematical framework
-for reasoning about constrained generation using a pre-trained language model
-generator oracle and a process verifier--which can decide whether a prefix can
-be extended to a string which satisfies the constraints of choice. We show that
-even in very simple settings, access to a verifier can render an intractable
-problem (information-theoretically or computationally) to a tractable one. In
-fact, we show even simple algorithms, like tokenwise rejection sampling, can
-enjoy significant benefits from access to a verifier. Empirically, we show that
-a natural modification of tokenwise rejection sampling, in which the sampler is
-allowed to "backtrack" (i.e., erase the final few generated tokens) has robust
-and substantive benefits over natural baselines (e.g. (blockwise) rejection
-sampling, nucleus sampling)--both in terms of computational efficiency,
-accuracy and diversity.
-
-摘要：<paragraph>最近，许多作品提出了推理时间算法（例如 best-of-n），其中包含验证器以协助生成过程。它们的质量效率权衡已在各种受限生成任务中得到经验基准测试，但算法设计格局仍然很大程度上难以理解。在本文中，我们开发了一个数学框架，用于使用预训练语言模型生成器预言机和过程验证器推理受限生成——它可以决定是否可以将前缀扩展为满足选择约束的字符串。我们表明，即使在非常简单的设置中，访问验证器也可以将一个棘手的问题（信息论或计算）转换为一个易处理的问题。事实上，我们表明即使是简单的算法，如逐个标记拒绝采样，也可以从访问验证器中受益匪浅。凭经验，我们表明逐个标记拒绝采样的自然修改，其中允许采样器“回溯”（即，擦除最后几个生成的标记）比自然基线（例如（按块）拒绝采样、核采样）具有强大而实质性的优势——无论是在计算效率、准确性还是多样性方面。</paragraph>
-
-##### **LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**
-2502.12120v1 by Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
-
-Scaling laws guide the development of large language models (LLMs) by
-offering estimates for the optimal balance of model size, tokens, and compute.
-More recently, loss-to-loss scaling laws that relate losses across pretraining
-datasets and downstream tasks have emerged as a powerful tool for understanding
-and improving LLM performance. In this work, we investigate which factors most
-strongly influence loss-to-loss scaling. Our experiments reveal that the
-pretraining data and tokenizer determine the scaling trend. In contrast, model
-size, optimization hyperparameters, and even significant architectural
-differences, such as between transformer-based models like Llama and
-state-space models like Mamba, have limited impact. Consequently, practitioners
-should carefully curate suitable pretraining datasets for optimal downstream
-performance, while architectures and other settings can be freely optimized for
-training efficiency.
-
-摘要：規模化定律透過提供模型大小、符號和運算的最佳平衡估計，引導大型語言模型 (LLM) 的開發。最近，與預訓練資料集和下游任務相關的損失到損失縮放定律已成為了解和改善 LLM 效能的強大工具。在這項工作中，我們探討哪些因素最能影響損失到損失縮放。我們的實驗顯示，預訓練資料和分詞器會決定縮放趨勢。相反地，模型大小、最佳化超參數，甚至重大的架構差異（例如基於Transformer的模型，如 Llama，和狀態空間模型，如 Mamba 之間的差異）影響有限。因此，從業人員應仔細策劃適當的預訓練資料集以獲得最佳的下游效能，而架構和其他設定可以自由最佳化以提升訓練效率。
-
-##### **PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**
-2502.12119v1 by Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
-
-Visual instruction tuning refines pre-trained Multimodal Large Language
-Models (MLLMs) to enhance their real-world task performance. However, the rapid
-expansion of visual instruction datasets introduces significant data
-redundancy, leading to excessive computational costs. Existing data selection
-methods predominantly rely on proxy models or loss-based metrics, both of which
-impose substantial computational overheads due to the necessity of model
-inference and backpropagation. To address this challenge, we propose PRISM, a
-novel training-free approach for efficient multimodal data selection. Unlike
-existing methods, PRISM eliminates the reliance on proxy models, warm-up
-pretraining, and gradient-based optimization. Instead, it leverages Pearson
-correlation analysis to quantify the intrinsic visual encoding properties of
-MLLMs, computing a task-specific correlation score to identify high-value
-instances. This not only enbles data-efficient selection,but maintains the
-original performance. Empirical evaluations across multiple MLLMs demonstrate
-that PRISM reduces the overall time required for visual instruction tuning and
-data selection to just 30% of conventional methods, while surpassing fully
-fine-tuned models across eight multimodal and three language understanding
-benchmarks, achieving a 101.7% relative improvement in final performance.
-
-摘要：視覺指令調整優化預先訓練的多模態大型語言模型 (MLLM)，以增強其真實世界的任務表現。然而，視覺指令資料集的快速擴展引入了顯著的資料冗餘，導致過度的運算成本。現有的資料選取方法主要依賴於代理模型或基於損失的指標，這兩者由於模型推理和反向傳播的必要性而造成大量的運算負擔。為了應對這一挑戰，我們提出了 PRISM，一種用於高效多模態資料選取的新型無訓練方法。與現有方法不同，PRISM 消除了對代理模型、熱身預訓練和基於梯度的優化的依賴。相反，它利用 Pearson 相關分析來量化 MLLM 的內在視覺編碼特性，計算特定任務相關性分數以識別高價值實例。這不僅能選擇資料效率，而且能保持原始效能。跨多個 MLLM 的經驗評估表明，PRISM 將視覺指令調整和資料選取所需的總時間減少到傳統方法的 30%，同時在八個多模態和三個語言理解基準中超越了完全微調的模型，在最終效能上實現了 101.7% 的相對改進。
-
-##### **Scaling Test-Time Compute Without Verification or RL is Suboptimal**
-2502.12118v1 by Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
-
-Despite substantial advances in scaling test-time compute, an ongoing debate
-in the community is how it should be scaled up to enable continued and
-efficient improvements with scaling. There are largely two approaches: first,
-distilling successful search or thinking traces; and second, using verification
-(e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement
-learning (RL) and search algorithms. In this paper, we prove that finetuning
-LLMs with verifier-based (VB) methods based on RL or search is far superior to
-verifier-free (VF) approaches based on distilling or cloning search traces,
-given a fixed amount of compute/data budget. Further, we show that as we scale
-test-time compute (measured as the output token length) and training data,
-suboptimality of VF methods scales poorly compared to VB when the base
-pre-trained LLM presents a heterogeneous distribution over correct solution
-traces (e.g., different lengths, styles, etc.) and admits a non-sharp
-distribution over rewards on traces sampled from it. We formalize this
-condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger
-result that VB methods scale better asymptotically, with the performance gap
-between VB and VF methods widening as test-time budget grows. We corroborate
-our theory empirically on both didactic and math reasoning problems with
-3/8/32B-sized pre-trained LLMs, where we find verification is crucial for
-scaling test-time compute.
-
-摘要：儘管在擴展測試時間計算方面取得了重大進展，但社群中持續的辯論是如何擴展它以持續有效地改善擴展。大致有兩種方法：首先，提煉成功的搜尋或思考軌跡；其次，使用驗證（例如，0/1 結果獎勵、獎勵模型或驗證器）來指導強化學習 (RL) 和搜尋演算法。在本文中，我們證明使用基於 RL 或搜尋的驗證器為基礎 (VB) 方法微調 LLM 遠優於基於提煉或複製搜尋軌跡的驗證器免費 (VF) 方法，給定固定數量的計算/資料預算。此外，我們表明，當我們擴展測試時間計算（以輸出標記長度衡量）和訓練資料時，與 VB 相比，VF 方法的次最佳性擴展效果不佳，當基礎預先訓練的 LLM 在正確的解決方案軌跡上呈現異質分佈（例如，不同的長度、樣式等）並承認從其中取樣的軌跡上獎勵的分佈不尖銳時。我們使用反集中 [Erd\H{o}s，1945] 將此條件形式化。這暗示了一個更強的結果，即 VB 方法在漸近上擴展得更好，VB 和 VF 方法之間的效能差距隨著測試時間預算的增加而擴大。我們在具有 3/8/32B 大小的預先訓練 LLM 的教學和數學推理問題上對我們的理論進行實證驗證，我們發現驗證對於擴展測試時間計算至關重要。
-
-##### **A-MEM: Agentic Memory for LLM Agents**
-2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
-
-While large language model (LLM) agents can effectively use external tools
-for complex real-world tasks, they require memory systems to leverage
-historical experiences. Current memory systems enable basic storage and
-retrieval but lack sophisticated memory organization, despite recent attempts
-to incorporate graph databases. Moreover, these systems' fixed operations and
-structures limit their adaptability across diverse tasks. To address this
-limitation, this paper proposes a novel agentic memory system for LLM agents
-that can dynamically organize memories in an agentic way. Following the basic
-principles of the Zettelkasten method, we designed our memory system to create
-interconnected knowledge networks through dynamic indexing and linking. When a
-new memory is added, we generate a comprehensive note containing multiple
-structured attributes, including contextual descriptions, keywords, and tags.
-The system then analyzes historical memories to identify relevant connections,
-establishing links where meaningful similarities exist. Additionally, this
-process enables memory evolution - as new memories are integrated, they can
-trigger updates to the contextual representations and attributes of existing
-historical memories, allowing the memory network to continuously refine its
-understanding. Our approach combines the structured organization principles of
-Zettelkasten with the flexibility of agent-driven decision making, allowing for
-more adaptive and context-aware memory management. Empirical experiments on six
-foundation models show superior improvement against existing SOTA baselines.
-The source code is available at https://github.com/WujiangXu/AgenticMemory.
-
-摘要：大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務，但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索，但缺乏精密的記憶體組織，儘管最近嘗試納入圖形資料庫。此外，這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制，本文提出了一種新的代理記憶體系統，供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則，我們設計我們的記憶體系統，透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時，我們會產生包含多個結構化屬性的綜合筆記，包括脈絡描述、關鍵字和標籤。然後，系統會分析歷史記憶體以找出相關連結，在有意義的相似性時建立連結。此外，這個程序能讓記憶體演化，因為當整合新的記憶體時，它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新，讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性，能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。
-
-##### **Personality Structured Interview for Large Language Model Simulation in Personality Research**
-2502.12109v1 by Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, Frederick L. Oswald
-
-Although psychometrics researchers have recently explored the use of large
-language models (LLMs) as proxies for human participants, LLMs often fail to
-generate heterogeneous data with human-like diversity, which diminishes their
-value in advancing social science research. To address these challenges, we
-explored the potential of the theory-informed Personality Structured Interview
-(PSI) as a tool for simulating human responses in personality research. In this
-approach, the simulation is grounded in nuanced real-human interview
-transcripts that target the personality construct of interest. We have provided
-a growing set of 357 structured interview transcripts from a representative
-sample, each containing an individual's response to 32 open-ended questions
-carefully designed to gather theory-based personality evidence. Additionally,
-grounded in psychometric research, we have summarized an evaluation framework
-to systematically validate LLM-generated psychometric data. Results from three
-experiments demonstrate that well-designed structured interviews could improve
-human-like heterogeneity in LLM-simulated personality data and predict
-personality-related behavioral outcomes (i.e., organizational citizenship
-behaviors and counterproductive work behavior). We further discuss the role of
-theory-informed structured interviews in LLM-based simulation and outline a
-general framework for designing structured interviews to simulate human-like
-data for psychometric research.
-
-摘要：儘管心理測量研究人員最近已探討將大型語言模型 (LLM) 用作人類參與者的代理，但 LLM 經常無法產生具有類似人類多樣性的異質資料，這降低了它們在推進社會科學研究中的價值。為了應對這些挑戰，我們探討了理論知情的個性結構化訪談 (PSI) 作為模擬人格研究中人類反應的工具的潛力。在此方法中，模擬基於針對目標人格建構的細緻真實人類訪談記錄。我們提供了一組不斷增加的 357 個結構化訪談記錄，來自一個具代表性的樣本，每個記錄都包含個人對 32 個開放式問題的回答，這些問題經過仔細設計，用於收集基於理論的人格證據。此外，基於心理測量研究，我們總結了一個評估架構，以系統性驗證 LLM 生成的精神測量資料。三個實驗的結果表明，設計良好的結構化訪談可以改善 LLM 模擬的人格資料中類似人類的異質性，並預測與人格相關的行為結果（例如，組織公民行為和適得其反的工作行為）。我們進一步討論了理論知情的結構化訪談在基於 LLM 的模擬中的作用，並概述了一個通用框架，用於設計結構化訪談以模擬類似人類的資料，以進行心理測量研究。
-
-##### **Using the Path of Least Resistance to Explain Deep Networks**
-2502.12108v1 by Sina Salek, Joseph Enguehard
-
-Integrated Gradients (IG), a widely used axiomatic path-based attribution
-method, assigns importance scores to input features by integrating model
-gradients along a straight path from a baseline to the input. While effective
-in some cases, we show that straight paths can lead to flawed attributions. In
-this paper, we identify the cause of these misattributions and propose an
-alternative approach that treats the input space as a Riemannian manifold,
-computing attributions by integrating gradients along geodesics. We call this
-method Geodesic Integrated Gradients (GIG). To approximate geodesic paths, we
-introduce two techniques: a k-Nearest Neighbours-based approach for smaller
-models and a Stochastic Variational Inference-based method for larger ones.
-Additionally, we propose a new axiom, Strong Completeness, extending the axioms
-satisfied by IG. We show that this property is desirable for attribution
-methods and that GIG is the only method that satisfies it. Through experiments
-on both synthetic and real-world data, we demonstrate that GIG outperforms
-existing explainability methods, including IG.
-
-摘要：整合梯度 (IG) 是一種廣泛使用的公理路徑歸因方法，它透過整合從基線到輸入的直線路徑上的模型梯度，為輸入特徵分配重要性分數。雖然在某些情況下有效，但我們表明直線路徑可能會導致錯誤的歸因。在本文中，我們找出這些錯誤歸因的原因，並提出將輸入空間視為黎曼流形的替代方法，透過整合測地線上的梯度來計算歸因。我們將此方法稱為測地線整合梯度 (GIG)。為了近似測地線路徑，我們引入了兩種技術：一種基於 k 最近鄰的方法，適用於較小的模型；一種基於隨機變異推論的方法，適用於較大的模型。此外，我們提出了新的公理，即強完整性，擴展了 IG 滿足的公理。我們表明此屬性對於歸因方法而言是理想的，並且 GIG 是唯一滿足此屬性的方法。透過對合成資料和真實世界資料進行的實驗，我們證明 GIG 優於現有的可解釋性方法，包括 IG。
-
-##### **Relational Norms for Human-AI Cooperation**
-2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
-
-How we should design and interact with social artificial intelligence depends
-on the socio-relational role the AI is meant to emulate or occupy. In human
-society, relationships such as teacher-student, parent-child, neighbors,
-siblings, or employer-employee are governed by specific norms that prescribe or
-proscribe cooperative functions including hierarchy, care, transaction, and
-mating. These norms shape our judgments of what is appropriate for each
-partner. For example, workplace norms may allow a boss to give orders to an
-employee, but not vice versa, reflecting hierarchical and transactional
-expectations. As AI agents and chatbots powered by large language models are
-increasingly designed to serve roles analogous to human positions - such as
-assistant, mental health provider, tutor, or romantic partner - it is
-imperative to examine whether and how human relational norms should extend to
-human-AI interactions. Our analysis explores how differences between AI systems
-and humans, such as the absence of conscious experience and immunity to
-fatigue, may affect an AI's capacity to fulfill relationship-specific functions
-and adhere to corresponding norms. This analysis, which is a collaborative
-effort by philosophers, psychologists, relationship scientists, ethicists,
-legal experts, and AI researchers, carries important implications for AI
-systems design, user behavior, and regulation. While we accept that AI systems
-can offer significant benefits such as increased availability and consistency
-in certain socio-relational roles, they also risk fostering unhealthy
-dependencies or unrealistic expectations that could spill over into human-human
-relationships. We propose that understanding and thoughtfully shaping (or
-implementing) suitable human-AI relational norms will be crucial for ensuring
-that human-AI interactions are ethical, trustworthy, and favorable to human
-well-being.
-
-摘要：<paragraph>我們應如何設計和與社交人工智慧互動，取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中，師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配，這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如，職場規範可能允許老闆對員工發號施令，但反之則不行，這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色，例如助理、心理健康提供者、導師或浪漫伴侶，審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異，例如缺乏意識體驗和對疲勞的免疫力，如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果，對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處，例如增加可用性和一致性，但它們也可能助長不健康的依賴關係或不切實際的期望，這些期望可能會蔓延到人際關係中。我們提出，理解和深思熟慮地塑造（或實施）適當的人類與人工智慧關係規範，對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。</paragraph>
-
-##### **A Study on Leveraging Search and Self-Feedback for Agent Reasoning**
-2502.12094v1 by Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba
-
-Recent works have demonstrated that incorporating search during inference can
-significantly improve reasoning capabilities of language agents. Some
-approaches may make use of the ground truth or rely on model's own generated
-feedback. The search algorithm uses this feedback to then produce values that
-will update its criterion for exploring and exploiting various reasoning paths.
-In this study, we investigate how search and model's self-feedback can be
-leveraged for reasoning tasks. First, we explore differences in ground-truth
-feedback and self-feedback during search for math reasoning. Second, we observe
-limitations in applying search techniques to more complex tasks like
-tool-calling and design domain-specific approaches to address these gaps. Our
-experiments reveal challenges related to generalization when solely relying on
-self-feedback during search. For search to work effectively, either access to
-the ground-truth is needed or feedback mechanisms need to be carefully designed
-for the specific task.
-
-摘要：最近的研究表明，在推理过程中加入搜索功能可以显著提升语言代理的推理能力。一些方法可能会利用基本事实或依赖模型本身产生的反馈。搜索算法使用此反馈，然后生成值，以更新其探索和利用各种推理路径的标准。在本研究中，我们调查了如何利用搜索和模型的自反馈来进行推理任务。首先，我们探讨了数学推理搜索过程中基本事实反馈和自反馈的差异。其次，我们观察到在将搜索技术应用于更复杂的任务（如工具调用和设计特定于领域的解决方案）时存在的局限性，并提出针对这些差距的解决方案。我们的实验揭示了在搜索过程中仅依赖自反馈时与泛化相关的挑战。要使搜索有效，需要访问基本事实或需要针对特定任务仔细设计反馈机制。
-
-##### **Meta-Statistical Learning: Supervised Learning of Statistical Inference**
-2502.12088v1 by Maxime Peyrard, Kyunghyun Cho
-
-This work demonstrates that the tools and principles driving the success of
-large language models (LLMs) can be repurposed to tackle distribution-level
-tasks, where the goal is to predict properties of the data-generating
-distribution rather than labels for individual datapoints. These tasks
-encompass statistical inference problems such as parameter estimation,
-hypothesis testing, or mutual information estimation. Framing these tasks
-within traditional machine learning pipelines is challenging, as supervision is
-typically tied to individual datapoint. We propose meta-statistical learning, a
-framework inspired by multi-instance learning that reformulates statistical
-inference tasks as supervised learning problems. In this approach, entire
-datasets are treated as single inputs to neural networks, which predict
-distribution-level parameters. Transformer-based architectures, without
-positional encoding, provide a natural fit due to their permutation-invariance
-properties. By training on large-scale synthetic datasets, meta-statistical
-models can leverage the scalability and optimization infrastructure of
-Transformer-based LLMs. We demonstrate the framework's versatility with
-applications in hypothesis testing and mutual information estimation, showing
-strong performance, particularly for small datasets where traditional neural
-methods struggle.
-
-摘要：这项工作表明，推动大型语言模型 (LLM) 成功发展的工具和原则可以重新用于解决分布级别任务，其中目标是预测数据生成分布的属性，而不是单个数据点的标签。这些任务包括统计推断问题，例如参数估计、假设检验或互信息估计。在传统的机器学习管道中构建这些任务具有挑战性，因为监督通常与单个数据点相关联。我们提出了元统计学习，这是一个受多实例学习启发的框架，它将统计推断任务重新表述为监督学习问题。在此方法中，整个数据集被视为神经网络的单个输入，该神经网络预测分布级别参数。基于 Transformer 的架构在没有位置编码的情况下提供了自然拟合，因为它们具有置换不变性。通过在大型合成数据集上进行训练，元统计模型可以利用基于 Transformer 的 LLM 的可扩展性和优化基础设施。我们通过在假设检验和互信息估计中的应用展示了该框架的多功能性，显示出强大的性能，特别是对于传统神经方法难以处理的小型数据集。
-
-##### **APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**
-2502.12085v1 by Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
-
-While long-context inference is crucial for advancing large language model
-(LLM) applications, its prefill speed remains a significant bottleneck. Current
-approaches, including sequence parallelism strategies and compute reduction
-through approximate attention mechanisms, still fall short of delivering
-optimal inference efficiency. This hinders scaling the inputs to longer
-sequences and processing long-context queries in a timely manner. To address
-this, we introduce APB, an efficient long-context inference framework that
-leverages multi-host approximate attention to enhance prefill speed by reducing
-compute and enhancing parallelism simultaneously. APB introduces a
-communication mechanism for essential key-value pairs within a sequence
-parallelism framework, enabling a faster inference speed while maintaining task
-performance. We implement APB by incorporating a tailored FlashAttn kernel
-alongside optimized distribution strategies, supporting diverse models and
-parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x
-compared with FlashAttn, RingAttn, and StarAttn, respectively, without any
-observable task performance degradation. We provide the implementation and
-experiment code of APB in https://github.com/thunlp/APB.
-
-摘要：雖然長文本推理對於推進大型語言模型 (LLM) 應用至關重要，但其預填充速度仍然是一個重大的瓶頸。目前的各種方法，包括序列並行策略和透過近似注意力機制減少運算，仍然無法提供最佳的推理效率。這會阻礙將輸入擴展到更長的序列，以及及時處理長文本查詢。為了解決這個問題，我們引入了 APB，這是一個高效的長文本推理架構，它利用多主機近似注意力來減少運算並同時提高並行性，從而提高預填充速度。APB 在序列並行架構中引入了一個用於基本鍵值對的通訊機制，在維持任務效能的同時，實現更快的推理速度。我們透過整合一個量身打造的 FlashAttn 核心以及最佳化的分佈策略來實作 APB，支援各種模型和並行配置。與 FlashAttn、RingAttn 和 StarAttn 相比，APB 分別實現了高達 9.2 倍、4.2 倍和 1.6 倍的加速，同時沒有任何可觀察到的任務效能下降。我們在 https://github.com/thunlp/APB 中提供了 APB 的實作和實驗程式碼。
-
-##### **VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**
-2502.12084v1 by Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R., Fung
-
-Visually linking matching cues is a crucial ability in daily life, such as
-identifying the same person in multiple photos based on their cues, even
-without knowing who they are. Despite the extensive knowledge that
-vision-language models (VLMs) possess, it remains largely unexplored whether
-they are capable of performing this fundamental task. To address this, we
-introduce VLM$^2$-Bench, a benchmark designed to assess whether VLMs can
-Visually Link Matching cues, with 9 subtasks and over 3,000 test cases.
-Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with
-further analysis of various language-side and vision-side prompting methods,
-leads to a total of eight key findings. We identify critical challenges in
-models' ability to link visual cues, highlighting a significant performance gap
-where even GPT-4o lags 34.80% behind humans. Based on these insights, we
-advocate for (i) enhancing core visual capabilities to improve adaptability and
-reduce reliance on prior knowledge, (ii) establishing clearer principles for
-integrating language-based reasoning in vision-centric tasks to prevent
-unnecessary biases, and (iii) shifting vision-text training paradigms toward
-fostering models' ability to independently structure and infer relationships
-among visual cues.
-
-摘要：視覺連結匹配線索是日常生活中的關鍵能力，例如在多張照片中根據線索辨識同一個人，即使不知道他們是誰。儘管視覺語言模型 (VLM) 擁有廣泛的知識，但它們是否能執行這項基本任務，在很大程度上仍未被探討。為了解決這個問題，我們引入了 VLM$^2$-Bench，一個基準測試，旨在評估 VLM 是否能視覺連結匹配線索，包含 9 個子任務和超過 3,000 個測試案例。對八個開源 VLM 和 GPT-4o 的全面評估，以及對各種語言側和視覺側提示方法的進一步分析，得出總共八項關鍵發現。我們找出模型連結視覺線索能力的關鍵挑戰，強調一個顯著的效能差距，即使是 GPT-4o 也落後人類 34.80%。根據這些見解，我們提倡 (i) 提升核心視覺能力以改善適應性並減少對先驗知識的依賴，(ii) 為整合基於語言的推理到以視覺為中心的任務中建立更明確的原則，以防止不必要的偏見，以及 (iii) 將視覺文字訓練範例轉移到培養模型獨立建構和推論視覺線索之間關係的能力。
-
-##### **AdaSplash: Adaptive Sparse Flash Attention**
-2502.12082v1 by Nuno Gonçalves, Marcos Treviso, André F. T. Martins
-
-The computational cost of softmax-based attention in transformers limits
-their applicability to long-context tasks. Adaptive sparsity, of which
-$\alpha$-entmax attention is an example, offers a flexible data-dependent
-alternative, but existing implementations are inefficient and do not leverage
-the sparsity to obtain runtime and memory gains. In this work, we propose
-AdaSplash, which combines the efficiency of GPU-optimized algorithms with the
-sparsity benefits of $\alpha$-entmax. We first introduce a hybrid
-Halley-bisection algorithm, resulting in a 7-fold reduction in the number of
-iterations needed to compute the $\alpha$-entmax transformation. Then, we
-implement custom Triton kernels to efficiently handle adaptive sparsity.
-Experiments with RoBERTa and ModernBERT for text classification and
-single-vector retrieval, along with GPT-2 for language modeling, show that our
-method achieves substantial improvements in runtime and memory efficiency
-compared to existing $\alpha$-entmax implementations. It approaches -- and in
-some cases surpasses -- the efficiency of highly optimized softmax
-implementations like FlashAttention-2, enabling long-context training while
-maintaining strong task performance.
-
-摘要：基於 softmax 的注意力在 Transformer 中的運算成本限制了它們在長內容任務中的應用性。適應性稀疏性，其中 $\alpha$-entmax 注意力是一個例子，提供了一個靈活的資料相關替代方案，但現有的實作效率低下，且無法利用稀疏性來獲得執行時間和記憶體的增益。在這項工作中，我們提出了 AdaSplash，它結合了 GPU 最佳化演算法的效率和 $\alpha$-entmax 的稀疏性優點。我們首先引入了一個混合 Halley-二分法演算法，導致計算 $\alpha$-entmax 轉換所需的迭代次數減少了 7 倍。然後，我們實作自訂 Triton 核心，以有效處理適應性稀疏性。針對文字分類和單一向量擷取的 RoBERTa 和 ModernBERT，以及用於語言建模的 GPT-2 的實驗顯示，與現有的 $\alpha$-entmax 實作相比，我們的方法在執行時間和記憶體效率方面獲得了顯著的改善。它接近了 -- 在某些情況下超越了 -- 高度最佳化 softmax 實作（例如 FlashAttention-2）的效率，同時在維持強大任務效能的同時，能夠進行長內容訓練。
-
-##### **Unhackable Temporal Rewarding for Scalable Video MLLMs**
-2502.12081v1 by En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
-
-In the pursuit of superior video-processing MLLMs, we have encountered a
-perplexing paradox: the "anti-scaling law", where more data and larger models
-lead to worse performance. This study unmasks the culprit: "temporal hacking",
-a phenomenon where models shortcut by fixating on select frames, missing the
-full video narrative. In this work, we systematically establish a comprehensive
-theory of temporal hacking, defining it from a reinforcement learning
-perspective, introducing the Temporal Perplexity (TPL) score to assess this
-misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework
-to mitigate the temporal hacking. Both theoretically and empirically, TPL
-proves to be a reliable indicator of temporal modeling quality, correlating
-strongly with frame activation patterns. Extensive experiments reveal that UTR
-not only counters temporal hacking but significantly elevates video
-comprehension capabilities. This work not only advances video-AI systems but
-also illuminates the critical importance of aligning proxy rewards with true
-objectives in MLLM development.
-
-摘要：在追求卓越的影片處理 MLLM 時，我們遭遇了一個令人費解的矛盾現象：「反規模化定律」，也就是更多資料和更大的模型會導致更差的效能。本研究揭露了罪魁禍首：「時間駭客」，這是一種模型透過專注於特定影格來簡化的現象，錯失了完整的影片敘事。在這項研究中，我們系統性地建立了一個關於時間駭客的全面理論，從強化學習的角度定義它，並引入了時間困惑度 (TPL) 分數來評估這種失衡，並提出了無法破解的時間獎勵 (UTR) 架構來減輕時間駭客現象。從理論和經驗上來說，TPL 被證明是時間建模品質的可靠指標，與影格啟動模式有很強的相關性。大量的實驗顯示，UTR 不僅對抗時間駭客，還能顯著提升影片理解能力。這項研究不僅推動了影片 AI 系統，也闡明了在 MLLM 開發中，將代理獎勵與真實目標對齊的重要性。
-
-##### **Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**
-2502.12073v1 by Zhongyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
-
-Social media enables dynamic user engagement with trending topics, and recent
-research has explored the potential of large language models (LLMs) for
-response generation. While some studies investigate LLMs as agents for
-simulating user behavior on social media, their focus remains on practical
-viability and scalability rather than a deeper understanding of how well LLM
-aligns with human behavior. This paper analyzes LLMs' ability to simulate
-social media engagement through action guided response generation, where a
-model first predicts a user's most likely engagement action-retweet, quote, or
-rewrite-towards a trending post before generating a personalized response
-conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and
-DeepSeek-R1 in social media engagement simulation regarding a major societal
-event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT
-in action prediction, while few-shot prompting initially degrades the
-prediction accuracy of LLMs with limited examples. However, in response
-generation, few-shot LLMs achieve stronger semantic alignment with ground truth
-posts.
-
-摘要：社交媒體讓使用者能夠動態參與熱門話題，而最近的研究探索了大型語言模型 (LLM) 在回應生成方面的潛力。儘管有些研究將 LLM 視為模擬社交媒體使用者行為的代理，但其重點仍放在實務可行性和可擴充性，而非深入了解 LLM 如何與人類行為相符。本文分析了 LLM 透過動作引導回應生成來模擬社交媒體參與的能力，其中一個模型首先預測使用者最有可能的參與動作（轉推、引用或改寫）對熱門貼文的參與，然後根據預測的動作產生個人化回應。我們在 X 上討論的一個重大社會事件中，對 GPT-4o-mini、O1-mini 和 DeepSeek-R1 進行社交媒體參與模擬的基準測試。我們的研究結果顯示，零次學習 LLM 在動作預測方面表現不如 BERT，而少次學習提示最初會降低範例有限的 LLM 預測準確度。然而，在回應生成方面，少次學習 LLM 與真實貼文達到了更強的語義對齊。
-
-##### **TokenSkip: Controllable Chain-of-Thought Compression in LLMs**
-2502.12067v1 by Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li
-
-Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning
-capabilities of large language models (LLMs). Recent advancements, such as
-OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT
-sequences during inference could further boost LLM reasoning performance.
-However, due to the autoregressive nature of LLM decoding, longer CoT outputs
-lead to a linear increase in inference latency, adversely affecting user
-experience, particularly when the CoT exceeds 10,000 tokens. To address this
-limitation, we analyze the semantic importance of tokens within CoT outputs and
-reveal that their contributions to reasoning vary. Building on this insight, we
-propose TokenSkip, a simple yet effective approach that enables LLMs to
-selectively skip less important tokens, allowing for controllable CoT
-compression. Extensive experiments across various models and tasks demonstrate
-the effectiveness of TokenSkip in reducing CoT token usage while preserving
-strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct,
-TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less
-than a 0.4% performance drop.
-
-摘要：<paragraph>鏈式思維 (CoT) 已被證明能有效提升大型語言模型 (LLM) 的推理能力。最近的進展，例如 OpenAI 的 o1 和 DeepSeek-R1，表明在推理過程中擴展 CoT 序列的長度可以進一步提升 LLM 的推理效能。然而，由於 LLM 解碼的自動回歸特性，較長的 CoT 輸出會導致推理延遲線性增加，對使用者體驗造成負面影響，特別是在 CoT 超過 10,000 個符號時。為了解決這個限制，我們分析了 CoT 輸出中符號的語義重要性，並揭示了它們對推理的貢獻度不同。基於這個見解，我們提出了 TokenSkip，一種簡單但有效的技術，使 LLM 能有選擇地略過較不重要的符號，從而實現可控的 CoT 壓縮。跨越各種模型和任務的廣泛實驗證明了 TokenSkip 在減少 CoT 符號使用量同時保持強大推理效能方面的有效性。值得注意的是，當應用於 Qwen2.5-14B-Instruct 時，TokenSkip 將 GSM8K 上的推理符號減少了 40%（從 313 個減少到 181 個），效能下降不到 0.4%。</paragraph>
-
-##### **CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**
-2502.12066v1 by Yifan Zhang, Xue Yang
-
-Automating planning with LLMs presents transformative opportunities for
-traditional industries, yet remains underexplored. In commercial construction,
-the complexity of automated scheduling often requires manual intervention to
-ensure precision. We propose CONSTRUCTA, a novel framework leveraging LLMs to
-optimize construction schedules in complex projects like semiconductor
-fabrication. CONSTRUCTA addresses key challenges by: (1) integrating
-construction-specific knowledge through static RAG; (2) employing
-context-sampling techniques inspired by architectural expertise to provide
-relevant input; and (3) deploying Construction DPO to align schedules with
-expert preferences using RLHF. Experiments on proprietary data demonstrate
-performance improvements of +42.3% in missing value prediction, +79.1% in
-dependency analysis, and +28.9% in automated planning compared to baseline
-methods, showcasing its potential to revolutionize construction workflows and
-inspire domain-specific LLM advancements.
-
-摘要：利用 LLM 自動化規劃為傳統產業帶來轉型契機，但仍有待進一步探索。在商業建築中，自動化排程的複雜性通常需要手動介入以確保精確度。我們提出 CONSTRUCTA，一個利用 LLM 優化複雜專案（如半導體製造）建築排程的新穎架構。CONSTRUCTA 透過下列方式解決關鍵挑戰：(1) 整合靜態 RAG 的建築特定知識；(2) 採用受建築專業知識啟發的脈絡取樣技術，提供相關輸入；(3) 部署建築 DPO，使用 RLHF 將排程與專家偏好對齊。專利數據的實驗顯示，與基準方法相比，遺失值預測的效能提升 +42.3%、相依性分析提升 +79.1%、自動化規劃提升 +28.9%，展示其革新建築工作流程和激勵領域特定 LLM 進展的潛力。
-
-##### **Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**
-2502.12065v1 by Lan Zhang, Marco Valentino, Andre Freitas
-
-Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge
-the gap between informal mathematics and formal languages through
-autoformalization. However, it is still unclear how well LLMs generalize to
-sophisticated and naturally occurring mathematical statements. To address this
-gap, we investigate the task of autoformalizing real-world mathematical
-definitions -- a critical component of mathematical discourse. Specifically, we
-introduce two novel resources for autoformalisation, collecting definitions
-from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically
-evaluate a range of LLMs, analyzing their ability to formalize definitions into
-Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs'
-performance including refinement through external feedback from Proof
-Assistants, and formal definition grounding, where we guide LLMs through
-relevant contextual elements from formal mathematical libraries. Our findings
-reveal that definitions present a greater challenge compared to existing
-benchmarks, such as miniF2F. In particular, we found that LLMs still struggle
-with self-correction, and aligning with relevant mathematical libraries. At the
-same time, structured refinement methods and definition grounding strategies
-yield notable improvements of up to 16% on self-correction capabilities and 43%
-on the reduction of undefined errors, highlighting promising directions for
-enhancing LLM-based autoformalization in real-world scenarios.
-
-摘要：由於語言能力，LLM 提供了一個機會，透過自動形式化來彌合非正式數學和形式語言之間的差距。然而，LLM 在多麼精巧且自然發生的數學陳述中概化，這仍不清楚。為了解決這個差距，我們探討了自動形式化真實世界數學定義的任務，這是數學論述中的關鍵組成部分。具體來說，我們介紹了自動形式化的兩個新資源，收集來自維基百科（Def_Wiki）和 arXiv 論文（Def_ArXiv）的定義。然後，我們系統性地評估了一系列 LLM，分析它們將定義形式化為 Isabelle/HOL 的能力。此外，我們探討了增強 LLM 效能的策略，包括透過證明輔助工具的外部回饋進行精煉，以及形式定義基礎，其中我們透過形式數學函式庫中的相關脈絡元素來引導 LLM。我們的發現顯示，與現有的基準（例如 miniF2F）相比，定義提出了更大的挑戰。特別是，我們發現 LLM 在自我修正和與相關數學函式庫對齊方面仍然有困難。同時，結構化的精煉方法和定義基礎策略在自我修正能力上產生了顯著的改善，高達 16%，在減少未定義錯誤方面改善了 43%，突顯了在真實世界場景中增強基於 LLM 的自動形式化的有希望的方向。
-
-##### **AI-generated Text Detection with a GLTR-based Approach**
-2502.12064v1 by Lucía Yan Wu, Isabel Segura-Bedmar
-
-The rise of LLMs (Large Language Models) has contributed to the improved
-performance and development of cutting-edge NLP applications. However, these
-can also pose risks when used maliciously, such as spreading fake news, harmful
-content, impersonating individuals, or facilitating school plagiarism, among
-others. This is because LLMs can generate high-quality texts, which are
-challenging to differentiate from those written by humans. GLTR, which stands
-for Giant Language Model Test Room and was developed jointly by the MIT-IBM
-Watson AI Lab and HarvardNLP, is a visual tool designed to help detect
-machine-generated texts based on GPT-2, that highlights the words in text
-depending on the probability that they were machine-generated. One limitation
-of GLTR is that the results it returns can sometimes be ambiguous and lead to
-confusion. This study aims to explore various ways to improve GLTR's
-effectiveness for detecting AI-generated texts within the context of the
-IberLef-AuTexTification 2023 shared task, in both English and Spanish
-languages. Experiment results show that our GLTR-based GPT-2 model overcomes
-the state-of-the-art models on the English dataset with a macro F1-score of
-80.19%, except for the first ranking model (80.91%). However, for the Spanish
-dataset, we obtained a macro F1-score of 66.20%, which differs by 4.57%
-compared to the top-performing model.
-
-摘要：大型語言模型 (LLM) 的興起有助於改進尖端 NLP 應用程式的效能和開發。不過，這些應用程式若遭惡意使用，例如散布假新聞、有害內容、冒充個人或協助學校抄襲等，也可能造成風險。這是因為 LLM 可以產生高品質的文字，而這些文字難以與人類所寫的文字區分。GLTR（代表大型語言模型測試室）是由麻省理工學院-IBM Watson AI 實驗室和 HarvardNLP 共同開發的視覺工具，旨在協助偵測基於 GPT-2 的機器產生的文字，它會根據文字中每個字詞機器產生的機率來標示。GLTR 的一個限制在於，它回傳的結果有時可能模稜兩可，容易造成混淆。本研究旨在探討各種方法來改善 GLTR 在 IberLef-AuTexTification 2023 共享任務中偵測 AI 生成的文字的效能，任務中包含英文和西班牙文兩種語言。實驗結果顯示，我們的基於 GLTR 的 GPT-2 模型在英文資料集上以 80.19% 的巨觀 F1 分數超越了最先進的模型，僅次於第一名排名模型 (80.91%)。不過，在西班牙文資料集上，我們獲得的巨觀 F1 分數為 66.20%，與表現最佳的模型相比，相差 4.57%。
-
-##### **Culture is Not Trivia: Sociocultural Theory for Cultural NLP**
-2502.12057v1 by Naitian Zhou, David Bamman, Isaac L. Bleaman
-
-The field of cultural NLP has recently experienced rapid growth, driven by a
-pressing need to ensure that language technologies are effective and safe
-across a pluralistic user base. This work has largely progressed without a
-shared conception of culture, instead choosing to rely on a wide array of
-cultural proxies. However, this leads to a number of recurring limitations:
-coarse national boundaries fail to capture nuanced differences that lay within
-them, limited coverage restricts datasets to only a subset of usually
-highly-represented cultures, and a lack of dynamicity results in static
-cultural benchmarks that do not change as culture evolves. In this position
-paper, we argue that these methodological limitations are symptomatic of a
-theoretical gap. We draw on a well-developed theory of culture from
-sociocultural linguistics to fill this gap by 1) demonstrating in a case study
-how it can clarify methodological constraints and affordances, 2) offering
-theoretically-motivated paths forward to achieving cultural competence, and 3)
-arguing that localization is a more useful framing for the goals of much
-current work in cultural NLP.
-
-摘要：文化 NLP 領域最近經歷了快速成長，這是因為迫切需要確保語言技術對於多元化的使用者基礎而言是有效且安全的。這項工作在很大程度上沒有文化共識，而是選擇依賴各種文化代理。然而，這導致了許多重複性的限制：粗略的國家界線無法捕捉到其中的細微差異，有限的涵蓋範圍將資料集限制在通常高度代表的文化子集，而且缺乏動態性導致靜態文化基準無法隨著文化演變而改變。在這篇立場文件中，我們認為這些方法論限制是理論差距的徵兆。我們從社會文化語言學中汲取一個發展良好的文化理論，透過 1) 在個案研究中展示它如何釐清方法論限制和可負擔性，2) 提供理論上合理的途徑來實現文化能力，以及 3) 主張在地化對於文化 NLP 中許多當前工作的目標而言是一個更有用的框架，來填補這個差距。
-
-##### **Designing Role Vectors to Improve LLM Inference Behaviour**
-2502.12055v1 by Daniele Potertì, Andrea Seveso, Fabio Mercorio
-
-The influence of personas on Large Language Models (LLMs) has been widely
-studied, yet their direct impact on performance remains uncertain. This work
-explores a novel approach to guiding LLM behaviour through role vectors, an
-alternative to persona-based prompting. We construct 29 role vectors derived
-from model activations and evaluate their impact on benchmark performance
-across multiple domains. Our analysis investigates whether these vectors can
-effectively steer models toward domain-specific expertise. We measure two key
-interventions: (i) activation addition, which reinforces role-specific
-directions, and (ii) directional ablation, which removes them. Results on
-well-established benchmarks indicate that role vectors do, in fact, influence
-model behaviour, improving task performance in relevant domains while
-marginally affecting unrelated tasks. This, in turn, suggests that manipulating
-internal model representations has a greater impact on outcomes than
-persona-based prompting.
-
-摘要：大型語言模型 (LLM) 中角色的影響已被廣泛研究，但它們對效能的直接影響仍然不確定。本研究探討了一種透過角色向量引導 LLM 行為的新方法，這是一種基於角色提示的替代方案。我們從模型激活中建構了 29 個角色向量，並評估它們對多個領域基準效能的影響。我們的分析探討了這些向量是否能有效地引導模型朝向特定領域的專業知識。我們衡量了兩個關鍵干預措施：(i) 激活新增，它加強了特定角色的方向，以及 (ii) 方向消融，它移除了這些方向。在既定基準上的結果表明，角色向量確實會影響模型行為，在相關領域中改善任務效能，同時對不相關任務的影響很小。這反過來表明，操縱內部模型表示對結果的影響比基於角色的提示更大。
-
-##### **PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**
-2502.12054v1 by Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu
-
-Large language models demonstrate remarkable capabilities across various
-domains, especially mathematics and logic reasoning. However, current
-evaluations overlook physics-based reasoning - a complex task requiring physics
-theorems and constraints. We present PhysReason, a 1,200-problem benchmark
-comprising knowledge-based (25%) and reasoning-based (75%) problems, where the
-latter are divided into three difficulty levels (easy, medium, hard). Notably,
-problems require an average of 8.1 solution steps, with hard requiring 15.6,
-reflecting the complexity of physics-based reasoning. We propose the Physics
-Solution Auto Scoring Framework, incorporating efficient answer-level and
-comprehensive step-level evaluations. Top-performing models like Deepseek-R1,
-Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on
-answer-level evaluation, with performance dropping from knowledge questions
-(75.11%) to hard problems (31.95%). Through step-level evaluation, we
-identified four key bottlenecks: Physics Theorem Application, Physics Process
-Understanding, Calculation, and Physics Condition Analysis. These findings
-position PhysReason as a novel and comprehensive benchmark for evaluating
-physics-based reasoning capabilities in large language models. Our code and
-data will be published at https:/dxzxy12138.github.io/PhysReason.
-
-摘要：大型語言模型展示了在各個領域的非凡能力，特別是數學和邏輯推理。然而，目前的評估忽略了基於物理的推理——這是一項複雜的任務，需要物理定理和約束。我們提出了 PhysReason，一個包含 1,200 題的基準，包含基於知識的（25%）和基於推理的（75%）問題，後者分為三個難度等級（容易、中等、困難）。值得注意的是，問題需要平均 8.1 個求解步驟，困難的需要 15.6 個，反映了基於物理的推理的複雜性。我們提出了物理解決方案自動評分框架，結合了高效的答案級別和全面的步驟級別評估。Deepseek-R1、Gemini-2.0-Flash-Thinking 和 o3-mini-high 等表現最佳的模型在答案級別評估中獲得低於 60% 的分數，性能從知識問題（75.11%）下降到困難問題（31.95%）。通過步驟級別評估，我們確定了四個關鍵瓶頸：物理定理應用、物理過程理解、計算和物理條件分析。這些發現將 PhysReason 定位為一個新穎且全面的基準，用於評估大型語言模型中基於物理的推理能力。我們的代碼和數據將發布在 https:/dxzxy12138.github.io/PhysReason。
-
-##### **A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**
-2502.12052v1 by Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
-
-In NLG meta-evaluation, evaluation metrics are typically assessed based on
-their consistency with humans. However, we identify some limitations in
-traditional NLG meta-evaluation approaches, such as issues in handling human
-ratings and ambiguous selections of correlation measures, which undermine the
-effectiveness of meta-evaluation. In this work, we propose a dual-perspective
-NLG meta-evaluation framework that focuses on different evaluation
-capabilities, thereby providing better interpretability. In addition, we
-introduce a method of automatically constructing the corresponding benchmarks
-without requiring new human annotations. Furthermore, we conduct experiments
-with 16 representative LLMs as the evaluators based on our proposed framework,
-comprehensively analyzing their evaluation performance from different
-perspectives.
-
-摘要：在 NLG 元評估中，評估指標通常根據其與人類的一致性進行評估。然而，我們在傳統的 NLG 元評估方法中發現了一些限制，例如在處理人類評分和模稜兩可的相關性測量選擇方面存在問題，這會損害元評估的有效性。在這項工作中，我們提出了一個雙視角 NLG 元評估框架，該框架專注於不同的評估能力，從而提供更好的可解釋性。此外，我們引入了一種自動構建相應基準的方法，而不需要新的手動註釋。此外，我們根據我們提出的框架對 16 個具有代表性的 LLM 作為評估器進行了實驗，從不同的角度全面分析了它們的評估性能。
-
-##### **How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**
-2502.12051v1 by Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
-
-Neural scaling laws have revolutionized the design and optimization of
-large-scale AI models by revealing predictable relationships between model
-size, dataset volume, and computational resources. Early research established
-power-law relationships in model performance, leading to compute-optimal
-scaling strategies. However, recent studies highlighted their limitations
-across architectures, modalities, and deployment contexts. Sparse models,
-mixture-of-experts, retrieval-augmented learning, and multimodal models often
-deviate from traditional scaling patterns. Moreover, scaling behaviors vary
-across domains such as vision, reinforcement learning, and fine-tuning,
-underscoring the need for more nuanced approaches. In this survey, we
-synthesize insights from over 50 studies, examining the theoretical
-foundations, empirical findings, and practical implications of scaling laws. We
-also explore key challenges, including data efficiency, inference scaling, and
-architecture-specific constraints, advocating for adaptive scaling strategies
-tailored to real-world applications. We suggest that while scaling laws provide
-a useful guide, they do not always generalize across all architectures and
-training strategies.
-
-摘要：神經網路規模定律透過揭示模型規模、資料集體積和計算資源之間可預測的關係，徹底革新了大型 AI 模型的設計和最佳化。早期研究建立了模型效能中的冪次定律關係，進而產生最佳化的運算規模策略。然而，最近的研究突出了它們在架構、模態和部署脈絡中的限制。稀疏模型、專家混合、檢索增強式學習和多模態模型通常偏離傳統的規模模式。此外，規模行為因視覺、強化學習和微調等領域而異，強調需要更細緻的方法。在這項調查中，我們綜合了 50 多項研究的見解，探討規模定律的理論基礎、實證發現和實務意涵。我們也探討了關鍵挑戰，包括資料效率、推論規模和特定於架構的限制，提倡針對實際應用量身打造的自適應規模策略。我們建議，儘管規模定律提供了有用的指南，但它們並不總是能概括到所有架構和訓練策略。
-
-##### **SpeechT: Findings of the First Mentorship in Speech Translation**
-2502.12050v1 by Yasmin Moslem, Juan Julián Cea Morán, Mariano Gonzalez-Gomez, Muhammad Hazim Al Farouq, Farah Abdou, Satarupa Deb
-
-This work presents the details and findings of the first mentorship in speech
-translation (SpeechT), which took place in December 2024 and January 2025. To
-fulfil the requirements of the mentorship, the participants engaged in key
-activities, including data preparation, modelling, and advanced research.
-
-摘要：本研究報告了 2024 年 12 月和 2025 年 1 月舉行的首次語音翻譯 (SpeechT) 指導計畫的詳細資訊和發現。為了滿足指導計畫的要求，參與者參與了關鍵活動，包括資料準備、建模和進階研究。
-
-##### **A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**
-2502.12048v1 by Shreya Shukla, Jose Torres, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury
-
-Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial
-Intelligence (GenAI) has opened new frontiers in brain signal decoding,
-enabling assistive communication, neural representation learning, and
-multimodal integration. BCIs, particularly those leveraging
-Electroencephalography (EEG), provide a non-invasive means of translating
-neural activity into meaningful outputs. Recent advances in deep learning,
-including Generative Adversarial Networks (GANs) and Transformer-based Large
-Language Models (LLMs), have significantly improved EEG-based generation of
-images, text, and speech. This paper provides a literature review of the
-state-of-the-art in EEG-based multimodal generation, focusing on (i)
-EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and
-Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based
-language models and contrastive learning methods. Additionally, we discuss the
-emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We
-highlight key datasets, use cases, challenges, and EEG feature encoding methods
-that underpin generative approaches. By providing a structured overview of
-EEG-based generative AI, this survey aims to equip researchers and
-practitioners with insights to advance neural decoding, enhance assistive
-technologies, and expand the frontiers of brain-computer interaction.
-
-摘要：腦機介面（BCIs）與生成式人工智慧（GenAI）的整合為腦信號解碼開啟了新領域，能協助溝通、神經表徵學習與多模式整合。BCIs，特別是利用腦電圖（EEG）的 BCIs，提供了一種非侵入性的方式，可將神經活動轉換為有意義的輸出。深度學習的最新進展，包括生成對抗網路（GANs）與基於 Transformer 的大型語言模型（LLMs），大幅改善了基於 EEG 的影像、文字與語音生成。本文提供了一份基於 EEG 的多模式生成的最新文獻回顧，重點在於（一）透過 GANs、變異自動編碼器（VAEs）與擴散模型進行 EEG 到影像的生成，以及（二）利用基於 Transformer 的語言模型與對比學習方法進行 EEG 到文字的生成。此外，我們討論了 EEG 到語音合成的新興領域，這是一個不斷演進的多模式領域。我們重點介紹了關鍵的資料集、用例、挑戰與支撐生成方法的 EEG 特徵編碼方法。透過提供基於 EEG 的生成式 AI 的結構化概觀，本調查旨在為研究人員與從業人員提供見解，以推進神經解碼、增強輔助技術並擴展腦機互動的領域。
-
-##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**
-2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li
-
-Large language models (LLMs) have demonstrated remarkable capabilities in
-various complex tasks, yet they still suffer from hallucinations. Introducing
-external knowledge, such as knowledge graph, can enhance the LLMs' ability to
-provide factual answers. LLMs have the ability to interactively explore
-knowledge graphs. However, most approaches have been affected by insufficient
-internal knowledge excavation in LLMs, limited generation of trustworthy
-knowledge reasoning paths, and a vague integration between internal and
-external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large
-model framework driven by the collaboration of internal and external knowledge.
-It relies on the internal knowledge of the LLM to guide the exploration of
-interpretable directed subgraphs in external knowledge graphs, better
-integrating the two knowledge sources for more accurate reasoning. Extensive
-experiments on multiple real-world datasets confirm the superiority of
-KnowPath.
-
-摘要：大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力，但仍會出現幻覺。引入外部知識（例如知識圖譜）可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而，大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限，以及內部和外部知識之間的整合模糊的影響。因此，我們提出 KnowPath，這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索，更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。
-
-##### **SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**
-2502.12025v1 by Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
-
-Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage
-long chain-of-thought (CoT) reasoning to generate structured intermediate
-steps, enhancing their reasoning capabilities. However, long CoT does not
-inherently guarantee safe outputs, potentially leading to harmful consequences
-such as the introduction of security vulnerabilities in code or the spread of
-misinformation. Current research on large language model (LLM) safety usually
-focuses on short-answer responses, overlooking the long CoT style outputs of
-LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First,
-we investigate safety evaluators calibrated against human annotations. Using
-our newly developed metrics, we thoroughly assess the safety of 12
-state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results
-show that LRMs are not safe compared to their reasoning advance. Further, we
-perform a fine-grained analysis of the reasoning trace and final answer. We
-find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can
-improve model safety without additional training. However, these strategies
-either use constrained reasoning traces or incur high inference costs. To
-better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind
-safety training dataset in CoT style. We fine-tune two LRMs with SafeChain,
-showing that it not only enhances model safety but also preserves performance
-across 6 reasoning benchmarks.
-
-摘要：新興的大型推理模型（LRM），例如 DeepSeek-R1 模型，利用長鏈思考（CoT）推理來生成結構化的中間步驟，增強其推理能力。然而，長 CoT 本質上並不能保證安全的輸出，可能會導致有害的後果，例如在程式碼中引入安全漏洞或散佈錯誤訊息。目前針對大型語言模型（LLM）安全性的研究通常側重於簡短的回答回應，忽略了 LRM 的長 CoT 風格輸出。為了彌補這個差距，我們對 LRM 安全性進行系統性研究。首先，我們研究根據人類註解校正的安全評估器。使用我們新開發的指標，我們徹底評估了 12 個最先進的 LRM 在 StrongReject 和 WildJailbreak 資料集上的安全性。我們的結果表明，與其推理進度相比，LRM 並不安全。此外，我們對推理軌跡和最終答案進行了細粒度分析。我們發現三種解碼策略（ZeroThink、LessThink 和 MoreThink）可以在不額外訓練的情況下提高模型安全性。然而，這些策略要么使用受約束的推理軌跡，要么會產生高昂的推論成本。為了進一步加強 LRM 安全性，我們引入了 SafeChain，這是第一個 CoT 風格的安全訓練資料集。我們使用 SafeChain 微調了兩個 LRM，表明它不僅增強了模型安全性，而且在 6 個推理基準測試中都保持了效能。
-
-##### **Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**
-2502.12022v1 by Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
-
-Existing approaches to mathematical reasoning with large language models
-(LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated
-Reasoning (TIR) for precise computation. While efforts have been made to
-combine these methods, they primarily rely on post-selection or predefined
-strategies, leaving an open question: whether LLMs can autonomously adapt their
-reasoning strategy based on their inherent capabilities. In this work, we
-propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework
-that enables LLMs to personalize their reasoning strategy spontaneously,
-aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware
-data selection during supervised fine-tuning (SFT) to tailor training data to
-the model's unique abilities. This approach equips LLMs to autonomously
-determine and apply the appropriate reasoning strategy at test time. We
-evaluate TATA through extensive experiments on six mathematical reasoning
-benchmarks, using both general-purpose and math-specialized LLMs. Empirical
-results demonstrate that TATA effectively combines the complementary strengths
-of CoT and TIR, achieving superior or comparable performance with improved
-inference efficiency compared to TIR alone. Further analysis underscores the
-critical role of aptitude-aware data selection in enabling LLMs to make
-effective and adaptive reasoning decisions and align reasoning strategies with
-model capabilities.
-
-摘要：現有的數學推理方法使用大型語言模型 (LLM) 仰賴思考鏈 (CoT) 來達到泛化性，或使用工具整合推理 (TIR) 來進行精確運算。儘管已有人嘗試結合這些方法，但它們主要依賴後選取或預定義策略，留下一個開放性的問題：LLM 是否能根據其內在能力自主調整其推理策略。在這項工作中，我們提出 TATA（根據其天賦來教授 LLM），這是一個適應性架構，讓 LLM 能夠自發地個人化其推理策略，並與其內在的天賦保持一致。TATA 在監督微調 (SFT) 期間納入了基礎 LLM 感知資料選取，以根據模型的獨特能力調整訓練資料。此方法讓 LLM 能夠在測試時自主決定並套用適當的推理策略。我們透過對六個數學推理基準進行廣泛的實驗來評估 TATA，使用通用和數學專用 LLM。經驗結果顯示，TATA 有效地結合了 CoT 和 TIR 的互補優勢，與僅使用 TIR 相比，達到了優越或相當的效能，並改善了推論效率。進一步的分析強調了天賦感知資料選取在讓 LLM 能夠做出有效且適應性的推理決策，並將推理策略與模型能力保持一致時所扮演的關鍵角色。
-
-##### **Atom of Thoughts for Markov LLM Test-Time Scaling**
-2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
-
-Large Language Models (LLMs) achieve superior performance through
-training-time scaling, and test-time scaling further enhances their
-capabilities by conducting effective reasoning during inference. However, as
-the scale of reasoning increases, existing test-time scaling methods suffer
-from accumulated historical information, which not only wastes computational
-resources but also interferes with effective reasoning. To address this issue,
-we observe that complex reasoning progress is often achieved by solving a
-sequence of independent subquestions, each being self-contained and verifiable.
-These subquestions are essentially atomic questions, relying primarily on their
-current state rather than accumulated history, similar to the memoryless
-transitions in a Markov process. Based on this observation, we propose Atom of
-Thoughts (AoT), where each state transition in the reasoning process consists
-of decomposing the current question into a dependency-based directed acyclic
-graph and contracting its subquestions, forming a new atomic question state.
-This iterative decomposition-contraction process continues until reaching
-directly solvable atomic questions, naturally realizing Markov transitions
-between question states. Furthermore, these atomic questions can be seamlessly
-integrated into existing test-time scaling methods, enabling AoT to serve as a
-plug-in enhancement for improving reasoning capabilities. Experiments across
-six benchmarks demonstrate the effectiveness of AoT both as a standalone
-framework and a plug-in enhancement. Notably, on HotpotQA, when applied to
-gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and
-DeepSeek-R1 by 10.6%. The code will be available at
-https://github.com/qixucen/atom.
-
-摘要：大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能，而測試時間擴充透過在推論期間進行有效的推理，進一步提升其能力。然而，隨著推理規模的擴大，現有的測試時間擴充方法會受到累積的歷史資訊影響，這不僅會浪費運算資源，還會干擾有效的推理。為了解決這個問題，我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成，每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題，主要依賴於它們的當前狀態，而不是累積的歷史，類似於馬可夫過程中的無記憶轉換。基於這個觀察，我們提出了思想原子 (AoT)，其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖，並收縮其子問題，形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行，直到達到可直接解決的原子問題，自然地實現問題狀態之間的馬可夫轉換。此外，這些原子問題可以無縫整合到現有的測試時間擴充方法中，讓 AoT 可以作為外掛程式強化功能，以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是，在 HotpotQA 上，當應用於 gpt-4o-mini 時，AoT 達到了 80.6% 的 F1 分數，比 o3-mini 高出 3.4%，比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。
-
-##### **Demographic Attributes Prediction from Speech Using WavLM Embeddings**
-2502.12007v1 by Yuchen Yang, Thomas Thebaud, Najim Dehak
-
-This paper introduces a general classifier based on WavLM features, to infer
-demographic characteristics, such as age, gender, native language, education,
-and country, from speech. Demographic feature prediction plays a crucial role
-in applications like language learning, accessibility, and digital forensics,
-enabling more personalized and inclusive technologies. Leveraging pretrained
-models for embedding extraction, the proposed framework identifies key acoustic
-and linguistic fea-tures associated with demographic attributes, achieving a
-Mean Absolute Error (MAE) of 4.94 for age prediction and over 99.81% accuracy
-for gender classification across various datasets. Our system improves upon
-existing models by up to relative 30% in MAE and up to relative 10% in accuracy
-and F1 scores across tasks, leveraging a diverse range of datasets and large
-pretrained models to ensure robustness and generalizability. This study offers
-new insights into speaker diversity and provides a strong foundation for future
-research in speech-based demographic profiling.
-
-摘要：本文介紹一個基於 WavLM 特徵的一般分類器，用於從語音中推斷人口特徵，例如年齡、性別、母語、教育和國家。人口特徵預測在語言學習、無障礙性和數位鑑識等應用中扮演著至關重要的角色，能實現更個人化且包容性的技術。利用預先訓練的模型進行嵌入式萃取，提出的架構識別與人口屬性相關的主要音訊和語言特徵，在年齡預測中達到 4.94 的平均絕對誤差 (MAE)，在各種資料集中的性別分類中準確率超過 99.81%。我們的系統在平均絕對誤差上比現有模型提升了相對 30%，在準確率和 F1 分數上提升了相對 10%，利用各種資料集和大型預先訓練模型來確保穩健性和概括性。本研究提供了對說話者多元性的新見解，並為未來基於語音的人口特徵分析研究奠定了堅實的基礎。
-
-##### **Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**
-2502.12001v1 by Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, Yoshihide Nomura
-
-This paper investigates the integration of technical vocabulary in merged
-language models. We explore the knowledge transfer mechanisms involved when
-combining a general-purpose language-specific model with a domain-specific
-model, focusing on the resulting model's comprehension of technical jargon. Our
-experiments analyze the impact of this merging process on the target model's
-proficiency in handling specialized terminology. We present a quantitative
-evaluation of the performance of the merged model, comparing it with that of
-the individual constituent models. The findings offer insights into the
-effectiveness of different model merging methods for enhancing domain-specific
-knowledge and highlight potential challenges and future directions in
-leveraging these methods for cross-lingual knowledge transfer in Natural
-Language Processing.
-
-摘要：本文探討了技術詞彙在合併語言模型中的整合。我們探討了結合一般用途語言特定模型與特定領域模型時所涉及的知識轉移機制，重點在於所產生模型對技術術語的理解。我們的實驗分析了此合併程序對目標模型處理專業術語能力的影響。我們提出了合併模型效能的量化評估，並將其與個別組成模型的效能進行比較。這些發現提供了見解，說明了不同模型合併方法在增強特定領域知識方面的效能，並強調了利用這些方法進行自然語言處理中跨語言知識轉移的潛在挑戰和未來方向。
-
-##### **Presumed Cultural Identity: How Names Shape LLM Responses**
-2502.11995v1 by Siddhesh Pawar, Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein
-
-Names are deeply tied to human identity. They can serve as markers of
-individuality, cultural heritage, and personal history. However, using names as
-a core indicator of identity can lead to over-simplification of complex
-identities. When interacting with LLMs, user names are an important point of
-information for personalisation. Names can enter chatbot conversations through
-direct user input (requested by chatbots), as part of task contexts such as CV
-reviews, or as built-in memory features that store user information for
-personalisation. We study biases associated with names by measuring cultural
-presumptions in the responses generated by LLMs when presented with common
-suggestion-seeking queries, which might involve making assumptions about the
-user. Our analyses demonstrate strong assumptions about cultural identity
-associated with names present in LLM generations across multiple cultures. Our
-work has implications for designing more nuanced personalisation systems that
-avoid reinforcing stereotypes while maintaining meaningful customisation.
-
-摘要：姓名與人類身分密不可分。它們可以作為個人特質、文化遺產和個人歷史的標記。然而，將姓名作為身分的核心指標可能會導致複雜身分的過度簡化。在與 LLM 互動時，使用者名稱是個人化的重要資訊點。姓名可以透過直接使用者輸入（聊天機器人要求）、作為履歷審查等任務情境的其中一部分，或作為儲存使用者資訊以供個人化的內建記憶功能，進入聊天機器人對話。我們透過衡量 LLM 在面對常見的建議尋求查詢時所產生的回應中的文化預設，來研究與姓名相關的偏見，這可能涉及對使用者的假設。我們的分析顯示，在跨多種文化的 LLM 世代中，與姓名相關的文化身分有強烈的假設。我們的研究對於設計更細緻的個人化系統有影響，這些系統避免強化刻板印象，同時維持有意義的客製化。
-
-##### **Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**
-2502.11989v1 by Negar Kamali, Karyn Nakamura, Aakriti Kumar, Angelos Chatzimparmpas, Jessica Hullman, Matthew Groh
-
-Diffusion model-generated images can appear indistinguishable from authentic
-photographs, but these images often contain artifacts and implausibilities that
-reveal their AI-generated provenance. Given the challenge to public trust in
-media posed by photorealistic AI-generated images, we conducted a large-scale
-experiment measuring human detection accuracy on 450 diffusion-model generated
-images and 149 real images. Based on collecting 749,828 observations and 34,675
-comments from 50,444 participants, we find that scene complexity of an image,
-artifact types within an image, display time of an image, and human curation of
-AI-generated images all play significant roles in how accurately people
-distinguish real from AI-generated images. Additionally, we propose a taxonomy
-characterizing artifacts often appearing in images generated by diffusion
-models. Our empirical observations and taxonomy offer nuanced insights into the
-capabilities and limitations of diffusion models to generate photorealistic
-images in 2024.
-
-摘要：擴散模型生成的影像看起來可能與真實照片無異，但這些影像通常包含人工智慧生成來源的瑕疵和不合理之處。由於寫實的人工智慧生成影像對公眾對媒體的信任構成挑戰，我們進行了一項大規模實驗，測量人類對 450 張擴散模型生成影像和 149 張真實影像的檢測準確度。根據收集自 50,444 位參與者的 749,828 次觀察和 34,675 則評論，我們發現影像的場景複雜性、影像中的瑕疵類型、影像的顯示時間，以及人類對人工智慧生成影像的策展，在人們準確區分真實影像和人工智慧生成影像方面都扮演重要的角色。此外，我們提出了一種分類法，用於描述經常出現在擴散模型生成的影像中的瑕疵。我們的經驗觀察和分類法為擴散模型在 2024 年生成寫實影像的能力和限制提供了細緻的見解。
-
-##### **Generating Text from Uniform Meaning Representation**
-2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein
-
-Uniform Meaning Representation (UMR) is a recently developed graph-based
-semantic representation, which expands on Abstract Meaning Representation (AMR)
-in a number of ways, in particular through the inclusion of document-level
-information and multilingual flexibility. In order to effectively adopt and
-leverage UMR for downstream tasks, efforts must be placed toward developing a
-UMR technological ecosystem. Though still limited amounts of UMR annotations
-have been produced to date, in this work, we investigate the first approaches
-to producing text from multilingual UMR graphs: (1) a pipeline conversion of
-UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large
-language models with UMR data, and (3) fine-tuning existing AMR-to-text
-generation models with UMR data. Our best performing model achieves a
-multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared
-to the reference, which is a promising indication of the effectiveness of
-fine-tuning approaches for UMR-to-text generation with even limited amounts of
-UMR data.
-
-摘要：統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示，它在許多方面擴展了抽象語意表示 (AMR)，特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR，必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限，但在這項工作中，我們探討了從多語言 UMR 圖形產生文字的第一種方法：(1) 將 UMR 轉換為 AMR 的管道，然後使用 AMR 轉文字生成模型，(2) 使用 UMR 資料微調大型語言模型，以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比，我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數，在中文中達到 0.882，這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果，即使 UMR 資料數量有限。
-
-##### **Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**
-2502.11969v1 by Sehun Jung, Hyang-won Lee
-
-In vision-language models (VLMs), prompt tuning has shown its effectiveness
-in adapting models to downstream tasks. However, learned prompts struggle to
-generalize to unseen classes, as they tend to overfit to the classes that are
-targeted during prompt tuning. Examining failure cases, we observed that
-learned prompts disrupt the semantics of unseen classes, generating text
-embeddings with incorrect semantic relationships among classes. To address
-this, we propose Similarity Alignment Regularization (SAR), which regularizes
-learnable prompts to preserve the semantic relationships among classes captured
-by hand-crafted prompts. Specifically, we first obtain novel classes related to
-base classes using ChatGPT-4o and utilize them as potential unseen classes
-during prompt tuning. Then, by targeting both base and novel classes, SAR
-aligns the similarity relationships among text embeddings generated by
-learnable prompts with the similarity relationships from hand-crafted prompts.
-Extensive experiments applying SAR to existing prompt tuning methods
-demonstrate its effectiveness in improving generalization to unseen classes.
-
-摘要：在視覺語言模型 (VLM) 中，提示調整已展現其在調整模型至下游任務上的效能。然而，已學習的提示難以推廣至未見類別，因為它們傾向於過度擬合提示調整期間所鎖定的類別。在檢視失敗案例時，我們觀察到已學習的提示會擾亂未見類別的語義，產生具有類別間不正確語義關係的文字嵌入。為了解決此問題，我們提出相似度對齊正則化 (SAR)，它會對可學習提示進行正則化，以保留由手工提示捕捉到的類別間語義關係。具體來說，我們首先使用 ChatGPT-4o 取得與基本類別相關的新穎類別，並在提示調整期間將它們用作潛在的未見類別。然後，透過鎖定基本類別和新穎類別，SAR 會將可學習提示產生的文字嵌入之間的相似度關係與手工提示的相似度關係對齊。將 SAR 應用於現有提示調整方法的廣泛實驗證明了其在改善對未見類別的概括上的效能。
-
-##### **A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**
-2502.11965v1 by Jun Jiang, Wenjun Yu, Yunfan Li, Yuan Gao, Shugong Xu
-
-In the field of artificial intelligence, self-supervised learning has
-demonstrated superior generalization capabilities by leveraging large-scale
-unlabeled datasets for pretraining, which is especially critical for wireless
-communication models to adapt to a variety of scenarios. This paper
-innovatively treats Channel State Information (CSI) and Channel Impulse
-Response (CIR) as naturally aligned multi-modal data and proposes the first
-MIMO wireless channel foundation model, named CSI-CLIP. By effectively
-capturing the joint representations of both CIR and CSI, CSI-CLIP exhibits
-remarkable adaptability across scenarios and robust feature extraction
-capabilities. Experimental results show that in positioning task, CSI-CLIP
-reduces the mean error distance by 22%; in beam management task, it increases
-accuracy by 1% compared to traditional supervised methods, as well as in the
-channel identification task. These improvements not only highlight the
-potential and value of CSI-CLIP in integrating sensing and communication but
-also demonstrate its significant advantages over existing techniques. Moreover,
-viewing CSI and CIR as multi-modal pairs and contrastive learning for wireless
-channel foundation model open up new research directions in the domain of MIMO
-wireless communications.
-
-摘要：在人工智能领域，自监督学习通过利用大规模无标签数据集进行预训练，展示了卓越的泛化能力，这对于无线通信模型适应各种场景尤为关键。本文创新地将信道状态信息 (CSI) 和信道脉冲响应 (CIR) 视为自然对齐的多模态数据，并提出了第一个 MIMO 无线信道基础模型，名为 CSI-CLIP。通过有效捕获 CIR 和 CSI 的联合表示，CSI-CLIP 在各种场景中表现出卓越的适应性和强大的特征提取能力。实验结果表明，在定位任务中，CSI-CLIP 将平均误差距离减少了 22%；在波束管理任务中，与传统的监督方法相比，其准确度提高了 1%，以及在信道识别任务中。这些改进不仅突出了 CSI-CLIP 在集成感知和通信方面的潜力和价值，而且还展示了其相对于现有技术的显着优势。此外，将 CSI 和 CIR 视为多模态对，并对比学习无线信道基础模型，为 MIMO 无线通信领域开辟了新的研究方向。
-
-##### **Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**
-2502.11962v1 by Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
-
-Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language
-Models (LLMs), but it may lower their truthfulness. This trade-off arises
-because IFT steers LLMs to generate responses with long-tail knowledge that is
-not well covered during pre-training, leading to more informative but less
-truthful answers when generalizing to unseen tasks. In this paper, we
-empirically demonstrate this helpfulness-truthfulness trade-off in IFT and
-propose $\textbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs
-to recognize their uncertainty and explicitly reflect it at the end of their
-responses. Experimental results show that UNIT-tuned models maintain their
-helpfulness while distinguishing between certain and uncertain claims, thereby
-reducing hallucinations.
-
-摘要：指令微調 (IFT) 可以提升大型語言模型 (LLM) 的實用性，但可能會降低其真實性。這種取捨會出現，是因為 IFT 引導 LLM 生成具有長尾知識的回應，而這些知識在預訓練期間並未充分涵蓋，導致在推廣到未見任務時，答案更具資訊性，但真實性較低。在本文中，我們透過實證展示 IFT 中的這種實用性與真實性取捨，並提出一個新穎的 IFT 典範 $\textbf{UNIT}$ 來解決這個問題。UNIT 教導 LLM 辨識其不確定性，並明確反映在其回應的結尾。實驗結果顯示，經過 UNIT 微調的模型維持其實用性，同時區分確定和不確定的說法，從而減少幻覺。
-
-##### **STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**
-2502.11959v1 by Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang
-
-Claim verification is the task of determining whether a claim is supported or
-refuted by evidence. Self-improvement methods, where reasoning chains are
-generated and those leading to correct results are selected for training, have
-succeeded in tasks like mathematical problem solving. However, in claim
-verification, this approach struggles. Low-quality reasoning chains may falsely
-match binary truth labels, introducing faulty reasoning into the
-self-improvement process and ultimately degrading performance. To address this,
-we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our
-method introduces a structured reasoning design with Claim Decomposition,
-Entity Analysis, and Evidence Grounding Verification. These components improve
-reasoning quality, reduce errors, and provide additional supervision signals
-for self-improvement. STRIVE begins with a warm-up phase, where the base model
-is fine-tuned on a small number of annotated examples to learn the structured
-reasoning design. It is then applied to generate reasoning chains for all
-training examples, selecting only those that are correct and structurally sound
-for subsequent self-improvement training. We demonstrate that STRIVE achieves
-significant improvements over baseline models, with a 31.4% performance gain
-over the base model and 20.7% over Chain of Thought on the HOVER datasets,
-highlighting its effectiveness.
-
-摘要：聲明驗證的任務是確定聲明是否受到證據支持或反駁。自改善方法（產生推理鏈並選擇導致正確結果的鏈進行訓練）已成功應用於數學問題求解等任務。然而，在聲明驗證中，此方法會遇到困難。低品質的推理鏈可能錯誤地匹配二元真值標籤，將錯誤的推理引入自改善流程並最終降低效能。為了解決此問題，我們提出 STRIVE：結構化推理自改善驗證。我們的模型引入了結構化推理設計，包含聲明分解、實體分析和證據依據驗證。這些組件改善了推理品質、減少了錯誤，並為自改善提供了額外的監督訊號。STRIVE 從熱身階段開始，在少數標註範例上微調基礎模型以學習結構化推理設計。接著將其應用於為所有訓練範例產生推理鏈，僅選擇正確且結構上合理的推理鏈進行後續的自改善訓練。我們證明 STRIVE 獲得了顯著的改善，在 HOVER 資料集上，效能比基礎模型提升了 31.4%，比 Chain of Thought 提升了 20.7%，突顯了其有效性。
-
-##### **Can Your Uncertainty Scores Detect Hallucinated Entity?**
-2502.11948v1 by Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li
-
-To mitigate the impact of hallucination nature of LLMs, many studies propose
-detecting hallucinated generation through uncertainty estimation. However,
-these approaches predominantly operate at the sentence or paragraph level,
-failing to pinpoint specific spans or entities responsible for hallucinated
-content. This lack of granularity is especially problematic for long-form
-outputs that mix accurate and fabricated information. To address this
-limitation, we explore entity-level hallucination detection. We propose a new
-data set, HalluEntity, which annotates hallucination at the entity level. Based
-on the dataset, we comprehensively evaluate uncertainty-based hallucination
-detection approaches across 17 modern LLMs. Our experimental results show that
-uncertainty estimation approaches focusing on individual token probabilities
-tend to over-predict hallucinations, while context-aware methods show better
-but still suboptimal performance. Through an in-depth qualitative study, we
-identify relationships between hallucination tendencies and linguistic
-properties and highlight important directions for future research.
-
-摘要：為了減輕 LLM 幻覺性質的影響，許多研究提出透過不確定性估計來偵測幻覺產生的內容。然而，這些方法主要是在句子或段落層級運作，無法精確找出對幻覺內容負責的特定區間或實體。這種缺乏粒度的現象對於混合了準確和虛構資訊的長篇輸出內容來說尤其成問題。為了解決這個限制，我們探討了實體層級的幻覺偵測。我們提出了一個新的資料集 HalluEntity，其中註解了實體層級的幻覺。根據該資料集，我們全面評估了 17 種現代 LLM 的基於不確定性的幻覺偵測方法。我們的實驗結果顯示，專注於個別代幣機率的不確定性估計方法傾向於過度預測幻覺，而具備背景感知能力的方法則表現得更好，但仍未達到最佳狀態。透過深入的定性研究，我們找出幻覺傾向與語言特徵之間的關係，並強調未來研究的重要方向。
-
-##### **Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**
-2502.11946v1 by Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuting Yan, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
-
-Real-time speech interaction, serving as a fundamental interface for
-human-machine collaboration, holds immense potential. However, current
-open-source models face limitations such as high costs in voice data
-collection, weakness in dynamic control, and limited intelligence. To address
-these challenges, this paper introduces Step-Audio, the first production-ready
-open-source solution. Key contributions include: 1) a 130B-parameter unified
-speech-text multi-modal model that achieves unified understanding and
-generation, with the Step-Audio-Chat version open-sourced; 2) a generative
-speech data engine that establishes an affordable voice cloning framework and
-produces the open-sourced lightweight Step-Audio-TTS-3B model through
-distillation; 3) an instruction-driven fine control system enabling dynamic
-adjustments across dialects, emotions, singing, and RAP; 4) an enhanced
-cognitive architecture augmented with tool calling and role-playing abilities
-to manage complex tasks effectively. Based on our new StepEval-Audio-360
-evaluation benchmark, Step-Audio achieves state-of-the-art performance in human
-evaluations, especially in terms of instruction following. On open-source
-benchmarks like LLaMA Question, shows 9.3% average performance improvement,
-demonstrating our commitment to advancing the development of open-source
-multi-modal language technologies. Our code and models are available at
-https://github.com/stepfun-ai/Step-Audio.
-
-摘要：<paragraph>即時語音互動作為人機協作的基本介面，蘊含著巨大的潛力。然而，目前的開源模型面臨著語音數據收集成本高、動態控制能力弱、智慧有限等限制。為了應對這些挑戰，本文介紹了 Step-Audio，這是第一個可投入生產的開源解決方案。主要貢獻包括：1) 一個 130B 參數的統一語音文字多模態模型，實現了統一的理解和生成，其中 Step-Audio-Chat 版本已開源；2) 一個生成式語音數據引擎，建立了一個經濟實惠的語音克隆框架，並通過蒸餾技術產生了開源的輕量級 Step-Audio-TTS-3B 模型；3) 一個指令驅動的精細控制系統，實現了跨方言、情緒、唱歌和饒舌的動態調整；4) 一個增強的認知架構，增加了工具呼叫和角色扮演的能力，以有效地管理複雜的任務。根據我們新的 StepEval-Audio-360 評估基準，Step-Audio 在人類評估中實現了最先進的性能，特別是在指令遵循方面。在 LLaMA Question 等開源基準測試中，表現出平均提升了 9.3%，證明了我們致力於推進開源多模態語言技術的發展。我們的程式碼和模型可在 https://github.com/stepfun-ai/Step-Audio 取得。</paragraph>
-
-##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**
-2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy
-
-Air quality prediction is key to mitigating health impacts and guiding
-decisions, yet existing models tend to focus on temporal trends while
-overlooking spatial generalization. We propose AQ-Net, a spatiotemporal
-reanalysis model for both observed and unobserved stations in the near future.
-AQ-Net utilizes the LSTM and multi-head attention for the temporal regression.
-We also propose a cyclic encoding technique to ensure continuous time
-representation. To learn fine-grained spatial air quality estimation, we
-incorporate AQ-Net with the neural kNN to explore feature-based interpolation,
-such that we can fill the spatial gaps given coarse observation stations. To
-demonstrate the efficiency of our model for spatiotemporal reanalysis, we use
-data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive
-experiments show that AQ-Net excels in air quality reanalysis, highlighting the
-potential of hybrid spatio-temporal models to better capture environmental
-dynamics, especially in urban areas where both spatial and temporal variability
-are critical.
-
-摘要：空气品质预测是减轻健康影响和指导决策的关键，但现有的模型倾向于关注时间趋势，而忽略空间概化。我们提出了 AQ-Net，这是一种时空再分析模型，适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计，我们将 AQ-Net 与神经 kNN 结合起来，以探索基于特征的插值，以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率，我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明，AQ-Net 在空气质量再分析中表现出色，突出了混合时空模型在更好地捕捉环境动态方面的潜力，尤其是在空间和时间变异性都很关键的城市地区。
-
-##### **FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**
-2502.11937v1 by Yutong Ye, Yingbo Zhou, Zhusen Liu, Xiao Du, Hao Zhou, Xiang Lian, Mingsong Chen
-
-Although Reinforcement Learning (RL)-based Traffic Signal Control (TSC)
-methods have been extensively studied, their practical applications still raise
-some serious issues such as high learning cost and poor generalizability. This
-is because the ``trial-and-error'' training style makes RL agents extremely
-dependent on the specific traffic environment, which also requires a long
-convergence time. To address these issues, we propose a novel Federated
-Imitation Learning (FIL)-based framework for multi-intersection TSC, named
-FitLight, which allows RL agents to plug-and-play for any traffic environment
-without additional pre-training cost. Unlike existing imitation learning
-approaches that rely on pre-training RL agents with demonstrations, FitLight
-allows real-time imitation learning and seamless transition to reinforcement
-learning. Due to our proposed knowledge-sharing mechanism and novel hybrid
-pressure-based agent design, RL agents can quickly find a best control policy
-with only a few episodes. Moreover, for resource-constrained TSC scenarios,
-FitLight supports model pruning and heterogeneous model aggregation, such that
-RL agents can work on a micro-controller with merely 16{\it KB} RAM and 32{\it
-KB} ROM. Extensive experiments demonstrate that, compared to state-of-the-art
-methods, FitLight not only provides a superior starting point but also
-converges to a better final solution on both real-world and synthetic datasets,
-even under extreme resource limitations.
-
-摘要：儘管基於強化學習 (RL) 的交通號誌控制 (TSC) 方法已經廣泛研究，但其實際應用仍會產生一些嚴重的問題，例如學習成本高和泛化能力差。這是因為「試錯法」訓練風格讓 RL 代理極度依賴特定的交通環境，這也需要很長的收斂時間。為了解決這些問題，我們提出一個名為 FitLight 的基於聯邦模仿學習 (FIL) 的多路口 TSC 框架，讓 RL 代理可以即插即用於任何交通環境，而無需額外的預訓練成本。與依賴使用示範預訓練 RL 代理的現有模仿學習方法不同，FitLight 允許即時模仿學習和無縫過渡到強化學習。由於我們提出的知識共享機制和新穎的基於壓力的混合代理設計，RL 代理只需幾個回合即可快速找到最佳控制策略。此外，對於資源受限的 TSC 場景，FitLight 支援模型剪枝和異質模型聚合，讓 RL 代理可以在僅有 16{\it KB} RAM 和 32{\it KB} ROM 的微控制器上運行。廣泛的實驗證明，與最先進的方法相比，FitLight 不僅提供了更好的起點，而且在實際和合成資料集上都能收斂到更好的最終解決方案，即使在極端的資源限制下也是如此。
-
-##### **On Representational Dissociation of Language and Arithmetic in Large Language Models**
-2502.11932v1 by Riku Kisako, Tatsuki Kuribayashi, Ryohei Sasano
-
-The association between language and (non-linguistic) thinking ability in
-humans has long been debated, and recently, neuroscientific evidence of brain
-activity patterns has been considered. Such a scientific context naturally
-raises an interdisciplinary question -- what about such a language-thought
-dissociation in large language models (LLMs)? In this paper, as an initial
-foray, we explore this question by focusing on simple arithmetic skills (e.g.,
-$1+2=$ ?) as a thinking ability and analyzing the geometry of their encoding in
-LLMs' representation space. Our experiments with linear classifiers and cluster
-separability tests demonstrate that simple arithmetic equations and general
-language input are encoded in completely separated regions in LLMs' internal
-representation space across all the layers, which is also supported with more
-controlled stimuli (e.g., spelled-out equations). These tentatively suggest
-that arithmetic reasoning is mapped into a distinct region from general
-language input, which is in line with the neuroscientific observations of human
-brain activations, while we also point out their somewhat cognitively
-implausible geometric properties.
-
-摘要：人類語言與（非語言）思考能力之間的關聯性長期以來一直備受爭論，而最近，神經科學證據中的大腦活動模式也已受到考量。這樣一個科學背景自然會引發一個跨領域問題——大型語言模型（LLM）中這種語言與思考的分離又是如何？在本文中，作為初步探討，我們透過專注於簡單的算術技能（例如 $1+2=$？）作為思考能力，並分析它們在 LLM 表徵空間中的編碼幾何形狀來探討這個問題。我們透過線性分類器和群集可分性測試進行的實驗證明，簡單的算術方程式和一般語言輸入在 LLM 的內部表徵空間中所有層中都是以完全分離的區域編碼，這也獲得了更受控刺激（例如，拼寫出的方程式）的支持。這些初步表明算術推理被映射到與一般語言輸入不同的區域，這與人類大腦活化的神經科學觀察結果一致，同時我們也指出了它們在認知上有些難以置信的幾何屬性。
-
-##### **BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**
-2502.11926v1 by Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Andrew Piper, Alexander Panchenko, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
-
-People worldwide use language in subtle and complex ways to express emotions.
-While emotion recognition -- an umbrella term for several NLP tasks --
-significantly impacts different applications in NLP and other fields, most work
-in the area is focused on high-resource languages. Therefore, this has led to
-major disparities in research and proposed solutions, especially for
-low-resource languages that suffer from the lack of high-quality datasets. In
-this paper, we present BRIGHTER-- a collection of multilabeled
-emotion-annotated datasets in 28 different languages. BRIGHTER covers
-predominantly low-resource languages from Africa, Asia, Eastern Europe, and
-Latin America, with instances from various domains annotated by fluent
-speakers. We describe the data collection and annotation processes and the
-challenges of building these datasets. Then, we report different experimental
-results for monolingual and crosslingual multi-label emotion identification, as
-well as intensity-level emotion recognition. We investigate results with and
-without using LLMs and analyse the large variability in performance across
-languages and text domains. We show that BRIGHTER datasets are a step towards
-bridging the gap in text-based emotion recognition and discuss their impact and
-utility.
-
-摘要：全球各地的人們都以微妙且複雜的方式使用語言來表達情感。
-雖然情緒辨識——幾個 NLP 任務的總稱——
-顯著影響 NLP 及其他領域中的不同應用，但該領域中的大部分工作
-都集中於高資源語言。因此，這導致研究和提出的解決方案出現重大差異，特別是
-對於缺乏高品質資料集的低資源語言。在本文中，我們提出 BRIGHTER——一個
-由 28 種不同語言組成的多標記情緒標註資料集。BRIGHTER 主要涵蓋來自非洲、亞洲、東歐和
-拉丁美洲的低資源語言，其中包含由流利講者標註的來自不同領域的實例。我們描述了資料收集和標註流程以及
-建立這些資料集的挑戰。然後，我們報告了單語和跨語言多標籤情緒識別的不同實驗結果，以及
-強度級別的情緒識別。我們研究了使用和不使用 LLM 的結果，並分析了跨語言和文字領域的性能的巨大變異。我們表明，BRIGHTER 資料集是縮小基於文字的情緒識別差距的一步，並討論了它們的影響和
-效用。
-
-##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**
-2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han
-
-The rapid development of Multimodal Large Language Models (MLLMs) has enabled
-the integration of multiple modalities, including texts and images, within the
-large language model (LLM) framework. However, texts and images are usually
-interconnected, forming a multimodal attributed graph (MMAG). It is
-underexplored how MLLMs can incorporate the relational information
-(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts
-and images) on such graphs for multimodal comprehension and generation. In this
-paper, we propose GraphGPT-o, which supports omni-multimodal understanding and
-creation on MMAGs. We first comprehensively study linearization variants to
-transform semantic and structural information as input for MLLMs. Then, we
-propose a hierarchical aligner that enables deep graph encoding, bridging the
-gap between MMAGs and MLLMs. Finally, we explore the inference choices,
-adapting MLLM to interleaved text and image generation in graph scenarios.
-Extensive experiments on three datasets from different domains demonstrate the
-effectiveness of our proposed method. Datasets and codes will be open-sourced
-upon acceptance.
-
-摘要：多模态大语言模型 (MLLM) 的快速发展，促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而，文本和图像通常是相互关联的，形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息（即图结构）和语义信息（即文本和图像）以进行多模态理解和生成，目前仍未得到充分探索。在本文中，我们提出了 GraphGPT-o，它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体，以将语义和结构信息转换为 MLLM 的输入。然后，我们提出了一个分层对齐器，它支持深度图编码，弥合了 MMAG 和 MLLM 之间的差距。最后，我们探索了推理选择，使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。
-
-##### **From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**
-2502.11919v1 by Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ziang Xiao, Ming Yin
-
-AI-assisted decision making becomes increasingly prevalent, yet individuals
-often fail to utilize AI-based decision aids appropriately especially when the
-AI explanations are absent, potentially as they do not %understand reflect on
-AI's decision recommendations critically. Large language models (LLMs), with
-their exceptional conversational and analytical capabilities, present great
-opportunities to enhance AI-assisted decision making in the absence of AI
-explanations by providing natural-language-based analysis of AI's decision
-recommendation, e.g., how each feature of a decision making task might
-contribute to the AI recommendation. In this paper, via a randomized
-experiment, we first show that presenting LLM-powered analysis of each task
-feature, either sequentially or concurrently, does not significantly improve
-people's AI-assisted decision performance. To enable decision makers to better
-leverage LLM-powered analysis, we then propose an algorithmic framework to
-characterize the effects of LLM-powered analysis on human decisions and
-dynamically decide which analysis to present. Our evaluation with human
-subjects shows that this approach effectively improves decision makers'
-appropriate reliance on AI in AI-assisted decision making.
-
-摘要：隨著 AI 輔助決策越來越普遍，但個人常常無法適當地利用 AI 決策輔助，特別是在沒有 AI 解釋的情況下，潛在原因是他們無法批判性地理解 AI 的決策建議。大型語言模型 (LLM) 擁有卓越的對話和分析能力，在沒有 AI 解釋的情況下，透過提供基於自然語言的 AI 決策建議分析，例如決策任務的每個特徵如何影響 AI 建議，為增強 AI 輔助決策提供了絕佳的機會。在本文中，我們透過隨機實驗，首先展示了以循序或並行的方式呈現 LLM 分析的每個任務特徵，並未顯著改善人們的 AI 輔助決策表現。為了讓決策者能更好地利用 LLM 分析，我們接著提出了演算法架構，用於描述 LLM 分析對人類決策的影響，並動態決定要呈現哪種分析。我們對人類受試者的評估顯示，這種方法有效地改善了決策者在 AI 輔助決策中對 AI 的適當依賴性。
-
-##### **EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**
-2502.11916v1 by Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
-
-Automated Essay Scoring (AES) plays a crucial role in educational assessment
-by providing scalable and consistent evaluations of writing tasks. However,
-traditional AES systems face three major challenges: (1) reliance on
-handcrafted features that limit generalizability, (2) difficulty in capturing
-fine-grained traits like coherence and argumentation, and (3) inability to
-handle multimodal contexts. In the era of Multimodal Large Language Models
-(MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES
-capabilities across lexical-, sentence-, and discourse-level traits. By
-leveraging MLLMs' strengths in trait-specific scoring and multimodal context
-understanding, EssayJudge aims to offer precise, context-rich evaluations
-without manual feature engineering, addressing longstanding AES limitations.
-Our experiments with 18 representative MLLMs reveal gaps in AES performance
-compared to human evaluation, particularly in discourse-level traits,
-highlighting the need for further advancements in MLLM-based AES research. Our
-dataset and code will be available upon acceptance.
-
-摘要：自動化論文評分 (AES) 在教育評量中扮演著重要的角色，它能提供可擴充且一致的寫作任務評量。然而，傳統的 AES 系統面臨了三個主要的挑戰：(1) 依賴於限制泛用性的手工特徵，(2) 難以捕捉連貫性和論證等細微特徵，以及 (3) 無法處理多模態的脈絡。在多模態大型語言模型 (MLLM) 的時代，我們提出了 EssayJudge，這是第一個評估 AES 能力的多模態基準，橫跨詞彙、句子和篇章層級的特徵。EssayJudge 透過利用 MLLM 在特定特徵評分和多模態脈絡理解方面的優勢，旨在提供精確且富含脈絡的評量，而無需手動特徵工程，進而解決長久以來的 AES 限制。我們針對 18 個具代表性的 MLLM 進行的實驗揭露了 AES 效能與人類評量之間的差距，特別是在篇章層級的特徵，這凸顯了 MLLM 為基礎的 AES 研究需要進一步的進展。我們的資料集和程式碼將在通過驗證後提供。
-
-##### **On the robustness of ChatGPT in teaching Korean Mathematics**
-2502.11915v1 by Phuong-Nam Nguyen, Quang Nguyen-The, An Vu-Minh, Diep-Anh Nguyen, Xuan-Lam Pham
-
-ChatGPT, an Artificial Intelligence model, has the potential to revolutionize
-education. However, its effectiveness in solving non-English questions remains
-uncertain. This study evaluates ChatGPT's robustness using 586 Korean
-mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering
-391 out of 586 questions. We also assess its ability to rate mathematics
-questions based on eleven criteria and perform a topic analysis. Our findings
-show that ChatGPT's ratings align with educational theory and test-taker
-perspectives. While ChatGPT performs well in question classification, it
-struggles with non-English contexts, highlighting areas for improvement. Future
-research should address linguistic biases and enhance accuracy across diverse
-languages. Domain-specific optimizations and multilingual training could
-improve ChatGPT's role in personalized education.
-
-摘要：ChatGPT，一種人工智慧模型，具有革新教育的潛力。然而，其解決非英語問題的有效性仍不確定。本研究使用 586 個韓語數學問題評估 ChatGPT 的健壯性。ChatGPT 達到 66.72% 的準確率，正確回答了 586 個問題中的 391 個。我們也評估其根據 11 個標準對數學問題進行評分並執行主題分析的能力。我們的研究結果顯示，ChatGPT 的評分與教育理論和應試者的觀點一致。儘管 ChatGPT 在問題分類中表現良好，但它在非英語語境中表現不佳，突顯出需要改進的地方。未來的研究應解決語言偏見並提高跨不同語言的準確性。特定領域的優化和多語言訓練可以提升 ChatGPT 在個人化教育中的作用。
-
-##### **MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**
-2502.11903v1 by Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao
-
-Recent multimodal large language models (MLLMs) have demonstrated significant
-potential in open-ended conversation, generating more accurate and personalized
-responses. However, their abilities to memorize, recall, and reason in
-sustained interactions within real-world scenarios remain underexplored. This
-paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for
-evaluating six core open-ended abilities of MLLMs: information extraction,
-multi-turn reasoning, information update, image management, memory recall, and
-answer refusal. With data collected from real-world scenarios, MMRC comprises
-5,120 conversations and 28,720 corresponding manually labeled questions, posing
-a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC
-indicate an accuracy drop during open-ended interactions. We identify four
-common failure patterns: long-term memory degradation, inadequacies in updating
-factual knowledge, accumulated assumption of error propagation, and reluctance
-to say no. To mitigate these issues, we propose a simple yet effective
-NOTE-TAKING strategy, which can record key information from the conversation
-and remind the model during its responses, enhancing conversational
-capabilities. Experiments across six MLLMs demonstrate significant performance
-improvements.
-
-摘要：最近的多模态大型语言模型 (MLLM) 已在开放式对话中展现出显著的潜力，产生更准确且个性化的回应。然而，它们在现实世界场景中持续互动中的记忆、回忆和推理能力仍未得到充分探索。本文介绍了 MMRC，一个多模态现实世界对话基准，用于评估 MLLM 的六项核心开放式能力：信息提取、多轮推理、信息更新、图像管理、记忆回忆和答案拒绝。通过从现实世界场景中收集的数据，MMRC 包含 5,120 个对话和 28,720 个相应的手动标记问题，对现有的 MLLM 构成了重大挑战。在 MMRC 中对 20 个 MLLM 的评估表明，在开放式互动期间准确性下降。我们确定了四种常见的故障模式：长期记忆退化、更新事实知识的不足、累积的错误传播假设以及不愿说不。为了减轻这些问题，我们提出了一种简单但有效的笔记策略，它可以记录对话中的关键信息并在模型响应期间提醒模型，从而增强对话能力。六个 MLLM 的实验表明了显著的性能改进。
-
-##### **Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**
-2502.11901v1 by Dylan Zhang, Justin Wang, Tianran Sun
-
-Existing LMs struggle with proof-oriented programming due to data scarcity,
-which manifest in two key ways: (1) a lack of sufficient corpora for
-proof-oriented programming languages such as F*, and (2) the absence of
-large-scale, project-level proof-oriented implementations that can teach the
-model the intricate reasoning process when performing proof-oriented
-programming. We present the first on synthetic data augmentation for project
-level proof oriented programming for both generation and repair. Our method
-addresses data scarcity by synthesizing basic proof-oriented programming
-problems for proficiency in that language; incorporating diverse coding data
-for reasoning capability elicitation and creating new proofs and repair data
-within existing repositories. This approach enables language models to both
-synthesize and repair proofs for function- and repository-level code. We show
-that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of
-the models that outperforms GPT-4o in project-level proof-oriented programming
-by 64% relative margin, and can improve GPT-4o's performance by 54% by
-repairing its outputs over GPT-4o's self-repair.
-
-摘要：現有的語言模型在基於證明編程時會因資料稀少而有困難，
-這會以兩種關鍵方式表現出來：(1) 缺乏足夠的語料庫，例如 F* 等面向證明的程式語言，以及 (2) 缺乏大型的專案層級面向證明實作，這些實作可以在執行面向證明編程時，教導模型複雜的推理程序。我們提出第一個面向專案層級面向證明編程的合成資料擴充，用於產生和修復。我們的做法透過合成基本的面向證明編程問題來解決資料稀少的問題，以精通該語言；納入不同的編碼資料，以引出推理能力，並在現有的儲存庫中建立新的證明和修復資料。這個方法讓語言模型能夠為函數層級和儲存庫層級的程式碼合成和修復證明。我們展示經過微調的 14B 參數模型 PoPilot，可以超過在專案層級面向證明編程中表現優於 GPT-4o 的模型 64% 的相對差距，並且可以透過修復 GPT-4o 自我修復的輸出，將 GPT-4o 的效能提升 54%。
-
-##### **DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**
-2502.11897v1 by Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
-
-In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a
-training-free paradigm that can make use of adaptive temporal compression in
-latent space. While existing video generative models apply fixed compression
-rates via pretrained VAE, we observe that real-world video content exhibits
-substantial temporal non-uniformity, with high-motion segments containing more
-information than static scenes. Based on this insight, DLFR-VAE dynamically
-adjusts the latent frame rate according to the content complexity.
-Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent
-Frame Rate Scheduler that partitions videos into temporal chunks and adaptively
-determines optimal frame rates based on information-theoretic content
-complexity, and (2) A training-free adaptation mechanism that transforms
-pretrained VAE architectures into a dynamic VAE that can process features with
-variable frame rates. Our simple but effective DLFR-VAE can function as a
-plug-and-play module, seamlessly integrating with existing video generation
-models and accelerating the video generation process.
-
-摘要：在本文中，我們提出動態潛在幀率 VAE (DLFR-VAE)，一種無需訓練的範例，它可以在潛在空間中使用自適應時間壓縮。現有的影片生成模型透過預訓練的 VAE 應用固定壓縮率，但我們觀察到真實世界的影片內容展現出大量的時間非一致性，其中高動作片段包含比靜態場景更多的資訊。基於這個見解，DLFR-VAE 會根據內容複雜度動態調整潛在幀率。具體來說，DLFR-VAE 包含兩項核心創新：(1) 一個動態潛在幀率排程器，它將影片分割成時間區塊，並根據資訊理論內容複雜度自適應地決定最佳幀率，以及 (2) 一個無需訓練的適應機制，它將預訓練的 VAE 架構轉換成一個動態 VAE，它可以處理具有可變幀率的特色。我們簡單但有效的 DLFR-VAE 可以作為一個即插即用的模組，與現有的影片生成模型無縫整合，並加速影片生成過程。
-
-##### **CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**
-2502.11896v1 by Yanxiao Zhao, Yangge Qian, Jingyang Shan, Xiaolin Qin
-
-Reinforcement learning (RL) in continuous action spaces encounters persistent
-challenges, such as inefficient exploration and convergence to suboptimal
-solutions. To address these limitations, we propose CAMEL, a novel framework
-integrating LLM-generated suboptimal policies into the RL training pipeline.
-CAMEL leverages dynamic action masking and an adaptive epsilon-masking
-mechanism to guide exploration during early training stages while gradually
-enabling agents to optimize policies independently. At the core of CAMEL lies
-the integration of Python-executable suboptimal policies generated by LLMs
-based on environment descriptions and task objectives. Although simplistic and
-hard-coded, these policies offer valuable initial guidance for RL agents. To
-effectively utilize these priors, CAMEL employs masking-aware optimization to
-dynamically constrain the action space based on LLM outputs. Additionally,
-epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling
-agents to transition from constrained exploration to autonomous policy
-refinement. Experimental validation on Gymnasium MuJoCo environments
-demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated
-policies significantly improve sample efficiency, achieving performance
-comparable to or surpassing expert masking baselines. For Walker2d-v4, where
-LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust
-RL performance without notable degradation, highlighting the framework's
-adaptability across diverse tasks. While CAMEL shows promise in enhancing
-sample efficiency and mitigating convergence challenges, these issues remain
-open for further research. Future work aims to generalize CAMEL to multimodal
-LLMs for broader observation-action spaces and automate policy evaluation,
-reducing human intervention and enhancing scalability in RL training pipelines.
-
-摘要：<paragraph>在連續動作空間中的強化學習 (RL) 會遇到持續的挑戰，例如探索效率低落和收斂至次佳解。為了解決這些限制，我們提出 CAMEL，一個將 LLM 生成的次佳策略整合到 RL 訓練管線中的新框架。CAMEL 透過動態動作遮罩和自適應 epsilon 遮罩機制來引導探索，同時逐漸讓代理程式能夠獨立最佳化策略。CAMEL 的核心在於整合由 LLM 生成的 Python 可執行次佳策略，這些策略基於環境描述和任務目標。儘管這些策略過於簡化且硬編碼，但它們為 RL 代理程式提供了有價值的初始指導。為了有效利用這些先驗知識，CAMEL 採用遮罩感知最佳化來根據 LLM 輸出動態限制動作空間。此外，epsilon 遮罩逐漸減少對 LLM 生成的指導依賴，讓代理程式能夠從受限探索轉換為自主策略改善。在 Gymnasium MuJoCo 環境上的實驗驗證證明了 CAMEL 的有效性。在 Hopper-v4 和 Ant-v4 中，LLM 生成的策略顯著提升了樣本效率，達到了與專家遮罩基準相近或超越的效能。對於 LLM 難以準確建模雙足步態動態的 Walker2d-v4，CAMEL 維持穩健的 RL 效能，且沒有顯著降低，突顯了該框架在不同任務中的適應性。儘管 CAMEL 在提升樣本效率和緩解收斂挑戰方面顯示出前景，但這些問題仍有待進一步研究。未來的研究工作旨在將 CAMEL 推廣到多模態 LLM，以涵蓋更廣泛的觀察動作空間，並自動化策略評估，減少人工介入並提升 RL 訓練管線的可擴充性。</paragraph>
-
-##### **Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**
-2502.11895v1 by Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
-
-Large language models (LLMs) require immense resources for training and
-inference. Quantization, a technique that reduces the precision of model
-parameters, offers a promising solution for improving LLM efficiency and
-sustainability. While post-training quantization methods typically achieve 4-8
-bits per parameter, recent research suggests that training LLMs with 1.58 bits
-per weight parameter from scratch can maintain model accuracy while greatly
-reducing memory requirements and energy consumption at inference time. Here, we
-investigate a training strategy for quantization-aware pre-training, where the
-models are first trained with 16-bit precision and then transition into
-1.58-bit quantization-aware training. Our results on 11 downstream tasks show
-that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit
-training and leaves models closer to those which have undergone 16-bit
-training. We further investigate the effects of retaining the optimizer state
-at the transition point and gradually phasing in quantization strength --
-finding that both techniques alleviate the magnitude of loss spikes, but also
-that these effects can be compensated through further training.
-
-摘要：大型語言模型 (LLM) 需要大量的資源來進行訓練和推理。量化是一種降低模型參數精度的技術，為提高 LLM 效率和可持續性提供了一個有希望的解決方案。雖然訓練後量化方法通常每參數達到 4-8 位元，但最近的研究表明，從頭開始使用每權重參數 1.58 位元訓練 LLM 可以維持模型準確性，同時大幅減少推理時間的記憶體需求和能源消耗。在此，我們探討量化感知預訓練的訓練策略，其中模型首先使用 16 位元精度訓練，然後轉換為 1.58 位元量化感知訓練。我們在 11 個下游任務上的結果表明，這種 16 位元到 1.58 位元的訓練策略優於完全 1.58 位元訓練，並且使模型更接近經過 16 位元訓練的模型。我們進一步探討了在轉換點保留最佳化器狀態和逐漸調整量化強度的影響——發現這兩種技術都可以減輕損失尖峰的大小，但這些影響也可以透過進一步訓練來補償。
-
-##### **Revisiting Classification Taxonomy for Grammatical Errors**
-2502.11890v1 by Deqing Zou, Jingheng Ye, Yulu Liu, Yu Wu, Zishan Xu, Yinghui Li, Hai-Tao Zheng, Bingxu An, Zhao Wei, Yong Xu
-
-Grammatical error classification plays a crucial role in language learning
-systems, but existing classification taxonomies often lack rigorous validation,
-leading to inconsistencies and unreliable feedback. In this paper, we revisit
-previous classification taxonomies for grammatical errors by introducing a
-systematic and qualitative evaluation framework. Our approach examines four
-aspects of a taxonomy, i.e., exclusivity, coverage, balance, and usability.
-Then, we construct a high-quality grammatical error classification dataset
-annotated with multiple classification taxonomies and evaluate them grounding
-on our proposed evaluation framework. Our experiments reveal the drawbacks of
-existing taxonomies. Our contributions aim to improve the precision and
-effectiveness of error analysis, providing more understandable and actionable
-feedback for language learners.
-
-摘要：語法錯誤分類在語言學習系統中扮演至關重要的角色，但現有的分類法常常缺乏嚴謹的驗證，導致不一致且不可靠的回饋。在本文中，我們透過引入一個系統且定性的評估架構，重新檢視先前的語法錯誤分類法。我們的做法檢視分類法的四個面向，即排他性、涵蓋性、平衡性和可用性。接著，我們建構一個高品質的語法錯誤分類資料集，並用多個分類法進行標註，並根據我們提出的評估架構對其進行評估。我們的實驗揭露了現有分類法的缺點。我們的貢獻旨在改善錯誤分析的準確性和有效性，為語言學習者提供更易於理解且可操作的回饋。
-
-##### **Stonefish: Supporting Machine Learning Research in Marine Robotics**
-2502.11887v1 by Michele Grimaldi, Patryk Cieslak, Eduardo Ochoa, Vibhav Bharti, Hayat Rajani, Ignacio Carlucho, Maria Koskinopoulou, Yvan R. Petillot, Nuno Gracias
-
-Simulations are highly valuable in marine robotics, offering a cost-effective
-and controlled environment for testing in the challenging conditions of
-underwater and surface operations. Given the high costs and logistical
-difficulties of real-world trials, simulators capable of capturing the
-operational conditions of subsea environments have become key in developing and
-refining algorithms for remotely-operated and autonomous underwater vehicles.
-This paper highlights recent enhancements to the Stonefish simulator, an
-advanced open-source platform supporting development and testing of marine
-robotics solutions. Key updates include a suite of additional sensors, such as
-an event-based camera, a thermal camera, and an optical flow camera, as well
-as, visual light communication, support for tethered operations, improved
-thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy.
-These developments and an automated annotation tool significantly bolster
-Stonefish's role in marine robotics research, especially in the field of
-machine learning, where training data with a known ground truth is hard or
-impossible to collect.
-
-摘要：模擬在海洋機器人中極具價值，提供具成本效益且受控的環境，用於在水下和水面作業的挑戰性條件下進行測試。鑑於現實世界試驗的高成本和後勤困難，能夠捕捉海底環境作業條件的模擬器已成為開發和改進遠程操作和自主水下載具演算法的關鍵。本文重點介紹了 Stonefish 模擬器最近的增強功能，這是一個先進的開源平台，支援海洋機器人解決方案的開發和測試。主要更新包括一系列額外的感測器，例如事件式相機、熱像儀和光流相機，以及可見光通訊、對繫繩操作的支援、改進的推進器建模、更靈活的水動力學和增強的聲納準確度。這些開發和自動化標註工具顯著提升了 Stonefish 在海洋機器人研究中的作用，特別是在機器學習領域，其中具有已知基本事實的訓練資料難以或無法收集。
-
-##### **LIMR: Less is More for RL Scaling**
-2502.11886v1 by Xuefeng Li, Haoyang Zou, Pengfei Liu
-
-In this paper, we ask: what truly determines the effectiveness of RL training
-data for enhancing language models' reasoning capabilities? While recent
-advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack
-of transparency about training data requirements has hindered systematic
-progress. Starting directly from base models without distillation, we challenge
-the assumption that scaling up RL training data inherently improves
-performance. we demonstrate that a strategically selected subset of just 1,389
-samples can outperform the full 8,523-sample dataset. We introduce Learning
-Impact Measurement (LIM), an automated method to evaluate and prioritize
-training samples based on their alignment with model learning trajectories,
-enabling efficient resource utilization and scalable implementation. Our method
-achieves comparable or even superior performance using only 1,389 samples
-versus the full 8,523 samples dataset. Notably, while recent data-efficient
-approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it
-significantly underperforms at 7B-scale through supervised fine-tuning (SFT).
-In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and
-outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results
-fundamentally reshape our understanding of RL scaling in LLMs, demonstrating
-that precise sample selection, rather than data scale, may be the key to
-unlocking enhanced reasoning capabilities. For reproducible research and future
-innovation, we are open-sourcing LIMR, including implementation of LIM,
-training and evaluation code, curated datasets, and trained models at
-https://github.com/GAIR-NLP/LIMR.
-
-摘要：<paragraph>在這篇論文中，我們提出一個問題：究竟是什麼決定了 RL 訓練資料增強語言模型推理能力的有效性？雖然最近的進展，例如 o1、Deepseek R1 和 Kimi1.5，展示了 RL 的潛力，但缺乏關於訓練資料需求的透明度阻礙了系統化的進展。從沒有蒸餾的基本模型直接開始，我們挑戰了擴充 RL 訓練資料本質上就會提升效能的假設。我們證明，策略性地選出僅 1,389 個樣本的子集就能勝過完整的 8,523 個樣本資料集。我們引入了學習影響力測量 (LIM)，這是一種自動化方法，用來評估和優先處理訓練樣本，根據它們與模型學習軌跡的一致性，能有效利用資源和擴充實作。我們的方法使用僅 1,389 個樣本就能達到與使用完整的 8,523 個樣本資料集相當甚至更佳的效能。值得注意的是，雖然最近資料有效率的方法（例如 LIMO 和 s1）在 32B 規模的模型上展現了前景，但我們發現它在 7B 規模上透過監督微調 (SFT) 的表現大幅落後。相比之下，我們基於 RL 的 LIMR 在 AIME24 上達到了高出 16.7% 的準確度，並在 MATH500 上比 LIMO 和 s1 分別高出 13.0% 和 22.2%。這些結果從根本上改變了我們對 LLM 中 RL 擴充的理解，證明精確的樣本選取，而非資料規模，可能是解鎖增強推理能力的關鍵。為了可重製的研究和未來的創新，我們開放原始碼 LIMR，包括 LIM 的實作、訓練和評估程式碼、策展的資料集，以及在 https://github.com/GAIR-NLP/LIMR 上訓練的模型。</paragraph>
-
-##### **Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**
-2502.11882v1 by Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
-
-Agents built on large language models (LLMs) have excelled in turn-by-turn
-human-AI collaboration but struggle with simultaneous tasks requiring real-time
-interaction. Latency issues and the challenge of inferring variable human
-strategies hinder their ability to make autonomous decisions without explicit
-instructions. Through experiments with current independent System 1 and System
-2 methods, we validate the necessity of using Dual Process Theory (DPT) in
-real-time tasks. We propose DPT-Agent, a novel language agent framework that
-integrates System 1 and System 2 for efficient real-time simultaneous human-AI
-collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and
-code-as-policy for fast, intuitive, and controllable decision-making.
-DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous
-reflection to infer human intentions and perform reasoning-based autonomous
-decisions. We demonstrate the effectiveness of DPT-Agent through further
-experiments with rule-based agents and human collaborators, showing significant
-improvements over mainstream LLM-based frameworks. To the best of our
-knowledge, DPT-Agent is the first language agent framework that achieves
-successful real-time simultaneous human-AI collaboration autonomously. Code of
-DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
-
-摘要：建立在大语言模型（LLM）上的代理在回合制人机协作方面表现出色，但在需要实时交互的同时任务中却举步维艰。延迟问题和推断可变人类策略的挑战阻碍了他们在没有明确指示的情况下做出自主决策的能力。通过使用当前独立的系统 1 和系统 2 方法进行的实验，我们验证了在实时任务中使用双重过程理论 (DPT) 的必要性。我们提出了 DPT-Agent，这是一个新颖的语言代理框架，它集成了系统 1 和系统 2，以实现高效的实时同时人机协作。DPT-Agent 的系统 1 使用有限状态机 (FSM) 和代码作为策略，以进行快速、直观且可控的决策。DPT-Agent 的系统 2 集成了心智理论 (ToM) 和异步反射，以推断人类意图并执行基于推理的自主决策。我们通过与基于规则的代理和人类合作者进行进一步的实验来证明 DPT-Agent 的有效性，展示了对主流基于 LLM 的框架的重大改进。据我们所知，DPT-Agent 是第一个实现自主的实时同时人机协作的语言代理框架。DPT-Agent 的代码可以在 https://github.com/sjtu-marl/DPT-Agent 中找到。
-
-##### **Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**
-2502.11881v1 by Hyunwoo Kim, Melanie Sclar, Tan Zhi-Xuan, Lance Ying, Sydney Levine, Yang Liu, Joshua B. Tenenbaum, Yejin Choi
-
-Existing LLM reasoning methods have shown impressive capabilities across
-various tasks, such as solving math and coding problems. However, applying
-these methods to scenarios without ground-truth answers or rule-based
-verification methods - such as tracking the mental states of an agent - remains
-challenging. Inspired by the sequential Monte Carlo algorithm, we introduce
-thought-tracing, an inference-time reasoning algorithm designed to trace the
-mental states of specific agents by generating hypotheses and weighting them
-based on observations without relying on ground-truth solutions to questions in
-datasets. Our algorithm is modeled after the Bayesian theory-of-mind framework,
-using LLMs to approximate probabilistic inference over agents' evolving mental
-states based on their perceptions and actions. We evaluate thought-tracing on
-diverse theory-of-mind benchmarks, demonstrating significant performance
-improvements compared to baseline LLMs. Our experiments also reveal interesting
-behaviors of the recent reasoning models - e.g., o1 and R1 - on theory-of-mind,
-highlighting the difference of social reasoning compared to other domains.
-
-摘要：現有的 LLM 推理方法已在各種任務中展現出令人印象深刻的能力，例如解決數學和編碼問題。然而，將這些方法應用於沒有正解答案或基於規則的驗證方法的情境中 - 例如追蹤代理人的心智狀態 - 仍然具有挑戰性。受到序貫蒙地卡羅演算法的啟發，我們引入了思想追蹤，這是一種在推理時間進行推理的演算法，旨在透過產生假設並根據觀察加權這些假設來追蹤特定代理人的心智狀態，而無需依賴資料集中的問題正解。我們的演算法是以貝氏心智理論架構為範本，使用 LLM 根據代理人的感知和行動來近似代理人不斷演變的心智狀態的機率推論。我們在各種心智理論基準上評估思想追蹤，與基準 LLM 相比，證明了顯著的效能提升。我們的實驗也揭露了近期推理模型在心智理論上的有趣行為 - 例如 o1 和 R1 - 突顯了社會推理與其他領域的差異。
-
-##### **Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**
-2502.11880v1 by Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
-
-The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has
-spurred interest in ternary LLMs. Despite this, research and practical
-applications focusing on efficient edge inference for ternary LLMs remain
-scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system
-optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix
-multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs,
-Bitnet.cpp incorporates a novel mpGEMM library to facilitate
-sub-2-bits-per-weight, efficient and lossless inference. The library features
-two core solutions: Ternary Lookup Table (TL), which addresses spatial
-inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S),
-which ensures lossless edge inference, both enabling high-speed inference. Our
-experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over
-full-precision baselines and up to 2.32x over low-bit baselines, setting new
-benchmarks in the field. Additionally, we expand TL to element-wise lookup
-table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and
-empirical evidence of its considerable potential. Bitnet.cpp is publicly
-available at https://github.com/microsoft/BitNet/tree/paper , offering a
-sophisticated solution for the efficient and practical deployment of edge LLMs.
-
-摘要：隨著由 BitNet b1.58 領先的 1 位元大型語言模型 (LLM) 出現，已激發了對三元 LLM 的興趣。儘管如此，專注於三元 LLM 的高效能邊緣推論的研究和實際應用仍然很少見。為了彌補這個差距，我們引入了 Bitnet.cpp，這是一個針對 BitNet b1.58 和三元 LLM 最佳化的推論系統。由於混合精度矩陣乘法 (mpGEMM) 構成三元 LLM 中推論時間的大部分，Bitnet.cpp 結合了一個新穎的 mpGEMM 函式庫，以利於每權重低於 2 位元、高效能且無損失的推論。該函式庫具有兩個核心解決方案：三元查詢表 (TL)，它解決了先前逐位元方法的空間低效率，以及具有比例的 Int2 (I2_S)，它確保無損失的邊緣推論，兩者都能實現高速推論。我們的實驗顯示，Bitnet.cpp 的速度比全精度的基準快了 6.25 倍，比低位元基準快了 2.32 倍，樹立了該領域的新基準。此外，我們在附錄中將 TL 擴充到逐元素查詢表 (ELUT) 以用於低位元 LLM，並提出其巨大潛力的理論和實證證據。Bitnet.cpp 已公開於 https://github.com/microsoft/BitNet/tree/paper，提供了一個精密的解決方案，用於邊緣 LLM 的高效能和實際部署。
-
-##### **VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**
-2502.11874v1 by Hugh Mee Wong, Rick Nouwen, Albert Gatt
-
-Vague quantifiers such as "a few" and "many" are influenced by many
-contextual factors, including how many objects are present in a given context.
-In this work, we evaluate the extent to which vision-and-language models (VLMs)
-are compatible with humans when producing or judging the appropriateness of
-vague quantifiers in visual contexts. We release a novel dataset, VAQUUM,
-containing 20300 human ratings on quantified statements across a total of 1089
-images. Using this dataset, we compare human judgments and VLM predictions
-using three different evaluation methods. Our findings show that VLMs, like
-humans, are influenced by object counts in vague quantifier use. However, we
-find significant inconsistencies across models in different evaluation
-settings, suggesting that judging and producing vague quantifiers rely on two
-different processes.
-
-摘要：模糊量词，例如「一些」和「许多」，会受到许多语境因素的影响，包括在给定语境中出现的对象数量。在这项工作中，我们评估视觉语言模型 (VLM) 在视觉语境中产生或判断模糊量词的适当性时，与人类的兼容程度。我们发布了一个新数据集 VAQUUM，其中包含对 1089 张图像中的量化陈述的 20300 个人类评级。使用此数据集，我们使用三种不同的评估方法来比较人类判断和 VLM 预测。我们的研究结果表明，VLM 与人类一样，在模糊量词的使用中会受到对象数量的影响。然而，我们发现不同评估设置中的模型之间存在显着的不一致性，这表明判断和产生模糊量词依赖于两个不同的过程。
-
-##### **Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**
-2502.11866v1 by Michael McRae
-
-I introduce a new large-scale dataset of historical wire articles from U.S.
-Southern newspapers, spanning 1960-1975 and covering multiple wire services:
-The Associated Press, United Press International, Newspaper Enterprise
-Association. Unlike prior work focusing on front-page content, this dataset
-captures articles across the entire newspaper, offering broader insight into
-mid-century Southern coverage. The dataset includes a version that has
-undergone an LLM-based text cleanup pipeline to reduce OCR noise, enhancing its
-suitability for quantitative text analysis. Additionally, duplicate versions of
-articles are retained to enable analysis of editorial differences in language
-and framing across newspapers. Each article is tagged by wire service,
-facilitating comparative studies of editorial patterns across agencies. This
-resource opens new avenues for research in computational social science,
-digital humanities, and historical linguistics, providing a detailed
-perspective on how Southern newspapers relayed national and international news
-during a transformative period in American history. The dataset will be made
-available upon publication or request for research purposes.
-
-摘要：我介紹一個新的美國歷史電訊文章大型資料集，時間跨度為 1960-1975 年，涵蓋多個電訊服務：美聯社、美聯國際社、報業企業協會。與先前專注於頭版內容的研究不同，此資料集擷取了整份報紙的文章，提供更廣泛的見解，深入探討世紀中葉的南方報導。該資料集包含一個經過 LLM 文字清理管線處理的版本，以減少 OCR 雜訊，提升其適用於量化文字分析。此外，保留文章的重複版本，以利分析報紙間語言和架構的編輯差異。每篇文章都標記電訊服務，便於比較各家機構的編輯模式。此資源為計算社會科學、數位人文和歷史語言學的研究開啟了新的途徑，提供一個詳細的觀點，探討南方報紙在美國歷史的轉型時期如何傳遞國內和國際新聞。該資料集將在出版或研究目的請求後提供。
-
-##### **FedEAT: A Robustness Optimization Framework for Federated LLMs**
-2502.11863v1 by Yahao Pang, Xingyuan Wu, Xiaojin Zhang, Wei Chen, Hai Jin
-
-Significant advancements have been made by Large Language Models (LLMs) in
-the domains of natural language understanding and automated content creation.
-However, they still face persistent problems, including substantial
-computational costs and inadequate availability of training data. The
-combination of Federated Learning (FL) and LLMs (federated LLMs) offers a
-solution by leveraging distributed data while protecting privacy, which
-positions it as an ideal choice for sensitive domains. However, Federated LLMs
-still suffer from robustness challenges, including data heterogeneity,
-malicious clients, and adversarial attacks, which greatly hinder their
-applications. We first introduce the robustness problems in federated LLMs, to
-address these challenges, we propose FedEAT (Federated Embedding space
-Adversarial Training), a novel framework that applies adversarial training in
-the embedding space of client LLM and employs a robust aggregation approach,
-specifically geometric median aggregation, to enhance the robustness of
-Federated LLMs. Our experiments demonstrate that FedEAT effectively improves
-the robustness of Federated LLMs with minimal performance loss.
-
-摘要：大型語言模型 (LLM) 在自然語言理解和自動化內容創作領域取得了重大進展。
-然而，它們仍然面臨持續的問題，包括大量的運算成本和訓練數據的可用性不足。
-聯合學習 (FL) 和 LLM（聯合 LLM）的結合提供了一個解決方案，在保護隱私的同時利用分佈式數據，這使其成為敏感領域的理想選擇。
-然而，聯合 LLM 仍然面臨著穩健性的挑戰，包括數據異質性、惡意用戶和對抗性攻擊，這極大地阻礙了它們的應用。
-我們首先介紹了聯合 LLM 中的穩健性問題，為了應對這些挑戰，我們提出了 FedEAT（聯合嵌入空間對抗訓練），這是一個新穎的框架，它在用戶端 LLM 的嵌入空間中應用對抗訓練，並採用穩健的聚合方法，特別是幾何中值聚合，以增強聯合 LLM 的穩健性。
-我們的實驗表明，FedEAT 有效地提高了聯合 LLM 的穩健性，同時性能損失最小。
-
-##### **Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**
-2502.11862v1 by Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
-
-In-context machine translation (MT) with large language models (LLMs) is a
-promising approach for low-resource MT, as it can readily take advantage of
-linguistic resources such as grammar books and dictionaries. Such resources are
-usually selectively integrated into the prompt so that LLMs can directly
-perform translation without any specific training, via their in-context
-learning capability (ICL). However, the relative importance of each type of
-resource e.g., dictionary, grammar book, and retrieved parallel examples, is
-not entirely clear. To address this gap, this study systematically investigates
-how each resource and its quality affects the translation performance, with the
-Manchu language as our case study. To remove any prior knowledge of Manchu
-encoded in the LLM parameters and single out the effect of ICL, we also
-experiment with an encrypted version of Manchu texts. Our results indicate that
-high-quality dictionaries and good parallel examples are very helpful, while
-grammars hardly help. In a follow-up study, we showcase a promising application
-of in-context MT: parallel data augmentation as a way to bootstrap the
-conventional MT model. When monolingual data abound, generating synthetic
-parallel data through in-context MT offers a pathway to mitigate data scarcity
-and build effective and efficient low-resource neural MT systems.
-
-摘要：語境機器翻譯 (MT) 與大型語言模型 (LLM) 結合，對於低資源 MT 來說是一種有前景的方法，因為它可以輕易利用語法書和字典等語言資源。此類資源通常會選擇性地整合到提示中，讓 LLM 能夠透過其語境學習能力 (ICL) 直接執行翻譯，而無需任何特定訓練。然而，每種類型的資源（例如字典、語法書和擷取的平行範例）的相對重要性並不明確。為了解決這個問題，本研究系統性地探討每項資源及其品質如何影響翻譯效能，並以滿語作為我們的案例研究。為了移除 LLM 參數中編碼的任何滿語先備知識，並找出 ICL 的影響，我們也對滿語文本的加密版本進行實驗。我們的結果顯示，高品質的字典和良好的平行範例非常有幫助，而語法幾乎沒有幫助。在後續研究中，我們展示了語境 MT 的一個有前景的應用：平行數據擴充，作為引導傳統 MT 模型的一種方式。當單語資料豐富時，透過語境 MT 產生合成平行資料提供了一條途徑，可以減輕資料短缺，並建構有效且高效的低資源神經 MT 系統。
-
-##### **Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**
-2502.11861v1 by Shuqi Yang, Mingrui Jing, Shuai Wang, Jiaxin Kou, Manfei Shi, Weijie Xing, Yan Hu, Zheng Zhu
-
-This study reviewed the use of Large Language Models (LLMs) in healthcare,
-focusing on their training corpora, customization techniques, and evaluation
-metrics. A systematic search of studies from 2021 to 2024 identified 61
-articles. Four types of corpora were used: clinical resources, literature,
-open-source datasets, and web-crawled data. Common construction techniques
-included pre-training, prompt engineering, and retrieval-augmented generation,
-with 44 studies combining multiple methods. Evaluation metrics were categorized
-into process, usability, and outcome metrics, with outcome metrics divided into
-model-based and expert-assessed outcomes. The study identified critical gaps in
-corpus fairness, which contributed to biases from geographic, cultural, and
-socio-economic factors. The reliance on unverified or unstructured data
-highlighted the need for better integration of evidence-based clinical
-guidelines. Future research should focus on developing a tiered corpus
-architecture with vetted sources and dynamic weighting, while ensuring model
-transparency. Additionally, the lack of standardized evaluation frameworks for
-domain-specific models called for comprehensive validation of LLMs in
+##### **SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**
+2502.13143v1 by Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
+
+Spatial intelligence is a critical component of embodied AI, promoting robots
+to understand and interact with their environments. While recent advances have
+enhanced the ability of VLMs to perceive object locations and positional
+relationships, they still lack the capability to precisely understand object
+orientations-a key requirement for tasks involving fine-grained manipulations.
+Addressing this limitation not only requires geometric reasoning but also an
+expressive and intuitive way to represent orientation. In this context, we
+propose that natural language offers a more flexible representation space than
+canonical frames, making it particularly suitable for instruction-following
+robotic systems. In this paper, we introduce the concept of semantic
+orientation, which defines object orientations using natural language in a
+reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the
+''handle'' direction of a knife). To support this, we construct OrienText300K,
+a large-scale dataset of 3D models annotated with semantic orientations that
+link geometric understanding to functional semantics. By integrating semantic
+orientation into a VLM system, we enable robots to generate manipulation
+actions with both positional and orientational constraints. Extensive
+experiments in simulation and real world demonstrate that our approach
+significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy
+on Open6DOR and 74.9% accuracy on SIMPLER.
+
+摘要：空間智能是具象 AI 的關鍵組成部分，促使機器人了解其環境並與之互動。雖然最近的進展增強了 VLM 感知物件位置和位置關係的能力，但它們仍然缺乏精確理解物件方向的能力，這對於涉及細微操作的任務來說是一項關鍵要求。解決這個限制不僅需要幾何推理，還需要一種表達性和直觀的方式來表示方向。在此背景下，我們提出自然語言提供了一個比標準框架更靈活的表示空間，使其特別適合於遵循指令的機器人系統。在本文中，我們介紹了語義方向的概念，它使用自然語言以無參考框架的方式定義物件方向（例如，USB 的「插入」方向或刀子的「握柄」方向）。為了支持這一點，我們構建了 OrienText300K，這是一個大型 3D 模型數據集，其中註釋了語義方向，將幾何理解與功能語義聯繫起來。通過將語義方向整合到 VLM 系統中，我們使機器人能夠生成同時具有位置和方向約束的操作動作。在模擬和現實世界中進行的廣泛實驗表明，我們的做法顯著增強了機器人的操作能力，例如，Open6DOR 的準確率為 48.7%，SIMPLER 的準確率為 74.9%。
+
+##### **Pre-training Auto-regressive Robotic Models with 4D Representations**
+2502.13142v1 by Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig
+
+Foundation models pre-trained on massive unlabeled datasets have
+revolutionized natural language and computer vision, exhibiting remarkable
+generalization capabilities, thus highlighting the importance of pre-training.
+Yet, efforts in robotics have struggled to achieve similar success, limited by
+either the need for costly robotic annotations or the lack of representations
+that effectively model the physical world. In this paper, we introduce ARM4R,
+an Auto-regressive Robotic Model that leverages low-level 4D Representations
+learned from human video data to yield a better pre-trained robotic model.
+Specifically, we focus on utilizing 3D point tracking representations from
+videos derived by lifting 2D representations into 3D space via monocular depth
+estimation across time. These 4D representations maintain a shared geometric
+structure between the points and robot state representations up to a linear
+transformation, enabling efficient transfer learning from human video data to
+low-level robotic control. Our experiments show that ARM4R can transfer
+efficiently from human video data to robotics and consistently improves
+performance on tasks across various robot environments and configurations.
+
+摘要：預先在大量未標記資料集上訓練好的基礎模型已經徹底改變了自然語言和電腦視覺，展現出非凡的概化能力，因此突顯了預先訓練的重要性。然而，機器人領域的努力一直難以取得類似的成功，受到昂貴的機器人標註需求或缺乏有效建模物理世界的表徵的限制。在本文中，我們介紹了 ARM4R，一種自迴歸機器人模型，它利用從人類影片資料中學習到的低階 4D 表徵，以產生更好的預先訓練機器人模型。具體來說，我們專注於利用從影片中獲得的 3D 點追蹤表徵，這些表徵是透過單眼深度估計跨時間將 2D 表徵提升到 3D 空間而導出的。這些 4D 表徵在點和機器人狀態表徵之間保持一個共用的幾何結構，直到一個線性轉換，這使得從人類影片資料到低階機器人控制的有效遷移學習成為可能。我們的實驗表明，ARM4R 可以有效地從人類影片資料轉移到機器人技術，並持續改善各種機器人環境和組態中的任務效能。
+
+##### **UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**
+2502.13141v1 by Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao
+
+Large Language Models (LLMs) are vulnerable to attacks like prompt injection,
+backdoor attacks, and adversarial attacks, which manipulate prompts or models
+to generate harmful outputs. In this paper, departing from traditional deep
+learning attack paradigms, we explore their intrinsic relationship and
+collectively term them Prompt Trigger Attacks (PTA). This raises a key
+question: Can we determine if a prompt is benign or poisoned? To address this,
+we propose UniGuardian, the first unified defense mechanism designed to detect
+prompt injection, backdoor attacks, and adversarial attacks in LLMs.
+Additionally, we introduce a single-forward strategy to optimize the detection
+pipeline, enabling simultaneous attack detection and text generation within a
+single forward pass. Our experiments confirm that UniGuardian accurately and
+efficiently identifies malicious prompts in LLMs.
+
+摘要：大型語言模型 (LLM) 容易受到提示注入、後門攻擊和對抗性攻擊等攻擊，這些攻擊會操縱提示或模型以產生有害的輸出。在本文中，我們跳脫傳統深度學習攻擊範例，探討它們的內在關係，並將它們統稱為提示觸發攻擊 (PTA)。這引發了一個關鍵問題：我們能確定一個提示是良性的還是惡意的嗎？為了解決這個問題，我們提出了 UniGuardian，這是一種旨在偵測 LLM 中的提示注入、後門攻擊和對抗性攻擊的第一個統一防禦機制。此外，我們引入了一個單一前向策略來最佳化偵測管道，在單一前向傳遞中同時進行攻擊偵測和文字生成。我們的實驗證實，UniGuardian 能準確且有效地識別 LLM 中的惡意提示。
+
+##### **AIDE: AI-Driven Exploration in the Space of Code**
+2502.13138v1 by Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, Yuxiang Wu
+
+Machine learning, the foundation of modern artificial intelligence, has
+driven innovations that have fundamentally transformed the world. Yet, behind
+advancements lies a complex and often tedious process requiring labor and
+compute intensive iteration and experimentation. Engineers and scientists
+developing machine learning models spend much of their time on trial-and-error
+tasks instead of conceptualizing innovative solutions or research hypotheses.
+To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine
+learning engineering agent powered by large language models (LLMs). AIDE frames
+machine learning engineering as a code optimization problem, and formulates
+trial-and-error as a tree search in the space of potential solutions. By
+strategically reusing and refining promising solutions, AIDE effectively trades
+computational resources for enhanced performance, achieving state-of-the-art
+results on multiple machine learning engineering benchmarks, including our
+Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.
+
+摘要：機器學習，現代人工智慧的基礎，已經推動了根本性地改變世界的創新。然而，進步的背後是一個複雜且經常繁瑣的過程，需要人工和計算密集的迭代和實驗。開發機器學習模型的工程師和科學家將大部分時間花在試錯任務上，而不是構思創新的解決方案或研究假設。為了應對這一挑戰，我們引入了 AI 驅動探索 (AIDE)，這是一種由大型語言模型 (LLM) 驅動的機器學習工程代理。AIDE 將機器學習工程構建為一個程式碼最佳化問題，並將試錯表述為在潛在解決方案空間中的樹狀搜尋。透過策略性地重複使用和改進有希望的解決方案，AIDE 有效地將計算資源轉換為增強的效能，在多個機器學習工程基準上取得了最先進的成果，包括我們的 Kaggle 評估、OpenAI MLE-Bench 和 METRs RE-Bench。
+
+##### **Theorem Prover as a Judge for Synthetic Data Generation**
+2502.13137v1 by Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen
+
+The demand for synthetic data in mathematical reasoning has increased due to
+its potential to enhance the mathematical capabilities of large language models
+(LLMs). However, ensuring the validity of intermediate reasoning steps remains
+a significant challenge, affecting data quality. While formal verification via
+theorem provers effectively validates LLM reasoning, the autoformalisation of
+mathematical proofs remains error-prone. In response, we introduce iterative
+autoformalisation, an approach that iteratively refines theorem prover
+formalisation to mitigate errors, thereby increasing the execution rate on the
+Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as
+a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to
+rigorously assess LLM intermediate reasoning, effectively integrating
+autoformalisation with synthetic data generation. Finally, we present
+Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that
+replaces human annotation with theorem prover feedback in Reinforcement
+Learning from Human Feedback (RLHF). Across multiple LLMs, applying
+TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving
+5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for
+SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
+
+摘要：<paragraph>由於合成資料在數學推理中具有增強大型語言模型 (LLM) 數學能力的潛力，對合成資料的需求已增加。然而，確保中間推理步驟的有效性仍然是一項重大的挑戰，影響資料品質。雖然透過定理證明器進行形式驗證可有效驗證 LLM 推理，但數學證明自動形式化仍然容易出錯。為了解決這個問題，我們引入了迭代自動形式化，這是一種迭代優化定理證明器形式化以減少錯誤的方法，從而將 Lean 證明器的執行率從 60% 提高到 87%。在此基礎上，我們引入了定理證明器作為評審 (TP-as-a-Judge)，這是一種採用定理證明器形式化來嚴格評估 LLM 中間推理的方法，有效地將自動形式化與合成資料產生整合。最後，我們提出了定理證明器回饋強化學習 (RLTPF)，這是一個框架，用定理證明器回饋取代人類標註，以進行人類回饋強化學習 (RLHF)。在多個 LLM 中，應用 TP-as-a-Judge 和 RLTPF 可透過僅 3,508 個樣本改善基準，在 MultiArith 上獲得 5.56% 的準確度提升，在 SVAMP 上獲得 Llama-2-7B 的 6.00% 提升，在 AQUA 上獲得 Llama-3.1-8B 的 3.55% 提升。</paragraph>
+
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Rethinking Diverse Human Preference Learning through Principal Component Analysis**
+2502.13131v1 by Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
+
+Understanding human preferences is crucial for improving foundation models
+and building personalized AI systems. However, preferences are inherently
+diverse and complex, making it difficult for traditional reward models to
+capture their full range. While fine-grained preference data can help,
+collecting it is expensive and hard to scale. In this paper, we introduce
+Decomposed Reward Models (DRMs), a novel approach that extracts diverse human
+preferences from binary comparisons without requiring fine-grained annotations.
+Our key insight is to represent human preferences as vectors and analyze them
+using Principal Component Analysis (PCA). By constructing a dataset of
+embedding differences between preferred and rejected responses, DRMs identify
+orthogonal basis vectors that capture distinct aspects of preference. These
+decomposed rewards can be flexibly combined to align with different user needs,
+offering an interpretable and scalable alternative to traditional reward
+models. We demonstrate that DRMs effectively extract meaningful preference
+dimensions (e.g., helpfulness, safety, humor) and adapt to new users without
+additional training. Our results highlight DRMs as a powerful framework for
+personalized and interpretable LLM alignment.
+
+摘要：理解人類偏好對於改進基礎模型和建構個人化 AI 系統至關重要。然而，偏好本質上是多樣且複雜的，這使得傳統的獎勵模型難以捕捉其全部範圍。雖然細緻的偏好數據可能有所幫助，但收集這些數據既昂貴又難以擴展。在本文中，我們介紹了解構獎勵模型 (DRM)，這是一種新穎的方法，它可以從二元比較中提取多樣化的人類偏好，而不需要細緻的註解。我們的關鍵見解是將人類偏好表示為向量，並使用主成分分析 (PCA) 對其進行分析。透過建構偏好和拒絕回應之間嵌入差異的數據集，DRM 識別出正交基向量，這些向量捕捉偏好的不同面向。這些解構的獎勵可以靈活地結合在一起，以符合不同的使用者需求，提供一種可解釋且可擴展的傳統獎勵模型替代方案。我們證明了 DRM 可以有效地提取有意義的偏好維度（例如，有用性、安全性、幽默感），並在不需要額外訓練的情況下適應新的使用者。我們的結果突顯了 DRM 作為個人化且可解釋的 LLM 對齊強大架構。
+
+##### **Magma: A Foundation Model for Multimodal AI Agents**
+2502.13130v1 by Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
+
+We present Magma, a foundation model that serves multimodal AI agentic tasks
+in both the digital and physical worlds. Magma is a significant extension of
+vision-language (VL) models in that it not only retains the VL understanding
+ability (verbal intelligence) of the latter, but is also equipped with the
+ability to plan and act in the visual-spatial world (spatial-temporal
+intelligence) and complete agentic tasks ranging from UI navigation to robot
+manipulation. To endow the agentic capabilities, Magma is pretrained on large
+amounts of heterogeneous datasets spanning from images, videos to robotics
+data, where the actionable visual objects (e.g., clickable buttons in GUI) in
+images are labeled by Set-of-Mark (SoM) for action grounding, and the object
+movements (e.g., the trace of human hands or robotic arms) in videos are
+labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show
+that SoM and ToM reach great synergy and facilitate the acquisition of
+spatial-temporal intelligence for our Magma model, which is fundamental to a
+wide range of tasks as shown in Fig.1. In particular, Magma creates new
+state-of-the-art results on UI navigation and robotic manipulation tasks,
+outperforming previous models that are specifically tailored to these tasks. On
+image and video-related multimodal tasks, Magma also compares favorably to
+popular large multimodal models that are trained on much larger datasets. We
+make our model and code public for reproducibility at
+https://microsoft.github.io/Magma.
+
+摘要：<paragraph>我們提出 Magma，這是一個基礎模型，用於服務數位和物理世界中的多模態 AI 代理任務。Magma 是視覺語言 (VL) 模型的重大延伸，它不僅保留了後者的 VL 理解能力（語言智能），還具備在視覺空間世界中規劃和行動的能力（時空智能），並完成從 UI 導航到機器人操作的代理任務。為了賦予代理能力，Magma 在從影像、影片到機器人資料的大量異質資料集上進行預訓練，其中影像中的可操作視覺物件（例如 GUI 中的可點擊按鈕）由動作接地 Set-of-Mark (SoM) 標記，影片中的物件動作（例如人手或機器手臂的軌跡）由動作規劃 Trace-of-Mark (ToM) 標記。廣泛的實驗表明，SoM 和 ToM 達到了極大的協同作用，並促進了我們 Magma 模型的時空智能的獲取，這對於圖 1 中所示的各種任務至關重要。特別是，Magma 在 UI 導航和機器人操作任務上創造了新的最先進的結果，優於專門針對這些任務的先前模型。在影像和影片相關的多模態任務上，Magma 也與在更大資料集上訓練的流行大型多模態模型相比，表現得很好。我們公開我們的模型和程式碼，以便在 https://microsoft.github.io/Magma 上重現。</paragraph>
+
+##### **SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**
+2502.13128v1 by Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
+
+Text-to-song generation, the task of creating vocals and accompaniment from
+textual inputs, poses significant challenges due to domain complexity and data
+scarcity. Existing approaches often employ multi-stage generation procedures,
+resulting in cumbersome training and inference pipelines. In this paper, we
+propose SongGen, a fully open-source, single-stage auto-regressive transformer
+designed for controllable song generation. The proposed model facilitates
+fine-grained control over diverse musical attributes, including lyrics and
+textual descriptions of instrumentation, genre, mood, and timbre, while also
+offering an optional three-second reference clip for voice cloning. Within a
+unified auto-regressive framework, SongGen supports two output modes: mixed
+mode, which generates a mixture of vocals and accompaniment directly, and
+dual-track mode, which synthesizes them separately for greater flexibility in
+downstream applications. We explore diverse token pattern strategies for each
+mode, leading to notable improvements and valuable insights. Furthermore, we
+design an automated data preprocessing pipeline with effective quality control.
+To foster community engagement and future research, we will release our model
+weights, training code, annotated data, and preprocessing pipeline. The
+generated samples are showcased on our project page at
+https://liuzh-19.github.io/SongGen/ , and the code will be available at
+https://github.com/LiuZH-19/SongGen .
+
+摘要：文字轉歌曲生成，從文字輸入建立人聲和伴奏的任務，由於領域複雜性和資料稀少性，因此構成重大挑戰。現有方法通常採用多階段生成程序，導致訓練和推論管道繁瑣。在本文中，我們提出 SongGen，一個完全開源的單階段自迴歸轉換器，專為可控歌曲生成而設計。所提出的模型促進對各種音樂屬性的細粒度控制，包括歌詞和樂器、類型、情緒和音色的文字描述，同時還提供可選的三秒參考片段以進行語音複製。在統一的自迴歸框架內，SongGen 支援兩種輸出模式：混合模式，直接生成人聲和伴奏的混合，以及雙軌模式，將它們分開合成以提高下游應用程式的靈活性。我們探索每種模式的不同代幣模式策略，從而帶來顯著的改進和有價值的見解。此外，我們設計了一個自動化資料預處理管道，具備有效的品質控制。為了促進社區參與和未來的研究，我們將釋出我們的模型權重、訓練程式碼、註解資料和預處理管道。生成的範例展示在我們的專案頁面 https://liuzh-19.github.io/SongGen/，程式碼將在 https://github.com/LiuZH-19/SongGen 中提供。
+
+##### **Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**
+2502.13127v1 by Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo
+
+Recent advances in Large Language Models (LLMs) have enabled them to process
+increasingly longer sequences, ranging from 2K to 2M tokens and even beyond.
+However, simply extending the input sequence length does not necessarily lead
+to effective long-context understanding. In this study, we integrate
+Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate
+effective long-context understanding. To achieve this, we introduce
+LongFinanceQA, a synthetic dataset in the financial domain designed to improve
+long-context reasoning. Unlike existing long-context synthetic data,
+LongFinanceQA includes intermediate CoT reasoning before the final conclusion,
+which encourages LLMs to perform explicit reasoning, improving accuracy and
+interpretability in long-context understanding. To generate synthetic CoT
+reasoning, we propose Property-driven Agentic Inference (PAI), an agentic
+framework that simulates human-like reasoning steps, including property
+extraction, retrieval, and summarization. We evaluate PAI's reasoning
+capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark,
+outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune
+LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's
+financial subset.
+
+摘要：大型語言模型 (LLM) 的最新進展讓它們能夠處理越來越長的序列，範圍從 2K 到 2M 個符號，甚至更長。
+然而，僅僅延長輸入序列長度並不會必然導致有效的長語境理解。在本研究中，我們以監督的方式將思考鏈 (CoT) 推理整合到 LLM 中，以促進有效的長語境理解。為此，我們引入了 LongFinanceQA，這是一個在金融領域中的合成數據集，旨在改進長語境推理。與現有的長語境合成數據不同，LongFinanceQA 在最終結論之前包含了中間的 CoT 推理，這鼓勵 LLM 執行明確的推理，從而提高長語境理解的準確性和可解釋性。為了生成合成的 CoT 推理，我們提出了基於屬性的主體推理 (PAI)，這是一個模擬類人推理步驟的主體框架，包括屬性提取、檢索和總結。我們通過評估搭載 PAI 的 GPT-4o-mini 在 Loong 基準上的推理能力，使其比標準的 GPT-4o-mini 高出 20.0%，來評估 PAI 的推理能力。此外，我們對 LLaMA-3.1-8B-Instruct 進行了微調，在 Loong 的金融子集中實現了 24.6% 的增益。
+
+##### **RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**
+2502.13125v1 by Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li
+
+Recent advances in large language models (LLMs) have shown that they can
+answer questions requiring complex reasoning. However, their ability to
+identify and respond to text containing logical fallacies or deliberately
+misleading premises remains less studied. To address this gap, we introduce
+RuozhiBench, a bilingual dataset comprising 677 carefully curated questions
+that contain various forms of deceptive reasoning, meticulously crafted through
+extensive human effort and expert review. In a comprehensive evaluation of 17
+LLMs from 5 Series over RuozhiBench using both open-ended and two-choice
+formats, we conduct extensive analyses on evaluation protocols and result
+patterns. Despite their high scores on conventional benchmarks, these models
+showed limited ability to detect and reason correctly about logical fallacies,
+with even the best-performing model, Claude-3-haiku, achieving only 62%
+accuracy compared to the human of more than 90%.
+
+摘要：大型語言模型 (LLM) 的最新進展顯示，它們可以回答需要複雜推理的問題。然而，它們識別和回應包含邏輯謬誤或故意誤導前提的文本的能力仍未得到充分研究。為了解決這個差距，我們引入了 RuozhiBench，這是一個雙語資料集，包含 677 個經過仔細策劃的問題，其中包含各種形式的欺騙性推理，並透過廣泛的人力投入和專家審查精心製作。在使用開放式和二選一格式對來自 5 個系列的 17 個 LLM 進行 RuozhiBench 的全面評估中，我們對評估協定和結果模式進行了廣泛的分析。儘管它們在傳統基準測試中獲得了高分，但這些模型在檢測和正確推理邏輯謬誤方面表現出的能力有限，即使是效能最好的模型 Claude-3-haiku，與人類的 90% 以上相比，也只達到了 62% 的準確度。
+
+##### **NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**
+2502.13124v1 by Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li
+
+Scaling reasoning capabilities beyond traditional domains such as math and
+coding is hindered by the lack of diverse and high-quality questions. To
+overcome this limitation, we introduce a scalable approach for generating
+diverse and challenging reasoning questions, accompanied by reference answers.
+We present NaturalReasoning, a comprehensive dataset comprising 2.8 million
+questions that span multiple domains, including STEM fields (e.g., Physics,
+Computer Science), Economics, Social Sciences, and more. We demonstrate the
+utility of the questions in NaturalReasoning through knowledge distillation
+experiments which show that NaturalReasoning can effectively elicit and
+transfer reasoning capabilities from a strong teacher model. Furthermore, we
+demonstrate that NaturalReasoning is also effective for unsupervised
+self-training using external reward models or self-rewarding.
+
+摘要：透過超越傳統領域（例如數學和編碼）來擴充推理能力，受到缺乏多元且高品質問題的阻礙。為了克服這個限制，我們引入一個可擴充的方法，用於產生多元且具挑戰性的推理問題，並附上參考答案。我們提出 NaturalReasoning，這是一個包含 280 萬個問題的綜合資料集，涵蓋多個領域，包括 STEM 領域（例如物理、電腦科學）、經濟學、社會科學等等。我們透過知識蒸餾實驗，展示 NaturalReasoning 中問題的實用性，這些實驗顯示 NaturalReasoning 能有效地引發和轉移強大教師模型的推理能力。此外，我們展示 NaturalReasoning 也適用於使用外部獎勵模型或自我獎勵的無監督自我訓練。
+
+##### **Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**
+2502.13120v1 by Marion Bartl, Thomas Brendan Murphy, Susan Leavy
+
+Gender-inclusive language is often used with the aim of ensuring that all
+individuals, regardless of gender, can be associated with certain concepts.
+While psycholinguistic studies have examined its effects in relation to human
+cognition, it remains unclear how Large Language Models (LLMs) process
+gender-inclusive language. Given that commercial LLMs are gaining an
+increasingly strong foothold in everyday applications, it is crucial to examine
+whether LLMs in fact interpret gender-inclusive language neutrally, because the
+language they generate has the potential to influence the language of their
+users. This study examines whether LLM-generated coreferent terms align with a
+given gender expression or reflect model biases. Adapting psycholinguistic
+methods from French to English and German, we find that in English, LLMs
+generally maintain the antecedent's gender but exhibit underlying masculine
+bias. In German, this bias is much stronger, overriding all tested
+gender-neutralization strategies.
+
+摘要：性別包容性語言通常用於確保所有個人，無論性別如何，都能與某些概念聯繫在一起。雖然心理語言學研究已經檢視了它對人類認知的影響，但大型語言模型 (LLM) 如何處理性別包容性語言仍然不清楚。鑑於商業 LLM 在日常應用中越來越站穩腳步，因此至關重要的是要檢查 LLM 是否實際上中立地解釋性別包容性語言，因為它們產生的語言有可能影響其使用者的語言。本研究探討了 LLM 生成的共指術語是否與給定的性別表達一致或反映模型偏見。我們採用法語到英語和德語的心理語言學方法，發現英語中，LLM 通常會保持先行詞的性別，但表現出潛在的男性偏見。在德語中，這種偏見強得多，凌駕於所有經過測試的性別中立化策略。
+
+##### **STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**
+2502.13119v1 by Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin-Leyton Brown
+
+How should one judge whether a given large language model (LLM) can reliably
+perform economic reasoning? Most existing LLM benchmarks focus on specific
+applications and fail to present the model with a rich variety of economic
+tasks. A notable exception is Raman et al. [2024], who offer an approach for
+comprehensively benchmarking strategic decision-making; however, this approach
+fails to address the non-strategic settings prevalent in microeconomics, such
+as supply-and-demand analysis. We address this gap by taxonomizing
+microeconomic reasoning into $58$ distinct elements, focusing on the logic of
+supply and demand, each grounded in up to $10$ distinct domains, $5$
+perspectives, and $3$ types. The generation of benchmark data across this
+combinatorial space is powered by a novel LLM-assisted data generation protocol
+that we dub auto-STEER, which generates a set of questions by adapting
+handwritten templates to target new domains and perspectives. Because it offers
+an automated way of generating fresh questions, auto-STEER mitigates the risk
+that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that
+it will serve as a useful tool both for evaluating and fine-tuning models for
+years to come. We demonstrate the usefulness of our benchmark via a case study
+on $27$ LLMs, ranging from small open-source models to the current state of the
+art. We examined each model's ability to solve microeconomic problems across
+our whole taxonomy and present the results across a range of prompting
+strategies and scoring metrics.
+
+摘要：<paragraph>如何判斷一個給定的大型語言模型 (LLM) 能否可靠地進行經濟推理？現有的 LLM 基準測試大多專注於特定應用，未能為模型提供豐富多樣的經濟任務。一個值得注意的例外是 Raman 等人 [2024]，他們提供了一種全面評估策略決策制定方法；然而，這種方法無法解決微觀經濟學中普遍存在的非策略性設定，例如供需分析。我們透過將微觀經濟推理分類為 58 個不同的元素來解決這個差距，重點放在供需邏輯上，每個元素都基於多達 10 個不同的領域、5 個觀點和 3 種類型。在這個組合空間中產生基準數據是由一種新穎的 LLM 輔助數據生成協議（我們稱之為 auto-STEER）推動的，它通過調整手寫模板來針對新的領域和觀點來生成一組問題。由於它提供了一種生成新問題的自動化方式，auto-STEER 減輕了 LLM 將被訓練過度配合評估基準測試的風險；因此，我們希望它將成為未來幾年評估和微調模型的有用工具。我們通過一個案例研究展示了我們基準測試的效用，該案例研究涵蓋了 27 個 LLM，從小型開源模型到當前技術狀態。我們檢查了每個模型在我們的整個分類法中解決微觀經濟問題的能力，並在各種提示策略和評分指標中展示了結果。</paragraph>
+
+##### **Performance Evaluation of Large Language Models in Statistical Programming**
+2502.13117v1 by Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong
+
+The programming capabilities of large language models (LLMs) have
+revolutionized automatic code generation and opened new avenues for automatic
+statistical analysis. However, the validity and quality of these generated
+codes need to be systematically evaluated before they can be widely adopted.
+Despite their growing prominence, a comprehensive evaluation of statistical
+code generated by LLMs remains scarce in the literature. In this paper, we
+assess the performance of LLMs, including two versions of ChatGPT and one
+version of Llama, in the domain of SAS programming for statistical analysis.
+Our study utilizes a set of statistical analysis tasks encompassing diverse
+statistical topics and datasets. Each task includes a problem description,
+dataset information, and human-verified SAS code. We conduct a comprehensive
+assessment of the quality of SAS code generated by LLMs through human expert
+evaluation based on correctness, effectiveness, readability, executability, and
+the accuracy of output results. The analysis of rating scores reveals that
+while LLMs demonstrate usefulness in generating syntactically correct code,
+they struggle with tasks requiring deep domain understanding and may produce
+redundant or incorrect results. This study offers valuable insights into the
+capabilities and limitations of LLMs in statistical programming, providing
+guidance for future advancements in AI-assisted coding systems for statistical
+analysis.
+
+摘要：大型語言模型 (LLM) 的程式設計功能徹底改變了自動程式碼生成，並為自動統計分析開啟了新途徑。然而，在廣泛採用這些產生的程式碼之前，需要系統性地評估其有效性和品質。儘管其重要性日益提升，但文獻中對於 LLM 產生的統計程式碼的全面評估仍然稀少。在本文中，我們評估了 LLM 的效能，包括兩個版本的 ChatGPT 和一個版本的 Llama，在統計分析的 SAS 程式設計領域。我們的研究利用了一組涵蓋各種統計主題和資料集的統計分析任務。每個任務都包含問題說明、資料集資訊和經過人工驗證的 SAS 程式碼。我們透過基於正確性、有效性、可讀性、可執行性和輸出結果精確度的專家評估，對 LLM 產生的 SAS 程式碼品質進行全面評估。評分結果的分析顯示，儘管 LLM 在產生語法正確的程式碼方面表現出其效用，但它們在需要深入領域理解的任務中會遇到困難，並且可能會產生冗餘或不正確的結果。本研究提供了 LLM 在統計程式設計中能力和限制的寶貴見解，為統計分析的 AI 輔助編碼系統的未來進展提供指導。
+
+##### **Near-Optimal Private Learning in Linear Contextual Bandits**
+2502.13115v1 by Fan Chen, Jiachun Li, Alexander Rakhlin, David Simchi-Levi
+
+We analyze the problem of private learning in generalized linear contextual
+bandits. Our approach is based on a novel method of re-weighted regression,
+yielding an efficient algorithm with regret of order
+$\sqrt{T}+\frac{1}{\alpha}$ and $\sqrt{T}/\alpha$ in the joint and local model
+of $\alpha$-privacy, respectively. Further, we provide near-optimal private
+procedures that achieve dimension-independent rates in private linear models
+and linear contextual bandits. In particular, our results imply that joint
+privacy is almost "for free" in all the settings we consider, partially
+addressing the open problem posed by Azize and Basu (2024).
+
+摘要：我們分析廣義線性情境強盜中私人學習的問題。我們的做法基於重新加權回歸的新方法，產生一種有效率的演算法，其後悔值分別為
+$\sqrt{T}+\frac{1}{\alpha}$ 和 $\sqrt{T}/\alpha$ 在 $\alpha$-隱私的聯合和局部模型中。此外，我們提供近乎最佳的私人程序，在私人線性模型和線性情境強盜中實現與維度無關的比率。特別是，我們的結果表明，在我們考慮的所有設定中，聯合隱私幾乎是「免費」的，部分解決了 Azize 和 Basu (2024) 提出的開放性問題。
+
+##### **The influence of motion features in temporal perception**
+2502.13114v1 by Rosa Illan Castillo, Javier Valenzuela
+
+This paper examines the role of manner-of-motion verbs in shaping subjective
+temporal perception and emotional resonance. Through four complementary
+studies, we explore how these verbs influence the conceptualization of time,
+examining their use in literal and metaphorical (temporal) contexts. Our
+findings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging
+temporal experiences, often linked to positive emotions and greater agency. In
+contrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and
+negative emotions, reflecting tedious or constrained experiences of time. These
+effects are amplified in metaphorical contexts, where manner verbs encode
+emotional and experiential nuances that transcend their literal meanings. We
+also find that participants prefer manner verbs over path verbs (e.g., go,
+pass) in emotionally charged temporal contexts, as manner verbs capture the
+experiential and emotional qualities of time more effectively. These findings
+highlight the interplay between language, motion, and emotion in shaping
+temporal perception, offering insights into how linguistic framing influences
+subjective experiences of time.
+
+摘要：本文探討動作方式動詞在形塑主觀時間感知和情緒共鳴中所扮演的角色。透過四項互補的研究，我們探討這些動詞如何影響時間的概念化，並檢視它們在字面和隱喻（時間）語境中的用法。我們的研究結果顯示，較快的動詞（例如飛、飆）會引起動態且引人入勝的時間體驗，通常與正面情緒和較大的自主性有關。相反地，較慢的動詞（例如爬、拖）傳達了被動、單調和負面情緒，反映出乏味或受限的時間體驗。這些效應在隱喻語境中會被放大，其中動作動詞編碼了超越其字面意義的情緒和體驗細微差別。我們還發現，在充滿情緒的時間語境中，參與者偏好動作動詞而非路徑動詞（例如走、經過），因為動作動詞更有效地捕捉了時間的體驗和情緒品質。這些研究結果突顯了語言、動作和情緒之間在形塑時間感知中的交互作用，並提供了語言框架如何影響主觀時間體驗的見解。
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
 real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **MatterChat: A Multi-Modal LLM for Material Science**
+2502.13107v1 by Yingheng Tang, Wenbin Xu, Jie Cao, Jianzhu Ma, Weilu Gao, Steve Farrell, Benjamin Erichson, Michael W. Mahoney, Andy Nonaka, Zhi Yao
+
+Understanding and predicting the properties of inorganic materials is crucial
+for accelerating advancements in materials science and driving applications in
+energy, electronics, and beyond. Integrating material structure data with
+language-based information through multi-modal large language models (LLMs)
+offers great potential to support these efforts by enhancing human-AI
+interaction. However, a key challenge lies in integrating atomic structures at
+full resolution into LLMs. In this work, we introduce MatterChat, a versatile
+structure-aware multi-modal LLM that unifies material structural data and
+textual inputs into a single cohesive model. MatterChat employs a bridging
+module to effectively align a pretrained machine learning interatomic potential
+with a pretrained LLM, reducing training costs and enhancing flexibility. Our
+results demonstrate that MatterChat significantly improves performance in
+material property prediction and human-AI interaction, surpassing
+general-purpose LLMs such as GPT-4. We also demonstrate its usefulness in
+applications such as more advanced scientific reasoning and step-by-step
+material synthesis.
+
+摘要：了解和預測無機材料的特性對於加速材料科學的進步和推動能源、電子等方面的應用至關重要。透過多模態大型語言模型 (LLM) 將材料結構數據與基於語言的資訊整合，可以極大程度地支持這些工作，藉此增強人類與 AI 的互動。然而，一個關鍵挑戰在於將原子結構以完整解析度整合到 LLM 中。在這項工作中，我們引入了 MatterChat，這是一個通用的結構感知多模態 LLM，它將材料結構數據和文字輸入統一到一個單一的內聚模型中。MatterChat 採用橋接模組，將預先訓練好的機器學習原子間電位與預先訓練好的 LLM 有效地對齊，從而降低訓練成本並增強靈活性。我們的結果表明，MatterChat 大幅提升了材料特性預測和人類與 AI 互動的效能，超越了 GPT-4 等通用 LLM。我們也展示了它在更進階的科學推理和逐步材料合成等應用中的效用。
+
+##### **Understanding and Rectifying Safety Perception Distortion in VLMs**
+2502.13095v1 by Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin
+
+Recent studies reveal that vision-language models (VLMs) become more
+susceptible to harmful requests and jailbreak attacks after integrating the
+vision modality, exhibiting greater vulnerability than their text-only LLM
+backbones. To uncover the root cause of this phenomenon, we conduct an in-depth
+analysis and identify a key issue: multimodal inputs introduce an
+modality-induced activation shift toward a "safer" direction compared to their
+text-only counterparts, leading VLMs to systematically overestimate the safety
+of harmful inputs. We refer to this issue as safety perception distortion. To
+mitigate such distortion, we propose Activation Shift Disentanglement and
+Calibration (ShiftDC), a training-free method that decomposes and calibrates
+the modality-induced activation shift to reduce the impact of modality on
+safety. By isolating and removing the safety-relevant component, ShiftDC
+restores the inherent safety alignment of the LLM backbone while preserving the
+vision-language capabilities of VLMs. Empirical results demonstrate that
+ShiftDC significantly enhances alignment performance on safety benchmarks
+without impairing model utility.
+
+摘要：最近的研究表明，在整合了视觉模态后，视觉语言模型 (VLM) 更容易受到有害请求和越狱攻击，表现出比其仅文本的 LLM 主干更大的漏洞。为了揭示这种现象的根本原因，我们进行了深入分析，并确定了一个关键问题：与仅文本的对应物相比，多模态输入引入了朝“更安全”方向的模态诱导激活转移，导致 VLM 系统性地高估有害输入的安全性。我们将此问题称为安全感知扭曲。为了减轻这种扭曲，我们提出了激活转移解耦和校准 (ShiftDC)，这是一种无训练方法，用于分解和校准模态诱导的激活转移，以减少模态对安全性的影响。通过隔离和移除与安全性相关的组件，ShiftDC 恢复了 LLM 主干的固有安全性对齐，同时保留了 VLM 的视觉语言能力。实证结果表明，ShiftDC 在不损害模型效用的情况下，显著增强了安全基准上的对齐性能。
+
+##### **Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**
+2502.13092v1 by Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
+
+Recently, there has been growing interest in leveraging large language models
+(LLMs) to generate symbolic world models from textual descriptions. Although
+LLMs have been extensively explored in the context of world modeling, prior
+studies encountered several challenges, including evaluation randomness,
+dependence on indirect metrics, and a limited domain scope. To address these
+limitations, we introduce a novel benchmark, Text2World, based on planning
+domain definition language (PDDL), featuring hundreds of diverse domains and
+employing multi-criteria, execution-based metrics for a more robust evaluation.
+We benchmark current LLMs using Text2World and find that reasoning models
+trained with large-scale reinforcement learning outperform others. However,
+even the best-performing model still demonstrates limited capabilities in world
+modeling. Building on these insights, we examine several promising strategies
+to enhance the world modeling capabilities of LLMs, including test-time
+scaling, agent training, and more. We hope that Text2World can serve as a
+crucial resource, laying the groundwork for future research in leveraging LLMs
+as world models. The project page is available at
+https://text-to-world.github.io/.
+
+摘要：最近，人们越来越有兴趣利用大型语言模型（LLM）从文本描述中生成符号世界模型。尽管 LLM 已在世界建模的背景下得到广泛探索，但先前的研究遇到了若干挑战，包括评估随机性、对间接指标的依赖以及有限的领域范围。为了解决这些限制，我们引入了基于规划域定义语言（PDDL）的新基准 Text2World，该基准包含数百个不同的域，并采用基于执行的多标准指标来进行更稳健的评估。我们使用 Text2World 对当前的 LLM 进行了基准测试，发现使用大规模强化学习训练的推理模型优于其他模型。然而，即使是性能最佳的模型在世界建模方面仍然表现出有限的能力。基于这些见解，我们研究了几种有希望的策略来增强 LLM 的世界建模能力，包括测试时缩放、代理训练等等。我们希望 Text2World 能够作为一项至关重要的资源，为未来利用 LLM 作为世界模型的研究奠定基础。项目页面可在 https://text-to-world.github.io/ 获得。
+
+##### **KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**
+2502.13076v1 by Xin Xia, Yujin Wang, Jun Zhou, Guisheng Zhong, Linning Cai, Chen Zhang
+
+Patent analysis highly relies on concise and interpretable document
+representations, referred to as patent portraits. Keyphrases, both present and
+absent, are ideal candidates for patent portraits due to their brevity,
+representativeness, and clarity. In this paper, we introduce KAPPA, an
+integrated framework designed to construct keyphrase-based patent portraits and
+enhance patent analysis. KAPPA operates in two phases: patent portrait
+construction and portrait-based analysis. To ensure effective portrait
+construction, we propose a semantic-calibrated keyphrase generation paradigm
+that integrates pre-trained language models with a prompt-based hierarchical
+decoding strategy to leverage the multi-level structural characteristics of
+patents. For portrait-based analysis, we develop a comprehensive framework that
+employs keyphrase-based patent portraits to enable efficient and accurate
+patent analysis. Extensive experiments on benchmark datasets of keyphrase
+generation, the proposed model achieves significant improvements compared to
+state-of-the-art baselines. Further experiments conducted on real-world patent
+applications demonstrate that our keyphrase-based portraits effectively capture
+domain-specific knowledge and enrich semantic representation for patent
+analysis tasks.
+
+摘要：專利分析高度依賴簡潔且可解讀的文件表示，稱為專利描述。關鍵字組，無論是存在的還是不存在的，都是專利描述的理想候選者，因為它們簡潔、具有代表性且清晰。在本文中，我們介紹了 KAPPA，一個用於建構基於關鍵字組的專利描述和增強專利分析的整合式架構。KAPPA 分為兩個階段執行：專利描述建構和基於描述的分析。為確保有效的描述建構，我們提出了一個語義校準關鍵字組生成範例，它將預先訓練的語言模型與基於提示的分層解碼策略整合在一起，以利用專利的多分層結構特性。對於基於描述的分析，我們開發了一個全面的架構，它採用基於關鍵字組的專利描述，以實現高效且準確的專利分析。在關鍵字組生成基準資料集上進行的廣泛實驗中，與最先進的基準線相比，所提出的模型取得了顯著的改進。在真實世界專利申請上進行的進一步實驗表明，我們基於關鍵字組的描述有效地擷取了特定領域的知識，並豐富了專利分析任務的語義表示。
+
+##### **Interactive Agents to Overcome Ambiguity in Software Engineering**
+2502.13069v1 by Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
+
+AI agents are increasingly being deployed to automate tasks, often based on
+ambiguous and underspecified user instructions. Making unwarranted assumptions
+and failing to ask clarifying questions can lead to suboptimal outcomes, safety
+risks due to tool misuse, and wasted computational resources. In this work, we
+study the ability of LLM agents to handle ambiguous instructions in interactive
+code generation settings by evaluating proprietary and open-weight models on
+their performance across three key steps: (a) leveraging interactivity to
+improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c)
+asking targeted questions. Our findings reveal that models struggle to
+distinguish between well-specified and underspecified instructions. However,
+when models interact for underspecified inputs, they effectively obtain vital
+information from the user, leading to significant improvements in performance
+and underscoring the value of effective interaction. Our study highlights
+critical gaps in how current state-of-the-art models handle ambiguity in
+complex software engineering tasks and structures the evaluation into distinct
+steps to enable targeted improvements.
+
+摘要：人工智能代理正越來越多地被部署用於自動化任務，通常基於模棱兩可且未明確規定的使用者指令。做出不合理的假設且未能提出澄清問題，可能導致次佳結果、因工具誤用而產生的安全風險，以及浪費運算資源。在這項工作中，我們研究了 LLM 代理在互動式程式碼生成設定中處理模棱兩可指令的能力，方法是在三個關鍵步驟中評估專有和開放權重的模型： (a) 利用互動性來提升在模棱兩可場景中的效能、(b) 偵測模糊性，以及 (c) 提出目標問題。我們的研究結果顯示，模型難以區分明確規範的指令和未明確規範的指令。然而，當模型針對未明確規範的輸入進行互動時，它們會有效地從使用者取得重要資訊，進而大幅提升效能，並強調有效互動的價值。我們的研究突顯了目前最先進的模型在處理複雜軟體工程任務中的模糊性時存在哪些關鍵差距，並將評估架構為不同的步驟，以促成有目標的改善。
+
+##### **Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**
+2502.13063v1 by Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
+
+A range of recent works addresses the problem of compression of sequence of
+tokens into a shorter sequence of real-valued vectors to be used as inputs
+instead of token embeddings or key-value cache. These approaches allow to
+reduce the amount of compute in existing language models. Despite relying on
+powerful models as encoders, the maximum attainable lossless compression ratio
+is typically not higher than x10. This fact is highly intriguing because, in
+theory, the maximum information capacity of large real-valued vectors is far
+beyond the presented rates even for 16-bit precision and a modest vector size.
+In this work, we explore the limits of compression by replacing the encoder
+with a per-sample optimization procedure. We show that vectors with compression
+ratios up to x1500 exist, which highlights two orders of magnitude gap between
+existing and practically attainable solutions. Furthermore, we empirically show
+that the compression limits are determined not by the length of the input but
+by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
+this sequence without any conditioning. The obtained limits highlight the
+substantial gap between the theoretical capacity of input embeddings and their
+practical utilization, suggesting significant room for optimization in model
+design.
+
+摘要：一系列近期作品探讨了将序列标记压缩成较短的实值向量序列的问题，以用作输入，而不是标记嵌入或键值缓存。这些方法允许减少现有语言模型中的计算量。尽管依赖于强大的模型作为编码器，但最大可达到的无损压缩比通常不高于 x10。这一事实非常有趣，因为理论上，即使对于 16 位精度和适中的向量大小，大型实值向量的最大信息容量也远远超出了所呈现的速率。在这项工作中，我们通过用按样本优化程序替换编码器来探索压缩的极限。我们表明，存在压缩比高达 x1500 的向量，这突出了现有解决方案和实际可实现解决方案之间两个数量级的差距。此外，我们凭经验表明，压缩极限不是由输入的长度决定的，而是由要减少的不确定性量决定的，即在此序列上的交叉熵损失，没有任何条件。获得的极限突出了输入嵌入的理论容量与其实际利用之间的巨大差距，表明模型设计中有很大的优化空间。
+
+##### **AI-Assisted Decision Making with Human Learning**
+2502.13062v1 by Gali Noti, Kate Donahue, Jon Kleinberg, Sigal Oren
+
+AI systems increasingly support human decision-making. In many cases, despite
+the algorithm's superior performance, the final decision remains in human
+hands. For example, an AI may assist doctors in determining which diagnostic
+tests to run, but the doctor ultimately makes the diagnosis. This paper studies
+such AI-assisted decision-making settings, where the human learns through
+repeated interactions with the algorithm. In our framework, the algorithm --
+designed to maximize decision accuracy according to its own model -- determines
+which features the human can consider. The human then makes a prediction based
+on their own less accurate model. We observe that the discrepancy between the
+algorithm's model and the human's model creates a fundamental tradeoff. Should
+the algorithm prioritize recommending more informative features, encouraging
+the human to recognize their importance, even if it results in less accurate
+predictions in the short term until learning occurs? Or is it preferable to
+forgo educating the human and instead select features that align more closely
+with their existing understanding, minimizing the immediate cost of learning?
+This tradeoff is shaped by the algorithm's time-discounted objective and the
+human's learning ability. Our results show that optimal feature selection has a
+surprisingly clean combinatorial characterization, reducible to a stationary
+sequence of feature subsets that is tractable to compute. As the algorithm
+becomes more "patient" or the human's learning improves, the algorithm
+increasingly selects more informative features, enhancing both prediction
+accuracy and the human's understanding. Notably, early investment in learning
+leads to the selection of more informative features than a later investment. We
+complement our analysis by showing that the impact of errors in the algorithm's
+knowledge is limited as it does not make the prediction directly.
+
+摘要：人工智慧系統日益支援人類決策。在許多情況下，儘管演算法的效能優異，最終決策仍掌握在人類手中。例如，人工智慧可能會協助醫生決定要執行哪些診斷測試，但最終下診斷的是醫生。本文探討此類人工智慧輔助決策設定，其中人類透過與演算法重複互動而學習。在我們的架構中，演算法（旨在根據其自身模型最大化決策準確度）會決定人類可以考量的特徵。然後，人類根據其自身較不準確的模型做出預測。我們觀察到，演算法模型與人類模型之間的差異會產生基本的權衡。演算法是否應優先推薦更多資訊性特徵，鼓勵人類認識其重要性，即使短期內會導致準確度較低的預測，直到學習發生？或者，是否較好放棄教育人類，而選擇與其現有理解更緊密對齊的特徵，將學習的立即成本降至最低？這種權衡取決於演算法的時間折現目標和人類的學習能力。我們的結果表明，最佳特徵選擇具有令人驚訝的乾淨組合特徵，可簡化為可計算的固定特徵子集序列。隨著演算法變得更「有耐心」或人類的學習進步，演算法會越來越多地選擇更多資訊性特徵，增強預測準確度和人類的理解。值得注意的是，早期投資於學習會導致選擇比後期投資更多資訊性特徵。我們透過顯示演算法知識中錯誤的影響是有限的，因為它不會直接做出預測，來補充我們的分析。
+
+##### **Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**
+2502.13061v1 by Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
+
+Hateful memes have become a significant concern on the Internet,
+necessitating robust automated detection systems. While large multimodal models
+have shown strong generalization across various tasks, they exhibit poor
+generalization to hateful meme detection due to the dynamic nature of memes
+tied to emerging social trends and breaking news. Recent work further
+highlights the limitations of conventional supervised fine-tuning for large
+multimodal models in this context. To address these challenges, we propose
+Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a
+novel two-stage fine-tuning framework designed to improve both in-domain
+accuracy and cross-domain generalization. Experimental results on six widely
+used meme classification datasets demonstrate that LMM-RGCL achieves
+state-of-the-art performance, outperforming agent-based systems such as
+VPD-PALI-X-55B. Furthermore, our method effectively generalizes to
+out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
+
+摘要：網路上的仇恨迷因已成為一大隱憂，因此需要強大的自動化偵測系統。雖然大型多模態模型已在各種任務中展現出強大的泛化能力，但由於迷因與新興社會趨勢和突發新聞息息相關，因此在仇恨迷因偵測方面表現不佳。最近的研究進一步強調了在這種情況下，傳統監督微調對大型多模態模型的限制。為了應對這些挑戰，我們提出了大型多模態模型檢索引導對比學習 (LMM-RGCL)，這是一種新穎的兩階段微調架構，旨在提高領域內準確度和跨領域泛化能力。在六個廣泛使用的迷因分類資料集上的實驗結果表明，LMM-RGCL 達到了最先進的效能，優於基於代理的系統，例如 VPD-PALI-X-55B。此外，我們的模型在低資源設定下有效泛化到領域外迷因，超越了 GPT-4o 等模型。
+
+##### **SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**
+2502.13059v1 by Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
+
+The increasing application of multi-modal large language models (MLLMs)
+across various sectors have spotlighted the essence of their output reliability
+and accuracy, particularly their ability to produce content grounded in factual
+information (e.g. common and domain-specific knowledge). In this work, we
+introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate
+the factuality ability of MLLMs to answer natural language short questions.
+SimpleVQA is characterized by six key features: it covers multiple tasks and
+multiple scenarios, ensures high quality and challenging queries, maintains
+static and timeless reference answers, and is straightforward to evaluate. Our
+approach involves categorizing visual question-answering items into 9 different
+tasks around objective events or common knowledge and situating these within 9
+topics. Rigorous quality control processes are implemented to guarantee
+high-quality, concise, and clear answers, facilitating evaluation with minimal
+variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a
+comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into
+their image comprehension and text generation abilities by identifying and
+analyzing error cases.
+
+摘要：隨著多模態大型語言模型 (MLLM) 在各個領域的應用日益普及，其輸出結果的可靠性和準確性已備受關注，特別是其根據事實資訊（例如一般知識和特定領域知識）產生內容的能力。在本文中，我們介紹 SimpleVQA，這是第一個用於評估 MLLM 回答自然語言簡短問題的事實能力的綜合多模態基準。SimpleVQA 有六個主要特徵：涵蓋多項任務和多種情境、確保高品質且具挑戰性的查詢、維護靜態且永恆的參考答案，而且評估起來很簡單。我們的做法是將視覺問答項目分類為 9 個不同的任務，圍繞客觀事件或常識，並將它們置於 9 個主題中。我們實施嚴格的品質控管流程，以保證答案的高品質、簡潔和清晰，並透過 LLM 作為評分系統，以最小的差異進行評估。我們使用 SimpleVQA 對 18 個主要的 MLLM 和 8 個純文字 LLM 進行全面評估，透過找出和分析錯誤案例，深入探討它們的影像理解和文字生成能力。
+
+##### **LAMD: Context-driven Android Malware Detection and Classification with LLMs**
+2502.13055v1 by Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, Lorenzo Cavallaro
+
+The rapid growth of mobile applications has escalated Android malware
+threats. Although there are numerous detection methods, they often struggle
+with evolving attacks, dataset biases, and limited explainability. Large
+Language Models (LLMs) offer a promising alternative with their zero-shot
+inference and reasoning capabilities. However, applying LLMs to Android malware
+detection presents two key challenges: (1)the extensive support code in Android
+applications, often spanning thousands of classes, exceeds LLMs' context limits
+and obscures malicious behavior within benign functionality; (2)the structural
+complexity and interdependencies of Android applications surpass LLMs'
+sequence-based reasoning, fragmenting code analysis and hindering malicious
+intent inference. To address these challenges, we propose LAMD, a practical
+context-driven framework to enable LLM-based Android malware detection. LAMD
+integrates key context extraction to isolate security-critical code regions and
+construct program structures, then applies tier-wise code reasoning to analyze
+application behavior progressively, from low-level instructions to high-level
+semantics, providing final prediction and explanation. A well-designed factual
+consistency verification mechanism is equipped to mitigate LLM hallucinations
+from the first tier. Evaluation in real-world settings demonstrates LAMD's
+effectiveness over conventional detectors, establishing a feasible basis for
+LLM-driven malware analysis in dynamic threat landscapes.
+
+摘要：隨著行動應用程式快速成長，Android 惡意軟體威脅也隨之升級。雖然有許多偵測方法，但它們經常難以應付不斷演進的攻擊、資料集偏差和有限的可解釋性。大型語言模型 (LLM) 提供了一個有前途的替代方案，具備零次學習推理和推理能力。然而，將 LLM 應用於 Android 惡意軟體偵測會出現兩個主要挑戰：(1) Android 應用程式中大量的支援程式碼，通常橫跨數千個類別，超過 LLM 的上下文限制，並模糊了良性功能中的惡意行為；(2) Android 應用程式的結構複雜性和相互依賴性超過 LLM 的基於序列的推理，會造成程式碼分析破碎，並阻礙惡意意圖推論。為了應對這些挑戰，我們提出了 LAMD，一個實用的脈絡驅動架構，以支援基於 LLM 的 Android 惡意軟體偵測。LAMD 整合了關鍵脈絡萃取，以隔離與安全性至關重要的程式碼區域並建構程式結構，然後套用分層式程式碼推理，逐步分析應用程式行為，從低階指令到高階語意，提供最終預測和說明。一個設計良好的事實一致性驗證機制具備減輕 LLM 從第一層產生的幻覺的能力。在真實環境中的評估顯示，LAMD 優於傳統偵測器，為動態威脅環境中的 LLM 驅動惡意軟體分析建立了一個可行的基礎。
+
+##### **Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**
+2502.13044v1 by Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
+
+Aspect sentiment quadruple prediction (ASQP) facilitates a detailed
+understanding of opinions expressed in a text by identifying the opinion term,
+aspect term, aspect category and sentiment polarity for each opinion. However,
+annotating a full set of training examples to fine-tune models for ASQP is a
+resource-intensive process. In this study, we explore the capabilities of large
+language models (LLMs) for zero- and few-shot learning on the ASQP task across
+five diverse datasets. We report F1 scores slightly below those obtained with
+state-of-the-art fine-tuned models but exceeding previously reported zero- and
+few-shot performance. In the 40-shot setting on the Rest16 restaurant domain
+dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the
+best-performing fine-tuned method MVP. Additionally, we report the performance
+of LLMs in target aspect sentiment detection (TASD), where the F1 scores were
+also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot
+setting, compared to 72.76 with MVP. While human annotators remain essential
+for achieving optimal performance, LLMs can reduce the need for extensive
+manual annotation in ASQP tasks.
+
+摘要：面向觀點的四元預測 (ASQP) 透過辨識各個觀點的觀點詞彙、面向詞彙、面向類別和觀點極性，協助詳細了解文字中表達的意見。然而，標註一組完整的訓練範例以微調 ASQP 模型是一個耗費資源的過程。在這項研究中，我們探討大型語言模型 (LLM) 在 ASQP 任務中進行零次和少量學習的能力，橫跨五個不同的資料集。我們報告的 F1 分數略低於使用最先進的微調模型獲得的分數，但超過先前報告的零次和少量學習表現。在 Rest16 餐廳領域資料集的 40 次學習設定中，LLM 達到了 52.46 的 F1 分數，而效能最佳的微調方法 MVP 則為 60.39。此外，我們報告了 LLM 在目標面向觀點偵測 (TASD) 中的表現，其中 F1 分數也接近微調模型，在 40 次學習設定中於 Rest16 達到 66.03，而 MVP 則為 72.76。儘管人類標註員對於達成最佳效能仍然至關重要，但 LLM 可以減少 ASQP 任務中廣泛手動標註的需求。
+
+##### **Natural Language Generation from Visual Sequences: Challenges and Future Directions**
+2502.13034v1 by Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
+
+The ability to use natural language to talk about visual content is at the
+core of human intelligence and a crucial feature of any artificial intelligence
+system. Various studies have focused on generating text for single images. In
+contrast, comparatively little attention has been paid to exhaustively
+analyzing and advancing work on multiple-image vision-to-text settings. In this
+position paper, we claim that any task dealing with temporally ordered
+sequences of multiple images or frames is an instance of a broader, more
+general problem involving the understanding of intricate relationships between
+the visual content and the corresponding text. We comprehensively analyze five
+tasks that are instances of this problem and argue that they pose a common set
+of challenges and share similarities in terms of modeling and evaluation
+approaches. Based on the insights from these various aspects and stages of
+multi-image-to-text generation, we highlight several open questions and suggest
+future research directions. We believe that these directions can advance the
+understanding of complex phenomena in this domain and the development of better
+models.
+
+摘要：使用自然語言來談論視覺內容的能力是人類智慧的核心，也是任何人工智慧系統的一項關鍵功能。各種研究都專注於為單一影像產生文字。相較之下，對於詳盡分析和推進多重影像視覺轉文字設定的工作，關注較少。在此立場文件中，我們聲稱任何處理多重影像或畫格的時間順序序列的任務，都是一個更廣泛、更普遍問題的範例，涉及理解視覺內容和對應文字之間的複雜關係。我們全面分析了此問題的五個範例任務，並論證它們提出了一組常見的挑戰，且在建模和評估方法方面有相似之處。根據多重影像轉文字生成的這些不同面向和階段的見解，我們突出了幾個開放性問題，並建議未來的研究方向。我們相信這些方向可以推進對此領域中複雜現象的理解，以及開發出更好的模型。
+
+##### **HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**
+2502.13031v1 by Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
+
+Since the adoption of large language models (LLMs) for text evaluation has
+become increasingly prevalent in the field of natural language processing
+(NLP), a series of existing works attempt to optimize the prompts for LLM
+evaluators to improve their alignment with human judgment. However, their
+efforts are limited to optimizing individual factors of evaluation prompts,
+such as evaluation criteria or output formats, neglecting the combinatorial
+impact of multiple factors, which leads to insufficient optimization of the
+evaluation pipeline. Nevertheless, identifying well-behaved prompting
+strategies for adjusting multiple factors requires extensive enumeration. To
+this end, we comprehensively integrate 8 key factors for evaluation prompts and
+propose a novel automatic prompting strategy optimization method called
+Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm,
+HPSS conducts an iterative search to find well-behaved prompting strategies for
+LLM evaluators. A heuristic function is employed to guide the search process,
+enhancing the performance of our algorithm. Extensive experiments across four
+evaluation tasks demonstrate the effectiveness of HPSS, consistently
+outperforming both human-designed evaluation prompts and existing automatic
+prompt optimization methods.
+
+摘要：隨著自然語言處理（NLP）領域中採用大型語言模型（LLM）進行文本評估變得越來越普遍，一系列現有工作嘗試優化 LLM 評估器的提示，以改善它們與人類判斷的一致性。然而，他們的努力僅限於優化評估提示的個別因素，例如評估準則或輸出格式，而忽略了多種因素的組合影響，這導致評估管道優化不足。儘管如此，找出調整多種因素的良好提示策略需要廣泛的枚舉。為此，我們全面整合了評估提示的 8 個關鍵因素，並提出了一種名為啟發式提示策略搜索（HPSS）的新型自動提示策略優化方法。在遺傳演算法的啟發下，HPSS 進行反覆搜索以找出 LLM 評估器的良好提示策略。採用啟發式函數來指導搜索過程，增強了我們演算法的效能。在四項評估任務中進行的廣泛實驗證明了 HPSS 的有效性，始終優於人類設計的評估提示和現有的自動提示優化方法。
+
+##### **Whose story is it? Personalizing story generation by inferring author styles**
+2502.13028v1 by Nischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer, Andrew Lan
+
+Personalization has become essential for improving user experience in
+interactive writing and educational applications, yet its potential in story
+generation remains largely unexplored. In this work, we propose a novel
+two-stage pipeline for personalized story generation. Our approach first infers
+an author's implicit story-writing characteristics from their past work and
+organizes them into an Author Writing Sheet, inspired by narrative theory. The
+second stage uses this sheet to simulate the author's persona through tailored
+persona descriptions and personalized story writing rules. To enable and
+validate our approach, we construct Mythos, a dataset of 590 stories from 64
+authors across five distinct sources that reflect diverse story-writing
+settings. A head-to-head comparison with a non-personalized baseline
+demonstrates our pipeline's effectiveness in generating high-quality
+personalized stories. Our personalized stories achieve a 75 percent win rate
+(versus 14 percent for the baseline and 11 percent ties) in capturing authors'
+writing style based on their past works. Human evaluation highlights the high
+quality of our Author Writing Sheet and provides valuable insights into the
+personalized story generation task. Notable takeaways are that writings from
+certain sources, such as Reddit, are easier to personalize than others, like
+AO3, while narrative aspects, like Creativity and Language Use, are easier to
+personalize than others, like Plot.
+
+摘要：個人化已成為改善互動式寫作和教育應用程式中使用者體驗的必要手段，然而其在故事生成中的潛力仍未被廣泛探索。在這項工作中，我們提出了一個創新的兩階段流程，用於個人化故事生成。我們的做法首先從作者過去的作品中推論出作者隱含的故事寫作特徵，並根據敘事理論將它們組織成作者寫作表。第二階段使用此表透過量身打造的角色描述和個人化故事寫作規則來模擬作者的角色。為了啟用和驗證我們的做法，我們建構了 Mythos，一個包含來自 64 位作者、橫跨五個不同來源的 590 個故事的資料集，這些故事反映了多樣化的故事寫作設定。與非個人化基準進行一對一的比較，證明了我們的流程在生成高品質個人化故事方面的有效性。我們的個人化故事以 75% 的獲勝率（相較於基準的 14% 和 11% 平手）捕捉到作者基於其過去作品的寫作風格。人類評估突顯了我們作者寫作表的優良品質，並提供了對個人化故事生成任務的寶貴見解。值得注意的是，來自某些來源（例如 Reddit）的作品比其他來源（例如 AO3）更容易個人化，而敘事層面（例如創造力和語言使用）比其他層面（例如情節）更容易個人化。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**
+2502.13019v1 by Sha Li, Naren Ramarkrishnan
+
+Despite the remarkable capabilities of Large Language Models (LLMs) in
+various NLP tasks, they remain vulnerable to hallucinations due to their
+limited parametric knowledge and lack of domain-specific expertise.
+Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating
+external document retrieval to augment the knowledge base of LLMs. In this
+approach, RAG retrieves document chunks from an external corpus in response to
+a query, which are then used as context for the downstream language model to
+generate an answer. However, these retrieved knowledge sources often include
+irrelevant or erroneous information, undermining the effectiveness of RAG in
+downstream tasks. To overcome this limitation, we introduce a compact,
+efficient, and pluggable module designed to refine external knowledge sources
+before feeding them to the generator. The module reconstructs retrieved content
+by extracting the most relevant and supportive information and reorganising it
+into a concise, query-specific format. Through a three-stage training paradigm
+- comprising supervised fine-tuning, contrastive multi-task learning, and
+reinforcement learning-based alignment - it prioritises critical knowledge and
+aligns it with the generator's preferences. This method enables LLMs to produce
+outputs that are more accurate, reliable, and contextually appropriate.
+
+摘要：儘管大型語言模型 (LLM) 在各種自然語言處理任務中具備卓越的能力，但由於其參數知識有限且缺乏特定領域的專業知識，因此它們仍然容易出現幻覺。檢索增強式生成 (RAG) 透過納入外部文件檢索來擴充 LLM 的知識庫，以應對此項挑戰。在此方法中，RAG 會根據查詢檢索外部語料庫中的文件區塊，然後將其用作下游語言模型的背景，以產生答案。然而，這些檢索到的知識來源通常包含不相關或錯誤的資訊，因而損害了 RAG 在下游任務中的效能。為了克服此項限制，我們引入了一個精簡、有效率且可插入的模組，用於在將外部知識來源提供給生成器之前對其進行精煉。此模組透過提取最相關且有用的資訊並將其重新組織成簡潔且特定於查詢的格式，來重建檢索到的內容。透過三階段訓練範例 - 包含監督微調、對比多任務學習以及基於強化學習的比對 - 它優先考量關鍵知識，並使其與生成器的偏好相符。此方法可讓 LLM 產生更準確、可靠且在語境上更適當的輸出。
+
+##### **LLM-Powered Proactive Data Systems**
+2502.13016v1 by Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran
+
+With the power of LLMs, we now have the ability to query data that was
+previously impossible to query, including text, images, and video. However,
+despite this enormous potential, most present-day data systems that leverage
+LLMs are reactive, reflecting our community's desire to map LLMs to known
+abstractions. Most data systems treat LLMs as an opaque black box that operates
+on user inputs and data as is, optimizing them much like any other approximate,
+expensive UDFs, in conjunction with other relational operators. Such data
+systems do as they are told, but fail to understand and leverage what the LLM
+is being asked to do (i.e. the underlying operations, which may be
+error-prone), the data the LLM is operating on (e.g., long, complex documents),
+or what the user really needs. They don't take advantage of the characteristics
+of the operations and/or the data at hand, or ensure correctness of results
+when there are imprecisions and ambiguities. We argue that data systems instead
+need to be proactive: they need to be given more agency -- armed with the power
+of LLMs -- to understand and rework the user inputs and the data and to make
+decisions on how the operations and the data should be represented and
+processed. By allowing the data system to parse, rewrite, and decompose user
+inputs and data, or to interact with the user in ways that go beyond the
+standard single-shot query-result paradigm, the data system is able to address
+user needs more efficiently and effectively. These new capabilities lead to a
+rich design space where the data system takes more initiative: they are
+empowered to perform optimization based on the transformation operations, data
+characteristics, and user intent. We discuss various successful examples of how
+this framework has been and can be applied in real-world tasks, and present
+future directions for this ambitious research agenda.
+
+摘要：<paragraph>透過 LLM 的強大功能，我們現在能夠查詢過去無法查詢的資料，包括文字、圖片和影片。然而，儘管有如此龐大的潛力，但現今大多數利用 LLM 的資料系統都是被動的，反映出我們的社群希望將 LLM 映射到已知的抽象化。大多數資料系統將 LLM 視為一個不透明的黑盒子，以使用者輸入和資料為基礎進行運作，並像其他近似、昂貴的 UDF 一樣最佳化它們，並與其他關聯運算子結合使用。這些資料系統會照著指示執行，但無法理解並運用 LLM 被要求執行的任務（例如可能容易出錯的基本運算）、LLM 正在運算的資料（例如冗長、複雜的文件），或使用者真正需要的是什麼。它們不會利用運算和/或手邊資料的特性，或在有誤差和歧義時確保結果的正確性。我們認為資料系統應該改為主動：它們需要被賦予更多自主權，並具備 LLM 的強大功能，以了解並重新處理使用者輸入和資料，並就運算和資料的表示和處理方式做出決策。透過允許資料系統解析、改寫和分解使用者輸入和資料，或以超越標準單次查詢結果模式的方式與使用者互動，資料系統能夠更有效率且有效地滿足使用者的需求。這些新功能會帶來一個豐富的設計空間，讓資料系統發揮更多主導性：它們有能力根據轉換運算、資料特性和使用者意圖進行最佳化。我們將討論這個架構如何應用於實際任務，並提出這個雄心勃勃的研究議程的未來方向。</paragraph>
+
+##### **Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**
+2502.13012v1 by Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, Dakuo Wang
+
+Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that
+simulates human-like behaviors in a variety of tasks. However, evaluating RPAs
+is challenging due to diverse task requirements and agent designs. This paper
+proposes an evidence-based, actionable, and generalizable evaluation design
+guideline for LLM-based RPA by systematically reviewing 1,676 papers published
+between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes,
+seven task attributes, and seven evaluation metrics from existing literature.
+Based on these findings, we present an RPA evaluation design guideline to help
+researchers develop more systematic and consistent evaluation methods.
+
+摘要：角色扮演代理（RPA）是一種越來越流行的 LLM 代理，它能模擬人類在各種任務中的行為。然而，由於任務需求和代理設計的多樣性，評估 RPA 具有挑戰性。本文通過系統地審查 2021 年 1 月至 2024 年 12 月期間發表的 1,676 篇論文，提出了基於證據、可操作且可推廣的 LLM 基於 RPA 的評估設計指南。我們的分析從現有文獻中識別出六個代理屬性、七個任務屬性和七個評估指標。根據這些發現，我們提出了 RPA 評估設計指南，以幫助研究人員開發更系統化和一致的評估方法。
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**
+2502.13006v1 by Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
+
+Automated Planning algorithms require a model of the domain that specifies
+the preconditions and effects of each action. Obtaining such a domain model is
+notoriously hard. Algorithms for learning domain models exist, yet it remains
+unclear whether learning a domain model and planning is an effective approach
+for numeric planning environments, i.e., where states include discrete and
+numeric state variables. In this work, we explore the benefits of learning a
+numeric domain model and compare it with alternative model-free solutions. As a
+case study, we use two tasks in Minecraft, a popular sandbox game that has been
+used as an AI challenge. First, we consider an offline learning setting, where
+a set of expert trajectories are available to learn from. This is the standard
+setting for learning domain models. We used the Numeric Safe Action Model
+Learning (NSAM) algorithm to learn a numeric domain model and solve new
+problems with the learned domain model and a numeric planner. We call this
+model-based solution NSAM_(+p), and compare it to several model-free Imitation
+Learning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical
+results show that some IL algorithms can learn faster to solve simple tasks,
+while NSAM_(+p) allows solving tasks that require long-term planning and
+enables generalizing to solve problems in larger environments. Then, we
+consider an online learning setting, where learning is done by moving an agent
+in the environment. For this setting, we introduce RAMP. In RAMP, observations
+collected during the agent's execution are used to simultaneously train an RL
+policy and learn a planning domain action model. This forms a positive feedback
+loop between the RL policy and the learned domain model. We demonstrate
+experimentally the benefits of using RAMP, showing that it finds more efficient
+plans and solves more problems than several RL baselines.
+
+摘要：<paragraph>自動化規劃演算法需要一個網域模型，來指定每個動作的前提條件和效果。取得這樣的網域模型出了名的困難。學習網域模型的演算法確實存在，但學習網域模型和規劃是否為數值規劃環境的有效方法仍然不清楚，也就是說，其中狀態包含離散和數值狀態變數。在這項工作中，我們探討學習數值網域模型的優點，並將其與替代的無模型解決方案進行比較。作為一個案例研究，我們使用 Minecraft 中的兩個任務，Minecraft 是一個流行的沙盒遊戲，已被用作 AI 挑戰。首先，我們考慮離線學習設定，其中有一組專家軌跡可供學習。這是學習網域模型的標準設定。我們使用數值安全動作模型學習 (NSAM) 演算法來學習數值網域模型，並使用已學習的網域模型和數值規劃器解決新問題。我們稱此模型為基礎的解決方案 NSAM_(+p)，並將其與多種無模型模仿學習 (IL) 和離線強化學習 (RL) 演算法進行比較。經驗結果顯示，一些 IL 演算法可以更快地學習解決簡單任務，而 NSAM_(+p) 允許解決需要長期規劃的任務，並能夠推廣到在更大環境中解決問題。然後，我們考慮線上學習設定，其中學習是透過在環境中移動代理來完成的。對於此設定，我們引入了 RAMP。在 RAMP 中，在代理執行期間收集的觀察結果用於同時訓練 RL 政策和學習規劃網域動作模型。這在 RL 政策和已學習的網域模型之間形成了一個正向回饋迴路。我們透過實驗證明了使用 RAMP 的好處，顯示它比多個 RL 基準找到了更有效的計畫，並解決了更多問題。</paragraph>
+
+##### **Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**
+2502.13004v1 by Wafaa Wardah, Tuğçe Melike Koçak Büyüktaş, Kirill Shchegelskiy, Sebastian Möller, Robert P. Spang
+
+Objective speech quality models aim to predict human-perceived speech quality
+using automated methods. However, cross-lingual generalization remains a major
+challenge, as Mean Opinion Scores (MOS) vary across languages due to
+linguistic, perceptual, and dataset-specific differences. A model trained
+primarily on English data may struggle to generalize to languages with
+different phonetic, tonal, and prosodic characteristics, leading to
+inconsistencies in objective assessments. This study investigates the
+cross-lingual performance of two speech quality models: NISQA, a CNN-based
+model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both
+models were trained exclusively on English datasets containing over 49,000
+speech samples and subsequently evaluated on speech in German, French,
+Mandarin, Swedish, and Dutch. We analyze model performance using Pearson
+Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five
+speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS.
+Our findings show that while AST achieves a more stable cross-lingual
+performance, both models exhibit noticeable biases. Notably, Mandarin speech
+quality predictions correlate highly with human MOS scores, whereas Swedish and
+Dutch present greater prediction challenges. Discontinuities remain difficult
+to model across all languages. These results highlight the need for more
+balanced multilingual datasets and architecture-specific adaptations to improve
+cross-lingual generalization.
+
+摘要：客觀語音品質模型旨在使用自動化方法預測人類感知的語音品質。然而，跨語言的概化仍然是一項重大挑戰，因為平均意見分數 (MOS) 會因語言的不同而有所不同，這是由於語言、感知和特定於資料集的差異所致。主要使用英語資料訓練的模型可能會難以概化到具有不同語音、聲調和韻律特徵的語言，導致客觀評估不一致。本研究探討了兩種語音品質模型的跨語言效能：基於 CNN 的 NISQA 模型和基於 Transformer 的音訊光譜 Transformer (AST) 模型。這兩種模型都僅使用包含超過 49,000 個語音範例的英語資料集進行訓練，然後在德語、法語、普通話、瑞典語和荷蘭語的語音上進行評估。我們使用皮爾森相關係數 (PCC) 和均方根誤差 (RMSE) 分析五個語音品質維度的模型效能：色彩、不連續性、響度、雜訊和 MOS。我們的研究結果顯示，儘管 AST 達到了更穩定的跨語言效能，但這兩種模型都表現出明顯的偏差。值得注意的是，普通話語音品質預測與人類 MOS 分數高度相關，而瑞典語和荷蘭語則呈現出更大的預測挑戰。不連續性在所有語言中仍然難以建模。這些結果凸顯了對更平衡的多語言資料集和特定於架構的調整的需求，以改善跨語言的概化。
+
+##### **You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**
+2502.13001v1 by Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp
+
+Meeting summarization suffers from limited high-quality data, mainly due to
+privacy restrictions and expensive collection processes. We address this gap
+with FAME, a dataset of 500 meetings in English and 300 in German produced by
+MIMIC, our new multi-agent meeting synthesis framework that generates meeting
+transcripts on a given knowledge source by defining psychologically grounded
+participant profiles, outlining the conversation, and orchestrating a large
+language model (LLM) debate. A modular post-processing step refines these
+outputs, mitigating potential repetitiveness and overly formal tones, ensuring
+coherent, credible dialogues at scale. We also propose a psychologically
+grounded evaluation framework assessing naturalness, social behavior
+authenticity, and transcript difficulties. Human assessments show that FAME
+approximates real-meeting spontaneity (4.5/5 in naturalness), preserves
+speaker-centric challenges (3/5 in spoken language), and introduces richer
+information-oriented difficulty (4/5 in difficulty). These findings highlight
+that FAME is a good and scalable proxy for real-world meeting conditions. It
+enables new test scenarios for meeting summarization research and other
+conversation-centric applications in tasks requiring conversation data or
+simulating social scenarios under behavioral constraints.
+
+摘要：會議摘要因缺乏高品質資料而受限，主要是由於隱私限制和昂貴的收集程序。我們透過 FAME 來解決這個差距，FAME 是 MIMIC 製作的 500 場英文會議和 300 場德文會議的資料集，MIMIC 是我們新的多重代理會議合成架構，透過定義心理基礎的參與者設定檔、概述對話，並協調大型語言模型 (LLM) 辯論，在給定的知識來源上產生會議記錄。模組化後處理步驟會改善這些輸出，減輕潛在的重複性和過於正式的語氣，確保大規模的對話連貫且可信。我們也提出一個心理基礎的評估架構，評估自然性、社交行為真實性，以及記錄難度。人類評估顯示，FAME 近似於真實會議的即興性（自然性 4.5/5），保留以講者為中心的挑戰（口語 3/5），並引入更豐富的資訊導向難度（難度 4/5）。這些發現強調 FAME 是真實世界會議條件的良好且可擴充的代理。它能為會議摘要研究和其他對話為中心的應用程式啟用新的測試情境，在需要對話資料或在行為限制下模擬社交情境的任務中。
+
+##### **Personalized Top-k Set Queries Over Predicted Scores**
+2502.12998v1 by Sohrab Namazi Nia, Subhodeep Ghosh, Senjuti Basu Roy, Sihem Amer-Yahia
+
+This work studies the applicability of expensive external oracles such as
+large language models in answering top-k queries over predicted scores. Such
+scores are incurred by user-defined functions to answer personalized queries
+over multi-modal data. We propose a generic computational framework that
+handles arbitrary set-based scoring functions, as long as the functions could
+be decomposed into constructs, each of which sent to an oracle (in our case an
+LLM) to predict partial scores. At a given point in time, the framework assumes
+a set of responses and their partial predicted scores, and it maintains a
+collection of possible sets that are likely to be the true top-k. Since calling
+oracles is costly, our framework judiciously identifies the next construct,
+i.e., the next best question to ask the oracle so as to maximize the likelihood
+of identifying the true top-k. We present a principled probabilistic model that
+quantifies that likelihood. We study efficiency opportunities in designing
+algorithms. We run an evaluation with three large scale datasets, scoring
+functions, and baselines. Experiments indicate the efficacy of our framework,
+as it achieves an order of magnitude improvement over baselines in requiring
+LLM calls while ensuring result accuracy. Scalability experiments further
+indicate that our framework could be used in large-scale applications.
+
+摘要：本研究探討在預測分數中回答前 k 個查詢時，昂貴的外部預言（例如大型語言模型）的適用性。此類分數是由使用者定義的函式產生，用於回答多模態資料中的個人化查詢。我們提出一個通用的運算框架，用於處理任意基於集合的計分函式，只要這些函式可以分解為建構區塊，然後將每個建構區塊傳送給預言（在本例中為 LLM）以預測部分分數。在特定時間點，此框架假設一組回應及其部分預測分數，並維護一組可能成為真實前 k 個的集合。由於呼叫預言的成本很高，因此我們的框架會明智地找出下一個建構區塊，亦即下一個最佳問題，以詢問預言，以便最大化找出真實前 k 個的可能性。我們提出一個基於原理的機率模型，用於量化此可能性。我們研究設計演算法時的效率機會。我們針對三個大型資料集、計分函式和基準執行評估。實驗結果指出我們框架的效能，因為它在需要 LLM 呼叫的同時確保結果準確性，比基準進步了一個數量級。可擴充性實驗進一步指出我們的框架可用於大型應用程式。
+
+##### **Eager Updates For Overlapped Communication and Computation in DiLoCo**
+2502.12996v1 by Satyen Kale, Arthur Douillard, Yanislav Donchev
+
+Distributed optimization methods such as DiLoCo have been shown to be
+effective in training very large models across multiple distributed workers,
+such as datacenters. These methods split updates into two parts: an inner
+optimization phase, where the workers independently execute multiple
+optimization steps on their own local data, and an outer optimization step,
+where the inner updates are synchronized. While such approaches require orders
+of magnitude less communication than standard data-parallel training, in
+settings where the workers are datacenters, even the limited communication
+requirements of these approaches can still cause significant slow downs due to
+the blocking necessary at each outer optimization step. In this paper, we
+investigate techniques to mitigate this issue by overlapping communication with
+computation in a manner that allows the outer optimization step to fully
+overlap with the inner optimization phase. We show that a particular variant,
+dubbed eager updates, provides competitive performance with standard DiLoCo in
+settings with low bandwidth between workers.
+
+摘要：分散式優化方法（例如 DiLoCo）已被證明可有效訓練橫跨多個分散式工作者的超大型模型，例如資料中心。這些方法將更新拆分為兩部分：內部最佳化階段，其中工作者獨立地在自己的本地資料上執行多個最佳化步驟，以及外部最佳化步驟，其中內部更新會同步。雖然此類方法所需的通訊量比標準資料平行訓練少幾個數量級，但在工作者為資料中心的情況下，即使這些方法有限的通訊需求仍可能由於每個外部最佳化步驟所需的封鎖而導致顯著的減速。在本文中，我們探討了透過以允許外部最佳化步驟與內部最佳化階段完全重疊的方式將通訊與運算重疊，來減輕此問題的技術。我們展示了一個特定變體，稱為即時更新，在工作者之間頻寬較低的情況下，可提供與標準 DiLoCo 相當的效能。
+
+##### **Free Argumentative Exchanges for Explaining Image Classifiers**
+2502.12995v1 by Avinash Kori, Antonio Rago, Francesca Toni
+
+Deep learning models are powerful image classifiers but their opacity hinders
+their trustworthiness. Explanation methods for capturing the reasoning process
+within these classifiers faithfully and in a clear manner are scarce, due to
+their sheer complexity and size. We provide a solution for this problem by
+defining a novel method for explaining the outputs of image classifiers with
+debates between two agents, each arguing for a particular class. We obtain
+these debates as concrete instances of Free Argumentative eXchanges (FAXs), a
+novel argumentation-based multi-agent framework allowing agents to internalise
+opinions by other agents differently than originally stated. We define two
+metrics (consensus and persuasion rate) to assess the usefulness of FAXs as
+argumentative explanations for image classifiers. We then conduct a number of
+empirical experiments showing that FAXs perform well along these metrics as
+well as being more faithful to the image classifiers than conventional,
+non-argumentative explanation methods. All our implementations can be found at
+https://github.com/koriavinash1/FAX.
+
+摘要：深度學習模型是強大的影像分類器，但其不透明性阻礙了其可信度。由於其極高的複雜性和規模，忠實且清楚地捕捉這些分類器內部推理過程的解釋方法很少見。我們透過定義一種新穎的方法來解決這個問題，該方法透過兩個代理之間的辯論來解釋影像分類器的輸出，每個代理都主張一個特定類別。我們將這些辯論作為自由論證交換 (FAX) 的具體實例，這是一個新穎的基於論證的多代理架構，允許代理以不同於原始陳述的方式內化其他代理的意見。我們定義了兩個指標（共識率和說服率）來評估 FAX 作為影像分類器論證解釋的有用性。然後，我們進行了多項實證實驗，表明 FAX 在這些指標上表現良好，並且比傳統的非論證解釋方法更忠實於影像分類器。我們所有的實作都可以在 https://github.com/koriavinash1/FAX 中找到。
+
+##### **B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**
+2502.12992v1 by Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
+
+Post-hoc explanation methods for black-box models often struggle with
+faithfulness and human interpretability due to the lack of explainability in
+current neural models. Meanwhile, B-cos networks have been introduced to
+improve model explainability through architectural and computational
+adaptations, but their application has so far been limited to computer vision
+models and their associated training pipelines. In this work, we introduce
+B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
+transforms pre-trained language models into B-cos LMs by combining B-cos
+conversion and task fine-tuning, improving efficiency compared to previous
+B-cos methods. Our automatic and human evaluation results demonstrate that
+B-cos LMs produce more faithful and human interpretable explanations than post
+hoc methods, while maintaining task performance comparable to conventional
+fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
+conventionally fine-tuned models in their learning processes and explanation
+patterns. Finally, we provide practical guidelines for effectively building
+B-cos LMs based on our findings. Our code is available at
+https://anonymous.4open.science/r/bcos_lm.
+
+摘要：黑盒模型的事后解释方法通常会因为当前神经模型缺乏可解释性而难以做到忠实和人类可解释。与此同时，B-cos 网络已被引入，以通过架构和计算改编来提高模型的可解释性，但到目前为止，它们的应用仅限于计算机视觉模型及其相关的训练管道。在这项工作中，我们引入了 B-cos LM，即针对 NLP 任务增强的 B-cos 网络。我们的方法通过结合 B-cos 转换和任务微调，将预训练的语言模型直接转换为 B-cos LM，与以前 B-cos 方法相比，提高了效率。我们的自动和人工评估结果表明，与事后方法相比，B-cos LM 产生了更忠实和人类可解释的解释，同时保持与传统微调相当的任务性能。我们的深入分析探讨了 B-cos LM 在其学习过程和解释模式中与传统微调模型有何不同。最后，我们根据我们的发现提供了有效构建 B-cos LM 的实用指南。我们的代码可在 https://anonymous.4open.science/r/bcos_lm 获得。
+
+##### **Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**
+2502.12988v1 by Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen
+
+Previous approaches to persona simulation large language models (LLMs) have
+typically relied on learning basic biographical information, or using limited
+role-play dialogue datasets to capture a character's responses. However, a
+holistic representation of an individual goes beyond surface-level facts or
+conversations to deeper thoughts and thinking. In this work, we introduce
+CharacterBot, a model designed to replicate both the linguistic patterns and
+distinctive thought processes of a character. Using Lu Xun, a renowned Chinese
+writer, as a case study, we propose four training tasks derived from his 17
+essay collections. These include a pre-training task focused on mastering
+external linguistic structures and knowledge, as well as three fine-tuning
+tasks: multiple-choice question answering, generative question answering, and
+style transfer, each aligning the LLM with Lu Xun's internal ideation and
+writing style. To optimize learning across these tasks, we introduce a CharLoRA
+parameter updating mechanism, where a general linguistic style expert
+collaborates with other task-specific experts to better study both the language
+style and the understanding of deeper thoughts. We evaluate CharacterBot on
+three tasks for linguistic accuracy and opinion comprehension, demonstrating
+that it significantly outperforms the baselines on our adapted metrics. We hope
+that this work inspires future research on deep character persona simulation
+LLM.
+
+摘要：<paragraph>以前對角色模擬大型語言模型 (LLM) 的方法通常依賴於學習基本傳記資訊，或使用有限的角色扮演對話資料集來捕捉角色的反應。然而，對個人的整體表徵超越了表面層面的事實或對話，深入到更深層的想法和思考。在這項工作中，我們引入了 CharacterBot，一個旨在複製角色的語言模式和獨特思考過程的模型。以著名的中國作家魯迅為案例研究，我們提出了四個從他的 17 篇散文集中衍生的訓練任務。其中包括一個預訓練任務，專注於掌握外部語言結構和知識，以及三個微調任務：多選題回答、生成式問答和風格轉移，每個任務都將 LLM 與魯迅的內部觀念和寫作風格相結合。為了優化這些任務的學習，我們引入了一個 CharLoRA 參數更新機制，其中一位通曉語言風格的專家與其他特定任務專家合作，以更好地研究語言風格和對深層思想的理解。我們在三項任務上評估了 CharacterBot 的語言準確性和意見理解，證明它在我們調整的指標上顯著優於基準。我們希望這項工作能激勵未來對深度角色角色模擬 LLM 的研究。</paragraph>
+
+##### **PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**
+2502.12985v1 by Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua
+
+Accurate 3D shape representation is essential in engineering applications
+such as design, optimization, and simulation. In practice, engineering
+workflows require structured, part-aware representations, as objects are
+inherently designed as assemblies of distinct components. However, most
+existing methods either model shapes holistically or decompose them without
+predefined part structures, limiting their applicability in real-world design
+tasks. We propose PartSDF, a supervised implicit representation framework that
+explicitly models composite shapes with independent, controllable parts while
+maintaining shape consistency. Despite its simple single-decoder architecture,
+PartSDF outperforms both supervised and unsupervised baselines in
+reconstruction and generation tasks. We further demonstrate its effectiveness
+as a structured shape prior for engineering applications, enabling precise
+control over individual components while preserving overall coherence. Code
+available at https://github.com/cvlab-epfl/PartSDF.
+
+摘要：精確的 3D 形狀表示在工程應用中至關重要，例如設計、最佳化和模擬。實際上，工程工作流程需要結構化、零件感知的表示，因為物體本質上是設計為不同元件的組件。然而，大多數現有方法不是整體建模形狀，就是將其分解，而沒有預先定義的零件結構，這限制了它們在實際設計任務中的適用性。我們提出 PartSDF，一個監督式的隱式表示框架，它明確地使用獨立、可控的零件對複合形狀進行建模，同時保持形狀一致性。儘管其單一的解碼器架構很簡單，但 PartSDF 在重建和生成任務中都優於監督式和非監督式基準。我們進一步證明了其作為工程應用結構化形狀先驗的有效性，能夠精確控制各個元件，同時保持整體一致性。程式碼可在 https://github.com/cvlab-epfl/PartSDF 取得。
+
+##### **Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**
+2502.12982v1 by Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
+
+Sailor2 is a family of cutting-edge multilingual language models for
+South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
+diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
+pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
+support 13 SEA languages while retaining proficiency in Chinese and English.
+Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
+languages. We also deliver a comprehensive cookbook on how to develop the
+multilingual model in an efficient manner, including five key aspects: data
+curation, pre-training, post-training, model customization and evaluation. We
+hope that Sailor2 model (Apache 2.0 license) will drive language development in
+the SEA region, and Sailor2 cookbook will inspire researchers to build more
+inclusive LLMs for other under-served languages.
+
+摘要：Sailor2 是一系列針對東南亞 (SEA) 語言的尖端多語言語言模型，備有 1B、8B 和 20B 大小，以適應各種應用。在 Qwen2.5 的基礎上，Sailor2 持續進行 500B 代幣（400B SEA 專用和 100B 重播代幣）的預訓練，以支援 13 種 SEA 語言，同時保留中文和英文的熟練度。Sailor2-20B 模型在 SEA 語言中對抗 GPT-4o 時，達到 50-50 的獲勝率。我們還提供一本全面的食譜，說明如何以有效的方式開發多語言模型，包括五個關鍵方面：資料策展、預訓練、後訓練、模型自訂和評估。我們希望 Sailor2 模型（Apache 2.0 授權）將推動 SEA 地區的語言發展，而 Sailor2 食譜將激勵研究人員為其他服務不足的語言建立更具包容性的 LLM。
+
+##### **Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**
+2502.12970v1 by Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
+
+The reasoning abilities of Large Language Models (LLMs) have demonstrated
+remarkable advancement and exceptional performance across diverse domains.
+However, leveraging these reasoning capabilities to enhance LLM safety against
+adversarial attacks and jailbreak queries remains largely unexplored. To bridge
+this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that
+integrates safety reflections of queries and responses into LLMs' generation
+process, unlocking a safety-aware reasoning mechanism. This approach enables
+self-evaluation at each reasoning step to create safety pivot tokens as
+indicators of the response's safety status. Furthermore, in order to improve
+the learning efficiency of pivot token prediction, we propose Contrastive Pivot
+Optimization(CPO), which enhances the model's ability to perceive the safety
+status of dialogues. Through this mechanism, LLMs dynamically adjust their
+response strategies during reasoning, significantly enhancing their defense
+capabilities against jailbreak attacks. Extensive experimental results
+demonstrate that R2D effectively mitigates various attacks and improves overall
+safety, highlighting the substantial potential of safety-aware reasoning in
+strengthening LLMs' robustness against jailbreaks.
+
+摘要：大型語言模型 (LLM) 的推理能力已展現出顯著的進步，並在不同的領域中表現出色。然而，利用這些推理能力來增強 LLM 對抗攻擊和越獄查詢的安全性仍然是未開發的領域。為了彌補這個差距，我們提出了推理防禦 (R2D)，這是一種新穎的訓練範例，它將查詢和回應的安全考量整合到 LLM 的生成過程中，開啟了一個安全感知推理機制。此方法可以在每個推理步驟中進行自我評估，以建立安全樞紐標記，作為回應安全狀態的指標。此外，為了提高樞紐標記預測的學習效率，我們提出了對比樞紐最佳化 (CPO)，它增強了模型感知對話安全狀態的能力。透過此機制，LLM 在推理過程中動態調整其回應策略，大幅增強其對抗越獄攻擊的防禦能力。廣泛的實驗結果證明，R2D 有效地減輕了各種攻擊，並改善了整體安全性，突顯了安全感知推理在加強 LLM 對抗越獄的穩健性方面的潛力。
+
+##### **A Survey of Text Classification Under Class Distribution Shift**
+2502.12965v1 by Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
+
+The basic underlying assumption of machine learning (ML) models is that the
+training and test data are sampled from the same distribution. However, in
+daily practice, this assumption is often broken, i.e.~the distribution of the
+test data changes over time, which hinders the application of conventional ML
+models. One domain where the distribution shift naturally occurs is text
+classification, since people always find new topics to discuss. To this end, we
+survey research articles studying open-set text classification and related
+tasks. We divide the methods in this area based on the constraints that define
+the kind of distribution shift and the corresponding problem formulation,
+i.e.~learning with the Universum, zero-shot learning, and open-set learning. We
+next discuss the predominant mitigation approaches for each problem setup.
+Finally, we identify several future work directions, aiming to push the
+boundaries beyond the state of the art. Interestingly, we find that continual
+learning can solve many of the issues caused by the shifting class
+distribution. We maintain a list of relevant papers at
+https://github.com/Eduard6421/Open-Set-Survey.
+
+摘要：機器學習 (ML) 模型的基本假設是訓練資料和測試資料取樣自同一個分佈。然而，在日常實務中，這個假設經常被打破，也就是說測試資料的分布會隨著時間改變，這會阻礙傳統 ML 模型的應用。分佈轉移自然發生的其中一個領域是文字分類，因為人們總能找到新的主題來討論。為此，我們調查研究開放集文字分類和相關任務的研究文章。我們根據定義分佈轉移的類型和對應問題公式的限制，將這個領域的方法分為：使用 Universum 學習、零次學習和開放集學習。接下來，我們討論每個問題設定的主要緩解方法。最後，我們找出幾個未來的研究方向，目標是將界線推展到現有技術的極限之外。有趣的是，我們發現持續學習可以解決許多由類別分佈轉移所造成的議題。我們在 https://github.com/Eduard6421/Open-Set-Survey 維護一份相關論文清單。
+
+##### **Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**
+2502.12964v1 by Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
+
+Large Language Models (LLMs) often generate outputs that lack grounding in
+real-world facts, a phenomenon known as hallucinations. Prior research has
+associated hallucinations with model uncertainty, leveraging this relationship
+for hallucination detection and mitigation. In this paper, we challenge the
+underlying assumption that all hallucinations are associated with uncertainty.
+Using knowledge detection and uncertainty measurement methods, we demonstrate
+that models can hallucinate with high certainty even when they have the correct
+knowledge. We further show that high-certainty hallucinations are consistent
+across models and datasets, distinctive enough to be singled out, and challenge
+existing mitigation methods. Our findings reveal an overlooked aspect of
+hallucinations, emphasizing the need to understand their origins and improve
+mitigation strategies to enhance LLM safety. The code is available at
+https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
+
+摘要：大型語言模型 (LLM) 經常產生缺乏真實世界事實根據的輸出，這種現象稱為幻覺。先前的研究已將幻覺與模型不確定性聯繫起來，利用這種關係進行幻覺偵測和緩解。在本文中，我們挑戰所有幻覺都與不確定性相關的基本假設。使用知識偵測和不確定性測量方法，我們證明模型即使擁有正確的知識，也能以高度確定性產生幻覺。我們進一步表明，高確定性幻覺在模型和資料集之間是一致的，足夠獨特以至於可以單獨挑選出來，並挑戰現有的緩解方法。我們的研究結果揭示了幻覺的一個被忽視的方面，強調需要了解其起源並改進緩解策略以增強 LLM 安全性。可以在 https://github.com/technion-cs-nlp/Trust_me_Im_wrong 找到程式碼。
+
+##### **Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**
+2502.12962v1 by Xiaoju Ye, Zhichun Wang, Jingyuan Wang
+
+Limited by the context window size of Large Language Models(LLMs), handling
+various tasks with input tokens exceeding the upper limit has been challenging,
+whether it is a simple direct retrieval task or a complex multi-hop reasoning
+task. Although various methods have been proposed to enhance the long-context
+processing capabilities of LLMs, they either incur substantial post-training
+costs, or require additional tool modules(e.g.,RAG), or have not shown
+significant improvement in realistic tasks. Our work observes the correlation
+between the attention distribution and generated answers across each layer, and
+establishes the attention allocation aligns with retrieval-augmented
+capabilities through experiments. Drawing on the above insights, we propose a
+novel method InfiniRetri that leverages the LLMs's own attention information to
+enable accurate retrieval across inputs of infinitely length. Our evaluations
+indicate that InfiniRetri achieves 100% accuracy in the
+Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,
+surpassing other method or larger models and setting a new
+state-of-the-art(SOTA). Moreover, our method achieves significant performance
+improvements on real-world benchmarks, with a maximum 288% improvement. In
+addition, InfiniRetri can be applied to any Transformer-based LLMs without
+additional training and substantially reduces inference latency and compute
+overhead in long texts. In summary, our comprehensive studies show
+InfiniRetri's potential for practical applications and creates a paradigm for
+retrievaling information using LLMs own capabilities under infinite-length
+tokens. Code will be released in link.
+
+摘要：受限于大型语言模型 (LLM) 的上下文窗口大小，处理超出上限的输入标记的各种任务一直具有挑战性，无论是简单的直接检索任务还是复杂的多跳推理任务。虽然已经提出了各种方法来增强 LLM 的长上下文处理能力，但它们要么产生大量的后训练成本，要么需要额外的工具模块（例如，RAG），要么在实际任务中没有显示出显着的改进。我们的工作观察了每层注意力分布和生成答案之间的相关性，并通过实验建立了注意力分配与检索增强能力保持一致。根据上述见解，我们提出了一种新方法 InfiniRetri，该方法利用 LLM 自身的注意力信息来实现对无限长度输入的准确检索。我们的评估表明，InfiniRetri 在使用 0.5B 参数模型对超过 100 万个标记的针头干草堆 (NIH) 测试中实现了 100% 的准确率，超越了其他方法或更大的模型，并创造了新的最先进 (SOTA)。此外，我们的方法在实际基准上实现了显著的性能提升，最大提升了 288%。此外，InfiniRetri 可以应用于任何基于 Transformer 的 LLM，而无需额外的训练，并且可以大幅减少推理延迟和长文本中的计算开销。总之，我们的综合研究表明了 InfiniRetri 在实际应用中的潜力，并为使用 LLM 自身能力在无限长度标记下检索信息创造了一个范例。代码将在链接中发布。
+
+##### **Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**
+2502.12961v1 by Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
+
+Large language models (LLMs) have shown remarkable emergent capabilities,
+transforming the execution of functional tasks by leveraging external tools for
+complex problems that require specialized processing or real-time data. While
+existing research expands LLMs access to diverse tools (e.g., program
+interpreters, search engines, weather/map apps), the necessity of using these
+tools is often overlooked, leading to indiscriminate tool invocation. This
+naive approach raises two key issues:(1) increased delays due to unnecessary
+tool calls, and (2) potential errors resulting from faulty interactions with
+external tools. In this paper, we introduce meta-cognition as a proxy for LLMs
+self-assessment of their capabilities, representing the model's awareness of
+its own limitations. Based on this, we propose MeCo, an adaptive
+decision-making strategy for external tool use. MeCo quantifies metacognitive
+scores by capturing high-level cognitive signals in the representation space,
+guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs
+minimal cost. Our experiments show that MeCo accurately detects LLMs' internal
+cognitive signals and significantly improves tool-use decision-making across
+multiple base models and benchmarks.
+
+摘要：大型語言模型 (LLM) 已展現出顯著的新興能力，透過運用外部工具來執行功能任務，解決需要專業處理或即時資料的複雜問題，從而轉變任務的執行方式。儘管現有研究擴展了 LLM 對各種工具的存取（例如程式碼詮釋器、搜尋引擎、天氣/地圖應用程式），但使用這些工具的必要性往往被忽略，導致不加選擇地呼叫工具。這種天真的方法提出了兩個關鍵問題：(1) 由於不必要的工具呼叫而導致延遲增加，以及 (2) 由於與外部工具互動錯誤而導致的潛在錯誤。在本文中，我們將元認知引入作為 LLM 自我評估其能力的代理，代表模型意識到其自身的限制。基於此，我們提出了 MeCo，一種用於外部工具使用的適應性決策制定策略。MeCo 透過擷取表徵空間中的高階認知訊號來量化元認知分數，指導何時呼叫工具。值得注意的是，MeCo 是免微調的，而且成本極低。我們的實驗表明，MeCo 能夠準確地偵測 LLM 的內部認知訊號，並大幅改善跨多個基本模型和基準的工具使用決策制定。
+
+##### **AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**
+2502.12959v1 by Steve Bakos, Félix Gaschi, David Guzmán, Riddhi More, Kelly Chutong Li, En-Shiun Annie Lee
+
+Realignment techniques are often employed to enhance cross-lingual transfer
+in multilingual language models, still, they can sometimes degrade performance
+in languages that differ significantly from the fine-tuned source language.
+This paper introduces AlignFreeze, a method that freezes either the layers'
+lower half or upper half during realignment. Through controlled experiments on
+4 tasks, 3 models, and in 35 languages, we find that realignment affects all
+the layers but can be the most detrimental to the lower ones. Freezing the
+lower layers can prevent performance degradation. Particularly, AlignFreeze
+improves Part-of-Speech (PoS) tagging performances in languages where full
+realignment fails: with XLM-R, it provides improvements of more than one
+standard deviation in accuracy in seven more languages than full realignment.
+
+摘要：重新對齊技術通常用於增強多語言語言模型中的跨語言轉移，然而，它們有時會降低與微調源語言顯著不同的語言的效能。本文介紹了 AlignFreeze，一種在重新對齊期間凍結層的下半部或上半部的的方法。透過 4 項任務、3 個模型和 35 種語言的受控實驗，我們發現重新對齊會影響所有層，但對較低層的影響最大。凍結較低層可以防止效能下降。特別是，AlignFreeze 改善了在完全重新對齊失敗的語言中的詞性 (PoS) 標記效能：使用 XLM-R，它比完全重新對齊在七種語言中提供了超過一個標準差的準確度改進。
+
+##### **Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**
+2502.12953v1 by Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
+
+Masked language modeling has become a widely adopted unsupervised technique
+to pre-train language models. However, the process of selecting tokens for
+masking is random, and the percentage of masked tokens is typically fixed for
+the entire training process. In this paper, we propose to adjust the masking
+ratio and to decide which tokens to mask based on a novel task-informed
+anti-curriculum learning scheme. First, we harness task-specific knowledge
+about useful and harmful tokens in order to determine which tokens to mask.
+Second, we propose a cyclic decaying masking ratio, which corresponds to an
+anti-curriculum schedule (from hard to easy). We exemplify our novel
+task-informed anti-curriculum by masking (TIACBM) approach across three diverse
+downstream tasks: sentiment analysis, text classification by topic, and
+authorship attribution. Our findings suggest that TIACBM enhances the ability
+of the model to focus on key task-relevant features, contributing to
+statistically significant performance gains across tasks. We release our code
+at https://github.com/JarcaAndrei/TIACBM.
+
+摘要：遮蔽語言模型已成為一種廣泛採用的無監督技術，用於預先訓練語言模型。然而，選擇用於遮蔽的詞彙的過程是隨機的，且遮蔽詞彙的百分比通常在整個訓練過程中是固定的。在本文中，我們建議調整遮蔽率，並根據一種新穎的任務資訊反課程學習方案來決定要遮蔽哪些詞彙。首先，我們利用任務特定的知識，了解有用的和有害的詞彙，以確定要遮蔽哪些詞彙。其次，我們提出一個循環遞減遮蔽率，這對應於一個反課程表（從難到易）。我們以三項不同的下游任務為例，說明我們新穎的任務資訊反課程遮蔽（TIACBM）方法：情緒分析、按主題分類文字，以及作者歸屬。我們的研究結果表明，TIACBM 增強了模型專注於關鍵任務相關特徵的能力，有助於在各項任務中獲得具有統計意義的效能提升。我們在 https://github.com/JarcaAndrei/TIACBM 釋出我們的程式碼。
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**
+2502.12947v1 by Gyeongman Kim, Gyouk Chu, Eunho Yang
+
+With the emergence of Mixture-of-Experts (MoE), the efficient scaling of
+model size has accelerated the development of large language models in recent
+years. However, their high memory requirements prevent their use in
+resource-constrained environments. While knowledge distillation (KD) has been a
+proven method for model compression, its application to MoE teacher models
+remains underexplored. Through our investigation, we discover that
+non-activated experts in MoE models possess valuable knowledge that benefits
+student models. We further demonstrate that existing KD methods are not optimal
+for compressing MoE models, as they fail to leverage this knowledge
+effectively. To address this, we propose two intuitive MoE-specific KD methods
+for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR),
+both designed to effectively extract knowledge from all experts. Specifically,
+KA augments knowledge by sampling experts multiple times, while SAR uses all
+experts and adjusts the expert weights through router training to provide
+optimal knowledge. Extensive experiments show that our methods outperform
+conventional KD methods, demonstrating their effectiveness for MoE teacher
+models.
+
+摘要：隨著 Mixture-of-Experts (MoE) 的出現，模型規模的有效擴展加速了近年來大型語言模型的發展。然而，它們的高記憶體需求會阻礙它們在資源受限的環境中使用。雖然知識蒸餾 (KD) 已被證明是一種模型壓縮的方法，但它在 MoE 教師模型中的應用仍未被充分探索。透過我們的調查，我們發現 MoE 模型中未被啟用的專家擁有有價值的知識，這些知識對學生模型有益。我們進一步證明，現有的 KD 方法並非壓縮 MoE 模型的最佳方法，因為它們無法有效利用這些知識。為了解決這個問題，我們首次提出兩種直觀的 MoE 專用 KD 方法：知識擴充 (KA) 和學生感知路由器 (SAR)，兩者都旨在從所有專家有效提取知識。具體來說，KA 透過多次抽樣專家來擴充知識，而 SAR 使用所有專家並透過路由器訓練調整專家權重以提供最佳知識。廣泛的實驗表明，我們的模型優於傳統的 KD 模型，證明了它們對 MoE 教師模型的有效性。
+
+##### **LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**
+2502.12945v1 by Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose
+
+Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold
+significant commercial value. The rise of high-quality AI-generated content has
+spurred interest in AI-driven micro-video creation. However, despite the
+advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek
+in text generation and reasoning, their potential to assist the creation of
+popular micro-videos remains largely unexplored.
+  In this paper, we conduct an empirical study on LLM-assisted popular
+micro-video generation (LLMPopcorn). Specifically, we investigate the following
+research questions: (i) How can LLMs be effectively utilized to assist popular
+micro-video generation? (ii) To what extent can prompt-based enhancements
+optimize the LLM-generated content for higher popularity? (iii) How well do
+various LLMs and video generators perform in the popular micro-video generation
+task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3
+enable micro-video generation to achieve popularity comparable to human-created
+content. Prompt enhancements further boost popularity, and benchmarking
+highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and
+HunyuanVideo lead in video generation. This pioneering work advances
+AI-assisted micro-video creation, uncovering new research opportunities. We
+will release the code and datasets to support future studies.
+
+摘要：<paragraph>在 TikTok 和 YouTube 等平台上流行的微影片具有
+重要的商业价值。高质量 AI 生成的内容的兴起
+激发了人们对 AI 驱动的微影片创作的兴趣。然而，尽管大型语言模型 (LLM) 如 ChatGPT 和 DeepSeek
+在文本生成和推理方面的能力很强，但它们在辅助创建
+流行微影片方面的潜力在很大程度上仍未得到探索。
+  在本文中，我们对 LLM 辅助的流行
+微影片生成 (LLMPopcorn) 进行了实证研究。具体来说，我们调查了以下
+研究问题：(i) 如何有效利用 LLM 来辅助流行
+微影片生成？(ii) 基于提示的增强在多大程度上可以
+优化 LLM 生成的内容以获得更高的流行度？(iii) 各种 LLM 和视频生成器在流行的微视频生成中表现如何
+任务？通过探索这些问题，我们表明了像 DeepSeek-V3 这样的高级 LLM
+使微视频生成能够达到与人类创作的内容相当的流行度。提示增强进一步提高了受欢迎程度，并且基准测试突出了 LLM 中的 DeepSeek-V3 和 DeepSeek-R1，而 LTX-Video 和
+HunyuanVideo 在视频生成中领先。这项开创性的工作推进了
+人工智能辅助的微视频创作，发现了新的研究机会。我们将发布代码和数据集以支持未来的研究。</paragraph>
+
+##### **Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**
+2502.12932v1 by Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
+
+Quantifying reasoning capability in low-resource languages remains a
+challenge in NLP due to data scarcity and limited access to annotators. While
+LLM-assisted dataset construction has proven useful for medium- and
+high-resource languages, its effectiveness in low-resource languages,
+particularly for commonsense reasoning, is still unclear. In this paper, we
+compare three dataset creation strategies: (1) LLM-assisted dataset generation,
+(2) machine translation, and (3) human-written data by native speakers, to
+build a culturally nuanced story comprehension dataset. We focus on Javanese
+and Sundanese, two major local languages in Indonesia, and evaluate the
+effectiveness of open-weight and closed-weight LLMs in assisting dataset
+creation through extensive manual validation. To assess the utility of
+synthetic data, we fine-tune language models on classification and generation
+tasks using this data and evaluate performance on a human-written test set. Our
+findings indicate that LLM-assisted data creation outperforms machine
+translation.
+
+摘要：由於資料稀少且標註者有限，量化低資源語言中的推理能力在自然語言處理中仍然是一項挑戰。雖然 LLM 輔助的資料集建構已被證明對中高資源語言有用，但其在低資源語言中的有效性，特別是對於常識推理，仍然不清楚。在本文中，我們比較了三種資料集建立策略：(1) LLM 輔助的資料集生成，(2) 機器翻譯，以及 (3) 母語人士撰寫的人工資料，以建立具有文化細微差的故事理解資料集。我們專注於爪哇語和巽他語，這兩種印尼的主要地方語言，並透過廣泛的手動驗證評估開放權重和封閉權重 LLM 在協助資料集建立中的有效性。為了評估合成資料的效用，我們使用這些資料對分類和生成任務進行語言模型微調，並在人工撰寫的測試集上評估效能。我們的研究結果表明，LLM 輔助的資料建立優於機器翻譯。
+
+##### **Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**
+2502.12929v1 by Lakshmi Nair, Ian Trase, Mark Kim
+
+We present a novel reasoning approach called Flow-of-Options (FoO), designed
+to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs
+to systematically explore a diverse range of possibilities in their reasoning,
+as demonstrated by an FoO-based agentic system for autonomously solving Machine
+Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines,
+achieving improvements of 38.2% - 69.2% on standard data science tasks, and
+37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost
+under $1 per task, our framework is well-suited for cost-sensitive
+applications. Beyond classification and regression, we illustrate the broader
+applicability of our FoO-based agentic system to tasks such as reinforcement
+learning and image generation. Our framework presents significant advancements
+compared to current state-of-the-art agentic systems for AutoML, due to the
+benefits of FoO in enforcing diversity in LLM solutions through compressed,
+explainable representations that also support long-term memory when combined
+with case-based reasoning.
+
+摘要：我們提出了一種稱為選項流 (FoO) 的新推理方法，旨在解決大型語言模型 (LLM) 中的內在偏差。FoO 使 LLM 能系統性地探索其推理中的各種可能性，這由一個基於 FoO 的代理系統展示，該系統可自主解決機器學習任務 (AutoML)。我們的框架優於最先進的基準，在標準數據科學任務上取得了 38.2% - 69.2% 的改進，在治療化學任務上取得了 37.4% - 47.9% 的改進。由於每個任務的整體運營成本低於 1 美元，因此我們的框架非常適合對成本敏感的應用。除了分類和回歸之外，我們還說明了基於 FoO 的代理系統在強化學習和圖像生成等任務中的更廣泛適用性。我們的框架與當前最先進的 AutoML 代理系統相比具有顯著的進步，這是因為 FoO 在通過壓縮、可解釋的表示強制 LLM 解決方案的多樣性方面具有優勢，這些表示與基於案例的推理結合時還支持長期記憶。
+
+##### **Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**
+2502.12928v1 by Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
+
+Large language models have demonstrated exceptional performance across a wide
+range of tasks. However, dense models usually suffer from sparse activation,
+where many activation values tend towards zero (i.e., being inactivated). We
+argue that this could restrict the efficient exploration of model
+representation space. To mitigate this issue, we propose Finedeep, a
+deep-layered fine-grained expert architecture for dense models. Our framework
+partitions the feed-forward neural network layers of traditional dense models
+into small experts, arranges them across multiple sub-layers. A novel routing
+mechanism is proposed to determine each expert's contribution. We conduct
+extensive experiments across various model sizes, demonstrating that our
+approach significantly outperforms traditional dense architectures in terms of
+perplexity and benchmark performance while maintaining a comparable number of
+parameters and floating-point operations. Moreover, we find that Finedeep
+achieves optimal results when balancing depth and width, specifically by
+adjusting the number of expert sub-layers and the number of experts per
+sub-layer. Empirical results confirm that Finedeep effectively alleviates
+sparse activation and efficiently utilizes representation capacity in dense
+models.
+
+摘要：大型語言模型在各種任務中展現出非凡的效能。然而，密集模型通常會出現稀疏激活，其中許多激活值趨近於零（即處於非激活狀態）。我們認為這可能會限制模型表示空間的有效探索。為了減輕這個問題，我們提出 Finedeep，這是一種針對密集模型的深度分層細粒度專家架構。我們的框架將傳統密集模型的前饋神經網路層分割成小型專家，並將它們排列在多個子層中。我們提出了一種新穎的路由機制來確定每個專家的貢獻。我們針對各種模型大小進行了廣泛的實驗，證明我們的做法在困惑度和基準效能方面顯著優於傳統的密集架構，同時保持了相當數量的參數和浮點運算。此外，我們發現 Finedeep 在平衡深度和廣度時可以達到最佳結果，特別是透過調整專家子層的數量和每個子層的專家數量。實證結果證實，Finedeep 有效地減輕了稀疏激活，並有效利用了密集模型中的表示能力。
+
+##### **SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**
+2502.12927v1 by Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva
+
+Providing high-quality feedback is crucial for student success but is
+constrained by time, cost, and limited data availability. We introduce
+Synthetic Educational Feedback Loops (SEFL), a novel framework designed to
+deliver immediate, on-demand feedback at scale without relying on extensive,
+real-world student data. In SEFL, two large language models (LLMs) operate in
+teacher--student roles to simulate assignment completion and formative
+feedback, generating abundant synthetic pairs of student work and corresponding
+critiques. We then fine-tune smaller, more computationally efficient LLMs on
+these synthetic pairs, enabling them to replicate key features of high-quality,
+goal-oriented feedback. Unlike personalized tutoring approaches that offer
+multi-turn, individualized instruction, SEFL specifically focuses on
+replicating the teacher-->student feedback loop for diverse assignments.
+Through both LLM-as-a-judge and human evaluations, we demonstrate that
+SEFL-tuned models outperform their non-tuned counterparts in feedback quality,
+clarity, and timeliness. These findings reveal SEFL's potential to transform
+feedback processes for higher education and beyond, offering an ethical and
+scalable alternative to conventional manual feedback cycles.
+
+摘要：提供高品質的回饋對於學生的成功至關重要，但受到時間、成本和資料取得有限的限制。我們引入了合成教育回饋迴圈 (SEFL)，這是一個新穎的架構，旨在提供立即且依需求的回饋，且無需仰賴大量的真實世界學生資料。在 SEFL 中，兩個大型語言模型 (LLM) 以師生角色運作，模擬作業完成和形成性回饋，產生大量的合成學生作業和對應的評論。然後我們針對這些合成配對微調較小、計算效率較高的 LLM，讓它們能夠複製高品質、目標導向回饋的主要特徵。與提供多回合、個別化教學的個人化輔導方法不同，SEFL 特別專注於複製適用於各種作業的教師-->學生回饋迴圈。透過 LLM 作為評審和人類評估，我們證明了 SEFL 微調模型在回饋品質、清晰度和時效性方面優於未微調的模型。這些發現揭示了 SEFL 轉變高等教育及其他領域回饋流程的潛力，提供了一個符合道德且可擴充的替代方案，取代傳統的手動回饋週期。
+
+##### **Towards more Contextual Agents: An extractor-Generator Optimization Framework**
+2502.12926v1 by Mourad Aouini, Jinan Loubani
+
+Large Language Model (LLM)-based agents have demonstrated remarkable success
+in solving complex tasks across a wide range of general-purpose applications.
+However, their performance often degrades in context-specific scenarios, such
+as specialized industries or research domains, where the absence of
+domain-relevant knowledge leads to imprecise or suboptimal outcomes. To address
+this challenge, our work introduces a systematic approach to enhance the
+contextual adaptability of LLM-based agents by optimizing their underlying
+prompts-critical components that govern agent behavior, roles, and
+interactions. Manually crafting optimized prompts for context-specific tasks is
+labor-intensive, error-prone, and lacks scalability. In this work, we introduce
+an Extractor-Generator framework designed to automate the optimization of
+contextual LLM-based agents. Our method operates through two key stages: (i)
+feature extraction from a dataset of gold-standard input-output examples, and
+(ii) prompt generation via a high-level optimization strategy that iteratively
+identifies underperforming cases and applies self-improvement techniques. This
+framework substantially improves prompt adaptability by enabling more precise
+generalization across diverse inputs, particularly in context-specific tasks
+where maintaining semantic consistency and minimizing error propagation are
+critical for reliable performance. Although developed with single-stage
+workflows in mind, the approach naturally extends to multi-stage workflows,
+offering broad applicability across various agent-based systems. Empirical
+evaluations demonstrate that our framework significantly enhances the
+performance of prompt-optimized agents, providing a structured and efficient
+approach to contextual LLM-based agents.
+
+摘要：大型語言模型 (LLM) 為基礎的代理已展現出非凡的成功，
+能解決廣泛一般用途應用程式的複雜任務。
+然而，它們的效能通常會在特定情境中下降，例如專門產業或研究領域，
+其中缺乏與領域相關知識會導致不精確或次佳的結果。為了解決
+這項挑戰，我們的研究引進了一種系統化的方法來增強 LLM 為基礎的代理的
+情境適應性，方法是最佳化它們的基礎提示，這些提示是決定代理行為、角色和
+互動的重要組成部分。手動製作最佳化的提示以應對特定情境的任務既費時又容易出錯，而且缺乏可擴充性。在這項研究中，我們引進
+一個萃取產生器架構，旨在自動化情境 LLM 為基礎代理的最佳化。我們的
+方法透過兩個關鍵階段運作：(i) 從黃金標準輸入輸出範例的資料集萃取特徵，以及
+(ii) 透過高階最佳化策略產生提示，此策略會反覆找出表現不佳的案例並套用自我改善技術。此
+架構大幅改善了提示適應性，讓它能針對不同的輸入進行更精確的概括，特別是在情境特定任務中，在這些任務中，維持語意一致性和將錯誤傳播降至最低對於可靠的效能至關重要。儘管是針對單階段工作流程開發，但此方法自然能延伸至多階段工作流程，在各種基於代理的系統中提供廣泛的適用性。實證評估顯示，我們的架構大幅增強了提示最佳化代理的效能，為基於情境的 LLM 代理提供了一個結構化且有效率的方法。
+
+##### **Keep what you need : extracting efficient subnetworks from large audio representation models**
+2502.12925v1 by David Genova, Philippe Esling, Tom Hurlin
+
+Recently, research on audio foundation models has witnessed notable advances,
+as illustrated by the ever improving results on complex downstream tasks.
+Subsequently, those pretrained networks have quickly been used for various
+audio applications. These improvements have however resulted in a considerable
+increase both in size and complexity of these models. Along the environmental
+concerns this issue raises, this prevents the deployment of such networks on
+consumer-level devices, and precludes their use for real-time applications.
+Moreover, this appears contradictory with the specificity of the tasks for
+which these models are used, which are often simpler compared to extracting a
+rich, multi-purpose representation from any type of audio data. In this paper,
+we address this issue with a simple, yet effective method to extract
+lightweight specialist subnetworks from large foundation models. Specifically,
+we introduce learnable binary masks in-between the layers of a pretrained
+representation model. When training the end-to-end model on a downstream task,
+we add a sparsity-inducing loss to the overall objective, hence learning a
+compact subnetwork specialized on a single task. Importantly, the weights of
+the foundation model are kept frozen, resulting into low additional training
+costs. Once trained, the masked computational units can then be removed from
+the network, implying significant performance gains. We assess our method on
+three widespread audio foundation models, each based on a different backbone
+architecture, and illustrate its effectiveness on common audio representation
+evaluation tasks, as well as its versatility on both speech, music, and general
+audio. Code for reproducing the results and supporting webpage are available at
+https://github.com/gnvIRCAM/Audio-representation-trimming
+
+摘要：<paragraph>近期，音频基础模型的研究取得了显著进展，
+复杂的下游任务上不断提升的结果证明了这一点。
+随后，这些预训练网络已迅速用于各种
+音频应用程序。然而，这些改进导致了这些模型的尺寸和复杂性都大幅
+增加。除了由此产生的环境问题外，这也阻止了此类网络在
+消费者级设备上的部署，并排除了它们在实时应用程序中的使用。
+此外，这似乎与这些模型的使用任务的特殊性相矛盾，与从任何类型的音频数据中提取丰富的多用途表示相比，这些任务通常更简单。在本文中，
+我们通过一种简单但有效的方法来解决此问题，从大型基础模型中提取轻量级专家子网络。具体来说，
+我们在预训练表示模型的层之间引入了可学习的二进制掩码。当在某个下游任务上训练端到端模型时，
+我们在总体目标中添加了稀疏性诱导损失，从而学习到专门用于单个任务的紧凑型子网络。重要的是，
+基础模型的权重保持冻结，从而导致额外的训练成本低。一旦训练完成，就可以从网络中移除掩码的计算单元，这意味着性能将大幅提升。我们对三个广泛使用的音频基础模型评估了我们的方法，每个模型都基于不同的骨干架构，并说明了其在常见音频表示评估任务上的有效性，以及其在语音、音乐和通用音频上的多功能性。用于重现结果的代码和支持网页可在
+https://github.com/gnvIRCAM/Audio-representation-trimming 获得</paragraph>
+
+##### **Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**
+2502.12924v1 by Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
+
+Code-switching (CS) is still a critical challenge in Natural Language
+Processing (NLP). Current Large Language Models (LLMs) struggle to interpret
+and generate code-switched text, primarily due to the scarcity of large-scale
+CS datasets for training. This paper presents a novel methodology to generate
+CS data using LLMs, and test it on the English-Spanish language pair. We
+propose back-translating natural CS sentences into monolingual English, and
+using the resulting parallel corpus to fine-tune LLMs to turn monolingual
+sentences into CS. Unlike previous approaches to CS generation, our methodology
+uses natural CS data as a starting point, allowing models to learn its natural
+distribution beyond grammatical patterns. We thoroughly analyse the models'
+performance through a study on human preferences, a qualitative error analysis
+and an evaluation with popular automatic metrics. Results show that our
+methodology generates fluent code-switched text, expanding research
+opportunities in CS communication, and that traditional metrics do not
+correlate with human judgement when assessing the quality of the generated CS
+data. We release our code and generated dataset under a CC-BY-NC-SA license.
+
+摘要：代碼轉換（CS）在自然語言處理（NLP）中仍是一個嚴峻的挑戰。目前的巨量語言模型（LLM）難以解讀和生成代碼轉換文字，主要是因為缺乏用於訓練的大規模 CS 資料集。本文提出了一種使用 LLM 生成 CS 資料的新方法，並在英語-西班牙語語言對上進行測試。我們建議將自然 CS 句子反向翻譯成單語英語，並使用產生的平行語料庫微調 LLM，將單語句子轉換為 CS。與先前的 CS 生成方法不同，我們的技術使用自然 CS 資料作為起點，讓模型能夠學習其超越語法模式的自然分佈。我們透過研究人類偏好、定性錯誤分析和使用流行的自動化指標進行評估，徹底分析模型的效能。結果顯示，我們的技術可以生成流利的代碼轉換文字，擴展 CS 溝通的研究機會，而且在評估生成的 CS 資料品質時，傳統指標與人類判斷無關。我們在 CC-BY-NC-SA 授權下釋出我們的程式碼和生成的資料集。
+
+##### **On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**
+2502.12923v1 by Rune Birkmose, Nathan Mørkeberg Reece, Esben Hofstedt Norvin, Johannes Bjerva, Mike Zhang
+
+This paper investigates whether Large Language Models (LLMs), fine-tuned on
+synthetic but domain-representative data, can perform the twofold task of (i)
+slot and intent detection and (ii) natural language response generation for a
+smart home assistant, while running solely on resource-limited, CPU-only edge
+hardware. We fine-tune LLMs to produce both JSON action calls and text
+responses. Our experiments show that 16-bit and 8-bit quantized variants
+preserve high accuracy on slot and intent detection and maintain strong
+semantic coherence in generated text, while the 4-bit model, while retaining
+generative fluency, suffers a noticeable drop in device-service classification
+accuracy. Further evaluations on noisy human (non-synthetic) prompts and
+out-of-domain intents confirm the models' generalization ability, obtaining
+around 80--86\% accuracy. While the average inference time is 5--6 seconds per
+query -- acceptable for one-shot commands but suboptimal for multi-turn
+dialogue -- our results affirm that an on-device LLM can effectively unify
+command interpretation and flexible response generation for home automation
+without relying on specialized hardware.
+
+摘要：本文探討微調於合成但具領域代表性的資料上的大型語言模型 (LLM)，是否能執行 (i) 槽位和意圖偵測，以及 (ii) 自然語言回應產生的雙重任務，同時僅在資源受限、僅 CPU 的邊緣硬體上執行。我們微調 LLM 以產生 JSON 動作呼叫和文字回應。我們的實驗顯示，16 位元和 8 位元量化的變體在槽位和意圖偵測上保持高準確度，並在產生的文字中維持強大的語意一致性，而 4 位元模型雖然保有生成流暢度，但在裝置服務分類準確度上卻有明顯下降。進一步對有雜訊的人類 (非合成) 提示和領域外意圖的評估，證實了模型的泛化能力，獲得約 80--86% 的準確度。雖然平均推論時間為每個查詢 5--6 秒，對於一次性命令來說是可以接受的，但對於多輪對話來說並不理想，但我們的結果證實，裝置上的 LLM 可以有效地統一命令解譯和彈性回應產生，以進行家庭自動化，而無需依賴專用硬體。
+
+##### **Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**
+2502.12921v1 by George-Kirollos Saad, Scott Sanner
+
+Query-driven recommendation with unknown items poses a challenge for users to
+understand why certain items are appropriate for their needs. Query-driven
+Contrastive Summarization (QCS) is a methodology designed to address this issue
+by leveraging language-based item descriptions to clarify contrasts between
+them. However, existing state-of-the-art contrastive summarization methods such
+as STRUM-LLM fall short of this goal. To overcome these limitations, we
+introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs
+debate-style prompting to generate focused and contrastive summarizations of
+item aspects relevant to a query. Leveraging modern large language models
+(LLMs) as powerful tools for generating debates, Q-STRUM Debate provides
+enhanced contrastive summaries. Experiments across three datasets demonstrate
+that Q-STRUM Debate yields significant performance improvements over existing
+methods on key contrastive summarization criteria, thus introducing a novel and
+performant debate prompting methodology for QCS.
+
+摘要：以未知項目進行的查詢驅動推薦對使用者來說是一項挑戰，他們難以理解為何某些項目適合自己的需求。查詢驅動對比摘要 (QCS) 是一種方法，旨在透過利用基於語言的項目描述來釐清項目之間的對比，以解決這個問題。然而，現有的最先進對比摘要方法（例如 STRUM-LLM）並未達成此目標。為了克服這些限制，我們引進 Q-STRUM Debate，一種 STRUM-LLM 的新延伸，它採用辯論式提示來產生與查詢相關的項目面向的重點式對比摘要。透過利用現代大型語言模型 (LLM) 作為產生辯論的強大工具，Q-STRUM Debate 提供增強的對比摘要。透過三個資料集的實驗證明，Q-STRUM Debate 在關鍵的對比摘要標準上，比現有方法有顯著的效能改善，因此為 QCS 引進一種新穎且高性能的辯論提示方法。
+
+##### **GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**
+2502.12913v1 by Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang
+
+Large Language Models (LLMs) fine-tuning technologies have achieved
+remarkable results. However, traditional LLM fine-tuning approaches face
+significant challenges: they require large Floating Point (FP) computation,
+raising privacy concerns when handling sensitive data, and are impractical for
+resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT)
+techniques reduce trainable parameters, their reliance on floating-point
+arithmetic creates fundamental incompatibilities with edge hardware. In this
+work, we introduce a novel framework for on-device LLM fine-tuning that
+eliminates the need for floating-point operations in both inference and
+training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer
+format, which efficiently represents model parameters in integer format using
+shared exponents among parameter groups. When combined with LoRA-like adapters,
+this enables fully integer-based fine-tuning that is both memory and compute
+efficient. We demonstrate that our approach achieves accuracy comparable to
+FP16-based fine-tuning while significantly reducing memory usage (50%).
+Moreover, compared to FP8, our method can reduce 5x power consumption and 11x
+chip area with same performance, making large-scale model adaptation feasible
+on edge devices.
+
+摘要：大型语言模型 (LLM) 微调技术已取得显著成果。然而，传统的 LLM 微调方法面临着严峻的挑战：它们需要大量的浮点 (FP) 计算，在处理敏感数据时会引发隐私问题，并且对于资源受限的边缘设备而言不切实际。虽然参数高效微调 (PEFT) 技术减少了可训练参数，但它们对浮点运算的依赖与边缘硬件产生了根本上的不兼容性。在这项工作中，我们引入了一个用于设备上 LLM 微调的新框架，该框架消除了推理和训练中对浮点运算的需求，名为 GSQ-Tuning。其核心是组共享指数整数格式，该格式使用参数组之间的共享指数以整数格式有效地表示模型参数。当与类似 LoRA 的适配器相结合时，这实现了完全基于整数的微调，既节省内存又节省计算。我们证明了我们的方法实现了与基于 FP16 的微调相当的准确性，同时显著减少了内存使用量 (50%)。此外，与 FP8 相比，我们的方法可以在相同的性能下减少 5 倍的功耗和 11 倍的芯片面积，从而使大规模模型适应在边缘设备上成为可能。
+
+##### **Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**
+2502.12911v1 by Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang
+
+Generating SQLs from user queries is a long-standing challenge, where the
+accuracy of initial schema linking significantly impacts subsequent SQL
+generation performance. However, current schema linking models still struggle
+with missing relevant schema elements or an excess of redundant ones. A crucial
+reason for this is that commonly used metrics, recall and precision, fail to
+capture relevant element missing and thus cannot reflect actual schema linking
+performance. Motivated by this, we propose an enhanced schema linking metric by
+introducing a restricted missing indicator. Accordingly, we introduce Knapsack
+optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent
+designed to prevent the missing of relevant schema elements while minimizing
+the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy
+that first identifies the optimal table linking and subsequently links columns
+within the selected table to reduce linking candidate space. In each linking
+process, it utilize a knapsack optimization approach to link potentially
+relevant elements while accounting for a limited tolerance of potential
+redundant ones.With this optimization, KaSLA-1.6B achieves superior schema
+linking results compared to large-scale LLMs, including deepseek-v3 with
+state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider
+and BIRD benchmarks verify that KaSLA can significantly improve the SQL
+generation performance of SOTA text-to-SQL models by substituting their schema
+linking processes.
+
+摘要：從使用者查詢中產生 SQL 是個長期的挑戰，其中初始架構連結的準確性會顯著影響後續 SQL 產生效能。然而，目前的架構連結模型仍難以處理遺漏相關架構元素或過多重複元素的問題。造成此問題的一個關鍵原因是，常用的指標召回率和精確度無法捕捉遺漏相關元素，因此無法反映實際的架構連結效能。有鑑於此，我們提出一個增強的架構連結指標，透過引入受限遺漏指標。因此，我們介紹基於背包最佳化的架構連結代理 (KaSLA)，這是一個外掛式架構連結代理，旨在防止遺漏相關架構元素，同時將重複元素的納入降至最低。KaSLA 採用分層連結策略，首先找出最佳的表格連結，然後連結所選表格中的欄位，以減少連結候選空間。在每個連結過程中，它利用背包最佳化方法連結潛在相關元素，同時考量對潛在重複元素的容忍度。透過此最佳化，KaSLA-1.6B 達到優於大規模 LLM 的架構連結結果，包括採用最先進 (SOTA) 架構連結方法的 deepseek-v3。在 Spider 和 BIRD 基準上的廣泛實驗驗證，KaSLA 可透過取代其架構連結流程，大幅提升 SOTA 文字轉 SQL 模型的 SQL 產生效能。
+
+##### **Graph Neural Networks for Databases: A Survey**
+2502.12908v1 by Ziming Li, Youhuan Li, Yuyu Luo, Guoliang Li, Chuxu Zhang
+
+Graph neural networks (GNNs) are powerful deep learning models for
+graph-structured data, demonstrating remarkable success across diverse domains.
+Recently, the database (DB) community has increasingly recognized the
+potentiality of GNNs, prompting a surge of researches focusing on improving
+database systems through GNN-based approaches. However, despite notable
+advances, There is a lack of a comprehensive review and understanding of how
+GNNs could improve DB systems. Therefore, this survey aims to bridge this gap
+by providing a structured and in-depth overview of GNNs for DB systems.
+Specifically, we propose a new taxonomy that classifies existing methods into
+two key categories: (1) Relational Databases, which includes tasks like
+performance prediction, query optimization, and text-to-SQL, and (2) Graph
+Databases, addressing challenges like efficient graph query processing and
+graph similarity computation. We systematically review key methods in each
+category, highlighting their contributions and practical implications. Finally,
+we suggest promising avenues for integrating GNNs into Database systems.
+
+摘要：圖形神經網路 (GNN) 是用於圖形結構資料的強大深度學習模型，在各種領域中展現出顯著的成功。最近，資料庫 (DB) 社群越來越認識到 GNN 的潛力，促使大量研究專注於透過基於 GNN 的方法來改善資料庫系統。然而，儘管有顯著的進展，但對於 GNN 如何改善資料庫系統，仍然缺乏全面的回顧和理解。因此，本調查旨在透過提供 GNN 在資料庫系統中的結構化且深入的概觀來彌補這個差距。具體來說，我們提出了一個新的分類法，將現有方法分類為兩個主要類別：(1) 關係資料庫，其中包括效能預測、查詢最佳化和文字轉 SQL 等任務，以及 (2) 圖形資料庫，用於處理高效圖形查詢處理和圖形相似度計算等挑戰。我們系統性地回顧了每個類別中的關鍵方法，重點說明其貢獻和實務意涵。最後，我們建議將 GNN 整合到資料庫系統中的有希望途徑。
+
+##### **Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**
+2502.12904v1 by Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
+
+We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to
+defend against internet fraud and phishing in dynamic, real-world scenarios.
+Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job
+postings, social media, and news, categorized into 5 major fraud types. Unlike
+previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to
+assess LLMs' resistance to fraud at different stages, including credibility
+building, urgency creation, and emotional manipulation. Furthermore, we
+evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM
+provides general decision-making assistance, and 2. Role-play, where the model
+assumes a specific persona, widely used in real-world agent-based interactions.
+Our evaluation reveals the significant challenges in defending against fraud
+and phishing inducement, especially in role-play settings and fake job
+postings. Additionally, we observe a substantial performance gap between
+Chinese and English, underscoring the need for improved multilingual fraud
+detection capabilities.
+
+摘要：我們推出 Fraud-R1，一個基準，旨在評估 LLM 在動態、真實世界場景中防範網路詐騙和網路釣魚的能力。Fraud-R1 包含 8,564 起詐騙案例，來源包括網路釣魚詐騙、虛假職缺、社群媒體和新聞，分類為 5 種類型的主要詐騙手法。與先前的基準不同，Fraud-R1 引入多輪評估管道，以評估 LLM 在不同階段對詐騙的抵抗力，包括建立信譽、製造急迫感和情感操縱。此外，我們在兩種設定下評估 15 個 LLM：1. 協助助理，其中 LLM 提供一般決策協助，以及 2. 角色扮演，其中模型假設特定角色，廣泛用於現實世界中基於代理的互動。我們的評估揭示了在防範詐騙和網路釣魚誘導方面面臨的重大挑戰，尤其是在角色扮演設定和虛假職缺中。此外，我們觀察到中文和英文之間有顯著的效能差距，這凸顯了改進多語言詐騙偵測功能的必要性。
+
+##### **Soundwave: Less is More for Speech-Text Alignment in LLMs**
+2502.12900v1 by Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
+
+Existing end-to-end speech large language models (LLMs) usually rely on
+large-scale annotated data for training, while data-efficient training has not
+been discussed in depth. We focus on two fundamental problems between speech
+and text: the representation space gap and sequence length inconsistency. We
+propose Soundwave, which utilizes an efficient training strategy and a novel
+architecture to address these issues. Results show that Soundwave outperforms
+the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,
+using only one-fiftieth of the training data. Further analysis shows that
+Soundwave still retains its intelligence during conversation. The project is
+available at https://github.com/FreedomIntelligence/Soundwave.
+
+摘要：現有的端對端語音大型語言模型 (LLM) 通常依賴於大規模註釋資料進行訓練，而資料有效率的訓練尚未深入探討。我們專注於語音和文字之間的兩個基本問題：表示空間差距和序列長度不一致。我們提出 Soundwave，它利用高效的訓練策略和新穎的架構來解決這些問題。結果顯示，Soundwave 在語音翻譯和 AIR-Bench 語音任務中優於進階的 Qwen2-Audio，僅使用五十分之一的訓練資料。進一步的分析顯示，Soundwave 在對話中仍能保持其智慧。專案可於 https://github.com/FreedomIntelligence/Soundwave 取得。
+
+##### **None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**
+2502.12896v1 by Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
+
+In LLM evaluations, reasoning is often distinguished from recall/memorization
+by performing numerical variations to math-oriented questions. Here we
+introduce a general variation method for multiple-choice questions that
+completely dissociates the correct answer from previously seen tokens or
+concepts, requiring LLMs to understand and reason (rather than memorizing) in
+order to answer correctly. Using this method, we evaluate state-of-the-art
+proprietary and open-source LLMs on two datasets available in English and
+Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
+Results show that all models experience remarkable accuracy drops under our
+proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
+2024, ranging from 10% to 93% across models. Notably, the most accurate model
+in our experimentation (OpenAI-o3-mini) is not the most robust
+(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
+not be the ones with better reasoning capabilities. Also, we see larger
+accuracy drops in public (vs private) datasets and questions posed in their
+original language (vs a manual translation), which are signs of contamination
+and also point to a relevant role of recall/memorization in current LLMs'
+answers.
+
+摘要：在 LLM 評估中，推理通常透過對數學導向問題進行數值變異來區別於回憶/記憶。在此，我們引入一種通用變異方法，適用於多選題，它將正確答案與先前看到的代幣或概念完全區分開來，要求 LLM 理解和推理（而不是記憶），以便正確回答。使用此方法，我們在英語和西班牙語中評估了兩種數據集中的最先進的專有和開源 LLM：公共 MMLU 基準和私有 UNED-Access 2024 數據集。結果表明，在我們提出的變異下，所有模型的準確度都出現顯著下降，在 MMLU 上平均損失 57%，在 UNED-Access 2024 上平均損失 50%，在不同模型中範圍從 10% 到 93%。值得注意的是，我們實驗中最準確的模型（OpenAI-o3-mini）並不是最穩健的模型（DeepSeek-R1-70B），這表明標準評估中最好的模型可能不是推理能力最強的模型。此外，我們看到公共（相對於私有）數據集和以原始語言提出的問題（相對於人工翻譯）的準確度下降幅度更大，這是汙染的跡象，也表明回憶/記憶在當前 LLM 的答案中發揮著相關作用。
+
+##### **Multilingual European Language Models: Benchmarking Approaches and Challenges**
+2502.12895v1 by Fabio Barth, Georg Rehm
+
+The breakthrough of generative large language models (LLMs) that can solve
+different tasks through chat interaction has led to a significant increase in
+the use of general benchmarks to assess the quality or performance of these
+models beyond individual applications. There is also a need for better methods
+to evaluate and also to compare models due to the ever increasing number of new
+models published. However, most of the established benchmarks revolve around
+the English language. This paper analyses the benefits and limitations of
+current evaluation datasets, focusing on multilingual European benchmarks. We
+analyse seven multilingual benchmarks and identify four major challenges.
+Furthermore, we discuss potential solutions to enhance translation quality and
+mitigate cultural biases, including human-in-the-loop verification and
+iterative translation ranking. Our analysis highlights the need for culturally
+aware and rigorously validated benchmarks to assess the reasoning and
+question-answering capabilities of multilingual LLMs accurately.
+
+摘要：生成式大型語言模型 (LLM) 的突破，它能透過聊天互動解決不同任務，這導致使用一般基準來評估這些模型在個別應用程式以外的品質或效能大幅增加。由於已發布的新模型數量不斷增加，因此也有必要採用更好的方法來評估模型並進行比較。然而，大多數已建立的基準都圍繞著英語。本文分析了目前評估資料集的優點和限制，重點放在多語言歐洲基準。我們分析了七個多語言基準，並找出四個主要的挑戰。此外，我們討論了增強翻譯品質和減輕文化偏見的潛在解決方案，包括人為迴圈驗證和反覆翻譯排名。我們的分析突顯了對文化意識和嚴格驗證的基準的需求，以準確評估多語言 LLM 的推理和問答能力。
+
+##### **H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**
+2502.12893v1 by Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen
+
+Large Reasoning Models (LRMs) have recently extended their powerful reasoning
+capabilities to safety checks-using chain-of-thought reasoning to decide
+whether a request should be answered. While this new approach offers a
+promising route for balancing model utility and safety, its robustness remains
+underexplored. To address this gap, we introduce Malicious-Educator, a
+benchmark that disguises extremely dangerous or malicious requests beneath
+seemingly legitimate educational prompts. Our experiments reveal severe
+security flaws in popular commercial-grade LRMs, including OpenAI o1/o3,
+DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1
+model initially maintains a high refusal rate of about 98%, subsequent model
+updates significantly compromise its safety; and attackers can easily extract
+criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any
+additional tricks. To further highlight these vulnerabilities, we propose
+Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method
+that leverages the model's own displayed intermediate reasoning to jailbreak
+its safety reasoning mechanism. Under H-CoT, refusal rates sharply
+decline-dropping from 98% to below 2%-and, in some instances, even transform
+initially cautious tones into ones that are willing to provide harmful content.
+We hope these findings underscore the urgent need for more robust safety
+mechanisms to preserve the benefits of advanced reasoning capabilities without
+compromising ethical standards.
+
+摘要：大型推理模型 (LRM) 最近將其強大的推理能力擴展到安全檢查，使用思維鏈推理來決定是否應回答請求。雖然這種新方法為平衡模型實用性和安全性提供了一條有希望的途徑，但其穩健性仍未得到充分探索。為了解決這一差距，我們引入了 Malicious-Educator，這是一個基準，它將極其危險或惡意的請求偽裝在看似合法的教育提示之下。我們的實驗揭示了流行的商業級 LRM 中嚴重的安全缺陷，包括 OpenAI o1/o3、DeepSeek-R1 和 Gemini 2.0 Flash Thinking。例如，儘管 OpenAI 的 o1 模型最初保持約 98% 的高拒絕率，但後續的模型更新顯著損害了其安全性；攻擊者可以輕鬆地從 DeepSeek-R1 和 Gemini 2.0 Flash Thinking 中提取犯罪策略，而無需任何額外的技巧。為了進一步強調這些漏洞，我們提出了劫持思維鏈 (H-CoT)，這是一種通用且可轉移的攻擊方法，它利用模型自己顯示的中間推理來越獄其安全推理機制。在 H-CoT 下，拒絕率急劇下降，從 98% 降至 2% 以下，在某些情況下，甚至將最初謹慎的語氣轉變為願意提供有害內容的語氣。我們希望這些發現強調了對更強大的安全機制的迫切需要，以保留先進推理能力的好處，同時不損害道德標準。
+
+##### **Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**
+2502.12886v1 by Georg Rehm, Annika Grützner-Zahn, Fabio Barth
+
+Large language models (LLMs) demonstrate unprecedented capabilities and
+define the state of the art for almost all natural language processing (NLP)
+tasks and also for essentially all Language Technology (LT) applications. LLMs
+can only be trained for languages for which a sufficient amount of pre-training
+data is available, effectively excluding many languages that are typically
+characterised as under-resourced. However, there is both circumstantial and
+empirical evidence that multilingual LLMs, which have been trained using data
+sets that cover multiple languages (including under-resourced ones), do exhibit
+strong capabilities for some of these under-resourced languages. Eventually,
+this approach may have the potential to be a technological off-ramp for those
+under-resourced languages for which "native" LLMs, and LLM-based technologies,
+cannot be developed due to a lack of training data. This paper, which
+concentrates on European languages, examines this idea, analyses the current
+situation in terms of technology support and summarises related work. The
+article concludes by focusing on the key open questions that need to be
+answered for the approach to be put into practice in a systematic way.
+
+摘要：大型語言模型 (LLM) 展現前所未有的能力，並定義了幾乎所有自然語言處理 (NLP) 任務以及所有語言技術 (LT) 應用的最新技術。LLM 只能針對有足夠預訓練資料可用的語言進行訓練，實際上排除了許多通常被歸類為資源不足的語言。然而，有環境和經驗證據顯示，多語言 LLM 已使用涵蓋多種語言（包括資源不足的語言）的資料集進行訓練，確實對其中一些資源不足的語言展現出強大的能力。最終，這種方法可能具有成為那些由於缺乏訓練資料而無法開發「原生」LLM 和基於 LLM 的技術的資源不足語言的技術跳板的潛力。本文專注於歐洲語言，探討這個想法，分析技術支援方面的現狀，並總結相關工作。本文最後專注於必須回答的主要開放性問題，以便系統性地實踐這種方法。
+
+##### **How desirable is alignment between LLMs and linguistically diverse human users?**
+2502.12884v1 by Pia Knoeferle, Sebastian Möller, Dorothea Kolossa, Veronika Solopova, Georg Rehm
+
+We discuss how desirable it is that Large Language Models (LLMs) be able to
+adapt or align their language behavior with users who may be diverse in their
+language use. User diversity may come about among others due to i) age
+differences; ii) gender characteristics, and/or iii) multilingual experience,
+and associated differences in language processing and use. We consider
+potential consequences for usability, communication, and LLM development.
+
+摘要：我們探討大型語言模型 (LLM) 能夠適應或調整其語言行為，以適應語言使用可能多樣化的使用者，這有多麼可取。使用者多樣性可能出於以下原因而產生：i) 年齡差異；ii) 性別特徵，和/或 iii) 多語言經驗，以及語言處理和使用上的相關差異。我們考慮對可用性、溝通和 LLM 開發的潛在後果。
+
+##### **Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**
+2502.12876v1 by Nandakishor M, Anjali M
+
+Creating personalized and adaptable conversational AI remains a key
+challenge. This paper introduces a Continuous Learning Conversational AI (CLCA)
+approach, implemented using A2C reinforcement learning, to move beyond static
+Large Language Models (LLMs). We use simulated sales dialogues, generated by
+LLMs, to train an A2C agent. This agent learns to optimize conversation
+strategies for personalization, focusing on engagement and delivering value.
+Our system architecture integrates reinforcement learning with LLMs for both
+data creation and response selection. This method offers a practical way to
+build personalized AI companions that evolve through continuous learning,
+advancing beyond traditional static LLM techniques.
+
+摘要：建立個人化且適應性強的對話式 AI 仍然是一項關鍵挑戰。本文介紹了一種持續學習對話式 AI (CLCA) 方法，透過 A2C 強化學習實作，以超越靜態大型語言模型 (LLM)。我們使用 LLM 生成的模擬銷售對話來訓練 A2C 代理。此代理會學習最佳化對話策略以實現個人化，並專注於參與和提供價值。我們的系統架構將強化學習與 LLM 整合，用於資料建立和回應選取。此方法提供了一種實用的方式來建立個人化 AI 伴侶，這些伴侶會透過持續學習而演進，超越傳統的靜態 LLM 技術。
+
+##### **PAFT: Prompt-Agnostic Fine-Tuning**
+2502.12859v1 by Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu
+
+While Large Language Models (LLMs) adapt well to downstream tasks after
+fine-tuning, this adaptability often compromises prompt robustness, as even
+minor prompt variations can significantly degrade performance. To address this,
+we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach
+that dynamically adjusts prompts during fine-tuning. This encourages the model
+to learn underlying task principles rather than overfitting to specific prompt
+formulations. PAFT operates in two stages: First, a diverse set of meaningful,
+synthetic candidate prompts is constructed. Second, during fine-tuning, prompts
+are randomly sampled from this set to create dynamic training inputs. Extensive
+experiments across diverse datasets and LLMs demonstrate that models trained
+with PAFT exhibit strong robustness and generalization across a wide range of
+prompts, including unseen ones. This enhanced robustness improves both model
+performance and inference speed while maintaining training efficiency. Ablation
+studies further confirm the effectiveness of PAFT.
+
+摘要：儘管大型語言模型 (LLM) 在微調後能很好地適應下游任務，但這種適應性通常會損害提示的穩健性，因為即使微小的提示變異也會大幅降低效能。為了解決這個問題，我們提出提示不可知微調 (PAFT)，這是一種簡單卻有效的方法，可以在微調期間動態調整提示。這鼓勵模型學習底層任務原則，而不是過度擬合特定的提示表述。PAFT 分為兩個階段運作：首先，構建一組多樣化、有意義的合成候選提示。其次，在微調期間，從此集合中隨機抽取提示以建立動態訓練輸入。針對各種資料集和 LLM 進行的廣泛實驗表明，使用 PAFT 訓練的模型在各種提示中表現出強大的穩健性和概括性，包括未見過的提示。這種增強的穩健性同時改善了模型效能和推理速度，同時維持訓練效率。消融研究進一步證實了 PAFT 的有效性。
+
+##### **Rejected Dialects: Biases Against African American Language in Reward Models**
+2502.12858v1 by Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, Maarten Sap
+
+Preference alignment via reward models helps build safe, helpful, and
+reliable large language models (LLMs). However, subjectivity in preference
+judgments and the lack of representative sampling in preference data collection
+can introduce new biases, hindering reward models' fairness and equity. In this
+work, we introduce a framework for evaluating dialect biases in reward models
+and conduct a case study on biases against African American Language (AAL)
+through several experiments comparing reward model preferences and behavior on
+paired White Mainstream English (WME) and both machine-translated and
+human-written AAL corpora. We show that reward models are less aligned with
+human preferences when processing AAL texts vs. WME ones (-4\% accuracy on
+average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and
+steer conversations toward WME, even when prompted with AAL texts. Our findings
+provide a targeted analysis of anti-AAL biases at a relatively understudied
+stage in LLM development, highlighting representational harms and ethical
+questions about the desired behavior of LLMs concerning AAL.
+
+摘要：透過獎勵模型進行偏好比對有助於建立安全、有用的可靠大型語言模型 (LLM)。然而，偏好判斷的主觀性，以及偏好資料收集中缺乏代表性抽樣，可能會引進新的偏誤，阻礙獎勵模型的公平性和公正性。在這項工作中，我們引進一個用於評估獎勵模型中方言偏誤的架構，並透過數個實驗進行案例研究，探討針對非裔美國人語言 (AAL) 的偏誤，這些實驗比較了獎勵模型偏好和行為，比較成對的白人主流英語 (WME) 與機器翻譯和人類撰寫的 AAL 語料庫。我們顯示，與處理 WME 文字相比，獎勵模型在處理 AAL 文字時與人類偏好較不一致（平均準確度降低 4%），經常不偏好與 AAL 一致的文字，而偏好與 WME 一致的文字，並將對話導向 WME，即使提示的是 AAL 文字。我們的發現針對 LLM 開發中相對未受重視的階段，提供針對反 AAL 偏誤的目標分析，強調與表徵相關的危害和關於 LLM 對 AAL 的期望行為的倫理問題。
+
+##### **Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**
+2502.12855v1 by Neeraj Gangwar, Suma P Bhat, Nickvash Kani
+
+While large models pre-trained on high-quality data exhibit excellent
+performance across various reasoning tasks, including mathematical reasoning
+(e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical
+reasoning remains a challenging problem. Common approaches to address this
+challenge include knowledge distillation, where smaller student models learn
+from large pre-trained teacher models, and data augmentation, such as
+rephrasing questions. Despite these efforts, smaller models struggle with
+arithmetic computations, leading to errors in mathematical reasoning. In this
+work, we focus on leveraging a programmatically generated arithmetic dataset to
+enhance the reasoning capabilities of smaller models. We investigate two key
+approaches to incorporate this dataset -- (1) intermediate fine-tuning, where a
+model is fine-tuned on the arithmetic dataset before being trained on a
+reasoning dataset, and (2) integrating the arithmetic dataset into the
+instruction-tuning mixture, allowing the model to learn arithmetic skills
+alongside general instruction-following abilities. Our experiments on multiple
+reasoning benchmarks demonstrate that incorporating an arithmetic dataset,
+whether through targeted fine-tuning or within the instruction-tuning mixture,
+enhances the models' arithmetic capabilities, which in turn improves their
+mathematical reasoning performance.
+
+摘要：大型模型经过针对高质量数据的预训练，在各种推理任务中表现出色，包括数学推理（例如 GSM8k、MultiArith），但专门化小型模型以擅长数学推理仍然是一个具有挑战性的问题。解决这一挑战的常见方法包括知识蒸馏，其中较小的学生模型从经过预训练的大型教师模型中学习，以及数据增强，例如重新表述问题。尽管做出了这些努力，较小的模型在算术计算中仍然存在困难，从而导致数学推理错误。在这项工作中，我们专注于利用程序化生成的算术数据集来增强较小模型的推理能力。我们研究了两种关键方法来合并此数据集——（1）中间微调，其中模型在算术数据集上进行微调，然后在推理数据集上进行训练，以及（2）将算术数据集集成到指令微调混合中，允许模型学习算术技能以及一般的指令遵循能力。我们在多个推理基准上的实验表明，通过有针对性的微调或在指令微调混合中合并算术数据集，增强了模型的算术能力，进而提高了它们的数学推理性能。
+
+##### **S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**
+2502.12853v1 by Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
+
+Recent studies have demonstrated the effectiveness of LLM test-time scaling.
+However, existing approaches to incentivize LLMs' deep thinking abilities
+generally require large-scale data or significant training efforts. Meanwhile,
+it remains unclear how to improve the thinking abilities of less powerful base
+models. In this work, we introduce S$^2$R, an efficient framework that enhances
+LLM reasoning by teaching models to self-verify and self-correct during
+inference. Specifically, we first initialize LLMs with iterative
+self-verification and self-correction behaviors through supervised fine-tuning
+on carefully curated data. The self-verification and self-correction skills are
+then further strengthened by both outcome-level and process-level reinforcement
+learning, with minimized resource requirements, enabling the model to
+adaptively refine its reasoning process during inference. Our results
+demonstrate that, with only 3.1k self-verifying and self-correcting behavior
+initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
+51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
+long-CoT distilled data. Extensive experiments and analysis based on three base
+models across both in-domain and out-of-domain benchmarks validate the
+effectiveness of S$^2$R. Our code and data are available at
+https://github.com/NineAbyss/S2R.
+
+摘要：<paragraph>最近的研究表明了 LLM 测试时间扩展的有效性。
+然而，现有激励 LLM 深度思考能力的方法
+通常需要大规模数据或大量的训练工作。同时，
+如何提高较弱基础模型的思考能力仍然不清楚。在这项工作中，我们引入了 S$^2$R，一个通过教导模型在
+推理过程中进行自我验证和自我纠正来增强 LLM 推理的有效框架。具体来说，我们首先通过监督微调对精心整理的数据来初始化具有迭代自我验证和自我纠正行为的 LLM。然后通过结果级别和过程级别的强化
+学习进一步加强自我验证和自我纠正技能，同时最大程度地减少资源需求，使模型能够
+在推理过程中自适应地优化其推理过程。我们的结果
+表明，仅使用 3.1k 个自我验证和自我纠正行为
+初始化样本，Qwen2.5-math-7B 的准确率从
+51.0% 提高到 81.6%，优于在等量长 CoT 蒸馏数据上训练的模型。基于三个基础模型在域内和域外基准上的广泛实验和分析验证了
+S$^2$R 的有效性。我们的代码和数据可以在
+https://github.com/NineAbyss/S2R 获得。</paragraph>
+
+##### **MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**
+2502.12852v1 by Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
+
+Existing multilingual vision-language (VL) benchmarks often only cover a
+handful of languages. Consequently, evaluations of large vision-language models
+(LVLMs) predominantly target high-resource languages, underscoring the need for
+evaluation data for low-resource languages. To address this limitation, we
+introduce MVL-SIB, a massively multilingual vision-language benchmark that
+evaluates both cross-modal and text-only topical matching across 205 languages
+-- over 100 more than the most multilingual existing VL benchmarks encompass.
+We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)
+on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic
+matching in lower-resource languages, performing no better than chance on
+languages like N'Koo. Our analysis further reveals that VL support in LVLMs
+declines disproportionately relative to textual support for lower-resource
+languages, as evidenced by comparison of cross-modal and text-only topical
+matching performance. We further observe that open-weight LVLMs do not benefit
+from representing a topic with more than one image, suggesting that these
+models are not yet fully effective at handling multi-image tasks. By
+correlating performance on MVL-SIB with other multilingual VL benchmarks, we
+highlight that MVL-SIB serves as a comprehensive probe of multilingual VL
+understanding in LVLMs.
+
+摘要：現有的多語言視覺語言 (VL) 基準通常只涵蓋少數語言。因此，大型視覺語言模型 (LVLMs) 的評估主要針對資源豐富的語言，強調了對資源匱乏語言的評估資料的需求。為了解決此限制，我們引入了 MVL-SIB，一個大規模的多語言視覺語言基準，它評估了 205 種語言的跨模態和純文字主題匹配，比現有的多語言 VL 基準涵蓋的語言多出 100 多種。然後，我們在 MVL-SIB 上對一系列開放權重的 LVLMs 與 GPT-4o(-mini) 進行了基準測試。我們的結果表明，LVLMs 在資源較少的語言中難以進行跨模態主題匹配，在 N'Koo 等語言上的表現不比隨機好。我們的分析進一步表明，LVLMs 中的 VL 支援相對於資源較少的語言的文字支援下降得不成比例，這從跨模態和純文字主題匹配效能的比較中可以看出。我們進一步觀察到，開放權重的 LVLMs 無法從用多於一張影像來表示主題中受益，這表明這些模型在處理多影像任務方面尚未完全有效。通過將 MVL-SIB 上的效能與其他多語言 VL 基準相關聯，我們強調 MVL-SIB 可作為 LVLMs 中多語言 VL 理解的綜合探測。
+
+##### **MeMo: Towards Language Models with Associative Memory Mechanisms**
+2502.12851v1 by Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
+
+Memorization is a fundamental ability of Transformer-based Large Language
+Models, achieved through learning. In this paper, we propose a paradigm shift
+by designing an architecture to memorize text directly, bearing in mind the
+principle that memorization precedes learning. We introduce MeMo, a novel
+architecture for language modeling that explicitly memorizes sequences of
+tokens in layered associative memories. By design, MeMo offers transparency and
+the possibility of model editing, including forgetting texts. We experimented
+with the MeMo architecture, showing the memorization power of the one-layer and
+the multi-layer configurations.
+
+摘要：記憶是 Transformer 大型語言模型的基本能力，可透過學習達成。在本文中，我們提出一個典範轉移，透過設計一個架構來直接記憶文字，並牢記記憶先於學習的原則。我們導入 MeMo，一個新穎的語言建模架構，可明確地記憶分層關聯式記憶中的代幣序列。透過設計，MeMo 提供透明度和模型編輯的可能性，包括遺忘文字。我們實驗了 MeMo 架構，展示了單層和多層組態的記憶力。
+
+##### **Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**
+2502.12842v1 by Kathrin Seßler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci
+
+Effective feedback is essential for fostering students' success in scientific
+inquiry. With advancements in artificial intelligence, large language models
+(LLMs) offer new possibilities for delivering instant and adaptive feedback.
+However, this feedback often lacks the pedagogical validation provided by
+real-world practitioners. To address this limitation, our study evaluates and
+compares the feedback quality of LLM agents with that of human teachers and
+science education experts on student-written experimentation protocols. Four
+blinded raters, all professionals in scientific inquiry and science education,
+evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and
+3) the science education experts using a five-point Likert scale based on six
+criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive
+Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that
+LLM-generated feedback shows no significant difference to that of teachers and
+experts in overall quality. However, the LLM agent's performance lags in the
+Feed Back dimension, which involves identifying and explaining errors within
+the student's work context. Qualitative analysis highlighted the LLM agent's
+limitations in contextual understanding and in the clear communication of
+specific errors. Our findings suggest that combining LLM-generated feedback
+with human expertise can enhance educational practices by leveraging the
+efficiency of LLMs and the nuanced understanding of educators.
+
+摘要：有效的回饋對於培養學生在科學探究中的成功至關重要。隨著人工智慧的進步，大型語言模型 (LLM) 為提供即時且適應性的回饋提供了新的可能性。然而，此回饋通常缺乏實際從業者提供的教學驗證。為了解決此限制，我們的研究評估並比較了 LLM 代理與人類教師和科學教育專家在學生撰寫的實驗協定上的回饋品質。四位盲評者，皆為科學探究和科學教育專業人士，使用基於六個有效回饋準則的五點李克特量表評估由 1) LLM 代理、2) 教師和 3) 科學教育專家產生的回饋文字：鼓勵、回饋、前饋、建設性語氣、語言清晰度和技術術語。我們的結果表明，LLM 產生的回饋在整體品質上與教師和專家產生的回饋沒有顯著差異。然而，LLM 代理的表現落後於回饋面向，這涉及在學生的作業背景中識別和解釋錯誤。定性分析突顯了 LLM 代理在情境理解和明確傳達特定錯誤方面的限制。我們的研究結果表明，將 LLM 產生的回饋與人類專業知識相結合，可以透過利用 LLM 的效率和教育者的細緻理解來提升教育實務。
+
+##### **Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**
+2502.12838v1 by Berk Yilmaz, Huthaifa I. Ashqar
+
+The recent advances in large language models (LLMs) have revolutionized
+industries such as finance, marketing, and customer service by enabling
+sophisticated natural language processing tasks. However, the broad adoption of
+LLMs brings significant challenges, particularly in the form of social biases
+that can be embedded within their outputs. Biases related to gender, age, and
+other sensitive attributes can lead to unfair treatment, raising ethical
+concerns and risking both company reputation and customer trust. This study
+examined bias in finance-related marketing slogans generated by LLMs (i.e.,
+ChatGPT) by prompting tailored ads targeting five demographic categories:
+gender, marital status, age, income level, and education level. A total of
+1,700 slogans were generated for 17 unique demographic groups, and key terms
+were categorized into four thematic groups: empowerment, financial, benefits
+and features, and personalization. Bias was systematically assessed using
+relative bias calculations and statistically tested with the Kolmogorov-Smirnov
+(KS) test against general slogans generated for any individual. Results
+revealed that marketing slogans are not neutral; rather, they emphasize
+different themes based on demographic factors. Women, younger individuals,
+low-income earners, and those with lower education levels receive more distinct
+messaging compared to older, higher-income, and highly educated individuals.
+This underscores the need to consider demographic-based biases in AI-generated
+marketing strategies and their broader societal implications. The findings of
+this study provide a roadmap for developing more equitable AI systems,
+highlighting the need for ongoing bias detection and mitigation efforts in
+LLMs.
+
+摘要：大型語言模型 (LLM) 的最新進展徹底改變了金融、行銷和客戶服務等產業，因為它能執行複雜的自然語言處理任務。然而，LLM 的廣泛採用帶來重大的挑戰，特別是潛藏在其輸出結果中的社會偏見形式。與性別、年齡和其他敏感屬性相關的偏見可能導致不公平的待遇，引發道德問題，並危及公司聲譽和客戶信任。本研究探討了 LLM（即 ChatGPT）產生的與金融相關的行銷標語中的偏見，方法是針對五個人口統計類別：性別、婚姻狀況、年齡、收入水準和教育水準，提示量身打造的廣告。總共為 17 個獨特的人口統計群組產生了 1,700 個標語，並且關鍵詞被分類為四個主題群組：賦權、財務、好處和功能，以及個人化。偏見使用相對偏見計算進行系統性評估，並使用科爾莫哥洛夫-史米諾夫 (KS) 檢定與針對任何個人產生的通用標語進行統計檢定。結果顯示行銷標語並非中立；相反地，它們根據人口統計因素強調不同的主題。與年紀較大、收入較高和受教育程度較高的個人相比，女性、年輕人、低收入者和教育程度較低者接收到的訊息更為不同。這強調了在 AI 生成的行銷策略中考量基於人口統計的偏見及其更廣泛的社會影響的必要性。本研究的發現提供了開發更公平 AI 系統的路線圖，突顯了在 LLM 中持續進行偏見偵測和緩解工作的重要性。
+
+##### **An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**
+2502.12836v1 by Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M. Rahmani
+
+Large language models (LLMs) are revolutionizing healthcare by improving
+diagnosis, patient care, and decision support through interactive
+communication. More recently, they have been applied to analyzing physiological
+time-series like wearable data for health insight extraction. Existing methods
+embed raw numerical sequences directly into prompts, which exceeds token limits
+and increases computational costs. Additionally, some studies integrated
+features extracted from time-series in textual prompts or applied multimodal
+approaches. However, these methods often produce generic and unreliable outputs
+due to LLMs' limited analytical rigor and inefficiency in interpreting
+continuous waveforms. In this paper, we develop an LLM-powered agent for
+physiological time-series analysis aimed to bridge the gap in integrating LLMs
+with well-established analytical tools. Built on the OpenCHA, an open-source
+LLM-powered framework, our agent features an orchestrator that integrates user
+interaction, data sources, and analytical tools to generate accurate health
+insights. To evaluate its effectiveness, we implement a case study on heart
+rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of
+PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study.
+The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o,
+with ECG serving as the gold standard for HR estimation. Results demonstrate
+that our agent significantly outperforms benchmark models by achieving lower
+error rates and more reliable HR estimations. The agent implementation is
+publicly available on GitHub.
+
+摘要：大型語言模型 (LLM) 透過互動式溝通，改善診斷、病人照護和決策支援，進而革新醫療保健。最近，它們已應用於分析生理時間序列，例如可穿戴式裝置的資料，以萃取健康見解。現有方法會將原始數值序列直接嵌入提示中，這會超過權杖限制並增加運算成本。此外，一些研究將從時間序列中萃取的特徵整合到文字提示中，或應用多模態方法。然而，由於 LLM 在解譯連續波形時分析嚴謹度有限且效率不彰，這些方法經常產生通用且不可靠的輸出。在本文中，我們開發了一個由 LLM 驅動的代理，用於生理時間序列分析，旨在彌合將 LLM 與既有分析工具整合的差距。我們的代理建立在 OpenCHA（一個由 LLM 驅動的開源架構）之上，具備一個整合使用者互動、資料來源和分析工具的協調器，以產生準確的健康見解。為了評估其有效性，我們實作了一個案例研究，從遠距健康監測研究中的一組光電容積描記圖 (PPG) 和心電圖 (ECG) 記錄中估算心率 (HR)。該代理的效能與 OpenAI GPT-4o-mini 和 GPT-4o 進行基準測試，其中 ECG 作為 HR 估算的金標準。結果顯示，我們的代理透過達成較低的錯誤率和更可靠的 HR 估算，顯著優於基準模型。該代理實作已公開在 GitHub 上。
+
+##### **Subword models struggle with word learning, but surprisal hides it**
+2502.12835v1 by Bastian Bunzeck, Sina Zarrieß
+
+We study word learning in subword and character language models with the
+psycholinguistic lexical decision task. While subword LMs struggle to discern
+words and non-words with high accuracy, character LMs solve this task easily
+and consistently. Furthermore, when comparing word learning and syntactic
+learning, both processes are separable in character LM where word learning
+predates syntactic learning, whereas these processes are simultaneous in
+subword LM. This raises questions about the adequacy of subword LMs for
+modeling language acquisition and positions character LMs as a viable
+alternative.
+
+摘要：我們使用心理語言學的詞彙決策任務研究在子詞和字元語言模型中的詞彙學習。儘管子詞語言模型難以區分單詞和非單詞，但字元語言模型可以輕鬆且一致地解決此任務。此外，在比較單詞學習和句法學習時，這兩個過程在字元語言模型中是可分離的，其中單詞學習先於句法學習，而這些過程在子詞語言模型中是同時發生的。這引發了關於子詞語言模型對語言習得建模的充分性的問題，並將字元語言模型定位為可行的替代方案。
+
+##### **KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**
+2502.12829v1 by Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
+
+Despite having a population of twenty million, Kazakhstan's culture and
+language remain underrepresented in the field of natural language processing.
+Although large language models (LLMs) continue to advance worldwide, progress
+in Kazakh language has been limited, as seen in the scarcity of dedicated
+models and benchmark evaluations. To address this gap, we introduce KazMMLU,
+the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU
+comprises 23,000 questions that cover various educational levels, including
+STEM, humanities, and social sciences, sourced from authentic educational
+materials and manually validated by native speakers and educators. The dataset
+includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting
+Kazakhstan's bilingual education system and rich local context. Our evaluation
+of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4,
+and DeepSeek V3) demonstrates substantial room for improvement, as even the
+best-performing models struggle to achieve competitive performance in Kazakh
+and Russian. These findings underscore significant performance gaps compared to
+high-resource languages. We hope that our dataset will enable further research
+and development of Kazakh-centric LLMs. Data and code will be made available
+upon acceptance.
 
-摘要：本研究回顧了大型語言模型 (LLM) 在醫療保健中的使用，重點在於其訓練語料庫、自訂技術和評估指標。針對 2021 年至 2024 年的研究進行系統性搜尋，找出 61 篇文章。語料庫類型有四種：臨床資源、文獻、開放原始碼資料集和網路爬取資料。常見的建構技術包括預訓練、提示工程和檢索增強生成，其中有 44 項研究結合多種方法。評估指標分為流程、可用性和成果指標，其中成果指標又分為基於模型和專家評估的成果。本研究發現語料庫公平性存在重大差距，這會導致地理、文化和社會經濟因素的偏見。對未驗證或非結構化資料的依賴性突顯出更佳整合循證臨床指南的必要性。未來的研究應專注於開發具有審查來源和動態加權的分層語料庫架構，同時確保模型透明性。此外，缺乏針對特定領域模型的標準化評估架構，因此需要對 LLM 在實際醫療保健環境中進行全面驗證。
-
-##### **Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**
-2502.11859v1 by Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, Yong Li
-
-The Theory of Multiple Intelligences underscores the hierarchical nature of
-cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer
-a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual
-Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial
-Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13
-mainstream VLMs through nine validated psychometric experiments reveals
-significant gaps versus humans (average score 24.95 vs. 68.38), with three key
-findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation,
-weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller
-models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading
-(30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought
-(0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from
-architectural constraints. Identified barriers include weak geometry encoding
-and missing dynamic simulation. By linking psychometric BSAs to VLM
-capabilities, we provide a diagnostic toolkit for spatial intelligence
-evaluation, methodological foundations for embodied AI development, and a
-cognitive science-informed roadmap for achieving human-like spatial
-intelligence.
-
-摘要：多元智能理論強調認知能力的層次性質。為了推進空間人工智慧，我們開創了一個心理測量框架，在視覺語言模型 (VLM) 中定義了五種基本空間能力 (BSA)：空間知覺、空間關係、空間定向、心智旋轉和空間視覺化。通過九項經過驗證的心理測量實驗對 13 個主流 VLM 進行基準測試，揭示了與人類相比的顯著差距（平均分數 24.95 對 68.38），並得出三個關鍵發現：1) VLM 反映人類層次結構（2D 定向最強，3D 旋轉最弱）具有獨立的 BSA（Pearson's r<0.4）；2) Qwen2-VL-7B 等較小的模型超越了較大的模型，其中 Qwen 領先（30.82），InternVL2 落後（19.6）；3) 思想鏈等干預措施（0.100  accuracy gain）和 5 次訓練（0.259 提升）顯示了架構約束的限制。已識別的障礙包括弱幾何編碼和缺少動態模擬。通過將心理測量 BSA 與 VLM 能力聯繫起來，我們提供了一個用於空間智能評估的診斷工具包、具身 AI 開發的方法論基礎，以及實現類人空間智能的認知科學信息路標。
-
-##### **LLMs as a synthesis between symbolic and continuous approaches to language**
-2502.11856v1 by Gemma Boleda
-
-Since the middle of the 20th century, a fierce battle is being fought between
-symbolic and continuous approaches to language and cognition. The success of
-deep learning models, and LLMs in particular, has been alternatively taken as
-showing that the continuous camp has won, or dismissed as an irrelevant
-engineering development. However, in this position paper I argue that deep
-learning models for language actually represent a synthesis between the two
-traditions. This is because 1) deep learning architectures allow for both
-continuous/distributed and symbolic/discrete-like representations and
-computations; 2) models trained on language make use this flexibility. In
-particular, I review recent research in mechanistic interpretability that
-showcases how a substantial part of morphosyntactic knowledge is encoded in a
-near-discrete fashion in LLMs. This line of research suggests that different
-behaviors arise in an emergent fashion, and models flexibly alternate between
-the two modes (and everything in between) as needed. This is possibly one of
-the main reasons for their wild success; and it is also what makes them
-particularly interesting for the study of language and cognition. Is it time
-for peace?
-
-摘要：自 20 世紀中葉以來，象徵與連續的語言和認知方法之間展開了一場激烈的戰鬥。深度學習模型，特別是 LLM 的成功，被交替視為連續陣營獲勝的證明，或被視為無關的工程發展而被忽視。然而，在本文中，我認為用於語言的深度學習模型實際上代表了這兩種傳統之間的綜合。這是因為 1) 深度學習架構允許連續/分佈式和符號/離散式表示和計算；2) 在語言上訓練的模型利用了這種靈活性。特別是，我回顧了機制可解釋性的最新研究，展示了形態句法知識的實質部分是如何以近乎離散的方式編碼在 LLM 中的。這條研究線表明，不同的行為以一種新興的方式出現，並且模型根據需要在兩種模式（以及介於兩者之間的所有內容）之間靈活地交替。這可能是它們獲得巨大成功的主要原因之一；這也是它們對語言和認知研究特別有趣的原因。和平的時刻到了嗎？
-
-##### **BaxBench: Can LLMs Generate Correct and Secure Backends?**
-2502.11844v1 by Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev
-
-The automatic generation of programs has long been a fundamental challenge in
-computer science. Recent benchmarks have shown that large language models
-(LLMs) can effectively generate code at the function level, make code edits,
-and solve algorithmic coding tasks. However, to achieve full automation, LLMs
-should be able to generate production-quality, self-contained application
-modules. To evaluate the capabilities of LLMs in solving this challenge, we
-introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for
-the generation of backend applications. We focus on backends for three critical
-reasons: (i) they are practically relevant, building the core components of
-most modern web and cloud software, (ii) they are difficult to get right,
-requiring multiple functions and files to achieve the desired functionality,
-and (iii) they are security-critical, as they are exposed to untrusted
-third-parties, making secure solutions that prevent deployment-time attacks an
-imperative. BaxBench validates the functionality of the generated applications
-with comprehensive test cases, and assesses their security exposure by
-executing end-to-end exploits. Our experiments reveal key limitations of
-current LLMs in both functionality and security: (i) even the best model,
-OpenAI o1, achieves a mere 60% on code correctness; (ii) on average, we could
-successfully execute security exploits on more than half of the correct
-programs generated by each LLM; and (iii) in less popular backend frameworks,
-models further struggle to generate correct and secure applications. Progress
-on BaxBench signifies important steps towards autonomous and secure software
-development with LLMs.
-
-摘要：<paragraph>程式自動產生一直是電腦科學中的基本挑戰。最近的基準測試顯示，大型語言模型 (LLM) 能夠有效產生函數層級的程式碼、進行程式碼編輯，以及解決演算法編碼任務。然而，若要達成完全自動化，LLM 應能夠產生生產品質、獨立的應用程式模組。為了評估 LLM 在解決此挑戰的能力，我們引入了 BaxBench，這是一個包含 392 個後端應用程式產生任務的新評估基準。我們專注於後端有三個關鍵原因：(i) 它們在實務上有其相關性，建構了大多數現代網路和雲端軟體的核心元件；(ii) 它們難以正確執行，需要多個函數和檔案才能達成所需的運作功能；(iii) 它們與安全性息息相關，因為它們會暴露於不受信任的第三方，使得預防部署時攻擊的安全解決方案成為當務之急。BaxBench 使用全面的測試案例驗證產生應用程式的功能，並透過執行端對端漏洞利用來評估其安全性風險。我們的實驗揭露了目前 LLM 在功能和安全性上的主要限制：(i) 即使是最好的模型 OpenAI o1，在程式碼正確性上也僅達到 60%；(ii) 平均而言，我們能夠在每個 LLM 產生的正確程式中成功執行超過一半的安全漏洞利用；(iii) 在較不受歡迎的後端框架中，模型在產生正確且安全的應用程式上更加困難。在 BaxBench 上的進展代表著使用 LLM 朝向自主且安全的軟體開發邁出了重要的一步。</paragraph>
-
-##### **Can LLM Agents Maintain a Persona in Discourse?**
-2502.11843v1 by Pranav Bhandari, Nicolas Fay, Michael Wise, Amitava Datta, Stephanie Meek, Usman Naseem, Mehwish Nasim
-
-Large Language Models (LLMs) are widely used as conversational agents,
-exploiting their capabilities in various sectors such as education, law,
-medicine, and more. However, LLMs are often subjected to context-shifting
-behaviour, resulting in a lack of consistent and interpretable
-personality-aligned interactions. Adherence to psychological traits lacks
-comprehensive analysis, especially in the case of dyadic (pairwise)
-conversations. We examine this challenge from two viewpoints, initially using
-two conversation agents to generate a discourse on a certain topic with an
-assigned personality from the OCEAN framework (Openness, Conscientiousness,
-Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This
-is followed by using multiple judge agents to infer the original traits
-assigned to explore prediction consistency, inter-model agreement, and
-alignment with the assigned personality. Our findings indicate that while LLMs
-can be guided toward personality-driven dialogue, their ability to maintain
-personality traits varies significantly depending on the combination of models
-and discourse settings. These inconsistencies emphasise the challenges in
-achieving stable and interpretable personality-aligned interactions in LLMs.
-
-摘要：大型語言模型 (LLM) 被廣泛用作對話代理，
-在教育、法律、
-醫學等各個領域發揮其能力。然而，LLM 經常受到情境轉換
-行為的影響，導致缺乏一致且可解釋的
-與人格一致的互動。對心理特質的堅持缺乏
-全面的分析，特別是在二元 (成對)
-對話的情況下。我們從兩個觀點審視這個挑戰，最初使用
-兩個對話代理在特定主題上產生論述，並從 OCEAN 框架 (開放性、盡責性、
-外向性、宜人性、神經質) 中分配人格，每個特質為高/低。這
-接著使用多個評審代理來推斷分配給探索預測一致性、模型間協議的原始特質，
-以及與分配人格的一致性。我們的研究結果表明，雖然 LLM
-可以引導至以人格為導向的對話，但它們維持
-人格特質的能力會根據模型和論述設定的組合而有顯著差異。這些不一致強調了
-在 LLM 中實現穩定且可解釋的與人格一致的互動的挑戰。
-
-##### **ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**
-2502.11840v1 by Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo
-
-Chord recognition serves as a critical task in music information retrieval
-due to the abstract and descriptive nature of chords in music analysis. While
-audio chord recognition systems have achieved significant accuracy for small
-vocabularies (e.g., major/minor chords), large-vocabulary chord recognition
-remains a challenging problem. This complexity also arises from the inherent
-long-tail distribution of chords, where rare chord types are underrepresented
-in most datasets, leading to insufficient training samples. Effective chord
-recognition requires leveraging contextual information from audio sequences,
-yet existing models, such as combinations of convolutional neural networks,
-bidirectional long short-term memory networks, and bidirectional transformers,
-face limitations in capturing long-term dependencies and exhibit suboptimal
-performance on large-vocabulary chord recognition tasks. This work proposes
-ChordFormer, a novel conformer-based architecture designed to tackle structural
-chord recognition (e.g., triads, bass, sevenths) for large vocabularies.
-ChordFormer leverages conformer blocks that integrate convolutional neural
-networks with transformers, thus enabling the model to capture both local
-patterns and global dependencies effectively. By addressing challenges such as
-class imbalance through a reweighted loss function and structured chord
-representations, ChordFormer outperforms state-of-the-art models, achieving a
-2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy
-on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling
-class imbalance, providing robust and balanced recognition across chord types.
-This approach bridges the gap between theoretical music knowledge and practical
-applications, advancing the field of large-vocabulary chord recognition.
-
-摘要：和弦辨識由於和弦在音樂分析中具有抽象性和描述性，因此在音樂資訊檢索中扮演著重要的任務。雖然音訊和弦辨識系統已在小型詞彙（例如，大調/小調和弦）中達到顯著的準確度，但大型詞彙和弦辨識仍然是一個具有挑戰性的問題。這種複雜性也來自和弦固有的長尾分佈，其中在大多數資料集中罕見的和弦類型代表性不足，導致訓練樣本不足。有效的和弦辨識需要利用音訊序列中的上下文資訊，但現有的模型，例如卷積神經網路、雙向長短期記憶網路和雙向轉換器的組合，在捕捉長期依賴關係方面面臨限制，並且在大詞彙和弦辨識任務上表現不佳。這項工作提出了 ChordFormer，這是一種新穎的基於變形器的架構，旨在解決大型詞彙的結構和弦辨識（例如，三和弦、低音、七和弦）。ChordFormer 利用變形器區塊將卷積神經網路與變形器整合在一起，從而使模型能夠有效地捕捉局部模式和全局依賴關係。透過重新加權損失函數和結構化和弦表示來解決類別不平衡等挑戰，ChordFormer 優於最先進的模型，在大詞彙和弦資料集上實現了幀準確度提高 2% 和類準確度提高 6%。此外，ChordFormer 在處理類別不平衡方面表現出色，在和弦類型中提供穩健且平衡的辨識。這種方法彌合了理論音樂知識與實際應用之間的差距，推動了大型詞彙和弦辨識領域的發展。
-
-##### **Intuitive physics understanding emerges from self-supervised pretraining on natural videos**
-2502.11831v1 by Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
-
-We investigate the emergence of intuitive physics understanding in
-general-purpose deep neural network models trained to predict masked regions in
-natural videos. Leveraging the violation-of-expectation framework, we find that
-video prediction models trained to predict outcomes in a learned representation
-space demonstrate an understanding of various intuitive physics properties,
-such as object permanence and shape consistency. In contrast, video prediction
-in pixel space and multimodal large language models, which reason through text,
-achieve performance closer to chance. Our comparisons of these architectures
-reveal that jointly learning an abstract representation space while predicting
-missing parts of sensory input, akin to predictive coding, is sufficient to
-acquire an understanding of intuitive physics, and that even models trained on
-one week of unique video achieve above chance performance. This challenges the
-idea that core knowledge -- a set of innate systems to help understand the
-world -- needs to be hardwired to develop an understanding of intuitive
-physics.
-
-摘要：我們探討了在經過訓練以預測自然影片中遮蔽區域的通用深度神經網路模型中，直覺物理理解的出現。利用違反預期框架，我們發現經過訓練以預測學習表徵空間中結果的影片預測模型，展現了對各種直覺物理特性的理解，例如物體恆存和形狀一致性。相反地，影片在像素空間和多模態大型語言模型中的預測，透過文字推理，達到的效能接近隨機。我們對這些架構的比較揭示了在預測感官輸入的遺失部分時，同時學習抽象表徵空間，類似於預測編碼，足以獲得對直覺物理的理解，而且即使在獨特影片上訓練一週的模型，也達到了高於隨機的效能。這挑戰了核心知識（一套幫助理解世界的先天系統）需要硬連線才能發展對直覺物理的理解這個想法。
-
-##### **Text Classification in the LLM Era - Where do we stand?**
-2502.11830v1 by Sowmya Vajjala, Shwetali Shimangaud
-
-Large Language Models revolutionized NLP and showed dramatic performance
-improvements across several tasks. In this paper, we investigated the role of
-such language models in text classification and how they compare with other
-approaches relying on smaller pre-trained language models. Considering 32
-datasets spanning 8 languages, we compared zero-shot classification, few-shot
-fine-tuning and synthetic data based classifiers with classifiers built using
-the complete human labeled dataset. Our results show that zero-shot approaches
-do well for sentiment classification, but are outperformed by other approaches
-for the rest of the tasks, and synthetic data sourced from multiple LLMs can
-build better classifiers than zero-shot open LLMs. We also see wide performance
-disparities across languages in all the classification scenarios. We expect
-that these findings would guide practitioners working on developing text
-classification systems across languages.
-
-摘要：大型語言模型革新了自然語言處理，並在多項任務中展現出顯著的效能提升。在本文中，我們探討了此類語言模型在文字分類中的角色，以及它們與依賴較小規模預先訓練語言模型的其他方法相比如何。考量涵蓋 8 種語言的 32 個資料集，我們比較了零次學習分類、少次學習微調和合成資料分類器，以及使用完整人工標記資料集建置的分類器。我們的結果顯示，零次學習方法在情緒分類中表現良好，但在其他任務中則不如其他方法，而來自多個大型語言模型的合成資料可以建置比零次學習開放大型語言模型更好的分類器。我們也看到在所有分類情境中，不同語言之間的效能差異很大。我們預期這些發現將引導從事跨語言文字分類系統開發的實務工作者。
-
-##### **Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**
-2502.11829v1 by Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
-
-This paper introduces Code-Vision, a benchmark designed to evaluate the
-logical understanding and code generation capabilities of Multimodal Large
-Language Models (MLLMs). It challenges MLLMs to generate a correct program that
-fulfills specific functionality requirements based on a given flowchart, which
-visually represents the desired algorithm or process. Code-Vision comprises
-three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding
-abilities across basic programming, algorithmic, and mathematical
-problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision.
-Experimental results demonstrate that there is a large performance difference
-between proprietary and open-source models. On Hard problems, GPT-4o can
-achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further
-experiments reveal that Code-Vision can pose unique challenges compared to
-other multimodal reasoning benchmarks MMCode and MathVista. We also explore the
-reason for the poor performance of the open-source models. All data and codes
-are available at https://github.com/wanghanbinpanda/CodeVision.
-
-摘要：本文介绍 Code-Vision，此基准测试旨在评估多模态大型语言模型 (MLLM) 的逻辑理解和代码生成能力。它要求 MLLM 根据给定的流程图生成一个正确的程序，以满足特定的功能需求，而流程图直观地表示所需的算法或流程。Code-Vision 包含三个子集：HumanEval-V、Algorithm 和 MATH，它们评估 MLLM 在基本编程、算法和数学问题解决域中的编码能力。我们的实验对 Code-Vision 上的 12 个 MLLM 进行了评估。实验结果表明，专有模型和开源模型之间的性能差异很大。在困难问题上，GPT-4o 可以达到 79.3% 的 pass@1，但最好的开源模型只能达到 15%。进一步的实验表明，与其他多模态推理基准 MMCode 和 MathVista 相比，Code-Vision 可能会带来独特的挑战。我们还探讨了开源模型性能不佳的原因。所有数据和代码均可在 https://github.com/wanghanbinpanda/CodeVision 中获得。
-
-##### **M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**
-2502.11824v1 by Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Barbara Plank, Yun Xue
-
-Aspect-based sentiment analysis (ABSA) is a crucial task in information
-extraction and sentiment analysis, aiming to identify aspects with associated
-sentiment elements in text. However, existing ABSA datasets are predominantly
-English-centric, limiting the scope for multilingual evaluation and research.
-To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7
-domains and 21 languages, making it the most extensive multilingual parallel
-dataset for ABSA to date. Our primary focus is on triplet extraction, which
-involves identifying aspect terms, aspect categories, and sentiment polarities.
-The dataset is constructed through an automatic translation process with human
-review to ensure quality. We perform extensive experiments using various
-baselines to assess performance and compatibility on M-ABSA. Our empirical
-findings highlight that the dataset enables diverse evaluation tasks, such as
-multilingual and multi-domain transfer learning, and large language model
-evaluation, underscoring its inclusivity and its potential to drive
-advancements in multilingual ABSA research.
-
-摘要：面向方面的观点分析 (ABSA) 是資訊萃取和觀點分析中的一項重要任務，旨在識別文本中帶有相關觀點元素的方面。然而，現有的 ABSA 資料集以英語為中心，限制了多語言評估和研究的範圍。為了彌補這個差距，我們提出了 M-ABSA，這是一個涵蓋 7 個領域和 21 種語言的綜合性資料集，使其成為迄今為止最廣泛的多語言平行資料集，適用於 ABSA。我們的重點是三元組萃取，其中涉及識別方面術語、方面類別和觀點極性。該資料集是透過自動翻譯過程構建的，並經過人工審查以確保品質。我們使用各種基線進行廣泛的實驗，以評估 M-ABSA 上的效能和相容性。我們的實證結果強調，該資料集支援多樣化的評估任務，例如多語言和多領域遷移學習，以及大型語言模型評估，凸顯其包容性和推動多語言 ABSA 研究進展的潛力。
-
-##### **AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**
-2502.11817v1 by Hao Zhou, Wenge Rong, Jianfei Zhang, Qing Sun, Yuanxin Ouyang, Zhang Xiong
-
-Knowledge Tracing (KT) aims to predict students' future performances based on
-their former exercises and additional information in educational settings. KT
-has received significant attention since it facilitates personalized
-experiences in educational situations. Simultaneously, the autoregressive
-modeling on the sequence of former exercises has been proven effective for this
-task. One of the primary challenges in autoregressive modeling for Knowledge
-Tracing is effectively representing the anterior (pre-response) and posterior
-(post-response) states of learners across exercises. Existing methods often
-employ complex model architectures to update learner states using question and
-response records. In this study, we propose a novel perspective on knowledge
-tracing task by treating it as a generative process, consistent with the
-principles of autoregressive models. We demonstrate that knowledge states can
-be directly represented through autoregressive encodings on a question-response
-alternate sequence, where model generate the most probable representation in
-hidden state space by analyzing history interactions. This approach underpins
-our framework, termed Alternate Autoregressive Knowledge Tracing (AAKT).
-Additionally, we incorporate supplementary educational information, such as
-question-related skills, into our framework through an auxiliary task, and
-include extra exercise details, like response time, as additional inputs. Our
-proposed framework is implemented using advanced autoregressive technologies
-from Natural Language Generation (NLG) for both training and prediction.
-Empirical evaluations on four real-world KT datasets indicate that AAKT
-consistently outperforms all baseline models in terms of AUC, ACC, and RMSE.
-Furthermore, extensive ablation studies and visualized analysis validate the
-effectiveness of key components in AAKT.
-
-摘要：<paragraph>知識追蹤 (KT) 旨在根據學生的前次練習和教育環境中的額外資訊，預測學生的未來表現。KT 自從促進教育情境中的個人化體驗後，便備受關注。同時，前次練習序列上的自迴歸模型已被證明對此任務有效。知識追蹤中自迴歸模型的主要挑戰之一，是有效表示學習者在各項練習中的先驗 (反應前) 和後驗 (反應後) 狀態。現有方法通常採用複雜的模型架構，使用問題和反應記錄來更新學習者狀態。在本研究中，我們提出了一個關於知識追蹤任務的新觀點，將其視為一個生成過程，與自迴歸模型的原理一致。我們證明了知識狀態可以直接透過問答交替序列上的自迴歸編碼來表示，其中模型透過分析歷史互動來生成隱藏狀態空間中最可能的表示。此方法支撐了我們的架構，稱為交替自迴歸知識追蹤 (AAKT)。此外，我們透過輔助任務將補充教育資訊（例如與問題相關的技能）納入我們的架構，並將額外練習細節（例如反應時間）納入額外輸入。我們提出的架構是使用自然語言生成 (NLG) 的先進自迴歸技術，用於訓練和預測。對四個真實世界的 KT 資料集進行的經驗評估表明，AAKT 在 AUC、ACC 和 RMSE 方面始終優於所有基準模型。此外，廣泛的消融研究和視覺化分析驗證了 AAKT 中關鍵組件的有效性。</paragraph>
-
-##### **Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**
-2502.11812v1 by Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
-
-Fine-tuning significantly improves the performance of Large Language Models
-(LLMs), yet its underlying mechanisms remain poorly understood. This paper aims
-to provide an in-depth interpretation of the fine-tuning process through
-circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike
-previous studies
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that focus on tasks where pre-trained models already perform well, we develop a
-set of mathematical tasks where fine-tuning yields substantial performance
-gains, which are closer to the practical setting. In our experiments, we
-identify circuits at various checkpoints during fine-tuning and examine the
-interplay between circuit analysis, fine-tuning methods, and task complexities.
-First, we find that while circuits maintain high node similarity before and
-after fine-tuning, their edges undergo significant changes, which is in
-contrast to the previous work
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that show circuits only add some additional components after fine-tuning. Based
-on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA)
-method, which assigns ranks to layers based on edge changes in the circuits.
-Experimental results demonstrate that our circuit-based LoRA algorithm achieves
-an average performance improvement of 2.46\% over standard LoRA with similar
-parameter sizes. Furthermore, we explore how combining circuits from subtasks
-can enhance fine-tuning in compositional tasks, providing new insights into the
-design of such tasks and deepening the understanding of circuit dynamics and
-fine-tuning mechanisms.
-
-摘要：微調大幅提升大型語言模型 (LLM) 的效能，但其底層機制仍鮮為人知。本文旨在透過電路分析，一種機械可解釋性 (MI) 中廣泛使用的工具，提供微調過程的深入詮釋。不同於先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-專注於預訓練模型已表現良好的任務，我們開發了一組數學任務，其中微調產生顯著的效能提升，更接近實際設定。在我們的實驗中，我們在微調期間的各種檢查點識別電路，並探討電路分析、微調方法和任務複雜度之間的交互作用。首先，我們發現電路在微調前後雖然維持高節點相似度，但其邊緣卻經歷顯著變化，這與先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-顯示電路僅在微調後新增一些額外組件的結果相反。基於這些觀察，我們開發了一個電路感知低秩適應 (LoRA) 方法，根據電路中的邊緣變化為層級分配秩。實驗結果證明，我們的基於電路的 LoRA 演算法在參數大小相似的條件下，比標準 LoRA 平均提升了 2.46% 的效能。此外，我們探討如何結合子任務的電路來增強組合任務中的微調，為此類任務的設計提供新的見解，並加深對電路動態和微調機制的理解。
-
-##### **FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**
-2502.11811v1 by Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
-
-Retrieved documents containing noise will hinder Retrieval-Augmented
-Generation (RAG) from detecting answer clues, necessitating noise filtering
-mechanisms to enhance accuracy.Existing methods use re-ranking or summarization
-to identify the most relevant sentences, but directly and accurately locating
-answer clues from these large-scale and complex documents remains challenging.
-Unlike these document-level operations, we treat noise filtering as a
-sentence-level MinMax optimization problem: first identifying the potential
-clues from multiple documents using contextual information, then ranking them
-by relevance, and finally retaining the least clues through truncation. In this
-paper, we propose FineFilter, a novel fine-grained noise filtering mechanism
-for RAG consisting of a clue extractor, a re-ranker, and a truncator. We
-optimize each module to tackle complex reasoning challenges: (1) Clue extractor
-firstly uses sentences containing the answer and similar ones as fine-tuned
-targets, aiming at extracting sufficient potential clues; (2) Re-ranker is
-trained to prioritize effective clues based on the real feedback from
-generation module, with clues capable of generating correct answer as positive
-samples and others as negative; (3) Truncator takes the minimum clues needed to
-answer the question (truncation point) as fine-tuned targets, and performs
-truncation on the re-ranked clues to achieve fine-grained noise filtering.
-Experiments on three QA datasets demonstrate that FineFilter significantly
-outperforms baselines in terms of performance and inference cost. Further
-analysis on each module shows the effectiveness of our optimizations for
-complex reasoning.
-
-摘要：<paragraph>檢索到含有雜訊的文件會阻礙檢索增強生成 (RAG) 偵測答案線索，因此需要雜訊過濾機制來增強準確性。現有方法使用重新排序或摘要來找出最相關的句子，但從這些大規模且複雜的文件中直接且準確地找出答案線索仍然具有挑戰性。與這些文件層級的操作不同，我們將雜訊過濾視為一個句子層級的 MinMax 最佳化問題：首先使用脈絡資訊從多個文件中找出潛在線索，接著依據相關性對它們進行排序，最後透過截斷保留最少的線索。在本文中，我們提出 FineFilter，一種創新的細緻雜訊過濾機制，用於 RAG，它包含一個線索萃取器、一個重新排序器和一個截斷器。我們最佳化每個模組來應對複雜的推理挑戰：(1) 線索萃取器首先使用包含答案和類似答案的句子作為微調的目標，旨在萃取足夠的潛在線索；(2) 重新排序器經過訓練，根據生成模組的真實回饋來優先處理有效的線索，其中能夠生成正確答案的線索為正樣本，其他則為負樣本；(3) 截斷器將回答問題所需的最小線索 (截斷點) 視為微調的目標，並對重新排序的線索執行截斷，以達成細緻的雜訊過濾。在三個問答資料集上的實驗證實，FineFilter 在效能和推論成本方面都明顯優於基線。進一步分析每個模組顯示，我們的最佳化對於複雜推理而言是有效的。</paragraph>
-
-##### **Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**
-2502.11809v1 by Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
-
-Deep neural networks (DNNs) often exhibit biases toward certain categories
-during object recognition, even under balanced training data conditions. The
-intrinsic mechanisms underlying these biases remain unclear. Inspired by the
-human visual system, which decouples object manifolds through hierarchical
-processing to achieve object recognition, we propose a geometric analysis
-framework linking the geometric complexity of class-specific perceptual
-manifolds in DNNs to model bias. Our findings reveal that differences in
-geometric complexity can lead to varying recognition capabilities across
-categories, introducing biases. To support this analysis, we present the
-Perceptual-Manifold-Geometry library, designed for calculating the geometric
-properties of perceptual manifolds.
-
-摘要：深度神經網路 (DNN) 在物件辨識過程中，即使在平衡的訓練資料條件下，通常會對特定類別表現出偏見。這些偏見背後的基本機制仍然不清楚。受人類視覺系統的啟發，人類視覺系統透過階層化處理來解耦物件流形以達成物件辨識，我們提出一個幾何分析架構，將 DNN 中特定類別感知流形的幾何複雜度與模型偏見連結起來。我們的研究結果顯示，幾何複雜度的差異會導致不同類別的辨識能力有所不同，進而造成偏見。為了支持這個分析，我們提出感知流形幾何函式庫，用於計算感知流形的幾何屬性。
-
-##### **Exploring Translation Mechanism of Large Language Models**
-2502.11806v1 by Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Min Zhang
-
-Large language models (LLMs) have succeeded remarkably in multilingual
-translation tasks. However, the inherent translation mechanisms of LLMs remain
-poorly understood, largely due to sophisticated architectures and vast
-parameter scales. In response to this issue, this study explores the
-translation mechanism of LLM from the perspective of computational components
-(e.g., attention heads and MLPs). Path patching is utilized to explore causal
-relationships between components, detecting those crucial for translation tasks
-and subsequently analyzing their behavioral patterns in human-interpretable
-terms. Comprehensive analysis reveals that translation is predominantly
-facilitated by a sparse subset of specialized attention heads (less than 5\%),
-which extract source language, indicator, and positional features. MLPs
-subsequently integrate and process these features by transiting towards
-English-centric latent representations. Notably, building on the above
-findings, targeted fine-tuning of only 64 heads achieves translation
-improvement comparable to full-parameter tuning while preserving general
-capabilities.
-
-摘要：大型語言模型 (LLM) 在多語言翻譯任務中取得了顯著的成功。然而，LLM 內在的翻譯機制仍未被很好地理解，這主要是由於複雜的架構和龐大的參數規模。為了應對這個問題，本研究從計算元件（例如注意力頭和 MLP）的角度探討了 LLM 的翻譯機制。路徑修補用於探索元件之間的因果關係，檢測對翻譯任務至關重要的元件，並隨後以人類可解釋的方式分析它們的行為模式。綜合分析表明，翻譯主要由稀疏的專門注意力頭（不到 5%）促進，這些注意力頭提取源語言、指標和位置特徵。MLPs 隨後通過轉換為以英語為中心的潛在表示來整合和處理這些特徵。值得注意的是，根據上述發現，僅對 64 個頭進行有針對性的微調，即可實現與全參數調整相當的翻譯改進，同時保留一般能力。
-
-##### **Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**
-2502.11799v1 by Peiying Yu, Guoxin Chen, Jingjing Wang
-
-Despite the remarkable capabilities of large language models (LLMs) in
-various reasoning tasks, they still struggle with table reasoning tasks,
-particularly in maintaining consistency throughout multi-step reasoning
-processes. While existing approaches have explored various decomposition
-strategies, they often lack effective mechanisms to identify and correct errors
-in intermediate reasoning steps, leading to cascading error propagation. To
-address these issues, we propose Table-Critic, a novel multi-agent framework
-that facilitates collaborative criticism and iterative refinement of the
-reasoning process until convergence to correct solutions. Our framework
-consists of four specialized agents: a Judge for error identification, a Critic
-for comprehensive critiques, a Refiner for process improvement, and a Curator
-for pattern distillation. To effectively deal with diverse and unpredictable
-error types, we introduce a self-evolving template tree that systematically
-accumulates critique knowledge through experience-driven learning and guides
-future reflections. Extensive experiments have demonstrated that Table-Critic
-achieves substantial improvements over existing methods, achieving superior
-accuracy and error correction rates while maintaining computational efficiency
-and lower solution degradation rate.
-
-摘要：儘管大型語言模型 (LLM) 在各種推理任務中展現出非凡的能力，它們在表格推理任務中仍面臨挑戰，特別是在多步驟推理過程中維持一致性方面。現有方法雖然探索了各種分解策略，但它們通常缺乏有效機制來識別和修正中間推理步驟中的錯誤，導致錯誤遞增。為了解決這些問題，我們提出 Table-Critic，一個新穎的多代理架構，它促進協作批評和反覆改進推理過程，直到收斂到正確的解決方案。我們的架構包含四個專業代理：用於錯誤識別的法官、用於全面批評的批評者、用於流程改進的精煉器，以及用於模式萃取的策展人。為了有效處理多樣且不可預測的錯誤類型，我們引入了一個自演化範本樹，它透過經驗驅動的學習系統性地累積批評知識，並引導未來的反思。廣泛的實驗證明，Table-Critic 在現有方法的基礎上取得了顯著的進步，在維持運算效率和較低解決方案劣化率的同時，達到了更高的準確度和錯誤修正率。
-
-##### **Personality Editing for Language Models through Relevant Knowledge Editing**
-2502.11789v1 by Seojin Hwang, Yumin Kim, Byeongjeong Kim, Hwanhee Lee
-
-Large Language Models (LLMs) play a vital role in applications like
-conversational agents and content creation, where controlling a model's
-personality is crucial for maintaining tone, consistency, and engagement.
-However, traditional prompt-based techniques for controlling personality often
-fall short, as they do not effectively mitigate the model's inherent biases. In
-this paper, we introduce a novel method PALETTE that enhances personality
-control through knowledge editing. By generating adjustment queries inspired by
-psychological assessments, our approach systematically adjusts responses to
-personality-related queries similar to modifying factual knowledge, thereby
-achieving controlled shifts in personality traits. Experimental results from
-both automatic and human evaluations demonstrate that our method enables more
-stable and well-balanced personality control in LLMs.
-
-摘要：大型語言模型 (LLM) 在會話代理和內容創作等應用程式中扮演至關重要的角色，其中控制模型的人格特質對於維持語氣、一致性和參與度至關重要。然而，傳統基於提示的控制人格技術通常無法達到預期效果，因為它們無法有效減輕模型固有的偏差。在本文中，我們介紹一種創新的方法 PALETTE，它通過知識編輯來增強人格控制。透過產生受心理評量啟發的調整查詢，我們的做法系統性地調整對人格相關查詢的回應，類似於修改事實知識，從而實現人格特質的受控轉變。來自自動和人工評估的實驗結果表明，我們的模型能夠在 LLM 中實現更穩定且均衡的人格控制。
-
-##### **Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**
-2502.11779v1 by Xuan Ren, Qi Chen, Lingqiao Liu
-
-The training data for fine-tuning large language models (LLMs) is typically
-structured as input-output pairs. However, for many tasks, there can be
-multiple equally valid output variations for the same input. Recent studies
-have observed that the choice of output variation used in training can affect
-the model's performance. This raises an important question: how can we generate
-the most effective output from the many possible response generation strategy
-options? Rather than relying on the traditional but resource-intensive
-train-and-evaluate approach, this paper proposes a scalable, approximate method
-for estimating the quality of a small subset of generated training data derived
-from the same input. We then evaluate how well this small subset of generated
-output fits the target model we are trying to train. We present a large-scale
-benchmark covering diverse reasoning-based datasets to support our study.
-  The central idea is that a good output should closely resemble the output
-generated by the target LLM. We formalize this 'closeness' as the expected
-alignment score between a candidate output and the output sampled from the
-target LLM. We connect this measurement to the perplexity metric used in
-previous literature and demonstrate that leveraging an alignment-based metric
-can provide better predictions of model performance. Using this strategy, we
-can evaluate a small subset of the generated output from each response
-generation strategy option, then select the most effective strategy. We show
-that an LLM trained on data generated by the selected strategy could lead to a
-significant performance gain in many cases.
-
-摘要：大型語言模型 (LLM) 的微調訓練資料通常
-以輸入輸出配對結構化。然而，對於許多任務而言，相同的輸入可能有多個同樣有效的輸出變化。最近的研究
-觀察到訓練中使用的輸出變化選擇會影響模型的效能。這引發了一個重要問題：我們如何從許多可能的回應產生策略選項中產生最有效的輸出？本文提出一個可擴充、近似的方法，用於估計從相同輸入衍生的訓練資料小子集的品質，而非依賴傳統但資源密集的訓練和評估方法。然後我們評估這個產生輸出的小子集與我們嘗試訓練的目標模型的契合程度。我們提出一個涵蓋各種基於推理的資料集的大規模基準，以支持我們的研究。
-核心概念是良好的輸出應與目標 LLM 產生的輸出密切相似。我們將這種「接近度」形式化為候選輸出與從目標 LLM 取樣的輸出之間的預期對齊分數。我們將此測量連接到先前文獻中使用的困惑度指標，並證明利用基於對齊的指標可以提供更好的模型效能預測。使用此策略，我們可以評估每個回應產生策略選項所產生輸出的小子集，然後選擇最有效的策略。我們展示在由所選策略產生的資料上訓練的 LLM，在許多情況下可能導致顯著的效能提升。
-
-##### **Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**
-2502.11777v1 by Siddiqui Muhammad Yasir, Hyunsik Ahn
-
-Depth estimation plays a pivotal role in advancing human-robot interactions,
-especially in indoor environments where accurate 3D scene reconstruction is
-essential for tasks like navigation and object handling. Monocular depth
-estimation, which relies on a single RGB camera, offers a more affordable
-solution compared to traditional methods that use stereo cameras or LiDAR.
-However, despite recent progress, many monocular approaches struggle with
-accurately defining depth boundaries, leading to less precise reconstructions.
-In response to these challenges, this study introduces a novel depth estimation
-framework that leverages latent space features within a deep convolutional
-neural network to enhance the precision of monocular depth maps. The proposed
-model features dual encoder-decoder architecture, enabling both color-to-depth
-and depth-to-depth transformations. This structure allows for refined depth
-estimation through latent space encoding. To further improve the accuracy of
-depth boundaries and local features, a new loss function is introduced. This
-function combines latent loss with gradient loss, helping the model maintain
-the integrity of depth boundaries. The framework is thoroughly tested using the
-NYU Depth V2 dataset, where it sets a new benchmark, particularly excelling in
-complex indoor scenarios. The results clearly show that this approach
-effectively reduces depth ambiguities and blurring, making it a promising
-solution for applications in human-robot interaction and 3D scene
-reconstruction.
-
-摘要：深度估計在推進人機互動方面發揮著至關重要的作用，特別是在室內環境中，準確的 3D 場景重建對於導航和物體處理等任務至關重要。單目深度估計依賴於單個 RGB 相機，與使用立體相機或 LiDAR 的傳統方法相比，它提供了一個更經濟的解決方案。然而，儘管最近取得了進展，許多單目方法在準確定義深度邊界方面仍然存在困難，從而導致重建精度降低。為了應對這些挑戰，本研究引入了一個新穎的深度估計框架，該框架利用深度卷積神經網路中的潛在空間特徵來增強單目深度圖的精度。所提出的模型採用雙編碼器-解碼器架構，既能進行顏色到深度的轉換，又能進行深度到深度的轉換。這種結構允許通過潛在空間編碼進行精確的深度估計。為了進一步提高深度邊界和局部特徵的精度，引入了一個新的損失函數。此函數將潛在損失與梯度損失相結合，幫助模型維護深度邊界的完整性。使用 NYU Depth V2 數據集對該框架進行了全面測試，在該數據集上，它設定了一個新的基準，特別是在複雜的室內場景中表現出色。結果清楚地表明，這種方法有效地減少了深度模糊和模糊，使其成為人機互動和 3D 場景重建應用中一種有前途的解決方案。
-
-##### **The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**
-2502.11771v1 by Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
-
-The ability of large language models (LLMs) to validate their output and
-identify potential errors is crucial for ensuring robustness and reliability.
-However, current research indicates that LLMs struggle with self-correction,
-encountering significant challenges in detecting errors. While studies have
-explored methods to enhance self-correction in LLMs, relatively little
-attention has been given to understanding the models' internal mechanisms
-underlying error detection. In this paper, we present a mechanistic analysis of
-error detection in LLMs, focusing on simple arithmetic problems. Through
-circuit analysis, we identify the computational subgraphs responsible for
-detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal
-that all models heavily rely on $\textit{consistency heads}$--attention heads
-that assess surface-level alignment of numerical values in arithmetic
-solutions. Moreover, we observe that the models' internal arithmetic
-computation primarily occurs in higher layers, whereas validation takes place
-in middle layers, before the final arithmetic results are fully encoded. This
-structural dissociation between arithmetic computation and validation seems to
-explain why current LLMs struggle to detect even simple arithmetic errors.
-
-摘要：大型語言模型 (LLM) 驗證其輸出並識別潛在錯誤的能力對於確保穩健性和可靠性至關重要。
-然而，目前的研究所示，LLM 難以進行自我修正，在檢測錯誤時遇到重大挑戰。儘管研究已探討增強 LLM 自我修正的方法，但對於瞭解模型內部錯誤檢測機制卻關注較少。在本文中，我們提出對 LLM 中錯誤檢測的機制分析，重點關注簡單的算術問題。通過電路分析，我們識別出負責檢測四個較小規模 LLM 中算術錯誤的計算子圖。我們的研究結果表明，所有模型都嚴重依賴於「一致性頭部」--注意頭部，用於評估算術解中數值表面的對齊方式。此外，我們觀察到模型的內部算術運算主要發生在較高層，而驗證則發生在中間層，在最終算術結果完全編碼之前。算術運算和驗證之間的這種結構性分離似乎解釋了為什麼當前的 LLM 難以檢測到即使是簡單的算術錯誤。
-
-##### **Cognitive-Aligned Document Selection for Retrieval-augmented Generation**
-2502.11770v1 by Bingyu Wan, Fuxi Zhang, Zhongpeng Qi, Jiayi Ding, Jijun Li, Baoshi Fan, Yijia Zhang, Jun Zhang
-
-Large language models (LLMs) inherently display hallucinations since the
-precision of generated texts cannot be guaranteed purely by the parametric
-knowledge they include. Although retrieval-augmented generation (RAG) systems
-enhance the accuracy and reliability of generative models by incorporating
-external documents, these retrieved documents often fail to adequately support
-the model's responses in practical applications. To address this issue, we
-propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment
-Re\textbf{trieval} for verifiable generation), which leverages an LLM to
-dynamically update queries and filter high-quality, reliable retrieval
-documents. Specifically, we parse the user query into its syntactic components
-and perform fine-grained grounded alignment with the retrieved documents. For
-query components that cannot be individually aligned, we propose a dynamic
-semantic compensation mechanism that iteratively refines and rewrites the query
-while continuously updating the retrieval results. This iterative process
-continues until the retrieved documents sufficiently support the query's
-response. Our approach introduces a novel criterion for filtering retrieved
-documents, closely emulating human strategies for acquiring targeted
-information. This ensures that the retrieved content effectively supports and
-verifies the generated outputs. On the ALCE benchmark, our method significantly
-surpasses a wide range of baselines, achieving state-of-the-art performance.
-
-摘要：大型語言模型 (LLM) 本質上會出現幻覺，因為生成的文本的準確性無法僅透過它們包含的參數化知識來保證。儘管檢索增強生成 (RAG) 系統透過納入外部文件來提升生成模型的準確性和可靠性，但這些檢索的文件在實際應用中常常無法充分支援模型的回應。為了解決這個問題，我們提出 GGatrieval（用於可驗證生成的精細化粒度化基礎對齊檢索），它利用 LLM 來動態更新查詢並過濾高品質、可靠的檢索文件。具體來說，我們將使用者查詢分析成其語法組成部分，並對檢索文件執行精細化粒度化基礎對齊。對於無法個別對齊的查詢組成部分，我們提出一個動態語義補償機制，在持續更新檢索結果的同時，反覆修正和重寫查詢。這個反覆的程序會持續到檢索的文件充分支援查詢的回應為止。我們的做法引進了一個新的檢索文件過濾標準，嚴密地模擬人類獲取目標資訊的策略。這確保檢索的內容有效地支援和驗證生成的輸出。在 ALCE 基準測試中，我們的做法顯著超越各種基線，達成最先進的效能。
-
-##### **From Selection to Generation: A Survey of LLM-based Active Learning**
-2502.11767v1 by Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley
-
-Active Learning (AL) has been a powerful paradigm for improving model
-efficiency and performance by selecting the most informative data points for
-labeling and training. In recent active learning frameworks, Large Language
-Models (LLMs) have been employed not only for selection but also for generating
-entirely new data instances and providing more cost-effective annotations.
-Motivated by the increasing importance of high-quality data and efficient model
-training in the era of LLMs, we present a comprehensive survey on LLM-based
-Active Learning. We introduce an intuitive taxonomy that categorizes these
-techniques and discuss the transformative roles LLMs can play in the active
-learning loop. We further examine the impact of AL on LLM learning paradigms
-and its applications across various domains. Finally, we identify open
-challenges and propose future research directions. This survey aims to serve as
-an up-to-date resource for researchers and practitioners seeking to gain an
-intuitive understanding of LLM-based AL techniques and deploy them to new
-applications.
-
-摘要：主動學習 (AL) 透過挑選最具資訊性的資料點來標記和訓練，已成為一種強大的範例，用以提升模型效率和效能。在最近的主動學習架構中，大型語言模型 (LLM) 不僅用於挑選，也用於產生全新的資料實例，並提供更具成本效益的註解。在大型語言模型時代，由於高品質資料和高效能模型訓練日益重要，我們針對基於大型語言模型的主動學習提出了一項全面的調查。我們提出一個直覺式的分類法，用以分類這些技術，並探討大型語言模型在主動學習迴圈中可以扮演的轉型角色。我們進一步探討主動學習對大型語言模型學習範例的影響，以及它在各種領域中的應用。最後，我們找出開放式挑戰，並提出未來的研究方向。本調查旨在作為研究人員和實務工作者的最新資源，用以獲得對基於大型語言模型的主動學習技術的直覺式理解，並將其部署至新的應用程式。
-
-##### **Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**
-2502.11766v1 by Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
-
-The widespread deployment of Large Language Models (LLMs) is hindered by the
-high computational demands, making knowledge distillation (KD) crucial for
-developing compact smaller ones. However, the conventional KD methods endure
-the distribution mismatch issue between the teacher and student models, leading
-to the poor performance of distillation. For instance, the widely-used KL-based
-methods suffer the mode-averaging and mode-collapsing problems, since the
-mismatched probabitliy distribution between both models. Previous studies
-mainly optimize this issue via different distance calculations towards the
-distribution of both models. Unfortunately, the distribution mismatch issue
-still exists in the early stage of the distillation. Hence, to reduce the
-impact of distribution mismatch, we propose a simple yet efficient method,
-named Warmup-Distill, which aligns the distillation of the student to that of
-the teacher in advance of distillation. Specifically, we first detect the
-distribution of the student model in practical scenarios with its internal
-knowledge, and then modify the knowledge with low probability via the teacher
-as the checker. Consequently, Warmup-Distill aligns the internal student's
-knowledge to that of the teacher, which expands the distribution of the student
-with the teacher's, and assists the student model to learn better in the
-subsequent distillation. Experiments on the seven benchmarks demonstrate that
-Warmup-Distill could provide a warmup student more suitable for distillation,
-which outperforms the vanilla student by as least +0.4 averaged score among all
-benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation
-on the math task could yield a further improvement, at most +1.9% accuracy.
-
-摘要：大型語言模型 (LLM) 的廣泛部署受到高運算需求的阻礙，這使得知識蒸餾 (KD) 對於開發緊湊型的小型模型至關重要。然而，傳統的 KD 方法忍受了教師和學生模型之間的分布不匹配問題，導致蒸餾效果不佳。例如，廣泛使用的基於 KL 的方法會出現模式平均和模式崩潰問題，因為兩個模型之間的機率分佈不匹配。先前的研究主要透過不同的距離計算來最佳化這個問題，以朝向兩個模型的分布。不幸的是，分布不匹配的問題仍然存在於蒸餾的早期階段。因此，為了減少分布不匹配的影響，我們提出了一種簡單但有效的方法，稱為 Warmup-Distill，它在蒸餾之前將學生的蒸餾與教師的蒸餾對齊。具體來說，我們首先使用其內部知識在實際場景中檢測學生的分布，然後透過教師作為檢查員修改低機率的知識。因此，Warmup-Distill 將學生的內部知識與教師的知識對齊，這會將學生的分布擴展到教師的分布，並協助學生模型在後續的蒸餾中學習得更好。在七個基準測試上的實驗表明，Warmup-Distill 可以提供更適合蒸餾的熱身學生，在所有基準測試中，其表現優於香草學生至少 +0.4 的平均分數。值得注意的是，在 Warmup-Distill 的協助下，數學任務上的蒸餾可以進一步提升，最多可提升 +1.9% 的準確度。
-
-##### **Lightweight Deepfake Detection Based on Multi-Feature Fusion**
-2502.11763v1 by Siddiqui Muhammad Yasir, Hyun Kim
-
-Deepfake technology utilizes deep learning based face manipulation techniques
-to seamlessly replace faces in videos creating highly realistic but
-artificially generated content. Although this technology has beneficial
-applications in media and entertainment misuse of its capabilities may lead to
-serious risks including identity theft cyberbullying and false information. The
-integration of DL with visual cognition has resulted in important technological
-improvements particularly in addressing privacy risks caused by artificially
-generated deepfake images on digital media platforms. In this study we propose
-an efficient and lightweight method for detecting deepfake images and videos
-making it suitable for devices with limited computational resources. In order
-to reduce the computational burden usually associated with DL models our method
-integrates machine learning classifiers in combination with keyframing
-approaches and texture analysis. Moreover the features extracted with a
-histogram of oriented gradients (HOG) local binary pattern (LBP) and KAZE bands
-were integrated to evaluate using random forest extreme gradient boosting extra
-trees and support vector classifier algorithms. Our findings show a
-feature-level fusion of HOG LBP and KAZE features improves accuracy to 92% and
-96% on FaceForensics++ and Celeb-DFv2 respectively.
-
-摘要：深度偽造技術利用基於深度學習的換臉技術，可無縫替換影片中的臉孔，創造出高度逼真但人工產生的內容。儘管這項技術在媒體和娛樂方面有益，但若誤用其功能可能會導致嚴重的風險，包括身分盜用、網路霸凌和虛假訊息。深度學習與視覺認知的整合已帶來重要的技術進步，特別是在解決由數位媒體平台上的人工深度偽造影像所造成的隱私風險方面。在本研究中，我們提出了一種用於偵測深度偽造影像和影片的有效且輕量級的方法，使其適用於運算資源有限的裝置。為了降低通常與深度學習模型相關的運算負擔，我們的做法結合了機器學習分類器、關鍵影格方法和紋理分析。此外，我們整合了使用方向梯度直方圖 (HOG)、局部二進位模式 (LBP) 和 KAZE 頻段所萃取出的特徵，並使用隨機森林、極端梯度提升、額外樹木和支援向量分類器演算法進行評估。我們的研究結果顯示，HOG、LBP 和 KAZE 特徵的層級融合將準確度提升至 92%，分別在 FaceForensics++ 和 Celeb-DFv2 上達到 96%。
-
-##### **HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**
-2502.11753v1 by Michiel van der Meer, Pavel Korshunov, Sébastien Marcel, Lonneke van der Plas
-
-Misinformation can be countered with fact-checking, but the process is costly
-and slow. Identifying checkworthy claims is the first step, where automation
-can help scale fact-checkers' efforts. However, detection methods struggle with
-content that is 1) multimodal, 2) from diverse domains, and 3) synthetic. We
-introduce HintsOfTruth, a public dataset for multimodal checkworthiness
-detection with $27$K real-world and synthetic image/claim pairs. The mix of
-real and synthetic data makes this dataset unique and ideal for benchmarking
-detection methods. We compare fine-tuned and prompted Large Language Models
-(LLMs). We find that well-configured lightweight text-based encoders perform
-comparably to multimodal models but the first only focus on identifying
-non-claim-like content. Multimodal LLMs can be more accurate but come at a
-significant computational cost, making them impractical for large-scale
-applications. When faced with synthetic data, multimodal models perform more
-robustly
-
-摘要：錯誤訊息可以透過事實查核來反駁，但這個過程既昂貴又緩慢。辨識需要查核的說法是第一步，自動化可以幫助擴大事實查核人員的努力。然而，偵測方法會在處理 1) 多模態、2) 來自不同領域，以及 3) 合成的內容時遇到困難。我們引進 HintsOfTruth，一個用於多模態查核價值偵測的公開資料集，其中包含 27K 個真實世界和合成的影像/說法配對。真實和合成資料的組合讓這個資料集獨一無二，非常適合用於基準偵測方法。我們比較微調和提示的大語言模型 (LLM)。我們發現，設定良好的輕量級文字編碼器的表現與多模態模型相當，但前者只專注於辨識非說法類型的內容。多模態 LLM 可能更準確，但需要大量的運算成本，這讓它們不適用於大規模的應用。在面對合成資料時，多模態模型的表現更強健。
-
-##### **Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**
-2502.11751v1 by Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
-
-Although Large Language Models (LLMs) excel in reasoning and generation for
-language tasks, they are not specifically designed for multimodal challenges.
-Training Multimodal Large Language Models (MLLMs), however, is
-resource-intensive and constrained by various training limitations. In this
-paper, we propose the Modular-based Visual Contrastive Decoding (MVCD)
-framework to move this obstacle. Our framework leverages LLMs' In-Context
-Learning (ICL) capability and the proposed visual contrastive-example decoding
-(CED), specifically tailored for this framework, without requiring any
-additional training. By converting visual signals into text and focusing on
-contrastive output distributions during decoding, we can highlight the new
-information introduced by contextual examples, explore their connections, and
-avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual
-perception to make it see and reason over the input visuals. To demonstrate
-MVCD's effectiveness, we conduct experiments with four LLMs across five
-question answering datasets. Our results not only show consistent improvement
-in model accuracy but well explain the effective components inside our decoding
-strategy. Our code will be available at https://github.com/Pbhgit/MVCD.
-
-摘要：儘管大型語言模型 (LLM) 在語言任務的推理和生成方面表現優異，但它們並非專門針對多模態挑戰而設計。然而，訓練多模態大型語言模型 (MLLM) 十分耗費資源，並受到各種訓練限制。在本文中，我們提出基於模組的視覺對比解碼 (MVCD) 架構來克服這個障礙。我們的架構利用 LLM 的情境學習 (ICL) 能力和專門為此架構量身打造的視覺對比範例解碼 (CED)，而無需任何額外訓練。透過將視覺信號轉換為文字，並在解碼過程中專注於對比輸出分佈，我們可以突顯情境範例引入的新資訊，探索它們的關聯性，並避免過度依賴先前編碼的知識。MVCD 增強了 LLM 的視覺感知能力，使其能夠觀察並推論輸入視覺效果。為了證明 MVCD 的有效性，我們使用四個 LLM 在五個問答資料集上進行實驗。我們的結果不僅顯示模型準確度持續提升，還能清楚說明我們的解碼策略中的有效組成部分。我們的程式碼將在 https://github.com/Pbhgit/MVCD 公開。
-
-##### **SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**
-2502.11741v1 by Shuai Lyu, Haoran Luo, Zhonghong Ou, Yifan Zhu, Xiaoran Shang, Yang Qin, Meina Song
-
-The Text-to-SQL(Text2SQL) task aims to convert natural language queries into
-executable SQL queries. Thanks to the application of large language models
-(LLMs), significant progress has been made in this field. However, challenges
-such as model scalability, limited generation space, and coherence issues in
-SQL generation still persist. To address these issues, we propose SQL-o1, a
-Self-Reward-based heuristic search method designed to enhance the reasoning
-ability of LLMs in SQL query generation. SQL-o1 combines Monte Carlo Tree
-Search (MCTS) for heuristic process-level search and constructs a Schema-Aware
-dataset to help the model better understand database schemas. Extensive
-experiments on the Bird and Spider datasets demonstrate that SQL-o1 improves
-execution accuracy by 10.8\% on the complex Bird dataset compared to the latest
-baseline methods, even outperforming GPT-4-based approaches. Additionally,
-SQL-o1 excels in few-shot learning scenarios and shows strong cross-model
-transferability. Our code is publicly available
-at:https://github.com/ShuaiLyu0110/SQL-o1.
-
-摘要：文本转 SQL（Text2SQL）任务旨在将自然语言查询转换为可执行的 SQL 查询。得益于大型语言模型（LLM）的应用，该领域取得了显著进展。然而，模型可扩展性、生成空间受限和 SQL 生成的连贯性问题等挑战仍然存在。为了解决这些问题，我们提出了 SQL-o1，这是一种基于自我奖励的启发式搜索方法，旨在增强 LLM 在 SQL 查询生成中的推理能力。SQL-o1 结合了蒙特卡罗树搜索（MCTS）用于启发式过程级搜索，并构建了一个模式感知数据集，以帮助模型更好地理解数据库模式。在 Bird 和 Spider 数据集上的大量实验表明，与最新的基准方法相比，SQL-o1 将复杂 Bird 数据集上的执行准确率提高了 10.8%，甚至优于基于 GPT-4 的方法。此外，SQL-o1 在少样本学习场景中表现出色，并显示出强大的跨模型可迁移性。我们的代码已公开发布在：https://github.com/ShuaiLyu0110/SQL-o1。
+摘要：儘管哈薩克人口達兩千萬，但哈薩克的文化和語言在自然語言處理領域仍未得到充分的重視。儘管大型語言模型 (LLM) 在全球持續進步，但哈薩克語的進展卻十分有限，這從專用模型和基準評估的稀缺性中可見一斑。為了解決這個差距，我們引入了 KazMMLU，這是第一個專門為哈薩克語設計的 MMLU 風格資料集。KazMMLU 包含 23,000 個問題，涵蓋各種教育層級，包括 STEM、人文學科和社會科學，這些問題來自真實的教育材料，並由母語人士和教育工作者手動驗證。該資料集包含 10,969 個哈薩克語問題和 12,031 個俄語問題，反映了哈薩克的雙語教育體系和豐富的在地脈絡。我們對幾個最先進的多語言模型（Llama-3.1、Qwen-2.5、GPT-4 和 DeepSeek V3）的評估顯示，仍有很大的改進空間，因為即使是效能最好的模型，也很難在哈薩克語和俄語中達到有競爭力的效能。這些發現強調了與資源豐富的語言相比，存在顯著的效能差距。我們希望我們的資料集能促進以哈薩克語為中心的 LLM 的進一步研究和開發。資料和程式碼將在獲得接受後提供。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**
+2502.12821v1 by Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
+
+Inverse tasks can uncover potential reasoning gaps as Large Language Models
+(LLMs) scale up. In this work, we explore the redefinition task, in which we
+assign alternative values to well-known physical constants and units of
+measure, prompting LLMs to respond accordingly. Our findings show that not only
+does model performance degrade with scale, but its false confidence also rises.
+Moreover, while factors such as prompting strategies or response formatting are
+influential, they do not preclude LLMs from anchoring to memorized values.
+
+摘要：逆向任務可以揭示大型語言模型 (LLM) 擴展時潛在的推理差距。在本文中，我們探討重新定義任務，其中我們將替換值指定給著名的物理常數和測量單位，促使 LLM 做出相應回應。我們的研究結果表明，模型效能不僅會隨著規模而下降，其虛假信心也會上升。此外，儘管提示策略或回應格式等因素具有影響力，但它們並不妨礙 LLM 錨定在記憶值上。
+
+##### **Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**
+2502.12813v1 by Adnan Ahmad, Stefan Hillmann, Sebastian Möller
+
+In this study, we explore the application of Large Language Models (LLMs) for
+generating synthetic users and simulating user conversations with a
+task-oriented dialogue system and present detailed results and their analysis.
+We propose a comprehensive novel approach to user simulation technique that
+uses LLMs to create diverse user profiles, set goals, engage in multi-turn
+dialogues, and evaluate the conversation success. We employ two proprietary
+LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a
+heterogeneous base of user profiles, characterized by varied demographics,
+multiple user goals, different conversational styles, initial knowledge levels,
+interests, and conversational objectives. We perform a detailed analysis of the
+user profiles generated by LLMs to assess the diversity, consistency, and
+potential biases inherent in these LLM-generated user simulations. We find that
+GPT-o1 generates more heterogeneous user distribution across most user
+attributes, while GPT-4o generates more skewed user attributes. The generated
+set of user profiles are then utilized to simulate dialogue sessions by
+interacting with a task-oriented dialogue system.
+
+摘要：在這項研究中，我們探討大型語言模型 (LLM) 在生成合成使用者和模擬使用者對話，並使用任務導向對話系統進行對話的應用，並提出詳細的結果及其分析。我們提出了一種全面的使用者模擬技術新方法，利用 LLM 建立多樣化的使用者概況、設定目標、參與多輪對話，並評估對話的成功性。我們採用了兩個專有的 LLM，即 GPT-4o 和 GPT-o1 (Achiam 等人，2023 年)，以生成一個異質的使用者概況基礎，其特徵在於不同的人口統計資料、多個使用者目標、不同的對話風格、初始知識水準、興趣和對話目標。我們對 LLM 生成的使用者概況進行了詳細分析，以評估這些 LLM 生成的使用者模擬中固有的多樣性、一致性和潛在偏差。我們發現 GPT-o1 在大多數使用者屬性中產生更異質的使用者分佈，而 GPT-4o 則產生更偏斜的使用者屬性。然後利用生成的使用者概況集，透過與任務導向對話系統互動來模擬對話會話。
+
+##### **Towards Text-Image Interleaved Retrieval**
+2502.12799v1 by Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
+
+Current multimodal information retrieval studies mainly focus on single-image
+inputs, which limits real-world applications involving multiple images and
+text-image interleaved content. In this work, we introduce the text-image
+interleaved retrieval (TIIR) task, where the query and document are interleaved
+text-image sequences, and the model is required to understand the semantics
+from the interleaved context for effective retrieval. We construct a TIIR
+benchmark based on naturally interleaved wikiHow tutorials, where a specific
+pipeline is designed to generate interleaved queries. To explore the task, we
+adapt several off-the-shelf retrievers and build a dense baseline by
+interleaved multimodal large language model (MLLM). We then propose a novel
+Matryoshka Multimodal Embedder (MME), which compresses the number of visual
+tokens at different granularity, to address the challenge of excessive visual
+tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption
+of existing models does not consistently yield effective results. Our MME
+achieves significant improvements over the baseline by substantially fewer
+visual tokens. We provide extensive analysis and will release the dataset and
+code to facilitate future research.
+
+摘要：目前的多模態資訊檢索研究主要集中在單一影像輸入，這限制了涉及多個影像和文字影像交錯內容的實際應用。在這項工作中，我們引入了文字影像交錯檢索 (TIIR) 任務，其中查詢和文件是交錯的文字影像序列，並且模型需要理解交錯內容的語意以進行有效檢索。我們根據自然交錯的 wikiHow 教學課程建構了一個 TIIR 基準，其中設計了一個特定的管線來產生交錯查詢。為了探索這個任務，我們調整了幾個現成的檢索器，並透過交錯的多模態大型語言模型 (MLLM) 建立了一個密集的基準。然後，我們提出了一個新穎的 Matryoshka 多模態嵌入器 (MME)，它壓縮了不同粒度視覺符號的數量，以解決基於 MLLM 的 TIIR 模型中過多視覺符號的挑戰。實驗表明，對現有模型的簡單調整並未持續產生有效結果。我們的 MME 透過大幅減少視覺符號，達到了比基準顯著的改進。我們提供了廣泛的分析，並將釋出資料集和程式碼以促進未來的研究。
+
+##### **Envious Explore and Exploit**
+2502.12798v1 by Omer Ben-Porat, Yotam Gafni, Or Markovetzki
+
+Explore-and-exploit tradeoffs play a key role in recommendation systems
+(RSs), aiming at serving users better by learning from previous interactions.
+Despite their commercial success, the societal effects of explore-and-exploit
+mechanisms are not well understood, especially regarding the utility
+discrepancy they generate between different users. In this work, we measure
+such discrepancy using the economic notion of envy. We present a multi-armed
+bandit-like model in which every round consists of several sessions, and
+rewards are realized once per round. We call the latter property reward
+consistency, and show that the RS can leverage this property for better
+societal outcomes. On the downside, doing so also generates envy, as
+late-to-arrive users enjoy the information gathered by early-to-arrive users.
+We examine the generated envy under several arrival order mechanisms and
+virtually any anonymous algorithm, i.e., any algorithm that treats all similar
+users similarly without leveraging their identities. We provide tight envy
+bounds on uniform arrival and upper bound the envy for nudged arrival, in which
+the RS can affect the order of arrival by nudging its users. Furthermore, we
+study the efficiency-fairness trade-off by devising an algorithm that allows
+constant envy and approximates the optimal welfare in restricted settings.
+Finally, we validate our theoretical results empirically using simulations.
+
+摘要：探索與開發的取捨在推薦系統 (RS) 中扮演著關鍵角色，旨在透過學習先前的互動來為使用者提供更好的服務。儘管在商業上獲得成功，但探索與開發機制的社會效應仍未被充分理解，特別是關於它們在不同使用者之間產生的效用差異。在這項工作中，我們使用經濟學中的嫉妒概念來衡量這種差異。我們提出了一個多臂老虎機模型，其中每一輪都包含多個回合，並且每回合只會實現一次獎勵。我們將後者的特性稱為獎勵一致性，並證明 RS 可以利用此特性來獲得更好的社會成果。不利的是，這麼做也會產生嫉妒，因為較晚加入的使用者可以享受較早加入的使用者所收集的資訊。我們在多種到達順序機制和幾乎任何匿名演算法（即任何演算法都以類似的方式對待所有類似的使用者，而不利用他們的身份）下檢驗產生的嫉妒。我們對均勻到達提供嚴格的嫉妒界線，並對推動到達的上限進行嫉妒界線，其中 RS 可以透過推動其使用者來影響到達順序。此外，我們透過設計一種演算法來研究效率公平權衡，該演算法允許恆定的嫉妒，並在受限設定中近似最佳福利。最後，我們使用模擬對我們的理論結果進行經驗驗證。
+
+##### **Commonsense Reasoning in Arab Culture**
+2502.12788v1 by Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto
+
+Despite progress in Arabic large language models, such as Jais and AceGPT,
+their evaluation on commonsense reasoning has largely relied on
+machine-translated datasets, which lack cultural depth and may introduce
+Anglocentric biases. Commonsense reasoning is shaped by geographical and
+cultural contexts, and existing English datasets fail to capture the diversity
+of the Arab world. To address this, we introduce \datasetname, a commonsense
+reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13
+countries across the Gulf, Levant, North Africa, and the Nile Valley. The
+dataset was built from scratch by engaging native speakers to write and
+validate culturally relevant questions for their respective countries.
+\datasetname spans 12 daily life domains with 54 fine-grained subtopics,
+reflecting various aspects of social norms, traditions, and everyday
+experiences. Zero-shot evaluations show that open-weight language models with
+up to 32B parameters struggle to comprehend diverse Arab cultures, with
+performance varying across regions. These findings highlight the need for more
+culturally aware models and datasets tailored to the Arabic-speaking world.
+
+摘要：儘管阿拉伯語大型語言模型（例如 Jais 和 AceGPT）已有進展，
+但它們在常識推理上的評估在很大程度上依賴於
+機器翻譯的資料集，這些資料集缺乏文化深度，可能會引入
+以英語為中心的偏見。常識推理受地理和
+文化背景影響，現有的英文資料集無法捕捉阿拉伯世界的多樣性。為了解決這個問題，我們引入了 \datasetname，一個現代標準阿拉伯語 (MSA) 的常識推理資料集，涵蓋海灣地區、黎凡特地區、北非和尼羅河谷 13 個國家的文化。此資料集是從頭開始建立的，由母語人士參與編寫和驗證他們各自國家的文化相關問題。\datasetname 涵蓋 12 個日常生活領域，包含 54 個細緻的主題，反映社會規範、傳統和日常經驗的各個方面。零次學習評估顯示，具有高達 32B 參數的開放式權重語言模型難以理解不同的阿拉伯文化，且各區域的表現不一。這些發現突顯了對更具文化意識的模型和專為阿拉伯語系世界量身打造的資料集的需求。
+
+##### **VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**
+2502.12782v1 by Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan
+
+The training of controllable text-to-video (T2V) models relies heavily on the
+alignment between videos and captions, yet little existing research connects
+video caption evaluation with T2V generation assessment. This paper introduces
+VidCapBench, a video caption evaluation scheme specifically designed for T2V
+generation, agnostic to any particular caption format. VidCapBench employs a
+data annotation pipeline, combining expert model labeling and human refinement,
+to associate each collected video with key information spanning video
+aesthetics, content, motion, and physical laws. VidCapBench then partitions
+these key information attributes into automatically assessable and manually
+assessable subsets, catering to both the rapid evaluation needs of agile
+development and the accuracy requirements of thorough validation. By evaluating
+numerous state-of-the-art captioning models, we demonstrate the superior
+stability and comprehensiveness of VidCapBench compared to existing video
+captioning evaluation approaches. Verification with off-the-shelf T2V models
+reveals a significant positive correlation between scores on VidCapBench and
+the T2V quality evaluation metrics, indicating that VidCapBench can provide
+valuable guidance for training T2V models. The project is available at
+https://github.com/VidCapBench/VidCapBench.
+
+摘要：可控制文本到影片 (T2V) 模型的訓練極度仰賴影片和字幕之間的對齊，但現有研究鮮少將影片字幕評估與 T2V 生成評估連結起來。本文介紹 VidCapBench，這是一種專門為 T2V 生成設計的影片字幕評估架構，與任何特定的字幕格式無關。VidCapBench 採用資料標註流程，結合專家模型標記和人工微調，將每個收集到的影片與涵蓋影片美學、內容、動作和物理定律等關鍵資訊關聯起來。VidCapBench 接著將這些關鍵資訊屬性分割成可自動評估和可手動評估的子集，以滿足敏捷開發的快速評估需求和全面驗證的準確性要求。透過評估許多最先進的字幕模型，我們證明了 VidCapBench 與現有的影片字幕評估方法相比，具有優異的穩定性和全面性。使用現成的 T2V 模型驗證顯示，VidCapBench 得分與 T2V 品質評估指標之間存在顯著的正相關，這表示 VidCapBench 可以為訓練 T2V 模型提供有價值的指導。專案可於 https://github.com/VidCapBench/VidCapBench 取得。
+
+##### **Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**
+2502.12776v1 by Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi
+
+While foundation models have been exploited for various expert tasks through
+fine-tuning, any foundation model will become outdated due to its old knowledge
+or limited capability. Thus the underlying foundation model should be
+eventually replaced by new ones, which leads to repeated cost of fine-tuning
+these new models. Existing work addresses this problem by inference-time
+tuning, i.e., modifying the output probabilities from the new foundation model
+with the outputs from the old foundation model and its fine-tuned model, which
+involves an additional overhead in inference by the latter two models. In this
+paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT),
+that reduces the inference overhead by its nature, based on the reformulation
+of fine-tuning as the reward maximization. Specifically, instead of fine-tuning
+parameters of the foundation models, PRT trains the reward model explicitly
+through the same loss function as in fine-tuning. During inference, the reward
+model can be used with any foundation model (with the same set of vocabularies
+or labels) through the formulation of reward maximization. Experimental
+results, covering both vision and language models, demonstrate that the
+PRT-trained model can achieve comparable accuracy to the existing work of
+inference-time tuning, with less inference cost.
+
+摘要：儘管基礎模型已透過微調用於各種專家任務，任何基礎模型都將因其舊知識或有限功能而過時。因此，基礎模型最終應由新模型取代，這導致重複微調這些新模型的成本。現有工作透過推論時間調整來解決這個問題，即使用舊基礎模型及其微調模型的輸出修改新基礎模型的輸出機率，這涉及後兩個模型在推論中的額外開銷。在本文中，我們提出一個新的微調原則，可攜式獎勵調整 (PRT)，它本質上會減少推論開銷，基於將微調重新表述為獎勵最大化。具體來說，PRT 不是微調基礎模型的參數，而是透過與微調中相同的損失函數明確訓練獎勵模型。在推論期間，獎勵模型可透過獎勵最大化的公式與任何基礎模型（具有相同的詞彙或標籤組）一起使用。涵蓋視覺和語言模型的實驗結果證明，PRT 訓練的模型可以達到與現有推論時間調整工作相當的準確度，且推論成本較低。
+
+##### **Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**
+2502.12771v1 by Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee
+
+Self-supervised language and audio models effectively predict brain responses
+to speech. However, traditional prediction models rely on linear mappings from
+unimodal features, despite the complex integration of auditory signals with
+linguistic and semantic information across widespread brain networks during
+speech comprehension. Here, we introduce a nonlinear, multimodal prediction
+model that combines audio and linguistic features from pre-trained models
+(e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in
+prediction performance (unnormalized and normalized correlation) over
+traditional unimodal linear models, as well as a 7.7% and 14.4% improvement,
+respectively, over prior state-of-the-art models. These improvements represent
+a major step towards future robust in-silico testing and improved decoding
+performance. They also reveal how auditory and semantic information are fused
+in motor, somatosensory, and higher-level semantic regions, aligning with
+existing neurolinguistic theories. Overall, our work highlights the often
+neglected potential of nonlinear and multimodal approaches to brain modeling,
+paving the way for future studies to embrace these strategies in naturalistic
+neurolinguistics research.
+
+摘要：自我監督的語言和音訊模型有效預測大腦對語言的反應。然而，傳統的預測模型依賴於單模態特徵的線性映射，儘管在語言理解過程中，聽覺信號與語言和語義資訊在廣泛的腦網路中進行複雜的整合。在此，我們引入一個非線性、多模態預測模型，結合預先訓練模型（例如，LLAMA、Whisper）中的音訊和語言特徵。我們的做法在預測效能上（未正規化和正規化相關性）分別比傳統的單模態線性模型提升了 17.2% 和 17.9%，分別比先前的最先進模型提升了 7.7% 和 14.4%。這些改進代表了未來穩健的電腦模擬測試和改進的解碼效能邁出了一大步。它們也揭示了聽覺和語義資訊如何在運動、體感和更高層次的語義區域中融合，與現有的神經語言學理論一致。總的來說，我們的研究突出了非線性和多模態大腦建模方法經常被忽略的潛力，為未來研究在自然主義神經語言學研究中採用這些策略鋪平了道路。
+
+##### **How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**
+2502.12769v1 by Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
+
+In the age of misinformation, hallucination -- the tendency of Large Language
+Models (LLMs) to generate non-factual or unfaithful responses -- represents the
+main risk for their global utility. Despite LLMs becoming increasingly
+multilingual, the vast majority of research on detecting and quantifying LLM
+hallucination are (a) English-centric and (b) focus on machine translation (MT)
+and summarization, tasks that are less common ``in the wild'' than open
+information seeking. In contrast, we aim to quantify the extent of LLM
+hallucination across languages in knowledge-intensive long-form question
+answering. To this end, we train a multilingual hallucination detection model
+and conduct a large-scale study across 30 languages and 6 open-source LLM
+families. We start from an English hallucination detection dataset and rely on
+MT to generate (noisy) training data in other languages. We also manually
+annotate gold data for five high-resource languages; we then demonstrate, for
+these languages, that the estimates of hallucination rates are similar between
+silver (LLM-generated) and gold test sets, validating the use of silver data
+for estimating hallucination rates for other languages. For the final rates
+estimation, we build a knowledge-intensive QA dataset for 30 languages with
+LLM-generated prompts and Wikipedia articles as references. We find that, while
+LLMs generate longer responses with more hallucinated tokens for
+higher-resource languages, there is no correlation between length-normalized
+hallucination rates of languages and their digital representation. Further, we
+find that smaller LLMs exhibit larger hallucination rates than larger models.
+
+摘要：<paragraph>在错误訊息的時代，幻覺——大型語言模型 (LLM) 產生非事實或不忠實回應的傾向——代表其全球效用的主要風險。儘管 LLM 變得越來越多元化，但絕大多數關於偵測和量化 LLM 幻覺的研究都是 (a) 以英語為中心，(b) 專注於機器翻譯 (MT) 和摘要，這些任務在「野外」中不如開放式資訊搜尋常見。相反地，我們旨在量化 LLM 在知識密集型長篇問答中跨語言的幻覺程度。為此，我們訓練了一個多語言幻覺偵測模型，並針對 30 種語言和 6 個開放原始碼 LLM 家族進行大規模研究。我們從一個英語幻覺偵測資料集開始，並依賴 MT 在其他語言中產生（有雜訊的）訓練資料。我們還手動為五種高資源語言註解黃金資料；然後我們證明，對於這些語言，幻覺率的估計值在白銀（LLM 產生）和黃金測試集之間是相似的，驗證了使用白銀資料來估計其他語言的幻覺率。對於最終的比率估計，我們建立了一個知識密集型問答資料集，其中包含 30 種語言，並以 LLM 產生的提示和維基百科文章作為參考。我們發現，儘管 LLM 為資源較多的語言產生了更長的回應和更多幻覺的代幣，但語言的長度正規化幻覺率與其數位表示之間沒有相關性。此外，我們發現較小的 LLM 表現出比較大的模型更大的幻覺率。</paragraph>
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
 
diff --git a/docs/AI/Medical explainable AI.md b/docs/AI/Medical explainable AI.md
index b5411dac48..328ec1b1fb 100644
--- a/docs/AI/Medical explainable AI.md	
+++ b/docs/AI/Medical explainable AI.md	
@@ -2,6 +2,7 @@
 ### Medical explainable AI
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
 |**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
@@ -101,9 +102,22 @@
 |**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
 |**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
 |**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
 
 #### Abstracts
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
 2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
@@ -2669,29 +2683,3 @@ characteristics, in addition to the diverse stakeholders' perceptions.
 
 摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
-
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
-
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
-
diff --git a/docs/AI/Medical.md b/docs/AI/Medical.md
index 474ac3e2ff..f51cc334f5 100644
--- a/docs/AI/Medical.md
+++ b/docs/AI/Medical.md
@@ -2,6 +2,12 @@
 ### Medical
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|null|
+|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null|
 |**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
 |**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
 |**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null|
@@ -13,17 +19,19 @@
 |**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null|
 |**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|null|
 |**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|null|
+|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null|
 |**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|null|
 |**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null|
 |**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null|
 |**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|null|
-|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|null|
+|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)|
 |**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null|
 |**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null|
 |**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null|
 |**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null|
 |**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v1](http://arxiv.org/abs/2502.10526v1)|null|
 |**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null|
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null|
 |**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null|
 |**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null|
@@ -40,6 +48,7 @@
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
 |**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null|
 |**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)|
 |**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
 |**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
@@ -51,7 +60,7 @@
 |**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
 |**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)|
 |**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null|
 |**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
 |**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
 |**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
@@ -93,17 +102,150 @@
 |**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
 |**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
 |**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v2](http://arxiv.org/abs/2502.03568v2)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|[link](https://github.com/martinwimpff/eeg-continual)|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
 
 #### Abstracts
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **LLM Safety for Children**
+2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat
+
+This paper analyzes the safety of Large Language Models (LLMs) in
+interactions with children below age of 18 years. Despite the transformative
+applications of LLMs in various aspects of children's lives such as education
+and therapy, there remains a significant gap in understanding and mitigating
+potential content harms specific to this demographic. The study acknowledges
+the diverse nature of children often overlooked by standard safety evaluations
+and proposes a comprehensive approach to evaluating LLM safety specifically for
+children. We list down potential risks that children may encounter when using
+LLM powered applications. Additionally we develop Child User Models that
+reflect the varied personalities and interests of children informed by
+literature in child care and psychology. These user models aim to bridge the
+existing gap in child safety literature across various fields. We utilize Child
+User Models to evaluate the safety of six state of the art LLMs. Our
+observations reveal significant safety gaps in LLMs particularly in categories
+harmful to children but not adults
+
+摘要：本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面（例如教育和治療）都有轉變性的應用，但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性，而標準安全評估通常會忽略這些多樣性，並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外，我們開發了兒童使用者模型，這些模型反映了兒童不同的個性特質和興趣，並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞，特別是在對兒童有害但對成年人無害的類別中
+
+##### **Classifiers of Data Sharing Statements in Clinical Trial Records**
+2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth
+
+Digital individual participant data (IPD) from clinical trials are
+increasingly distributed for potential scientific reuse. The identification of
+available IPD, however, requires interpretations of textual data-sharing
+statements (DSS) in large databases. Recent advancements in computational
+linguistics include pre-trained language models that promise to simplify the
+implementation of effective classifiers based on textual inputs. In a subset of
+5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers
+based on domain-specific pre-trained language models reproduce original
+availability categories as well as manually annotated labels. Typical metrics
+indicate that classifiers that predicted manual annotations outperformed those
+that learned to output the original availability categories. This suggests that
+the textual DSS descriptions contain applicable information that the
+availability categories do not, and that such classifiers could thus aid the
+automatic identification of available IPD in large trial databases.
+
+摘要：臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而，要找出可用的 IPD，需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型，有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中，我們評估了基於特定領域預先訓練語言模型的分類器，在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示，預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊，而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。
+
 ##### **Relational Norms for Human-AI Cooperation**
 2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
 
@@ -393,6 +535,28 @@ chatbot applications.
 
 摘要：大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而，它們經常產生未經驗證的輸出，這會損害它們在關鍵應用中的可靠性。在本研究中，我們提出了一個創新的框架，透過檢索增強生成技術，將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體，開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型，產生在脈絡上相關且可驗證的回應，並直接參考臨床證據。實驗結果顯示，此方法顯著減少了幻覺、增強了事實準確性，並改善了生成回應的清晰度，為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。
 
+##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**
+2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang
+
+Automatic depression detection provides cues for early clinical intervention
+by clinicians. Clinical interviews for depression detection involve dialogues
+centered around multiple themes. Existing studies primarily design end-to-end
+neural network models to capture the hierarchical structure of clinical
+interview dialogues. However, these methods exhibit defects in modeling the
+thematic content of clinical interviews: 1) they fail to capture intra-theme
+and inter-theme correlation explicitly, and 2) they do not allow clinicians to
+intervene and focus on themes of interest. To address these issues, this paper
+introduces an interactive depression detection framework. This framework
+leverages in-context learning techniques to identify themes in clinical
+interviews and then models both intra-theme and inter-theme correlation.
+Additionally, it employs AI-driven feedback to simulate the interests of
+clinicians, enabling interactive adjustment of theme importance. PDIMC achieves
+absolute improvements of 35\% and 12\% compared to the state-of-the-art on the
+depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of
+modeling theme correlation and incorporating interactive external feedback.
+
+摘要：自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而，這些方法在建模臨床訪談的主題內容時表現出缺陷：1）它們無法明確捕捉主題內和主題間的關聯性，以及 2）它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題，本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題，然後對主題內和主題間的關聯性進行建模。此外，它採用 AI 驅動的回饋來模擬臨床醫師的興趣，實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比，PDIMC 的絕對改進率分別為 35% 和 12%，這證明了對主題關聯性建模和納入互動式外部回饋的有效性。
+
 ##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**
 2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu
 
@@ -661,6 +825,20 @@ differences, such as rotation and cropping.
 
 摘要：随着人工智能在我们的生活中变得越来越普遍，人们正在享受它带来的便利，但也面临着隐藏的威胁，例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果，特别是对于一些立即生效的应用，例如自动驾驶和医疗领域。在这些威胁中，后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象，使其成为不可忽视的威胁，然而，在部署后门模型的过程中，后门攻击往往存在一些使其在实际应用中不尽如人意的原因，例如抖动和亮度变化。基于此，我们提出了一种高度鲁棒的后门攻击，该攻击对目标样本进行平移并将其与自身结合以形成后门样本，即置换后门攻击 (DBA)。实验结果表明，DBA 攻击可以抵抗模拟真实世界差异的数据增强，例如旋转和裁剪。
 
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**
 2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
@@ -1046,6 +1224,32 @@ care interventions, and large-scale health monitoring.
 
 摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
+##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design**
+2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang
+
+Taste peptides have emerged as promising natural flavoring agents attributed
+to their unique organoleptic properties, high safety profile, and potential
+health benefits. However, the de novo identification of taste peptides derived
+from animal, plant, or microbial sources remains a time-consuming and
+resource-intensive process, significantly impeding their widespread application
+in the food industry. Here, we present TastePepAI, a comprehensive artificial
+intelligence framework for customized taste peptide design and safety
+assessment. As the key element of this framework, a loss-supervised adaptive
+variational autoencoder (LA-VAE) is implemented to efficiently optimizes the
+latent representation of sequences during training and facilitates the
+generation of target peptides with desired taste profiles. Notably, our model
+incorporates a novel taste-avoidance mechanism, allowing for selective flavor
+exclusion. Subsequently, our in-house developed toxicity prediction algorithm
+(SpepToxPred) is integrated in the framework to undergo rigorous safety
+evaluation of generated peptides. Using this integrated platform, we
+successfully identified 73 peptides exhibiting sweet, salty, and umami,
+significantly expanding the current repertoire of taste peptides. This work
+demonstrates the potential of TastePepAI in accelerating taste peptide
+discovery for food applications and provides a versatile framework adaptable to
+broader peptide engineering challenges.
+
+摘要：味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而，从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程，严重阻碍了它们在食品工业中的广泛应用。在此，我们提出了 TastePepAI，这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素，实现了损失监督自适应变分自动编码器 (LA-VAE)，以在训练期间有效优化序列的潜在表示，并促进生成具有所需味觉特征的目标肽。值得注意的是，我们的模型包含了一种新颖的味觉回避机制，允许选择性排除风味。随后，我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中，以对生成的肽进行严格的安全评估。使用这个集成平台，我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽，极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力，并提供了一个适用于更广泛的肽工程挑战的多功能框架。
+
 ##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
 2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
@@ -1342,7 +1546,7 @@ CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
 疾病研究和診斷量化。
 
 ##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
 Early prediction of pediatric cardiac arrest (CA) is critical for timely
 intervention in high-risk intensive care settings. We introduce PedCA-FT, a
@@ -1357,7 +1561,7 @@ and identifies clinically meaningful risk factors. These findings underscore
 the potential of multimodal fusion techniques to enhance early CA detection and
 improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+摘要：早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT，一個新穎的基於轉換器的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組，PedCA-FT 捕獲複雜的時間和上下文模式，以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估，我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型，並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。
 
 ##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
 2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
@@ -2398,228 +2602,3 @@ experiment details are made available.
 
 摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v2 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
-
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
-
-摘要：許多推理、規劃和問題解決任務都具有內在的演算法性質：正確模擬每一步是正確解決它們的充分條件。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的能力。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成數據是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見結構作為自然主義推理任務建構區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複操作方面的能力，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎很依賴模式辨識。我們的貢獻建立在合成測試 LLM 的推理能力之上，作為手工製作的人工標記問題的可擴充補充。
-
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
-
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
-
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
-
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
-
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
-
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
-
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
-
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
-
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
-
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
-
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
-
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
-
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
-
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
-
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
-
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
-
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
-
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
-
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
-
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
-
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
-
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
-
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
-
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
-
diff --git a/docs/index.md b/docs/index.md
index 51680a83a7..626af2d452 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,5 +1,5 @@
 # arxiv-daily
- Automated deployment @ 2025-02-19 09:05:53 Asia/Taipei
+ Automated deployment @ 2025-02-19 20:34:11 Asia/Taipei
 > Welcome to contribute! Add your topics and keywords in [`topic.yml`](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/topic.yml).
 > You can also view historical data through the [storage](https://github.com/jawatech/arxiv-daily-in-place/blob/main/database/storage).
 
@@ -8,6 +8,7 @@
 ### Medical explainable AI
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-10**|**Foundation Model of Electronic Medical Records for Adaptive Risk Estimation**|Pawel Renc et.al.|[2502.06124v1](http://arxiv.org/abs/2502.06124v1)|null|
 |**2025-01-27**|**An Explainable Disease Surveillance System for Early Prediction of Multiple Chronic Diseases**|Shaheer Ahmad Khan et.al.|[2501.15969v1](http://arxiv.org/abs/2501.15969v1)|null|
@@ -107,9 +108,22 @@
 |**2024-02-28**|**Cause and Effect: Can Large Language Models Truly Understand Causality?**|Swagata Ashwani et.al.|[2402.18139v3](http://arxiv.org/abs/2402.18139v3)|null|
 |**2024-02-28**|**Artificial Intelligence and Diabetes Mellitus: An Inside Look Through the Retina**|Yasin Sadeghi Bazargani et.al.|[2402.18600v1](http://arxiv.org/abs/2402.18600v1)|null|
 |**2024-02-22**|**Multi-stakeholder Perspective on Responsible Artificial Intelligence and Acceptability in Education**|A. J. Karran et.al.|[2402.15027v2](http://arxiv.org/abs/2402.15027v2)|null|
-|**2024-02-12**|**Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**|Aruna Mohan et.al.|[2402.09474v2](http://arxiv.org/abs/2402.09474v2)|null|
 
 #### Abstracts
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**
 2502.09218v1 by Flavio Bertini, Alessandro Dal Palù, Federica Zaglio, Francesco Fabiano, Andrea Formisano
 
@@ -2675,36 +2689,16 @@ characteristics, in addition to the diverse stakeholders' perceptions.
 
 摘要：這項研究從多個利害關係人的角度探討不同的人工智慧 (AI) 應用在教育上的可接受性，包括學生、老師和家長。承認 AI 在教育上的轉型潛力，它解決了與資料隱私、AI 代理、透明度、可解釋性和 AI 的道德部署相關的疑慮。透過小插曲方法，參與者被呈現了四種情境，其中 AI 的代理、透明度、可解釋性和隱私受到操縱。在每個情境後，參與者完成了一項調查，該調查捕捉了他們對 AI 的整體效用、個人效用、正義、信心、風險和如果可用，使用每個情境的 AI 的意圖的看法。資料蒐集包含來自合作機構和社群媒體活動的 1198 位多利害關係人參與者的最終樣本，並專注於對四個 AI 使用案例的個別回應。對資料的調解分析表明，對 AI 的接受度和信任在利害關係人團體之間有顯著差異。我們發現，AI 的代理、透明度和可解釋性高低程度之間的關鍵調解者，以及使用不同教育 AI 的意圖，包括感知到的整體效用、正義和信心。這項研究強調，接受 AI 在教育上的應用是一個微妙且多面向的問題，除了不同的利害關係人的看法外，還需要仔細考慮具體的 AI 應用及其特徵。
 
-##### **Deciphering Heartbeat Signatures: A Vision Transformer Approach to Explainable Atrial Fibrillation Detection from ECG Signals**
-2402.09474v2 by Aruna Mohan, Danne Elbers, Or Zilbershot, Fatemeh Afghah, David Vorchheimer
-
-Remote patient monitoring based on wearable single-lead electrocardiogram
-(ECG) devices has significant potential for enabling the early detection of
-heart disease, especially in combination with artificial intelligence (AI)
-approaches for automated heart disease detection. There have been prior studies
-applying AI approaches based on deep learning for heart disease detection.
-However, these models are yet to be widely accepted as a reliable aid for
-clinical diagnostics, in part due to the current black-box perception
-surrounding many AI algorithms. In particular, there is a need to identify the
-key features of the ECG signal that contribute toward making an accurate
-diagnosis, thereby enhancing the interpretability of the model. In the present
-study, we develop a vision transformer approach to identify atrial fibrillation
-based on single-lead ECG data. A residual network (ResNet) approach is also
-developed for comparison with the vision transformer approach. These models are
-applied to the Chapman-Shaoxing dataset to classify atrial fibrillation, as
-well as another common arrhythmia, sinus bradycardia, and normal sinus rhythm
-heartbeats. The models enable the identification of the key regions of the
-heartbeat that determine the resulting classification, and highlight the
-importance of P-waves and T-waves, as well as heartbeat duration and signal
-amplitude, in distinguishing normal sinus rhythm from atrial fibrillation and
-sinus bradycardia.
-
-摘要：<paragraph>基於可穿戴式單導程心電圖 (ECG) 裝置的遠端病患監測在早期偵測心臟疾病方面具有顯著的潛力，特別是與用於自動化心臟疾病偵測的人工智慧 (AI) 方法結合使用時。先前已有研究應用基於深度學習的 AI 方法進行心臟疾病偵測。然而，這些模型尚未被廣泛接受為臨床診斷的可靠輔助工具，部分原因在於圍繞許多 AI 演算法的當前黑箱感知。特別是，有必要找出有助於做出準確診斷的 ECG 訊號關鍵特徵，從而增強模型的可解釋性。在本研究中，我們開發了一種視覺轉換器方法，以根據單導程 ECG 資料找出心房顫動。殘差網路 (ResNet) 方法也已開發出來，以便與視覺轉換器方法進行比較。這些模型應用於 Chapman-Shaoxing 資料集，以分類心房顫動，以及另一種常見的心律不整，竇性心動過緩，和正常竇性心律的心跳。這些模型能夠找出決定最終分類的心跳關鍵區域，並強調 P 波和 T 波，以及心跳持續時間和訊號振幅在區分正常竇性心律與心房顫動和竇性心動過緩方面的重要性。</paragraph>
-
 
 ### Medical
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**LLM Safety for Children**|Prasanjit Rath et.al.|[2502.12552v1](http://arxiv.org/abs/2502.12552v1)|null|
+|**2025-02-17**|**Classifiers of Data Sharing Statements in Clinical Trial Records**|Saber Jelodari Mamaghani et.al.|[2502.12362v1](http://arxiv.org/abs/2502.12362v1)|null|
 |**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
 |**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
 |**2025-02-17**|**Proactive Depot Discovery: A Generative Framework for Flexible Location-Routing**|Site Qu et.al.|[2502.11715v1](http://arxiv.org/abs/2502.11715v1)|null|
@@ -2716,17 +2710,19 @@ sinus bradycardia.
 |**2025-02-16**|**A Survey of LLM-based Agents in Medicine: How far are we from Baymax?**|Wenxuan Wang et.al.|[2502.11211v1](http://arxiv.org/abs/2502.11211v1)|null|
 |**2025-02-16**|**RT-DEMT: A hybrid real-time acupoint detection model combining mamba and transformer**|Shilong Yang et.al.|[2502.11179v1](http://arxiv.org/abs/2502.11179v1)|null|
 |**2025-02-16**|**Knowledge Graph-Driven Retrieval-Augmented Generation: Integrating Deepseek-R1 with Weaviate for Advanced Chatbot Applications**|Alexandru Lecu et.al.|[2502.11108v1](http://arxiv.org/abs/2502.11108v1)|null|
+|**2025-02-16**|**Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**|Xianbing Zhao et.al.|[2502.12204v1](http://arxiv.org/abs/2502.12204v1)|null|
 |**2025-02-16**|**CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**|Gen Zhou et.al.|[2502.11001v1](http://arxiv.org/abs/2502.11001v1)|null|
 |**2025-02-15**|**Automatic Quality Assessment of First Trimester Crown-Rump-Length Ultrasound Images**|Sevim Cengiz et.al.|[2502.10908v1](http://arxiv.org/abs/2502.10908v1)|null|
 |**2025-02-15**|**Breaking Down the Hierarchy: A New Approach to Leukemia Classification**|Ibraheem Hamdi et.al.|[2502.10899v1](http://arxiv.org/abs/2502.10899v1)|null|
 |**2025-02-15**|**An Empirical Analysis of Uncertainty in Large Language Model Evaluations**|Qiujie Xie et.al.|[2502.10709v1](http://arxiv.org/abs/2502.10709v1)|null|
-|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|null|
+|**2025-02-15**|**Reading Your Heart: Learning ECG Words and Sentences via Pre-training ECG Language Model**|Jiarui Jin et.al.|[2502.10707v1](http://arxiv.org/abs/2502.10707v1)|[link](https://github.com/pkudigitalhealth/heartlang)|
 |**2025-02-15**|**Self-Explaining Hypergraph Neural Networks for Diagnosis Prediction**|Leisheng Yu et.al.|[2502.10689v1](http://arxiv.org/abs/2502.10689v1)|null|
 |**2025-02-15**|**ProMRVL-CAD: Proactive Dialogue System with Multi-Round Vision-Language Interactions for Computer-Aided Diagnosis**|Xueshen Li et.al.|[2502.10620v1](http://arxiv.org/abs/2502.10620v1)|null|
 |**2025-02-15**|**Optimizing CNN Architectures for Advanced Thoracic Disease Classification**|Tejas Mirthipati et.al.|[2502.10614v1](http://arxiv.org/abs/2502.10614v1)|null|
 |**2025-02-14**|**PolyPath: Adapting a Large Multimodal Model for Multi-slide Pathology Report Generation**|Faruk Ahmed et.al.|[2502.10536v1](http://arxiv.org/abs/2502.10536v1)|null|
 |**2025-02-14**|**Tempo: Helping Data Scientists and Domain Experts Collaboratively Specify Predictive Modeling Tasks**|Venkatesh Sivaraman et.al.|[2502.10526v1](http://arxiv.org/abs/2502.10526v1)|null|
 |**2025-02-14**|**A Robust Attack: Displacement Backdoor Attack**|Yong Li et.al.|[2502.10490v1](http://arxiv.org/abs/2502.10490v1)|null|
+|**2025-02-14**|**3D ReX: Causal Explanations in 3D Neuroimaging Classification**|Melane Navaratnarajah et.al.|[2502.12181v1](http://arxiv.org/abs/2502.12181v1)|null|
 |**2025-02-14**|**Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**|Jin Cui et.al.|[2502.09947v1](http://arxiv.org/abs/2502.09947v1)|null|
 |**2025-02-14**|**TransGUNet: Transformer Meets Graph-based Skip Connection for Medical Image Segmentation**|Ju-Hyeon Nam et.al.|[2502.09931v1](http://arxiv.org/abs/2502.09931v1)|null|
 |**2025-02-14**|**Video2Policy: Scaling up Manipulation Tasks in Simulation through Internet Videos**|Weirui Ye et.al.|[2502.09886v1](http://arxiv.org/abs/2502.09886v1)|null|
@@ -2743,6 +2739,7 @@ sinus bradycardia.
 |**2025-02-13**|**Data2Concept2Text: An Explainable Multilingual Framework for Data Analysis Narration**|Flavio Bertini et.al.|[2502.09218v1](http://arxiv.org/abs/2502.09218v1)|null|
 |**2025-02-13**|**Logical Lease Litigation: Prolog and LLMs for Rental Law Compliance in New York**|Sanskar Sehgal et.al.|[2502.09204v1](http://arxiv.org/abs/2502.09204v1)|null|
 |**2025-02-13**|**Two-Stage Representation Learning for Analyzing Movement Behavior Dynamics in People Living with Dementia**|Jin Cui et.al.|[2502.09173v1](http://arxiv.org/abs/2502.09173v1)|null|
+|**2025-02-13**|**TastepepAI, An artificial intelligence platform for taste peptide de novo design**|Jianda Yue et.al.|[2502.12167v1](http://arxiv.org/abs/2502.12167v1)|null|
 |**2025-02-12**|**HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**|Valentina Vadori et.al.|[2502.08754v1](http://arxiv.org/abs/2502.08754v1)|[link](https://github.com/Vadori/CytoArk)|
 |**2025-02-12**|**Brain Latent Progression: Individual-based Spatiotemporal Disease Progression on 3D Brain MRIs via Latent Diffusion**|Lemuel Puglisi et.al.|[2502.08560v1](http://arxiv.org/abs/2502.08560v1)|[link](https://github.com/lemuelpuglisi/brlp)|
 |**2025-02-12**|**Representation Learning to Advance Multi-institutional Studies with Electronic Health Record Data**|Doudou Zhou et.al.|[2502.08547v1](http://arxiv.org/abs/2502.08547v1)|null|
@@ -2754,7 +2751,7 @@ sinus bradycardia.
 |**2025-02-11**|**Towards Efficient Optimizer Design for LLM via Structured Fisher Approximation with a Low-Rank Extension**|Wenbo Gong et.al.|[2502.07752v1](http://arxiv.org/abs/2502.07752v1)|null|
 |**2025-02-11**|**The Devil is in the Prompts: De-Identification Traces Enhance Memorization Risks in Synthetic Chest X-Ray Generation**|Raman Dutt et.al.|[2502.07516v2](http://arxiv.org/abs/2502.07516v2)|[link](https://github.com/Raman1121/diffusion_memorization)|
 |**2025-02-11**|**KPIs 2024 Challenge: Advancing Glomerular Segmentation from Patch- to Slide-Level**|Ruining Deng et.al.|[2502.07288v1](http://arxiv.org/abs/2502.07288v1)|[link](https://github.com/agaldran/kpis)|
-|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v1](http://arxiv.org/abs/2502.07158v1)|null|
+|**2025-02-11**|**Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**|Jiaying Lu et.al.|[2502.07158v2](http://arxiv.org/abs/2502.07158v2)|null|
 |**2025-02-11**|**Explaining 3D Computed Tomography Classifiers with Counterfactuals**|Joseph Paul Cohen et.al.|[2502.07156v1](http://arxiv.org/abs/2502.07156v1)|[link](https://github.com/ieee8023/ct-counterfactuals)|
 |**2025-02-10**|**Interactive Data Harmonization with LLM Agents**|Aécio Santos et.al.|[2502.07132v1](http://arxiv.org/abs/2502.07132v1)|null|
 |**2025-02-10**|**Machine Learning for Everyone: Simplifying Healthcare Analytics with BigQuery ML**|Mohammad Amir Salari et.al.|[2502.07026v1](http://arxiv.org/abs/2502.07026v1)|null|
@@ -2796,17 +2793,150 @@ sinus bradycardia.
 |**2025-02-06**|**A Retrospective Systematic Study on Hierarchical Sparse Query Transformer-assisted Ultrasound Screening for Early Hepatocellular Carcinoma**|Chaoyin She et.al.|[2502.03772v1](http://arxiv.org/abs/2502.03772v1)|[link](https://github.com/Asunatan/HSQformer)|
 |**2025-02-05**|**Towards Fair Medical AI: Adversarial Debiasing of 3D CT Foundation Embeddings**|Guangyao Zheng et.al.|[2502.04386v1](http://arxiv.org/abs/2502.04386v1)|[link](https://github.com/BioIntelligence-Lab/VAE-Adversarial-Debiasing)|
 |**2025-02-05**|**Clinically-Inspired Hierarchical Multi-Label Classification of Chest X-rays with a Penalty-Based Loss Function**|Mehrdad Asadi et.al.|[2502.03591v1](http://arxiv.org/abs/2502.03591v1)|[link](https://github.com/the-mercury/CIHMLC)|
-|**2025-02-05**|**Code Simulation as a Proxy for High-order Tasks in Large Language Models**|Emanuele La Malfa et.al.|[2502.03568v2](http://arxiv.org/abs/2502.03568v2)|null|
-|**2025-02-05**|**Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**|Jonathan Kim et.al.|[2502.04381v1](http://arxiv.org/abs/2502.04381v1)|null|
-|**2025-02-05**|**Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**|Sarah Al-Shareeda et.al.|[2502.03396v1](http://arxiv.org/abs/2502.03396v1)|null|
-|**2025-02-05**|**RadVLM: A Multitask Conversational Vision-Language Model for Radiology**|Nicolas Deperrois et.al.|[2502.03333v1](http://arxiv.org/abs/2502.03333v1)|null|
-|**2025-02-05**|**MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**|Amin Dada et.al.|[2502.03298v1](http://arxiv.org/abs/2502.03298v1)|null|
-|**2025-02-05**|**Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**|Matthias Schwab et.al.|[2502.03272v1](http://arxiv.org/abs/2502.03272v1)|null|
-|**2025-02-05**|**Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**|Li Pan et.al.|[2502.03238v2](http://arxiv.org/abs/2502.03238v2)|[link](https://github.com/peterlipan/lmd)|
-|**2025-02-05**|**Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**|Martin Wimpff et.al.|[2502.06828v1](http://arxiv.org/abs/2502.06828v1)|[link](https://github.com/martinwimpff/eeg-continual)|
-|**2025-02-05**|**MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**|Seonok Kim et.al.|[2502.03004v1](http://arxiv.org/abs/2502.03004v1)|null|
 
 #### Abstracts
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **LLM Safety for Children**
+2502.12552v1 by Prasanjit Rath, Hari Shrawgi, Parag Agrawal, Sandipan Dandapat
+
+This paper analyzes the safety of Large Language Models (LLMs) in
+interactions with children below age of 18 years. Despite the transformative
+applications of LLMs in various aspects of children's lives such as education
+and therapy, there remains a significant gap in understanding and mitigating
+potential content harms specific to this demographic. The study acknowledges
+the diverse nature of children often overlooked by standard safety evaluations
+and proposes a comprehensive approach to evaluating LLM safety specifically for
+children. We list down potential risks that children may encounter when using
+LLM powered applications. Additionally we develop Child User Models that
+reflect the varied personalities and interests of children informed by
+literature in child care and psychology. These user models aim to bridge the
+existing gap in child safety literature across various fields. We utilize Child
+User Models to evaluate the safety of six state of the art LLMs. Our
+observations reveal significant safety gaps in LLMs particularly in categories
+harmful to children but not adults
+
+摘要：本文分析了大型語言模型 (LLM) 在與 18 歲以下兒童互動時的安全性。儘管 LLM 在兒童生活的各個方面（例如教育和治療）都有轉變性的應用，但在了解和減輕對這個群體具體的潛在內容危害方面仍然存在顯著差距。研究承認兒童的多樣性，而標準安全評估通常會忽略這些多樣性，並提出了一種針對兒童評估 LLM 安全性的綜合方法。我們列出了兒童在使用由 LLM 提供動力的應用程式時可能遇到的潛在風險。此外，我們開發了兒童使用者模型，這些模型反映了兒童不同的個性特質和興趣，並參考了兒童照護和心理學的文獻。這些使用者模型旨在彌合不同領域兒童安全文獻中現有的差距。我們利用兒童使用者模型來評估六個最先進的 LLM 的安全性。我們的觀察結果揭示了 LLM 中的重大安全漏洞，特別是在對兒童有害但對成年人無害的類別中
+
+##### **Classifiers of Data Sharing Statements in Clinical Trial Records**
+2502.12362v1 by Saber Jelodari Mamaghani, Cosima Strantz, Dennis Toddenroth
+
+Digital individual participant data (IPD) from clinical trials are
+increasingly distributed for potential scientific reuse. The identification of
+available IPD, however, requires interpretations of textual data-sharing
+statements (DSS) in large databases. Recent advancements in computational
+linguistics include pre-trained language models that promise to simplify the
+implementation of effective classifiers based on textual inputs. In a subset of
+5,000 textual DSS from ClinicalTrials.gov, we evaluate how well classifiers
+based on domain-specific pre-trained language models reproduce original
+availability categories as well as manually annotated labels. Typical metrics
+indicate that classifiers that predicted manual annotations outperformed those
+that learned to output the original availability categories. This suggests that
+the textual DSS descriptions contain applicable information that the
+availability categories do not, and that such classifiers could thus aid the
+automatic identification of available IPD in large trial databases.
+
+摘要：臨床試驗的數位個人參與者資料 (IPD) 愈來愈廣泛地用於潛在的科學再利用。然而，要找出可用的 IPD，需要對大型資料庫中的文字資料共享聲明 (DSS) 進行詮釋。計算語言學最近的進展包括預先訓練的語言模型，有望簡化根據文字輸入實作有效分類器的過程。在 ClinicalTrials.gov 中的 5,000 個文字 DSS 子集中，我們評估了基於特定領域預先訓練語言模型的分類器，在重現原始可用性類別以及手動註解標籤方面的表現。典型的指標顯示，預測手動註解的分類器優於學會輸出原始可用性類別的分類器。這表示文字 DSS 說明包含可用性類別所沒有的適用資訊，而且此類分類器因此有助於在大型試驗資料庫中自動找出可用的 IPD。
+
 ##### **Relational Norms for Human-AI Cooperation**
 2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
 
@@ -3096,6 +3226,28 @@ chatbot applications.
 
 摘要：大型語言模型 (LLM) 已大幅推動自然語言生成的領域。然而，它們經常產生未經驗證的輸出，這會損害它們在關鍵應用中的可靠性。在本研究中，我們提出了一個創新的框架，透過檢索增強生成技術，將結構化的生物醫學知識與 LLM 結合。我們的系統透過識別和精煉與年齡相關性黃斑部病變 (AMD) 相關的醫學摘要中的因果關係和命名實體，開發一個徹底的知識圖譜。我們的框架使用基於向量的檢索流程和本地部署的語言模型，產生在脈絡上相關且可驗證的回應，並直接參考臨床證據。實驗結果顯示，此方法顯著減少了幻覺、增強了事實準確性，並改善了生成回應的清晰度，為先進的生物醫學聊天機器人應用程式提供了穩健的解決方案。
 
+##### **Predicting Depression in Screening Interviews from Interactive Multi-Theme Collaboration**
+2502.12204v1 by Xianbing Zhao, Yiqing Lyu, Di Wang, Buzhou Tang
+
+Automatic depression detection provides cues for early clinical intervention
+by clinicians. Clinical interviews for depression detection involve dialogues
+centered around multiple themes. Existing studies primarily design end-to-end
+neural network models to capture the hierarchical structure of clinical
+interview dialogues. However, these methods exhibit defects in modeling the
+thematic content of clinical interviews: 1) they fail to capture intra-theme
+and inter-theme correlation explicitly, and 2) they do not allow clinicians to
+intervene and focus on themes of interest. To address these issues, this paper
+introduces an interactive depression detection framework. This framework
+leverages in-context learning techniques to identify themes in clinical
+interviews and then models both intra-theme and inter-theme correlation.
+Additionally, it employs AI-driven feedback to simulate the interests of
+clinicians, enabling interactive adjustment of theme importance. PDIMC achieves
+absolute improvements of 35\% and 12\% compared to the state-of-the-art on the
+depression detection dataset DAIC-WOZ, which demonstrates the effectiveness of
+modeling theme correlation and incorporating interactive external feedback.
+
+摘要：自動憂鬱症偵測提供臨床醫師早期臨床介入的線索。憂鬱症偵測的臨床訪談涉及以多個主題為中心的對話。現有研究主要設計端對端的類神經網路模型來捕捉臨床訪談對話的階層結構。然而，這些方法在建模臨床訪談的主題內容時表現出缺陷：1）它們無法明確捕捉主題內和主題間的關聯性，以及 2）它們不允許臨床醫師介入並專注於感興趣的主題。為了解決這些問題，本文介紹了一個互動式憂鬱症偵測框架。此框架利用情境學習技術來識別臨床訪談中的主題，然後對主題內和主題間的關聯性進行建模。此外，它採用 AI 驅動的回饋來模擬臨床醫師的興趣，實現主題重要性的互動式調整。與 DAIC-WOZ 憂鬱症偵測資料集上的最新技術相比，PDIMC 的絕對改進率分別為 35% 和 12%，這證明了對主題關聯性建模和納入互動式外部回饋的有效性。
+
 ##### **CL-MFAP: A Contrastive Learning-Based Multimodal Foundation Model for Molecular Property Prediction and Antibiotic Screening**
 2502.11001v1 by Gen Zhou, Sugitha Janarthanan, Yutong Lu, Pingzhao Hu
 
@@ -3364,6 +3516,20 @@ differences, such as rotation and cropping.
 
 摘要：随着人工智能在我们的生活中变得越来越普遍，人们正在享受它带来的便利，但也面临着隐藏的威胁，例如数据中毒和对抗性攻击。这些威胁可能对人工智能的应用产生灾难性后果，特别是对于一些立即生效的应用，例如自动驾驶和医疗领域。在这些威胁中，后门攻击以其隐蔽性和简单的部署给人们留下了深刻的印象，使其成为不可忽视的威胁，然而，在部署后门模型的过程中，后门攻击往往存在一些使其在实际应用中不尽如人意的原因，例如抖动和亮度变化。基于此，我们提出了一种高度鲁棒的后门攻击，该攻击对目标样本进行平移并将其与自身结合以形成后门样本，即置换后门攻击 (DBA)。实验结果表明，DBA 攻击可以抵抗模拟真实世界差异的数据增强，例如旋转和裁剪。
 
+##### **3D ReX: Causal Explanations in 3D Neuroimaging Classification**
+2502.12181v1 by Melane Navaratnarajah, Sophie A. Martin, David A. Kelly, Nathan Blake, Hana Chocker
+
+Explainability remains a significant problem for AI models in medical
+imaging, making it challenging for clinicians to trust AI-driven predictions.
+We introduce 3D ReX, the first causality-based post-hoc explainability tool for
+3D models. 3D ReX uses the theory of actual causality to generate
+responsibility maps which highlight the regions most crucial to the model's
+decision. We test 3D ReX on a stroke detection model, providing insight into
+the spatial distribution of features relevant to stroke.
+
+摘要：解釋性仍然是醫療影像中 AI 模型的一大問題，這使得臨床醫生難以信任 AI 驅動的預測。
+我們引入了 3D ReX，這是第一個用於 3D 模型的基於因果關係的事後解釋性工具。3D ReX 使用實際因果關係理論來生成責任圖，該圖突出了對模型決策至關重要的區域。我們在中風檢測模型上測試了 3D ReX，提供了與中風相關特徵的空間分佈的見解。
+
 ##### **Analyzing Patient Daily Movement Behavior Dynamics Using Two-Stage Encoding Model**
 2502.09947v1 by Jin Cui, Alexander Capstick, Payam Barnaghi, Gregory Scott
 
@@ -3749,6 +3915,32 @@ care interventions, and large-scale health monitoring.
 
 摘要：在遠程醫療監控中，時間序列表示學習揭示了高頻率數據中的關鍵患者行為模式。本研究通過提出一個兩階段、自我監督的學習方法來分析痴呆症患者的家庭活動數據，該方法專門用於發現低秩結構。第一階段將時間序列活動轉換為由預訓練語言模型編碼的文本序列，使用基於 PageRank 的方法提供了一個豐富、高維的潛在狀態空間。此 PageRank 向量捕獲潛在狀態轉換，有效地將複雜的行為數據壓縮成簡潔的形式，從而增強了解力。此低秩表示不僅增強了模型的可解釋性，還促進了聚類和轉換分析，揭示了與臨床指標（例如 MMSE 和 ADAS-COG 分數）相關的關鍵行為模式。我們的研究結果證明了該框架在支持認知狀態預測、個性化護理干預和大型健康監控方面的潛力。
 
+##### **TastepepAI, An artificial intelligence platform for taste peptide de novo design**
+2502.12167v1 by Jianda Yue, Tingting Li, Jian Ouyang, Jiawei Xu, Hua Tan, Zihui Chen, Changsheng Han, Huanyu Li, Songping Liang, Zhonghua Liu, Zhonghua Liu, Ying Wang
+
+Taste peptides have emerged as promising natural flavoring agents attributed
+to their unique organoleptic properties, high safety profile, and potential
+health benefits. However, the de novo identification of taste peptides derived
+from animal, plant, or microbial sources remains a time-consuming and
+resource-intensive process, significantly impeding their widespread application
+in the food industry. Here, we present TastePepAI, a comprehensive artificial
+intelligence framework for customized taste peptide design and safety
+assessment. As the key element of this framework, a loss-supervised adaptive
+variational autoencoder (LA-VAE) is implemented to efficiently optimizes the
+latent representation of sequences during training and facilitates the
+generation of target peptides with desired taste profiles. Notably, our model
+incorporates a novel taste-avoidance mechanism, allowing for selective flavor
+exclusion. Subsequently, our in-house developed toxicity prediction algorithm
+(SpepToxPred) is integrated in the framework to undergo rigorous safety
+evaluation of generated peptides. Using this integrated platform, we
+successfully identified 73 peptides exhibiting sweet, salty, and umami,
+significantly expanding the current repertoire of taste peptides. This work
+demonstrates the potential of TastePepAI in accelerating taste peptide
+discovery for food applications and provides a versatile framework adaptable to
+broader peptide engineering challenges.
+
+摘要：味觉肽因其独特的感官特性、高安全性概况和潜在的健康益处而成为有前途的天然调味剂。然而，从动物、植物或微生物来源中从头鉴定味觉肽仍然是一个耗时且资源密集的过程，严重阻碍了它们在食品工业中的广泛应用。在此，我们提出了 TastePepAI，这是一个用于定制味觉肽设计和安全性评估的综合人工智能框架。作为该框架的关键元素，实现了损失监督自适应变分自动编码器 (LA-VAE)，以在训练期间有效优化序列的潜在表示，并促进生成具有所需味觉特征的目标肽。值得注意的是，我们的模型包含了一种新颖的味觉回避机制，允许选择性排除风味。随后，我们内部开发的毒性预测算法 (SpepToxPred) 被集成到框架中，以对生成的肽进行严格的安全评估。使用这个集成平台，我们成功地鉴定了 73 种表现出甜味、咸味和鲜味的肽，极大地扩展了当前的味觉肽库。这项工作展示了 TastePepAI 在加速味觉肽发现以用于食品应用方面的潜力，并提供了一个适用于更广泛的肽工程挑战的多功能框架。
+
 ##### **HistoSmith: Single-Stage Histology Image-Label Generation via Conditional Latent Diffusion for Enhanced Cell Segmentation and Classification**
 2502.08754v1 by Valentina Vadori, Jean-Marie Graïc, Antonella Peruffo, Livio Finos, Ujwala Kiran Chaudhari, Enrico Grisan
 
@@ -4045,7 +4237,7 @@ CKD 臨床前嚙齒動物模型的資料集，這些註解來自 60 多個
 疾病研究和診斷量化。
 
 ##### **Early Risk Prediction of Pediatric Cardiac Arrest from Electronic Health Records via Multimodal Fused Transformer**
-2502.07158v1 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
+2502.07158v2 by Jiaying Lu, Stephanie R. Brown, Songyuan Liu, Shifan Zhao, Kejun Dong, Del Bold, Michael Fundora, Alaa Aljiffry, Alex Fedorov, Jocelyn Grunwell, Xiao Hu
 
 Early prediction of pediatric cardiac arrest (CA) is critical for timely
 intervention in high-risk intensive care settings. We introduce PedCA-FT, a
@@ -4060,7 +4252,7 @@ and identifies clinically meaningful risk factors. These findings underscore
 the potential of multimodal fusion techniques to enhance early CA detection and
 improve patient care.
 
-摘要：早期預測兒童心臟驟停 (CA) 對高風險重症監護環境中的及時干預至關重要。我們引入了 PedCA-FT，這是一個新的基於Transformer的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分釋放高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的Transformer模塊，PedCA-FT 捕獲復雜的時間和上下文模式以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中經過策劃的兒科隊列上進行評估，我們的做法在五個關鍵性能指標上優於其他十個人工智慧模型，並識別出臨床上有意義的風險因素。這些發現強調了多模態融合技術在增強早期 CA 檢測和改善患者護理方面的潛力。
+摘要：早期預測小兒心臟驟停 (CA) 對於在高風險的重症照護環境中及時介入至關重要。我們引入了 PedCA-FT，一個新穎的基於轉換器的框架，它將 EHR 的表格視圖與 EHR 的派生文本視圖融合在一起，以充分發揮高維風險因素及其動態的交互作用。通過為每個模態視圖採用專用的轉換器模組，PedCA-FT 捕獲複雜的時間和上下文模式，以產生穩健的 CA 風險估計。在 CHOA-CICU 資料庫中策劃的小兒群體中進行評估，我們的做法在五項關鍵績效指標中優於其他十種人工智慧模型，並找出臨床上有意義的風險因素。這些發現強調了多模式融合技術在增強早期 CA 檢測和改善患者照護方面的潛力。
 
 ##### **Explaining 3D Computed Tomography Classifiers with Counterfactuals**
 2502.07156v1 by Joseph Paul Cohen, Louis Blankemeier, Akshay Chaudhari
@@ -5101,2682 +5293,16 @@ experiment details are made available.
 
 摘要：在本文中，我們提出胸部 X 光（CXR）影像多標籤分類的新方法，在維持簡化的單一模型、單次執行訓練管線的同時，提升臨床可解釋性。利用 CheXpert 資料集和 VisualCheXbert 衍生的標籤，我們納入階層標籤群組，以擷取診斷之間具有臨床意義的關聯性。為此，我們設計了自訂的階層二元交叉熵 (HBCE) 損失函數，使用固定或資料驅動的懲罰類型來強制執行標籤依賴性。我們的模型在測試集上達到受試者工作特性曲線 (AUROC) 下的平均面積為 0.903。此外，我們提供視覺化說明和不確定性估計，以進一步提升模型可解釋性。所有程式碼、模型組態和實驗詳細資料皆已公開。
 
-##### **Code Simulation as a Proxy for High-order Tasks in Large Language Models**
-2502.03568v2 by Emanuele La Malfa, Christoph Weinhuber, Orazio Torre, Fangru Lin, X. Angelo Huang, Samuele Marro, Anthony Cohn, Nigel Shadbolt, Michael Wooldridge
-
-Many reasoning, planning, and problem-solving tasks share an intrinsic
-algorithmic nature: correctly simulating each step is a sufficient condition to
-solve them correctly. We collect pairs of naturalistic and synthetic reasoning
-tasks to assess the capabilities of Large Language Models (LLM). While
-naturalistic tasks often require careful human handcrafting, we show that
-synthetic data is, in many cases, a good proxy that is much easier to collect
-at scale. We leverage common constructs in programming as the counterpart of
-the building blocks of naturalistic reasoning tasks, such as straight-line
-programs, code that contains critical paths, and approximate and redundant
-instructions. We further assess the capabilities of LLMs on sorting problems
-and repeated operations via sorting algorithms and nested loops. Our synthetic
-datasets further reveal that while the most powerful LLMs exhibit relatively
-strong execution capabilities, the process is fragile: it is negatively
-affected by memorisation and seems to rely heavily on pattern recognition. Our
-contribution builds upon synthetically testing the reasoning capabilities of
-LLMs as a scalable complement to handcrafted human-annotated problems.
-
-摘要：許多推理、規劃和問題解決任務都具有內在的演算法性質：正確模擬每一步是正確解決它們的充分條件。我們收集自然主義和合成推理任務對，以評估大型語言模型 (LLM) 的能力。雖然自然主義任務通常需要仔細的人工製作，但我們表明在許多情況下，合成數據是一個很好的代理，而且更容易大規模收集。我們利用程式設計中的常見結構作為自然主義推理任務建構區塊的對應物，例如直線程式、包含關鍵路徑的程式碼，以及近似和冗餘指令。我們進一步評估 LLM 在排序問題和重複操作方面的能力，透過排序演算法和巢狀迴圈。我們的合成資料集進一步揭示，雖然最強大的 LLM 表現出相對強大的執行能力，但這個過程很脆弱：它受到記憶的負面影響，而且似乎很依賴模式辨識。我們的貢獻建立在合成測試 LLM 的推理能力之上，作為手工製作的人工標記問題的可擴充補充。
-
-##### **Limitations of Large Language Models in Clinical Problem-Solving Arising from Inflexible Reasoning**
-2502.04381v1 by Jonathan Kim, Anna Podlasek, Kie Shidara, Feng Liu, Ahmed Alaa, Danilo Bernardo
-
-Large Language Models (LLMs) have attained human-level accuracy on medical
-question-answer (QA) benchmarks. However, their limitations in navigating
-open-ended clinical scenarios have recently been shown, raising concerns about
-the robustness and generalizability of LLM reasoning across diverse, real-world
-medical tasks. To probe potential LLM failure modes in clinical
-problem-solving, we present the medical abstraction and reasoning corpus
-(M-ARC). M-ARC assesses clinical reasoning through scenarios designed to
-exploit the Einstellung effect -- the fixation of thought arising from prior
-experience, targeting LLM inductive biases toward inflexible pattern matching
-from their training data rather than engaging in flexible reasoning. We find
-that LLMs, including current state-of-the-art o1 and Gemini models, perform
-poorly compared to physicians on M-ARC, often demonstrating lack of commonsense
-medical reasoning and a propensity to hallucinate. In addition, uncertainty
-estimation analyses indicate that LLMs exhibit overconfidence in their answers,
-despite their limited accuracy. The failure modes revealed by M-ARC in LLM
-medical reasoning underscore the need to exercise caution when deploying these
-models in clinical settings.
-
-摘要：大型語言模型 (LLM) 已在醫療問題解答 (QA) 基準上達到人類層級的準確度。然而，它們在應對開放式臨床場景中的局限性最近已被揭示，引發了人們對 LLM 推理在多樣化、真實世界醫療任務中的穩健性和概括性的擔憂。為了探討臨床問題解決中 LLM 的潛在故障模式，我們提出了醫療抽象和推理語料庫 (M-ARC)。M-ARC 通過旨在利用艾賓浩斯錯覺（由先前經驗產生的思維定勢）來評估臨床推理，針對 LLM 歸納偏誤，使其從訓練數據中進行僵化的模式匹配，而不是進行靈活的推理。我們發現，包括當前最先進的 o1 和 Gemini 模型在內的 LLM，在 M-ARC 上的表現遠不如醫生，它們經常表現出缺乏常識性的醫療推理和產生幻覺的傾向。此外，不確定性估計分析表明，儘管 LLM 準確性有限，但它們對自己的答案表現出過度自信。M-ARC 揭示的 LLM 醫療推理故障模式強調了在臨床環境中部署這些模型時需要謹慎。
-
-##### **Accurate AI-Driven Emergency Vehicle Location Tracking in Healthcare ITS Digital Twin**
-2502.03396v1 by Sarah Al-Shareeda, Yasar Celik, Bilge Bilgili, Ahmed Al-Dubai, Berk Canberk
-
-Creating a Digital Twin (DT) for Healthcare Intelligent Transportation
-Systems (HITS) is a hot research trend focusing on enhancing HITS management,
-particularly in emergencies where ambulance vehicles must arrive at the crash
-scene on time and track their real-time location is crucial to the medical
-authorities. Despite the claim of real-time representation, a temporal
-misalignment persists between the physical and virtual domains, leading to
-discrepancies in the ambulance's location representation. This study proposes
-integrating AI predictive models, specifically Support Vector Regression (SVR)
-and Deep Neural Networks (DNN), within a constructed mock DT data pipeline
-framework to anticipate the medical vehicle's next location in the virtual
-world. These models align virtual representations with their physical
-counterparts, i.e., metaphorically offsetting the synchronization delay between
-the two worlds. Trained meticulously on a historical geospatial dataset, SVR
-and DNN exhibit exceptional prediction accuracy in MATLAB and Python
-environments. Through various testing scenarios, we visually demonstrate the
-efficacy of our methodology, showcasing SVR and DNN's key role in significantly
-reducing the witnessed gap within the HITS's DT. This transformative approach
-enhances real-time synchronization in emergency HITS by approximately 88% to
-93%.
-
-摘要：建立醫療智慧交通系統（HITS）的數位分身（DT）是熱門的研究趨勢，其重點在於提升 HITS 管理，特別是在救護車必須準時抵達車禍現場的緊急情況中，追蹤其即時位置對於醫療單位至關重要。儘管聲稱即時呈現，但實體和虛擬領域之間仍存在時間上的錯位，導致救護車位置呈現上的差異。本研究建議在建構的虛擬 DT 資料管道架構中整合人工智慧預測模型，特別是支援向量回歸（SVR）和深度神經網路（DNN），以預測醫療車輛在虛擬世界的下一個位置。這些模型將虛擬呈現與其實體對應物對齊，也就是說，在兩個世界之間比喻性地抵銷同步延遲。在歷史地理空間資料集上經過仔細訓練，SVR 和 DNN 在 MATLAB 和 Python 環境中展現出卓越的預測準確性。透過各種測試情境，我們視覺化展示了我們方法論的效能，展示了 SVR 和 DNN 在顯著縮小 HITS 的 DT 中見證到的差距方面的關鍵作用。這種變革性的方法將緊急 HITS 中的即時同步提升了大約 88% 到 93%。
-
-##### **RadVLM: A Multitask Conversational Vision-Language Model for Radiology**
-2502.03333v1 by Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vandenhirtz, Sonia Laguna, Alain Ryser, Koji Fujimoto, Mizuho Nishio, Thomas M. Sutter, Julia E. Vogt, Jonas Kluckert, Thomas Frauenfelder, Christian Blüthgen, Farhad Nooralahzadeh, Michael Krauthammer
-
-The widespread use of chest X-rays (CXRs), coupled with a shortage of
-radiologists, has driven growing interest in automated CXR analysis and
-AI-assisted reporting. While existing vision-language models (VLMs) show
-promise in specific tasks such as report generation or abnormality detection,
-they often lack support for interactive diagnostic capabilities. In this work
-we present RadVLM, a compact, multitask conversational foundation model
-designed for CXR interpretation. To this end, we curate a large-scale
-instruction dataset comprising over 1 million image-instruction pairs
-containing both single-turn tasks -- such as report generation, abnormality
-classification, and visual grounding -- and multi-turn, multi-task
-conversational interactions. After fine-tuning RadVLM on this instruction
-dataset, we evaluate it across different tasks along with re-implemented
-baseline VLMs. Our results show that RadVLM achieves state-of-the-art
-performance in conversational capabilities and visual grounding while remaining
-competitive in other radiology tasks. Ablation studies further highlight the
-benefit of joint training across multiple tasks, particularly for scenarios
-with limited annotated data. Together, these findings highlight the potential
-of RadVLM as a clinically relevant AI assistant, providing structured CXR
-interpretation and conversational capabilities to support more effective and
-accessible diagnostic workflows.
-
-摘要：胸部 X 光 (CXR) 的广泛使用，加上放射科醫師短缺，促使人們對自動化 CXR 分析和 AI 輔助報告產生越來越濃厚的興趣。雖然現有的視覺語言模型 (VLM) 在特定任務中顯示出前景，例如報告生成或異常偵測，但它們通常缺乏對互動式診斷功能的支持。在這項工作中，我們提出 RadVLM，這是一個緊湊的多任務對話式基礎模型，專為 CXR 解釋而設計。為此，我們策劃了一個大型指令資料集，包含超過 100 萬個影像指令對，其中包含單輪任務（例如報告生成、異常分類和視覺基礎），以及多輪、多任務對話互動。在對這個指令資料集進行微調後，我們對 RadVLM 進行評估，並與重新實作的基準 VLM 一起執行不同的任務。我們的結果顯示，RadVLM 在對話能力和視覺基礎方面取得了最先進的效能，同時在其他放射學任務中仍具有競爭力。消融研究進一步突顯了跨多個任務進行聯合訓練的好處，特別是對於帶有標註資料有限的場景。這些發現共同突顯了 RadVLM 作為臨床相關 AI 助理的潛力，提供結構化的 CXR 解釋和對話能力，以支援更有效且可存取的診斷工作流程。
-
-##### **MeDiSumQA: Patient-Oriented Question-Answer Generation from Discharge Letters**
-2502.03298v1 by Amin Dada, Osman Alperen Koras, Marie Bauer, Amanda Butler, Kaleb E. Smith, Jens Kleesiek, Julian Friedrich
-
-While increasing patients' access to medical documents improves medical care,
-this benefit is limited by varying health literacy levels and complex medical
-terminology. Large language models (LLMs) offer solutions by simplifying
-medical information. However, evaluating LLMs for safe and patient-friendly
-text generation is difficult due to the lack of standardized evaluation
-resources. To fill this gap, we developed MeDiSumQA. MeDiSumQA is a dataset
-created from MIMIC-IV discharge summaries through an automated pipeline
-combining LLM-based question-answer generation with manual quality checks. We
-use this dataset to evaluate various LLMs on patient-oriented
-question-answering. Our findings reveal that general-purpose LLMs frequently
-surpass biomedical-adapted models, while automated metrics correlate with human
-judgment. By releasing MeDiSumQA on PhysioNet, we aim to advance the
-development of LLMs to enhance patient understanding and ultimately improve
-care outcomes.
-
-摘要：儘管讓患者更能取得醫療文件有助於改善醫療照護，
-但此優點受到不同的健康素養程度和複雜的醫療術語所限制。大型語言模型 (LLM) 提供了簡化醫療資訊的解決方案。然而，由於缺乏標準化的評估資源，因此難以評估 LLM 以確保其安全且對患者友善的文字產生。為了填補此缺口，我們開發了 MeDiSumQA。MeDiSumQA 是透過自動化流程從 MIMIC-IV 出院摘要中建立的資料集，結合了基於 LLM 的問答產生和手動品質檢查。我們使用此資料集來評估各種 LLM 在以患者為導向的問答中。我們的發現顯示，通用 LLM 經常超越生物醫學適應模型，而自動化指標與人類判斷相關。透過在 PhysioNet 上發布 MeDiSumQA，我們旨在推動 LLM 的發展，以增進患者理解，並最終改善照護成果。
-
-##### **Deep Learning Pipeline for Fully Automated Myocardial Infarct Segmentation from Clinical Cardiac MR Scans**
-2502.03272v1 by Matthias Schwab, Mathias Pamminger, Christian Kremser, Agnes Mayr
-
-Purpose: To develop and evaluate a deep learning-based method that allows to
-perform myocardial infarct segmentation in a fully-automated way.
-  Materials and Methods: For this retrospective study, a cascaded framework of
-two and three-dimensional convolutional neural networks (CNNs), specialized on
-identifying ischemic myocardial scars on late gadolinium enhancement (LGE)
-cardiac magnetic resonance (CMR) images, was trained on an in-house training
-dataset consisting of 144 examinations. On a separate test dataset from the
-same institution, including images from 152 examinations obtained between 2021
-and 2023, a quantitative comparison between artificial intelligence (AI)-based
-segmentations and manual segmentations was performed. Further, qualitative
-assessment of segmentation accuracy was evaluated for both human and
-AI-generated contours by two CMR experts in a blinded experiment.
-  Results: Excellent agreement could be found between manually and
-automatically calculated infarct volumes ($\rho_c$ = 0.9). The qualitative
-evaluation showed that compared to human-based measurements, the experts rated
-the AI-based segmentations to better represent the actual extent of infarction
-significantly (p < 0.001) more often (33.4% AI, 25.1% human, 41.5% equal). On
-the contrary, for segmentation of microvascular obstruction (MVO), manual
-measurements were still preferred (11.3% AI, 55.6% human, 33.1% equal).
-  Conclusion: This fully-automated segmentation pipeline enables CMR infarct
-size to be calculated in a very short time and without requiring any
-pre-processing of the input images while matching the segmentation quality of
-trained human observers. In a blinded experiment, experts preferred automated
-infarct segmentations more often than manual segmentations, paving the way for
-a potential clinical application.
-
-摘要：<paragraph>目的：開發和評估一種基於深度學習的方法，允許以全自動的方式執行心肌梗塞分割。
-材料和方法：對於這項回顧性研究，一個由二維和三維卷積神經網路 (CNN) 組成的串聯架構，專門用於識別晚期釓增強 (LGE) 心臟磁振造影 (CMR) 影像上的缺血性心肌疤痕，並在包含 144 項檢查的內部訓練資料集上受訓。在來自同一家機構的獨立測試資料集上，包括 2021 年至 2023 年間獲得的 152 項檢查的影像，執行基於人工智慧 (AI) 的分割和手動分割之間的定量比較。此外，由兩位 CMR 專家在盲測實驗中評估人類和 AI 生成的輪廓的分割準確度。
-結果：在手動和自動計算的梗塞體積之間可以發現極佳的一致性（ρ_c = 0.9）。定性評估顯示，與基於人類的測量相比，專家評估 AI 基於分割能更能代表梗塞的實際範圍，顯著（p < 0.001）更常發生（33.4% AI，25.1% 人類，41.5% 相等）。相反，對於微血管阻塞 (MVO) 的分割，手動測量仍然較受青睞（11.3% AI，55.6% 人類，33.1% 相等）。
-結論：這個全自動分割管道可以在很短的時間內計算 CMR 梗塞大小，而且無需對輸入影像進行任何前處理，同時匹配受過訓練的人類觀察者的分割品質。在盲測實驗中，專家比手動分割更常偏好自動梗塞分割，為潛在的臨床應用鋪平了道路。</paragraph>
-
-##### **Long-tailed Medical Diagnosis with Relation-aware Representation Learning and Iterative Classifier Calibration**
-2502.03238v2 by Li Pan, Yupei Zhang, Qiushi Yang, Tan Li, Zhen Chen
-
-Recently computer-aided diagnosis has demonstrated promising performance,
-effectively alleviating the workload of clinicians. However, the inherent
-sample imbalance among different diseases leads algorithms biased to the
-majority categories, leading to poor performance for rare categories. Existing
-works formulated this challenge as a long-tailed problem and attempted to
-tackle it by decoupling the feature representation and classification. Yet, due
-to the imbalanced distribution and limited samples from tail classes, these
-works are prone to biased representation learning and insufficient classifier
-calibration. To tackle these problems, we propose a new Long-tailed Medical
-Diagnosis (LMD) framework for balanced medical image classification on
-long-tailed datasets. In the initial stage, we develop a Relation-aware
-Representation Learning (RRL) scheme to boost the representation ability by
-encouraging the encoder to capture intrinsic semantic features through
-different data augmentations. In the subsequent stage, we propose an Iterative
-Classifier Calibration (ICC) scheme to calibrate the classifier iteratively.
-This is achieved by generating a large number of balanced virtual features and
-fine-tuning the encoder using an Expectation-Maximization manner. The proposed
-ICC compensates for minority categories to facilitate unbiased classifier
-optimization while maintaining the diagnostic knowledge in majority classes.
-Comprehensive experiments on three public long-tailed medical datasets
-demonstrate that our LMD framework significantly surpasses state-of-the-art
-approaches. The source code can be accessed at
-https://github.com/peterlipan/LMD.
-
-摘要：<paragraph>最近，计算机辅助诊断已展现出可观的表现，有效减轻了临床医生的工作量。然而，不同疾病之间固有的样本不平衡导致算法偏向于多数类别，从而导致罕见类别表现不佳。现有工作将这一挑战表述为长尾问题，并尝试通过解耦特征表示和分类来解决它。然而，由于不平衡分布和尾类样本有限，这些工作容易出现有偏差的表示学习和分类器校准不足。为了解决这些问题，我们提出了一个新的长尾医学诊断 (LMD) 框架，用于对长尾数据集进行平衡的医学图像分类。在初始阶段，我们开发了一个关系感知表示学习 (RRL) 方案，通过鼓励编码器通过不同的数据增强来捕获内在语义特征，从而提升表示能力。在后续阶段，我们提出了一个迭代分类器校准 (ICC) 方案，以迭代方式校准分类器。这是通过生成大量的平衡虚拟特征并使用期望最大化方式微调编码器来实现的。所提出的 ICC 补偿了少数类别，以促进无偏分类器优化，同时保持多数类别的诊断知识。在三个公共长尾医学数据集上进行的综合实验表明，我们的 LMD 框架明显超越了最先进的方法。源代码可在 https://github.com/peterlipan/LMD 处获取。</paragraph>
-
-##### **Fine-Tuning Strategies for Continual Online EEG Motor Imagery Decoding: Insights from a Large-Scale Longitudinal Study**
-2502.06828v1 by Martin Wimpff, Bruno Aristimunha, Sylvain Chevallier, Bin Yang
-
-This study investigates continual fine-tuning strategies for deep learning in
-online longitudinal electroencephalography (EEG) motor imagery (MI) decoding
-within a causal setting involving a large user group and multiple sessions per
-participant. We are the first to explore such strategies across a large user
-group, as longitudinal adaptation is typically studied in the single-subject
-setting with a single adaptation strategy, which limits the ability to
-generalize findings. First, we examine the impact of different fine-tuning
-approaches on decoder performance and stability. Building on this, we integrate
-online test-time adaptation (OTTA) to adapt the model during deployment,
-complementing the effects of prior fine-tuning. Our findings demonstrate that
-fine-tuning that successively builds on prior subject-specific information
-improves both performance and stability, while OTTA effectively adapts the
-model to evolving data distributions across consecutive sessions, enabling
-calibration-free operation. These results offer valuable insights and
-recommendations for future research in longitudinal online MI decoding and
-highlight the importance of combining domain adaptation strategies for
-improving BCI performance in real-world applications. Clinical Relevance: Our
-investigation enables more stable and efficient long-term motor imagery
-decoding, which is critical for neurorehabilitation and assistive technologies.
-
-摘要：本研究探討在因果關係設定中涉及大量使用者群組和每個參與者多個階段的線上縱向腦電圖 (EEG) 運動想像 (MI) 解碼中，深度學習的持續微調策略。我們是第一個在大量使用者群組中探討此類策略，因為縱向適應通常在單一主體設定中研究，並使用單一適應策略，這限制了推廣研究結果的能力。首先，我們探討不同微調方法對解碼器效能和穩定性的影響。在此基礎上，我們整合線上測試時間適應 (OTTA) 以在部署期間適應模型，補充先前微調的效果。我們的研究結果表明，連續建立在先前特定主體資訊上的微調可以同時改善效能和穩定性，而 OTTA 可以有效地適應連續階段中不斷變化的資料分佈，從而實現無需校準的操作。這些結果為縱向線上 MI 解碼的未來研究提供了有價值的見解和建議，並強調了結合領域適應策略以改善實際應用中 BCI 效能的重要性。臨床相關性：我們的研究可以實現更穩定、更有效的長期運動想像解碼，這對於神經復健和輔助技術至關重要。
-
-##### **MedBioLM: Optimizing Medical and Biological QA with Fine-Tuned Large Language Models and Retrieval-Augmented Generation**
-2502.03004v1 by Seonok Kim
-
-Large Language Models (LLMs) have demonstrated impressive capabilities across
-natural language processing tasks. However, their application to specialized
-domains such as medicine and biology requires further optimization to ensure
-factual accuracy, reliability, and contextual depth. We introduce MedBioLM, a
-domain-adapted biomedical question-answering model designed to enhance both
-short-form and long-form queries. By integrating fine-tuning and
-retrieval-augmented generation (RAG), MedBioLM dynamically incorporates
-domain-specific knowledge, improving reasoning abilities and factual accuracy.
-To evaluate its effectiveness, we fine-tuned the model on diverse biomedical QA
-datasets, covering structured multiple-choice assessments and complex clinical
-reasoning tasks. Fine-tuning significantly improves accuracy on benchmark
-datasets, while RAG enhances factual consistency. These results highlight the
-potential of domain-optimized LLMs in advancing biomedical research, medical
-education, and clinical decision support.
-
-摘要：大型語言模型 (LLM) 已展現出在自然語言處理任務中令人印象深刻的能力。然而，要將其應用於醫學和生物學等特定領域，需要進一步最佳化，以確保事實的準確性、可靠性以及脈絡的深度。我們引進了 MedBioLM，這是一個適應領域的生物醫學問答模型，旨在增強短式和長式查詢。透過整合微調和檢索增強生成 (RAG)，MedBioLM 能動態地納入領域特定的知識，從而提升推理能力和事實準確性。為了評估其有效性，我們對模型進行微調，使其涵蓋結構化的多重選擇評量和複雜的臨床推理任務等多樣化的生物醫學問答資料集。微調顯著提升了基準資料集的準確性，而 RAG 則增強了事實的一致性。這些結果突顯了領域最佳化的 LLM 在推進生物醫學研究、醫學教育和臨床決策支援方面的潛力。
-
-
-### LLM
-|Publish Date|Title|Authors|Homepage|Code|
-| :---: | :---: | :---: | :---: | :---: |
-|**2025-02-17**|**Diffusion Models without Classifier-free Guidance**|Zhicong Tang et.al.|[2502.12154v1](http://arxiv.org/abs/2502.12154v1)|[link](https://github.com/tzco/Diffusion-wo-CFG)|
-|**2025-02-17**|**Idiosyncrasies in Large Language Models**|Mingjie Sun et.al.|[2502.12150v1](http://arxiv.org/abs/2502.12150v1)|null|
-|**2025-02-17**|**HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**|Kenan Jiang et.al.|[2502.12149v1](http://arxiv.org/abs/2502.12149v1)|null|
-|**2025-02-17**|**Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**|Jinyan Su et.al.|[2502.12145v1](http://arxiv.org/abs/2502.12145v1)|null|
-|**2025-02-17**|**Small Models Struggle to Learn from Strong Reasoners**|Yuetai Li et.al.|[2502.12143v1](http://arxiv.org/abs/2502.12143v1)|null|
-|**2025-02-17**|**SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**|Yige Xu et.al.|[2502.12134v1](http://arxiv.org/abs/2502.12134v1)|null|
-|**2025-02-17**|**Transformer Dynamics: A neuroscientific approach to interpretability of large language models**|Jesseba Fernando et.al.|[2502.12131v1](http://arxiv.org/abs/2502.12131v1)|null|
-|**2025-02-17**|**Scaling Autonomous Agents via Automatic Reward Modeling And Planning**|Zhenfang Chen et.al.|[2502.12130v1](http://arxiv.org/abs/2502.12130v1)|null|
-|**2025-02-17**|**LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**|Florian Sestak et.al.|[2502.12128v1](http://arxiv.org/abs/2502.12128v1)|null|
-|**2025-02-17**|**On the Query Complexity of Verifier-Assisted Language Generation**|Edoardo Botta et.al.|[2502.12123v1](http://arxiv.org/abs/2502.12123v1)|null|
-|**2025-02-17**|**LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**|Prasanna Mayilvahanan et.al.|[2502.12120v1](http://arxiv.org/abs/2502.12120v1)|null|
-|**2025-02-17**|**PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**|Jinhe Bi et.al.|[2502.12119v1](http://arxiv.org/abs/2502.12119v1)|null|
-|**2025-02-17**|**Scaling Test-Time Compute Without Verification or RL is Suboptimal**|Amrith Setlur et.al.|[2502.12118v1](http://arxiv.org/abs/2502.12118v1)|null|
-|**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
-|**2025-02-17**|**Personality Structured Interview for Large Language Model Simulation in Personality Research**|Pengda Wang et.al.|[2502.12109v1](http://arxiv.org/abs/2502.12109v1)|null|
-|**2025-02-17**|**Using the Path of Least Resistance to Explain Deep Networks**|Sina Salek et.al.|[2502.12108v1](http://arxiv.org/abs/2502.12108v1)|null|
-|**2025-02-17**|**Relational Norms for Human-AI Cooperation**|Brian D. Earp et.al.|[2502.12102v1](http://arxiv.org/abs/2502.12102v1)|null|
-|**2025-02-17**|**A Study on Leveraging Search and Self-Feedback for Agent Reasoning**|Karthikeyan K et.al.|[2502.12094v1](http://arxiv.org/abs/2502.12094v1)|null|
-|**2025-02-17**|**Meta-Statistical Learning: Supervised Learning of Statistical Inference**|Maxime Peyrard et.al.|[2502.12088v1](http://arxiv.org/abs/2502.12088v1)|null|
-|**2025-02-17**|**APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**|Yuxiang Huang et.al.|[2502.12085v1](http://arxiv.org/abs/2502.12085v1)|null|
-|**2025-02-17**|**VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**|Jianshu Zhang et.al.|[2502.12084v1](http://arxiv.org/abs/2502.12084v1)|null|
-|**2025-02-17**|**AdaSplash: Adaptive Sparse Flash Attention**|Nuno Gonçalves et.al.|[2502.12082v1](http://arxiv.org/abs/2502.12082v1)|null|
-|**2025-02-17**|**Unhackable Temporal Rewarding for Scalable Video MLLMs**|En Yu et.al.|[2502.12081v1](http://arxiv.org/abs/2502.12081v1)|null|
-|**2025-02-17**|**Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**|Zhongyi Qiu et.al.|[2502.12073v1](http://arxiv.org/abs/2502.12073v1)|null|
-|**2025-02-17**|**TokenSkip: Controllable Chain-of-Thought Compression in LLMs**|Heming Xia et.al.|[2502.12067v1](http://arxiv.org/abs/2502.12067v1)|null|
-|**2025-02-17**|**CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**|Yifan Zhang et.al.|[2502.12066v1](http://arxiv.org/abs/2502.12066v1)|null|
-|**2025-02-17**|**Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**|Lan Zhang et.al.|[2502.12065v1](http://arxiv.org/abs/2502.12065v1)|null|
-|**2025-02-17**|**AI-generated Text Detection with a GLTR-based Approach**|Lucía Yan Wu et.al.|[2502.12064v1](http://arxiv.org/abs/2502.12064v1)|null|
-|**2025-02-17**|**Culture is Not Trivia: Sociocultural Theory for Cultural NLP**|Naitian Zhou et.al.|[2502.12057v1](http://arxiv.org/abs/2502.12057v1)|null|
-|**2025-02-17**|**Designing Role Vectors to Improve LLM Inference Behaviour**|Daniele Potertì et.al.|[2502.12055v1](http://arxiv.org/abs/2502.12055v1)|null|
-|**2025-02-17**|**PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**|Xinyu Zhang et.al.|[2502.12054v1](http://arxiv.org/abs/2502.12054v1)|null|
-|**2025-02-17**|**A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**|Xinyu Hu et.al.|[2502.12052v1](http://arxiv.org/abs/2502.12052v1)|null|
-|**2025-02-17**|**How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**|Ayan Sengupta et.al.|[2502.12051v1](http://arxiv.org/abs/2502.12051v1)|null|
-|**2025-02-17**|**SpeechT: Findings of the First Mentorship in Speech Translation**|Yasmin Moslem et.al.|[2502.12050v1](http://arxiv.org/abs/2502.12050v1)|null|
-|**2025-02-17**|**A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**|Shreya Shukla et.al.|[2502.12048v1](http://arxiv.org/abs/2502.12048v1)|null|
-|**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
-|**2025-02-17**|**SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**|Fengqing Jiang et.al.|[2502.12025v1](http://arxiv.org/abs/2502.12025v1)|null|
-|**2025-02-17**|**Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**|Xin Xu et.al.|[2502.12022v1](http://arxiv.org/abs/2502.12022v1)|null|
-|**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
-|**2025-02-17**|**Demographic Attributes Prediction from Speech Using WavLM Embeddings**|Yuchen Yang et.al.|[2502.12007v1](http://arxiv.org/abs/2502.12007v1)|null|
-|**2025-02-17**|**Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**|Thibault Rousset et.al.|[2502.12001v1](http://arxiv.org/abs/2502.12001v1)|null|
-|**2025-02-17**|**Presumed Cultural Identity: How Names Shape LLM Responses**|Siddhesh Pawar et.al.|[2502.11995v1](http://arxiv.org/abs/2502.11995v1)|null|
-|**2025-02-17**|**Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**|Negar Kamali et.al.|[2502.11989v1](http://arxiv.org/abs/2502.11989v1)|null|
-|**2025-02-17**|**Generating Text from Uniform Meaning Representation**|Emma Markle et.al.|[2502.11973v1](http://arxiv.org/abs/2502.11973v1)|null|
-|**2025-02-17**|**Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**|Sehun Jung et.al.|[2502.11969v1](http://arxiv.org/abs/2502.11969v1)|null|
-|**2025-02-17**|**A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**|Jun Jiang et.al.|[2502.11965v1](http://arxiv.org/abs/2502.11965v1)|null|
-|**2025-02-17**|**Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**|Tianyi Wu et.al.|[2502.11962v1](http://arxiv.org/abs/2502.11962v1)|null|
-|**2025-02-17**|**STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**|Haisong Gong et.al.|[2502.11959v1](http://arxiv.org/abs/2502.11959v1)|null|
-|**2025-02-17**|**Can Your Uncertainty Scores Detect Hallucinated Entity?**|Min-Hsuan Yeh et.al.|[2502.11948v1](http://arxiv.org/abs/2502.11948v1)|null|
-|**2025-02-17**|**Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**|Ailin Huang et.al.|[2502.11946v1](http://arxiv.org/abs/2502.11946v1)|null|
-|**2025-02-17**|**Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**|Ammar Kheder et.al.|[2502.11941v1](http://arxiv.org/abs/2502.11941v1)|null|
-|**2025-02-17**|**FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**|Yutong Ye et.al.|[2502.11937v1](http://arxiv.org/abs/2502.11937v1)|null|
-|**2025-02-17**|**On Representational Dissociation of Language and Arithmetic in Large Language Models**|Riku Kisako et.al.|[2502.11932v1](http://arxiv.org/abs/2502.11932v1)|null|
-|**2025-02-17**|**BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**|Shamsuddeen Hassan Muhammad et.al.|[2502.11926v1](http://arxiv.org/abs/2502.11926v1)|null|
-|**2025-02-17**|**GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**|Yi Fang et.al.|[2502.11925v1](http://arxiv.org/abs/2502.11925v1)|null|
-|**2025-02-17**|**From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**|Zhuoyan Li et.al.|[2502.11919v1](http://arxiv.org/abs/2502.11919v1)|null|
-|**2025-02-17**|**EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**|Jiamin Su et.al.|[2502.11916v1](http://arxiv.org/abs/2502.11916v1)|null|
-|**2025-02-17**|**On the robustness of ChatGPT in teaching Korean Mathematics**|Phuong-Nam Nguyen et.al.|[2502.11915v1](http://arxiv.org/abs/2502.11915v1)|null|
-|**2025-02-17**|**MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**|Haochen Xue et.al.|[2502.11903v1](http://arxiv.org/abs/2502.11903v1)|null|
-|**2025-02-17**|**Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**|Dylan Zhang et.al.|[2502.11901v1](http://arxiv.org/abs/2502.11901v1)|null|
-|**2025-02-17**|**DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**|Zhihang Yuan et.al.|[2502.11897v1](http://arxiv.org/abs/2502.11897v1)|null|
-|**2025-02-17**|**CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**|Yanxiao Zhao et.al.|[2502.11896v1](http://arxiv.org/abs/2502.11896v1)|null|
-|**2025-02-17**|**Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**|Jacob Nielsen et.al.|[2502.11895v1](http://arxiv.org/abs/2502.11895v1)|null|
-|**2025-02-17**|**Revisiting Classification Taxonomy for Grammatical Errors**|Deqing Zou et.al.|[2502.11890v1](http://arxiv.org/abs/2502.11890v1)|null|
-|**2025-02-17**|**Stonefish: Supporting Machine Learning Research in Marine Robotics**|Michele Grimaldi et.al.|[2502.11887v1](http://arxiv.org/abs/2502.11887v1)|null|
-|**2025-02-17**|**LIMR: Less is More for RL Scaling**|Xuefeng Li et.al.|[2502.11886v1](http://arxiv.org/abs/2502.11886v1)|null|
-|**2025-02-17**|**Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**|Shao Zhang et.al.|[2502.11882v1](http://arxiv.org/abs/2502.11882v1)|null|
-|**2025-02-17**|**Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**|Hyunwoo Kim et.al.|[2502.11881v1](http://arxiv.org/abs/2502.11881v1)|null|
-|**2025-02-17**|**Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**|Jinheng Wang et.al.|[2502.11880v1](http://arxiv.org/abs/2502.11880v1)|null|
-|**2025-02-17**|**VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**|Hugh Mee Wong et.al.|[2502.11874v1](http://arxiv.org/abs/2502.11874v1)|null|
-|**2025-02-17**|**Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**|Michael McRae et.al.|[2502.11866v1](http://arxiv.org/abs/2502.11866v1)|null|
-|**2025-02-17**|**FedEAT: A Robustness Optimization Framework for Federated LLMs**|Yahao Pang et.al.|[2502.11863v1](http://arxiv.org/abs/2502.11863v1)|null|
-|**2025-02-17**|**Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**|Renhao Pei et.al.|[2502.11862v1](http://arxiv.org/abs/2502.11862v1)|null|
-|**2025-02-17**|**Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**|Shuqi Yang et.al.|[2502.11861v1](http://arxiv.org/abs/2502.11861v1)|null|
-|**2025-02-17**|**Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**|Wenrui Xu et.al.|[2502.11859v1](http://arxiv.org/abs/2502.11859v1)|null|
-|**2025-02-17**|**LLMs as a synthesis between symbolic and continuous approaches to language**|Gemma Boleda et.al.|[2502.11856v1](http://arxiv.org/abs/2502.11856v1)|null|
-|**2025-02-17**|**BaxBench: Can LLMs Generate Correct and Secure Backends?**|Mark Vero et.al.|[2502.11844v1](http://arxiv.org/abs/2502.11844v1)|null|
-|**2025-02-17**|**Can LLM Agents Maintain a Persona in Discourse?**|Pranav Bhandari et.al.|[2502.11843v1](http://arxiv.org/abs/2502.11843v1)|null|
-|**2025-02-17**|**ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**|Muhammad Waseem Akram et.al.|[2502.11840v1](http://arxiv.org/abs/2502.11840v1)|null|
-|**2025-02-17**|**Intuitive physics understanding emerges from self-supervised pretraining on natural videos**|Quentin Garrido et.al.|[2502.11831v1](http://arxiv.org/abs/2502.11831v1)|null|
-|**2025-02-17**|**Text Classification in the LLM Era - Where do we stand?**|Sowmya Vajjala et.al.|[2502.11830v1](http://arxiv.org/abs/2502.11830v1)|null|
-|**2025-02-17**|**Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**|Hanbin Wang et.al.|[2502.11829v1](http://arxiv.org/abs/2502.11829v1)|null|
-|**2025-02-17**|**M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**|Chengyan Wu et.al.|[2502.11824v1](http://arxiv.org/abs/2502.11824v1)|null|
-|**2025-02-17**|**AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**|Hao Zhou et.al.|[2502.11817v1](http://arxiv.org/abs/2502.11817v1)|null|
-|**2025-02-17**|**Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**|Xu Wang et.al.|[2502.11812v1](http://arxiv.org/abs/2502.11812v1)|null|
-|**2025-02-17**|**FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**|Qianchi Zhang et.al.|[2502.11811v1](http://arxiv.org/abs/2502.11811v1)|null|
-|**2025-02-17**|**Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**|Yanbiao Ma et.al.|[2502.11809v1](http://arxiv.org/abs/2502.11809v1)|null|
-|**2025-02-17**|**Exploring Translation Mechanism of Large Language Models**|Hongbin Zhang et.al.|[2502.11806v1](http://arxiv.org/abs/2502.11806v1)|null|
-|**2025-02-17**|**Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**|Peiying Yu et.al.|[2502.11799v1](http://arxiv.org/abs/2502.11799v1)|null|
-|**2025-02-17**|**Personality Editing for Language Models through Relevant Knowledge Editing**|Seojin Hwang et.al.|[2502.11789v1](http://arxiv.org/abs/2502.11789v1)|null|
-|**2025-02-17**|**Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**|Xuan Ren et.al.|[2502.11779v1](http://arxiv.org/abs/2502.11779v1)|null|
-|**2025-02-17**|**Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**|Siddiqui Muhammad Yasir et.al.|[2502.11777v1](http://arxiv.org/abs/2502.11777v1)|null|
-|**2025-02-17**|**The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**|Leonardo Bertolazzi et.al.|[2502.11771v1](http://arxiv.org/abs/2502.11771v1)|null|
-|**2025-02-17**|**Cognitive-Aligned Document Selection for Retrieval-augmented Generation**|Bingyu Wan et.al.|[2502.11770v1](http://arxiv.org/abs/2502.11770v1)|null|
-|**2025-02-17**|**From Selection to Generation: A Survey of LLM-based Active Learning**|Yu Xia et.al.|[2502.11767v1](http://arxiv.org/abs/2502.11767v1)|null|
-|**2025-02-17**|**Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**|Zengkui Sun et.al.|[2502.11766v1](http://arxiv.org/abs/2502.11766v1)|null|
-|**2025-02-17**|**Lightweight Deepfake Detection Based on Multi-Feature Fusion**|Siddiqui Muhammad Yasir et.al.|[2502.11763v1](http://arxiv.org/abs/2502.11763v1)|null|
-|**2025-02-17**|**HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**|Michiel van der Meer et.al.|[2502.11753v1](http://arxiv.org/abs/2502.11753v1)|null|
-|**2025-02-17**|**Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**|Yuqi Pang et.al.|[2502.11751v1](http://arxiv.org/abs/2502.11751v1)|null|
-|**2025-02-17**|**SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**|Shuai Lyu et.al.|[2502.11741v1](http://arxiv.org/abs/2502.11741v1)|null|
-
-#### Abstracts
-##### **Diffusion Models without Classifier-free Guidance**
-2502.12154v1 by Zhicong Tang, Jianmin Bao, Dong Chen, Baining Guo
-
-This paper presents Model-guidance (MG), a novel objective for training
-diffusion model that addresses and removes of the commonly used Classifier-free
-guidance (CFG). Our innovative approach transcends the standard modeling of
-solely data distribution to incorporating the posterior probability of
-conditions. The proposed technique originates from the idea of CFG and is easy
-yet effective, making it a plug-and-play module for existing models. Our method
-significantly accelerates the training process, doubles the inference speed,
-and achieve exceptional quality that parallel and even surpass concurrent
-diffusion models with CFG. Extensive experiments demonstrate the effectiveness,
-efficiency, scalability on different models and datasets. Finally, we establish
-state-of-the-art performance on ImageNet 256 benchmarks with an FID of 1.34.
-Our code is available at https://github.com/tzco/Diffusion-wo-CFG.
-
-摘要：本文提出模型指導 (MG)，一種用於訓練擴散模型的新目標，它解決並消除了常用的無分類器指導 (CFG)。我們的創新方法超越了僅數據分佈的標準建模，並納入了條件的後驗機率。提議的技術源自 CFG 的概念，既簡單又有效，使其成為現有模型的即插即用模組。我們的技術顯著加速了訓練過程，將推論速度提高了一倍，並取得了與 CFG 並行甚至超越並行擴散模型的出色品質。廣泛的實驗證明了該技術在不同模型和資料集上的有效性、效率和可擴充性。最後，我們在 ImageNet 256 基準上建立了最先進的效能，FID 為 1.34。我們的程式碼可在 https://github.com/tzco/Diffusion-wo-CFG 取得。
-
-##### **Idiosyncrasies in Large Language Models**
-2502.12150v1 by Mingjie Sun, Yida Yin, Zhiqiu Xu, J. Zico Kolter, Zhuang Liu
-
-In this work, we unveil and study idiosyncrasies in Large Language Models
-(LLMs) -- unique patterns in their outputs that can be used to distinguish the
-models. To do so, we consider a simple classification task: given a particular
-text output, the objective is to predict the source LLM that generates the
-text. We evaluate this synthetic task across various groups of LLMs and find
-that simply fine-tuning existing text embedding models on LLM-generated texts
-yields excellent classification accuracy. Notably, we achieve 97.1% accuracy on
-held-out validation data in the five-way classification problem involving
-ChatGPT, Claude, Grok, Gemini, and DeepSeek. Our further investigation reveals
-that these idiosyncrasies are rooted in word-level distributions. These
-patterns persist even when the texts are rewritten, translated, or summarized
-by an external LLM, suggesting that they are also encoded in the semantic
-content. Additionally, we leverage LLM as judges to generate detailed,
-open-ended descriptions of each model's idiosyncrasies. Finally, we discuss the
-broader implications of our findings, particularly for training on synthetic
-data and inferring model similarity. Code is available at
-https://github.com/locuslab/llm-idiosyncrasies.
-
-摘要：在這項工作中，我們揭示並研究了大型語言模型 (LLM) 中的特殊性，也就是其輸出中可區分模型的獨特模式。為此，我們考慮了一項簡單的分類任務：給定一個特定文本輸出，目標是預測產生該文本的來源 LLM。我們在各種 LLM 組合中評估這個合成任務，並發現僅微調現有的文本嵌入模型在 LLM 生成的文本上即可產生極佳的分類準確度。值得注意的是，在涉及 ChatGPT、Claude、Grok、Gemini 和 DeepSeek 的五向分類問題中，我們在留存驗證資料上達到了 97.1% 的準確度。我們的進一步調查顯示，這些特殊性根植於詞彙層級的分布。即使文本是由外部 LLM 改寫、翻譯或摘要，這些模式仍然存在，這表明它們也編碼在語義內容中。此外，我們利用 LLM 作為評審，為每個模型的特殊性產生詳細、開放式的描述。最後，我們討論了我們發現的更廣泛含意，特別是對於合成資料的訓練和推斷模型相似性。程式碼可在 https://github.com/locuslab/llm-idiosyncrasies 取得。
-
-##### **HARBOR: Exploring Persona Dynamics in Multi-Agent Competition**
-2502.12149v1 by Kenan Jiang, Li Xiong, Fei Liu
-
-We investigate factors contributing to LLM agents' success in competitive
-multi-agent environments, using auctions as a testbed where agents bid to
-maximize profit. The agents are equipped with bidding domain knowledge,
-distinct personas that reflect item preferences, and a memory of auction
-history. Our work extends the classic auction scenario by creating a realistic
-environment where multiple agents bid on houses, weighing aspects such as size,
-location, and budget to secure the most desirable homes at the lowest prices.
-Particularly, we investigate three key questions: (a) How does a persona
-influence an agent's behavior in a competitive setting? (b) Can an agent
-effectively profile its competitors' behavior during auctions? (c) How can
-persona profiling be leveraged to create an advantage using strategies such as
-theory of mind? Through a series of experiments, we analyze the behaviors of
-LLM agents and shed light on new findings. Our testbed, called HARBOR, offers a
-valuable platform for deepening our understanding of multi-agent workflows in
-competitive environments.
-
-摘要：我們研究促成 LLM 代理在競爭性多代理環境中成功的因素，使用拍賣作為測試平台，其中代理出價以最大化利潤。這些代理配備了競標領域知識、反映物品偏好的不同角色以及拍賣歷史的記憶。我們的研究透過創造一個現實的環境來擴展經典的拍賣場景，在該環境中，多個代理對房屋出價，權衡大小、位置和預算等方面以最低價格確保最理想的房屋。特別是，我們研究了三個關鍵問題：(a) 角色如何在競爭環境中影響代理的行為？(b) 代理是否可以在拍賣期間有效地分析其競爭對手的行為？(c) 如何利用角色分析來利用心智理論等策略創造優勢？透過一系列實驗，我們分析 LLM 代理的行為並闡明新的發現。我們的測試平台稱為 HARBOR，它提供了一個有價值的平台，用於加深我們對競爭環境中多代理工作流程的理解。
-
-##### **Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control**
-2502.12145v1 by Jinyan Su, Jennifer Healey, Preslav Nakov, Claire Cardie
-
-Retrieval-Augmented Generation (RAG) has emerged as a powerful approach to
-mitigate large language model (LLM) hallucinations by incorporating external
-knowledge retrieval. However, existing RAG frameworks often apply retrieval
-indiscriminately,leading to inefficiencies-over-retrieving when unnecessary or
-failing to retrieve iteratively when required for complex reasoning. Recent
-adaptive retrieval strategies, though adaptively navigates these retrieval
-strategies, predict only based on query complexity and lacks user-driven
-flexibility, making them infeasible for diverse user application needs. In this
-paper, we introduce a novel user-controllable RAG framework that enables
-dynamic adjustment of the accuracy-cost trade-off. Our approach leverages two
-classifiers: one trained to prioritize accuracy and another to prioritize
-retrieval efficiency. Via an interpretable control parameter $\alpha$, users
-can seamlessly navigate between minimal-cost retrieval and high-accuracy
-retrieval based on their specific requirements. We empirically demonstrate that
-our approach effectively balances accuracy, retrieval cost, and user
-controllability, making it a practical and adaptable solution for real-world
-applications.
-
-摘要：檢索增強生成 (RAG) 已成為一種強大的方法，可透過整合外部知識檢索來減輕大型語言模型 (LLM) 的幻覺。然而，現有的 RAG 框架經常不加區別地應用檢索，導致低效率，在不必要時過度檢索，或在複雜推理時無法反覆檢索。最近的自適應檢索策略，儘管自適應地導航這些檢索策略，但僅根據查詢複雜性進行預測，並且缺乏使用者驅動的靈活性，這使得它們無法滿足多樣化的使用者應用需求。在本文中，我們引入了一個新穎的使用者可控制 RAG 框架，它可以動態調整準確度成本權衡。我們的做法利用兩個分類器：一個訓練用於優先考慮準確度，另一個用於優先考慮檢索效率。透過可解釋的控制參數 $\alpha$，使用者可以在最低成本檢索和基於其特定需求的高準確度檢索之間無縫導航。我們通過實證證明，我們的做法有效地平衡了準確度、檢索成本和使用者可控性，使其成為現實世界應用中實用且適應性強的解決方案。
-
-##### **Small Models Struggle to Learn from Strong Reasoners**
-2502.12143v1 by Yuetai Li, Xiang Yue, Zhangchen Xu, Fengqing Jiang, Luyao Niu, Bill Yuchen Lin, Bhaskar Ramasubramanian, Radha Poovendran
-
-Large language models (LLMs) excel in complex reasoning tasks, and distilling
-their reasoning capabilities into smaller models has shown promise. However, we
-uncover an interesting phenomenon, which we term the Small Model Learnability
-Gap: small models ($\leq$3B parameters) do not consistently benefit from long
-chain-of-thought (CoT) reasoning or distillation from larger models. Instead,
-they perform better when fine-tuned on shorter, simpler reasoning chains that
-better align with their intrinsic learning capacity. To address this, we
-propose Mix Distillation, a simple yet effective strategy that balances
-reasoning complexity by combining long and short CoT examples or reasoning from
-both larger and smaller models. Our experiments demonstrate that Mix
-Distillation significantly improves small model reasoning performance compared
-to training on either data alone. These findings highlight the limitations of
-direct strong model distillation and underscore the importance of adapting
-reasoning complexity for effective reasoning capability transfer.
-
-摘要：大型語言模型 (LLM) 在複雜推理任務中表現出色，且將其推理能力提煉成較小的模型已展現前景。然而，我們發現了一個有趣的現象，我們稱之為小型模型可學習性差距：小型模型（參數數目 ≤ 3B）並非總能從大型模型的長鏈條思考 (CoT) 推理或提煉中受益。相反地，當針對較短、較簡單的推理鏈進行微調時，它們的表現會更好，而這更符合其內在學習能力。為了解決此問題，我們提出混合提煉，這是一種簡單但有效的策略，透過結合長短 CoT 範例或從較大及較小模型進行推理，來平衡推理的複雜性。我們的實驗證明，與僅針對任一資料進行訓練相比，混合提煉顯著改善了小型模型的推理效能。這些發現突顯了直接強模型提煉的限制，並強調了調整推理複雜性以有效轉移推理能力的重要性。
-
-##### **SoftCoT: Soft Chain-of-Thought for Efficient Reasoning with LLMs**
-2502.12134v1 by Yige Xu, Xu Guo, Zhiwei Zeng, Chunyan Miao
-
-Chain-of-Thought (CoT) reasoning enables Large Language Models (LLMs) to
-solve complex reasoning tasks by generating intermediate reasoning steps.
-However, most existing approaches focus on hard token decoding, which
-constrains reasoning within the discrete vocabulary space and may not always be
-optimal. While recent efforts explore continuous-space reasoning, they often
-suffer from catastrophic forgetting, limiting their applicability to
-state-of-the-art LLMs that already perform well in zero-shot settings with a
-proper instruction. To address this challenge, we propose a novel approach for
-continuous-space reasoning that does not require modifying the underlying LLM.
-Specifically, we employ a lightweight assistant model to generate
-instance-specific soft thought tokens speculatively as the initial chain of
-thoughts, which are then mapped into the LLM's representation space via a
-projection module. Experimental results on five reasoning benchmarks
-demonstrate that our method enhances LLM reasoning performance through
-supervised, parameter-efficient fine-tuning.
-
-摘要：鏈式思考 (CoT) 推理讓大型語言模型 (LLM) 能夠透過產生中間推理步驟來解決複雜的推理任務。然而，現有的大多數方法都專注於硬標記解碼，這會將推理限制在離散的詞彙空間內，而且可能並非總是最佳。雖然最近的研究探索了連續空間推理，但它們經常會遭遇災難性遺忘，這限制了它們在零次學習設置中表現良好的最先進 LLM 的適用性，且需要適當的說明。為了應對這項挑戰，我們提出了一種創新的連續空間推理方法，不需要修改底層的 LLM。具體來說，我們採用一個輕量級的輔助模型來產生特定於實例的軟思考標記，作為思考的初始鏈，然後透過投影模組將它們映射到 LLM 的表示空間。在五個推理基準上的實驗結果表明，我們的模型透過監督式、參數高效的微調，增強了 LLM 的推理效能。
-
-##### **Transformer Dynamics: A neuroscientific approach to interpretability of large language models**
-2502.12131v1 by Jesseba Fernando, Grigori Guitchounts
-
-As artificial intelligence models have exploded in scale and capability,
-understanding of their internal mechanisms remains a critical challenge.
-Inspired by the success of dynamical systems approaches in neuroscience, here
-we propose a novel framework for studying computations in deep learning
-systems. We focus on the residual stream (RS) in transformer models,
-conceptualizing it as a dynamical system evolving across layers. We find that
-activations of individual RS units exhibit strong continuity across layers,
-despite the RS being a non-privileged basis. Activations in the RS accelerate
-and grow denser over layers, while individual units trace unstable periodic
-orbits. In reduced-dimensional spaces, the RS follows a curved trajectory with
-attractor-like dynamics in the lower layers. These insights bridge dynamical
-systems theory and mechanistic interpretability, establishing a foundation for
-a "neuroscience of AI" that combines theoretical rigor with large-scale data
-analysis to advance our understanding of modern neural networks.
-
-摘要：隨著人工智慧模型在規模和能力上爆炸式增長，
-理解其內部機制仍然是一項嚴峻的挑戰。
-受到神經科學中動力系統方法成功的啟發，我們在此
-提出了一個新的框架來研究深度學習系統中的運算。我們專注於Transformer模型中的殘差流 (RS)，
-將其概念化為一個跨層演化的動態系統。我們發現
-儘管 RS 不是一個特權基礎，但個別 RS 單元的激活在各層之間表現出很強的連續性。RS 中的激活
-隨著層數的增加而加速並變得更密集，而個別單元則追蹤不穩定的週期
-軌道。在降維空間中，RS 遵循一個曲線軌跡，在較低層中具有類吸引子的動力學。這些見解橋接了動力
-系統理論和機制可解釋性，為「AI 神經科學」奠定了基礎，結合了理論嚴謹性和大規模數據
-分析，以增進我們對現代神經網路的理解。
-
-##### **Scaling Autonomous Agents via Automatic Reward Modeling And Planning**
-2502.12130v1 by Zhenfang Chen, Delin Chen, Rui Sun, Wenjun Liu, Chuang Gan
-
-Large language models (LLMs) have demonstrated remarkable capabilities across
-a range of text-generation tasks. However, LLMs still struggle with problems
-requiring multi-step decision-making and environmental feedback, such as online
-shopping, scientific reasoning, and mathematical problem-solving. Unlike pure
-text data, collecting large-scale decision-making data is challenging.
-Moreover, many powerful LLMs are only accessible through APIs, which hinders
-their fine-tuning for agent tasks due to cost and complexity. To address LLM
-agents' limitations, we propose a framework that can automatically learn a
-reward model from the environment without human annotations. This model can be
-used to evaluate the action trajectories of LLM agents and provide heuristics
-for task planning. Specifically, our approach involves employing one LLM-based
-agent to navigate an environment randomly, generating diverse action
-trajectories. Subsequently, a separate LLM is leveraged to assign a task intent
-and synthesize a negative response alongside the correct response for each
-trajectory. These triplets (task intent, positive response, and negative
-response) are then utilized as training data to optimize a reward model capable
-of scoring action trajectories. The effectiveness and generalizability of our
-framework are demonstrated through evaluations conducted on different agent
-benchmarks. In conclusion, our proposed framework represents a significant
-advancement in enhancing LLM agents' decision-making capabilities. By
-automating the learning of reward models, we overcome the challenges of data
-scarcity and API limitations, potentially revolutionizing the application of
-LLMs in complex and interactive environments. This research paves the way for
-more sophisticated AI agents capable of tackling a wide range of real-world
-problems requiring multi-step decision-making.
-
-摘要：大型語言模型 (LLM) 已在各種文字生成任務中展示出非凡的能力。然而，LLM 仍然在需要多步驟決策制定和環境回饋的問題上苦苦掙扎，例如網上購物、科學推理和數學問題求解。與純文本數據不同，收集大規模決策制定數據具有挑戰性。此外，許多強大的 LLM 只能通過 API 訪問，這由於成本和複雜性而阻礙了它們對代理任務的微調。為了解決 LLM 代理的局限性，我們提出了一個框架，該框架可以從環境中自動學習獎勵模型，而無需人工註釋。此模型可用于評估 LLM 代理的動作軌跡並為任務規劃提供啟發式方法。具體來說，我們的方法涉及使用一個基於 LLM 的代理隨機導航環境，生成不同的動作軌跡。隨後，利用一個單獨的 LLM 為每個軌跡分配任務意圖並合成一個負面響應以及正確的響應。然後將這些三元組（任務意圖、正面響應和負面響應）用作訓練數據，以優化能夠評分動作軌跡的獎勵模型。我們框架的有效性和普遍性通過在不同代理基準上進行的評估得到證明。總之，我們提出的框架代表了加強 LLM 代理決策能力的重大進步。通過自動化獎勵模型的學習，我們克服了數據稀缺和 API 限制的挑戰，有可能徹底改變 LLM 在複雜和互動環境中的應用。這項研究為更複雜的 AI 代理鋪平了道路，這些代理能夠解決需要多步驟決策制定的大量現實世界問題。
-
-##### **LaM-SLidE: Latent Space Modeling of Spatial Dynamical Systems via Linked Entities**
-2502.12128v1 by Florian Sestak, Artur Toshev, Andreas Fürst, Günter Klambauer, Andreas Mayr, Johannes Brandstetter
-
-Generative models are spearheading recent progress in deep learning, showing
-strong promise for trajectory sampling in dynamical systems as well. However,
-while latent space modeling paradigms have transformed image and video
-generation, similar approaches are more difficult for most dynamical systems.
-Such systems -- from chemical molecule structures to collective human behavior
--- are described by interactions of entities, making them inherently linked to
-connectivity patterns and the traceability of entities over time. Our approach,
-LaM-SLidE (Latent Space Modeling of Spatial Dynamical Systems via Linked
-Entities), combines the advantages of graph neural networks, i.e., the
-traceability of entities across time-steps, with the efficiency and scalability
-of recent advances in image and video generation, where pre-trained encoder and
-decoder are frozen to enable generative modeling in the latent space. The core
-idea of LaM-SLidE is to introduce identifier representations (IDs) to allow for
-retrieval of entity properties, e.g., entity coordinates, from latent system
-representations and thus enables traceability. Experimentally, across different
-domains, we show that LaM-SLidE performs favorably in terms of speed, accuracy,
-and generalizability. (Code is available at
-https://github.com/ml-jku/LaM-SLidE)
-
-摘要：生成模型引領深度學習的最新進展，也展現出在動態系統中進行軌跡取樣的強大前景。然而，儘管潛在空間建模範例已轉變圖像和影片生成，但對於大多數動態系統來說，類似的做法較為困難。此類系統（從化學分子結構到人類集體行為）由實體的交互作用所描述，使它們與連接模式和實體隨時間的追溯性產生固有聯繫。我們的做法 LaM-SLidE（透過連結實體進行空間動態系統的潛在空間建模）結合圖形神經網路的優點，亦即跨時間步長的實體追溯性，以及圖像和影片生成中近期進展的高效率和可擴充性，其中預先訓練的編碼器和解碼器被凍結以在潛在空間中啟用生成模型。LaM-SLidE 的核心概念是導入識別符號表示（ID），以允許從潛在系統表示中擷取實體屬性（例如實體座標），從而實現追溯性。透過不同領域的實驗，我們證明 LaM-SLidE 在速度、準確度和可概括性方面表現良好。（程式碼可在 https://github.com/ml-jku/LaM-SLidE 取得）
-
-##### **On the Query Complexity of Verifier-Assisted Language Generation**
-2502.12123v1 by Edoardo Botta, Yuchen Li, Aashay Mehta, Jordan T. Ash, Cyril Zhang, Andrej Risteski
-
-Recently, a plethora of works have proposed inference-time algorithms (e.g.
-best-of-n), which incorporate verifiers to assist the generation process. Their
-quality-efficiency trade-offs have been empirically benchmarked on a variety of
-constrained generation tasks, but the algorithmic design landscape is still
-largely poorly understood. In this paper, we develop a mathematical framework
-for reasoning about constrained generation using a pre-trained language model
-generator oracle and a process verifier--which can decide whether a prefix can
-be extended to a string which satisfies the constraints of choice. We show that
-even in very simple settings, access to a verifier can render an intractable
-problem (information-theoretically or computationally) to a tractable one. In
-fact, we show even simple algorithms, like tokenwise rejection sampling, can
-enjoy significant benefits from access to a verifier. Empirically, we show that
-a natural modification of tokenwise rejection sampling, in which the sampler is
-allowed to "backtrack" (i.e., erase the final few generated tokens) has robust
-and substantive benefits over natural baselines (e.g. (blockwise) rejection
-sampling, nucleus sampling)--both in terms of computational efficiency,
-accuracy and diversity.
-
-摘要：<paragraph>最近，许多作品提出了推理时间算法（例如 best-of-n），其中包含验证器以协助生成过程。它们的质量效率权衡已在各种受限生成任务中得到经验基准测试，但算法设计格局仍然很大程度上难以理解。在本文中，我们开发了一个数学框架，用于使用预训练语言模型生成器预言机和过程验证器推理受限生成——它可以决定是否可以将前缀扩展为满足选择约束的字符串。我们表明，即使在非常简单的设置中，访问验证器也可以将一个棘手的问题（信息论或计算）转换为一个易处理的问题。事实上，我们表明即使是简单的算法，如逐个标记拒绝采样，也可以从访问验证器中受益匪浅。凭经验，我们表明逐个标记拒绝采样的自然修改，其中允许采样器“回溯”（即，擦除最后几个生成的标记）比自然基线（例如（按块）拒绝采样、核采样）具有强大而实质性的优势——无论是在计算效率、准确性还是多样性方面。</paragraph>
-
-##### **LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws**
-2502.12120v1 by Prasanna Mayilvahanan, Thaddäus Wiedemer, Sayak Mallick, Matthias Bethge, Wieland Brendel
-
-Scaling laws guide the development of large language models (LLMs) by
-offering estimates for the optimal balance of model size, tokens, and compute.
-More recently, loss-to-loss scaling laws that relate losses across pretraining
-datasets and downstream tasks have emerged as a powerful tool for understanding
-and improving LLM performance. In this work, we investigate which factors most
-strongly influence loss-to-loss scaling. Our experiments reveal that the
-pretraining data and tokenizer determine the scaling trend. In contrast, model
-size, optimization hyperparameters, and even significant architectural
-differences, such as between transformer-based models like Llama and
-state-space models like Mamba, have limited impact. Consequently, practitioners
-should carefully curate suitable pretraining datasets for optimal downstream
-performance, while architectures and other settings can be freely optimized for
-training efficiency.
-
-摘要：規模化定律透過提供模型大小、符號和運算的最佳平衡估計，引導大型語言模型 (LLM) 的開發。最近，與預訓練資料集和下游任務相關的損失到損失縮放定律已成為了解和改善 LLM 效能的強大工具。在這項工作中，我們探討哪些因素最能影響損失到損失縮放。我們的實驗顯示，預訓練資料和分詞器會決定縮放趨勢。相反地，模型大小、最佳化超參數，甚至重大的架構差異（例如基於Transformer的模型，如 Llama，和狀態空間模型，如 Mamba 之間的差異）影響有限。因此，從業人員應仔細策劃適當的預訓練資料集以獲得最佳的下游效能，而架構和其他設定可以自由最佳化以提升訓練效率。
-
-##### **PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection**
-2502.12119v1 by Jinhe Bi, Yifan Wang, Danqi Yan, Xun Xiao, Artur Hecker, Volker Tresp, Yunpu Ma
-
-Visual instruction tuning refines pre-trained Multimodal Large Language
-Models (MLLMs) to enhance their real-world task performance. However, the rapid
-expansion of visual instruction datasets introduces significant data
-redundancy, leading to excessive computational costs. Existing data selection
-methods predominantly rely on proxy models or loss-based metrics, both of which
-impose substantial computational overheads due to the necessity of model
-inference and backpropagation. To address this challenge, we propose PRISM, a
-novel training-free approach for efficient multimodal data selection. Unlike
-existing methods, PRISM eliminates the reliance on proxy models, warm-up
-pretraining, and gradient-based optimization. Instead, it leverages Pearson
-correlation analysis to quantify the intrinsic visual encoding properties of
-MLLMs, computing a task-specific correlation score to identify high-value
-instances. This not only enbles data-efficient selection,but maintains the
-original performance. Empirical evaluations across multiple MLLMs demonstrate
-that PRISM reduces the overall time required for visual instruction tuning and
-data selection to just 30% of conventional methods, while surpassing fully
-fine-tuned models across eight multimodal and three language understanding
-benchmarks, achieving a 101.7% relative improvement in final performance.
-
-摘要：視覺指令調整優化預先訓練的多模態大型語言模型 (MLLM)，以增強其真實世界的任務表現。然而，視覺指令資料集的快速擴展引入了顯著的資料冗餘，導致過度的運算成本。現有的資料選取方法主要依賴於代理模型或基於損失的指標，這兩者由於模型推理和反向傳播的必要性而造成大量的運算負擔。為了應對這一挑戰，我們提出了 PRISM，一種用於高效多模態資料選取的新型無訓練方法。與現有方法不同，PRISM 消除了對代理模型、熱身預訓練和基於梯度的優化的依賴。相反，它利用 Pearson 相關分析來量化 MLLM 的內在視覺編碼特性，計算特定任務相關性分數以識別高價值實例。這不僅能選擇資料效率，而且能保持原始效能。跨多個 MLLM 的經驗評估表明，PRISM 將視覺指令調整和資料選取所需的總時間減少到傳統方法的 30%，同時在八個多模態和三個語言理解基準中超越了完全微調的模型，在最終效能上實現了 101.7% 的相對改進。
-
-##### **Scaling Test-Time Compute Without Verification or RL is Suboptimal**
-2502.12118v1 by Amrith Setlur, Nived Rajaraman, Sergey Levine, Aviral Kumar
-
-Despite substantial advances in scaling test-time compute, an ongoing debate
-in the community is how it should be scaled up to enable continued and
-efficient improvements with scaling. There are largely two approaches: first,
-distilling successful search or thinking traces; and second, using verification
-(e.g., 0/1 outcome rewards, reward models, or verifiers) to guide reinforcement
-learning (RL) and search algorithms. In this paper, we prove that finetuning
-LLMs with verifier-based (VB) methods based on RL or search is far superior to
-verifier-free (VF) approaches based on distilling or cloning search traces,
-given a fixed amount of compute/data budget. Further, we show that as we scale
-test-time compute (measured as the output token length) and training data,
-suboptimality of VF methods scales poorly compared to VB when the base
-pre-trained LLM presents a heterogeneous distribution over correct solution
-traces (e.g., different lengths, styles, etc.) and admits a non-sharp
-distribution over rewards on traces sampled from it. We formalize this
-condition using anti-concentration [Erd\H{o}s, 1945]. This implies a stronger
-result that VB methods scale better asymptotically, with the performance gap
-between VB and VF methods widening as test-time budget grows. We corroborate
-our theory empirically on both didactic and math reasoning problems with
-3/8/32B-sized pre-trained LLMs, where we find verification is crucial for
-scaling test-time compute.
-
-摘要：儘管在擴展測試時間計算方面取得了重大進展，但社群中持續的辯論是如何擴展它以持續有效地改善擴展。大致有兩種方法：首先，提煉成功的搜尋或思考軌跡；其次，使用驗證（例如，0/1 結果獎勵、獎勵模型或驗證器）來指導強化學習 (RL) 和搜尋演算法。在本文中，我們證明使用基於 RL 或搜尋的驗證器為基礎 (VB) 方法微調 LLM 遠優於基於提煉或複製搜尋軌跡的驗證器免費 (VF) 方法，給定固定數量的計算/資料預算。此外，我們表明，當我們擴展測試時間計算（以輸出標記長度衡量）和訓練資料時，與 VB 相比，VF 方法的次最佳性擴展效果不佳，當基礎預先訓練的 LLM 在正確的解決方案軌跡上呈現異質分佈（例如，不同的長度、樣式等）並承認從其中取樣的軌跡上獎勵的分佈不尖銳時。我們使用反集中 [Erd\H{o}s，1945] 將此條件形式化。這暗示了一個更強的結果，即 VB 方法在漸近上擴展得更好，VB 和 VF 方法之間的效能差距隨著測試時間預算的增加而擴大。我們在具有 3/8/32B 大小的預先訓練 LLM 的教學和數學推理問題上對我們的理論進行實證驗證，我們發現驗證對於擴展測試時間計算至關重要。
-
-##### **A-MEM: Agentic Memory for LLM Agents**
-2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
-
-While large language model (LLM) agents can effectively use external tools
-for complex real-world tasks, they require memory systems to leverage
-historical experiences. Current memory systems enable basic storage and
-retrieval but lack sophisticated memory organization, despite recent attempts
-to incorporate graph databases. Moreover, these systems' fixed operations and
-structures limit their adaptability across diverse tasks. To address this
-limitation, this paper proposes a novel agentic memory system for LLM agents
-that can dynamically organize memories in an agentic way. Following the basic
-principles of the Zettelkasten method, we designed our memory system to create
-interconnected knowledge networks through dynamic indexing and linking. When a
-new memory is added, we generate a comprehensive note containing multiple
-structured attributes, including contextual descriptions, keywords, and tags.
-The system then analyzes historical memories to identify relevant connections,
-establishing links where meaningful similarities exist. Additionally, this
-process enables memory evolution - as new memories are integrated, they can
-trigger updates to the contextual representations and attributes of existing
-historical memories, allowing the memory network to continuously refine its
-understanding. Our approach combines the structured organization principles of
-Zettelkasten with the flexibility of agent-driven decision making, allowing for
-more adaptive and context-aware memory management. Empirical experiments on six
-foundation models show superior improvement against existing SOTA baselines.
-The source code is available at https://github.com/WujiangXu/AgenticMemory.
-
-摘要：大型語言模型 (LLM) 代理雖然能有效地使用外部工具來執行複雜的真實世界任務，但它們需要記憶體系統來利用歷史經驗。目前的記憶體系統能進行基本的儲存和檢索，但缺乏精密的記憶體組織，儘管最近嘗試納入圖形資料庫。此外，這些系統固定的運作和結構限制了它們在不同任務中的適應性。為了解決這個限制，本文提出了一種新的代理記憶體系統，供 LLM 代理動態地以代理的方式組織記憶體。遵循 Zettelkasten 方法的基本原則，我們設計我們的記憶體系統，透過動態索引和連結來建立相互連結的知識網路。當加入新的記憶體時，我們會產生包含多個結構化屬性的綜合筆記，包括脈絡描述、關鍵字和標籤。然後，系統會分析歷史記憶體以找出相關連結，在有意義的相似性時建立連結。此外，這個程序能讓記憶體演化，因為當整合新的記憶體時，它們會觸發對現有歷史記憶體的脈絡表示和屬性的更新，讓記憶體網路能持續精進它的理解。我們的做法結合了 Zettelkasten 的結構化組織原則和代理驅動決策制定的靈活性，能進行更具適應性和脈絡感知的記憶體管理。在六個基礎模型上的經驗實驗顯示出比現有的 SOTA 基準線有顯著的進步。原始碼可以在 https://github.com/WujiangXu/AgenticMemory 找到。
-
-##### **Personality Structured Interview for Large Language Model Simulation in Personality Research**
-2502.12109v1 by Pengda Wang, Huiqi Zou, Hanjie Chen, Tianjun Sun, Ziang Xiao, Frederick L. Oswald
-
-Although psychometrics researchers have recently explored the use of large
-language models (LLMs) as proxies for human participants, LLMs often fail to
-generate heterogeneous data with human-like diversity, which diminishes their
-value in advancing social science research. To address these challenges, we
-explored the potential of the theory-informed Personality Structured Interview
-(PSI) as a tool for simulating human responses in personality research. In this
-approach, the simulation is grounded in nuanced real-human interview
-transcripts that target the personality construct of interest. We have provided
-a growing set of 357 structured interview transcripts from a representative
-sample, each containing an individual's response to 32 open-ended questions
-carefully designed to gather theory-based personality evidence. Additionally,
-grounded in psychometric research, we have summarized an evaluation framework
-to systematically validate LLM-generated psychometric data. Results from three
-experiments demonstrate that well-designed structured interviews could improve
-human-like heterogeneity in LLM-simulated personality data and predict
-personality-related behavioral outcomes (i.e., organizational citizenship
-behaviors and counterproductive work behavior). We further discuss the role of
-theory-informed structured interviews in LLM-based simulation and outline a
-general framework for designing structured interviews to simulate human-like
-data for psychometric research.
-
-摘要：儘管心理測量研究人員最近已探討將大型語言模型 (LLM) 用作人類參與者的代理，但 LLM 經常無法產生具有類似人類多樣性的異質資料，這降低了它們在推進社會科學研究中的價值。為了應對這些挑戰，我們探討了理論知情的個性結構化訪談 (PSI) 作為模擬人格研究中人類反應的工具的潛力。在此方法中，模擬基於針對目標人格建構的細緻真實人類訪談記錄。我們提供了一組不斷增加的 357 個結構化訪談記錄，來自一個具代表性的樣本，每個記錄都包含個人對 32 個開放式問題的回答，這些問題經過仔細設計，用於收集基於理論的人格證據。此外，基於心理測量研究，我們總結了一個評估架構，以系統性驗證 LLM 生成的精神測量資料。三個實驗的結果表明，設計良好的結構化訪談可以改善 LLM 模擬的人格資料中類似人類的異質性，並預測與人格相關的行為結果（例如，組織公民行為和適得其反的工作行為）。我們進一步討論了理論知情的結構化訪談在基於 LLM 的模擬中的作用，並概述了一個通用框架，用於設計結構化訪談以模擬類似人類的資料，以進行心理測量研究。
-
-##### **Using the Path of Least Resistance to Explain Deep Networks**
-2502.12108v1 by Sina Salek, Joseph Enguehard
-
-Integrated Gradients (IG), a widely used axiomatic path-based attribution
-method, assigns importance scores to input features by integrating model
-gradients along a straight path from a baseline to the input. While effective
-in some cases, we show that straight paths can lead to flawed attributions. In
-this paper, we identify the cause of these misattributions and propose an
-alternative approach that treats the input space as a Riemannian manifold,
-computing attributions by integrating gradients along geodesics. We call this
-method Geodesic Integrated Gradients (GIG). To approximate geodesic paths, we
-introduce two techniques: a k-Nearest Neighbours-based approach for smaller
-models and a Stochastic Variational Inference-based method for larger ones.
-Additionally, we propose a new axiom, Strong Completeness, extending the axioms
-satisfied by IG. We show that this property is desirable for attribution
-methods and that GIG is the only method that satisfies it. Through experiments
-on both synthetic and real-world data, we demonstrate that GIG outperforms
-existing explainability methods, including IG.
-
-摘要：整合梯度 (IG) 是一種廣泛使用的公理路徑歸因方法，它透過整合從基線到輸入的直線路徑上的模型梯度，為輸入特徵分配重要性分數。雖然在某些情況下有效，但我們表明直線路徑可能會導致錯誤的歸因。在本文中，我們找出這些錯誤歸因的原因，並提出將輸入空間視為黎曼流形的替代方法，透過整合測地線上的梯度來計算歸因。我們將此方法稱為測地線整合梯度 (GIG)。為了近似測地線路徑，我們引入了兩種技術：一種基於 k 最近鄰的方法，適用於較小的模型；一種基於隨機變異推論的方法，適用於較大的模型。此外，我們提出了新的公理，即強完整性，擴展了 IG 滿足的公理。我們表明此屬性對於歸因方法而言是理想的，並且 GIG 是唯一滿足此屬性的方法。透過對合成資料和真實世界資料進行的實驗，我們證明 GIG 優於現有的可解釋性方法，包括 IG。
-
-##### **Relational Norms for Human-AI Cooperation**
-2502.12102v1 by Brian D. Earp, Sebastian Porsdam Mann, Mateo Aboy, Edmond Awad, Monika Betzler, Marietjie Botes, Rachel Calcott, Mina Caraccio, Nick Chater, Mark Coeckelbergh, Mihaela Constantinescu, Hossein Dabbagh, Kate Devlin, Xiaojun Ding, Vilius Dranseika, Jim A. C. Everett, Ruiping Fan, Faisal Feroz, Kathryn B. Francis, Cindy Friedman, Orsolya Friedrich, Iason Gabriel, Ivar Hannikainen, Julie Hellmann, Arasj Khodadade Jahrome, Niranjan S. Janardhanan, Paul Jurcys, Andreas Kappes, Maryam Ali Khan, Gordon Kraft-Todd, Maximilian Kroner Dale, Simon M. Laham, Benjamin Lange, Muriel Leuenberger, Jonathan Lewis, Peng Liu, David M. Lyreskog, Matthijs Maas, John McMillan, Emilian Mihailov, Timo Minssen, Joshua Teperowski Monrad, Kathryn Muyskens, Simon Myers, Sven Nyholm, Alexa M. Owen, Anna Puzio, Christopher Register, Madeline G. Reinecke, Adam Safron, Henry Shevlin, Hayate Shimizu, Peter V. Treit, Cristina Voinea, Karen Yan, Anda Zahiu, Renwen Zhang, Hazem Zohny, Walter Sinnott-Armstrong, Ilina Singh, Julian Savulescu, Margaret S. Clark
-
-How we should design and interact with social artificial intelligence depends
-on the socio-relational role the AI is meant to emulate or occupy. In human
-society, relationships such as teacher-student, parent-child, neighbors,
-siblings, or employer-employee are governed by specific norms that prescribe or
-proscribe cooperative functions including hierarchy, care, transaction, and
-mating. These norms shape our judgments of what is appropriate for each
-partner. For example, workplace norms may allow a boss to give orders to an
-employee, but not vice versa, reflecting hierarchical and transactional
-expectations. As AI agents and chatbots powered by large language models are
-increasingly designed to serve roles analogous to human positions - such as
-assistant, mental health provider, tutor, or romantic partner - it is
-imperative to examine whether and how human relational norms should extend to
-human-AI interactions. Our analysis explores how differences between AI systems
-and humans, such as the absence of conscious experience and immunity to
-fatigue, may affect an AI's capacity to fulfill relationship-specific functions
-and adhere to corresponding norms. This analysis, which is a collaborative
-effort by philosophers, psychologists, relationship scientists, ethicists,
-legal experts, and AI researchers, carries important implications for AI
-systems design, user behavior, and regulation. While we accept that AI systems
-can offer significant benefits such as increased availability and consistency
-in certain socio-relational roles, they also risk fostering unhealthy
-dependencies or unrealistic expectations that could spill over into human-human
-relationships. We propose that understanding and thoughtfully shaping (or
-implementing) suitable human-AI relational norms will be crucial for ensuring
-that human-AI interactions are ethical, trustworthy, and favorable to human
-well-being.
-
-摘要：<paragraph>我們應如何設計和與社交人工智慧互動，取決於人工智慧預期要模仿或扮演的社會關係角色。在人類社會中，師生、父母子女、鄰居、兄弟姐妹或雇主員工等關係受特定規範所支配，這些規範規定或禁止包括等級、照顧、交易和交配在內的合作功能。這些規範形塑我們對每個夥伴適當行為的判斷。例如，職場規範可能允許老闆對員工發號施令，但反之則不行，這反映了等級和交易的期望。隨著由大型語言模型驅動的人工智慧代理程式和聊天機器人日益被設計為服務類似於人類職位的角色，例如助理、心理健康提供者、導師或浪漫伴侶，審查人類關係規範是否以及如何延伸至人類與人工智慧的互動至關重要。我們的分析探討了人工智慧系統和人類之間的差異，例如缺乏意識體驗和對疲勞的免疫力，如何影響人工智慧履行特定關係功能和遵守相應規範的能力。這項分析是由哲學家、心理學家、關係科學家、倫理學家、法律專家和人工智慧研究人員共同合作的成果，對人工智慧系統設計、使用者行為和法規具有重要的意義。雖然我們接受人工智慧系統可以在某些社會關係角色中提供顯著的好處，例如增加可用性和一致性，但它們也可能助長不健康的依賴關係或不切實際的期望，這些期望可能會蔓延到人際關係中。我們提出，理解和深思熟慮地塑造（或實施）適當的人類與人工智慧關係規範，對於確保人類與人工智慧的互動具有倫理性、可信賴性和有利於人類福祉至關重要。</paragraph>
-
-##### **A Study on Leveraging Search and Self-Feedback for Agent Reasoning**
-2502.12094v1 by Karthikeyan K, Michelle Yuan, Elman Mansimov, Katerina Margatina, Anurag Pratik, Daniele Bonadiman, Monica Sunkara, Yi Zhang, Yassine Benajiba
-
-Recent works have demonstrated that incorporating search during inference can
-significantly improve reasoning capabilities of language agents. Some
-approaches may make use of the ground truth or rely on model's own generated
-feedback. The search algorithm uses this feedback to then produce values that
-will update its criterion for exploring and exploiting various reasoning paths.
-In this study, we investigate how search and model's self-feedback can be
-leveraged for reasoning tasks. First, we explore differences in ground-truth
-feedback and self-feedback during search for math reasoning. Second, we observe
-limitations in applying search techniques to more complex tasks like
-tool-calling and design domain-specific approaches to address these gaps. Our
-experiments reveal challenges related to generalization when solely relying on
-self-feedback during search. For search to work effectively, either access to
-the ground-truth is needed or feedback mechanisms need to be carefully designed
-for the specific task.
-
-摘要：最近的研究表明，在推理过程中加入搜索功能可以显著提升语言代理的推理能力。一些方法可能会利用基本事实或依赖模型本身产生的反馈。搜索算法使用此反馈，然后生成值，以更新其探索和利用各种推理路径的标准。在本研究中，我们调查了如何利用搜索和模型的自反馈来进行推理任务。首先，我们探讨了数学推理搜索过程中基本事实反馈和自反馈的差异。其次，我们观察到在将搜索技术应用于更复杂的任务（如工具调用和设计特定于领域的解决方案）时存在的局限性，并提出针对这些差距的解决方案。我们的实验揭示了在搜索过程中仅依赖自反馈时与泛化相关的挑战。要使搜索有效，需要访问基本事实或需要针对特定任务仔细设计反馈机制。
-
-##### **Meta-Statistical Learning: Supervised Learning of Statistical Inference**
-2502.12088v1 by Maxime Peyrard, Kyunghyun Cho
-
-This work demonstrates that the tools and principles driving the success of
-large language models (LLMs) can be repurposed to tackle distribution-level
-tasks, where the goal is to predict properties of the data-generating
-distribution rather than labels for individual datapoints. These tasks
-encompass statistical inference problems such as parameter estimation,
-hypothesis testing, or mutual information estimation. Framing these tasks
-within traditional machine learning pipelines is challenging, as supervision is
-typically tied to individual datapoint. We propose meta-statistical learning, a
-framework inspired by multi-instance learning that reformulates statistical
-inference tasks as supervised learning problems. In this approach, entire
-datasets are treated as single inputs to neural networks, which predict
-distribution-level parameters. Transformer-based architectures, without
-positional encoding, provide a natural fit due to their permutation-invariance
-properties. By training on large-scale synthetic datasets, meta-statistical
-models can leverage the scalability and optimization infrastructure of
-Transformer-based LLMs. We demonstrate the framework's versatility with
-applications in hypothesis testing and mutual information estimation, showing
-strong performance, particularly for small datasets where traditional neural
-methods struggle.
-
-摘要：这项工作表明，推动大型语言模型 (LLM) 成功发展的工具和原则可以重新用于解决分布级别任务，其中目标是预测数据生成分布的属性，而不是单个数据点的标签。这些任务包括统计推断问题，例如参数估计、假设检验或互信息估计。在传统的机器学习管道中构建这些任务具有挑战性，因为监督通常与单个数据点相关联。我们提出了元统计学习，这是一个受多实例学习启发的框架，它将统计推断任务重新表述为监督学习问题。在此方法中，整个数据集被视为神经网络的单个输入，该神经网络预测分布级别参数。基于 Transformer 的架构在没有位置编码的情况下提供了自然拟合，因为它们具有置换不变性。通过在大型合成数据集上进行训练，元统计模型可以利用基于 Transformer 的 LLM 的可扩展性和优化基础设施。我们通过在假设检验和互信息估计中的应用展示了该框架的多功能性，显示出强大的性能，特别是对于传统神经方法难以处理的小型数据集。
-
-##### **APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs**
-2502.12085v1 by Yuxiang Huang, Mingye Li, Xu Han, Chaojun Xiao, Weilin Zhao, Sun Ao, Hao Zhou, Jie Zhou, Zhiyuan Liu, Maosong Sun
-
-While long-context inference is crucial for advancing large language model
-(LLM) applications, its prefill speed remains a significant bottleneck. Current
-approaches, including sequence parallelism strategies and compute reduction
-through approximate attention mechanisms, still fall short of delivering
-optimal inference efficiency. This hinders scaling the inputs to longer
-sequences and processing long-context queries in a timely manner. To address
-this, we introduce APB, an efficient long-context inference framework that
-leverages multi-host approximate attention to enhance prefill speed by reducing
-compute and enhancing parallelism simultaneously. APB introduces a
-communication mechanism for essential key-value pairs within a sequence
-parallelism framework, enabling a faster inference speed while maintaining task
-performance. We implement APB by incorporating a tailored FlashAttn kernel
-alongside optimized distribution strategies, supporting diverse models and
-parallelism configurations. APB achieves speedups of up to 9.2x, 4.2x, and 1.6x
-compared with FlashAttn, RingAttn, and StarAttn, respectively, without any
-observable task performance degradation. We provide the implementation and
-experiment code of APB in https://github.com/thunlp/APB.
-
-摘要：雖然長文本推理對於推進大型語言模型 (LLM) 應用至關重要，但其預填充速度仍然是一個重大的瓶頸。目前的各種方法，包括序列並行策略和透過近似注意力機制減少運算，仍然無法提供最佳的推理效率。這會阻礙將輸入擴展到更長的序列，以及及時處理長文本查詢。為了解決這個問題，我們引入了 APB，這是一個高效的長文本推理架構，它利用多主機近似注意力來減少運算並同時提高並行性，從而提高預填充速度。APB 在序列並行架構中引入了一個用於基本鍵值對的通訊機制，在維持任務效能的同時，實現更快的推理速度。我們透過整合一個量身打造的 FlashAttn 核心以及最佳化的分佈策略來實作 APB，支援各種模型和並行配置。與 FlashAttn、RingAttn 和 StarAttn 相比，APB 分別實現了高達 9.2 倍、4.2 倍和 1.6 倍的加速，同時沒有任何可觀察到的任務效能下降。我們在 https://github.com/thunlp/APB 中提供了 APB 的實作和實驗程式碼。
-
-##### **VLM$^2$-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues**
-2502.12084v1 by Jianshu Zhang, Dongyu Yao, Renjie Pi, Paul Pu Liang, Yi R., Fung
-
-Visually linking matching cues is a crucial ability in daily life, such as
-identifying the same person in multiple photos based on their cues, even
-without knowing who they are. Despite the extensive knowledge that
-vision-language models (VLMs) possess, it remains largely unexplored whether
-they are capable of performing this fundamental task. To address this, we
-introduce VLM$^2$-Bench, a benchmark designed to assess whether VLMs can
-Visually Link Matching cues, with 9 subtasks and over 3,000 test cases.
-Comprehensive evaluation across eight open-source VLMs and GPT-4o, along with
-further analysis of various language-side and vision-side prompting methods,
-leads to a total of eight key findings. We identify critical challenges in
-models' ability to link visual cues, highlighting a significant performance gap
-where even GPT-4o lags 34.80% behind humans. Based on these insights, we
-advocate for (i) enhancing core visual capabilities to improve adaptability and
-reduce reliance on prior knowledge, (ii) establishing clearer principles for
-integrating language-based reasoning in vision-centric tasks to prevent
-unnecessary biases, and (iii) shifting vision-text training paradigms toward
-fostering models' ability to independently structure and infer relationships
-among visual cues.
-
-摘要：視覺連結匹配線索是日常生活中的關鍵能力，例如在多張照片中根據線索辨識同一個人，即使不知道他們是誰。儘管視覺語言模型 (VLM) 擁有廣泛的知識，但它們是否能執行這項基本任務，在很大程度上仍未被探討。為了解決這個問題，我們引入了 VLM$^2$-Bench，一個基準測試，旨在評估 VLM 是否能視覺連結匹配線索，包含 9 個子任務和超過 3,000 個測試案例。對八個開源 VLM 和 GPT-4o 的全面評估，以及對各種語言側和視覺側提示方法的進一步分析，得出總共八項關鍵發現。我們找出模型連結視覺線索能力的關鍵挑戰，強調一個顯著的效能差距，即使是 GPT-4o 也落後人類 34.80%。根據這些見解，我們提倡 (i) 提升核心視覺能力以改善適應性並減少對先驗知識的依賴，(ii) 為整合基於語言的推理到以視覺為中心的任務中建立更明確的原則，以防止不必要的偏見，以及 (iii) 將視覺文字訓練範例轉移到培養模型獨立建構和推論視覺線索之間關係的能力。
-
-##### **AdaSplash: Adaptive Sparse Flash Attention**
-2502.12082v1 by Nuno Gonçalves, Marcos Treviso, André F. T. Martins
-
-The computational cost of softmax-based attention in transformers limits
-their applicability to long-context tasks. Adaptive sparsity, of which
-$\alpha$-entmax attention is an example, offers a flexible data-dependent
-alternative, but existing implementations are inefficient and do not leverage
-the sparsity to obtain runtime and memory gains. In this work, we propose
-AdaSplash, which combines the efficiency of GPU-optimized algorithms with the
-sparsity benefits of $\alpha$-entmax. We first introduce a hybrid
-Halley-bisection algorithm, resulting in a 7-fold reduction in the number of
-iterations needed to compute the $\alpha$-entmax transformation. Then, we
-implement custom Triton kernels to efficiently handle adaptive sparsity.
-Experiments with RoBERTa and ModernBERT for text classification and
-single-vector retrieval, along with GPT-2 for language modeling, show that our
-method achieves substantial improvements in runtime and memory efficiency
-compared to existing $\alpha$-entmax implementations. It approaches -- and in
-some cases surpasses -- the efficiency of highly optimized softmax
-implementations like FlashAttention-2, enabling long-context training while
-maintaining strong task performance.
-
-摘要：基於 softmax 的注意力在 Transformer 中的運算成本限制了它們在長內容任務中的應用性。適應性稀疏性，其中 $\alpha$-entmax 注意力是一個例子，提供了一個靈活的資料相關替代方案，但現有的實作效率低下，且無法利用稀疏性來獲得執行時間和記憶體的增益。在這項工作中，我們提出了 AdaSplash，它結合了 GPU 最佳化演算法的效率和 $\alpha$-entmax 的稀疏性優點。我們首先引入了一個混合 Halley-二分法演算法，導致計算 $\alpha$-entmax 轉換所需的迭代次數減少了 7 倍。然後，我們實作自訂 Triton 核心，以有效處理適應性稀疏性。針對文字分類和單一向量擷取的 RoBERTa 和 ModernBERT，以及用於語言建模的 GPT-2 的實驗顯示，與現有的 $\alpha$-entmax 實作相比，我們的方法在執行時間和記憶體效率方面獲得了顯著的改善。它接近了 -- 在某些情況下超越了 -- 高度最佳化 softmax 實作（例如 FlashAttention-2）的效率，同時在維持強大任務效能的同時，能夠進行長內容訓練。
-
-##### **Unhackable Temporal Rewarding for Scalable Video MLLMs**
-2502.12081v1 by En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, Wenbing Tao
-
-In the pursuit of superior video-processing MLLMs, we have encountered a
-perplexing paradox: the "anti-scaling law", where more data and larger models
-lead to worse performance. This study unmasks the culprit: "temporal hacking",
-a phenomenon where models shortcut by fixating on select frames, missing the
-full video narrative. In this work, we systematically establish a comprehensive
-theory of temporal hacking, defining it from a reinforcement learning
-perspective, introducing the Temporal Perplexity (TPL) score to assess this
-misalignment, and proposing the Unhackable Temporal Rewarding (UTR) framework
-to mitigate the temporal hacking. Both theoretically and empirically, TPL
-proves to be a reliable indicator of temporal modeling quality, correlating
-strongly with frame activation patterns. Extensive experiments reveal that UTR
-not only counters temporal hacking but significantly elevates video
-comprehension capabilities. This work not only advances video-AI systems but
-also illuminates the critical importance of aligning proxy rewards with true
-objectives in MLLM development.
-
-摘要：在追求卓越的影片處理 MLLM 時，我們遭遇了一個令人費解的矛盾現象：「反規模化定律」，也就是更多資料和更大的模型會導致更差的效能。本研究揭露了罪魁禍首：「時間駭客」，這是一種模型透過專注於特定影格來簡化的現象，錯失了完整的影片敘事。在這項研究中，我們系統性地建立了一個關於時間駭客的全面理論，從強化學習的角度定義它，並引入了時間困惑度 (TPL) 分數來評估這種失衡，並提出了無法破解的時間獎勵 (UTR) 架構來減輕時間駭客現象。從理論和經驗上來說，TPL 被證明是時間建模品質的可靠指標，與影格啟動模式有很強的相關性。大量的實驗顯示，UTR 不僅對抗時間駭客，還能顯著提升影片理解能力。這項研究不僅推動了影片 AI 系統，也闡明了在 MLLM 開發中，將代理獎勵與真實目標對齊的重要性。
-
-##### **Can LLMs Simulate Social Media Engagement? A Study on Action-Guided Response Generation**
-2502.12073v1 by Zhongyi Qiu, Hanjia Lyu, Wei Xiong, Jiebo Luo
-
-Social media enables dynamic user engagement with trending topics, and recent
-research has explored the potential of large language models (LLMs) for
-response generation. While some studies investigate LLMs as agents for
-simulating user behavior on social media, their focus remains on practical
-viability and scalability rather than a deeper understanding of how well LLM
-aligns with human behavior. This paper analyzes LLMs' ability to simulate
-social media engagement through action guided response generation, where a
-model first predicts a user's most likely engagement action-retweet, quote, or
-rewrite-towards a trending post before generating a personalized response
-conditioned on the predicted action. We benchmark GPT-4o-mini, O1-mini, and
-DeepSeek-R1 in social media engagement simulation regarding a major societal
-event discussed on X. Our findings reveal that zero-shot LLMs underperform BERT
-in action prediction, while few-shot prompting initially degrades the
-prediction accuracy of LLMs with limited examples. However, in response
-generation, few-shot LLMs achieve stronger semantic alignment with ground truth
-posts.
-
-摘要：社交媒體讓使用者能夠動態參與熱門話題，而最近的研究探索了大型語言模型 (LLM) 在回應生成方面的潛力。儘管有些研究將 LLM 視為模擬社交媒體使用者行為的代理，但其重點仍放在實務可行性和可擴充性，而非深入了解 LLM 如何與人類行為相符。本文分析了 LLM 透過動作引導回應生成來模擬社交媒體參與的能力，其中一個模型首先預測使用者最有可能的參與動作（轉推、引用或改寫）對熱門貼文的參與，然後根據預測的動作產生個人化回應。我們在 X 上討論的一個重大社會事件中，對 GPT-4o-mini、O1-mini 和 DeepSeek-R1 進行社交媒體參與模擬的基準測試。我們的研究結果顯示，零次學習 LLM 在動作預測方面表現不如 BERT，而少次學習提示最初會降低範例有限的 LLM 預測準確度。然而，在回應生成方面，少次學習 LLM 與真實貼文達到了更強的語義對齊。
-
-##### **TokenSkip: Controllable Chain-of-Thought Compression in LLMs**
-2502.12067v1 by Heming Xia, Yongqi Li, Chak Tou Leong, Wenjie Wang, Wenjie Li
-
-Chain-of-Thought (CoT) has been proven effective in enhancing the reasoning
-capabilities of large language models (LLMs). Recent advancements, such as
-OpenAI's o1 and DeepSeek-R1, suggest that scaling up the length of CoT
-sequences during inference could further boost LLM reasoning performance.
-However, due to the autoregressive nature of LLM decoding, longer CoT outputs
-lead to a linear increase in inference latency, adversely affecting user
-experience, particularly when the CoT exceeds 10,000 tokens. To address this
-limitation, we analyze the semantic importance of tokens within CoT outputs and
-reveal that their contributions to reasoning vary. Building on this insight, we
-propose TokenSkip, a simple yet effective approach that enables LLMs to
-selectively skip less important tokens, allowing for controllable CoT
-compression. Extensive experiments across various models and tasks demonstrate
-the effectiveness of TokenSkip in reducing CoT token usage while preserving
-strong reasoning performance. Notably, when applied to Qwen2.5-14B-Instruct,
-TokenSkip reduces reasoning tokens by 40% (from 313 to 181) on GSM8K, with less
-than a 0.4% performance drop.
-
-摘要：<paragraph>鏈式思維 (CoT) 已被證明能有效提升大型語言模型 (LLM) 的推理能力。最近的進展，例如 OpenAI 的 o1 和 DeepSeek-R1，表明在推理過程中擴展 CoT 序列的長度可以進一步提升 LLM 的推理效能。然而，由於 LLM 解碼的自動回歸特性，較長的 CoT 輸出會導致推理延遲線性增加，對使用者體驗造成負面影響，特別是在 CoT 超過 10,000 個符號時。為了解決這個限制，我們分析了 CoT 輸出中符號的語義重要性，並揭示了它們對推理的貢獻度不同。基於這個見解，我們提出了 TokenSkip，一種簡單但有效的技術，使 LLM 能有選擇地略過較不重要的符號，從而實現可控的 CoT 壓縮。跨越各種模型和任務的廣泛實驗證明了 TokenSkip 在減少 CoT 符號使用量同時保持強大推理效能方面的有效性。值得注意的是，當應用於 Qwen2.5-14B-Instruct 時，TokenSkip 將 GSM8K 上的推理符號減少了 40%（從 313 個減少到 181 個），效能下降不到 0.4%。</paragraph>
-
-##### **CONSTRUCTA: Automating Commercial Construction Schedules in Fabrication Facilities with Large Language Models**
-2502.12066v1 by Yifan Zhang, Xue Yang
-
-Automating planning with LLMs presents transformative opportunities for
-traditional industries, yet remains underexplored. In commercial construction,
-the complexity of automated scheduling often requires manual intervention to
-ensure precision. We propose CONSTRUCTA, a novel framework leveraging LLMs to
-optimize construction schedules in complex projects like semiconductor
-fabrication. CONSTRUCTA addresses key challenges by: (1) integrating
-construction-specific knowledge through static RAG; (2) employing
-context-sampling techniques inspired by architectural expertise to provide
-relevant input; and (3) deploying Construction DPO to align schedules with
-expert preferences using RLHF. Experiments on proprietary data demonstrate
-performance improvements of +42.3% in missing value prediction, +79.1% in
-dependency analysis, and +28.9% in automated planning compared to baseline
-methods, showcasing its potential to revolutionize construction workflows and
-inspire domain-specific LLM advancements.
-
-摘要：利用 LLM 自動化規劃為傳統產業帶來轉型契機，但仍有待進一步探索。在商業建築中，自動化排程的複雜性通常需要手動介入以確保精確度。我們提出 CONSTRUCTA，一個利用 LLM 優化複雜專案（如半導體製造）建築排程的新穎架構。CONSTRUCTA 透過下列方式解決關鍵挑戰：(1) 整合靜態 RAG 的建築特定知識；(2) 採用受建築專業知識啟發的脈絡取樣技術，提供相關輸入；(3) 部署建築 DPO，使用 RLHF 將排程與專家偏好對齊。專利數據的實驗顯示，與基準方法相比，遺失值預測的效能提升 +42.3%、相依性分析提升 +79.1%、自動化規劃提升 +28.9%，展示其革新建築工作流程和激勵領域特定 LLM 進展的潛力。
-
-##### **Formalizing Complex Mathematical Statements with LLMs: A Study on Mathematical Definitions**
-2502.12065v1 by Lan Zhang, Marco Valentino, Andre Freitas
-
-Thanks to their linguistic capabilities, LLMs offer an opportunity to bridge
-the gap between informal mathematics and formal languages through
-autoformalization. However, it is still unclear how well LLMs generalize to
-sophisticated and naturally occurring mathematical statements. To address this
-gap, we investigate the task of autoformalizing real-world mathematical
-definitions -- a critical component of mathematical discourse. Specifically, we
-introduce two novel resources for autoformalisation, collecting definitions
-from Wikipedia (Def_Wiki) and arXiv papers (Def_ArXiv). We then systematically
-evaluate a range of LLMs, analyzing their ability to formalize definitions into
-Isabelle/HOL. Furthermore, we investigate strategies to enhance LLMs'
-performance including refinement through external feedback from Proof
-Assistants, and formal definition grounding, where we guide LLMs through
-relevant contextual elements from formal mathematical libraries. Our findings
-reveal that definitions present a greater challenge compared to existing
-benchmarks, such as miniF2F. In particular, we found that LLMs still struggle
-with self-correction, and aligning with relevant mathematical libraries. At the
-same time, structured refinement methods and definition grounding strategies
-yield notable improvements of up to 16% on self-correction capabilities and 43%
-on the reduction of undefined errors, highlighting promising directions for
-enhancing LLM-based autoformalization in real-world scenarios.
-
-摘要：由於語言能力，LLM 提供了一個機會，透過自動形式化來彌合非正式數學和形式語言之間的差距。然而，LLM 在多麼精巧且自然發生的數學陳述中概化，這仍不清楚。為了解決這個差距，我們探討了自動形式化真實世界數學定義的任務，這是數學論述中的關鍵組成部分。具體來說，我們介紹了自動形式化的兩個新資源，收集來自維基百科（Def_Wiki）和 arXiv 論文（Def_ArXiv）的定義。然後，我們系統性地評估了一系列 LLM，分析它們將定義形式化為 Isabelle/HOL 的能力。此外，我們探討了增強 LLM 效能的策略，包括透過證明輔助工具的外部回饋進行精煉，以及形式定義基礎，其中我們透過形式數學函式庫中的相關脈絡元素來引導 LLM。我們的發現顯示，與現有的基準（例如 miniF2F）相比，定義提出了更大的挑戰。特別是，我們發現 LLM 在自我修正和與相關數學函式庫對齊方面仍然有困難。同時，結構化的精煉方法和定義基礎策略在自我修正能力上產生了顯著的改善，高達 16%，在減少未定義錯誤方面改善了 43%，突顯了在真實世界場景中增強基於 LLM 的自動形式化的有希望的方向。
-
-##### **AI-generated Text Detection with a GLTR-based Approach**
-2502.12064v1 by Lucía Yan Wu, Isabel Segura-Bedmar
-
-The rise of LLMs (Large Language Models) has contributed to the improved
-performance and development of cutting-edge NLP applications. However, these
-can also pose risks when used maliciously, such as spreading fake news, harmful
-content, impersonating individuals, or facilitating school plagiarism, among
-others. This is because LLMs can generate high-quality texts, which are
-challenging to differentiate from those written by humans. GLTR, which stands
-for Giant Language Model Test Room and was developed jointly by the MIT-IBM
-Watson AI Lab and HarvardNLP, is a visual tool designed to help detect
-machine-generated texts based on GPT-2, that highlights the words in text
-depending on the probability that they were machine-generated. One limitation
-of GLTR is that the results it returns can sometimes be ambiguous and lead to
-confusion. This study aims to explore various ways to improve GLTR's
-effectiveness for detecting AI-generated texts within the context of the
-IberLef-AuTexTification 2023 shared task, in both English and Spanish
-languages. Experiment results show that our GLTR-based GPT-2 model overcomes
-the state-of-the-art models on the English dataset with a macro F1-score of
-80.19%, except for the first ranking model (80.91%). However, for the Spanish
-dataset, we obtained a macro F1-score of 66.20%, which differs by 4.57%
-compared to the top-performing model.
-
-摘要：大型語言模型 (LLM) 的興起有助於改進尖端 NLP 應用程式的效能和開發。不過，這些應用程式若遭惡意使用，例如散布假新聞、有害內容、冒充個人或協助學校抄襲等，也可能造成風險。這是因為 LLM 可以產生高品質的文字，而這些文字難以與人類所寫的文字區分。GLTR（代表大型語言模型測試室）是由麻省理工學院-IBM Watson AI 實驗室和 HarvardNLP 共同開發的視覺工具，旨在協助偵測基於 GPT-2 的機器產生的文字，它會根據文字中每個字詞機器產生的機率來標示。GLTR 的一個限制在於，它回傳的結果有時可能模稜兩可，容易造成混淆。本研究旨在探討各種方法來改善 GLTR 在 IberLef-AuTexTification 2023 共享任務中偵測 AI 生成的文字的效能，任務中包含英文和西班牙文兩種語言。實驗結果顯示，我們的基於 GLTR 的 GPT-2 模型在英文資料集上以 80.19% 的巨觀 F1 分數超越了最先進的模型，僅次於第一名排名模型 (80.91%)。不過，在西班牙文資料集上，我們獲得的巨觀 F1 分數為 66.20%，與表現最佳的模型相比，相差 4.57%。
-
-##### **Culture is Not Trivia: Sociocultural Theory for Cultural NLP**
-2502.12057v1 by Naitian Zhou, David Bamman, Isaac L. Bleaman
-
-The field of cultural NLP has recently experienced rapid growth, driven by a
-pressing need to ensure that language technologies are effective and safe
-across a pluralistic user base. This work has largely progressed without a
-shared conception of culture, instead choosing to rely on a wide array of
-cultural proxies. However, this leads to a number of recurring limitations:
-coarse national boundaries fail to capture nuanced differences that lay within
-them, limited coverage restricts datasets to only a subset of usually
-highly-represented cultures, and a lack of dynamicity results in static
-cultural benchmarks that do not change as culture evolves. In this position
-paper, we argue that these methodological limitations are symptomatic of a
-theoretical gap. We draw on a well-developed theory of culture from
-sociocultural linguistics to fill this gap by 1) demonstrating in a case study
-how it can clarify methodological constraints and affordances, 2) offering
-theoretically-motivated paths forward to achieving cultural competence, and 3)
-arguing that localization is a more useful framing for the goals of much
-current work in cultural NLP.
-
-摘要：文化 NLP 領域最近經歷了快速成長，這是因為迫切需要確保語言技術對於多元化的使用者基礎而言是有效且安全的。這項工作在很大程度上沒有文化共識，而是選擇依賴各種文化代理。然而，這導致了許多重複性的限制：粗略的國家界線無法捕捉到其中的細微差異，有限的涵蓋範圍將資料集限制在通常高度代表的文化子集，而且缺乏動態性導致靜態文化基準無法隨著文化演變而改變。在這篇立場文件中，我們認為這些方法論限制是理論差距的徵兆。我們從社會文化語言學中汲取一個發展良好的文化理論，透過 1) 在個案研究中展示它如何釐清方法論限制和可負擔性，2) 提供理論上合理的途徑來實現文化能力，以及 3) 主張在地化對於文化 NLP 中許多當前工作的目標而言是一個更有用的框架，來填補這個差距。
-
-##### **Designing Role Vectors to Improve LLM Inference Behaviour**
-2502.12055v1 by Daniele Potertì, Andrea Seveso, Fabio Mercorio
-
-The influence of personas on Large Language Models (LLMs) has been widely
-studied, yet their direct impact on performance remains uncertain. This work
-explores a novel approach to guiding LLM behaviour through role vectors, an
-alternative to persona-based prompting. We construct 29 role vectors derived
-from model activations and evaluate their impact on benchmark performance
-across multiple domains. Our analysis investigates whether these vectors can
-effectively steer models toward domain-specific expertise. We measure two key
-interventions: (i) activation addition, which reinforces role-specific
-directions, and (ii) directional ablation, which removes them. Results on
-well-established benchmarks indicate that role vectors do, in fact, influence
-model behaviour, improving task performance in relevant domains while
-marginally affecting unrelated tasks. This, in turn, suggests that manipulating
-internal model representations has a greater impact on outcomes than
-persona-based prompting.
-
-摘要：大型語言模型 (LLM) 中角色的影響已被廣泛研究，但它們對效能的直接影響仍然不確定。本研究探討了一種透過角色向量引導 LLM 行為的新方法，這是一種基於角色提示的替代方案。我們從模型激活中建構了 29 個角色向量，並評估它們對多個領域基準效能的影響。我們的分析探討了這些向量是否能有效地引導模型朝向特定領域的專業知識。我們衡量了兩個關鍵干預措施：(i) 激活新增，它加強了特定角色的方向，以及 (ii) 方向消融，它移除了這些方向。在既定基準上的結果表明，角色向量確實會影響模型行為，在相關領域中改善任務效能，同時對不相關任務的影響很小。這反過來表明，操縱內部模型表示對結果的影響比基於角色的提示更大。
-
-##### **PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning**
-2502.12054v1 by Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu
-
-Large language models demonstrate remarkable capabilities across various
-domains, especially mathematics and logic reasoning. However, current
-evaluations overlook physics-based reasoning - a complex task requiring physics
-theorems and constraints. We present PhysReason, a 1,200-problem benchmark
-comprising knowledge-based (25%) and reasoning-based (75%) problems, where the
-latter are divided into three difficulty levels (easy, medium, hard). Notably,
-problems require an average of 8.1 solution steps, with hard requiring 15.6,
-reflecting the complexity of physics-based reasoning. We propose the Physics
-Solution Auto Scoring Framework, incorporating efficient answer-level and
-comprehensive step-level evaluations. Top-performing models like Deepseek-R1,
-Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on
-answer-level evaluation, with performance dropping from knowledge questions
-(75.11%) to hard problems (31.95%). Through step-level evaluation, we
-identified four key bottlenecks: Physics Theorem Application, Physics Process
-Understanding, Calculation, and Physics Condition Analysis. These findings
-position PhysReason as a novel and comprehensive benchmark for evaluating
-physics-based reasoning capabilities in large language models. Our code and
-data will be published at https:/dxzxy12138.github.io/PhysReason.
-
-摘要：大型語言模型展示了在各個領域的非凡能力，特別是數學和邏輯推理。然而，目前的評估忽略了基於物理的推理——這是一項複雜的任務，需要物理定理和約束。我們提出了 PhysReason，一個包含 1,200 題的基準，包含基於知識的（25%）和基於推理的（75%）問題，後者分為三個難度等級（容易、中等、困難）。值得注意的是，問題需要平均 8.1 個求解步驟，困難的需要 15.6 個，反映了基於物理的推理的複雜性。我們提出了物理解決方案自動評分框架，結合了高效的答案級別和全面的步驟級別評估。Deepseek-R1、Gemini-2.0-Flash-Thinking 和 o3-mini-high 等表現最佳的模型在答案級別評估中獲得低於 60% 的分數，性能從知識問題（75.11%）下降到困難問題（31.95%）。通過步驟級別評估，我們確定了四個關鍵瓶頸：物理定理應用、物理過程理解、計算和物理條件分析。這些發現將 PhysReason 定位為一個新穎且全面的基準，用於評估大型語言模型中基於物理的推理能力。我們的代碼和數據將發布在 https:/dxzxy12138.github.io/PhysReason。
-
-##### **A Dual-Perspective NLG Meta-Evaluation Framework with Automatic Benchmark and Better Interpretability**
-2502.12052v1 by Xinyu Hu, Mingqi Gao, Li Lin, Zhenghan Yu, Xiaojun Wan
-
-In NLG meta-evaluation, evaluation metrics are typically assessed based on
-their consistency with humans. However, we identify some limitations in
-traditional NLG meta-evaluation approaches, such as issues in handling human
-ratings and ambiguous selections of correlation measures, which undermine the
-effectiveness of meta-evaluation. In this work, we propose a dual-perspective
-NLG meta-evaluation framework that focuses on different evaluation
-capabilities, thereby providing better interpretability. In addition, we
-introduce a method of automatically constructing the corresponding benchmarks
-without requiring new human annotations. Furthermore, we conduct experiments
-with 16 representative LLMs as the evaluators based on our proposed framework,
-comprehensively analyzing their evaluation performance from different
-perspectives.
-
-摘要：在 NLG 元評估中，評估指標通常根據其與人類的一致性進行評估。然而，我們在傳統的 NLG 元評估方法中發現了一些限制，例如在處理人類評分和模稜兩可的相關性測量選擇方面存在問題，這會損害元評估的有效性。在這項工作中，我們提出了一個雙視角 NLG 元評估框架，該框架專注於不同的評估能力，從而提供更好的可解釋性。此外，我們引入了一種自動構建相應基準的方法，而不需要新的手動註釋。此外，我們根據我們提出的框架對 16 個具有代表性的 LLM 作為評估器進行了實驗，從不同的角度全面分析了它們的評估性能。
-
-##### **How to Upscale Neural Networks with Scaling Law? A Survey and Practical Guidelines**
-2502.12051v1 by Ayan Sengupta, Yash Goel, Tanmoy Chakraborty
-
-Neural scaling laws have revolutionized the design and optimization of
-large-scale AI models by revealing predictable relationships between model
-size, dataset volume, and computational resources. Early research established
-power-law relationships in model performance, leading to compute-optimal
-scaling strategies. However, recent studies highlighted their limitations
-across architectures, modalities, and deployment contexts. Sparse models,
-mixture-of-experts, retrieval-augmented learning, and multimodal models often
-deviate from traditional scaling patterns. Moreover, scaling behaviors vary
-across domains such as vision, reinforcement learning, and fine-tuning,
-underscoring the need for more nuanced approaches. In this survey, we
-synthesize insights from over 50 studies, examining the theoretical
-foundations, empirical findings, and practical implications of scaling laws. We
-also explore key challenges, including data efficiency, inference scaling, and
-architecture-specific constraints, advocating for adaptive scaling strategies
-tailored to real-world applications. We suggest that while scaling laws provide
-a useful guide, they do not always generalize across all architectures and
-training strategies.
-
-摘要：神經網路規模定律透過揭示模型規模、資料集體積和計算資源之間可預測的關係，徹底革新了大型 AI 模型的設計和最佳化。早期研究建立了模型效能中的冪次定律關係，進而產生最佳化的運算規模策略。然而，最近的研究突出了它們在架構、模態和部署脈絡中的限制。稀疏模型、專家混合、檢索增強式學習和多模態模型通常偏離傳統的規模模式。此外，規模行為因視覺、強化學習和微調等領域而異，強調需要更細緻的方法。在這項調查中，我們綜合了 50 多項研究的見解，探討規模定律的理論基礎、實證發現和實務意涵。我們也探討了關鍵挑戰，包括資料效率、推論規模和特定於架構的限制，提倡針對實際應用量身打造的自適應規模策略。我們建議，儘管規模定律提供了有用的指南，但它們並不總是能概括到所有架構和訓練策略。
-
-##### **SpeechT: Findings of the First Mentorship in Speech Translation**
-2502.12050v1 by Yasmin Moslem, Juan Julián Cea Morán, Mariano Gonzalez-Gomez, Muhammad Hazim Al Farouq, Farah Abdou, Satarupa Deb
-
-This work presents the details and findings of the first mentorship in speech
-translation (SpeechT), which took place in December 2024 and January 2025. To
-fulfil the requirements of the mentorship, the participants engaged in key
-activities, including data preparation, modelling, and advanced research.
-
-摘要：本研究報告了 2024 年 12 月和 2025 年 1 月舉行的首次語音翻譯 (SpeechT) 指導計畫的詳細資訊和發現。為了滿足指導計畫的要求，參與者參與了關鍵活動，包括資料準備、建模和進階研究。
-
-##### **A Survey on Bridging EEG Signals and Generative AI: From Image and Text to Beyond**
-2502.12048v1 by Shreya Shukla, Jose Torres, Abhijit Mishra, Jacek Gwizdka, Shounak Roychowdhury
-
-Integration of Brain-Computer Interfaces (BCIs) and Generative Artificial
-Intelligence (GenAI) has opened new frontiers in brain signal decoding,
-enabling assistive communication, neural representation learning, and
-multimodal integration. BCIs, particularly those leveraging
-Electroencephalography (EEG), provide a non-invasive means of translating
-neural activity into meaningful outputs. Recent advances in deep learning,
-including Generative Adversarial Networks (GANs) and Transformer-based Large
-Language Models (LLMs), have significantly improved EEG-based generation of
-images, text, and speech. This paper provides a literature review of the
-state-of-the-art in EEG-based multimodal generation, focusing on (i)
-EEG-to-image generation through GANs, Variational Autoencoders (VAEs), and
-Diffusion Models, and (ii) EEG-to-text generation leveraging Transformer based
-language models and contrastive learning methods. Additionally, we discuss the
-emerging domain of EEG-to-speech synthesis, an evolving multimodal frontier. We
-highlight key datasets, use cases, challenges, and EEG feature encoding methods
-that underpin generative approaches. By providing a structured overview of
-EEG-based generative AI, this survey aims to equip researchers and
-practitioners with insights to advance neural decoding, enhance assistive
-technologies, and expand the frontiers of brain-computer interaction.
-
-摘要：腦機介面（BCIs）與生成式人工智慧（GenAI）的整合為腦信號解碼開啟了新領域，能協助溝通、神經表徵學習與多模式整合。BCIs，特別是利用腦電圖（EEG）的 BCIs，提供了一種非侵入性的方式，可將神經活動轉換為有意義的輸出。深度學習的最新進展，包括生成對抗網路（GANs）與基於 Transformer 的大型語言模型（LLMs），大幅改善了基於 EEG 的影像、文字與語音生成。本文提供了一份基於 EEG 的多模式生成的最新文獻回顧，重點在於（一）透過 GANs、變異自動編碼器（VAEs）與擴散模型進行 EEG 到影像的生成，以及（二）利用基於 Transformer 的語言模型與對比學習方法進行 EEG 到文字的生成。此外，我們討論了 EEG 到語音合成的新興領域，這是一個不斷演進的多模式領域。我們重點介紹了關鍵的資料集、用例、挑戰與支撐生成方法的 EEG 特徵編碼方法。透過提供基於 EEG 的生成式 AI 的結構化概觀，本調查旨在為研究人員與從業人員提供見解，以推進神經解碼、增強輔助技術並擴展腦機互動的領域。
-
-##### **KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**
-2502.12029v1 by Qi Zhao, Hongyu Yang, Qi Song, Xinwei Yao, Xiangyang Li
-
-Large language models (LLMs) have demonstrated remarkable capabilities in
-various complex tasks, yet they still suffer from hallucinations. Introducing
-external knowledge, such as knowledge graph, can enhance the LLMs' ability to
-provide factual answers. LLMs have the ability to interactively explore
-knowledge graphs. However, most approaches have been affected by insufficient
-internal knowledge excavation in LLMs, limited generation of trustworthy
-knowledge reasoning paths, and a vague integration between internal and
-external knowledge. Therefore, we propose KnowPath, a knowledge-enhanced large
-model framework driven by the collaboration of internal and external knowledge.
-It relies on the internal knowledge of the LLM to guide the exploration of
-interpretable directed subgraphs in external knowledge graphs, better
-integrating the two knowledge sources for more accurate reasoning. Extensive
-experiments on multiple real-world datasets confirm the superiority of
-KnowPath.
-
-摘要：大型語言模型 (LLM) 已在各種複雜任務中展現出卓越的能力，但仍會出現幻覺。引入外部知識（例如知識圖譜）可以增強 LLM 提供事實答案的能力。LLM 有能力互動式地探索知識圖譜。然而，大多數方法都受到 LLM 中內部知識挖掘不足、可信賴知識推理路徑生成受限，以及內部和外部知識之間的整合模糊的影響。因此，我們提出 KnowPath，這是一個由內部和外部知識的協作驅動的知識增強型大型模型框架。它依賴於 LLM 的內部知識來指導對外部知識圖譜中可解釋的有向子圖的探索，更好地整合兩個知識來源以進行更準確的推理。對多個真實世界資料集進行的大量實驗證實了 KnowPath 的優越性。
-
-##### **SafeChain: Safety of Language Models with Long Chain-of-Thought Reasoning Capabilities**
-2502.12025v1 by Fengqing Jiang, Zhangchen Xu, Yuetai Li, Luyao Niu, Zhen Xiang, Bo Li, Bill Yuchen Lin, Radha Poovendran
-
-Emerging large reasoning models (LRMs), such as DeepSeek-R1 models, leverage
-long chain-of-thought (CoT) reasoning to generate structured intermediate
-steps, enhancing their reasoning capabilities. However, long CoT does not
-inherently guarantee safe outputs, potentially leading to harmful consequences
-such as the introduction of security vulnerabilities in code or the spread of
-misinformation. Current research on large language model (LLM) safety usually
-focuses on short-answer responses, overlooking the long CoT style outputs of
-LRMs. To bridge this gap, we conduct a systematic study of LRM safety. First,
-we investigate safety evaluators calibrated against human annotations. Using
-our newly developed metrics, we thoroughly assess the safety of 12
-state-of-the-art LRMs on StrongReject and WildJailbreak datasets. Our results
-show that LRMs are not safe compared to their reasoning advance. Further, we
-perform a fine-grained analysis of the reasoning trace and final answer. We
-find that three decoding strategies-ZeroThink, LessThink, and MoreThink-can
-improve model safety without additional training. However, these strategies
-either use constrained reasoning traces or incur high inference costs. To
-better strengthen LRM safety, we introduce SafeChain, the first-of-its-kind
-safety training dataset in CoT style. We fine-tune two LRMs with SafeChain,
-showing that it not only enhances model safety but also preserves performance
-across 6 reasoning benchmarks.
-
-摘要：新興的大型推理模型（LRM），例如 DeepSeek-R1 模型，利用長鏈思考（CoT）推理來生成結構化的中間步驟，增強其推理能力。然而，長 CoT 本質上並不能保證安全的輸出，可能會導致有害的後果，例如在程式碼中引入安全漏洞或散佈錯誤訊息。目前針對大型語言模型（LLM）安全性的研究通常側重於簡短的回答回應，忽略了 LRM 的長 CoT 風格輸出。為了彌補這個差距，我們對 LRM 安全性進行系統性研究。首先，我們研究根據人類註解校正的安全評估器。使用我們新開發的指標，我們徹底評估了 12 個最先進的 LRM 在 StrongReject 和 WildJailbreak 資料集上的安全性。我們的結果表明，與其推理進度相比，LRM 並不安全。此外，我們對推理軌跡和最終答案進行了細粒度分析。我們發現三種解碼策略（ZeroThink、LessThink 和 MoreThink）可以在不額外訓練的情況下提高模型安全性。然而，這些策略要么使用受約束的推理軌跡，要么會產生高昂的推論成本。為了進一步加強 LRM 安全性，我們引入了 SafeChain，這是第一個 CoT 風格的安全訓練資料集。我們使用 SafeChain 微調了兩個 LRM，表明它不僅增強了模型安全性，而且在 6 個推理基準測試中都保持了效能。
-
-##### **Teaching LLMs According to Their Aptitude: Adaptive Reasoning for Mathematical Problem Solving**
-2502.12022v1 by Xin Xu, Yan Xu, Tianhao Chen, Yuchen Yan, Chengwu Liu, Zaoyu Chen, Yufei Wang, Yichun Yin, Yasheng Wang, Lifeng Shang, Qun Liu
-
-Existing approaches to mathematical reasoning with large language models
-(LLMs) rely on Chain-of-Thought (CoT) for generalizability or Tool-Integrated
-Reasoning (TIR) for precise computation. While efforts have been made to
-combine these methods, they primarily rely on post-selection or predefined
-strategies, leaving an open question: whether LLMs can autonomously adapt their
-reasoning strategy based on their inherent capabilities. In this work, we
-propose TATA (Teaching LLMs According to Their Aptitude), an adaptive framework
-that enables LLMs to personalize their reasoning strategy spontaneously,
-aligning it with their intrinsic aptitude. TATA incorporates base-LLM-aware
-data selection during supervised fine-tuning (SFT) to tailor training data to
-the model's unique abilities. This approach equips LLMs to autonomously
-determine and apply the appropriate reasoning strategy at test time. We
-evaluate TATA through extensive experiments on six mathematical reasoning
-benchmarks, using both general-purpose and math-specialized LLMs. Empirical
-results demonstrate that TATA effectively combines the complementary strengths
-of CoT and TIR, achieving superior or comparable performance with improved
-inference efficiency compared to TIR alone. Further analysis underscores the
-critical role of aptitude-aware data selection in enabling LLMs to make
-effective and adaptive reasoning decisions and align reasoning strategies with
-model capabilities.
-
-摘要：現有的數學推理方法使用大型語言模型 (LLM) 仰賴思考鏈 (CoT) 來達到泛化性，或使用工具整合推理 (TIR) 來進行精確運算。儘管已有人嘗試結合這些方法，但它們主要依賴後選取或預定義策略，留下一個開放性的問題：LLM 是否能根據其內在能力自主調整其推理策略。在這項工作中，我們提出 TATA（根據其天賦來教授 LLM），這是一個適應性架構，讓 LLM 能夠自發地個人化其推理策略，並與其內在的天賦保持一致。TATA 在監督微調 (SFT) 期間納入了基礎 LLM 感知資料選取，以根據模型的獨特能力調整訓練資料。此方法讓 LLM 能夠在測試時自主決定並套用適當的推理策略。我們透過對六個數學推理基準進行廣泛的實驗來評估 TATA，使用通用和數學專用 LLM。經驗結果顯示，TATA 有效地結合了 CoT 和 TIR 的互補優勢，與僅使用 TIR 相比，達到了優越或相當的效能，並改善了推論效率。進一步的分析強調了天賦感知資料選取在讓 LLM 能夠做出有效且適應性的推理決策，並將推理策略與模型能力保持一致時所扮演的關鍵角色。
-
-##### **Atom of Thoughts for Markov LLM Test-Time Scaling**
-2502.12018v1 by Fengwei Teng, Zhaoyang Yu, Quan Shi, Jiayi Zhang, Chenglin Wu, Yuyu Luo
-
-Large Language Models (LLMs) achieve superior performance through
-training-time scaling, and test-time scaling further enhances their
-capabilities by conducting effective reasoning during inference. However, as
-the scale of reasoning increases, existing test-time scaling methods suffer
-from accumulated historical information, which not only wastes computational
-resources but also interferes with effective reasoning. To address this issue,
-we observe that complex reasoning progress is often achieved by solving a
-sequence of independent subquestions, each being self-contained and verifiable.
-These subquestions are essentially atomic questions, relying primarily on their
-current state rather than accumulated history, similar to the memoryless
-transitions in a Markov process. Based on this observation, we propose Atom of
-Thoughts (AoT), where each state transition in the reasoning process consists
-of decomposing the current question into a dependency-based directed acyclic
-graph and contracting its subquestions, forming a new atomic question state.
-This iterative decomposition-contraction process continues until reaching
-directly solvable atomic questions, naturally realizing Markov transitions
-between question states. Furthermore, these atomic questions can be seamlessly
-integrated into existing test-time scaling methods, enabling AoT to serve as a
-plug-in enhancement for improving reasoning capabilities. Experiments across
-six benchmarks demonstrate the effectiveness of AoT both as a standalone
-framework and a plug-in enhancement. Notably, on HotpotQA, when applied to
-gpt-4o-mini, AoT achieves an 80.6% F1 score, surpassing o3-mini by 3.4% and
-DeepSeek-R1 by 10.6%. The code will be available at
-https://github.com/qixucen/atom.
-
-摘要：大型語言模型 (LLM) 透過訓練時間擴充來達成卓越的效能，而測試時間擴充透過在推論期間進行有效的推理，進一步提升其能力。然而，隨著推理規模的擴大，現有的測試時間擴充方法會受到累積的歷史資訊影響，這不僅會浪費運算資源，還會干擾有效的推理。為了解決這個問題，我們觀察到複雜的推理進程通常是透過解決一系列獨立的子問題來達成，每個子問題都是獨立且可驗證的。這些子問題本質上是原子問題，主要依賴於它們的當前狀態，而不是累積的歷史，類似於馬可夫過程中的無記憶轉換。基於這個觀察，我們提出了思想原子 (AoT)，其中推理過程中每個狀態轉換都包含將當前問題分解為基於依賴關係的有向無環圖，並收縮其子問題，形成新的原子問題狀態。這個反覆的分解收縮過程會持續進行，直到達到可直接解決的原子問題，自然地實現問題狀態之間的馬可夫轉換。此外，這些原子問題可以無縫整合到現有的測試時間擴充方法中，讓 AoT 可以作為外掛程式強化功能，以改善推理能力。橫跨六個基準的實驗證明了 AoT 作為獨立架構和外掛程式強化的有效性。值得注意的是，在 HotpotQA 上，當應用於 gpt-4o-mini 時，AoT 達到了 80.6% 的 F1 分數，比 o3-mini 高出 3.4%，比 DeepSeek-R1 高出 10.6%。程式碼將在 https://github.com/qixucen/atom 上提供。
-
-##### **Demographic Attributes Prediction from Speech Using WavLM Embeddings**
-2502.12007v1 by Yuchen Yang, Thomas Thebaud, Najim Dehak
-
-This paper introduces a general classifier based on WavLM features, to infer
-demographic characteristics, such as age, gender, native language, education,
-and country, from speech. Demographic feature prediction plays a crucial role
-in applications like language learning, accessibility, and digital forensics,
-enabling more personalized and inclusive technologies. Leveraging pretrained
-models for embedding extraction, the proposed framework identifies key acoustic
-and linguistic fea-tures associated with demographic attributes, achieving a
-Mean Absolute Error (MAE) of 4.94 for age prediction and over 99.81% accuracy
-for gender classification across various datasets. Our system improves upon
-existing models by up to relative 30% in MAE and up to relative 10% in accuracy
-and F1 scores across tasks, leveraging a diverse range of datasets and large
-pretrained models to ensure robustness and generalizability. This study offers
-new insights into speaker diversity and provides a strong foundation for future
-research in speech-based demographic profiling.
-
-摘要：本文介紹一個基於 WavLM 特徵的一般分類器，用於從語音中推斷人口特徵，例如年齡、性別、母語、教育和國家。人口特徵預測在語言學習、無障礙性和數位鑑識等應用中扮演著至關重要的角色，能實現更個人化且包容性的技術。利用預先訓練的模型進行嵌入式萃取，提出的架構識別與人口屬性相關的主要音訊和語言特徵，在年齡預測中達到 4.94 的平均絕對誤差 (MAE)，在各種資料集中的性別分類中準確率超過 99.81%。我們的系統在平均絕對誤差上比現有模型提升了相對 30%，在準確率和 F1 分數上提升了相對 10%，利用各種資料集和大型預先訓練模型來確保穩健性和概括性。本研究提供了對說話者多元性的新見解，並為未來基於語音的人口特徵分析研究奠定了堅實的基礎。
-
-##### **Merging Language and Domain Specific Models: The Impact on Technical Vocabulary Acquisition**
-2502.12001v1 by Thibault Rousset, Taisei Kakibuchi, Yusuke Sasaki, Yoshihide Nomura
-
-This paper investigates the integration of technical vocabulary in merged
-language models. We explore the knowledge transfer mechanisms involved when
-combining a general-purpose language-specific model with a domain-specific
-model, focusing on the resulting model's comprehension of technical jargon. Our
-experiments analyze the impact of this merging process on the target model's
-proficiency in handling specialized terminology. We present a quantitative
-evaluation of the performance of the merged model, comparing it with that of
-the individual constituent models. The findings offer insights into the
-effectiveness of different model merging methods for enhancing domain-specific
-knowledge and highlight potential challenges and future directions in
-leveraging these methods for cross-lingual knowledge transfer in Natural
-Language Processing.
-
-摘要：本文探討了技術詞彙在合併語言模型中的整合。我們探討了結合一般用途語言特定模型與特定領域模型時所涉及的知識轉移機制，重點在於所產生模型對技術術語的理解。我們的實驗分析了此合併程序對目標模型處理專業術語能力的影響。我們提出了合併模型效能的量化評估，並將其與個別組成模型的效能進行比較。這些發現提供了見解，說明了不同模型合併方法在增強特定領域知識方面的效能，並強調了利用這些方法進行自然語言處理中跨語言知識轉移的潛在挑戰和未來方向。
-
-##### **Presumed Cultural Identity: How Names Shape LLM Responses**
-2502.11995v1 by Siddhesh Pawar, Arnav Arora, Lucie-Aimée Kaffee, Isabelle Augenstein
-
-Names are deeply tied to human identity. They can serve as markers of
-individuality, cultural heritage, and personal history. However, using names as
-a core indicator of identity can lead to over-simplification of complex
-identities. When interacting with LLMs, user names are an important point of
-information for personalisation. Names can enter chatbot conversations through
-direct user input (requested by chatbots), as part of task contexts such as CV
-reviews, or as built-in memory features that store user information for
-personalisation. We study biases associated with names by measuring cultural
-presumptions in the responses generated by LLMs when presented with common
-suggestion-seeking queries, which might involve making assumptions about the
-user. Our analyses demonstrate strong assumptions about cultural identity
-associated with names present in LLM generations across multiple cultures. Our
-work has implications for designing more nuanced personalisation systems that
-avoid reinforcing stereotypes while maintaining meaningful customisation.
-
-摘要：姓名與人類身分密不可分。它們可以作為個人特質、文化遺產和個人歷史的標記。然而，將姓名作為身分的核心指標可能會導致複雜身分的過度簡化。在與 LLM 互動時，使用者名稱是個人化的重要資訊點。姓名可以透過直接使用者輸入（聊天機器人要求）、作為履歷審查等任務情境的其中一部分，或作為儲存使用者資訊以供個人化的內建記憶功能，進入聊天機器人對話。我們透過衡量 LLM 在面對常見的建議尋求查詢時所產生的回應中的文化預設，來研究與姓名相關的偏見，這可能涉及對使用者的假設。我們的分析顯示，在跨多種文化的 LLM 世代中，與姓名相關的文化身分有強烈的假設。我們的研究對於設計更細緻的個人化系統有影響，這些系統避免強化刻板印象，同時維持有意義的客製化。
-
-##### **Characterizing Photorealism and Artifacts in Diffusion Model-Generated Images**
-2502.11989v1 by Negar Kamali, Karyn Nakamura, Aakriti Kumar, Angelos Chatzimparmpas, Jessica Hullman, Matthew Groh
-
-Diffusion model-generated images can appear indistinguishable from authentic
-photographs, but these images often contain artifacts and implausibilities that
-reveal their AI-generated provenance. Given the challenge to public trust in
-media posed by photorealistic AI-generated images, we conducted a large-scale
-experiment measuring human detection accuracy on 450 diffusion-model generated
-images and 149 real images. Based on collecting 749,828 observations and 34,675
-comments from 50,444 participants, we find that scene complexity of an image,
-artifact types within an image, display time of an image, and human curation of
-AI-generated images all play significant roles in how accurately people
-distinguish real from AI-generated images. Additionally, we propose a taxonomy
-characterizing artifacts often appearing in images generated by diffusion
-models. Our empirical observations and taxonomy offer nuanced insights into the
-capabilities and limitations of diffusion models to generate photorealistic
-images in 2024.
-
-摘要：擴散模型生成的影像看起來可能與真實照片無異，但這些影像通常包含人工智慧生成來源的瑕疵和不合理之處。由於寫實的人工智慧生成影像對公眾對媒體的信任構成挑戰，我們進行了一項大規模實驗，測量人類對 450 張擴散模型生成影像和 149 張真實影像的檢測準確度。根據收集自 50,444 位參與者的 749,828 次觀察和 34,675 則評論，我們發現影像的場景複雜性、影像中的瑕疵類型、影像的顯示時間，以及人類對人工智慧生成影像的策展，在人們準確區分真實影像和人工智慧生成影像方面都扮演重要的角色。此外，我們提出了一種分類法，用於描述經常出現在擴散模型生成的影像中的瑕疵。我們的經驗觀察和分類法為擴散模型在 2024 年生成寫實影像的能力和限制提供了細緻的見解。
-
-##### **Generating Text from Uniform Meaning Representation**
-2502.11973v1 by Emma Markle, Reihaneh Iranmanesh, Shira Wein
-
-Uniform Meaning Representation (UMR) is a recently developed graph-based
-semantic representation, which expands on Abstract Meaning Representation (AMR)
-in a number of ways, in particular through the inclusion of document-level
-information and multilingual flexibility. In order to effectively adopt and
-leverage UMR for downstream tasks, efforts must be placed toward developing a
-UMR technological ecosystem. Though still limited amounts of UMR annotations
-have been produced to date, in this work, we investigate the first approaches
-to producing text from multilingual UMR graphs: (1) a pipeline conversion of
-UMR to AMR, then using AMR-to-text generation models, (2) fine-tuning large
-language models with UMR data, and (3) fine-tuning existing AMR-to-text
-generation models with UMR data. Our best performing model achieves a
-multilingual BERTscore of 0.825 for English and 0.882 for Chinese when compared
-to the reference, which is a promising indication of the effectiveness of
-fine-tuning approaches for UMR-to-text generation with even limited amounts of
-UMR data.
-
-摘要：統一語意表示 (UMR) 是一種最近開發的基於圖形的語意表示，它在許多方面擴展了抽象語意表示 (AMR)，特別是透過納入文件層級資訊和多語言靈活性。為了有效採用和利用下游任務的 UMR，必須投入精力開發 UMR 技術生態系統。雖然到目前為止產生的 UMR 標註數量仍然有限，但在這項工作中，我們探討了從多語言 UMR 圖形產生文字的第一種方法：(1) 將 UMR 轉換為 AMR 的管道，然後使用 AMR 轉文字生成模型，(2) 使用 UMR 資料微調大型語言模型，以及 (3) 使用 UMR 資料微調現有的 AMR 轉文字生成模型。與參考相比，我們效能最好的模型在英文中達到 0.825 的多語言 BERT 分數，在中文中達到 0.882，這表示使用 UMR 資料進行 UMR 轉文字生成的微調方法具有良好的效果，即使 UMR 資料數量有限。
-
-##### **Learning Generalizable Prompt for CLIP with Class Similarity Knowledge**
-2502.11969v1 by Sehun Jung, Hyang-won Lee
-
-In vision-language models (VLMs), prompt tuning has shown its effectiveness
-in adapting models to downstream tasks. However, learned prompts struggle to
-generalize to unseen classes, as they tend to overfit to the classes that are
-targeted during prompt tuning. Examining failure cases, we observed that
-learned prompts disrupt the semantics of unseen classes, generating text
-embeddings with incorrect semantic relationships among classes. To address
-this, we propose Similarity Alignment Regularization (SAR), which regularizes
-learnable prompts to preserve the semantic relationships among classes captured
-by hand-crafted prompts. Specifically, we first obtain novel classes related to
-base classes using ChatGPT-4o and utilize them as potential unseen classes
-during prompt tuning. Then, by targeting both base and novel classes, SAR
-aligns the similarity relationships among text embeddings generated by
-learnable prompts with the similarity relationships from hand-crafted prompts.
-Extensive experiments applying SAR to existing prompt tuning methods
-demonstrate its effectiveness in improving generalization to unseen classes.
-
-摘要：在視覺語言模型 (VLM) 中，提示調整已展現其在調整模型至下游任務上的效能。然而，已學習的提示難以推廣至未見類別，因為它們傾向於過度擬合提示調整期間所鎖定的類別。在檢視失敗案例時，我們觀察到已學習的提示會擾亂未見類別的語義，產生具有類別間不正確語義關係的文字嵌入。為了解決此問題，我們提出相似度對齊正則化 (SAR)，它會對可學習提示進行正則化，以保留由手工提示捕捉到的類別間語義關係。具體來說，我們首先使用 ChatGPT-4o 取得與基本類別相關的新穎類別，並在提示調整期間將它們用作潛在的未見類別。然後，透過鎖定基本類別和新穎類別，SAR 會將可學習提示產生的文字嵌入之間的相似度關係與手工提示的相似度關係對齊。將 SAR 應用於現有提示調整方法的廣泛實驗證明了其在改善對未見類別的概括上的效能。
-
-##### **A MIMO Wireless Channel Foundation Model via CIR-CSI Consistency**
-2502.11965v1 by Jun Jiang, Wenjun Yu, Yunfan Li, Yuan Gao, Shugong Xu
-
-In the field of artificial intelligence, self-supervised learning has
-demonstrated superior generalization capabilities by leveraging large-scale
-unlabeled datasets for pretraining, which is especially critical for wireless
-communication models to adapt to a variety of scenarios. This paper
-innovatively treats Channel State Information (CSI) and Channel Impulse
-Response (CIR) as naturally aligned multi-modal data and proposes the first
-MIMO wireless channel foundation model, named CSI-CLIP. By effectively
-capturing the joint representations of both CIR and CSI, CSI-CLIP exhibits
-remarkable adaptability across scenarios and robust feature extraction
-capabilities. Experimental results show that in positioning task, CSI-CLIP
-reduces the mean error distance by 22%; in beam management task, it increases
-accuracy by 1% compared to traditional supervised methods, as well as in the
-channel identification task. These improvements not only highlight the
-potential and value of CSI-CLIP in integrating sensing and communication but
-also demonstrate its significant advantages over existing techniques. Moreover,
-viewing CSI and CIR as multi-modal pairs and contrastive learning for wireless
-channel foundation model open up new research directions in the domain of MIMO
-wireless communications.
-
-摘要：在人工智能领域，自监督学习通过利用大规模无标签数据集进行预训练，展示了卓越的泛化能力，这对于无线通信模型适应各种场景尤为关键。本文创新地将信道状态信息 (CSI) 和信道脉冲响应 (CIR) 视为自然对齐的多模态数据，并提出了第一个 MIMO 无线信道基础模型，名为 CSI-CLIP。通过有效捕获 CIR 和 CSI 的联合表示，CSI-CLIP 在各种场景中表现出卓越的适应性和强大的特征提取能力。实验结果表明，在定位任务中，CSI-CLIP 将平均误差距离减少了 22%；在波束管理任务中，与传统的监督方法相比，其准确度提高了 1%，以及在信道识别任务中。这些改进不仅突出了 CSI-CLIP 在集成感知和通信方面的潜力和价值，而且还展示了其相对于现有技术的显着优势。此外，将 CSI 和 CIR 视为多模态对，并对比学习无线信道基础模型，为 MIMO 无线通信领域开辟了新的研究方向。
-
-##### **Navigating the Helpfulness-Truthfulness Trade-Off with Uncertainty-Aware Instruction Fine-Tuning**
-2502.11962v1 by Tianyi Wu, Jingwei Ni, Bryan Hooi, Jiaheng Zhang, Elliott Ash, See-Kiong Ng, Mrinmaya Sachan, Markus Leippold
-
-Instruction Fine-tuning (IFT) can enhance the helpfulness of Large Language
-Models (LLMs), but it may lower their truthfulness. This trade-off arises
-because IFT steers LLMs to generate responses with long-tail knowledge that is
-not well covered during pre-training, leading to more informative but less
-truthful answers when generalizing to unseen tasks. In this paper, we
-empirically demonstrate this helpfulness-truthfulness trade-off in IFT and
-propose $\textbf{UNIT}$, a novel IFT paradigm to address it. UNIT teaches LLMs
-to recognize their uncertainty and explicitly reflect it at the end of their
-responses. Experimental results show that UNIT-tuned models maintain their
-helpfulness while distinguishing between certain and uncertain claims, thereby
-reducing hallucinations.
-
-摘要：指令微調 (IFT) 可以提升大型語言模型 (LLM) 的實用性，但可能會降低其真實性。這種取捨會出現，是因為 IFT 引導 LLM 生成具有長尾知識的回應，而這些知識在預訓練期間並未充分涵蓋，導致在推廣到未見任務時，答案更具資訊性，但真實性較低。在本文中，我們透過實證展示 IFT 中的這種實用性與真實性取捨，並提出一個新穎的 IFT 典範 $\textbf{UNIT}$ 來解決這個問題。UNIT 教導 LLM 辨識其不確定性，並明確反映在其回應的結尾。實驗結果顯示，經過 UNIT 微調的模型維持其實用性，同時區分確定和不確定的說法，從而減少幻覺。
-
-##### **STRIVE: Structured Reasoning for Self-Improvement in Claim Verification**
-2502.11959v1 by Haisong Gong, Jing Li, Junfei Wu, Qiang Liu, Shu Wu, Liang Wang
-
-Claim verification is the task of determining whether a claim is supported or
-refuted by evidence. Self-improvement methods, where reasoning chains are
-generated and those leading to correct results are selected for training, have
-succeeded in tasks like mathematical problem solving. However, in claim
-verification, this approach struggles. Low-quality reasoning chains may falsely
-match binary truth labels, introducing faulty reasoning into the
-self-improvement process and ultimately degrading performance. To address this,
-we propose STRIVE: Structured Reasoning for Self-Improved Verification. Our
-method introduces a structured reasoning design with Claim Decomposition,
-Entity Analysis, and Evidence Grounding Verification. These components improve
-reasoning quality, reduce errors, and provide additional supervision signals
-for self-improvement. STRIVE begins with a warm-up phase, where the base model
-is fine-tuned on a small number of annotated examples to learn the structured
-reasoning design. It is then applied to generate reasoning chains for all
-training examples, selecting only those that are correct and structurally sound
-for subsequent self-improvement training. We demonstrate that STRIVE achieves
-significant improvements over baseline models, with a 31.4% performance gain
-over the base model and 20.7% over Chain of Thought on the HOVER datasets,
-highlighting its effectiveness.
-
-摘要：聲明驗證的任務是確定聲明是否受到證據支持或反駁。自改善方法（產生推理鏈並選擇導致正確結果的鏈進行訓練）已成功應用於數學問題求解等任務。然而，在聲明驗證中，此方法會遇到困難。低品質的推理鏈可能錯誤地匹配二元真值標籤，將錯誤的推理引入自改善流程並最終降低效能。為了解決此問題，我們提出 STRIVE：結構化推理自改善驗證。我們的模型引入了結構化推理設計，包含聲明分解、實體分析和證據依據驗證。這些組件改善了推理品質、減少了錯誤，並為自改善提供了額外的監督訊號。STRIVE 從熱身階段開始，在少數標註範例上微調基礎模型以學習結構化推理設計。接著將其應用於為所有訓練範例產生推理鏈，僅選擇正確且結構上合理的推理鏈進行後續的自改善訓練。我們證明 STRIVE 獲得了顯著的改善，在 HOVER 資料集上，效能比基礎模型提升了 31.4%，比 Chain of Thought 提升了 20.7%，突顯了其有效性。
-
-##### **Can Your Uncertainty Scores Detect Hallucinated Entity?**
-2502.11948v1 by Min-Hsuan Yeh, Max Kamachee, Seongheon Park, Yixuan Li
-
-To mitigate the impact of hallucination nature of LLMs, many studies propose
-detecting hallucinated generation through uncertainty estimation. However,
-these approaches predominantly operate at the sentence or paragraph level,
-failing to pinpoint specific spans or entities responsible for hallucinated
-content. This lack of granularity is especially problematic for long-form
-outputs that mix accurate and fabricated information. To address this
-limitation, we explore entity-level hallucination detection. We propose a new
-data set, HalluEntity, which annotates hallucination at the entity level. Based
-on the dataset, we comprehensively evaluate uncertainty-based hallucination
-detection approaches across 17 modern LLMs. Our experimental results show that
-uncertainty estimation approaches focusing on individual token probabilities
-tend to over-predict hallucinations, while context-aware methods show better
-but still suboptimal performance. Through an in-depth qualitative study, we
-identify relationships between hallucination tendencies and linguistic
-properties and highlight important directions for future research.
-
-摘要：為了減輕 LLM 幻覺性質的影響，許多研究提出透過不確定性估計來偵測幻覺產生的內容。然而，這些方法主要是在句子或段落層級運作，無法精確找出對幻覺內容負責的特定區間或實體。這種缺乏粒度的現象對於混合了準確和虛構資訊的長篇輸出內容來說尤其成問題。為了解決這個限制，我們探討了實體層級的幻覺偵測。我們提出了一個新的資料集 HalluEntity，其中註解了實體層級的幻覺。根據該資料集，我們全面評估了 17 種現代 LLM 的基於不確定性的幻覺偵測方法。我們的實驗結果顯示，專注於個別代幣機率的不確定性估計方法傾向於過度預測幻覺，而具備背景感知能力的方法則表現得更好，但仍未達到最佳狀態。透過深入的定性研究，我們找出幻覺傾向與語言特徵之間的關係，並強調未來研究的重要方向。
-
-##### **Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction**
-2502.11946v1 by Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, Peng Liu, Ruihang Miao, Wang You, Xi Chen, Xuerui Yang, Yechang Huang, Yuxiang Zhang, Zheng Gong, Zixin Zhang, Brian Li, Changyi Wan, Hanpeng Hu, Ranchen Ming, Song Yuan, Xuelin Zhang, Yu Zhou, Bingxin Li, Buyun Ma, Kang An, Wei Ji, Wen Li, Xuan Wen, Yuankai Ma, Yuanwei Liang, Yun Mou, Bahtiyar Ahmidi, Bin Wang, Bo Li, Changxin Miao, Chen Xu, Chengting Feng, Chenrun Wang, Dapeng Shi, Deshan Sun, Dingyuan Hu, Dula Sai, Enle Liu, Guanzhe Huang, Gulin Yan, Heng Wang, Haonan Jia, Haoyang Zhang, Jiahao Gong, Jianchang Wu, Jiahong Liu, Jianjian Sun, Jiangjie Zhen, Jie Feng, Jie Wu, Jiaoren Wu, Jie Yang, Jinguo Wang, Jingyang Zhang, Junzhe Lin, Kaixiang Li, Lei Xia, Li Zhou, Longlong Gu, Mei Chen, Menglin Wu, Ming Li, Mingxiao Li, Mingyao Liang, Na Wang, Nie Hao, Qiling Wu, Qinyuan Tan, Shaoliang Pang, Shiliang Yang, Shuli Gao, Siqi Liu, Sitong Liu, Tiancheng Cao, Tianyu Wang, Wenjin Deng, Wenqing He, Wen Sun, Xin Han, Xiaomin Deng, Xiaojia Liu, Xu Zhao, Yanan Wei, Yanbo Yu, Yang Cao, Yangguang Li, Yangzhen Ma, Yanming Xu, Yaqiang Shi, Yilei Wang, Yinmin Zhong, Yu Luo, Yuanwei Lu, Yuhe Yin, Yuting Yan, Yuxiang Yang, Zhe Xie, Zheng Ge, Zheng Sun, Zhewei Huang, Zhichao Chang, Zidong Yang, Zili Zhang, Binxing Jiao, Daxin Jiang, Heung-Yeung Shum, Jiansheng Chen, Jing Li, Shuchang Zhou, Xiangyu Zhang, Xinhao Zhang, Yibo Zhu
-
-Real-time speech interaction, serving as a fundamental interface for
-human-machine collaboration, holds immense potential. However, current
-open-source models face limitations such as high costs in voice data
-collection, weakness in dynamic control, and limited intelligence. To address
-these challenges, this paper introduces Step-Audio, the first production-ready
-open-source solution. Key contributions include: 1) a 130B-parameter unified
-speech-text multi-modal model that achieves unified understanding and
-generation, with the Step-Audio-Chat version open-sourced; 2) a generative
-speech data engine that establishes an affordable voice cloning framework and
-produces the open-sourced lightweight Step-Audio-TTS-3B model through
-distillation; 3) an instruction-driven fine control system enabling dynamic
-adjustments across dialects, emotions, singing, and RAP; 4) an enhanced
-cognitive architecture augmented with tool calling and role-playing abilities
-to manage complex tasks effectively. Based on our new StepEval-Audio-360
-evaluation benchmark, Step-Audio achieves state-of-the-art performance in human
-evaluations, especially in terms of instruction following. On open-source
-benchmarks like LLaMA Question, shows 9.3% average performance improvement,
-demonstrating our commitment to advancing the development of open-source
-multi-modal language technologies. Our code and models are available at
-https://github.com/stepfun-ai/Step-Audio.
-
-摘要：<paragraph>即時語音互動作為人機協作的基本介面，蘊含著巨大的潛力。然而，目前的開源模型面臨著語音數據收集成本高、動態控制能力弱、智慧有限等限制。為了應對這些挑戰，本文介紹了 Step-Audio，這是第一個可投入生產的開源解決方案。主要貢獻包括：1) 一個 130B 參數的統一語音文字多模態模型，實現了統一的理解和生成，其中 Step-Audio-Chat 版本已開源；2) 一個生成式語音數據引擎，建立了一個經濟實惠的語音克隆框架，並通過蒸餾技術產生了開源的輕量級 Step-Audio-TTS-3B 模型；3) 一個指令驅動的精細控制系統，實現了跨方言、情緒、唱歌和饒舌的動態調整；4) 一個增強的認知架構，增加了工具呼叫和角色扮演的能力，以有效地管理複雜的任務。根據我們新的 StepEval-Audio-360 評估基準，Step-Audio 在人類評估中實現了最先進的性能，特別是在指令遵循方面。在 LLaMA Question 等開源基準測試中，表現出平均提升了 9.3%，證明了我們致力於推進開源多模態語言技術的發展。我們的程式碼和模型可在 https://github.com/stepfun-ai/Step-Audio 取得。</paragraph>
-
-##### **Deep Spatio-Temporal Neural Network for Air Quality Reanalysis**
-2502.11941v1 by Ammar Kheder, Benjamin Foreback, Lili Wang, Zhi-Song Liu, Michael Boy
-
-Air quality prediction is key to mitigating health impacts and guiding
-decisions, yet existing models tend to focus on temporal trends while
-overlooking spatial generalization. We propose AQ-Net, a spatiotemporal
-reanalysis model for both observed and unobserved stations in the near future.
-AQ-Net utilizes the LSTM and multi-head attention for the temporal regression.
-We also propose a cyclic encoding technique to ensure continuous time
-representation. To learn fine-grained spatial air quality estimation, we
-incorporate AQ-Net with the neural kNN to explore feature-based interpolation,
-such that we can fill the spatial gaps given coarse observation stations. To
-demonstrate the efficiency of our model for spatiotemporal reanalysis, we use
-data from 2013-2017 collected in northern China for PM2.5 analysis. Extensive
-experiments show that AQ-Net excels in air quality reanalysis, highlighting the
-potential of hybrid spatio-temporal models to better capture environmental
-dynamics, especially in urban areas where both spatial and temporal variability
-are critical.
-
-摘要：空气品质预测是减轻健康影响和指导决策的关键，但现有的模型倾向于关注时间趋势，而忽略空间概化。我们提出了 AQ-Net，这是一种时空再分析模型，适用于近期内已观测和未观测到的站点。AQ-Net 利用 LSTM 和多头注意力进行时间回归。我们还提出了一种循环编码技术来确保时间表示的连续性。为了学习细粒度的空间空气质量估计，我们将 AQ-Net 与神经 kNN 结合起来，以探索基于特征的插值，以便我们能够填充给定粗略观测站的空间空白。为了展示我们的模型在时空再分析中的效率，我们使用了 2013-2017 年在中国北部收集的 PM2.5 分析数据。大量的实验表明，AQ-Net 在空气质量再分析中表现出色，突出了混合时空模型在更好地捕捉环境动态方面的潜力，尤其是在空间和时间变异性都很关键的城市地区。
-
-##### **FitLight: Federated Imitation Learning for Plug-and-Play Autonomous Traffic Signal Control**
-2502.11937v1 by Yutong Ye, Yingbo Zhou, Zhusen Liu, Xiao Du, Hao Zhou, Xiang Lian, Mingsong Chen
-
-Although Reinforcement Learning (RL)-based Traffic Signal Control (TSC)
-methods have been extensively studied, their practical applications still raise
-some serious issues such as high learning cost and poor generalizability. This
-is because the ``trial-and-error'' training style makes RL agents extremely
-dependent on the specific traffic environment, which also requires a long
-convergence time. To address these issues, we propose a novel Federated
-Imitation Learning (FIL)-based framework for multi-intersection TSC, named
-FitLight, which allows RL agents to plug-and-play for any traffic environment
-without additional pre-training cost. Unlike existing imitation learning
-approaches that rely on pre-training RL agents with demonstrations, FitLight
-allows real-time imitation learning and seamless transition to reinforcement
-learning. Due to our proposed knowledge-sharing mechanism and novel hybrid
-pressure-based agent design, RL agents can quickly find a best control policy
-with only a few episodes. Moreover, for resource-constrained TSC scenarios,
-FitLight supports model pruning and heterogeneous model aggregation, such that
-RL agents can work on a micro-controller with merely 16{\it KB} RAM and 32{\it
-KB} ROM. Extensive experiments demonstrate that, compared to state-of-the-art
-methods, FitLight not only provides a superior starting point but also
-converges to a better final solution on both real-world and synthetic datasets,
-even under extreme resource limitations.
-
-摘要：儘管基於強化學習 (RL) 的交通號誌控制 (TSC) 方法已經廣泛研究，但其實際應用仍會產生一些嚴重的問題，例如學習成本高和泛化能力差。這是因為「試錯法」訓練風格讓 RL 代理極度依賴特定的交通環境，這也需要很長的收斂時間。為了解決這些問題，我們提出一個名為 FitLight 的基於聯邦模仿學習 (FIL) 的多路口 TSC 框架，讓 RL 代理可以即插即用於任何交通環境，而無需額外的預訓練成本。與依賴使用示範預訓練 RL 代理的現有模仿學習方法不同，FitLight 允許即時模仿學習和無縫過渡到強化學習。由於我們提出的知識共享機制和新穎的基於壓力的混合代理設計，RL 代理只需幾個回合即可快速找到最佳控制策略。此外，對於資源受限的 TSC 場景，FitLight 支援模型剪枝和異質模型聚合，讓 RL 代理可以在僅有 16{\it KB} RAM 和 32{\it KB} ROM 的微控制器上運行。廣泛的實驗證明，與最先進的方法相比，FitLight 不僅提供了更好的起點，而且在實際和合成資料集上都能收斂到更好的最終解決方案，即使在極端的資源限制下也是如此。
-
-##### **On Representational Dissociation of Language and Arithmetic in Large Language Models**
-2502.11932v1 by Riku Kisako, Tatsuki Kuribayashi, Ryohei Sasano
-
-The association between language and (non-linguistic) thinking ability in
-humans has long been debated, and recently, neuroscientific evidence of brain
-activity patterns has been considered. Such a scientific context naturally
-raises an interdisciplinary question -- what about such a language-thought
-dissociation in large language models (LLMs)? In this paper, as an initial
-foray, we explore this question by focusing on simple arithmetic skills (e.g.,
-$1+2=$ ?) as a thinking ability and analyzing the geometry of their encoding in
-LLMs' representation space. Our experiments with linear classifiers and cluster
-separability tests demonstrate that simple arithmetic equations and general
-language input are encoded in completely separated regions in LLMs' internal
-representation space across all the layers, which is also supported with more
-controlled stimuli (e.g., spelled-out equations). These tentatively suggest
-that arithmetic reasoning is mapped into a distinct region from general
-language input, which is in line with the neuroscientific observations of human
-brain activations, while we also point out their somewhat cognitively
-implausible geometric properties.
-
-摘要：人類語言與（非語言）思考能力之間的關聯性長期以來一直備受爭論，而最近，神經科學證據中的大腦活動模式也已受到考量。這樣一個科學背景自然會引發一個跨領域問題——大型語言模型（LLM）中這種語言與思考的分離又是如何？在本文中，作為初步探討，我們透過專注於簡單的算術技能（例如 $1+2=$？）作為思考能力，並分析它們在 LLM 表徵空間中的編碼幾何形狀來探討這個問題。我們透過線性分類器和群集可分性測試進行的實驗證明，簡單的算術方程式和一般語言輸入在 LLM 的內部表徵空間中所有層中都是以完全分離的區域編碼，這也獲得了更受控刺激（例如，拼寫出的方程式）的支持。這些初步表明算術推理被映射到與一般語言輸入不同的區域，這與人類大腦活化的神經科學觀察結果一致，同時我們也指出了它們在認知上有些難以置信的幾何屬性。
-
-##### **BRIGHTER: BRIdging the Gap in Human-Annotated Textual Emotion Recognition Datasets for 28 Languages**
-2502.11926v1 by Shamsuddeen Hassan Muhammad, Nedjma Ousidhoum, Idris Abdulmumin, Jan Philip Wahle, Terry Ruas, Meriem Beloucif, Christine de Kock, Nirmal Surange, Daniela Teodorescu, Ibrahim Said Ahmad, David Ifeoluwa Adelani, Alham Fikri Aji, Felermino D. M. A. Ali, Ilseyar Alimova, Vladimir Araujo, Nikolay Babakov, Naomi Baes, Ana-Maria Bucur, Andiswa Bukula, Guanqun Cao, Rodrigo Tufino Cardenas, Rendi Chevi, Chiamaka Ijeoma Chukwuneke, Alexandra Ciobotaru, Daryna Dementieva, Murja Sani Gadanya, Robert Geislinger, Bela Gipp, Oumaima Hourrane, Oana Ignat, Falalu Ibrahim Lawan, Rooweither Mabuya, Rahmad Mahendra, Vukosi Marivate, Andrew Piper, Alexander Panchenko, Charles Henrique Porto Ferreira, Vitaly Protasov, Samuel Rutunda, Manish Shrivastava, Aura Cristina Udrea, Lilian Diana Awuor Wanzare, Sophie Wu, Florian Valentin Wunderlich, Hanif Muhammad Zhafran, Tianhui Zhang, Yi Zhou, Saif M. Mohammad
-
-People worldwide use language in subtle and complex ways to express emotions.
-While emotion recognition -- an umbrella term for several NLP tasks --
-significantly impacts different applications in NLP and other fields, most work
-in the area is focused on high-resource languages. Therefore, this has led to
-major disparities in research and proposed solutions, especially for
-low-resource languages that suffer from the lack of high-quality datasets. In
-this paper, we present BRIGHTER-- a collection of multilabeled
-emotion-annotated datasets in 28 different languages. BRIGHTER covers
-predominantly low-resource languages from Africa, Asia, Eastern Europe, and
-Latin America, with instances from various domains annotated by fluent
-speakers. We describe the data collection and annotation processes and the
-challenges of building these datasets. Then, we report different experimental
-results for monolingual and crosslingual multi-label emotion identification, as
-well as intensity-level emotion recognition. We investigate results with and
-without using LLMs and analyse the large variability in performance across
-languages and text domains. We show that BRIGHTER datasets are a step towards
-bridging the gap in text-based emotion recognition and discuss their impact and
-utility.
-
-摘要：全球各地的人們都以微妙且複雜的方式使用語言來表達情感。
-雖然情緒辨識——幾個 NLP 任務的總稱——
-顯著影響 NLP 及其他領域中的不同應用，但該領域中的大部分工作
-都集中於高資源語言。因此，這導致研究和提出的解決方案出現重大差異，特別是
-對於缺乏高品質資料集的低資源語言。在本文中，我們提出 BRIGHTER——一個
-由 28 種不同語言組成的多標記情緒標註資料集。BRIGHTER 主要涵蓋來自非洲、亞洲、東歐和
-拉丁美洲的低資源語言，其中包含由流利講者標註的來自不同領域的實例。我們描述了資料收集和標註流程以及
-建立這些資料集的挑戰。然後，我們報告了單語和跨語言多標籤情緒識別的不同實驗結果，以及
-強度級別的情緒識別。我們研究了使用和不使用 LLM 的結果，並分析了跨語言和文字領域的性能的巨大變異。我們表明，BRIGHTER 資料集是縮小基於文字的情緒識別差距的一步，並討論了它們的影響和
-效用。
-
-##### **GRAPHGPT-O: Synergistic Multimodal Comprehension and Generation on Graphs**
-2502.11925v1 by Yi Fang, Bowen Jin, Jiacheng Shen, Sirui Ding, Qiaoyu Tan, Jiawei Han
-
-The rapid development of Multimodal Large Language Models (MLLMs) has enabled
-the integration of multiple modalities, including texts and images, within the
-large language model (LLM) framework. However, texts and images are usually
-interconnected, forming a multimodal attributed graph (MMAG). It is
-underexplored how MLLMs can incorporate the relational information
-(\textit{i.e.}, graph structure) and semantic information (\textit{i.e.,} texts
-and images) on such graphs for multimodal comprehension and generation. In this
-paper, we propose GraphGPT-o, which supports omni-multimodal understanding and
-creation on MMAGs. We first comprehensively study linearization variants to
-transform semantic and structural information as input for MLLMs. Then, we
-propose a hierarchical aligner that enables deep graph encoding, bridging the
-gap between MMAGs and MLLMs. Finally, we explore the inference choices,
-adapting MLLM to interleaved text and image generation in graph scenarios.
-Extensive experiments on three datasets from different domains demonstrate the
-effectiveness of our proposed method. Datasets and codes will be open-sourced
-upon acceptance.
-
-摘要：多模态大语言模型 (MLLM) 的快速发展，促进了文本和图像等多种模态在大型语言模型 (LLM) 框架内的整合。然而，文本和图像通常是相互关联的，形成多模态属性图 (MMAG)。对于 MLLM 如何整合此类图上的关系信息（即图结构）和语义信息（即文本和图像）以进行多模态理解和生成，目前仍未得到充分探索。在本文中，我们提出了 GraphGPT-o，它支持在 MMAG 上进行全方位多模态理解和创建。我们首先全面研究了线性化变体，以将语义和结构信息转换为 MLLM 的输入。然后，我们提出了一个分层对齐器，它支持深度图编码，弥合了 MMAG 和 MLLM 之间的差距。最后，我们探索了推理选择，使 MLLM 适应图场景中交错的文本和图像生成。来自不同领域的三组数据集上的大量实验表明了我们提出的方法的有效性。数据集和代码将在被接受后开源。
-
-##### **From Text to Trust: Empowering AI-assisted Decision Making with Adaptive LLM-powered Analysis**
-2502.11919v1 by Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ziang Xiao, Ming Yin
-
-AI-assisted decision making becomes increasingly prevalent, yet individuals
-often fail to utilize AI-based decision aids appropriately especially when the
-AI explanations are absent, potentially as they do not %understand reflect on
-AI's decision recommendations critically. Large language models (LLMs), with
-their exceptional conversational and analytical capabilities, present great
-opportunities to enhance AI-assisted decision making in the absence of AI
-explanations by providing natural-language-based analysis of AI's decision
-recommendation, e.g., how each feature of a decision making task might
-contribute to the AI recommendation. In this paper, via a randomized
-experiment, we first show that presenting LLM-powered analysis of each task
-feature, either sequentially or concurrently, does not significantly improve
-people's AI-assisted decision performance. To enable decision makers to better
-leverage LLM-powered analysis, we then propose an algorithmic framework to
-characterize the effects of LLM-powered analysis on human decisions and
-dynamically decide which analysis to present. Our evaluation with human
-subjects shows that this approach effectively improves decision makers'
-appropriate reliance on AI in AI-assisted decision making.
-
-摘要：隨著 AI 輔助決策越來越普遍，但個人常常無法適當地利用 AI 決策輔助，特別是在沒有 AI 解釋的情況下，潛在原因是他們無法批判性地理解 AI 的決策建議。大型語言模型 (LLM) 擁有卓越的對話和分析能力，在沒有 AI 解釋的情況下，透過提供基於自然語言的 AI 決策建議分析，例如決策任務的每個特徵如何影響 AI 建議，為增強 AI 輔助決策提供了絕佳的機會。在本文中，我們透過隨機實驗，首先展示了以循序或並行的方式呈現 LLM 分析的每個任務特徵，並未顯著改善人們的 AI 輔助決策表現。為了讓決策者能更好地利用 LLM 分析，我們接著提出了演算法架構，用於描述 LLM 分析對人類決策的影響，並動態決定要呈現哪種分析。我們對人類受試者的評估顯示，這種方法有效地改善了決策者在 AI 輔助決策中對 AI 的適當依賴性。
-
-##### **EssayJudge: A Multi-Granular Benchmark for Assessing Automated Essay Scoring Capabilities of Multimodal Large Language Models**
-2502.11916v1 by Jiamin Su, Yibo Yan, Fangteng Fu, Han Zhang, Jingheng Ye, Xiang Liu, Jiahao Huo, Huiyu Zhou, Xuming Hu
-
-Automated Essay Scoring (AES) plays a crucial role in educational assessment
-by providing scalable and consistent evaluations of writing tasks. However,
-traditional AES systems face three major challenges: (1) reliance on
-handcrafted features that limit generalizability, (2) difficulty in capturing
-fine-grained traits like coherence and argumentation, and (3) inability to
-handle multimodal contexts. In the era of Multimodal Large Language Models
-(MLLMs), we propose EssayJudge, the first multimodal benchmark to evaluate AES
-capabilities across lexical-, sentence-, and discourse-level traits. By
-leveraging MLLMs' strengths in trait-specific scoring and multimodal context
-understanding, EssayJudge aims to offer precise, context-rich evaluations
-without manual feature engineering, addressing longstanding AES limitations.
-Our experiments with 18 representative MLLMs reveal gaps in AES performance
-compared to human evaluation, particularly in discourse-level traits,
-highlighting the need for further advancements in MLLM-based AES research. Our
-dataset and code will be available upon acceptance.
-
-摘要：自動化論文評分 (AES) 在教育評量中扮演著重要的角色，它能提供可擴充且一致的寫作任務評量。然而，傳統的 AES 系統面臨了三個主要的挑戰：(1) 依賴於限制泛用性的手工特徵，(2) 難以捕捉連貫性和論證等細微特徵，以及 (3) 無法處理多模態的脈絡。在多模態大型語言模型 (MLLM) 的時代，我們提出了 EssayJudge，這是第一個評估 AES 能力的多模態基準，橫跨詞彙、句子和篇章層級的特徵。EssayJudge 透過利用 MLLM 在特定特徵評分和多模態脈絡理解方面的優勢，旨在提供精確且富含脈絡的評量，而無需手動特徵工程，進而解決長久以來的 AES 限制。我們針對 18 個具代表性的 MLLM 進行的實驗揭露了 AES 效能與人類評量之間的差距，特別是在篇章層級的特徵，這凸顯了 MLLM 為基礎的 AES 研究需要進一步的進展。我們的資料集和程式碼將在通過驗證後提供。
-
-##### **On the robustness of ChatGPT in teaching Korean Mathematics**
-2502.11915v1 by Phuong-Nam Nguyen, Quang Nguyen-The, An Vu-Minh, Diep-Anh Nguyen, Xuan-Lam Pham
-
-ChatGPT, an Artificial Intelligence model, has the potential to revolutionize
-education. However, its effectiveness in solving non-English questions remains
-uncertain. This study evaluates ChatGPT's robustness using 586 Korean
-mathematics questions. ChatGPT achieves 66.72% accuracy, correctly answering
-391 out of 586 questions. We also assess its ability to rate mathematics
-questions based on eleven criteria and perform a topic analysis. Our findings
-show that ChatGPT's ratings align with educational theory and test-taker
-perspectives. While ChatGPT performs well in question classification, it
-struggles with non-English contexts, highlighting areas for improvement. Future
-research should address linguistic biases and enhance accuracy across diverse
-languages. Domain-specific optimizations and multilingual training could
-improve ChatGPT's role in personalized education.
-
-摘要：ChatGPT，一種人工智慧模型，具有革新教育的潛力。然而，其解決非英語問題的有效性仍不確定。本研究使用 586 個韓語數學問題評估 ChatGPT 的健壯性。ChatGPT 達到 66.72% 的準確率，正確回答了 586 個問題中的 391 個。我們也評估其根據 11 個標準對數學問題進行評分並執行主題分析的能力。我們的研究結果顯示，ChatGPT 的評分與教育理論和應試者的觀點一致。儘管 ChatGPT 在問題分類中表現良好，但它在非英語語境中表現不佳，突顯出需要改進的地方。未來的研究應解決語言偏見並提高跨不同語言的準確性。特定領域的優化和多語言訓練可以提升 ChatGPT 在個人化教育中的作用。
-
-##### **MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation**
-2502.11903v1 by Haochen Xue, Feilong Tang, Ming Hu, Yexin Liu, Qidong Huang, Yulong Li, Chengzhi Liu, Zhongxing Xu, Chong Zhang, Chun-Mei Feng, Yutong Xie, Imran Razzak, Zongyuan Ge, Jionglong Su, Junjun He, Yu Qiao
-
-Recent multimodal large language models (MLLMs) have demonstrated significant
-potential in open-ended conversation, generating more accurate and personalized
-responses. However, their abilities to memorize, recall, and reason in
-sustained interactions within real-world scenarios remain underexplored. This
-paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for
-evaluating six core open-ended abilities of MLLMs: information extraction,
-multi-turn reasoning, information update, image management, memory recall, and
-answer refusal. With data collected from real-world scenarios, MMRC comprises
-5,120 conversations and 28,720 corresponding manually labeled questions, posing
-a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC
-indicate an accuracy drop during open-ended interactions. We identify four
-common failure patterns: long-term memory degradation, inadequacies in updating
-factual knowledge, accumulated assumption of error propagation, and reluctance
-to say no. To mitigate these issues, we propose a simple yet effective
-NOTE-TAKING strategy, which can record key information from the conversation
-and remind the model during its responses, enhancing conversational
-capabilities. Experiments across six MLLMs demonstrate significant performance
-improvements.
-
-摘要：最近的多模态大型语言模型 (MLLM) 已在开放式对话中展现出显著的潜力，产生更准确且个性化的回应。然而，它们在现实世界场景中持续互动中的记忆、回忆和推理能力仍未得到充分探索。本文介绍了 MMRC，一个多模态现实世界对话基准，用于评估 MLLM 的六项核心开放式能力：信息提取、多轮推理、信息更新、图像管理、记忆回忆和答案拒绝。通过从现实世界场景中收集的数据，MMRC 包含 5,120 个对话和 28,720 个相应的手动标记问题，对现有的 MLLM 构成了重大挑战。在 MMRC 中对 20 个 MLLM 的评估表明，在开放式互动期间准确性下降。我们确定了四种常见的故障模式：长期记忆退化、更新事实知识的不足、累积的错误传播假设以及不愿说不。为了减轻这些问题，我们提出了一种简单但有效的笔记策略，它可以记录对话中的关键信息并在模型响应期间提醒模型，从而增强对话能力。六个 MLLM 的实验表明了显著的性能改进。
-
-##### **Building A Proof-Oriented Programmer That Is 64% Better Than GPT-4o Under Data Scarsity**
-2502.11901v1 by Dylan Zhang, Justin Wang, Tianran Sun
-
-Existing LMs struggle with proof-oriented programming due to data scarcity,
-which manifest in two key ways: (1) a lack of sufficient corpora for
-proof-oriented programming languages such as F*, and (2) the absence of
-large-scale, project-level proof-oriented implementations that can teach the
-model the intricate reasoning process when performing proof-oriented
-programming. We present the first on synthetic data augmentation for project
-level proof oriented programming for both generation and repair. Our method
-addresses data scarcity by synthesizing basic proof-oriented programming
-problems for proficiency in that language; incorporating diverse coding data
-for reasoning capability elicitation and creating new proofs and repair data
-within existing repositories. This approach enables language models to both
-synthesize and repair proofs for function- and repository-level code. We show
-that our fine-tuned 14B parameter model, PoPilot, can exceed the performance of
-the models that outperforms GPT-4o in project-level proof-oriented programming
-by 64% relative margin, and can improve GPT-4o's performance by 54% by
-repairing its outputs over GPT-4o's self-repair.
-
-摘要：現有的語言模型在基於證明編程時會因資料稀少而有困難，
-這會以兩種關鍵方式表現出來：(1) 缺乏足夠的語料庫，例如 F* 等面向證明的程式語言，以及 (2) 缺乏大型的專案層級面向證明實作，這些實作可以在執行面向證明編程時，教導模型複雜的推理程序。我們提出第一個面向專案層級面向證明編程的合成資料擴充，用於產生和修復。我們的做法透過合成基本的面向證明編程問題來解決資料稀少的問題，以精通該語言；納入不同的編碼資料，以引出推理能力，並在現有的儲存庫中建立新的證明和修復資料。這個方法讓語言模型能夠為函數層級和儲存庫層級的程式碼合成和修復證明。我們展示經過微調的 14B 參數模型 PoPilot，可以超過在專案層級面向證明編程中表現優於 GPT-4o 的模型 64% 的相對差距，並且可以透過修復 GPT-4o 自我修復的輸出，將 GPT-4o 的效能提升 54%。
-
-##### **DLFR-VAE: Dynamic Latent Frame Rate VAE for Video Generation**
-2502.11897v1 by Zhihang Yuan, Siyuan Wang, Rui Xie, Hanling Zhang, Tongcheng Fang, Yuzhang Shang, Shengen Yan, Guohao Dai, Yu Wang
-
-In this paper, we propose the Dynamic Latent Frame Rate VAE (DLFR-VAE), a
-training-free paradigm that can make use of adaptive temporal compression in
-latent space. While existing video generative models apply fixed compression
-rates via pretrained VAE, we observe that real-world video content exhibits
-substantial temporal non-uniformity, with high-motion segments containing more
-information than static scenes. Based on this insight, DLFR-VAE dynamically
-adjusts the latent frame rate according to the content complexity.
-Specifically, DLFR-VAE comprises two core innovations: (1) A Dynamic Latent
-Frame Rate Scheduler that partitions videos into temporal chunks and adaptively
-determines optimal frame rates based on information-theoretic content
-complexity, and (2) A training-free adaptation mechanism that transforms
-pretrained VAE architectures into a dynamic VAE that can process features with
-variable frame rates. Our simple but effective DLFR-VAE can function as a
-plug-and-play module, seamlessly integrating with existing video generation
-models and accelerating the video generation process.
-
-摘要：在本文中，我們提出動態潛在幀率 VAE (DLFR-VAE)，一種無需訓練的範例，它可以在潛在空間中使用自適應時間壓縮。現有的影片生成模型透過預訓練的 VAE 應用固定壓縮率，但我們觀察到真實世界的影片內容展現出大量的時間非一致性，其中高動作片段包含比靜態場景更多的資訊。基於這個見解，DLFR-VAE 會根據內容複雜度動態調整潛在幀率。具體來說，DLFR-VAE 包含兩項核心創新：(1) 一個動態潛在幀率排程器，它將影片分割成時間區塊，並根據資訊理論內容複雜度自適應地決定最佳幀率，以及 (2) 一個無需訓練的適應機制，它將預訓練的 VAE 架構轉換成一個動態 VAE，它可以處理具有可變幀率的特色。我們簡單但有效的 DLFR-VAE 可以作為一個即插即用的模組，與現有的影片生成模型無縫整合，並加速影片生成過程。
-
-##### **CAMEL: Continuous Action Masking Enabled by Large Language Models for Reinforcement Learning**
-2502.11896v1 by Yanxiao Zhao, Yangge Qian, Jingyang Shan, Xiaolin Qin
-
-Reinforcement learning (RL) in continuous action spaces encounters persistent
-challenges, such as inefficient exploration and convergence to suboptimal
-solutions. To address these limitations, we propose CAMEL, a novel framework
-integrating LLM-generated suboptimal policies into the RL training pipeline.
-CAMEL leverages dynamic action masking and an adaptive epsilon-masking
-mechanism to guide exploration during early training stages while gradually
-enabling agents to optimize policies independently. At the core of CAMEL lies
-the integration of Python-executable suboptimal policies generated by LLMs
-based on environment descriptions and task objectives. Although simplistic and
-hard-coded, these policies offer valuable initial guidance for RL agents. To
-effectively utilize these priors, CAMEL employs masking-aware optimization to
-dynamically constrain the action space based on LLM outputs. Additionally,
-epsilon-masking gradually reduces reliance on LLM-generated guidance, enabling
-agents to transition from constrained exploration to autonomous policy
-refinement. Experimental validation on Gymnasium MuJoCo environments
-demonstrates the effectiveness of CAMEL. In Hopper-v4 and Ant-v4, LLM-generated
-policies significantly improve sample efficiency, achieving performance
-comparable to or surpassing expert masking baselines. For Walker2d-v4, where
-LLMs struggle to accurately model bipedal gait dynamics, CAMEL maintains robust
-RL performance without notable degradation, highlighting the framework's
-adaptability across diverse tasks. While CAMEL shows promise in enhancing
-sample efficiency and mitigating convergence challenges, these issues remain
-open for further research. Future work aims to generalize CAMEL to multimodal
-LLMs for broader observation-action spaces and automate policy evaluation,
-reducing human intervention and enhancing scalability in RL training pipelines.
-
-摘要：<paragraph>在連續動作空間中的強化學習 (RL) 會遇到持續的挑戰，例如探索效率低落和收斂至次佳解。為了解決這些限制，我們提出 CAMEL，一個將 LLM 生成的次佳策略整合到 RL 訓練管線中的新框架。CAMEL 透過動態動作遮罩和自適應 epsilon 遮罩機制來引導探索，同時逐漸讓代理程式能夠獨立最佳化策略。CAMEL 的核心在於整合由 LLM 生成的 Python 可執行次佳策略，這些策略基於環境描述和任務目標。儘管這些策略過於簡化且硬編碼，但它們為 RL 代理程式提供了有價值的初始指導。為了有效利用這些先驗知識，CAMEL 採用遮罩感知最佳化來根據 LLM 輸出動態限制動作空間。此外，epsilon 遮罩逐漸減少對 LLM 生成的指導依賴，讓代理程式能夠從受限探索轉換為自主策略改善。在 Gymnasium MuJoCo 環境上的實驗驗證證明了 CAMEL 的有效性。在 Hopper-v4 和 Ant-v4 中，LLM 生成的策略顯著提升了樣本效率，達到了與專家遮罩基準相近或超越的效能。對於 LLM 難以準確建模雙足步態動態的 Walker2d-v4，CAMEL 維持穩健的 RL 效能，且沒有顯著降低，突顯了該框架在不同任務中的適應性。儘管 CAMEL 在提升樣本效率和緩解收斂挑戰方面顯示出前景，但這些問題仍有待進一步研究。未來的研究工作旨在將 CAMEL 推廣到多模態 LLM，以涵蓋更廣泛的觀察動作空間，並自動化策略評估，減少人工介入並提升 RL 訓練管線的可擴充性。</paragraph>
-
-##### **Continual Quantization-Aware Pre-Training: When to transition from 16-bit to 1.58-bit pre-training for BitNet language models?**
-2502.11895v1 by Jacob Nielsen, Peter Schneider-Kamp, Lukas Galke
-
-Large language models (LLMs) require immense resources for training and
-inference. Quantization, a technique that reduces the precision of model
-parameters, offers a promising solution for improving LLM efficiency and
-sustainability. While post-training quantization methods typically achieve 4-8
-bits per parameter, recent research suggests that training LLMs with 1.58 bits
-per weight parameter from scratch can maintain model accuracy while greatly
-reducing memory requirements and energy consumption at inference time. Here, we
-investigate a training strategy for quantization-aware pre-training, where the
-models are first trained with 16-bit precision and then transition into
-1.58-bit quantization-aware training. Our results on 11 downstream tasks show
-that this 16-to-1.58-bit training strategy is preferable over full 1.58-bit
-training and leaves models closer to those which have undergone 16-bit
-training. We further investigate the effects of retaining the optimizer state
-at the transition point and gradually phasing in quantization strength --
-finding that both techniques alleviate the magnitude of loss spikes, but also
-that these effects can be compensated through further training.
-
-摘要：大型語言模型 (LLM) 需要大量的資源來進行訓練和推理。量化是一種降低模型參數精度的技術，為提高 LLM 效率和可持續性提供了一個有希望的解決方案。雖然訓練後量化方法通常每參數達到 4-8 位元，但最近的研究表明，從頭開始使用每權重參數 1.58 位元訓練 LLM 可以維持模型準確性，同時大幅減少推理時間的記憶體需求和能源消耗。在此，我們探討量化感知預訓練的訓練策略，其中模型首先使用 16 位元精度訓練，然後轉換為 1.58 位元量化感知訓練。我們在 11 個下游任務上的結果表明，這種 16 位元到 1.58 位元的訓練策略優於完全 1.58 位元訓練，並且使模型更接近經過 16 位元訓練的模型。我們進一步探討了在轉換點保留最佳化器狀態和逐漸調整量化強度的影響——發現這兩種技術都可以減輕損失尖峰的大小，但這些影響也可以透過進一步訓練來補償。
-
-##### **Revisiting Classification Taxonomy for Grammatical Errors**
-2502.11890v1 by Deqing Zou, Jingheng Ye, Yulu Liu, Yu Wu, Zishan Xu, Yinghui Li, Hai-Tao Zheng, Bingxu An, Zhao Wei, Yong Xu
-
-Grammatical error classification plays a crucial role in language learning
-systems, but existing classification taxonomies often lack rigorous validation,
-leading to inconsistencies and unreliable feedback. In this paper, we revisit
-previous classification taxonomies for grammatical errors by introducing a
-systematic and qualitative evaluation framework. Our approach examines four
-aspects of a taxonomy, i.e., exclusivity, coverage, balance, and usability.
-Then, we construct a high-quality grammatical error classification dataset
-annotated with multiple classification taxonomies and evaluate them grounding
-on our proposed evaluation framework. Our experiments reveal the drawbacks of
-existing taxonomies. Our contributions aim to improve the precision and
-effectiveness of error analysis, providing more understandable and actionable
-feedback for language learners.
-
-摘要：語法錯誤分類在語言學習系統中扮演至關重要的角色，但現有的分類法常常缺乏嚴謹的驗證，導致不一致且不可靠的回饋。在本文中，我們透過引入一個系統且定性的評估架構，重新檢視先前的語法錯誤分類法。我們的做法檢視分類法的四個面向，即排他性、涵蓋性、平衡性和可用性。接著，我們建構一個高品質的語法錯誤分類資料集，並用多個分類法進行標註，並根據我們提出的評估架構對其進行評估。我們的實驗揭露了現有分類法的缺點。我們的貢獻旨在改善錯誤分析的準確性和有效性，為語言學習者提供更易於理解且可操作的回饋。
-
-##### **Stonefish: Supporting Machine Learning Research in Marine Robotics**
-2502.11887v1 by Michele Grimaldi, Patryk Cieslak, Eduardo Ochoa, Vibhav Bharti, Hayat Rajani, Ignacio Carlucho, Maria Koskinopoulou, Yvan R. Petillot, Nuno Gracias
-
-Simulations are highly valuable in marine robotics, offering a cost-effective
-and controlled environment for testing in the challenging conditions of
-underwater and surface operations. Given the high costs and logistical
-difficulties of real-world trials, simulators capable of capturing the
-operational conditions of subsea environments have become key in developing and
-refining algorithms for remotely-operated and autonomous underwater vehicles.
-This paper highlights recent enhancements to the Stonefish simulator, an
-advanced open-source platform supporting development and testing of marine
-robotics solutions. Key updates include a suite of additional sensors, such as
-an event-based camera, a thermal camera, and an optical flow camera, as well
-as, visual light communication, support for tethered operations, improved
-thruster modelling, more flexible hydrodynamics, and enhanced sonar accuracy.
-These developments and an automated annotation tool significantly bolster
-Stonefish's role in marine robotics research, especially in the field of
-machine learning, where training data with a known ground truth is hard or
-impossible to collect.
-
-摘要：模擬在海洋機器人中極具價值，提供具成本效益且受控的環境，用於在水下和水面作業的挑戰性條件下進行測試。鑑於現實世界試驗的高成本和後勤困難，能夠捕捉海底環境作業條件的模擬器已成為開發和改進遠程操作和自主水下載具演算法的關鍵。本文重點介紹了 Stonefish 模擬器最近的增強功能，這是一個先進的開源平台，支援海洋機器人解決方案的開發和測試。主要更新包括一系列額外的感測器，例如事件式相機、熱像儀和光流相機，以及可見光通訊、對繫繩操作的支援、改進的推進器建模、更靈活的水動力學和增強的聲納準確度。這些開發和自動化標註工具顯著提升了 Stonefish 在海洋機器人研究中的作用，特別是在機器學習領域，其中具有已知基本事實的訓練資料難以或無法收集。
-
-##### **LIMR: Less is More for RL Scaling**
-2502.11886v1 by Xuefeng Li, Haoyang Zou, Pengfei Liu
-
-In this paper, we ask: what truly determines the effectiveness of RL training
-data for enhancing language models' reasoning capabilities? While recent
-advances like o1, Deepseek R1, and Kimi1.5 demonstrate RL's potential, the lack
-of transparency about training data requirements has hindered systematic
-progress. Starting directly from base models without distillation, we challenge
-the assumption that scaling up RL training data inherently improves
-performance. we demonstrate that a strategically selected subset of just 1,389
-samples can outperform the full 8,523-sample dataset. We introduce Learning
-Impact Measurement (LIM), an automated method to evaluate and prioritize
-training samples based on their alignment with model learning trajectories,
-enabling efficient resource utilization and scalable implementation. Our method
-achieves comparable or even superior performance using only 1,389 samples
-versus the full 8,523 samples dataset. Notably, while recent data-efficient
-approaches (e.g., LIMO and s1) show promise with 32B-scale models, we find it
-significantly underperforms at 7B-scale through supervised fine-tuning (SFT).
-In contrast, our RL-based LIMR achieves 16.7% higher accuracy on AIME24 and
-outperforms LIMO and s1 by 13.0% and 22.2% on MATH500. These results
-fundamentally reshape our understanding of RL scaling in LLMs, demonstrating
-that precise sample selection, rather than data scale, may be the key to
-unlocking enhanced reasoning capabilities. For reproducible research and future
-innovation, we are open-sourcing LIMR, including implementation of LIM,
-training and evaluation code, curated datasets, and trained models at
-https://github.com/GAIR-NLP/LIMR.
-
-摘要：<paragraph>在這篇論文中，我們提出一個問題：究竟是什麼決定了 RL 訓練資料增強語言模型推理能力的有效性？雖然最近的進展，例如 o1、Deepseek R1 和 Kimi1.5，展示了 RL 的潛力，但缺乏關於訓練資料需求的透明度阻礙了系統化的進展。從沒有蒸餾的基本模型直接開始，我們挑戰了擴充 RL 訓練資料本質上就會提升效能的假設。我們證明，策略性地選出僅 1,389 個樣本的子集就能勝過完整的 8,523 個樣本資料集。我們引入了學習影響力測量 (LIM)，這是一種自動化方法，用來評估和優先處理訓練樣本，根據它們與模型學習軌跡的一致性，能有效利用資源和擴充實作。我們的方法使用僅 1,389 個樣本就能達到與使用完整的 8,523 個樣本資料集相當甚至更佳的效能。值得注意的是，雖然最近資料有效率的方法（例如 LIMO 和 s1）在 32B 規模的模型上展現了前景，但我們發現它在 7B 規模上透過監督微調 (SFT) 的表現大幅落後。相比之下，我們基於 RL 的 LIMR 在 AIME24 上達到了高出 16.7% 的準確度，並在 MATH500 上比 LIMO 和 s1 分別高出 13.0% 和 22.2%。這些結果從根本上改變了我們對 LLM 中 RL 擴充的理解，證明精確的樣本選取，而非資料規模，可能是解鎖增強推理能力的關鍵。為了可重製的研究和未來的創新，我們開放原始碼 LIMR，包括 LIM 的實作、訓練和評估程式碼、策展的資料集，以及在 https://github.com/GAIR-NLP/LIMR 上訓練的模型。</paragraph>
-
-##### **Leveraging Dual Process Theory in Language Agent Framework for Real-time Simultaneous Human-AI Collaboration**
-2502.11882v1 by Shao Zhang, Xihuai Wang, Wenhao Zhang, Chaoran Li, Junru Song, Tingyu Li, Lin Qiu, Xuezhi Cao, Xunliang Cai, Wen Yao, Weinan Zhang, Xinbing Wang, Ying Wen
-
-Agents built on large language models (LLMs) have excelled in turn-by-turn
-human-AI collaboration but struggle with simultaneous tasks requiring real-time
-interaction. Latency issues and the challenge of inferring variable human
-strategies hinder their ability to make autonomous decisions without explicit
-instructions. Through experiments with current independent System 1 and System
-2 methods, we validate the necessity of using Dual Process Theory (DPT) in
-real-time tasks. We propose DPT-Agent, a novel language agent framework that
-integrates System 1 and System 2 for efficient real-time simultaneous human-AI
-collaboration. DPT-Agent's System 1 uses a Finite-state Machine (FSM) and
-code-as-policy for fast, intuitive, and controllable decision-making.
-DPT-Agent's System 2 integrates Theory of Mind (ToM) and asynchronous
-reflection to infer human intentions and perform reasoning-based autonomous
-decisions. We demonstrate the effectiveness of DPT-Agent through further
-experiments with rule-based agents and human collaborators, showing significant
-improvements over mainstream LLM-based frameworks. To the best of our
-knowledge, DPT-Agent is the first language agent framework that achieves
-successful real-time simultaneous human-AI collaboration autonomously. Code of
-DPT-Agent can be found in https://github.com/sjtu-marl/DPT-Agent.
-
-摘要：建立在大语言模型（LLM）上的代理在回合制人机协作方面表现出色，但在需要实时交互的同时任务中却举步维艰。延迟问题和推断可变人类策略的挑战阻碍了他们在没有明确指示的情况下做出自主决策的能力。通过使用当前独立的系统 1 和系统 2 方法进行的实验，我们验证了在实时任务中使用双重过程理论 (DPT) 的必要性。我们提出了 DPT-Agent，这是一个新颖的语言代理框架，它集成了系统 1 和系统 2，以实现高效的实时同时人机协作。DPT-Agent 的系统 1 使用有限状态机 (FSM) 和代码作为策略，以进行快速、直观且可控的决策。DPT-Agent 的系统 2 集成了心智理论 (ToM) 和异步反射，以推断人类意图并执行基于推理的自主决策。我们通过与基于规则的代理和人类合作者进行进一步的实验来证明 DPT-Agent 的有效性，展示了对主流基于 LLM 的框架的重大改进。据我们所知，DPT-Agent 是第一个实现自主的实时同时人机协作的语言代理框架。DPT-Agent 的代码可以在 https://github.com/sjtu-marl/DPT-Agent 中找到。
-
-##### **Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models**
-2502.11881v1 by Hyunwoo Kim, Melanie Sclar, Tan Zhi-Xuan, Lance Ying, Sydney Levine, Yang Liu, Joshua B. Tenenbaum, Yejin Choi
-
-Existing LLM reasoning methods have shown impressive capabilities across
-various tasks, such as solving math and coding problems. However, applying
-these methods to scenarios without ground-truth answers or rule-based
-verification methods - such as tracking the mental states of an agent - remains
-challenging. Inspired by the sequential Monte Carlo algorithm, we introduce
-thought-tracing, an inference-time reasoning algorithm designed to trace the
-mental states of specific agents by generating hypotheses and weighting them
-based on observations without relying on ground-truth solutions to questions in
-datasets. Our algorithm is modeled after the Bayesian theory-of-mind framework,
-using LLMs to approximate probabilistic inference over agents' evolving mental
-states based on their perceptions and actions. We evaluate thought-tracing on
-diverse theory-of-mind benchmarks, demonstrating significant performance
-improvements compared to baseline LLMs. Our experiments also reveal interesting
-behaviors of the recent reasoning models - e.g., o1 and R1 - on theory-of-mind,
-highlighting the difference of social reasoning compared to other domains.
-
-摘要：現有的 LLM 推理方法已在各種任務中展現出令人印象深刻的能力，例如解決數學和編碼問題。然而，將這些方法應用於沒有正解答案或基於規則的驗證方法的情境中 - 例如追蹤代理人的心智狀態 - 仍然具有挑戰性。受到序貫蒙地卡羅演算法的啟發，我們引入了思想追蹤，這是一種在推理時間進行推理的演算法，旨在透過產生假設並根據觀察加權這些假設來追蹤特定代理人的心智狀態，而無需依賴資料集中的問題正解。我們的演算法是以貝氏心智理論架構為範本，使用 LLM 根據代理人的感知和行動來近似代理人不斷演變的心智狀態的機率推論。我們在各種心智理論基準上評估思想追蹤，與基準 LLM 相比，證明了顯著的效能提升。我們的實驗也揭露了近期推理模型在心智理論上的有趣行為 - 例如 o1 和 R1 - 突顯了社會推理與其他領域的差異。
-
-##### **Bitnet.cpp: Efficient Edge Inference for Ternary LLMs**
-2502.11880v1 by Jinheng Wang, Hansong Zhou, Ting Song, Shijie Cao, Yan Xia, Ting Cao, Jianyu Wei, Shuming Ma, Hongyu Wang, Furu Wei
-
-The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has
-spurred interest in ternary LLMs. Despite this, research and practical
-applications focusing on efficient edge inference for ternary LLMs remain
-scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system
-optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix
-multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs,
-Bitnet.cpp incorporates a novel mpGEMM library to facilitate
-sub-2-bits-per-weight, efficient and lossless inference. The library features
-two core solutions: Ternary Lookup Table (TL), which addresses spatial
-inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S),
-which ensures lossless edge inference, both enabling high-speed inference. Our
-experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over
-full-precision baselines and up to 2.32x over low-bit baselines, setting new
-benchmarks in the field. Additionally, we expand TL to element-wise lookup
-table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and
-empirical evidence of its considerable potential. Bitnet.cpp is publicly
-available at https://github.com/microsoft/BitNet/tree/paper , offering a
-sophisticated solution for the efficient and practical deployment of edge LLMs.
-
-摘要：隨著由 BitNet b1.58 領先的 1 位元大型語言模型 (LLM) 出現，已激發了對三元 LLM 的興趣。儘管如此，專注於三元 LLM 的高效能邊緣推論的研究和實際應用仍然很少見。為了彌補這個差距，我們引入了 Bitnet.cpp，這是一個針對 BitNet b1.58 和三元 LLM 最佳化的推論系統。由於混合精度矩陣乘法 (mpGEMM) 構成三元 LLM 中推論時間的大部分，Bitnet.cpp 結合了一個新穎的 mpGEMM 函式庫，以利於每權重低於 2 位元、高效能且無損失的推論。該函式庫具有兩個核心解決方案：三元查詢表 (TL)，它解決了先前逐位元方法的空間低效率，以及具有比例的 Int2 (I2_S)，它確保無損失的邊緣推論，兩者都能實現高速推論。我們的實驗顯示，Bitnet.cpp 的速度比全精度的基準快了 6.25 倍，比低位元基準快了 2.32 倍，樹立了該領域的新基準。此外，我們在附錄中將 TL 擴充到逐元素查詢表 (ELUT) 以用於低位元 LLM，並提出其巨大潛力的理論和實證證據。Bitnet.cpp 已公開於 https://github.com/microsoft/BitNet/tree/paper，提供了一個精密的解決方案，用於邊緣 LLM 的高效能和實際部署。
-
-##### **VAQUUM: Are Vague Quantifiers Grounded in Visual Data?**
-2502.11874v1 by Hugh Mee Wong, Rick Nouwen, Albert Gatt
-
-Vague quantifiers such as "a few" and "many" are influenced by many
-contextual factors, including how many objects are present in a given context.
-In this work, we evaluate the extent to which vision-and-language models (VLMs)
-are compatible with humans when producing or judging the appropriateness of
-vague quantifiers in visual contexts. We release a novel dataset, VAQUUM,
-containing 20300 human ratings on quantified statements across a total of 1089
-images. Using this dataset, we compare human judgments and VLM predictions
-using three different evaluation methods. Our findings show that VLMs, like
-humans, are influenced by object counts in vague quantifier use. However, we
-find significant inconsistencies across models in different evaluation
-settings, suggesting that judging and producing vague quantifiers rely on two
-different processes.
-
-摘要：模糊量词，例如「一些」和「许多」，会受到许多语境因素的影响，包括在给定语境中出现的对象数量。在这项工作中，我们评估视觉语言模型 (VLM) 在视觉语境中产生或判断模糊量词的适当性时，与人类的兼容程度。我们发布了一个新数据集 VAQUUM，其中包含对 1089 张图像中的量化陈述的 20300 个人类评级。使用此数据集，我们使用三种不同的评估方法来比较人类判断和 VLM 预测。我们的研究结果表明，VLM 与人类一样，在模糊量词的使用中会受到对象数量的影响。然而，我们发现不同评估设置中的模型之间存在显着的不一致性，这表明判断和产生模糊量词依赖于两个不同的过程。
-
-##### **Southern Newswire Corpus: A Large-Scale Dataset of Mid-Century Wire Articles Beyond the Front Page**
-2502.11866v1 by Michael McRae
-
-I introduce a new large-scale dataset of historical wire articles from U.S.
-Southern newspapers, spanning 1960-1975 and covering multiple wire services:
-The Associated Press, United Press International, Newspaper Enterprise
-Association. Unlike prior work focusing on front-page content, this dataset
-captures articles across the entire newspaper, offering broader insight into
-mid-century Southern coverage. The dataset includes a version that has
-undergone an LLM-based text cleanup pipeline to reduce OCR noise, enhancing its
-suitability for quantitative text analysis. Additionally, duplicate versions of
-articles are retained to enable analysis of editorial differences in language
-and framing across newspapers. Each article is tagged by wire service,
-facilitating comparative studies of editorial patterns across agencies. This
-resource opens new avenues for research in computational social science,
-digital humanities, and historical linguistics, providing a detailed
-perspective on how Southern newspapers relayed national and international news
-during a transformative period in American history. The dataset will be made
-available upon publication or request for research purposes.
-
-摘要：我介紹一個新的美國歷史電訊文章大型資料集，時間跨度為 1960-1975 年，涵蓋多個電訊服務：美聯社、美聯國際社、報業企業協會。與先前專注於頭版內容的研究不同，此資料集擷取了整份報紙的文章，提供更廣泛的見解，深入探討世紀中葉的南方報導。該資料集包含一個經過 LLM 文字清理管線處理的版本，以減少 OCR 雜訊，提升其適用於量化文字分析。此外，保留文章的重複版本，以利分析報紙間語言和架構的編輯差異。每篇文章都標記電訊服務，便於比較各家機構的編輯模式。此資源為計算社會科學、數位人文和歷史語言學的研究開啟了新的途徑，提供一個詳細的觀點，探討南方報紙在美國歷史的轉型時期如何傳遞國內和國際新聞。該資料集將在出版或研究目的請求後提供。
-
-##### **FedEAT: A Robustness Optimization Framework for Federated LLMs**
-2502.11863v1 by Yahao Pang, Xingyuan Wu, Xiaojin Zhang, Wei Chen, Hai Jin
-
-Significant advancements have been made by Large Language Models (LLMs) in
-the domains of natural language understanding and automated content creation.
-However, they still face persistent problems, including substantial
-computational costs and inadequate availability of training data. The
-combination of Federated Learning (FL) and LLMs (federated LLMs) offers a
-solution by leveraging distributed data while protecting privacy, which
-positions it as an ideal choice for sensitive domains. However, Federated LLMs
-still suffer from robustness challenges, including data heterogeneity,
-malicious clients, and adversarial attacks, which greatly hinder their
-applications. We first introduce the robustness problems in federated LLMs, to
-address these challenges, we propose FedEAT (Federated Embedding space
-Adversarial Training), a novel framework that applies adversarial training in
-the embedding space of client LLM and employs a robust aggregation approach,
-specifically geometric median aggregation, to enhance the robustness of
-Federated LLMs. Our experiments demonstrate that FedEAT effectively improves
-the robustness of Federated LLMs with minimal performance loss.
-
-摘要：大型語言模型 (LLM) 在自然語言理解和自動化內容創作領域取得了重大進展。
-然而，它們仍然面臨持續的問題，包括大量的運算成本和訓練數據的可用性不足。
-聯合學習 (FL) 和 LLM（聯合 LLM）的結合提供了一個解決方案，在保護隱私的同時利用分佈式數據，這使其成為敏感領域的理想選擇。
-然而，聯合 LLM 仍然面臨著穩健性的挑戰，包括數據異質性、惡意用戶和對抗性攻擊，這極大地阻礙了它們的應用。
-我們首先介紹了聯合 LLM 中的穩健性問題，為了應對這些挑戰，我們提出了 FedEAT（聯合嵌入空間對抗訓練），這是一個新穎的框架，它在用戶端 LLM 的嵌入空間中應用對抗訓練，並採用穩健的聚合方法，特別是幾何中值聚合，以增強聯合 LLM 的穩健性。
-我們的實驗表明，FedEAT 有效地提高了聯合 LLM 的穩健性，同時性能損失最小。
-
-##### **Understanding In-Context Machine Translation for Low-Resource Languages: A Case Study on Manchu**
-2502.11862v1 by Renhao Pei, Yihong Liu, Peiqin Lin, François Yvon, Hinrich Schütze
-
-In-context machine translation (MT) with large language models (LLMs) is a
-promising approach for low-resource MT, as it can readily take advantage of
-linguistic resources such as grammar books and dictionaries. Such resources are
-usually selectively integrated into the prompt so that LLMs can directly
-perform translation without any specific training, via their in-context
-learning capability (ICL). However, the relative importance of each type of
-resource e.g., dictionary, grammar book, and retrieved parallel examples, is
-not entirely clear. To address this gap, this study systematically investigates
-how each resource and its quality affects the translation performance, with the
-Manchu language as our case study. To remove any prior knowledge of Manchu
-encoded in the LLM parameters and single out the effect of ICL, we also
-experiment with an encrypted version of Manchu texts. Our results indicate that
-high-quality dictionaries and good parallel examples are very helpful, while
-grammars hardly help. In a follow-up study, we showcase a promising application
-of in-context MT: parallel data augmentation as a way to bootstrap the
-conventional MT model. When monolingual data abound, generating synthetic
-parallel data through in-context MT offers a pathway to mitigate data scarcity
-and build effective and efficient low-resource neural MT systems.
-
-摘要：語境機器翻譯 (MT) 與大型語言模型 (LLM) 結合，對於低資源 MT 來說是一種有前景的方法，因為它可以輕易利用語法書和字典等語言資源。此類資源通常會選擇性地整合到提示中，讓 LLM 能夠透過其語境學習能力 (ICL) 直接執行翻譯，而無需任何特定訓練。然而，每種類型的資源（例如字典、語法書和擷取的平行範例）的相對重要性並不明確。為了解決這個問題，本研究系統性地探討每項資源及其品質如何影響翻譯效能，並以滿語作為我們的案例研究。為了移除 LLM 參數中編碼的任何滿語先備知識，並找出 ICL 的影響，我們也對滿語文本的加密版本進行實驗。我們的結果顯示，高品質的字典和良好的平行範例非常有幫助，而語法幾乎沒有幫助。在後續研究中，我們展示了語境 MT 的一個有前景的應用：平行數據擴充，作為引導傳統 MT 模型的一種方式。當單語資料豐富時，透過語境 MT 產生合成平行資料提供了一條途徑，可以減輕資料短缺，並建構有效且高效的低資源神經 MT 系統。
-
-##### **Exploring Large Language Models in Healthcare: Insights into Corpora Sources, Customization Strategies, and Evaluation Metrics**
-2502.11861v1 by Shuqi Yang, Mingrui Jing, Shuai Wang, Jiaxin Kou, Manfei Shi, Weijie Xing, Yan Hu, Zheng Zhu
-
-This study reviewed the use of Large Language Models (LLMs) in healthcare,
-focusing on their training corpora, customization techniques, and evaluation
-metrics. A systematic search of studies from 2021 to 2024 identified 61
-articles. Four types of corpora were used: clinical resources, literature,
-open-source datasets, and web-crawled data. Common construction techniques
-included pre-training, prompt engineering, and retrieval-augmented generation,
-with 44 studies combining multiple methods. Evaluation metrics were categorized
-into process, usability, and outcome metrics, with outcome metrics divided into
-model-based and expert-assessed outcomes. The study identified critical gaps in
-corpus fairness, which contributed to biases from geographic, cultural, and
-socio-economic factors. The reliance on unverified or unstructured data
-highlighted the need for better integration of evidence-based clinical
-guidelines. Future research should focus on developing a tiered corpus
-architecture with vetted sources and dynamic weighting, while ensuring model
-transparency. Additionally, the lack of standardized evaluation frameworks for
-domain-specific models called for comprehensive validation of LLMs in
-real-world healthcare settings.
-
-摘要：本研究回顧了大型語言模型 (LLM) 在醫療保健中的使用，重點在於其訓練語料庫、自訂技術和評估指標。針對 2021 年至 2024 年的研究進行系統性搜尋，找出 61 篇文章。語料庫類型有四種：臨床資源、文獻、開放原始碼資料集和網路爬取資料。常見的建構技術包括預訓練、提示工程和檢索增強生成，其中有 44 項研究結合多種方法。評估指標分為流程、可用性和成果指標，其中成果指標又分為基於模型和專家評估的成果。本研究發現語料庫公平性存在重大差距，這會導致地理、文化和社會經濟因素的偏見。對未驗證或非結構化資料的依賴性突顯出更佳整合循證臨床指南的必要性。未來的研究應專注於開發具有審查來源和動態加權的分層語料庫架構，同時確保模型透明性。此外，缺乏針對特定領域模型的標準化評估架構，因此需要對 LLM 在實際醫療保健環境中進行全面驗證。
-
-##### **Defining and Evaluating Visual Language Models' Basic Spatial Abilities: A Perspective from Psychometrics**
-2502.11859v1 by Wenrui Xu, Dalin Lyu, Weihang Wang, Jie Feng, Chen Gao, Yong Li
-
-The Theory of Multiple Intelligences underscores the hierarchical nature of
-cognitive capabilities. To advance Spatial Artificial Intelligence, we pioneer
-a psychometric framework defining five Basic Spatial Abilities (BSAs) in Visual
-Language Models (VLMs): Spatial Perception, Spatial Relation, Spatial
-Orientation, Mental Rotation, and Spatial Visualization. Benchmarking 13
-mainstream VLMs through nine validated psychometric experiments reveals
-significant gaps versus humans (average score 24.95 vs. 68.38), with three key
-findings: 1) VLMs mirror human hierarchies (strongest in 2D orientation,
-weakest in 3D rotation) with independent BSAs (Pearson's r<0.4); 2) Smaller
-models such as Qwen2-VL-7B surpass larger counterparts, with Qwen leading
-(30.82) and InternVL2 lagging (19.6); 3) Interventions like chain-of-thought
-(0.100 accuracy gain) and 5-shot training (0.259 improvement) show limits from
-architectural constraints. Identified barriers include weak geometry encoding
-and missing dynamic simulation. By linking psychometric BSAs to VLM
-capabilities, we provide a diagnostic toolkit for spatial intelligence
-evaluation, methodological foundations for embodied AI development, and a
-cognitive science-informed roadmap for achieving human-like spatial
-intelligence.
-
-摘要：多元智能理論強調認知能力的層次性質。為了推進空間人工智慧，我們開創了一個心理測量框架，在視覺語言模型 (VLM) 中定義了五種基本空間能力 (BSA)：空間知覺、空間關係、空間定向、心智旋轉和空間視覺化。通過九項經過驗證的心理測量實驗對 13 個主流 VLM 進行基準測試，揭示了與人類相比的顯著差距（平均分數 24.95 對 68.38），並得出三個關鍵發現：1) VLM 反映人類層次結構（2D 定向最強，3D 旋轉最弱）具有獨立的 BSA（Pearson's r<0.4）；2) Qwen2-VL-7B 等較小的模型超越了較大的模型，其中 Qwen 領先（30.82），InternVL2 落後（19.6）；3) 思想鏈等干預措施（0.100  accuracy gain）和 5 次訓練（0.259 提升）顯示了架構約束的限制。已識別的障礙包括弱幾何編碼和缺少動態模擬。通過將心理測量 BSA 與 VLM 能力聯繫起來，我們提供了一個用於空間智能評估的診斷工具包、具身 AI 開發的方法論基礎，以及實現類人空間智能的認知科學信息路標。
-
-##### **LLMs as a synthesis between symbolic and continuous approaches to language**
-2502.11856v1 by Gemma Boleda
-
-Since the middle of the 20th century, a fierce battle is being fought between
-symbolic and continuous approaches to language and cognition. The success of
-deep learning models, and LLMs in particular, has been alternatively taken as
-showing that the continuous camp has won, or dismissed as an irrelevant
-engineering development. However, in this position paper I argue that deep
-learning models for language actually represent a synthesis between the two
-traditions. This is because 1) deep learning architectures allow for both
-continuous/distributed and symbolic/discrete-like representations and
-computations; 2) models trained on language make use this flexibility. In
-particular, I review recent research in mechanistic interpretability that
-showcases how a substantial part of morphosyntactic knowledge is encoded in a
-near-discrete fashion in LLMs. This line of research suggests that different
-behaviors arise in an emergent fashion, and models flexibly alternate between
-the two modes (and everything in between) as needed. This is possibly one of
-the main reasons for their wild success; and it is also what makes them
-particularly interesting for the study of language and cognition. Is it time
-for peace?
-
-摘要：自 20 世紀中葉以來，象徵與連續的語言和認知方法之間展開了一場激烈的戰鬥。深度學習模型，特別是 LLM 的成功，被交替視為連續陣營獲勝的證明，或被視為無關的工程發展而被忽視。然而，在本文中，我認為用於語言的深度學習模型實際上代表了這兩種傳統之間的綜合。這是因為 1) 深度學習架構允許連續/分佈式和符號/離散式表示和計算；2) 在語言上訓練的模型利用了這種靈活性。特別是，我回顧了機制可解釋性的最新研究，展示了形態句法知識的實質部分是如何以近乎離散的方式編碼在 LLM 中的。這條研究線表明，不同的行為以一種新興的方式出現，並且模型根據需要在兩種模式（以及介於兩者之間的所有內容）之間靈活地交替。這可能是它們獲得巨大成功的主要原因之一；這也是它們對語言和認知研究特別有趣的原因。和平的時刻到了嗎？
-
-##### **BaxBench: Can LLMs Generate Correct and Secure Backends?**
-2502.11844v1 by Mark Vero, Niels Mündler, Victor Chibotaru, Veselin Raychev, Maximilian Baader, Nikola Jovanović, Jingxuan He, Martin Vechev
-
-The automatic generation of programs has long been a fundamental challenge in
-computer science. Recent benchmarks have shown that large language models
-(LLMs) can effectively generate code at the function level, make code edits,
-and solve algorithmic coding tasks. However, to achieve full automation, LLMs
-should be able to generate production-quality, self-contained application
-modules. To evaluate the capabilities of LLMs in solving this challenge, we
-introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for
-the generation of backend applications. We focus on backends for three critical
-reasons: (i) they are practically relevant, building the core components of
-most modern web and cloud software, (ii) they are difficult to get right,
-requiring multiple functions and files to achieve the desired functionality,
-and (iii) they are security-critical, as they are exposed to untrusted
-third-parties, making secure solutions that prevent deployment-time attacks an
-imperative. BaxBench validates the functionality of the generated applications
-with comprehensive test cases, and assesses their security exposure by
-executing end-to-end exploits. Our experiments reveal key limitations of
-current LLMs in both functionality and security: (i) even the best model,
-OpenAI o1, achieves a mere 60% on code correctness; (ii) on average, we could
-successfully execute security exploits on more than half of the correct
-programs generated by each LLM; and (iii) in less popular backend frameworks,
-models further struggle to generate correct and secure applications. Progress
-on BaxBench signifies important steps towards autonomous and secure software
-development with LLMs.
-
-摘要：<paragraph>程式自動產生一直是電腦科學中的基本挑戰。最近的基準測試顯示，大型語言模型 (LLM) 能夠有效產生函數層級的程式碼、進行程式碼編輯，以及解決演算法編碼任務。然而，若要達成完全自動化，LLM 應能夠產生生產品質、獨立的應用程式模組。為了評估 LLM 在解決此挑戰的能力，我們引入了 BaxBench，這是一個包含 392 個後端應用程式產生任務的新評估基準。我們專注於後端有三個關鍵原因：(i) 它們在實務上有其相關性，建構了大多數現代網路和雲端軟體的核心元件；(ii) 它們難以正確執行，需要多個函數和檔案才能達成所需的運作功能；(iii) 它們與安全性息息相關，因為它們會暴露於不受信任的第三方，使得預防部署時攻擊的安全解決方案成為當務之急。BaxBench 使用全面的測試案例驗證產生應用程式的功能，並透過執行端對端漏洞利用來評估其安全性風險。我們的實驗揭露了目前 LLM 在功能和安全性上的主要限制：(i) 即使是最好的模型 OpenAI o1，在程式碼正確性上也僅達到 60%；(ii) 平均而言，我們能夠在每個 LLM 產生的正確程式中成功執行超過一半的安全漏洞利用；(iii) 在較不受歡迎的後端框架中，模型在產生正確且安全的應用程式上更加困難。在 BaxBench 上的進展代表著使用 LLM 朝向自主且安全的軟體開發邁出了重要的一步。</paragraph>
-
-##### **Can LLM Agents Maintain a Persona in Discourse?**
-2502.11843v1 by Pranav Bhandari, Nicolas Fay, Michael Wise, Amitava Datta, Stephanie Meek, Usman Naseem, Mehwish Nasim
-
-Large Language Models (LLMs) are widely used as conversational agents,
-exploiting their capabilities in various sectors such as education, law,
-medicine, and more. However, LLMs are often subjected to context-shifting
-behaviour, resulting in a lack of consistent and interpretable
-personality-aligned interactions. Adherence to psychological traits lacks
-comprehensive analysis, especially in the case of dyadic (pairwise)
-conversations. We examine this challenge from two viewpoints, initially using
-two conversation agents to generate a discourse on a certain topic with an
-assigned personality from the OCEAN framework (Openness, Conscientiousness,
-Extraversion, Agreeableness, and Neuroticism) as High/Low for each trait. This
-is followed by using multiple judge agents to infer the original traits
-assigned to explore prediction consistency, inter-model agreement, and
-alignment with the assigned personality. Our findings indicate that while LLMs
-can be guided toward personality-driven dialogue, their ability to maintain
-personality traits varies significantly depending on the combination of models
-and discourse settings. These inconsistencies emphasise the challenges in
-achieving stable and interpretable personality-aligned interactions in LLMs.
-
-摘要：大型語言模型 (LLM) 被廣泛用作對話代理，
-在教育、法律、
-醫學等各個領域發揮其能力。然而，LLM 經常受到情境轉換
-行為的影響，導致缺乏一致且可解釋的
-與人格一致的互動。對心理特質的堅持缺乏
-全面的分析，特別是在二元 (成對)
-對話的情況下。我們從兩個觀點審視這個挑戰，最初使用
-兩個對話代理在特定主題上產生論述，並從 OCEAN 框架 (開放性、盡責性、
-外向性、宜人性、神經質) 中分配人格，每個特質為高/低。這
-接著使用多個評審代理來推斷分配給探索預測一致性、模型間協議的原始特質，
-以及與分配人格的一致性。我們的研究結果表明，雖然 LLM
-可以引導至以人格為導向的對話，但它們維持
-人格特質的能力會根據模型和論述設定的組合而有顯著差異。這些不一致強調了
-在 LLM 中實現穩定且可解釋的與人格一致的互動的挑戰。
-
-##### **ChordFormer: A Conformer-Based Architecture for Large-Vocabulary Audio Chord Recognition**
-2502.11840v1 by Muhammad Waseem Akram, Stefano Dettori, Valentina Colla, Giorgio Carlo Buttazzo
-
-Chord recognition serves as a critical task in music information retrieval
-due to the abstract and descriptive nature of chords in music analysis. While
-audio chord recognition systems have achieved significant accuracy for small
-vocabularies (e.g., major/minor chords), large-vocabulary chord recognition
-remains a challenging problem. This complexity also arises from the inherent
-long-tail distribution of chords, where rare chord types are underrepresented
-in most datasets, leading to insufficient training samples. Effective chord
-recognition requires leveraging contextual information from audio sequences,
-yet existing models, such as combinations of convolutional neural networks,
-bidirectional long short-term memory networks, and bidirectional transformers,
-face limitations in capturing long-term dependencies and exhibit suboptimal
-performance on large-vocabulary chord recognition tasks. This work proposes
-ChordFormer, a novel conformer-based architecture designed to tackle structural
-chord recognition (e.g., triads, bass, sevenths) for large vocabularies.
-ChordFormer leverages conformer blocks that integrate convolutional neural
-networks with transformers, thus enabling the model to capture both local
-patterns and global dependencies effectively. By addressing challenges such as
-class imbalance through a reweighted loss function and structured chord
-representations, ChordFormer outperforms state-of-the-art models, achieving a
-2% improvement in frame-wise accuracy and a 6% increase in class-wise accuracy
-on large-vocabulary chord datasets. Furthermore, ChordFormer excels in handling
-class imbalance, providing robust and balanced recognition across chord types.
-This approach bridges the gap between theoretical music knowledge and practical
-applications, advancing the field of large-vocabulary chord recognition.
-
-摘要：和弦辨識由於和弦在音樂分析中具有抽象性和描述性，因此在音樂資訊檢索中扮演著重要的任務。雖然音訊和弦辨識系統已在小型詞彙（例如，大調/小調和弦）中達到顯著的準確度，但大型詞彙和弦辨識仍然是一個具有挑戰性的問題。這種複雜性也來自和弦固有的長尾分佈，其中在大多數資料集中罕見的和弦類型代表性不足，導致訓練樣本不足。有效的和弦辨識需要利用音訊序列中的上下文資訊，但現有的模型，例如卷積神經網路、雙向長短期記憶網路和雙向轉換器的組合，在捕捉長期依賴關係方面面臨限制，並且在大詞彙和弦辨識任務上表現不佳。這項工作提出了 ChordFormer，這是一種新穎的基於變形器的架構，旨在解決大型詞彙的結構和弦辨識（例如，三和弦、低音、七和弦）。ChordFormer 利用變形器區塊將卷積神經網路與變形器整合在一起，從而使模型能夠有效地捕捉局部模式和全局依賴關係。透過重新加權損失函數和結構化和弦表示來解決類別不平衡等挑戰，ChordFormer 優於最先進的模型，在大詞彙和弦資料集上實現了幀準確度提高 2% 和類準確度提高 6%。此外，ChordFormer 在處理類別不平衡方面表現出色，在和弦類型中提供穩健且平衡的辨識。這種方法彌合了理論音樂知識與實際應用之間的差距，推動了大型詞彙和弦辨識領域的發展。
-
-##### **Intuitive physics understanding emerges from self-supervised pretraining on natural videos**
-2502.11831v1 by Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
-
-We investigate the emergence of intuitive physics understanding in
-general-purpose deep neural network models trained to predict masked regions in
-natural videos. Leveraging the violation-of-expectation framework, we find that
-video prediction models trained to predict outcomes in a learned representation
-space demonstrate an understanding of various intuitive physics properties,
-such as object permanence and shape consistency. In contrast, video prediction
-in pixel space and multimodal large language models, which reason through text,
-achieve performance closer to chance. Our comparisons of these architectures
-reveal that jointly learning an abstract representation space while predicting
-missing parts of sensory input, akin to predictive coding, is sufficient to
-acquire an understanding of intuitive physics, and that even models trained on
-one week of unique video achieve above chance performance. This challenges the
-idea that core knowledge -- a set of innate systems to help understand the
-world -- needs to be hardwired to develop an understanding of intuitive
-physics.
-
-摘要：我們探討了在經過訓練以預測自然影片中遮蔽區域的通用深度神經網路模型中，直覺物理理解的出現。利用違反預期框架，我們發現經過訓練以預測學習表徵空間中結果的影片預測模型，展現了對各種直覺物理特性的理解，例如物體恆存和形狀一致性。相反地，影片在像素空間和多模態大型語言模型中的預測，透過文字推理，達到的效能接近隨機。我們對這些架構的比較揭示了在預測感官輸入的遺失部分時，同時學習抽象表徵空間，類似於預測編碼，足以獲得對直覺物理的理解，而且即使在獨特影片上訓練一週的模型，也達到了高於隨機的效能。這挑戰了核心知識（一套幫助理解世界的先天系統）需要硬連線才能發展對直覺物理的理解這個想法。
-
-##### **Text Classification in the LLM Era - Where do we stand?**
-2502.11830v1 by Sowmya Vajjala, Shwetali Shimangaud
-
-Large Language Models revolutionized NLP and showed dramatic performance
-improvements across several tasks. In this paper, we investigated the role of
-such language models in text classification and how they compare with other
-approaches relying on smaller pre-trained language models. Considering 32
-datasets spanning 8 languages, we compared zero-shot classification, few-shot
-fine-tuning and synthetic data based classifiers with classifiers built using
-the complete human labeled dataset. Our results show that zero-shot approaches
-do well for sentiment classification, but are outperformed by other approaches
-for the rest of the tasks, and synthetic data sourced from multiple LLMs can
-build better classifiers than zero-shot open LLMs. We also see wide performance
-disparities across languages in all the classification scenarios. We expect
-that these findings would guide practitioners working on developing text
-classification systems across languages.
-
-摘要：大型語言模型革新了自然語言處理，並在多項任務中展現出顯著的效能提升。在本文中，我們探討了此類語言模型在文字分類中的角色，以及它們與依賴較小規模預先訓練語言模型的其他方法相比如何。考量涵蓋 8 種語言的 32 個資料集，我們比較了零次學習分類、少次學習微調和合成資料分類器，以及使用完整人工標記資料集建置的分類器。我們的結果顯示，零次學習方法在情緒分類中表現良好，但在其他任務中則不如其他方法，而來自多個大型語言模型的合成資料可以建置比零次學習開放大型語言模型更好的分類器。我們也看到在所有分類情境中，不同語言之間的效能差異很大。我們預期這些發現將引導從事跨語言文字分類系統開發的實務工作者。
-
-##### **Code-Vision: Evaluating Multimodal LLMs Logic Understanding and Code Generation Capabilities**
-2502.11829v1 by Hanbin Wang, Xiaoxuan Zhou, Zhipeng Xu, Keyuan Cheng, Yuxin Zuo, Kai Tian, Jingwei Song, Junting Lu, Wenhui Hu, Xueyang Liu
-
-This paper introduces Code-Vision, a benchmark designed to evaluate the
-logical understanding and code generation capabilities of Multimodal Large
-Language Models (MLLMs). It challenges MLLMs to generate a correct program that
-fulfills specific functionality requirements based on a given flowchart, which
-visually represents the desired algorithm or process. Code-Vision comprises
-three subsets: HumanEval-V, Algorithm, and MATH, which evaluate MLLMs' coding
-abilities across basic programming, algorithmic, and mathematical
-problem-solving domains. Our experiments evaluate 12 MLLMs on Code-Vision.
-Experimental results demonstrate that there is a large performance difference
-between proprietary and open-source models. On Hard problems, GPT-4o can
-achieve 79.3% pass@1, but the best open-source model only achieves 15%. Further
-experiments reveal that Code-Vision can pose unique challenges compared to
-other multimodal reasoning benchmarks MMCode and MathVista. We also explore the
-reason for the poor performance of the open-source models. All data and codes
-are available at https://github.com/wanghanbinpanda/CodeVision.
-
-摘要：本文介绍 Code-Vision，此基准测试旨在评估多模态大型语言模型 (MLLM) 的逻辑理解和代码生成能力。它要求 MLLM 根据给定的流程图生成一个正确的程序，以满足特定的功能需求，而流程图直观地表示所需的算法或流程。Code-Vision 包含三个子集：HumanEval-V、Algorithm 和 MATH，它们评估 MLLM 在基本编程、算法和数学问题解决域中的编码能力。我们的实验对 Code-Vision 上的 12 个 MLLM 进行了评估。实验结果表明，专有模型和开源模型之间的性能差异很大。在困难问题上，GPT-4o 可以达到 79.3% 的 pass@1，但最好的开源模型只能达到 15%。进一步的实验表明，与其他多模态推理基准 MMCode 和 MathVista 相比，Code-Vision 可能会带来独特的挑战。我们还探讨了开源模型性能不佳的原因。所有数据和代码均可在 https://github.com/wanghanbinpanda/CodeVision 中获得。
-
-##### **M-ABSA: A Multilingual Dataset for Aspect-Based Sentiment Analysis**
-2502.11824v1 by Chengyan Wu, Bolei Ma, Yihong Liu, Zheyu Zhang, Ningyuan Deng, Yanshu Li, Baolan Chen, Yi Zhang, Barbara Plank, Yun Xue
-
-Aspect-based sentiment analysis (ABSA) is a crucial task in information
-extraction and sentiment analysis, aiming to identify aspects with associated
-sentiment elements in text. However, existing ABSA datasets are predominantly
-English-centric, limiting the scope for multilingual evaluation and research.
-To bridge this gap, we present M-ABSA, a comprehensive dataset spanning 7
-domains and 21 languages, making it the most extensive multilingual parallel
-dataset for ABSA to date. Our primary focus is on triplet extraction, which
-involves identifying aspect terms, aspect categories, and sentiment polarities.
-The dataset is constructed through an automatic translation process with human
-review to ensure quality. We perform extensive experiments using various
-baselines to assess performance and compatibility on M-ABSA. Our empirical
-findings highlight that the dataset enables diverse evaluation tasks, such as
-multilingual and multi-domain transfer learning, and large language model
-evaluation, underscoring its inclusivity and its potential to drive
-advancements in multilingual ABSA research.
-
-摘要：面向方面的观点分析 (ABSA) 是資訊萃取和觀點分析中的一項重要任務，旨在識別文本中帶有相關觀點元素的方面。然而，現有的 ABSA 資料集以英語為中心，限制了多語言評估和研究的範圍。為了彌補這個差距，我們提出了 M-ABSA，這是一個涵蓋 7 個領域和 21 種語言的綜合性資料集，使其成為迄今為止最廣泛的多語言平行資料集，適用於 ABSA。我們的重點是三元組萃取，其中涉及識別方面術語、方面類別和觀點極性。該資料集是透過自動翻譯過程構建的，並經過人工審查以確保品質。我們使用各種基線進行廣泛的實驗，以評估 M-ABSA 上的效能和相容性。我們的實證結果強調，該資料集支援多樣化的評估任務，例如多語言和多領域遷移學習，以及大型語言模型評估，凸顯其包容性和推動多語言 ABSA 研究進展的潛力。
-
-##### **AAKT: Enhancing Knowledge Tracing with Alternate Autoregressive Modeling**
-2502.11817v1 by Hao Zhou, Wenge Rong, Jianfei Zhang, Qing Sun, Yuanxin Ouyang, Zhang Xiong
-
-Knowledge Tracing (KT) aims to predict students' future performances based on
-their former exercises and additional information in educational settings. KT
-has received significant attention since it facilitates personalized
-experiences in educational situations. Simultaneously, the autoregressive
-modeling on the sequence of former exercises has been proven effective for this
-task. One of the primary challenges in autoregressive modeling for Knowledge
-Tracing is effectively representing the anterior (pre-response) and posterior
-(post-response) states of learners across exercises. Existing methods often
-employ complex model architectures to update learner states using question and
-response records. In this study, we propose a novel perspective on knowledge
-tracing task by treating it as a generative process, consistent with the
-principles of autoregressive models. We demonstrate that knowledge states can
-be directly represented through autoregressive encodings on a question-response
-alternate sequence, where model generate the most probable representation in
-hidden state space by analyzing history interactions. This approach underpins
-our framework, termed Alternate Autoregressive Knowledge Tracing (AAKT).
-Additionally, we incorporate supplementary educational information, such as
-question-related skills, into our framework through an auxiliary task, and
-include extra exercise details, like response time, as additional inputs. Our
-proposed framework is implemented using advanced autoregressive technologies
-from Natural Language Generation (NLG) for both training and prediction.
-Empirical evaluations on four real-world KT datasets indicate that AAKT
-consistently outperforms all baseline models in terms of AUC, ACC, and RMSE.
-Furthermore, extensive ablation studies and visualized analysis validate the
-effectiveness of key components in AAKT.
-
-摘要：<paragraph>知識追蹤 (KT) 旨在根據學生的前次練習和教育環境中的額外資訊，預測學生的未來表現。KT 自從促進教育情境中的個人化體驗後，便備受關注。同時，前次練習序列上的自迴歸模型已被證明對此任務有效。知識追蹤中自迴歸模型的主要挑戰之一，是有效表示學習者在各項練習中的先驗 (反應前) 和後驗 (反應後) 狀態。現有方法通常採用複雜的模型架構，使用問題和反應記錄來更新學習者狀態。在本研究中，我們提出了一個關於知識追蹤任務的新觀點，將其視為一個生成過程，與自迴歸模型的原理一致。我們證明了知識狀態可以直接透過問答交替序列上的自迴歸編碼來表示，其中模型透過分析歷史互動來生成隱藏狀態空間中最可能的表示。此方法支撐了我們的架構，稱為交替自迴歸知識追蹤 (AAKT)。此外，我們透過輔助任務將補充教育資訊（例如與問題相關的技能）納入我們的架構，並將額外練習細節（例如反應時間）納入額外輸入。我們提出的架構是使用自然語言生成 (NLG) 的先進自迴歸技術，用於訓練和預測。對四個真實世界的 KT 資料集進行的經驗評估表明，AAKT 在 AUC、ACC 和 RMSE 方面始終優於所有基準模型。此外，廣泛的消融研究和視覺化分析驗證了 AAKT 中關鍵組件的有效性。</paragraph>
-
-##### **Towards Understanding Fine-Tuning Mechanisms of LLMs via Circuit Analysis**
-2502.11812v1 by Xu Wang, Yan Hu, Wenyu Du, Reynold Cheng, Benyou Wang, Difan Zou
-
-Fine-tuning significantly improves the performance of Large Language Models
-(LLMs), yet its underlying mechanisms remain poorly understood. This paper aims
-to provide an in-depth interpretation of the fine-tuning process through
-circuit analysis, a popular tool in Mechanistic Interpretability (MI). Unlike
-previous studies
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that focus on tasks where pre-trained models already perform well, we develop a
-set of mathematical tasks where fine-tuning yields substantial performance
-gains, which are closer to the practical setting. In our experiments, we
-identify circuits at various checkpoints during fine-tuning and examine the
-interplay between circuit analysis, fine-tuning methods, and task complexities.
-First, we find that while circuits maintain high node similarity before and
-after fine-tuning, their edges undergo significant changes, which is in
-contrast to the previous work
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-that show circuits only add some additional components after fine-tuning. Based
-on these observations, we develop a circuit-aware Low-Rank Adaptation (LoRA)
-method, which assigns ranks to layers based on edge changes in the circuits.
-Experimental results demonstrate that our circuit-based LoRA algorithm achieves
-an average performance improvement of 2.46\% over standard LoRA with similar
-parameter sizes. Furthermore, we explore how combining circuits from subtasks
-can enhance fine-tuning in compositional tasks, providing new insights into the
-design of such tasks and deepening the understanding of circuit dynamics and
-fine-tuning mechanisms.
-
-摘要：微調大幅提升大型語言模型 (LLM) 的效能，但其底層機制仍鮮為人知。本文旨在透過電路分析，一種機械可解釋性 (MI) 中廣泛使用的工具，提供微調過程的深入詮釋。不同於先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-專注於預訓練模型已表現良好的任務，我們開發了一組數學任務，其中微調產生顯著的效能提升，更接近實際設定。在我們的實驗中，我們在微調期間的各種檢查點識別電路，並探討電路分析、微調方法和任務複雜度之間的交互作用。首先，我們發現電路在微調前後雖然維持高節點相似度，但其邊緣卻經歷顯著變化，這與先前的研究
-\cite{prakash2024finetuningenhancesexistingmechanisms,chhabra2024neuroplasticity}
-顯示電路僅在微調後新增一些額外組件的結果相反。基於這些觀察，我們開發了一個電路感知低秩適應 (LoRA) 方法，根據電路中的邊緣變化為層級分配秩。實驗結果證明，我們的基於電路的 LoRA 演算法在參數大小相似的條件下，比標準 LoRA 平均提升了 2.46% 的效能。此外，我們探討如何結合子任務的電路來增強組合任務中的微調，為此類任務的設計提供新的見解，並加深對電路動態和微調機制的理解。
-
-##### **FineFilter: A Fine-grained Noise Filtering Mechanism for Retrieval-Augmented Large Language Models**
-2502.11811v1 by Qianchi Zhang, Hainan Zhang, Liang Pang, Hongwei Zheng, Yongxin Tong, Zhiming Zheng
-
-Retrieved documents containing noise will hinder Retrieval-Augmented
-Generation (RAG) from detecting answer clues, necessitating noise filtering
-mechanisms to enhance accuracy.Existing methods use re-ranking or summarization
-to identify the most relevant sentences, but directly and accurately locating
-answer clues from these large-scale and complex documents remains challenging.
-Unlike these document-level operations, we treat noise filtering as a
-sentence-level MinMax optimization problem: first identifying the potential
-clues from multiple documents using contextual information, then ranking them
-by relevance, and finally retaining the least clues through truncation. In this
-paper, we propose FineFilter, a novel fine-grained noise filtering mechanism
-for RAG consisting of a clue extractor, a re-ranker, and a truncator. We
-optimize each module to tackle complex reasoning challenges: (1) Clue extractor
-firstly uses sentences containing the answer and similar ones as fine-tuned
-targets, aiming at extracting sufficient potential clues; (2) Re-ranker is
-trained to prioritize effective clues based on the real feedback from
-generation module, with clues capable of generating correct answer as positive
-samples and others as negative; (3) Truncator takes the minimum clues needed to
-answer the question (truncation point) as fine-tuned targets, and performs
-truncation on the re-ranked clues to achieve fine-grained noise filtering.
-Experiments on three QA datasets demonstrate that FineFilter significantly
-outperforms baselines in terms of performance and inference cost. Further
-analysis on each module shows the effectiveness of our optimizations for
-complex reasoning.
-
-摘要：<paragraph>檢索到含有雜訊的文件會阻礙檢索增強生成 (RAG) 偵測答案線索，因此需要雜訊過濾機制來增強準確性。現有方法使用重新排序或摘要來找出最相關的句子，但從這些大規模且複雜的文件中直接且準確地找出答案線索仍然具有挑戰性。與這些文件層級的操作不同，我們將雜訊過濾視為一個句子層級的 MinMax 最佳化問題：首先使用脈絡資訊從多個文件中找出潛在線索，接著依據相關性對它們進行排序，最後透過截斷保留最少的線索。在本文中，我們提出 FineFilter，一種創新的細緻雜訊過濾機制，用於 RAG，它包含一個線索萃取器、一個重新排序器和一個截斷器。我們最佳化每個模組來應對複雜的推理挑戰：(1) 線索萃取器首先使用包含答案和類似答案的句子作為微調的目標，旨在萃取足夠的潛在線索；(2) 重新排序器經過訓練，根據生成模組的真實回饋來優先處理有效的線索，其中能夠生成正確答案的線索為正樣本，其他則為負樣本；(3) 截斷器將回答問題所需的最小線索 (截斷點) 視為微調的目標，並對重新排序的線索執行截斷，以達成細緻的雜訊過濾。在三個問答資料集上的實驗證實，FineFilter 在效能和推論成本方面都明顯優於基線。進一步分析每個模組顯示，我們的最佳化對於複雜推理而言是有效的。</paragraph>
-
-##### **Revealing Bias Formation in Deep Neural Networks Through the Geometric Mechanisms of Human Visual Decoupling**
-2502.11809v1 by Yanbiao Ma, Bowei Liu, Wei Dai, Jiayi Chen, Shuo Li
-
-Deep neural networks (DNNs) often exhibit biases toward certain categories
-during object recognition, even under balanced training data conditions. The
-intrinsic mechanisms underlying these biases remain unclear. Inspired by the
-human visual system, which decouples object manifolds through hierarchical
-processing to achieve object recognition, we propose a geometric analysis
-framework linking the geometric complexity of class-specific perceptual
-manifolds in DNNs to model bias. Our findings reveal that differences in
-geometric complexity can lead to varying recognition capabilities across
-categories, introducing biases. To support this analysis, we present the
-Perceptual-Manifold-Geometry library, designed for calculating the geometric
-properties of perceptual manifolds.
-
-摘要：深度神經網路 (DNN) 在物件辨識過程中，即使在平衡的訓練資料條件下，通常會對特定類別表現出偏見。這些偏見背後的基本機制仍然不清楚。受人類視覺系統的啟發，人類視覺系統透過階層化處理來解耦物件流形以達成物件辨識，我們提出一個幾何分析架構，將 DNN 中特定類別感知流形的幾何複雜度與模型偏見連結起來。我們的研究結果顯示，幾何複雜度的差異會導致不同類別的辨識能力有所不同，進而造成偏見。為了支持這個分析，我們提出感知流形幾何函式庫，用於計算感知流形的幾何屬性。
-
-##### **Exploring Translation Mechanism of Large Language Models**
-2502.11806v1 by Hongbin Zhang, Kehai Chen, Xuefeng Bai, Xiucheng Li, Min Zhang
-
-Large language models (LLMs) have succeeded remarkably in multilingual
-translation tasks. However, the inherent translation mechanisms of LLMs remain
-poorly understood, largely due to sophisticated architectures and vast
-parameter scales. In response to this issue, this study explores the
-translation mechanism of LLM from the perspective of computational components
-(e.g., attention heads and MLPs). Path patching is utilized to explore causal
-relationships between components, detecting those crucial for translation tasks
-and subsequently analyzing their behavioral patterns in human-interpretable
-terms. Comprehensive analysis reveals that translation is predominantly
-facilitated by a sparse subset of specialized attention heads (less than 5\%),
-which extract source language, indicator, and positional features. MLPs
-subsequently integrate and process these features by transiting towards
-English-centric latent representations. Notably, building on the above
-findings, targeted fine-tuning of only 64 heads achieves translation
-improvement comparable to full-parameter tuning while preserving general
-capabilities.
-
-摘要：大型語言模型 (LLM) 在多語言翻譯任務中取得了顯著的成功。然而，LLM 內在的翻譯機制仍未被很好地理解，這主要是由於複雜的架構和龐大的參數規模。為了應對這個問題，本研究從計算元件（例如注意力頭和 MLP）的角度探討了 LLM 的翻譯機制。路徑修補用於探索元件之間的因果關係，檢測對翻譯任務至關重要的元件，並隨後以人類可解釋的方式分析它們的行為模式。綜合分析表明，翻譯主要由稀疏的專門注意力頭（不到 5%）促進，這些注意力頭提取源語言、指標和位置特徵。MLPs 隨後通過轉換為以英語為中心的潛在表示來整合和處理這些特徵。值得注意的是，根據上述發現，僅對 64 個頭進行有針對性的微調，即可實現與全參數調整相當的翻譯改進，同時保留一般能力。
-
-##### **Table-Critic: A Multi-Agent Framework for Collaborative Criticism and Refinement in Table Reasoning**
-2502.11799v1 by Peiying Yu, Guoxin Chen, Jingjing Wang
-
-Despite the remarkable capabilities of large language models (LLMs) in
-various reasoning tasks, they still struggle with table reasoning tasks,
-particularly in maintaining consistency throughout multi-step reasoning
-processes. While existing approaches have explored various decomposition
-strategies, they often lack effective mechanisms to identify and correct errors
-in intermediate reasoning steps, leading to cascading error propagation. To
-address these issues, we propose Table-Critic, a novel multi-agent framework
-that facilitates collaborative criticism and iterative refinement of the
-reasoning process until convergence to correct solutions. Our framework
-consists of four specialized agents: a Judge for error identification, a Critic
-for comprehensive critiques, a Refiner for process improvement, and a Curator
-for pattern distillation. To effectively deal with diverse and unpredictable
-error types, we introduce a self-evolving template tree that systematically
-accumulates critique knowledge through experience-driven learning and guides
-future reflections. Extensive experiments have demonstrated that Table-Critic
-achieves substantial improvements over existing methods, achieving superior
-accuracy and error correction rates while maintaining computational efficiency
-and lower solution degradation rate.
-
-摘要：儘管大型語言模型 (LLM) 在各種推理任務中展現出非凡的能力，它們在表格推理任務中仍面臨挑戰，特別是在多步驟推理過程中維持一致性方面。現有方法雖然探索了各種分解策略，但它們通常缺乏有效機制來識別和修正中間推理步驟中的錯誤，導致錯誤遞增。為了解決這些問題，我們提出 Table-Critic，一個新穎的多代理架構，它促進協作批評和反覆改進推理過程，直到收斂到正確的解決方案。我們的架構包含四個專業代理：用於錯誤識別的法官、用於全面批評的批評者、用於流程改進的精煉器，以及用於模式萃取的策展人。為了有效處理多樣且不可預測的錯誤類型，我們引入了一個自演化範本樹，它透過經驗驅動的學習系統性地累積批評知識，並引導未來的反思。廣泛的實驗證明，Table-Critic 在現有方法的基礎上取得了顯著的進步，在維持運算效率和較低解決方案劣化率的同時，達到了更高的準確度和錯誤修正率。
-
-##### **Personality Editing for Language Models through Relevant Knowledge Editing**
-2502.11789v1 by Seojin Hwang, Yumin Kim, Byeongjeong Kim, Hwanhee Lee
-
-Large Language Models (LLMs) play a vital role in applications like
-conversational agents and content creation, where controlling a model's
-personality is crucial for maintaining tone, consistency, and engagement.
-However, traditional prompt-based techniques for controlling personality often
-fall short, as they do not effectively mitigate the model's inherent biases. In
-this paper, we introduce a novel method PALETTE that enhances personality
-control through knowledge editing. By generating adjustment queries inspired by
-psychological assessments, our approach systematically adjusts responses to
-personality-related queries similar to modifying factual knowledge, thereby
-achieving controlled shifts in personality traits. Experimental results from
-both automatic and human evaluations demonstrate that our method enables more
-stable and well-balanced personality control in LLMs.
-
-摘要：大型語言模型 (LLM) 在會話代理和內容創作等應用程式中扮演至關重要的角色，其中控制模型的人格特質對於維持語氣、一致性和參與度至關重要。然而，傳統基於提示的控制人格技術通常無法達到預期效果，因為它們無法有效減輕模型固有的偏差。在本文中，我們介紹一種創新的方法 PALETTE，它通過知識編輯來增強人格控制。透過產生受心理評量啟發的調整查詢，我們的做法系統性地調整對人格相關查詢的回應，類似於修改事實知識，從而實現人格特質的受控轉變。來自自動和人工評估的實驗結果表明，我們的模型能夠在 LLM 中實現更穩定且均衡的人格控制。
-
-##### **Efficient Response Generation Method Selection for Fine-Tuning Large Language Models**
-2502.11779v1 by Xuan Ren, Qi Chen, Lingqiao Liu
-
-The training data for fine-tuning large language models (LLMs) is typically
-structured as input-output pairs. However, for many tasks, there can be
-multiple equally valid output variations for the same input. Recent studies
-have observed that the choice of output variation used in training can affect
-the model's performance. This raises an important question: how can we generate
-the most effective output from the many possible response generation strategy
-options? Rather than relying on the traditional but resource-intensive
-train-and-evaluate approach, this paper proposes a scalable, approximate method
-for estimating the quality of a small subset of generated training data derived
-from the same input. We then evaluate how well this small subset of generated
-output fits the target model we are trying to train. We present a large-scale
-benchmark covering diverse reasoning-based datasets to support our study.
-  The central idea is that a good output should closely resemble the output
-generated by the target LLM. We formalize this 'closeness' as the expected
-alignment score between a candidate output and the output sampled from the
-target LLM. We connect this measurement to the perplexity metric used in
-previous literature and demonstrate that leveraging an alignment-based metric
-can provide better predictions of model performance. Using this strategy, we
-can evaluate a small subset of the generated output from each response
-generation strategy option, then select the most effective strategy. We show
-that an LLM trained on data generated by the selected strategy could lead to a
-significant performance gain in many cases.
-
-摘要：大型語言模型 (LLM) 的微調訓練資料通常
-以輸入輸出配對結構化。然而，對於許多任務而言，相同的輸入可能有多個同樣有效的輸出變化。最近的研究
-觀察到訓練中使用的輸出變化選擇會影響模型的效能。這引發了一個重要問題：我們如何從許多可能的回應產生策略選項中產生最有效的輸出？本文提出一個可擴充、近似的方法，用於估計從相同輸入衍生的訓練資料小子集的品質，而非依賴傳統但資源密集的訓練和評估方法。然後我們評估這個產生輸出的小子集與我們嘗試訓練的目標模型的契合程度。我們提出一個涵蓋各種基於推理的資料集的大規模基準，以支持我們的研究。
-核心概念是良好的輸出應與目標 LLM 產生的輸出密切相似。我們將這種「接近度」形式化為候選輸出與從目標 LLM 取樣的輸出之間的預期對齊分數。我們將此測量連接到先前文獻中使用的困惑度指標，並證明利用基於對齊的指標可以提供更好的模型效能預測。使用此策略，我們可以評估每個回應產生策略選項所產生輸出的小子集，然後選擇最有效的策略。我們展示在由所選策略產生的資料上訓練的 LLM，在許多情況下可能導致顯著的效能提升。
-
-##### **Deep Neural Networks for Accurate Depth Estimation with Latent Space Features**
-2502.11777v1 by Siddiqui Muhammad Yasir, Hyunsik Ahn
-
-Depth estimation plays a pivotal role in advancing human-robot interactions,
-especially in indoor environments where accurate 3D scene reconstruction is
-essential for tasks like navigation and object handling. Monocular depth
-estimation, which relies on a single RGB camera, offers a more affordable
-solution compared to traditional methods that use stereo cameras or LiDAR.
-However, despite recent progress, many monocular approaches struggle with
-accurately defining depth boundaries, leading to less precise reconstructions.
-In response to these challenges, this study introduces a novel depth estimation
-framework that leverages latent space features within a deep convolutional
-neural network to enhance the precision of monocular depth maps. The proposed
-model features dual encoder-decoder architecture, enabling both color-to-depth
-and depth-to-depth transformations. This structure allows for refined depth
-estimation through latent space encoding. To further improve the accuracy of
-depth boundaries and local features, a new loss function is introduced. This
-function combines latent loss with gradient loss, helping the model maintain
-the integrity of depth boundaries. The framework is thoroughly tested using the
-NYU Depth V2 dataset, where it sets a new benchmark, particularly excelling in
-complex indoor scenarios. The results clearly show that this approach
-effectively reduces depth ambiguities and blurring, making it a promising
-solution for applications in human-robot interaction and 3D scene
-reconstruction.
-
-摘要：深度估計在推進人機互動方面發揮著至關重要的作用，特別是在室內環境中，準確的 3D 場景重建對於導航和物體處理等任務至關重要。單目深度估計依賴於單個 RGB 相機，與使用立體相機或 LiDAR 的傳統方法相比，它提供了一個更經濟的解決方案。然而，儘管最近取得了進展，許多單目方法在準確定義深度邊界方面仍然存在困難，從而導致重建精度降低。為了應對這些挑戰，本研究引入了一個新穎的深度估計框架，該框架利用深度卷積神經網路中的潛在空間特徵來增強單目深度圖的精度。所提出的模型採用雙編碼器-解碼器架構，既能進行顏色到深度的轉換，又能進行深度到深度的轉換。這種結構允許通過潛在空間編碼進行精確的深度估計。為了進一步提高深度邊界和局部特徵的精度，引入了一個新的損失函數。此函數將潛在損失與梯度損失相結合，幫助模型維護深度邊界的完整性。使用 NYU Depth V2 數據集對該框架進行了全面測試，在該數據集上，它設定了一個新的基準，特別是在複雜的室內場景中表現出色。結果清楚地表明，這種方法有效地減少了深度模糊和模糊，使其成為人機互動和 3D 場景重建應用中一種有前途的解決方案。
-
-##### **The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It**
-2502.11771v1 by Leonardo Bertolazzi, Philipp Mondorf, Barbara Plank, Raffaella Bernardi
-
-The ability of large language models (LLMs) to validate their output and
-identify potential errors is crucial for ensuring robustness and reliability.
-However, current research indicates that LLMs struggle with self-correction,
-encountering significant challenges in detecting errors. While studies have
-explored methods to enhance self-correction in LLMs, relatively little
-attention has been given to understanding the models' internal mechanisms
-underlying error detection. In this paper, we present a mechanistic analysis of
-error detection in LLMs, focusing on simple arithmetic problems. Through
-circuit analysis, we identify the computational subgraphs responsible for
-detecting arithmetic errors across four smaller-sized LLMs. Our findings reveal
-that all models heavily rely on $\textit{consistency heads}$--attention heads
-that assess surface-level alignment of numerical values in arithmetic
-solutions. Moreover, we observe that the models' internal arithmetic
-computation primarily occurs in higher layers, whereas validation takes place
-in middle layers, before the final arithmetic results are fully encoded. This
-structural dissociation between arithmetic computation and validation seems to
-explain why current LLMs struggle to detect even simple arithmetic errors.
-
-摘要：大型語言模型 (LLM) 驗證其輸出並識別潛在錯誤的能力對於確保穩健性和可靠性至關重要。
-然而，目前的研究所示，LLM 難以進行自我修正，在檢測錯誤時遇到重大挑戰。儘管研究已探討增強 LLM 自我修正的方法，但對於瞭解模型內部錯誤檢測機制卻關注較少。在本文中，我們提出對 LLM 中錯誤檢測的機制分析，重點關注簡單的算術問題。通過電路分析，我們識別出負責檢測四個較小規模 LLM 中算術錯誤的計算子圖。我們的研究結果表明，所有模型都嚴重依賴於「一致性頭部」--注意頭部，用於評估算術解中數值表面的對齊方式。此外，我們觀察到模型的內部算術運算主要發生在較高層，而驗證則發生在中間層，在最終算術結果完全編碼之前。算術運算和驗證之間的這種結構性分離似乎解釋了為什麼當前的 LLM 難以檢測到即使是簡單的算術錯誤。
-
-##### **Cognitive-Aligned Document Selection for Retrieval-augmented Generation**
-2502.11770v1 by Bingyu Wan, Fuxi Zhang, Zhongpeng Qi, Jiayi Ding, Jijun Li, Baoshi Fan, Yijia Zhang, Jun Zhang
-
-Large language models (LLMs) inherently display hallucinations since the
-precision of generated texts cannot be guaranteed purely by the parametric
-knowledge they include. Although retrieval-augmented generation (RAG) systems
-enhance the accuracy and reliability of generative models by incorporating
-external documents, these retrieved documents often fail to adequately support
-the model's responses in practical applications. To address this issue, we
-propose GGatrieval (Fine-\textbf{G}rained \textbf{G}rounded \textbf{A}lignment
-Re\textbf{trieval} for verifiable generation), which leverages an LLM to
-dynamically update queries and filter high-quality, reliable retrieval
-documents. Specifically, we parse the user query into its syntactic components
-and perform fine-grained grounded alignment with the retrieved documents. For
-query components that cannot be individually aligned, we propose a dynamic
-semantic compensation mechanism that iteratively refines and rewrites the query
-while continuously updating the retrieval results. This iterative process
-continues until the retrieved documents sufficiently support the query's
-response. Our approach introduces a novel criterion for filtering retrieved
-documents, closely emulating human strategies for acquiring targeted
-information. This ensures that the retrieved content effectively supports and
-verifies the generated outputs. On the ALCE benchmark, our method significantly
-surpasses a wide range of baselines, achieving state-of-the-art performance.
-
-摘要：大型語言模型 (LLM) 本質上會出現幻覺，因為生成的文本的準確性無法僅透過它們包含的參數化知識來保證。儘管檢索增強生成 (RAG) 系統透過納入外部文件來提升生成模型的準確性和可靠性，但這些檢索的文件在實際應用中常常無法充分支援模型的回應。為了解決這個問題，我們提出 GGatrieval（用於可驗證生成的精細化粒度化基礎對齊檢索），它利用 LLM 來動態更新查詢並過濾高品質、可靠的檢索文件。具體來說，我們將使用者查詢分析成其語法組成部分，並對檢索文件執行精細化粒度化基礎對齊。對於無法個別對齊的查詢組成部分，我們提出一個動態語義補償機制，在持續更新檢索結果的同時，反覆修正和重寫查詢。這個反覆的程序會持續到檢索的文件充分支援查詢的回應為止。我們的做法引進了一個新的檢索文件過濾標準，嚴密地模擬人類獲取目標資訊的策略。這確保檢索的內容有效地支援和驗證生成的輸出。在 ALCE 基準測試中，我們的做法顯著超越各種基線，達成最先進的效能。
-
-##### **From Selection to Generation: A Survey of LLM-based Active Learning**
-2502.11767v1 by Yu Xia, Subhojyoti Mukherjee, Zhouhang Xie, Junda Wu, Xintong Li, Ryan Aponte, Hanjia Lyu, Joe Barrow, Hongjie Chen, Franck Dernoncourt, Branislav Kveton, Tong Yu, Ruiyi Zhang, Jiuxiang Gu, Nesreen K. Ahmed, Yu Wang, Xiang Chen, Hanieh Deilamsalehy, Sungchul Kim, Zhengmian Hu, Yue Zhao, Nedim Lipka, Seunghyun Yoon, Ting-Hao Kenneth Huang, Zichao Wang, Puneet Mathur, Soumyabrata Pal, Koyel Mukherjee, Zhehao Zhang, Namyong Park, Thien Huu Nguyen, Jiebo Luo, Ryan A. Rossi, Julian McAuley
-
-Active Learning (AL) has been a powerful paradigm for improving model
-efficiency and performance by selecting the most informative data points for
-labeling and training. In recent active learning frameworks, Large Language
-Models (LLMs) have been employed not only for selection but also for generating
-entirely new data instances and providing more cost-effective annotations.
-Motivated by the increasing importance of high-quality data and efficient model
-training in the era of LLMs, we present a comprehensive survey on LLM-based
-Active Learning. We introduce an intuitive taxonomy that categorizes these
-techniques and discuss the transformative roles LLMs can play in the active
-learning loop. We further examine the impact of AL on LLM learning paradigms
-and its applications across various domains. Finally, we identify open
-challenges and propose future research directions. This survey aims to serve as
-an up-to-date resource for researchers and practitioners seeking to gain an
-intuitive understanding of LLM-based AL techniques and deploy them to new
-applications.
-
-摘要：主動學習 (AL) 透過挑選最具資訊性的資料點來標記和訓練，已成為一種強大的範例，用以提升模型效率和效能。在最近的主動學習架構中，大型語言模型 (LLM) 不僅用於挑選，也用於產生全新的資料實例，並提供更具成本效益的註解。在大型語言模型時代，由於高品質資料和高效能模型訓練日益重要，我們針對基於大型語言模型的主動學習提出了一項全面的調查。我們提出一個直覺式的分類法，用以分類這些技術，並探討大型語言模型在主動學習迴圈中可以扮演的轉型角色。我們進一步探討主動學習對大型語言模型學習範例的影響，以及它在各種領域中的應用。最後，我們找出開放式挑戰，並提出未來的研究方向。本調查旨在作為研究人員和實務工作者的最新資源，用以獲得對基於大型語言模型的主動學習技術的直覺式理解，並將其部署至新的應用程式。
-
-##### **Warmup-Distill: Bridge the Distribution Mismatch between Teacher and Student before Knowledge Distillation**
-2502.11766v1 by Zengkui Sun, Yijin Liu, Fandong Meng, Yufeng Chen, Jinan Xu, Jie Zhou
-
-The widespread deployment of Large Language Models (LLMs) is hindered by the
-high computational demands, making knowledge distillation (KD) crucial for
-developing compact smaller ones. However, the conventional KD methods endure
-the distribution mismatch issue between the teacher and student models, leading
-to the poor performance of distillation. For instance, the widely-used KL-based
-methods suffer the mode-averaging and mode-collapsing problems, since the
-mismatched probabitliy distribution between both models. Previous studies
-mainly optimize this issue via different distance calculations towards the
-distribution of both models. Unfortunately, the distribution mismatch issue
-still exists in the early stage of the distillation. Hence, to reduce the
-impact of distribution mismatch, we propose a simple yet efficient method,
-named Warmup-Distill, which aligns the distillation of the student to that of
-the teacher in advance of distillation. Specifically, we first detect the
-distribution of the student model in practical scenarios with its internal
-knowledge, and then modify the knowledge with low probability via the teacher
-as the checker. Consequently, Warmup-Distill aligns the internal student's
-knowledge to that of the teacher, which expands the distribution of the student
-with the teacher's, and assists the student model to learn better in the
-subsequent distillation. Experiments on the seven benchmarks demonstrate that
-Warmup-Distill could provide a warmup student more suitable for distillation,
-which outperforms the vanilla student by as least +0.4 averaged score among all
-benchmarks. Noteably, with the assistance of Warmup-Distill, the distillation
-on the math task could yield a further improvement, at most +1.9% accuracy.
-
-摘要：大型語言模型 (LLM) 的廣泛部署受到高運算需求的阻礙，這使得知識蒸餾 (KD) 對於開發緊湊型的小型模型至關重要。然而，傳統的 KD 方法忍受了教師和學生模型之間的分布不匹配問題，導致蒸餾效果不佳。例如，廣泛使用的基於 KL 的方法會出現模式平均和模式崩潰問題，因為兩個模型之間的機率分佈不匹配。先前的研究主要透過不同的距離計算來最佳化這個問題，以朝向兩個模型的分布。不幸的是，分布不匹配的問題仍然存在於蒸餾的早期階段。因此，為了減少分布不匹配的影響，我們提出了一種簡單但有效的方法，稱為 Warmup-Distill，它在蒸餾之前將學生的蒸餾與教師的蒸餾對齊。具體來說，我們首先使用其內部知識在實際場景中檢測學生的分布，然後透過教師作為檢查員修改低機率的知識。因此，Warmup-Distill 將學生的內部知識與教師的知識對齊，這會將學生的分布擴展到教師的分布，並協助學生模型在後續的蒸餾中學習得更好。在七個基準測試上的實驗表明，Warmup-Distill 可以提供更適合蒸餾的熱身學生，在所有基準測試中，其表現優於香草學生至少 +0.4 的平均分數。值得注意的是，在 Warmup-Distill 的協助下，數學任務上的蒸餾可以進一步提升，最多可提升 +1.9% 的準確度。
-
-##### **Lightweight Deepfake Detection Based on Multi-Feature Fusion**
-2502.11763v1 by Siddiqui Muhammad Yasir, Hyun Kim
-
-Deepfake technology utilizes deep learning based face manipulation techniques
-to seamlessly replace faces in videos creating highly realistic but
-artificially generated content. Although this technology has beneficial
-applications in media and entertainment misuse of its capabilities may lead to
-serious risks including identity theft cyberbullying and false information. The
-integration of DL with visual cognition has resulted in important technological
-improvements particularly in addressing privacy risks caused by artificially
-generated deepfake images on digital media platforms. In this study we propose
-an efficient and lightweight method for detecting deepfake images and videos
-making it suitable for devices with limited computational resources. In order
-to reduce the computational burden usually associated with DL models our method
-integrates machine learning classifiers in combination with keyframing
-approaches and texture analysis. Moreover the features extracted with a
-histogram of oriented gradients (HOG) local binary pattern (LBP) and KAZE bands
-were integrated to evaluate using random forest extreme gradient boosting extra
-trees and support vector classifier algorithms. Our findings show a
-feature-level fusion of HOG LBP and KAZE features improves accuracy to 92% and
-96% on FaceForensics++ and Celeb-DFv2 respectively.
-
-摘要：深度偽造技術利用基於深度學習的換臉技術，可無縫替換影片中的臉孔，創造出高度逼真但人工產生的內容。儘管這項技術在媒體和娛樂方面有益，但若誤用其功能可能會導致嚴重的風險，包括身分盜用、網路霸凌和虛假訊息。深度學習與視覺認知的整合已帶來重要的技術進步，特別是在解決由數位媒體平台上的人工深度偽造影像所造成的隱私風險方面。在本研究中，我們提出了一種用於偵測深度偽造影像和影片的有效且輕量級的方法，使其適用於運算資源有限的裝置。為了降低通常與深度學習模型相關的運算負擔，我們的做法結合了機器學習分類器、關鍵影格方法和紋理分析。此外，我們整合了使用方向梯度直方圖 (HOG)、局部二進位模式 (LBP) 和 KAZE 頻段所萃取出的特徵，並使用隨機森林、極端梯度提升、額外樹木和支援向量分類器演算法進行評估。我們的研究結果顯示，HOG、LBP 和 KAZE 特徵的層級融合將準確度提升至 92%，分別在 FaceForensics++ 和 Celeb-DFv2 上達到 96%。
-
-##### **HintsOfTruth: A Multimodal Checkworthiness Detection Dataset with Real and Synthetic Claims**
-2502.11753v1 by Michiel van der Meer, Pavel Korshunov, Sébastien Marcel, Lonneke van der Plas
-
-Misinformation can be countered with fact-checking, but the process is costly
-and slow. Identifying checkworthy claims is the first step, where automation
-can help scale fact-checkers' efforts. However, detection methods struggle with
-content that is 1) multimodal, 2) from diverse domains, and 3) synthetic. We
-introduce HintsOfTruth, a public dataset for multimodal checkworthiness
-detection with $27$K real-world and synthetic image/claim pairs. The mix of
-real and synthetic data makes this dataset unique and ideal for benchmarking
-detection methods. We compare fine-tuned and prompted Large Language Models
-(LLMs). We find that well-configured lightweight text-based encoders perform
-comparably to multimodal models but the first only focus on identifying
-non-claim-like content. Multimodal LLMs can be more accurate but come at a
-significant computational cost, making them impractical for large-scale
-applications. When faced with synthetic data, multimodal models perform more
-robustly
-
-摘要：錯誤訊息可以透過事實查核來反駁，但這個過程既昂貴又緩慢。辨識需要查核的說法是第一步，自動化可以幫助擴大事實查核人員的努力。然而，偵測方法會在處理 1) 多模態、2) 來自不同領域，以及 3) 合成的內容時遇到困難。我們引進 HintsOfTruth，一個用於多模態查核價值偵測的公開資料集，其中包含 27K 個真實世界和合成的影像/說法配對。真實和合成資料的組合讓這個資料集獨一無二，非常適合用於基準偵測方法。我們比較微調和提示的大語言模型 (LLM)。我們發現，設定良好的輕量級文字編碼器的表現與多模態模型相當，但前者只專注於辨識非說法類型的內容。多模態 LLM 可能更準確，但需要大量的運算成本，這讓它們不適用於大規模的應用。在面對合成資料時，多模態模型的表現更強健。
-
-##### **Language Models Can See Better: Visual Contrastive Decoding For LLM Multimodal Reasoning**
-2502.11751v1 by Yuqi Pang, Bowen Yang, Haoqin Tu, Yun Cao, Zeyu Zhang
-
-Although Large Language Models (LLMs) excel in reasoning and generation for
-language tasks, they are not specifically designed for multimodal challenges.
-Training Multimodal Large Language Models (MLLMs), however, is
-resource-intensive and constrained by various training limitations. In this
-paper, we propose the Modular-based Visual Contrastive Decoding (MVCD)
-framework to move this obstacle. Our framework leverages LLMs' In-Context
-Learning (ICL) capability and the proposed visual contrastive-example decoding
-(CED), specifically tailored for this framework, without requiring any
-additional training. By converting visual signals into text and focusing on
-contrastive output distributions during decoding, we can highlight the new
-information introduced by contextual examples, explore their connections, and
-avoid over-reliance on prior encoded knowledge. MVCD enhances LLMs' visual
-perception to make it see and reason over the input visuals. To demonstrate
-MVCD's effectiveness, we conduct experiments with four LLMs across five
-question answering datasets. Our results not only show consistent improvement
-in model accuracy but well explain the effective components inside our decoding
-strategy. Our code will be available at https://github.com/Pbhgit/MVCD.
-
-摘要：儘管大型語言模型 (LLM) 在語言任務的推理和生成方面表現優異，但它們並非專門針對多模態挑戰而設計。然而，訓練多模態大型語言模型 (MLLM) 十分耗費資源，並受到各種訓練限制。在本文中，我們提出基於模組的視覺對比解碼 (MVCD) 架構來克服這個障礙。我們的架構利用 LLM 的情境學習 (ICL) 能力和專門為此架構量身打造的視覺對比範例解碼 (CED)，而無需任何額外訓練。透過將視覺信號轉換為文字，並在解碼過程中專注於對比輸出分佈，我們可以突顯情境範例引入的新資訊，探索它們的關聯性，並避免過度依賴先前編碼的知識。MVCD 增強了 LLM 的視覺感知能力，使其能夠觀察並推論輸入視覺效果。為了證明 MVCD 的有效性，我們使用四個 LLM 在五個問答資料集上進行實驗。我們的結果不僅顯示模型準確度持續提升，還能清楚說明我們的解碼策略中的有效組成部分。我們的程式碼將在 https://github.com/Pbhgit/MVCD 公開。
-
-##### **SQL-o1: A Self-Reward Heuristic Dynamic Search Method for Text-to-SQL**
-2502.11741v1 by Shuai Lyu, Haoran Luo, Zhonghong Ou, Yifan Zhu, Xiaoran Shang, Yang Qin, Meina Song
-
-The Text-to-SQL(Text2SQL) task aims to convert natural language queries into
-executable SQL queries. Thanks to the application of large language models
-(LLMs), significant progress has been made in this field. However, challenges
-such as model scalability, limited generation space, and coherence issues in
-SQL generation still persist. To address these issues, we propose SQL-o1, a
-Self-Reward-based heuristic search method designed to enhance the reasoning
-ability of LLMs in SQL query generation. SQL-o1 combines Monte Carlo Tree
-Search (MCTS) for heuristic process-level search and constructs a Schema-Aware
-dataset to help the model better understand database schemas. Extensive
-experiments on the Bird and Spider datasets demonstrate that SQL-o1 improves
-execution accuracy by 10.8\% on the complex Bird dataset compared to the latest
-baseline methods, even outperforming GPT-4-based approaches. Additionally,
-SQL-o1 excels in few-shot learning scenarios and shows strong cross-model
-transferability. Our code is publicly available
-at:https://github.com/ShuaiLyu0110/SQL-o1.
-
-摘要：文本转 SQL（Text2SQL）任务旨在将自然语言查询转换为可执行的 SQL 查询。得益于大型语言模型（LLM）的应用，该领域取得了显著进展。然而，模型可扩展性、生成空间受限和 SQL 生成的连贯性问题等挑战仍然存在。为了解决这些问题，我们提出了 SQL-o1，这是一种基于自我奖励的启发式搜索方法，旨在增强 LLM 在 SQL 查询生成中的推理能力。SQL-o1 结合了蒙特卡罗树搜索（MCTS）用于启发式过程级搜索，并构建了一个模式感知数据集，以帮助模型更好地理解数据库模式。在 Bird 和 Spider 数据集上的大量实验表明，与最新的基准方法相比，SQL-o1 将复杂 Bird 数据集上的执行准确率提高了 10.8%，甚至优于基于 GPT-4 的方法。此外，SQL-o1 在少样本学习场景中表现出色，并显示出强大的跨模型可迁移性。我们的代码已公开发布在：https://github.com/ShuaiLyu0110/SQL-o1。
-
 
 ### Knowledge Graphs
 |Publish Date|Title|Authors|Homepage|Code|
 | :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+|**2025-02-18**|**Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**|Xiang Liu et.al.|[2502.12669v1](http://arxiv.org/abs/2502.12669v1)|null|
+|**2025-02-18**|**G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**|Yuhan Li et.al.|[2502.12586v1](http://arxiv.org/abs/2502.12586v1)|null|
 |**2025-02-17**|**A-MEM: Agentic Memory for LLM Agents**|Wujiang Xu et.al.|[2502.12110v1](http://arxiv.org/abs/2502.12110v1)|null|
 |**2025-02-17**|**KnowPath: Knowledge-enhanced Reasoning via LLM-generated Inference Paths over Knowledge Graphs**|Qi Zhao et.al.|[2502.12029v1](http://arxiv.org/abs/2502.12029v1)|null|
 |**2025-02-17**|**Atom of Thoughts for Markov LLM Test-Time Scaling**|Fengwei Teng et.al.|[2502.12018v1](http://arxiv.org/abs/2502.12018v1)|null|
@@ -7841,7 +5367,7 @@ at:https://github.com/ShuaiLyu0110/SQL-o1.
 |**2025-02-06**|**Division-of-Thoughts: Harnessing Hybrid Language Model Synergy for Efficient On-Device Agents**|Chenyang Shao et.al.|[2502.04392v1](http://arxiv.org/abs/2502.04392v1)|null|
 |**2025-02-06**|**Boosting Knowledge Graph-based Recommendations through Confidence-Aware Augmentation with Large Language Models**|Rui Cai et.al.|[2502.03715v1](http://arxiv.org/abs/2502.03715v1)|null|
 |**2025-02-05**|**A Schema-Guided Reason-while-Retrieve framework for Reasoning on Scene Graphs with Large-Language-Models (LLMs)**|Yiye Chen et.al.|[2502.03450v1](http://arxiv.org/abs/2502.03450v1)|null|
-|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v1](http://arxiv.org/abs/2502.03283v1)|null|
+|**2025-02-05**|**SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**|Ben Liu et.al.|[2502.03283v2](http://arxiv.org/abs/2502.03283v2)|null|
 |**2025-02-05**|**Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**|Daniil Laptev et.al.|[2502.03032v2](http://arxiv.org/abs/2502.03032v2)|null|
 |**2025-02-05**|**A Benchmark for the Detection of Metalinguistic Disagreements between LLMs and Knowledge Graphs**|Bradley P. Allen et.al.|[2502.02896v1](http://arxiv.org/abs/2502.02896v1)|null|
 |**2025-02-05**|**Mol-LLM: Generalist Molecular LLM with Improved Graph Utilization**|Chanhui Lee et.al.|[2502.02810v1](http://arxiv.org/abs/2502.02810v1)|null|
@@ -7871,14 +5397,163 @@ at:https://github.com/ShuaiLyu0110/SQL-o1.
 |**2025-01-29**|**Query-Aware Learnable Graph Pooling Tokens as Prompt for Large Language Models**|Wooyoung Kim et.al.|[2501.17549v1](http://arxiv.org/abs/2501.17549v1)|null|
 |**2025-01-29**|**General Scene Adaptation for Vision-and-Language Navigation**|Haodong Hong et.al.|[2501.17403v1](http://arxiv.org/abs/2501.17403v1)|[link](https://github.com/honghd16/gsa-vln)|
 |**2025-01-28**|**Comprehensive Evaluation for a Large Scale Knowledge Graph Question Answering Service**|Saloni Potdar et.al.|[2501.17270v1](http://arxiv.org/abs/2501.17270v1)|null|
-|**2025-01-28**|**FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**|Deren Lei et.al.|[2501.17144v1](http://arxiv.org/abs/2501.17144v1)|[link](https://github.com/derenlei/factcg)|
-|**2025-01-28**|**LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**|Li Yin et.al.|[2501.16673v2](http://arxiv.org/abs/2501.16673v2)|[link](https://github.com/sylphai-inc/adalflow)|
-|**2025-01-27**|**360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**|Hamed Firooz et.al.|[2501.16450v3](http://arxiv.org/abs/2501.16450v3)|null|
-|**2025-01-27**|**Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**|Antony Bartlett et.al.|[2501.16191v1](http://arxiv.org/abs/2501.16191v1)|null|
-|**2025-01-27**|**Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**|Yu Li et.al.|[2501.15791v1](http://arxiv.org/abs/2501.15791v1)|[link](https://github.com/kse-eleven/makged)|
-|**2025-01-27**|**Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**|Momoka Furuhashi et.al.|[2501.15777v1](http://arxiv.org/abs/2501.15777v1)|null|
 
 #### Abstracts
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>
+
+##### **Perovskite-LLM: Knowledge-Enhanced Large Language Models for Perovskite Solar Cell Research**
+2502.12669v1 by Xiang Liu, Penglei Sun, Shuyan Chen, Longhan Zhang, Peijie Dong, Huajie You, Yongqi Zhang, Chang Yan, Xiaowen Chu, Tong-yi Zhang
+
+The rapid advancement of perovskite solar cells (PSCs) has led to an
+exponential growth in research publications, creating an urgent need for
+efficient knowledge management and reasoning systems in this domain. We present
+a comprehensive knowledge-enhanced system for PSCs that integrates three key
+components. First, we develop Perovskite-KG, a domain-specific knowledge graph
+constructed from 1,517 research papers, containing 23,789 entities and 22,272
+relationships. Second, we create two complementary datasets: Perovskite-Chat,
+comprising 55,101 high-quality question-answer pairs generated through a novel
+multi-agent framework, and Perovskite-Reasoning, containing 2,217 carefully
+curated materials science problems. Third, we introduce two specialized large
+language models: Perovskite-Chat-LLM for domain-specific knowledge assistance
+and Perovskite-Reasoning-LLM for scientific reasoning tasks. Experimental
+results demonstrate that our system significantly outperforms existing models
+in both domain-specific knowledge retrieval and scientific reasoning tasks,
+providing researchers with effective tools for literature review, experimental
+design, and complex problem-solving in PSC research.
+
+摘要：由於 perovskite 太陽能電池 (PSC) 快速進展，導致研究出版物呈指數成長，迫切需要在這領域建立有效的知識管理和推理系統。我們提出一個結合三項關鍵元件的 PSC 全面知識增強系統。首先，我們開發出 Perovskite-KG，一個由 1,517 篇研究論文建構而成、包含 23,789 個實體和 22,272 個關係的領域特定知識圖譜。其次，我們建立兩個互補的資料集：Perovskite-Chat，包含透過一個新穎的多代理架構產生 55,101 個高品質問答配對；以及 Perovskite-Reasoning，包含 2,217 個仔細策展的材料科學問題。第三，我們推出兩個專門化大型語言模型：針對領域特定知識協助的 Perovskite-Chat-LLM，以及針對科學推理任務的 Perovskite-Reasoning-LLM。實驗結果顯示，我們的系統在領域特定知識擷取和科學推理任務上都明顯優於現有模型，為研究人員提供有效的工具，用於 PSC 研究中的文獻回顧、實驗設計和複雜問題解決。
+
+##### **G-Refer: Graph Retrieval-Augmented Large Language Model for Explainable Recommendation**
+2502.12586v1 by Yuhan Li, Xinni Zhang, Linhao Luo, Heng Chang, Yuxiang Ren, Irwin King, Jia Li
+
+Explainable recommendation has demonstrated significant advantages in
+informing users about the logic behind recommendations, thereby increasing
+system transparency, effectiveness, and trustworthiness. To provide
+personalized and interpretable explanations, existing works often combine the
+generation capabilities of large language models (LLMs) with collaborative
+filtering (CF) information. CF information extracted from the user-item
+interaction graph captures the user behaviors and preferences, which is crucial
+for providing informative explanations. However, due to the complexity of graph
+structure, effectively extracting the CF information from graphs still remains
+a challenge. Moreover, existing methods often struggle with the integration of
+extracted CF information with LLMs due to its implicit representation and the
+modality gap between graph structures and natural language explanations. To
+address these challenges, we propose G-Refer, a framework using graph
+retrieval-augmented large language models (LLMs) for explainable
+recommendation. Specifically, we first employ a hybrid graph retrieval
+mechanism to retrieve explicit CF signals from both structural and semantic
+perspectives. The retrieved CF information is explicitly formulated as
+human-understandable text by the proposed graph translation and accounts for
+the explanations generated by LLMs. To bridge the modality gap, we introduce
+knowledge pruning and retrieval-augmented fine-tuning to enhance the ability of
+LLMs to process and utilize the retrieved CF information to generate
+explanations. Extensive experiments show that G-Refer achieves superior
+performance compared with existing methods in both explainability and
+stability. Codes and data are available at https://github.com/Yuhan1i/G-Refer.
+
+摘要：可解釋建議已證明在告知使用者建議背後的邏輯方面具有顯著優點，從而提高系統透明度、有效性和可信度。為了提供個人化且可解釋的說明，現有作品通常結合大型語言模型 (LLM) 的生成能力與協同過濾 (CF) 資訊。從使用者項目互動圖形中提取的 CF 資訊會擷取使用者行為和偏好，這對於提供資訊性說明至關重要。然而，由於圖形結構的複雜性，從圖形中有效提取 CF 資訊仍然是一個挑戰。此外，現有方法通常難以將提取的 CF 資訊與 LLM 整合，因為其隱含表示和圖形結構與自然語言說明之間的模式差距。為了應對這些挑戰，我們提出 G-Refer，一個使用圖形檢索增強型大型語言模型 (LLM) 的可解釋建議架構。具體來說，我們首先採用混合圖形檢索機制，從結構和語義角度檢索明確的 CF 訊號。檢索到的 CF 資訊由建議的圖形翻譯明確表述為人類可以理解的文字，並說明 LLM 生成的解釋。為了彌合模式差距，我們引入了知識修剪和檢索增強微調，以增強 LLM 處理和利用檢索到的 CF 資訊以產生解釋的能力。廣泛的實驗表明，與現有方法相比，G-Refer 在可解釋性和穩定性方面都取得了卓越的效能。程式碼和資料可在 https://github.com/Yuhan1i/G-Refer 取得。
+
 ##### **A-MEM: Agentic Memory for LLM Agents**
 2502.12110v1 by Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, Yongfeng Zhang
 
@@ -9399,7 +7074,7 @@ absence of agent-level demonstrations. Project code will be released.
 摘要：場景圖表已成為大型語言模型 (LLM) 以基礎空間推理為基礎的結構化且可序列化的環境表徵。在這項工作中，我們提出 SG-RwR，一個以綱要為導向的檢索與推理框架，用於場景圖表的推理和規劃。我們的做法採用了兩個協作的、編寫程式碼的 LLM 代理：一個 (1) 推論器，用於任務規劃和資訊查詢產生，以及一個 (2) 檢索器，用於根據查詢提取對應的圖形資訊。兩個代理反覆合作，實現對圖形資訊的順序推理和適應性關注。與先前的作品不同，兩個代理僅提示場景圖表綱要，而不是完整的圖形資料，這透過限制輸入代碼減少了幻覺，並驅使推論器抽象地產生推理軌跡。根據軌跡，檢索器根據綱要理解以程式化方式查詢場景圖形資料，允許對圖形進行動態和整體關注，增強推理和檢索之間的一致性。透過在多個模擬環境中的實驗，我們表明我們的框架在數值問答和規劃任務中超越了現有的基於 LLM 的方法，並且可以受益於任務級別的少次範例，即使在沒有代理級別示範的情況下也是如此。專案程式碼將會釋出。
 
 ##### **SymAgent: A Neural-Symbolic Self-Learning Agent Framework for Complex Reasoning over Knowledge Graphs**
-2502.03283v1 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
+2502.03283v2 by Ben Liu, Jihai Zhang, Fangquan Lin, Cheng Yang, Min Peng, Wotao Yin
 
 Recent advancements have highlighted that Large Language Models (LLMs) are
 prone to hallucinations when solving complex reasoning problems, leading to
@@ -9426,7 +7101,7 @@ better or comparable performance compared to various strong baselines. Further
 analysis reveals that our agent can identify missing triples, facilitating
 automatic KG updates.
 
-摘要：<paragraph>最近的研究表明，大型语言模型 (LLM) 在解决复杂的推理问题时容易出现幻觉，从而导致错误的结果。为了解决这个问题，研究人员结合了知识图谱 (KG) 来提高 LLM 的推理能力。然而，现有方法面临两个局限性：1) 它们通常假设问题的答案都包含在 KG 中，忽略了 KG 不完整的问题，2) 它们将 KG 视为一个静态存储库，而忽略了 KG 中固有的隐式逻辑推理结构。在本文中，我们介绍了 SymAgent，这是一个创新的神经符号代理框架，可以在 KG 和 LLM 之间实现协作增强。我们将 KG 概念化为动态环境，并将复杂的推理任务转化为一个多步骤的交互过程，使 KG 能够深入参与推理过程。SymAgent 由两个模块组成：Agent-Planner 和 Agent-Executor。Agent-Planner 利用 LLM 的归纳推理能力从 KG 中提取符号规则，指导高效的问题分解。Agent-Executor 自主调用预定义的动作工具来整合来自 KG 和外部文档的信息，解决 KG 不完整的问题。此外，我们设计了一个自学习框架，包括在线探索和离线迭代策略更新阶段，使代理能够自动合成推理轨迹并提高性能。实验结果表明，具有弱 LLM 主干的 SymAgent（即 7B 系列）与各种强大的基线相比，产生了更好或相当的性能。进一步的分析表明，我们的代理可以识别缺失的三元组，促进自动 KG 更新。</paragraph>
+摘要：<paragraph>最近的進展強調出，大型語言模型 (LLM) 在解決複雜推理問題時容易出現幻覺，導致錯誤的結果。為了解決這個問題，研究人員結合知識圖譜 (KG) 來改善 LLM 的推理能力。然而，現有方法面臨兩個限制：1) 它們通常假設問題的所有答案都包含在 KG 中，忽略了 KG 的不完整性問題，以及 2) 它們將 KG 視為一個靜態儲存庫，而忽略了 KG 中固有的隱式邏輯推理結構。在本文中，我們介紹了 SymAgent，一個創新的神經符號代理架構，它在 KG 和 LLM 之間實現了協作擴充。我們將 KG 概念化為動態環境，並將複雜的推理任務轉化為一個多步驟的互動過程，使 KG 能夠深入參與推理過程。SymAgent 包含兩個模組：代理規劃器和代理執行器。代理規劃器利用 LLM 的歸納推理能力從 KG 中提取符號規則，指導有效的問題分解。代理執行器自主地調用預定義的動作工具來整合來自 KG 和外部文件的資訊，解決 KG 不完整性的問題。此外，我們設計了一個自學習框架，包括線上探索和離線反覆的政策更新階段，使代理能夠自動合成推理軌跡並改善效能。實驗結果表明，具有弱 LLM 主幹的 SymAgent（例如，7B 系列）與各種強大的基線相比，產生了更好或相當的效能。進一步的分析表明，我們的代理可以識別遺失的三元組，促進自動 KG 更新。</paragraph>
 
 ##### **Analyze Feature Flow to Enhance Interpretation and Steering in Language Models**
 2502.03032v2 by Daniil Laptev, Nikita Balagansky, Yaroslav Aksenov, Daniil Gavrilov
@@ -10122,209 +7797,2460 @@ improve a real-world KGQA system.
 
 摘要：知識圖譜問答系統 (KGQA) 根據知識圖譜中的資料回答事實問題。KGQA 系統很複雜，因為系統必須理解知識尋求自然語言查詢中的關係和實體，並將它們對映到針對知識圖譜的結構化查詢，才能回答這些查詢。在本文中，我們介紹了 Chronos，這是一個用於產業規模 KGQA 的全面評估框架。它旨在全面評估這種多組件系統，重點關注：(1) 端對端和組件層級指標，(2) 可擴充至各種資料集，以及 (3) 可擴充的方法，用於在釋出前衡量系統的效能。在本文中，我們討論了與產業規模 KGQA 系統評估相關的獨特挑戰，檢視 Chronos 的設計，以及它如何應對這些挑戰。我們將展示它如何提供資料驅動決策的基礎，並討論使用它來衡量和改善真實世界 KGQA 系統的挑戰。
 
-##### **FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data**
-2501.17144v1 by Deren Lei, Yaxi Li, Siyao Li, Mengya Hu, Rui Xu, Ken Archer, Mingyu Wang, Emily Ching, Alex Deng
-
-Prior research on training grounded factuality classification models to
-detect hallucinations in large language models (LLMs) has relied on public
-natural language inference (NLI) data and synthetic data. However, conventional
-NLI datasets are not well-suited for document-level reasoning, which is
-critical for detecting LLM hallucinations. Recent approaches to document-level
-synthetic data generation involve iteratively removing sentences from documents
-and annotating factuality using LLM-based prompts. While effective, this method
-is computationally expensive for long documents and limited by the LLM's
-capabilities. In this work, we analyze the differences between existing
-synthetic training data used in state-of-the-art models and real LLM output
-claims. Based on our findings, we propose a novel approach for synthetic data
-generation, CG2C, that leverages multi-hop reasoning on context graphs
-extracted from documents. Our fact checker model, FactCG, demonstrates improved
-performance with more connected reasoning, using the same backbone models.
-Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark
-with much smaller model size.
-
-摘要：先前的研究訓練了基於事實的分類模型，以偵測大型語言模型 (LLM) 中的幻覺，依賴於公開的自然語言推論 (NLI) 資料和合成資料。然而，傳統的 NLI 資料集並不適合文件層級的推理，這對於偵測 LLM 的幻覺至關重要。最近的文件層級合成資料生成方法涉及從文件中反覆移除句子，並使用基於 LLM 的提示註解事實。雖然有效，但此方法對於長文件來說在運算上很昂貴，且受限於 LLM 的能力。在這項工作中，我們分析了現有合成訓練資料與最先進模型中使用的真實 LLM 輸出宣告之間的差異。根據我們的研究結果，我們提出了一個用於合成資料生成的創新方法 CG2C，它利用從文件中提取的內容圖表進行多跳推理。我們的查核模型 FactCG 使用相同的骨幹模型，展示了在更多連結的推理下改進的效能。實驗表明，它甚至在 LLM-Aggrefact 基準上優於 GPT-4-o，且模型大小小得多。
-
-##### **LLM-AutoDiff: Auto-Differentiate Any LLM Workflow**
-2501.16673v2 by Li Yin, Zhangyang Wang
-
-Large Language Models (LLMs) have reshaped natural language processing,
-powering applications from multi-hop retrieval and question answering to
-autonomous agent workflows. Yet, prompt engineering -- the task of crafting
-textual inputs to effectively direct LLMs -- remains difficult and
-labor-intensive, particularly for complex pipelines that combine multiple LLM
-calls with functional operations like retrieval and data formatting. We
-introduce LLM-AutoDiff: a novel framework for Automatic Prompt Engineering
-(APE) that extends textual gradient-based methods (such as Text-Grad) to
-multi-component, potentially cyclic LLM architectures. Implemented within the
-AdalFlow library, LLM-AutoDiff treats each textual input as a trainable
-parameter and uses a frozen backward engine LLM to generate feedback-akin to
-textual gradients -- that guide iterative prompt updates. Unlike prior
-single-node approaches, LLM-AutoDiff inherently accommodates functional nodes,
-preserves time-sequential behavior in repeated calls (e.g., multi-hop loops),
-and combats the "lost-in-the-middle" problem by isolating distinct sub-prompts
-(instructions, formats, or few-shot examples). It further boosts training
-efficiency by focusing on error-prone samples through selective gradient
-computation. Across diverse tasks, including single-step classification,
-multi-hop retrieval-based QA, and agent-driven pipelines, LLM-AutoDiff
-consistently outperforms existing textual gradient baselines in both accuracy
-and training cost. By unifying prompt optimization through a graph-centric
-lens, LLM-AutoDiff offers a powerful new paradigm for scaling and automating
-LLM workflows - mirroring the transformative role that automatic
-differentiation libraries have long played in neural network research.
-
-摘要：大型語言模型 (LLM) 已重塑自然語言處理，
-為從多跳檢索和問答到
-自主代理工作流程的應用提供動力。然而，提示工程 -- 編寫
-文本輸入以有效指導 LLM 的任務 -- 仍然困難且
-勞動密集，特別是對於將多個 LLM
-呼叫與檢索和數據格式化等功能操作相結合的複雜管道。我們
-介紹 LLM-AutoDiff：一個用於自動提示工程 (APE) 的新框架，它將基於文本梯度的
-方法（例如 Text-Grad）擴展到多組件、潛在循環 LLM 架構中。在
-AdalFlow 庫中實施，LLM-AutoDiff 將每個文本輸入視為一個可訓練
-參數，並使用凍結的後向引擎 LLM 生成反饋——類似於
-文本梯度——指導迭代提示更新。與先前的
-單節點方法不同，LLM-AutoDiff 本質上適應功能節點，
-在重複呼叫（例如，多跳循環）中保留時間順序行為，
-並通過隔離不同的子提示（說明、格式或少數鏡頭示例）來解決“迷失在中間”問題。它進一步提高訓練
-效率，通過選擇性梯度
-計算專注於容易出錯的樣本。在包括單步分類、
-多跳基於檢索的問答和代理驅動管道在內的各種任務中，LLM-AutoDiff
-在準確性和訓練成本方面始終優於現有的文本梯度基準。通過圖形中心化
-視角統一提示優化，LLM-AutoDiff 為擴展和自動化
-LLM 工作流程提供了一個強大的新範例——反映了自動
-微分庫在神經網絡研究中長期扮演的變革性角色。
-
-##### **360Brew: A Decoder-only Foundation Model for Personalized Ranking and Recommendation**
-2501.16450v3 by Hamed Firooz, Maziar Sanjabi, Adrian Englhardt, Aman Gupta, Ben Levine, Dre Olgiati, Gungor Polatkan, Iuliia Melnychuk, Karthik Ramgopal, Kirill Talanine, Kutta Srinivasan, Luke Simon, Natesh Sivasubramoniapillai, Necip Fazil Ayan, Qingquan Song, Samira Sriram, Souvik Ghosh, Tao Song, Tejas Dharamsi, Vignesh Kothapalli, Xiaoling Zhai, Ya Xu, Yu Wang, Yun Dai
-
-Ranking and recommendation systems are the foundation for numerous online
-experiences, ranging from search results to personalized content delivery.
-These systems have evolved into complex, multilayered architectures that
-leverage vast datasets and often incorporate thousands of predictive models.
-The maintenance and enhancement of these models is a labor intensive process
-that requires extensive feature engineering. This approach not only exacerbates
-technical debt but also hampers innovation in extending these systems to
-emerging problem domains. In this report, we present our research to address
-these challenges by utilizing a large foundation model with a textual interface
-for ranking and recommendation tasks. We illustrate several key advantages of
-our approach: (1) a single model can manage multiple predictive tasks involved
-in ranking and recommendation, (2) decoder models with textual interface due to
-their comprehension of reasoning capabilities, can generalize to new
-recommendation surfaces and out-of-domain problems, and (3) by employing
-natural language interfaces for task definitions and verbalizing member
-behaviors and their social connections, we eliminate the need for feature
-engineering and the maintenance of complex directed acyclic graphs of model
-dependencies. We introduce our research pre-production model, 360Brew V1.0, a
-150B parameter, decoder-only model that has been trained and fine-tuned on
-LinkedIn's data and tasks. This model is capable of solving over 30 predictive
-tasks across various segments of the LinkedIn platform, achieving performance
-levels comparable to or exceeding those of current production systems based on
-offline metrics, without task-specific fine-tuning. Notably, each of these
-tasks is conventionally addressed by dedicated models that have been developed
-and maintained over multiple years by teams of a similar or larger size than
-our own.
-
-摘要：排名和推薦系統是許多線上體驗的基礎，從搜尋結果到個人化內容傳遞。
-這些系統已演變成複雜的多層架構，利用龐大的資料集，並經常納入數千個預測模型。
-這些模型的維護和增強是一個勞力密集的過程，需要廣泛的特徵工程。
-這種方法不僅加劇了技術債務，也阻礙了將這些系統擴展到新興問題領域的創新。
-在此報告中，我們提出了我們的研究，以利用具有文字介面的大型基礎模型來解決這些挑戰，以進行排名和推薦任務。
-我們說明了我們方法的幾個主要優點：(1) 單一模型可以管理排名和推薦中涉及的多個預測任務，(2) 由於解碼器模型具有文字介面，因此它們對推理能力的理解，可以推廣到新的推薦表面和領域外問題，以及 (3) 通過採用自然語言介面進行任務定義和表達成員行為及其社交連接，我們消除了對特徵工程和維護複雜的模型相依性有向無環圖的需求。
-我們介紹了我們的研究前製作業模型 360Brew V1.0，這是一個 150B 參數，僅解碼器模型，已在 LinkedIn 的資料和任務上進行訓練和微調。
-此模型能夠解決 LinkedIn 平臺各個區塊中超過 30 個預測任務，在不針對任務進行微調的情況下，達到與基於離線指標的現行製作系統相當或超越的效能水準。
-值得注意的是，這些任務中的每個任務通常由專用模型處理，這些模型是由與我們規模相當或更大的團隊在多年間開發和維護的。
-
-##### **Raiders of the Lost Dependency: Fixing Dependency Conflicts in Python using LLMs**
-2501.16191v1 by Antony Bartlett, Cynthia Liem, Annibale Panichella
-
-Fixing Python dependency issues is a tedious and error-prone task for
-developers, who must manually identify and resolve environment dependencies and
-version constraints of third-party modules and Python interpreters. Researchers
-have attempted to automate this process by relying on large knowledge graphs
-and database lookup tables. However, these traditional approaches face
-limitations due to the variety of dependency error types, large sets of
-possible module versions, and conflicts among transitive dependencies. This
-study explores the potential of using large language models (LLMs) to
-automatically fix dependency issues in Python programs. We introduce PLLM
-(pronounced "plum"), a novel technique that employs retrieval-augmented
-generation (RAG) to help an LLM infer Python versions and required modules for
-a given Python file. PLLM builds a testing environment that iteratively (1)
-prompts the LLM for module combinations, (2) tests the suggested changes, and
-(3) provides feedback (error messages) to the LLM to refine the fix. This
-feedback cycle leverages natural language processing (NLP) to intelligently
-parse and interpret build error messages. We benchmark PLLM on the Gistable
-HG2.9K dataset, a collection of challenging single-file Python gists. We
-compare PLLM against two state-of-the-art automatic dependency inference
-approaches, namely PyEGo and ReadPyE, w.r.t. the ability to resolve dependency
-issues. Our results indicate that PLLM can fix more dependency issues than the
-two baselines, with +218 (+15.97%) more fixes over ReadPyE and +281 (+21.58%)
-over PyEGo. Our deeper analyses suggest that PLLM is particularly beneficial
-for projects with many dependencies and for specific third-party numerical and
-machine-learning modules. Our findings demonstrate the potential of LLM-based
-approaches to iteratively resolve Python dependency issues.
-
-摘要：<paragraph>修復 Python 依賴項問題對開發人員來說是一項繁瑣且容易出錯的任務，他們必須手動識別和解決第三方模組和 Python 解譯器的環境依賴項和版本限制。研究人員已嘗試透過依賴大型知識圖譜和資料庫查詢表來自動化此程序。然而，這些傳統方法由於依賴項錯誤類型多樣、可能的模組版本數量龐大，以及傳遞依賴項之間的衝突，而面臨限制。本研究探討使用大型語言模型 (LLM) 自動修復 Python 程式中的依賴項問題的可能性。我們介紹 PLLM（發音為「plum」），這是一種新穎的技術，採用檢索增強生成 (RAG) 來協助 LLM 推論 Python 版本和給定 Python 檔案所需的模組。PLLM 建立一個測試環境，反覆 (1) 提示 LLM 模組組合，(2) 測試建議的變更，以及 (3) 提供回饋（錯誤訊息）給 LLM 以改善修正。此回饋循環利用自然語言處理 (NLP) 來智慧解析和詮釋建置錯誤訊息。我們在 Gistable HG2.9K 資料集上對 PLLM 進行基準測試，該資料集是一個具有挑戰性的單一檔案 Python gist 集合。我們將 PLLM 與兩種最先進的自動依賴項推論方法進行比較，即 PyEGo 和 ReadPyE，以比較解決依賴項問題的能力。我們的結果顯示，PLLM 可以修復比這兩個基準更多的依賴項問題，比 ReadPyE 多修復了 +218 (+15.97%) 個，比 PyEGo 多修復了 +281 (+21.58%) 個。我們更深入的分析表明，PLLM 對具有許多依賴項的專案以及特定第三方數值和機器學習模組特別有益。我們的研究結果證明了基於 LLM 的方法反覆解決 Python 依賴項問題的可能性。</paragraph>
-
-##### **Harnessing Diverse Perspectives: A Multi-Agent Framework for Enhanced Error Detection in Knowledge Graphs**
-2501.15791v1 by Yu Li, Yi Huang, Guilin Qi, Junlan Feng, Nan Hu, Songlin Zhai, Haohan Xue, Yongrui Chen, Ruoyan Shen, Tongtong Wu
-
-Knowledge graphs are widely used in industrial applications, making error
-detection crucial for ensuring the reliability of downstream applications.
-Existing error detection methods often fail to effectively leverage
-fine-grained subgraph information and rely solely on fixed graph structures,
-while also lacking transparency in their decision-making processes, which
-results in suboptimal detection performance. In this paper, we propose a novel
-Multi-Agent framework for Knowledge Graph Error Detection (MAKGED) that
-utilizes multiple large language models (LLMs) in a collaborative setting. By
-concatenating fine-grained, bidirectional subgraph embeddings with LLM-based
-query embeddings during training, our framework integrates these
-representations to produce four specialized agents. These agents utilize
-subgraph information from different dimensions to engage in multi-round
-discussions, thereby improving error detection accuracy and ensuring a
-transparent decision-making process. Extensive experiments on FB15K and WN18RR
-demonstrate that MAKGED outperforms state-of-the-art methods, enhancing the
-accuracy and robustness of KG evaluation. For specific industrial scenarios,
-our framework can facilitate the training of specialized agents using
-domain-specific knowledge graphs for error detection, which highlights the
-potential industrial application value of our framework. Our code and datasets
-are available at https://github.com/kse-ElEvEn/MAKGED.
-
-摘要：知識圖譜廣泛應用於工業應用中，使得錯誤偵測對於確保下游應用的可靠性至關重要。現有的錯誤偵測方法通常無法有效利用細粒度的子圖資訊，並且僅依賴於固定的圖形結構，同時在它們的決策過程中也缺乏透明度，這導致次佳的偵測效能。在本文中，我們提出了一個用於知識圖譜錯誤偵測 (MAKGED) 的新多代理架構，它在協作設定中利用了多個大型語言模型 (LLM)。透過在訓練期間將細粒度、雙向子圖嵌入與基於 LLM 的查詢嵌入串接，我們的架構整合了這些表示以產生四個專門代理。這些代理利用不同維度的子圖資訊參與多輪討論，從而提高錯誤偵測準確度並確保透明的決策過程。在 FB15K 和 WN18RR 上的廣泛實驗表明，MAKGED 優於最先進的方法，增強了 KG 評估的準確性和穩健性。對於特定產業情境，我們的架構可以利用特定領域的知識圖譜來促進專門代理的訓練以進行錯誤偵測，這突顯了我們架構的潛在產業應用價值。我們的程式碼和資料集可在 https://github.com/kse-ElEvEn/MAKGED 取得。
-
-##### **Automatic Feedback Generation for Short Answer Questions using Answer Diagnostic Graphs**
-2501.15777v1 by Momoka Furuhashi, Hiroaki Funayama, Yuya Iwase, Yuichiroh Matsubayashi, Yoriko Isobe, Toru Nagahama, Saku Sugawara, Kentaro Inui
-
-Short-reading comprehension questions help students understand text structure
-but lack effective feedback. Students struggle to identify and correct errors,
-while manual feedback creation is labor-intensive. This highlights the need for
-automated feedback linking responses to a scoring rubric for deeper
-comprehension.
-  Despite advances in Natural Language Processing (NLP), research has focused
-on automatic grading, with limited work on feedback generation. To address
-this, we propose a system that generates feedback for student responses.
-  Our contributions are twofold. First, we introduce the first system for
-feedback on short-answer reading comprehension. These answers are derived from
-the text, requiring structural understanding. We propose an "answer diagnosis
-graph," integrating the text's logical structure with feedback templates. Using
-this graph and NLP techniques, we estimate students' comprehension and generate
-targeted feedback.
-  Second, we evaluate our feedback through an experiment with Japanese high
-school students (n=39). They answered two 70-80 word questions and were divided
-into two groups with minimal academic differences. One received a model answer,
-the other system-generated feedback. Both re-answered the questions, and we
-compared score changes. A questionnaire assessed perceptions and motivation.
-  Results showed no significant score improvement between groups, but
-system-generated feedback helped students identify errors and key points in the
-text. It also significantly increased motivation. However, further refinement
-is needed to enhance text structure understanding.
-
-摘要：短篇閱讀理解題目有助學生理解文章結構，但缺乏有效的回饋。學生難以找出並更正錯誤，而手動建立回饋又很費力。這突顯了自動化回饋的必要性，將回應連結到評分標準，以獲得更深入的理解。
-
-儘管自然語言處理 (NLP) 有所進展，但研究一直集中在自動評分上，而回饋生成的工作有限。為了解決這個問題，我們提出了一個系統，用於為學生的回答產生回饋。
-
-我們的貢獻有兩個方面。首先，我們引入了第一個針對簡答閱讀理解提供回饋的系統。這些答案來自於文本，需要結構化的理解。我們提出了一個「答案診斷圖」，將文本的邏輯結構與回饋範本整合在一起。使用這個圖表和 NLP 技術，我們估計學生的理解力並產生有針對性的回饋。
-
-其次，我們透過一項針對日本高中生的實驗（n=39）來評估我們的回饋。他們回答了兩個 70-80 字的問題，並被分成兩組，學術差異最小。一組收到範本答案，另一組收到系統產生的回饋。兩組都重新回答了問題，我們比較了分數的變化。一份問卷評估了認知和動機。
-
-結果顯示兩組之間沒有顯著的分數進步，但系統產生的回饋有助於學生找出文本中的錯誤和重點。它也顯著地提高了動機。然而，需要進一步的改進來增強對文本結構的理解。
+
+### LLM
+|Publish Date|Title|Authors|Homepage|Code|
+| :---: | :---: | :---: | :---: | :---: |
+|**2025-02-18**|**SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**|Zekun Qi et.al.|[2502.13143v1](http://arxiv.org/abs/2502.13143v1)|null|
+|**2025-02-18**|**Pre-training Auto-regressive Robotic Models with 4D Representations**|Dantong Niu et.al.|[2502.13142v1](http://arxiv.org/abs/2502.13142v1)|null|
+|**2025-02-18**|**UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**|Huawei Lin et.al.|[2502.13141v1](http://arxiv.org/abs/2502.13141v1)|null|
+|**2025-02-18**|**AIDE: AI-Driven Exploration in the Space of Code**|Zhengyao Jiang et.al.|[2502.13138v1](http://arxiv.org/abs/2502.13138v1)|null|
+|**2025-02-18**|**Theorem Prover as a Judge for Synthetic Data Generation**|Joshua Ong Jun Leang et.al.|[2502.13137v1](http://arxiv.org/abs/2502.13137v1)|null|
+|**2025-02-18**|**Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**|Taedong Yun et.al.|[2502.13135v1](http://arxiv.org/abs/2502.13135v1)|null|
+|**2025-02-18**|**Learning to Defer for Causal Discovery with Imperfect Experts**|Oscar Clivio et.al.|[2502.13132v1](http://arxiv.org/abs/2502.13132v1)|null|
+|**2025-02-18**|**Rethinking Diverse Human Preference Learning through Principal Component Analysis**|Feng Luo et.al.|[2502.13131v1](http://arxiv.org/abs/2502.13131v1)|null|
+|**2025-02-18**|**Magma: A Foundation Model for Multimodal AI Agents**|Jianwei Yang et.al.|[2502.13130v1](http://arxiv.org/abs/2502.13130v1)|null|
+|**2025-02-18**|**SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**|Zihan Liu et.al.|[2502.13128v1](http://arxiv.org/abs/2502.13128v1)|null|
+|**2025-02-18**|**Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**|Jingyang Lin et.al.|[2502.13127v1](http://arxiv.org/abs/2502.13127v1)|null|
+|**2025-02-18**|**RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**|Zenan Zhai et.al.|[2502.13125v1](http://arxiv.org/abs/2502.13125v1)|null|
+|**2025-02-18**|**NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**|Weizhe Yuan et.al.|[2502.13124v1](http://arxiv.org/abs/2502.13124v1)|null|
+|**2025-02-18**|**Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**|Marion Bartl et.al.|[2502.13120v1](http://arxiv.org/abs/2502.13120v1)|null|
+|**2025-02-18**|**STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**|Narun Raman et.al.|[2502.13119v1](http://arxiv.org/abs/2502.13119v1)|null|
+|**2025-02-18**|**Performance Evaluation of Large Language Models in Statistical Programming**|Xinyi Song et.al.|[2502.13117v1](http://arxiv.org/abs/2502.13117v1)|null|
+|**2025-02-18**|**Near-Optimal Private Learning in Linear Contextual Bandits**|Fan Chen et.al.|[2502.13115v1](http://arxiv.org/abs/2502.13115v1)|null|
+|**2025-02-18**|**The influence of motion features in temporal perception**|Rosa Illan Castillo et.al.|[2502.13114v1](http://arxiv.org/abs/2502.13114v1)|null|
+|**2025-02-18**|**Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**|Priyaranjan Pattnayak et.al.|[2502.13108v1](http://arxiv.org/abs/2502.13108v1)|null|
+|**2025-02-18**|**MatterChat: A Multi-Modal LLM for Material Science**|Yingheng Tang et.al.|[2502.13107v1](http://arxiv.org/abs/2502.13107v1)|null|
+|**2025-02-18**|**Understanding and Rectifying Safety Perception Distortion in VLMs**|Xiaohan Zou et.al.|[2502.13095v1](http://arxiv.org/abs/2502.13095v1)|null|
+|**2025-02-18**|**Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**|Mengkang Hu et.al.|[2502.13092v1](http://arxiv.org/abs/2502.13092v1)|null|
+|**2025-02-18**|**KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**|Xin Xia et.al.|[2502.13076v1](http://arxiv.org/abs/2502.13076v1)|null|
+|**2025-02-18**|**Interactive Agents to Overcome Ambiguity in Software Engineering**|Sanidhya Vijayvargiya et.al.|[2502.13069v1](http://arxiv.org/abs/2502.13069v1)|null|
+|**2025-02-18**|**Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**|Yuri Kuratov et.al.|[2502.13063v1](http://arxiv.org/abs/2502.13063v1)|null|
+|**2025-02-18**|**AI-Assisted Decision Making with Human Learning**|Gali Noti et.al.|[2502.13062v1](http://arxiv.org/abs/2502.13062v1)|null|
+|**2025-02-18**|**Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**|Jingbiao Mei et.al.|[2502.13061v1](http://arxiv.org/abs/2502.13061v1)|null|
+|**2025-02-18**|**SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**|Xianfu Cheng et.al.|[2502.13059v1](http://arxiv.org/abs/2502.13059v1)|null|
+|**2025-02-18**|**LAMD: Context-driven Android Malware Detection and Classification with LLMs**|Xingzhi Qian et.al.|[2502.13055v1](http://arxiv.org/abs/2502.13055v1)|null|
+|**2025-02-18**|**Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**|Nils Constantin Hellwig et.al.|[2502.13044v1](http://arxiv.org/abs/2502.13044v1)|null|
+|**2025-02-18**|**Natural Language Generation from Visual Sequences: Challenges and Future Directions**|Aditya K Surikuchi et.al.|[2502.13034v1](http://arxiv.org/abs/2502.13034v1)|null|
+|**2025-02-18**|**HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**|Bosi Wen et.al.|[2502.13031v1](http://arxiv.org/abs/2502.13031v1)|null|
+|**2025-02-18**|**Whose story is it? Personalizing story generation by inferring author styles**|Nischal Ashok Kumar et.al.|[2502.13028v1](http://arxiv.org/abs/2502.13028v1)|null|
+|**2025-02-18**|**Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**|Markus J. Buehler et.al.|[2502.13025v1](http://arxiv.org/abs/2502.13025v1)|null|
+|**2025-02-18**|**Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**|Sha Li et.al.|[2502.13019v1](http://arxiv.org/abs/2502.13019v1)|null|
+|**2025-02-18**|**LLM-Powered Proactive Data Systems**|Sepanta Zeighami et.al.|[2502.13016v1](http://arxiv.org/abs/2502.13016v1)|null|
+|**2025-02-18**|**Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**|Chaoran Chen et.al.|[2502.13012v1](http://arxiv.org/abs/2502.13012v1)|null|
+|**2025-02-18**|**Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**|Mohammad Reza Rezaei et.al.|[2502.13010v1](http://arxiv.org/abs/2502.13010v1)|null|
+|**2025-02-18**|**Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**|Yarin Benyamin et.al.|[2502.13006v1](http://arxiv.org/abs/2502.13006v1)|null|
+|**2025-02-18**|**Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**|Wafaa Wardah et.al.|[2502.13004v1](http://arxiv.org/abs/2502.13004v1)|null|
+|**2025-02-18**|**You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**|Frederic Kirstein et.al.|[2502.13001v1](http://arxiv.org/abs/2502.13001v1)|null|
+|**2025-02-18**|**Personalized Top-k Set Queries Over Predicted Scores**|Sohrab Namazi Nia et.al.|[2502.12998v1](http://arxiv.org/abs/2502.12998v1)|null|
+|**2025-02-18**|**Eager Updates For Overlapped Communication and Computation in DiLoCo**|Satyen Kale et.al.|[2502.12996v1](http://arxiv.org/abs/2502.12996v1)|null|
+|**2025-02-18**|**Free Argumentative Exchanges for Explaining Image Classifiers**|Avinash Kori et.al.|[2502.12995v1](http://arxiv.org/abs/2502.12995v1)|null|
+|**2025-02-18**|**B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**|Yifan Wang et.al.|[2502.12992v1](http://arxiv.org/abs/2502.12992v1)|null|
+|**2025-02-18**|**Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**|Zixiao Wang et.al.|[2502.12988v1](http://arxiv.org/abs/2502.12988v1)|null|
+|**2025-02-18**|**PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**|Nicolas Talabot et.al.|[2502.12985v1](http://arxiv.org/abs/2502.12985v1)|null|
+|**2025-02-18**|**Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**|Longxu Dou et.al.|[2502.12982v1](http://arxiv.org/abs/2502.12982v1)|null|
+|**2025-02-18**|**Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**|Junda Zhu et.al.|[2502.12970v1](http://arxiv.org/abs/2502.12970v1)|null|
+|**2025-02-18**|**A Survey of Text Classification Under Class Distribution Shift**|Adriana Valentina Costache et.al.|[2502.12965v1](http://arxiv.org/abs/2502.12965v1)|null|
+|**2025-02-18**|**Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**|Adi Simhi et.al.|[2502.12964v1](http://arxiv.org/abs/2502.12964v1)|null|
+|**2025-02-18**|**Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**|Xiaoju Ye et.al.|[2502.12962v1](http://arxiv.org/abs/2502.12962v1)|null|
+|**2025-02-18**|**Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**|Wenjun Li et.al.|[2502.12961v1](http://arxiv.org/abs/2502.12961v1)|null|
+|**2025-02-18**|**AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**|Steve Bakos et.al.|[2502.12959v1](http://arxiv.org/abs/2502.12959v1)|null|
+|**2025-02-18**|**Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**|Andrei Jarca et.al.|[2502.12953v1](http://arxiv.org/abs/2502.12953v1)|null|
+|**2025-02-18**|**Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**|Athira J Jacob et.al.|[2502.12948v1](http://arxiv.org/abs/2502.12948v1)|null|
+|**2025-02-18**|**Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**|Gyeongman Kim et.al.|[2502.12947v1](http://arxiv.org/abs/2502.12947v1)|null|
+|**2025-02-18**|**LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**|Junchen Fu et.al.|[2502.12945v1](http://arxiv.org/abs/2502.12945v1)|null|
+|**2025-02-18**|**Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**|Salsabila Zahirah Pranida et.al.|[2502.12932v1](http://arxiv.org/abs/2502.12932v1)|null|
+|**2025-02-18**|**Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**|Lakshmi Nair et.al.|[2502.12929v1](http://arxiv.org/abs/2502.12929v1)|null|
+|**2025-02-18**|**Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**|Leiyu Pan et.al.|[2502.12928v1](http://arxiv.org/abs/2502.12928v1)|null|
+|**2025-02-18**|**SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**|Mike Zhang et.al.|[2502.12927v1](http://arxiv.org/abs/2502.12927v1)|null|
+|**2025-02-18**|**Towards more Contextual Agents: An extractor-Generator Optimization Framework**|Mourad Aouini et.al.|[2502.12926v1](http://arxiv.org/abs/2502.12926v1)|null|
+|**2025-02-18**|**Keep what you need : extracting efficient subnetworks from large audio representation models**|David Genova et.al.|[2502.12925v1](http://arxiv.org/abs/2502.12925v1)|null|
+|**2025-02-18**|**Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**|Maite Heredia et.al.|[2502.12924v1](http://arxiv.org/abs/2502.12924v1)|null|
+|**2025-02-18**|**On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**|Rune Birkmose et.al.|[2502.12923v1](http://arxiv.org/abs/2502.12923v1)|null|
+|**2025-02-18**|**Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**|George-Kirollos Saad et.al.|[2502.12921v1](http://arxiv.org/abs/2502.12921v1)|null|
+|**2025-02-18**|**GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**|Sifan Zhou et.al.|[2502.12913v1](http://arxiv.org/abs/2502.12913v1)|null|
+|**2025-02-18**|**Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**|Zheng Yuan et.al.|[2502.12911v1](http://arxiv.org/abs/2502.12911v1)|null|
+|**2025-02-18**|**Graph Neural Networks for Databases: A Survey**|Ziming Li et.al.|[2502.12908v1](http://arxiv.org/abs/2502.12908v1)|null|
+|**2025-02-18**|**Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**|Shu Yang et.al.|[2502.12904v1](http://arxiv.org/abs/2502.12904v1)|null|
+|**2025-02-18**|**Soundwave: Less is More for Speech-Text Alignment in LLMs**|Yuhao Zhang et.al.|[2502.12900v1](http://arxiv.org/abs/2502.12900v1)|null|
+|**2025-02-18**|**None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**|Eva Sánchez Salido et.al.|[2502.12896v1](http://arxiv.org/abs/2502.12896v1)|null|
+|**2025-02-18**|**Multilingual European Language Models: Benchmarking Approaches and Challenges**|Fabio Barth et.al.|[2502.12895v1](http://arxiv.org/abs/2502.12895v1)|null|
+|**2025-02-18**|**H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**|Martin Kuo et.al.|[2502.12893v1](http://arxiv.org/abs/2502.12893v1)|null|
+|**2025-02-18**|**Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**|Georg Rehm et.al.|[2502.12886v1](http://arxiv.org/abs/2502.12886v1)|null|
+|**2025-02-18**|**How desirable is alignment between LLMs and linguistically diverse human users?**|Pia Knoeferle et.al.|[2502.12884v1](http://arxiv.org/abs/2502.12884v1)|null|
+|**2025-02-18**|**Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**|Nandakishor M et.al.|[2502.12876v1](http://arxiv.org/abs/2502.12876v1)|null|
+|**2025-02-18**|**PAFT: Prompt-Agnostic Fine-Tuning**|Chenxing Wei et.al.|[2502.12859v1](http://arxiv.org/abs/2502.12859v1)|null|
+|**2025-02-18**|**Rejected Dialects: Biases Against African American Language in Reward Models**|Joel Mire et.al.|[2502.12858v1](http://arxiv.org/abs/2502.12858v1)|null|
+|**2025-02-18**|**Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**|Neeraj Gangwar et.al.|[2502.12855v1](http://arxiv.org/abs/2502.12855v1)|null|
+|**2025-02-18**|**S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**|Ruotian Ma et.al.|[2502.12853v1](http://arxiv.org/abs/2502.12853v1)|null|
+|**2025-02-18**|**MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**|Fabian David Schmidt et.al.|[2502.12852v1](http://arxiv.org/abs/2502.12852v1)|null|
+|**2025-02-18**|**MeMo: Towards Language Models with Associative Memory Mechanisms**|Fabio Massimo Zanzotto et.al.|[2502.12851v1](http://arxiv.org/abs/2502.12851v1)|null|
+|**2025-02-18**|**Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**|Kathrin Seßler et.al.|[2502.12842v1](http://arxiv.org/abs/2502.12842v1)|null|
+|**2025-02-18**|**Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**|Berk Yilmaz et.al.|[2502.12838v1](http://arxiv.org/abs/2502.12838v1)|null|
+|**2025-02-18**|**An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**|Mohammad Feli et.al.|[2502.12836v1](http://arxiv.org/abs/2502.12836v1)|null|
+|**2025-02-18**|**Subword models struggle with word learning, but surprisal hides it**|Bastian Bunzeck et.al.|[2502.12835v1](http://arxiv.org/abs/2502.12835v1)|null|
+|**2025-02-18**|**KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**|Mukhammed Togmanov et.al.|[2502.12829v1](http://arxiv.org/abs/2502.12829v1)|null|
+|**2025-02-18**|**Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**|Rubing Lu et.al.|[2502.12825v1](http://arxiv.org/abs/2502.12825v1)|null|
+|**2025-02-18**|**Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**|Elena Stringli et.al.|[2502.12821v1](http://arxiv.org/abs/2502.12821v1)|null|
+|**2025-02-18**|**Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**|Adnan Ahmad et.al.|[2502.12813v1](http://arxiv.org/abs/2502.12813v1)|null|
+|**2025-02-18**|**Towards Text-Image Interleaved Retrieval**|Xin Zhang et.al.|[2502.12799v1](http://arxiv.org/abs/2502.12799v1)|null|
+|**2025-02-18**|**Envious Explore and Exploit**|Omer Ben-Porat et.al.|[2502.12798v1](http://arxiv.org/abs/2502.12798v1)|null|
+|**2025-02-18**|**Commonsense Reasoning in Arab Culture**|Abdelrahman Sadallah et.al.|[2502.12788v1](http://arxiv.org/abs/2502.12788v1)|null|
+|**2025-02-18**|**VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**|Xinlong Chen et.al.|[2502.12782v1](http://arxiv.org/abs/2502.12782v1)|null|
+|**2025-02-18**|**Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**|Daiki Chijiwa et.al.|[2502.12776v1](http://arxiv.org/abs/2502.12776v1)|null|
+|**2025-02-18**|**Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**|Danny Dongyeop Han et.al.|[2502.12771v1](http://arxiv.org/abs/2502.12771v1)|null|
+|**2025-02-18**|**How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**|Saad Obaid ul Islam et.al.|[2502.12769v1](http://arxiv.org/abs/2502.12769v1)|null|
+|**2025-02-18**|**R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**|Sumin Jo et.al.|[2502.12767v1](http://arxiv.org/abs/2502.12767v1)|null|
+
+#### Abstracts
+##### **SoFar: Language-Grounded Orientation Bridges Spatial Reasoning and Object Manipulation**
+2502.13143v1 by Zekun Qi, Wenyao Zhang, Yufei Ding, Runpei Dong, Xinqiang Yu, Jingwen Li, Lingyun Xu, Baoyu Li, Xialin He, Guofan Fan, Jiazhao Zhang, Jiawei He, Jiayuan Gu, Xin Jin, Kaisheng Ma, Zhizheng Zhang, He Wang, Li Yi
+
+Spatial intelligence is a critical component of embodied AI, promoting robots
+to understand and interact with their environments. While recent advances have
+enhanced the ability of VLMs to perceive object locations and positional
+relationships, they still lack the capability to precisely understand object
+orientations-a key requirement for tasks involving fine-grained manipulations.
+Addressing this limitation not only requires geometric reasoning but also an
+expressive and intuitive way to represent orientation. In this context, we
+propose that natural language offers a more flexible representation space than
+canonical frames, making it particularly suitable for instruction-following
+robotic systems. In this paper, we introduce the concept of semantic
+orientation, which defines object orientations using natural language in a
+reference-frame-free manner (e.g., the ''plug-in'' direction of a USB or the
+''handle'' direction of a knife). To support this, we construct OrienText300K,
+a large-scale dataset of 3D models annotated with semantic orientations that
+link geometric understanding to functional semantics. By integrating semantic
+orientation into a VLM system, we enable robots to generate manipulation
+actions with both positional and orientational constraints. Extensive
+experiments in simulation and real world demonstrate that our approach
+significantly enhances robotic manipulation capabilities, e.g., 48.7% accuracy
+on Open6DOR and 74.9% accuracy on SIMPLER.
+
+摘要：空間智能是具象 AI 的關鍵組成部分，促使機器人了解其環境並與之互動。雖然最近的進展增強了 VLM 感知物件位置和位置關係的能力，但它們仍然缺乏精確理解物件方向的能力，這對於涉及細微操作的任務來說是一項關鍵要求。解決這個限制不僅需要幾何推理，還需要一種表達性和直觀的方式來表示方向。在此背景下，我們提出自然語言提供了一個比標準框架更靈活的表示空間，使其特別適合於遵循指令的機器人系統。在本文中，我們介紹了語義方向的概念，它使用自然語言以無參考框架的方式定義物件方向（例如，USB 的「插入」方向或刀子的「握柄」方向）。為了支持這一點，我們構建了 OrienText300K，這是一個大型 3D 模型數據集，其中註釋了語義方向，將幾何理解與功能語義聯繫起來。通過將語義方向整合到 VLM 系統中，我們使機器人能夠生成同時具有位置和方向約束的操作動作。在模擬和現實世界中進行的廣泛實驗表明，我們的做法顯著增強了機器人的操作能力，例如，Open6DOR 的準確率為 48.7%，SIMPLER 的準確率為 74.9%。
+
+##### **Pre-training Auto-regressive Robotic Models with 4D Representations**
+2502.13142v1 by Dantong Niu, Yuvan Sharma, Haoru Xue, Giscard Biamby, Junyi Zhang, Ziteng Ji, Trevor Darrell, Roei Herzig
+
+Foundation models pre-trained on massive unlabeled datasets have
+revolutionized natural language and computer vision, exhibiting remarkable
+generalization capabilities, thus highlighting the importance of pre-training.
+Yet, efforts in robotics have struggled to achieve similar success, limited by
+either the need for costly robotic annotations or the lack of representations
+that effectively model the physical world. In this paper, we introduce ARM4R,
+an Auto-regressive Robotic Model that leverages low-level 4D Representations
+learned from human video data to yield a better pre-trained robotic model.
+Specifically, we focus on utilizing 3D point tracking representations from
+videos derived by lifting 2D representations into 3D space via monocular depth
+estimation across time. These 4D representations maintain a shared geometric
+structure between the points and robot state representations up to a linear
+transformation, enabling efficient transfer learning from human video data to
+low-level robotic control. Our experiments show that ARM4R can transfer
+efficiently from human video data to robotics and consistently improves
+performance on tasks across various robot environments and configurations.
+
+摘要：預先在大量未標記資料集上訓練好的基礎模型已經徹底改變了自然語言和電腦視覺，展現出非凡的概化能力，因此突顯了預先訓練的重要性。然而，機器人領域的努力一直難以取得類似的成功，受到昂貴的機器人標註需求或缺乏有效建模物理世界的表徵的限制。在本文中，我們介紹了 ARM4R，一種自迴歸機器人模型，它利用從人類影片資料中學習到的低階 4D 表徵，以產生更好的預先訓練機器人模型。具體來說，我們專注於利用從影片中獲得的 3D 點追蹤表徵，這些表徵是透過單眼深度估計跨時間將 2D 表徵提升到 3D 空間而導出的。這些 4D 表徵在點和機器人狀態表徵之間保持一個共用的幾何結構，直到一個線性轉換，這使得從人類影片資料到低階機器人控制的有效遷移學習成為可能。我們的實驗表明，ARM4R 可以有效地從人類影片資料轉移到機器人技術，並持續改善各種機器人環境和組態中的任務效能。
+
+##### **UniGuardian: A Unified Defense for Detecting Prompt Injection, Backdoor Attacks and Adversarial Attacks in Large Language Models**
+2502.13141v1 by Huawei Lin, Yingjie Lao, Tong Geng, Tan Yu, Weijie Zhao
+
+Large Language Models (LLMs) are vulnerable to attacks like prompt injection,
+backdoor attacks, and adversarial attacks, which manipulate prompts or models
+to generate harmful outputs. In this paper, departing from traditional deep
+learning attack paradigms, we explore their intrinsic relationship and
+collectively term them Prompt Trigger Attacks (PTA). This raises a key
+question: Can we determine if a prompt is benign or poisoned? To address this,
+we propose UniGuardian, the first unified defense mechanism designed to detect
+prompt injection, backdoor attacks, and adversarial attacks in LLMs.
+Additionally, we introduce a single-forward strategy to optimize the detection
+pipeline, enabling simultaneous attack detection and text generation within a
+single forward pass. Our experiments confirm that UniGuardian accurately and
+efficiently identifies malicious prompts in LLMs.
+
+摘要：大型語言模型 (LLM) 容易受到提示注入、後門攻擊和對抗性攻擊等攻擊，這些攻擊會操縱提示或模型以產生有害的輸出。在本文中，我們跳脫傳統深度學習攻擊範例，探討它們的內在關係，並將它們統稱為提示觸發攻擊 (PTA)。這引發了一個關鍵問題：我們能確定一個提示是良性的還是惡意的嗎？為了解決這個問題，我們提出了 UniGuardian，這是一種旨在偵測 LLM 中的提示注入、後門攻擊和對抗性攻擊的第一個統一防禦機制。此外，我們引入了一個單一前向策略來最佳化偵測管道，在單一前向傳遞中同時進行攻擊偵測和文字生成。我們的實驗證實，UniGuardian 能準確且有效地識別 LLM 中的惡意提示。
+
+##### **AIDE: AI-Driven Exploration in the Space of Code**
+2502.13138v1 by Zhengyao Jiang, Dominik Schmidt, Dhruv Srikanth, Dixing Xu, Ian Kaplan, Deniss Jacenko, Yuxiang Wu
+
+Machine learning, the foundation of modern artificial intelligence, has
+driven innovations that have fundamentally transformed the world. Yet, behind
+advancements lies a complex and often tedious process requiring labor and
+compute intensive iteration and experimentation. Engineers and scientists
+developing machine learning models spend much of their time on trial-and-error
+tasks instead of conceptualizing innovative solutions or research hypotheses.
+To address this challenge, we introduce AI-Driven Exploration (AIDE), a machine
+learning engineering agent powered by large language models (LLMs). AIDE frames
+machine learning engineering as a code optimization problem, and formulates
+trial-and-error as a tree search in the space of potential solutions. By
+strategically reusing and refining promising solutions, AIDE effectively trades
+computational resources for enhanced performance, achieving state-of-the-art
+results on multiple machine learning engineering benchmarks, including our
+Kaggle evaluations, OpenAI MLE-Bench and METRs RE-Bench.
+
+摘要：機器學習，現代人工智慧的基礎，已經推動了根本性地改變世界的創新。然而，進步的背後是一個複雜且經常繁瑣的過程，需要人工和計算密集的迭代和實驗。開發機器學習模型的工程師和科學家將大部分時間花在試錯任務上，而不是構思創新的解決方案或研究假設。為了應對這一挑戰，我們引入了 AI 驅動探索 (AIDE)，這是一種由大型語言模型 (LLM) 驅動的機器學習工程代理。AIDE 將機器學習工程構建為一個程式碼最佳化問題，並將試錯表述為在潛在解決方案空間中的樹狀搜尋。透過策略性地重複使用和改進有希望的解決方案，AIDE 有效地將計算資源轉換為增強的效能，在多個機器學習工程基準上取得了最先進的成果，包括我們的 Kaggle 評估、OpenAI MLE-Bench 和 METRs RE-Bench。
+
+##### **Theorem Prover as a Judge for Synthetic Data Generation**
+2502.13137v1 by Joshua Ong Jun Leang, Giwon Hong, Wenda Li, Shay B. Cohen
+
+The demand for synthetic data in mathematical reasoning has increased due to
+its potential to enhance the mathematical capabilities of large language models
+(LLMs). However, ensuring the validity of intermediate reasoning steps remains
+a significant challenge, affecting data quality. While formal verification via
+theorem provers effectively validates LLM reasoning, the autoformalisation of
+mathematical proofs remains error-prone. In response, we introduce iterative
+autoformalisation, an approach that iteratively refines theorem prover
+formalisation to mitigate errors, thereby increasing the execution rate on the
+Lean prover from 60% to 87%. Building upon that, we introduce Theorem Prover as
+a Judge (TP-as-a-Judge), a method that employs theorem prover formalisation to
+rigorously assess LLM intermediate reasoning, effectively integrating
+autoformalisation with synthetic data generation. Finally, we present
+Reinforcement Learning from Theorem Prover Feedback (RLTPF), a framework that
+replaces human annotation with theorem prover feedback in Reinforcement
+Learning from Human Feedback (RLHF). Across multiple LLMs, applying
+TP-as-a-Judge and RLTPF improves benchmarks with only 3,508 samples, achieving
+5.56% accuracy gain on Mistral-7B for MultiArith, 6.00% on Llama-2-7B for
+SVAMP, and 3.55% on Llama-3.1-8B for AQUA.
+
+摘要：<paragraph>由於合成資料在數學推理中具有增強大型語言模型 (LLM) 數學能力的潛力，對合成資料的需求已增加。然而，確保中間推理步驟的有效性仍然是一項重大的挑戰，影響資料品質。雖然透過定理證明器進行形式驗證可有效驗證 LLM 推理，但數學證明自動形式化仍然容易出錯。為了解決這個問題，我們引入了迭代自動形式化，這是一種迭代優化定理證明器形式化以減少錯誤的方法，從而將 Lean 證明器的執行率從 60% 提高到 87%。在此基礎上，我們引入了定理證明器作為評審 (TP-as-a-Judge)，這是一種採用定理證明器形式化來嚴格評估 LLM 中間推理的方法，有效地將自動形式化與合成資料產生整合。最後，我們提出了定理證明器回饋強化學習 (RLTPF)，這是一個框架，用定理證明器回饋取代人類標註，以進行人類回饋強化學習 (RLHF)。在多個 LLM 中，應用 TP-as-a-Judge 和 RLTPF 可透過僅 3,508 個樣本改善基準，在 MultiArith 上獲得 5.56% 的準確度提升，在 SVAMP 上獲得 Llama-2-7B 的 6.00% 提升，在 AQUA 上獲得 Llama-3.1-8B 的 3.55% 提升。</paragraph>
+
+##### **Sleepless Nights, Sugary Days: Creating Synthetic Users with Health Conditions for Realistic Coaching Agent Interactions**
+2502.13135v1 by Taedong Yun, Eric Yang, Mustafa Safdari, Jong Ha Lee, Vaishnavi Vinod Kumar, S. Sara Mahdavi, Jonathan Amar, Derek Peyton, Reut Aharony, Andreas Michaelides, Logan Schneider, Isaac Galatzer-Levy, Yugang Jia, John Canny, Arthur Gretton, Maja Matarić
+
+We present an end-to-end framework for generating synthetic users for
+evaluating interactive agents designed to encourage positive behavior changes,
+such as in health and lifestyle coaching. The synthetic users are grounded in
+health and lifestyle conditions, specifically sleep and diabetes management in
+this study, to ensure realistic interactions with the health coaching agent.
+Synthetic users are created in two stages: first, structured data are generated
+grounded in real-world health and lifestyle factors in addition to basic
+demographics and behavioral attributes; second, full profiles of the synthetic
+users are developed conditioned on the structured data. Interactions between
+synthetic users and the coaching agent are simulated using generative
+agent-based models such as Concordia, or directly by prompting a language
+model. Using two independently-developed agents for sleep and diabetes coaching
+as case studies, the validity of this framework is demonstrated by analyzing
+the coaching agent's understanding of the synthetic users' needs and
+challenges. Finally, through multiple blinded evaluations of user-coach
+interactions by human experts, we demonstrate that our synthetic users with
+health and behavioral attributes more accurately portray real human users with
+the same attributes, compared to generic synthetic users not grounded in such
+attributes. The proposed framework lays the foundation for efficient
+development of conversational agents through extensive, realistic, and grounded
+simulated interactions.
+
+摘要：<paragraph>我們提供了一個端到端的架構，用於為評估互動式代理生成合成使用者，這些代理旨在鼓勵正向行為改變，例如健康和生活方式指導。合成使用者以健康和生活方式狀況為基礎，特別是本研究中的睡眠和糖尿病管理，以確保與健康指導代理的互動具有真實性。合成使用者分兩個階段建立：首先，除了基本人口統計資料和行為屬性外，還會產生以現實世界的健康和生活方式因素為基礎的結構化資料；其次，會根據結構化資料開發合成使用者的完整個人資料。合成使用者和指導代理之間的互動是使用生成式基於代理的模型（例如 Concordia）模擬的，或者直接通過提示語言模型來模擬。使用兩個獨立開發的睡眠和糖尿病指導代理作為案例研究，通過分析指導代理對合成使用者需求和挑戰的理解，證明了此架構的有效性。最後，通過人類專家對使用者指導互動進行多重盲測評估，我們證明了與未以這些屬性為基礎的通用合成使用者相比，具有健康和行為屬性的合成使用者更準確地描繪了具有相同屬性的真實人類使用者。所提出的架構為通過廣泛、真實且有根據的模擬互動，為對話代理的有效開發奠定了基礎。</paragraph>
+
+##### **Learning to Defer for Causal Discovery with Imperfect Experts**
+2502.13132v1 by Oscar Clivio, Divyat Mahajan, Perouz Taslakian, Sara Magliacane, Ioannis Mitliagkas, Valentina Zantedeschi, Alexandre Drouin
+
+Integrating expert knowledge, e.g. from large language models, into causal
+discovery algorithms can be challenging when the knowledge is not guaranteed to
+be correct. Expert recommendations may contradict data-driven results, and
+their reliability can vary significantly depending on the domain or specific
+query. Existing methods based on soft constraints or inconsistencies in
+predicted causal relationships fail to account for these variations in
+expertise. To remedy this, we propose L2D-CD, a method for gauging the
+correctness of expert recommendations and optimally combining them with
+data-driven causal discovery results. By adapting learning-to-defer (L2D)
+algorithms for pairwise causal discovery (CD), we learn a deferral function
+that selects whether to rely on classical causal discovery methods using
+numerical data or expert recommendations based on textual meta-data. We
+evaluate L2D-CD on the canonical T\"ubingen pairs dataset and demonstrate its
+superior performance compared to both the causal discovery method and the
+expert used in isolation. Moreover, our approach identifies domains where the
+expert's performance is strong or weak. Finally, we outline a strategy for
+generalizing this approach to causal discovery on graphs with more than two
+variables, paving the way for further research in this area.
+
+摘要：整合专家知識，例如從大型語言模型中整合到因果發現演算法中，當知識無法保證正確時會很有挑戰性。專家建議可能會與資料驅動的結果相矛盾，而且他們的可靠性可能會根據領域或特定查詢而有顯著差異。現有的基於軟約束或預測因果關係中不一致的方法無法說明專業知識中的這些變化。為了補救這一點，我們提出了 L2D-CD，一種用於評估專家建議的正確性並將其與資料驅動的因果發現結果最佳結合的方法。透過調整學習延遲 (L2D) 演算法以進行成對因果發現 (CD)，我們學習了一個延遲函數，用於選擇依賴使用數值資料的傳統因果發現方法或基於文字元資料的專家建議。我們在經典的 T\"ubingen 對資料集上評估 L2D-CD，並證明其與單獨使用的因果發現方法和專家相比具有優越的效能。此外，我們的做法識別出專家表現強或弱的領域。最後，我們概述了一種將此方法推廣到具有兩個以上變數的圖表上進行因果發現的策略，為此領域的進一步研究鋪平了道路。
+
+##### **Rethinking Diverse Human Preference Learning through Principal Component Analysis**
+2502.13131v1 by Feng Luo, Rui Yang, Hao Sun, Chunyuan Deng, Jiarui Yao, Jingyan Shen, Huan Zhang, Hanjie Chen
+
+Understanding human preferences is crucial for improving foundation models
+and building personalized AI systems. However, preferences are inherently
+diverse and complex, making it difficult for traditional reward models to
+capture their full range. While fine-grained preference data can help,
+collecting it is expensive and hard to scale. In this paper, we introduce
+Decomposed Reward Models (DRMs), a novel approach that extracts diverse human
+preferences from binary comparisons without requiring fine-grained annotations.
+Our key insight is to represent human preferences as vectors and analyze them
+using Principal Component Analysis (PCA). By constructing a dataset of
+embedding differences between preferred and rejected responses, DRMs identify
+orthogonal basis vectors that capture distinct aspects of preference. These
+decomposed rewards can be flexibly combined to align with different user needs,
+offering an interpretable and scalable alternative to traditional reward
+models. We demonstrate that DRMs effectively extract meaningful preference
+dimensions (e.g., helpfulness, safety, humor) and adapt to new users without
+additional training. Our results highlight DRMs as a powerful framework for
+personalized and interpretable LLM alignment.
+
+摘要：理解人類偏好對於改進基礎模型和建構個人化 AI 系統至關重要。然而，偏好本質上是多樣且複雜的，這使得傳統的獎勵模型難以捕捉其全部範圍。雖然細緻的偏好數據可能有所幫助，但收集這些數據既昂貴又難以擴展。在本文中，我們介紹了解構獎勵模型 (DRM)，這是一種新穎的方法，它可以從二元比較中提取多樣化的人類偏好，而不需要細緻的註解。我們的關鍵見解是將人類偏好表示為向量，並使用主成分分析 (PCA) 對其進行分析。透過建構偏好和拒絕回應之間嵌入差異的數據集，DRM 識別出正交基向量，這些向量捕捉偏好的不同面向。這些解構的獎勵可以靈活地結合在一起，以符合不同的使用者需求，提供一種可解釋且可擴展的傳統獎勵模型替代方案。我們證明了 DRM 可以有效地提取有意義的偏好維度（例如，有用性、安全性、幽默感），並在不需要額外訓練的情況下適應新的使用者。我們的結果突顯了 DRM 作為個人化且可解釋的 LLM 對齊強大架構。
+
+##### **Magma: A Foundation Model for Multimodal AI Agents**
+2502.13130v1 by Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, Yuquan Deng, Lars Liden, Jianfeng Gao
+
+We present Magma, a foundation model that serves multimodal AI agentic tasks
+in both the digital and physical worlds. Magma is a significant extension of
+vision-language (VL) models in that it not only retains the VL understanding
+ability (verbal intelligence) of the latter, but is also equipped with the
+ability to plan and act in the visual-spatial world (spatial-temporal
+intelligence) and complete agentic tasks ranging from UI navigation to robot
+manipulation. To endow the agentic capabilities, Magma is pretrained on large
+amounts of heterogeneous datasets spanning from images, videos to robotics
+data, where the actionable visual objects (e.g., clickable buttons in GUI) in
+images are labeled by Set-of-Mark (SoM) for action grounding, and the object
+movements (e.g., the trace of human hands or robotic arms) in videos are
+labeled by Trace-of-Mark (ToM) for action planning. Extensive experiments show
+that SoM and ToM reach great synergy and facilitate the acquisition of
+spatial-temporal intelligence for our Magma model, which is fundamental to a
+wide range of tasks as shown in Fig.1. In particular, Magma creates new
+state-of-the-art results on UI navigation and robotic manipulation tasks,
+outperforming previous models that are specifically tailored to these tasks. On
+image and video-related multimodal tasks, Magma also compares favorably to
+popular large multimodal models that are trained on much larger datasets. We
+make our model and code public for reproducibility at
+https://microsoft.github.io/Magma.
+
+摘要：<paragraph>我們提出 Magma，這是一個基礎模型，用於服務數位和物理世界中的多模態 AI 代理任務。Magma 是視覺語言 (VL) 模型的重大延伸，它不僅保留了後者的 VL 理解能力（語言智能），還具備在視覺空間世界中規劃和行動的能力（時空智能），並完成從 UI 導航到機器人操作的代理任務。為了賦予代理能力，Magma 在從影像、影片到機器人資料的大量異質資料集上進行預訓練，其中影像中的可操作視覺物件（例如 GUI 中的可點擊按鈕）由動作接地 Set-of-Mark (SoM) 標記，影片中的物件動作（例如人手或機器手臂的軌跡）由動作規劃 Trace-of-Mark (ToM) 標記。廣泛的實驗表明，SoM 和 ToM 達到了極大的協同作用，並促進了我們 Magma 模型的時空智能的獲取，這對於圖 1 中所示的各種任務至關重要。特別是，Magma 在 UI 導航和機器人操作任務上創造了新的最先進的結果，優於專門針對這些任務的先前模型。在影像和影片相關的多模態任務上，Magma 也與在更大資料集上訓練的流行大型多模態模型相比，表現得很好。我們公開我們的模型和程式碼，以便在 https://microsoft.github.io/Magma 上重現。</paragraph>
+
+##### **SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation**
+2502.13128v1 by Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang
+
+Text-to-song generation, the task of creating vocals and accompaniment from
+textual inputs, poses significant challenges due to domain complexity and data
+scarcity. Existing approaches often employ multi-stage generation procedures,
+resulting in cumbersome training and inference pipelines. In this paper, we
+propose SongGen, a fully open-source, single-stage auto-regressive transformer
+designed for controllable song generation. The proposed model facilitates
+fine-grained control over diverse musical attributes, including lyrics and
+textual descriptions of instrumentation, genre, mood, and timbre, while also
+offering an optional three-second reference clip for voice cloning. Within a
+unified auto-regressive framework, SongGen supports two output modes: mixed
+mode, which generates a mixture of vocals and accompaniment directly, and
+dual-track mode, which synthesizes them separately for greater flexibility in
+downstream applications. We explore diverse token pattern strategies for each
+mode, leading to notable improvements and valuable insights. Furthermore, we
+design an automated data preprocessing pipeline with effective quality control.
+To foster community engagement and future research, we will release our model
+weights, training code, annotated data, and preprocessing pipeline. The
+generated samples are showcased on our project page at
+https://liuzh-19.github.io/SongGen/ , and the code will be available at
+https://github.com/LiuZH-19/SongGen .
+
+摘要：文字轉歌曲生成，從文字輸入建立人聲和伴奏的任務，由於領域複雜性和資料稀少性，因此構成重大挑戰。現有方法通常採用多階段生成程序，導致訓練和推論管道繁瑣。在本文中，我們提出 SongGen，一個完全開源的單階段自迴歸轉換器，專為可控歌曲生成而設計。所提出的模型促進對各種音樂屬性的細粒度控制，包括歌詞和樂器、類型、情緒和音色的文字描述，同時還提供可選的三秒參考片段以進行語音複製。在統一的自迴歸框架內，SongGen 支援兩種輸出模式：混合模式，直接生成人聲和伴奏的混合，以及雙軌模式，將它們分開合成以提高下游應用程式的靈活性。我們探索每種模式的不同代幣模式策略，從而帶來顯著的改進和有價值的見解。此外，我們設計了一個自動化資料預處理管道，具備有效的品質控制。為了促進社區參與和未來的研究，我們將釋出我們的模型權重、訓練程式碼、註解資料和預處理管道。生成的範例展示在我們的專案頁面 https://liuzh-19.github.io/SongGen/，程式碼將在 https://github.com/LiuZH-19/SongGen 中提供。
+
+##### **Facilitating Long Context Understanding via Supervised Chain-of-Thought Reasoning**
+2502.13127v1 by Jingyang Lin, Andy Wong, Tian Xia, Shenghua He, Hui Wei, Mei Han, Jiebo Luo
+
+Recent advances in Large Language Models (LLMs) have enabled them to process
+increasingly longer sequences, ranging from 2K to 2M tokens and even beyond.
+However, simply extending the input sequence length does not necessarily lead
+to effective long-context understanding. In this study, we integrate
+Chain-of-Thought (CoT) reasoning into LLMs in a supervised manner to facilitate
+effective long-context understanding. To achieve this, we introduce
+LongFinanceQA, a synthetic dataset in the financial domain designed to improve
+long-context reasoning. Unlike existing long-context synthetic data,
+LongFinanceQA includes intermediate CoT reasoning before the final conclusion,
+which encourages LLMs to perform explicit reasoning, improving accuracy and
+interpretability in long-context understanding. To generate synthetic CoT
+reasoning, we propose Property-driven Agentic Inference (PAI), an agentic
+framework that simulates human-like reasoning steps, including property
+extraction, retrieval, and summarization. We evaluate PAI's reasoning
+capabilities by assessing GPT-4o-mini w/ PAI on the Loong benchmark,
+outperforming standard GPT-4o-mini by 20.0%. Furthermore, we fine-tune
+LLaMA-3.1-8B-Instruct on LongFinanceQA, achieving a 24.6% gain on Loong's
+financial subset.
+
+摘要：大型語言模型 (LLM) 的最新進展讓它們能夠處理越來越長的序列，範圍從 2K 到 2M 個符號，甚至更長。
+然而，僅僅延長輸入序列長度並不會必然導致有效的長語境理解。在本研究中，我們以監督的方式將思考鏈 (CoT) 推理整合到 LLM 中，以促進有效的長語境理解。為此，我們引入了 LongFinanceQA，這是一個在金融領域中的合成數據集，旨在改進長語境推理。與現有的長語境合成數據不同，LongFinanceQA 在最終結論之前包含了中間的 CoT 推理，這鼓勵 LLM 執行明確的推理，從而提高長語境理解的準確性和可解釋性。為了生成合成的 CoT 推理，我們提出了基於屬性的主體推理 (PAI)，這是一個模擬類人推理步驟的主體框架，包括屬性提取、檢索和總結。我們通過評估搭載 PAI 的 GPT-4o-mini 在 Loong 基準上的推理能力，使其比標準的 GPT-4o-mini 高出 20.0%，來評估 PAI 的推理能力。此外，我們對 LLaMA-3.1-8B-Instruct 進行了微調，在 Loong 的金融子集中實現了 24.6% 的增益。
+
+##### **RuozhiBench: Evaluating LLMs with Logical Fallacies and Misleading Premises**
+2502.13125v1 by Zenan Zhai, Hao Li, Xudong Han, Zhenxuan Zhang, Yixuan Zhang, Timothy Baldwin, Haonan Li
+
+Recent advances in large language models (LLMs) have shown that they can
+answer questions requiring complex reasoning. However, their ability to
+identify and respond to text containing logical fallacies or deliberately
+misleading premises remains less studied. To address this gap, we introduce
+RuozhiBench, a bilingual dataset comprising 677 carefully curated questions
+that contain various forms of deceptive reasoning, meticulously crafted through
+extensive human effort and expert review. In a comprehensive evaluation of 17
+LLMs from 5 Series over RuozhiBench using both open-ended and two-choice
+formats, we conduct extensive analyses on evaluation protocols and result
+patterns. Despite their high scores on conventional benchmarks, these models
+showed limited ability to detect and reason correctly about logical fallacies,
+with even the best-performing model, Claude-3-haiku, achieving only 62%
+accuracy compared to the human of more than 90%.
+
+摘要：大型語言模型 (LLM) 的最新進展顯示，它們可以回答需要複雜推理的問題。然而，它們識別和回應包含邏輯謬誤或故意誤導前提的文本的能力仍未得到充分研究。為了解決這個差距，我們引入了 RuozhiBench，這是一個雙語資料集，包含 677 個經過仔細策劃的問題，其中包含各種形式的欺騙性推理，並透過廣泛的人力投入和專家審查精心製作。在使用開放式和二選一格式對來自 5 個系列的 17 個 LLM 進行 RuozhiBench 的全面評估中，我們對評估協定和結果模式進行了廣泛的分析。儘管它們在傳統基準測試中獲得了高分，但這些模型在檢測和正確推理邏輯謬誤方面表現出的能力有限，即使是效能最好的模型 Claude-3-haiku，與人類的 90% 以上相比，也只達到了 62% 的準確度。
+
+##### **NaturalReasoning: Reasoning in the Wild with 2.8M Challenging Questions**
+2502.13124v1 by Weizhe Yuan, Jane Yu, Song Jiang, Karthik Padthe, Yang Li, Dong Wang, Ilia Kulikov, Kyunghyun Cho, Yuandong Tian, Jason E Weston, Xian Li
+
+Scaling reasoning capabilities beyond traditional domains such as math and
+coding is hindered by the lack of diverse and high-quality questions. To
+overcome this limitation, we introduce a scalable approach for generating
+diverse and challenging reasoning questions, accompanied by reference answers.
+We present NaturalReasoning, a comprehensive dataset comprising 2.8 million
+questions that span multiple domains, including STEM fields (e.g., Physics,
+Computer Science), Economics, Social Sciences, and more. We demonstrate the
+utility of the questions in NaturalReasoning through knowledge distillation
+experiments which show that NaturalReasoning can effectively elicit and
+transfer reasoning capabilities from a strong teacher model. Furthermore, we
+demonstrate that NaturalReasoning is also effective for unsupervised
+self-training using external reward models or self-rewarding.
+
+摘要：透過超越傳統領域（例如數學和編碼）來擴充推理能力，受到缺乏多元且高品質問題的阻礙。為了克服這個限制，我們引入一個可擴充的方法，用於產生多元且具挑戰性的推理問題，並附上參考答案。我們提出 NaturalReasoning，這是一個包含 280 萬個問題的綜合資料集，涵蓋多個領域，包括 STEM 領域（例如物理、電腦科學）、經濟學、社會科學等等。我們透過知識蒸餾實驗，展示 NaturalReasoning 中問題的實用性，這些實驗顯示 NaturalReasoning 能有效地引發和轉移強大教師模型的推理能力。此外，我們展示 NaturalReasoning 也適用於使用外部獎勵模型或自我獎勵的無監督自我訓練。
+
+##### **Adapting Psycholinguistic Research for LLMs: Gender-inclusive Language in a Coreference Context**
+2502.13120v1 by Marion Bartl, Thomas Brendan Murphy, Susan Leavy
+
+Gender-inclusive language is often used with the aim of ensuring that all
+individuals, regardless of gender, can be associated with certain concepts.
+While psycholinguistic studies have examined its effects in relation to human
+cognition, it remains unclear how Large Language Models (LLMs) process
+gender-inclusive language. Given that commercial LLMs are gaining an
+increasingly strong foothold in everyday applications, it is crucial to examine
+whether LLMs in fact interpret gender-inclusive language neutrally, because the
+language they generate has the potential to influence the language of their
+users. This study examines whether LLM-generated coreferent terms align with a
+given gender expression or reflect model biases. Adapting psycholinguistic
+methods from French to English and German, we find that in English, LLMs
+generally maintain the antecedent's gender but exhibit underlying masculine
+bias. In German, this bias is much stronger, overriding all tested
+gender-neutralization strategies.
+
+摘要：性別包容性語言通常用於確保所有個人，無論性別如何，都能與某些概念聯繫在一起。雖然心理語言學研究已經檢視了它對人類認知的影響，但大型語言模型 (LLM) 如何處理性別包容性語言仍然不清楚。鑑於商業 LLM 在日常應用中越來越站穩腳步，因此至關重要的是要檢查 LLM 是否實際上中立地解釋性別包容性語言，因為它們產生的語言有可能影響其使用者的語言。本研究探討了 LLM 生成的共指術語是否與給定的性別表達一致或反映模型偏見。我們採用法語到英語和德語的心理語言學方法，發現英語中，LLM 通常會保持先行詞的性別，但表現出潛在的男性偏見。在德語中，這種偏見強得多，凌駕於所有經過測試的性別中立化策略。
+
+##### **STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models**
+2502.13119v1 by Narun Raman, Taylor Lundy, Thiago Amin, Jesse Perla, Kevin-Leyton Brown
+
+How should one judge whether a given large language model (LLM) can reliably
+perform economic reasoning? Most existing LLM benchmarks focus on specific
+applications and fail to present the model with a rich variety of economic
+tasks. A notable exception is Raman et al. [2024], who offer an approach for
+comprehensively benchmarking strategic decision-making; however, this approach
+fails to address the non-strategic settings prevalent in microeconomics, such
+as supply-and-demand analysis. We address this gap by taxonomizing
+microeconomic reasoning into $58$ distinct elements, focusing on the logic of
+supply and demand, each grounded in up to $10$ distinct domains, $5$
+perspectives, and $3$ types. The generation of benchmark data across this
+combinatorial space is powered by a novel LLM-assisted data generation protocol
+that we dub auto-STEER, which generates a set of questions by adapting
+handwritten templates to target new domains and perspectives. Because it offers
+an automated way of generating fresh questions, auto-STEER mitigates the risk
+that LLMs will be trained to over-fit evaluation benchmarks; we thus hope that
+it will serve as a useful tool both for evaluating and fine-tuning models for
+years to come. We demonstrate the usefulness of our benchmark via a case study
+on $27$ LLMs, ranging from small open-source models to the current state of the
+art. We examined each model's ability to solve microeconomic problems across
+our whole taxonomy and present the results across a range of prompting
+strategies and scoring metrics.
+
+摘要：<paragraph>如何判斷一個給定的大型語言模型 (LLM) 能否可靠地進行經濟推理？現有的 LLM 基準測試大多專注於特定應用，未能為模型提供豐富多樣的經濟任務。一個值得注意的例外是 Raman 等人 [2024]，他們提供了一種全面評估策略決策制定方法；然而，這種方法無法解決微觀經濟學中普遍存在的非策略性設定，例如供需分析。我們透過將微觀經濟推理分類為 58 個不同的元素來解決這個差距，重點放在供需邏輯上，每個元素都基於多達 10 個不同的領域、5 個觀點和 3 種類型。在這個組合空間中產生基準數據是由一種新穎的 LLM 輔助數據生成協議（我們稱之為 auto-STEER）推動的，它通過調整手寫模板來針對新的領域和觀點來生成一組問題。由於它提供了一種生成新問題的自動化方式，auto-STEER 減輕了 LLM 將被訓練過度配合評估基準測試的風險；因此，我們希望它將成為未來幾年評估和微調模型的有用工具。我們通過一個案例研究展示了我們基準測試的效用，該案例研究涵蓋了 27 個 LLM，從小型開源模型到當前技術狀態。我們檢查了每個模型在我們的整個分類法中解決微觀經濟問題的能力，並在各種提示策略和評分指標中展示了結果。</paragraph>
+
+##### **Performance Evaluation of Large Language Models in Statistical Programming**
+2502.13117v1 by Xinyi Song, Kexin Xie, Lina Lee, Ruizhe Chen, Jared M. Clark, Hao He, Haoran He, Jie Min, Xinlei Zhang, Simin Zheng, Zhiyang Zhang, Xinwei Deng, Yili Hong
+
+The programming capabilities of large language models (LLMs) have
+revolutionized automatic code generation and opened new avenues for automatic
+statistical analysis. However, the validity and quality of these generated
+codes need to be systematically evaluated before they can be widely adopted.
+Despite their growing prominence, a comprehensive evaluation of statistical
+code generated by LLMs remains scarce in the literature. In this paper, we
+assess the performance of LLMs, including two versions of ChatGPT and one
+version of Llama, in the domain of SAS programming for statistical analysis.
+Our study utilizes a set of statistical analysis tasks encompassing diverse
+statistical topics and datasets. Each task includes a problem description,
+dataset information, and human-verified SAS code. We conduct a comprehensive
+assessment of the quality of SAS code generated by LLMs through human expert
+evaluation based on correctness, effectiveness, readability, executability, and
+the accuracy of output results. The analysis of rating scores reveals that
+while LLMs demonstrate usefulness in generating syntactically correct code,
+they struggle with tasks requiring deep domain understanding and may produce
+redundant or incorrect results. This study offers valuable insights into the
+capabilities and limitations of LLMs in statistical programming, providing
+guidance for future advancements in AI-assisted coding systems for statistical
+analysis.
+
+摘要：大型語言模型 (LLM) 的程式設計功能徹底改變了自動程式碼生成，並為自動統計分析開啟了新途徑。然而，在廣泛採用這些產生的程式碼之前，需要系統性地評估其有效性和品質。儘管其重要性日益提升，但文獻中對於 LLM 產生的統計程式碼的全面評估仍然稀少。在本文中，我們評估了 LLM 的效能，包括兩個版本的 ChatGPT 和一個版本的 Llama，在統計分析的 SAS 程式設計領域。我們的研究利用了一組涵蓋各種統計主題和資料集的統計分析任務。每個任務都包含問題說明、資料集資訊和經過人工驗證的 SAS 程式碼。我們透過基於正確性、有效性、可讀性、可執行性和輸出結果精確度的專家評估，對 LLM 產生的 SAS 程式碼品質進行全面評估。評分結果的分析顯示，儘管 LLM 在產生語法正確的程式碼方面表現出其效用，但它們在需要深入領域理解的任務中會遇到困難，並且可能會產生冗餘或不正確的結果。本研究提供了 LLM 在統計程式設計中能力和限制的寶貴見解，為統計分析的 AI 輔助編碼系統的未來進展提供指導。
+
+##### **Near-Optimal Private Learning in Linear Contextual Bandits**
+2502.13115v1 by Fan Chen, Jiachun Li, Alexander Rakhlin, David Simchi-Levi
+
+We analyze the problem of private learning in generalized linear contextual
+bandits. Our approach is based on a novel method of re-weighted regression,
+yielding an efficient algorithm with regret of order
+$\sqrt{T}+\frac{1}{\alpha}$ and $\sqrt{T}/\alpha$ in the joint and local model
+of $\alpha$-privacy, respectively. Further, we provide near-optimal private
+procedures that achieve dimension-independent rates in private linear models
+and linear contextual bandits. In particular, our results imply that joint
+privacy is almost "for free" in all the settings we consider, partially
+addressing the open problem posed by Azize and Basu (2024).
+
+摘要：我們分析廣義線性情境強盜中私人學習的問題。我們的做法基於重新加權回歸的新方法，產生一種有效率的演算法，其後悔值分別為
+$\sqrt{T}+\frac{1}{\alpha}$ 和 $\sqrt{T}/\alpha$ 在 $\alpha$-隱私的聯合和局部模型中。此外，我們提供近乎最佳的私人程序，在私人線性模型和線性情境強盜中實現與維度無關的比率。特別是，我們的結果表明，在我們考慮的所有設定中，聯合隱私幾乎是「免費」的，部分解決了 Azize 和 Basu (2024) 提出的開放性問題。
+
+##### **The influence of motion features in temporal perception**
+2502.13114v1 by Rosa Illan Castillo, Javier Valenzuela
+
+This paper examines the role of manner-of-motion verbs in shaping subjective
+temporal perception and emotional resonance. Through four complementary
+studies, we explore how these verbs influence the conceptualization of time,
+examining their use in literal and metaphorical (temporal) contexts. Our
+findings reveal that faster verbs (e.g., fly, zoom) evoke dynamic and engaging
+temporal experiences, often linked to positive emotions and greater agency. In
+contrast, slower verbs (e.g., crawl, drag) convey passivity, monotony, and
+negative emotions, reflecting tedious or constrained experiences of time. These
+effects are amplified in metaphorical contexts, where manner verbs encode
+emotional and experiential nuances that transcend their literal meanings. We
+also find that participants prefer manner verbs over path verbs (e.g., go,
+pass) in emotionally charged temporal contexts, as manner verbs capture the
+experiential and emotional qualities of time more effectively. These findings
+highlight the interplay between language, motion, and emotion in shaping
+temporal perception, offering insights into how linguistic framing influences
+subjective experiences of time.
+
+摘要：本文探討動作方式動詞在形塑主觀時間感知和情緒共鳴中所扮演的角色。透過四項互補的研究，我們探討這些動詞如何影響時間的概念化，並檢視它們在字面和隱喻（時間）語境中的用法。我們的研究結果顯示，較快的動詞（例如飛、飆）會引起動態且引人入勝的時間體驗，通常與正面情緒和較大的自主性有關。相反地，較慢的動詞（例如爬、拖）傳達了被動、單調和負面情緒，反映出乏味或受限的時間體驗。這些效應在隱喻語境中會被放大，其中動作動詞編碼了超越其字面意義的情緒和體驗細微差別。我們還發現，在充滿情緒的時間語境中，參與者偏好動作動詞而非路徑動詞（例如走、經過），因為動作動詞更有效地捕捉了時間的體驗和情緒品質。這些研究結果突顯了語言、動作和情緒之間在形塑時間感知中的交互作用，並提供了語言框架如何影響主觀時間體驗的見解。
+
+##### **Improving Clinical Question Answering with Multi-Task Learning: A Joint Approach for Answer Extraction and Medical Categorization**
+2502.13108v1 by Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Amit Agarwal, Bhargava Kumar, Srikant Panda, Tejaswini Kumar
+
+Clinical Question Answering (CQA) plays a crucial role in medical
+decision-making, enabling physicians to extract relevant information from
+Electronic Medical Records (EMRs). While transformer-based models such as BERT,
+BioBERT, and ClinicalBERT have demonstrated state-of-the-art performance in
+CQA, existing models lack the ability to categorize extracted answers, which is
+critical for structured retrieval, content filtering, and medical decision
+support.
+  To address this limitation, we introduce a Multi-Task Learning (MTL)
+framework that jointly trains CQA models for both answer extraction and medical
+categorization. In addition to predicting answer spans, our model classifies
+responses into five standardized medical categories: Diagnosis, Medication,
+Symptoms, Procedure, and Lab Reports. This categorization enables more
+structured and interpretable outputs, making clinical QA models more useful in
+real-world healthcare settings.
+  We evaluate our approach on emrQA, a large-scale dataset for medical question
+answering. Results show that MTL improves F1-score by 2.2% compared to standard
+fine-tuning, while achieving 90.7% accuracy in answer categorization. These
+findings suggest that MTL not only enhances CQA performance but also introduces
+an effective mechanism for categorization and structured medical information
+retrieval.
+
+摘要：<paragraph>臨床問答 (CQA) 在醫療決策中扮演著至關重要的角色，讓醫師能夠從電子病歷 (EMR) 中擷取相關資訊。儘管 BERT、BioBERT 和 ClinicalBERT 等基於轉換器的模型已在 CQA 中展現出最先進的效能，但現有的模型缺乏分類擷取答案的能力，這對於結構化檢索、內容過濾和醫療決策支援至關重要。
+  為了解決這個限制，我們引進了一個多任務學習 (MTL) 架構，它同時訓練 CQA 模型用於答案擷取和醫療分類。除了預測答案範圍，我們的模型將回應分類為五個標準化醫療類別：診斷、藥物、症狀、程序和實驗室報告。這種分類能產生更結構化且易於理解的輸出，讓臨床問答模型在真實世界的醫療保健環境中更實用。
+  我們在 emrQA 上評估我們的做法，emrQA 是用於醫療問題解答的大規模資料集。結果顯示，與標準微調相比，MTL 將 F1 分數提高了 2.2%，同時在答案分類中達到 90.7% 的準確度。這些發現表明，MTL 不僅增強了 CQA 的效能，還引入了一種分類和結構化醫療資訊檢索的有效機制。</paragraph>
+
+##### **MatterChat: A Multi-Modal LLM for Material Science**
+2502.13107v1 by Yingheng Tang, Wenbin Xu, Jie Cao, Jianzhu Ma, Weilu Gao, Steve Farrell, Benjamin Erichson, Michael W. Mahoney, Andy Nonaka, Zhi Yao
+
+Understanding and predicting the properties of inorganic materials is crucial
+for accelerating advancements in materials science and driving applications in
+energy, electronics, and beyond. Integrating material structure data with
+language-based information through multi-modal large language models (LLMs)
+offers great potential to support these efforts by enhancing human-AI
+interaction. However, a key challenge lies in integrating atomic structures at
+full resolution into LLMs. In this work, we introduce MatterChat, a versatile
+structure-aware multi-modal LLM that unifies material structural data and
+textual inputs into a single cohesive model. MatterChat employs a bridging
+module to effectively align a pretrained machine learning interatomic potential
+with a pretrained LLM, reducing training costs and enhancing flexibility. Our
+results demonstrate that MatterChat significantly improves performance in
+material property prediction and human-AI interaction, surpassing
+general-purpose LLMs such as GPT-4. We also demonstrate its usefulness in
+applications such as more advanced scientific reasoning and step-by-step
+material synthesis.
+
+摘要：了解和預測無機材料的特性對於加速材料科學的進步和推動能源、電子等方面的應用至關重要。透過多模態大型語言模型 (LLM) 將材料結構數據與基於語言的資訊整合，可以極大程度地支持這些工作，藉此增強人類與 AI 的互動。然而，一個關鍵挑戰在於將原子結構以完整解析度整合到 LLM 中。在這項工作中，我們引入了 MatterChat，這是一個通用的結構感知多模態 LLM，它將材料結構數據和文字輸入統一到一個單一的內聚模型中。MatterChat 採用橋接模組，將預先訓練好的機器學習原子間電位與預先訓練好的 LLM 有效地對齊，從而降低訓練成本並增強靈活性。我們的結果表明，MatterChat 大幅提升了材料特性預測和人類與 AI 互動的效能，超越了 GPT-4 等通用 LLM。我們也展示了它在更進階的科學推理和逐步材料合成等應用中的效用。
+
+##### **Understanding and Rectifying Safety Perception Distortion in VLMs**
+2502.13095v1 by Xiaohan Zou, Jian Kang, George Kesidis, Lu Lin
+
+Recent studies reveal that vision-language models (VLMs) become more
+susceptible to harmful requests and jailbreak attacks after integrating the
+vision modality, exhibiting greater vulnerability than their text-only LLM
+backbones. To uncover the root cause of this phenomenon, we conduct an in-depth
+analysis and identify a key issue: multimodal inputs introduce an
+modality-induced activation shift toward a "safer" direction compared to their
+text-only counterparts, leading VLMs to systematically overestimate the safety
+of harmful inputs. We refer to this issue as safety perception distortion. To
+mitigate such distortion, we propose Activation Shift Disentanglement and
+Calibration (ShiftDC), a training-free method that decomposes and calibrates
+the modality-induced activation shift to reduce the impact of modality on
+safety. By isolating and removing the safety-relevant component, ShiftDC
+restores the inherent safety alignment of the LLM backbone while preserving the
+vision-language capabilities of VLMs. Empirical results demonstrate that
+ShiftDC significantly enhances alignment performance on safety benchmarks
+without impairing model utility.
+
+摘要：最近的研究表明，在整合了视觉模态后，视觉语言模型 (VLM) 更容易受到有害请求和越狱攻击，表现出比其仅文本的 LLM 主干更大的漏洞。为了揭示这种现象的根本原因，我们进行了深入分析，并确定了一个关键问题：与仅文本的对应物相比，多模态输入引入了朝“更安全”方向的模态诱导激活转移，导致 VLM 系统性地高估有害输入的安全性。我们将此问题称为安全感知扭曲。为了减轻这种扭曲，我们提出了激活转移解耦和校准 (ShiftDC)，这是一种无训练方法，用于分解和校准模态诱导的激活转移，以减少模态对安全性的影响。通过隔离和移除与安全性相关的组件，ShiftDC 恢复了 LLM 主干的固有安全性对齐，同时保留了 VLM 的视觉语言能力。实证结果表明，ShiftDC 在不损害模型效用的情况下，显著增强了安全基准上的对齐性能。
+
+##### **Text2World: Benchmarking Large Language Models for Symbolic World Model Generation**
+2502.13092v1 by Mengkang Hu, Tianxing Chen, Yude Zou, Yuheng Lei, Qiguang Chen, Ming Li, Hongyuan Zhang, Wenqi Shao, Ping Luo
+
+Recently, there has been growing interest in leveraging large language models
+(LLMs) to generate symbolic world models from textual descriptions. Although
+LLMs have been extensively explored in the context of world modeling, prior
+studies encountered several challenges, including evaluation randomness,
+dependence on indirect metrics, and a limited domain scope. To address these
+limitations, we introduce a novel benchmark, Text2World, based on planning
+domain definition language (PDDL), featuring hundreds of diverse domains and
+employing multi-criteria, execution-based metrics for a more robust evaluation.
+We benchmark current LLMs using Text2World and find that reasoning models
+trained with large-scale reinforcement learning outperform others. However,
+even the best-performing model still demonstrates limited capabilities in world
+modeling. Building on these insights, we examine several promising strategies
+to enhance the world modeling capabilities of LLMs, including test-time
+scaling, agent training, and more. We hope that Text2World can serve as a
+crucial resource, laying the groundwork for future research in leveraging LLMs
+as world models. The project page is available at
+https://text-to-world.github.io/.
+
+摘要：最近，人们越来越有兴趣利用大型语言模型（LLM）从文本描述中生成符号世界模型。尽管 LLM 已在世界建模的背景下得到广泛探索，但先前的研究遇到了若干挑战，包括评估随机性、对间接指标的依赖以及有限的领域范围。为了解决这些限制，我们引入了基于规划域定义语言（PDDL）的新基准 Text2World，该基准包含数百个不同的域，并采用基于执行的多标准指标来进行更稳健的评估。我们使用 Text2World 对当前的 LLM 进行了基准测试，发现使用大规模强化学习训练的推理模型优于其他模型。然而，即使是性能最佳的模型在世界建模方面仍然表现出有限的能力。基于这些见解，我们研究了几种有希望的策略来增强 LLM 的世界建模能力，包括测试时缩放、代理训练等等。我们希望 Text2World 能够作为一项至关重要的资源，为未来利用 LLM 作为世界模型的研究奠定基础。项目页面可在 https://text-to-world.github.io/ 获得。
+
+##### **KAPPA: A Generic Patent Analysis Framework with Keyphrase-Based Portraits**
+2502.13076v1 by Xin Xia, Yujin Wang, Jun Zhou, Guisheng Zhong, Linning Cai, Chen Zhang
+
+Patent analysis highly relies on concise and interpretable document
+representations, referred to as patent portraits. Keyphrases, both present and
+absent, are ideal candidates for patent portraits due to their brevity,
+representativeness, and clarity. In this paper, we introduce KAPPA, an
+integrated framework designed to construct keyphrase-based patent portraits and
+enhance patent analysis. KAPPA operates in two phases: patent portrait
+construction and portrait-based analysis. To ensure effective portrait
+construction, we propose a semantic-calibrated keyphrase generation paradigm
+that integrates pre-trained language models with a prompt-based hierarchical
+decoding strategy to leverage the multi-level structural characteristics of
+patents. For portrait-based analysis, we develop a comprehensive framework that
+employs keyphrase-based patent portraits to enable efficient and accurate
+patent analysis. Extensive experiments on benchmark datasets of keyphrase
+generation, the proposed model achieves significant improvements compared to
+state-of-the-art baselines. Further experiments conducted on real-world patent
+applications demonstrate that our keyphrase-based portraits effectively capture
+domain-specific knowledge and enrich semantic representation for patent
+analysis tasks.
+
+摘要：專利分析高度依賴簡潔且可解讀的文件表示，稱為專利描述。關鍵字組，無論是存在的還是不存在的，都是專利描述的理想候選者，因為它們簡潔、具有代表性且清晰。在本文中，我們介紹了 KAPPA，一個用於建構基於關鍵字組的專利描述和增強專利分析的整合式架構。KAPPA 分為兩個階段執行：專利描述建構和基於描述的分析。為確保有效的描述建構，我們提出了一個語義校準關鍵字組生成範例，它將預先訓練的語言模型與基於提示的分層解碼策略整合在一起，以利用專利的多分層結構特性。對於基於描述的分析，我們開發了一個全面的架構，它採用基於關鍵字組的專利描述，以實現高效且準確的專利分析。在關鍵字組生成基準資料集上進行的廣泛實驗中，與最先進的基準線相比，所提出的模型取得了顯著的改進。在真實世界專利申請上進行的進一步實驗表明，我們基於關鍵字組的描述有效地擷取了特定領域的知識，並豐富了專利分析任務的語義表示。
+
+##### **Interactive Agents to Overcome Ambiguity in Software Engineering**
+2502.13069v1 by Sanidhya Vijayvargiya, Xuhui Zhou, Akhila Yerukola, Maarten Sap, Graham Neubig
+
+AI agents are increasingly being deployed to automate tasks, often based on
+ambiguous and underspecified user instructions. Making unwarranted assumptions
+and failing to ask clarifying questions can lead to suboptimal outcomes, safety
+risks due to tool misuse, and wasted computational resources. In this work, we
+study the ability of LLM agents to handle ambiguous instructions in interactive
+code generation settings by evaluating proprietary and open-weight models on
+their performance across three key steps: (a) leveraging interactivity to
+improve performance in ambiguous scenarios, (b) detecting ambiguity, and (c)
+asking targeted questions. Our findings reveal that models struggle to
+distinguish between well-specified and underspecified instructions. However,
+when models interact for underspecified inputs, they effectively obtain vital
+information from the user, leading to significant improvements in performance
+and underscoring the value of effective interaction. Our study highlights
+critical gaps in how current state-of-the-art models handle ambiguity in
+complex software engineering tasks and structures the evaluation into distinct
+steps to enable targeted improvements.
+
+摘要：人工智能代理正越來越多地被部署用於自動化任務，通常基於模棱兩可且未明確規定的使用者指令。做出不合理的假設且未能提出澄清問題，可能導致次佳結果、因工具誤用而產生的安全風險，以及浪費運算資源。在這項工作中，我們研究了 LLM 代理在互動式程式碼生成設定中處理模棱兩可指令的能力，方法是在三個關鍵步驟中評估專有和開放權重的模型： (a) 利用互動性來提升在模棱兩可場景中的效能、(b) 偵測模糊性，以及 (c) 提出目標問題。我們的研究結果顯示，模型難以區分明確規範的指令和未明確規範的指令。然而，當模型針對未明確規範的輸入進行互動時，它們會有效地從使用者取得重要資訊，進而大幅提升效能，並強調有效互動的價值。我們的研究突顯了目前最先進的模型在處理複雜軟體工程任務中的模糊性時存在哪些關鍵差距，並將評估架構為不同的步驟，以促成有目標的改善。
+
+##### **Cramming 1568 Tokens into a Single Vector and Back Again: Exploring the Limits of Embedding Space Capacity**
+2502.13063v1 by Yuri Kuratov, Mikhail Arkhipov, Aydar Bulatov, Mikhail Burtsev
+
+A range of recent works addresses the problem of compression of sequence of
+tokens into a shorter sequence of real-valued vectors to be used as inputs
+instead of token embeddings or key-value cache. These approaches allow to
+reduce the amount of compute in existing language models. Despite relying on
+powerful models as encoders, the maximum attainable lossless compression ratio
+is typically not higher than x10. This fact is highly intriguing because, in
+theory, the maximum information capacity of large real-valued vectors is far
+beyond the presented rates even for 16-bit precision and a modest vector size.
+In this work, we explore the limits of compression by replacing the encoder
+with a per-sample optimization procedure. We show that vectors with compression
+ratios up to x1500 exist, which highlights two orders of magnitude gap between
+existing and practically attainable solutions. Furthermore, we empirically show
+that the compression limits are determined not by the length of the input but
+by the amount of uncertainty to be reduced, namely, the cross-entropy loss on
+this sequence without any conditioning. The obtained limits highlight the
+substantial gap between the theoretical capacity of input embeddings and their
+practical utilization, suggesting significant room for optimization in model
+design.
+
+摘要：一系列近期作品探讨了将序列标记压缩成较短的实值向量序列的问题，以用作输入，而不是标记嵌入或键值缓存。这些方法允许减少现有语言模型中的计算量。尽管依赖于强大的模型作为编码器，但最大可达到的无损压缩比通常不高于 x10。这一事实非常有趣，因为理论上，即使对于 16 位精度和适中的向量大小，大型实值向量的最大信息容量也远远超出了所呈现的速率。在这项工作中，我们通过用按样本优化程序替换编码器来探索压缩的极限。我们表明，存在压缩比高达 x1500 的向量，这突出了现有解决方案和实际可实现解决方案之间两个数量级的差距。此外，我们凭经验表明，压缩极限不是由输入的长度决定的，而是由要减少的不确定性量决定的，即在此序列上的交叉熵损失，没有任何条件。获得的极限突出了输入嵌入的理论容量与其实际利用之间的巨大差距，表明模型设计中有很大的优化空间。
+
+##### **AI-Assisted Decision Making with Human Learning**
+2502.13062v1 by Gali Noti, Kate Donahue, Jon Kleinberg, Sigal Oren
+
+AI systems increasingly support human decision-making. In many cases, despite
+the algorithm's superior performance, the final decision remains in human
+hands. For example, an AI may assist doctors in determining which diagnostic
+tests to run, but the doctor ultimately makes the diagnosis. This paper studies
+such AI-assisted decision-making settings, where the human learns through
+repeated interactions with the algorithm. In our framework, the algorithm --
+designed to maximize decision accuracy according to its own model -- determines
+which features the human can consider. The human then makes a prediction based
+on their own less accurate model. We observe that the discrepancy between the
+algorithm's model and the human's model creates a fundamental tradeoff. Should
+the algorithm prioritize recommending more informative features, encouraging
+the human to recognize their importance, even if it results in less accurate
+predictions in the short term until learning occurs? Or is it preferable to
+forgo educating the human and instead select features that align more closely
+with their existing understanding, minimizing the immediate cost of learning?
+This tradeoff is shaped by the algorithm's time-discounted objective and the
+human's learning ability. Our results show that optimal feature selection has a
+surprisingly clean combinatorial characterization, reducible to a stationary
+sequence of feature subsets that is tractable to compute. As the algorithm
+becomes more "patient" or the human's learning improves, the algorithm
+increasingly selects more informative features, enhancing both prediction
+accuracy and the human's understanding. Notably, early investment in learning
+leads to the selection of more informative features than a later investment. We
+complement our analysis by showing that the impact of errors in the algorithm's
+knowledge is limited as it does not make the prediction directly.
+
+摘要：人工智慧系統日益支援人類決策。在許多情況下，儘管演算法的效能優異，最終決策仍掌握在人類手中。例如，人工智慧可能會協助醫生決定要執行哪些診斷測試，但最終下診斷的是醫生。本文探討此類人工智慧輔助決策設定，其中人類透過與演算法重複互動而學習。在我們的架構中，演算法（旨在根據其自身模型最大化決策準確度）會決定人類可以考量的特徵。然後，人類根據其自身較不準確的模型做出預測。我們觀察到，演算法模型與人類模型之間的差異會產生基本的權衡。演算法是否應優先推薦更多資訊性特徵，鼓勵人類認識其重要性，即使短期內會導致準確度較低的預測，直到學習發生？或者，是否較好放棄教育人類，而選擇與其現有理解更緊密對齊的特徵，將學習的立即成本降至最低？這種權衡取決於演算法的時間折現目標和人類的學習能力。我們的結果表明，最佳特徵選擇具有令人驚訝的乾淨組合特徵，可簡化為可計算的固定特徵子集序列。隨著演算法變得更「有耐心」或人類的學習進步，演算法會越來越多地選擇更多資訊性特徵，增強預測準確度和人類的理解。值得注意的是，早期投資於學習會導致選擇比後期投資更多資訊性特徵。我們透過顯示演算法知識中錯誤的影響是有限的，因為它不會直接做出預測，來補充我們的分析。
+
+##### **Improved Fine-Tuning of Large Multimodal Models for Hateful Meme Detection**
+2502.13061v1 by Jingbiao Mei, Jinghong Chen, Guangyu Yang, Weizhe Lin, Bill Byrne
+
+Hateful memes have become a significant concern on the Internet,
+necessitating robust automated detection systems. While large multimodal models
+have shown strong generalization across various tasks, they exhibit poor
+generalization to hateful meme detection due to the dynamic nature of memes
+tied to emerging social trends and breaking news. Recent work further
+highlights the limitations of conventional supervised fine-tuning for large
+multimodal models in this context. To address these challenges, we propose
+Large Multimodal Model Retrieval-Guided Contrastive Learning (LMM-RGCL), a
+novel two-stage fine-tuning framework designed to improve both in-domain
+accuracy and cross-domain generalization. Experimental results on six widely
+used meme classification datasets demonstrate that LMM-RGCL achieves
+state-of-the-art performance, outperforming agent-based systems such as
+VPD-PALI-X-55B. Furthermore, our method effectively generalizes to
+out-of-domain memes under low-resource settings, surpassing models like GPT-4o.
+
+摘要：網路上的仇恨迷因已成為一大隱憂，因此需要強大的自動化偵測系統。雖然大型多模態模型已在各種任務中展現出強大的泛化能力，但由於迷因與新興社會趨勢和突發新聞息息相關，因此在仇恨迷因偵測方面表現不佳。最近的研究進一步強調了在這種情況下，傳統監督微調對大型多模態模型的限制。為了應對這些挑戰，我們提出了大型多模態模型檢索引導對比學習 (LMM-RGCL)，這是一種新穎的兩階段微調架構，旨在提高領域內準確度和跨領域泛化能力。在六個廣泛使用的迷因分類資料集上的實驗結果表明，LMM-RGCL 達到了最先進的效能，優於基於代理的系統，例如 VPD-PALI-X-55B。此外，我們的模型在低資源設定下有效泛化到領域外迷因，超越了 GPT-4o 等模型。
+
+##### **SimpleVQA: Multimodal Factuality Evaluation for Multimodal Large Language Models**
+2502.13059v1 by Xianfu Cheng, Wei Zhang, Shiwei Zhang, Jian Yang, Xiangyuan Guan, Xianjie Wu, Xiang Li, Ge Zhang, Jiaheng Liu, Yuying Mai, Yutao Zeng, Zhoufutu Wen, Ke Jin, Baorui Wang, Weixiao Zhou, Yunhong Lu, Tongliang Li, Wenhao Huang, Zhoujun Li
+
+The increasing application of multi-modal large language models (MLLMs)
+across various sectors have spotlighted the essence of their output reliability
+and accuracy, particularly their ability to produce content grounded in factual
+information (e.g. common and domain-specific knowledge). In this work, we
+introduce SimpleVQA, the first comprehensive multi-modal benchmark to evaluate
+the factuality ability of MLLMs to answer natural language short questions.
+SimpleVQA is characterized by six key features: it covers multiple tasks and
+multiple scenarios, ensures high quality and challenging queries, maintains
+static and timeless reference answers, and is straightforward to evaluate. Our
+approach involves categorizing visual question-answering items into 9 different
+tasks around objective events or common knowledge and situating these within 9
+topics. Rigorous quality control processes are implemented to guarantee
+high-quality, concise, and clear answers, facilitating evaluation with minimal
+variance via an LLM-as-a-judge scoring system. Using SimpleVQA, we perform a
+comprehensive assessment of leading 18 MLLMs and 8 text-only LLMs, delving into
+their image comprehension and text generation abilities by identifying and
+analyzing error cases.
+
+摘要：隨著多模態大型語言模型 (MLLM) 在各個領域的應用日益普及，其輸出結果的可靠性和準確性已備受關注，特別是其根據事實資訊（例如一般知識和特定領域知識）產生內容的能力。在本文中，我們介紹 SimpleVQA，這是第一個用於評估 MLLM 回答自然語言簡短問題的事實能力的綜合多模態基準。SimpleVQA 有六個主要特徵：涵蓋多項任務和多種情境、確保高品質且具挑戰性的查詢、維護靜態且永恆的參考答案，而且評估起來很簡單。我們的做法是將視覺問答項目分類為 9 個不同的任務，圍繞客觀事件或常識，並將它們置於 9 個主題中。我們實施嚴格的品質控管流程，以保證答案的高品質、簡潔和清晰，並透過 LLM 作為評分系統，以最小的差異進行評估。我們使用 SimpleVQA 對 18 個主要的 MLLM 和 8 個純文字 LLM 進行全面評估，透過找出和分析錯誤案例，深入探討它們的影像理解和文字生成能力。
+
+##### **LAMD: Context-driven Android Malware Detection and Classification with LLMs**
+2502.13055v1 by Xingzhi Qian, Xinran Zheng, Yiling He, Shuo Yang, Lorenzo Cavallaro
+
+The rapid growth of mobile applications has escalated Android malware
+threats. Although there are numerous detection methods, they often struggle
+with evolving attacks, dataset biases, and limited explainability. Large
+Language Models (LLMs) offer a promising alternative with their zero-shot
+inference and reasoning capabilities. However, applying LLMs to Android malware
+detection presents two key challenges: (1)the extensive support code in Android
+applications, often spanning thousands of classes, exceeds LLMs' context limits
+and obscures malicious behavior within benign functionality; (2)the structural
+complexity and interdependencies of Android applications surpass LLMs'
+sequence-based reasoning, fragmenting code analysis and hindering malicious
+intent inference. To address these challenges, we propose LAMD, a practical
+context-driven framework to enable LLM-based Android malware detection. LAMD
+integrates key context extraction to isolate security-critical code regions and
+construct program structures, then applies tier-wise code reasoning to analyze
+application behavior progressively, from low-level instructions to high-level
+semantics, providing final prediction and explanation. A well-designed factual
+consistency verification mechanism is equipped to mitigate LLM hallucinations
+from the first tier. Evaluation in real-world settings demonstrates LAMD's
+effectiveness over conventional detectors, establishing a feasible basis for
+LLM-driven malware analysis in dynamic threat landscapes.
+
+摘要：隨著行動應用程式快速成長，Android 惡意軟體威脅也隨之升級。雖然有許多偵測方法，但它們經常難以應付不斷演進的攻擊、資料集偏差和有限的可解釋性。大型語言模型 (LLM) 提供了一個有前途的替代方案，具備零次學習推理和推理能力。然而，將 LLM 應用於 Android 惡意軟體偵測會出現兩個主要挑戰：(1) Android 應用程式中大量的支援程式碼，通常橫跨數千個類別，超過 LLM 的上下文限制，並模糊了良性功能中的惡意行為；(2) Android 應用程式的結構複雜性和相互依賴性超過 LLM 的基於序列的推理，會造成程式碼分析破碎，並阻礙惡意意圖推論。為了應對這些挑戰，我們提出了 LAMD，一個實用的脈絡驅動架構，以支援基於 LLM 的 Android 惡意軟體偵測。LAMD 整合了關鍵脈絡萃取，以隔離與安全性至關重要的程式碼區域並建構程式結構，然後套用分層式程式碼推理，逐步分析應用程式行為，從低階指令到高階語意，提供最終預測和說明。一個設計良好的事實一致性驗證機制具備減輕 LLM 從第一層產生的幻覺的能力。在真實環境中的評估顯示，LAMD 優於傳統偵測器，為動態威脅環境中的 LLM 驅動惡意軟體分析建立了一個可行的基礎。
+
+##### **Do we still need Human Annotators? Prompting Large Language Models for Aspect Sentiment Quad Prediction**
+2502.13044v1 by Nils Constantin Hellwig, Jakob Fehle, Udo Kruschwitz, Christian Wolff
+
+Aspect sentiment quadruple prediction (ASQP) facilitates a detailed
+understanding of opinions expressed in a text by identifying the opinion term,
+aspect term, aspect category and sentiment polarity for each opinion. However,
+annotating a full set of training examples to fine-tune models for ASQP is a
+resource-intensive process. In this study, we explore the capabilities of large
+language models (LLMs) for zero- and few-shot learning on the ASQP task across
+five diverse datasets. We report F1 scores slightly below those obtained with
+state-of-the-art fine-tuned models but exceeding previously reported zero- and
+few-shot performance. In the 40-shot setting on the Rest16 restaurant domain
+dataset, LLMs achieved an F1 score of 52.46, compared to 60.39 by the
+best-performing fine-tuned method MVP. Additionally, we report the performance
+of LLMs in target aspect sentiment detection (TASD), where the F1 scores were
+also close to fine-tuned models, achieving 66.03 on Rest16 in the 40-shot
+setting, compared to 72.76 with MVP. While human annotators remain essential
+for achieving optimal performance, LLMs can reduce the need for extensive
+manual annotation in ASQP tasks.
+
+摘要：面向觀點的四元預測 (ASQP) 透過辨識各個觀點的觀點詞彙、面向詞彙、面向類別和觀點極性，協助詳細了解文字中表達的意見。然而，標註一組完整的訓練範例以微調 ASQP 模型是一個耗費資源的過程。在這項研究中，我們探討大型語言模型 (LLM) 在 ASQP 任務中進行零次和少量學習的能力，橫跨五個不同的資料集。我們報告的 F1 分數略低於使用最先進的微調模型獲得的分數，但超過先前報告的零次和少量學習表現。在 Rest16 餐廳領域資料集的 40 次學習設定中，LLM 達到了 52.46 的 F1 分數，而效能最佳的微調方法 MVP 則為 60.39。此外，我們報告了 LLM 在目標面向觀點偵測 (TASD) 中的表現，其中 F1 分數也接近微調模型，在 40 次學習設定中於 Rest16 達到 66.03，而 MVP 則為 72.76。儘管人類標註員對於達成最佳效能仍然至關重要，但 LLM 可以減少 ASQP 任務中廣泛手動標註的需求。
+
+##### **Natural Language Generation from Visual Sequences: Challenges and Future Directions**
+2502.13034v1 by Aditya K Surikuchi, Raquel Fernández, Sandro Pezzelle
+
+The ability to use natural language to talk about visual content is at the
+core of human intelligence and a crucial feature of any artificial intelligence
+system. Various studies have focused on generating text for single images. In
+contrast, comparatively little attention has been paid to exhaustively
+analyzing and advancing work on multiple-image vision-to-text settings. In this
+position paper, we claim that any task dealing with temporally ordered
+sequences of multiple images or frames is an instance of a broader, more
+general problem involving the understanding of intricate relationships between
+the visual content and the corresponding text. We comprehensively analyze five
+tasks that are instances of this problem and argue that they pose a common set
+of challenges and share similarities in terms of modeling and evaluation
+approaches. Based on the insights from these various aspects and stages of
+multi-image-to-text generation, we highlight several open questions and suggest
+future research directions. We believe that these directions can advance the
+understanding of complex phenomena in this domain and the development of better
+models.
+
+摘要：使用自然語言來談論視覺內容的能力是人類智慧的核心，也是任何人工智慧系統的一項關鍵功能。各種研究都專注於為單一影像產生文字。相較之下，對於詳盡分析和推進多重影像視覺轉文字設定的工作，關注較少。在此立場文件中，我們聲稱任何處理多重影像或畫格的時間順序序列的任務，都是一個更廣泛、更普遍問題的範例，涉及理解視覺內容和對應文字之間的複雜關係。我們全面分析了此問題的五個範例任務，並論證它們提出了一組常見的挑戰，且在建模和評估方法方面有相似之處。根據多重影像轉文字生成的這些不同面向和階段的見解，我們突出了幾個開放性問題，並建議未來的研究方向。我們相信這些方向可以推進對此領域中複雜現象的理解，以及開發出更好的模型。
+
+##### **HPSS: Heuristic Prompting Strategy Search for LLM Evaluators**
+2502.13031v1 by Bosi Wen, Pei Ke, Yufei Sun, Cunxiang Wang, Xiaotao Gu, Jinfeng Zhou, Jie Tang, Hongning Wang, Minlie Huang
+
+Since the adoption of large language models (LLMs) for text evaluation has
+become increasingly prevalent in the field of natural language processing
+(NLP), a series of existing works attempt to optimize the prompts for LLM
+evaluators to improve their alignment with human judgment. However, their
+efforts are limited to optimizing individual factors of evaluation prompts,
+such as evaluation criteria or output formats, neglecting the combinatorial
+impact of multiple factors, which leads to insufficient optimization of the
+evaluation pipeline. Nevertheless, identifying well-behaved prompting
+strategies for adjusting multiple factors requires extensive enumeration. To
+this end, we comprehensively integrate 8 key factors for evaluation prompts and
+propose a novel automatic prompting strategy optimization method called
+Heuristic Prompting Strategy Search (HPSS). Inspired by the genetic algorithm,
+HPSS conducts an iterative search to find well-behaved prompting strategies for
+LLM evaluators. A heuristic function is employed to guide the search process,
+enhancing the performance of our algorithm. Extensive experiments across four
+evaluation tasks demonstrate the effectiveness of HPSS, consistently
+outperforming both human-designed evaluation prompts and existing automatic
+prompt optimization methods.
+
+摘要：隨著自然語言處理（NLP）領域中採用大型語言模型（LLM）進行文本評估變得越來越普遍，一系列現有工作嘗試優化 LLM 評估器的提示，以改善它們與人類判斷的一致性。然而，他們的努力僅限於優化評估提示的個別因素，例如評估準則或輸出格式，而忽略了多種因素的組合影響，這導致評估管道優化不足。儘管如此，找出調整多種因素的良好提示策略需要廣泛的枚舉。為此，我們全面整合了評估提示的 8 個關鍵因素，並提出了一種名為啟發式提示策略搜索（HPSS）的新型自動提示策略優化方法。在遺傳演算法的啟發下，HPSS 進行反覆搜索以找出 LLM 評估器的良好提示策略。採用啟發式函數來指導搜索過程，增強了我們演算法的效能。在四項評估任務中進行的廣泛實驗證明了 HPSS 的有效性，始終優於人類設計的評估提示和現有的自動提示優化方法。
+
+##### **Whose story is it? Personalizing story generation by inferring author styles**
+2502.13028v1 by Nischal Ashok Kumar, Chau Minh Pham, Mohit Iyyer, Andrew Lan
+
+Personalization has become essential for improving user experience in
+interactive writing and educational applications, yet its potential in story
+generation remains largely unexplored. In this work, we propose a novel
+two-stage pipeline for personalized story generation. Our approach first infers
+an author's implicit story-writing characteristics from their past work and
+organizes them into an Author Writing Sheet, inspired by narrative theory. The
+second stage uses this sheet to simulate the author's persona through tailored
+persona descriptions and personalized story writing rules. To enable and
+validate our approach, we construct Mythos, a dataset of 590 stories from 64
+authors across five distinct sources that reflect diverse story-writing
+settings. A head-to-head comparison with a non-personalized baseline
+demonstrates our pipeline's effectiveness in generating high-quality
+personalized stories. Our personalized stories achieve a 75 percent win rate
+(versus 14 percent for the baseline and 11 percent ties) in capturing authors'
+writing style based on their past works. Human evaluation highlights the high
+quality of our Author Writing Sheet and provides valuable insights into the
+personalized story generation task. Notable takeaways are that writings from
+certain sources, such as Reddit, are easier to personalize than others, like
+AO3, while narrative aspects, like Creativity and Language Use, are easier to
+personalize than others, like Plot.
+
+摘要：個人化已成為改善互動式寫作和教育應用程式中使用者體驗的必要手段，然而其在故事生成中的潛力仍未被廣泛探索。在這項工作中，我們提出了一個創新的兩階段流程，用於個人化故事生成。我們的做法首先從作者過去的作品中推論出作者隱含的故事寫作特徵，並根據敘事理論將它們組織成作者寫作表。第二階段使用此表透過量身打造的角色描述和個人化故事寫作規則來模擬作者的角色。為了啟用和驗證我們的做法，我們建構了 Mythos，一個包含來自 64 位作者、橫跨五個不同來源的 590 個故事的資料集，這些故事反映了多樣化的故事寫作設定。與非個人化基準進行一對一的比較，證明了我們的流程在生成高品質個人化故事方面的有效性。我們的個人化故事以 75% 的獲勝率（相較於基準的 14% 和 11% 平手）捕捉到作者基於其過去作品的寫作風格。人類評估突顯了我們作者寫作表的優良品質，並提供了對個人化故事生成任務的寶貴見解。值得注意的是，來自某些來源（例如 Reddit）的作品比其他來源（例如 AO3）更容易個人化，而敘事層面（例如創造力和語言使用）比其他層面（例如情節）更容易個人化。
+
+##### **Agentic Deep Graph Reasoning Yields Self-Organizing Knowledge Networks**
+2502.13025v1 by Markus J. Buehler
+
+We present an agentic, autonomous graph expansion framework that iteratively
+structures and refines knowledge in situ. Unlike conventional knowledge graph
+construction methods relying on static extraction or single-pass learning, our
+approach couples a reasoning-native large language model with a continually
+updated graph representation. At each step, the system actively generates new
+concepts and relationships, merges them into a global graph, and formulates
+subsequent prompts based on its evolving structure. Through this
+feedback-driven loop, the model organizes information into a scale-free network
+characterized by hub formation, stable modularity, and bridging nodes that link
+disparate knowledge clusters. Over hundreds of iterations, new nodes and edges
+continue to appear without saturating, while centrality measures and shortest
+path distributions evolve to yield increasingly distributed connectivity. Our
+analysis reveals emergent patterns, such as the rise of highly connected 'hub'
+concepts and the shifting influence of 'bridge' nodes, indicating that agentic,
+self-reinforcing graph construction can yield open-ended, coherent knowledge
+structures. Applied to materials design problems, we present compositional
+reasoning experiments by extracting node-specific and synergy-level principles
+to foster genuinely novel knowledge synthesis, yielding cross-domain ideas that
+transcend rote summarization and strengthen the framework's potential for
+open-ended scientific discovery. We discuss other applications in scientific
+discovery and outline future directions for enhancing scalability and
+interpretability.
+
+摘要：<paragraph>我們提出一個能動的、自主的圖形擴展框架，它反覆地建構和精煉原位知識。與依賴靜態提取或單次學習的傳統知識圖形建構方法不同，我們的做法將一個推理原生的大語言模型與一個持續更新的圖形表示結合起來。在每一步中，系統主動產生新的概念和關係，將它們合併到一個全域圖形中，並根據其不斷演化的結構制定後續提示。透過這個回饋驅動的迴圈，模型將資訊組織成一個無標度網路，其特徵是樞紐形成、穩定的模組化以及連結不同知識群集的橋接節點。在數百次反覆運算中，新的節點和邊緣會持續出現，而不會飽和，同時中心性測量和最短路徑分佈會演化為產生越來越分散的連通性。我們的分析揭示了新興模式，例如高度連接的「樞紐」概念的興起和「橋樑」節點影響力的轉移，這表明能動的、自我強化的圖形建構可以產生開放式、連貫的知識結構。應用於材料設計問題，我們提出組合推理實驗，透過提取特定於節點的原則和協同效應層級原則，以促進真正新穎的知識綜合，產生超越死背式摘要並強化框架在開放式科學發現中潛力的跨領域想法。我們討論了在科學發現中的其他應用，並概述了增強可擴充性和可解釋性的未來方向。</paragraph>
+
+##### **Oreo: A Plug-in Context Reconstructor to Enhance Retrieval-Augmented Generation**
+2502.13019v1 by Sha Li, Naren Ramarkrishnan
+
+Despite the remarkable capabilities of Large Language Models (LLMs) in
+various NLP tasks, they remain vulnerable to hallucinations due to their
+limited parametric knowledge and lack of domain-specific expertise.
+Retrieval-Augmented Generation (RAG) addresses this challenge by incorporating
+external document retrieval to augment the knowledge base of LLMs. In this
+approach, RAG retrieves document chunks from an external corpus in response to
+a query, which are then used as context for the downstream language model to
+generate an answer. However, these retrieved knowledge sources often include
+irrelevant or erroneous information, undermining the effectiveness of RAG in
+downstream tasks. To overcome this limitation, we introduce a compact,
+efficient, and pluggable module designed to refine external knowledge sources
+before feeding them to the generator. The module reconstructs retrieved content
+by extracting the most relevant and supportive information and reorganising it
+into a concise, query-specific format. Through a three-stage training paradigm
+- comprising supervised fine-tuning, contrastive multi-task learning, and
+reinforcement learning-based alignment - it prioritises critical knowledge and
+aligns it with the generator's preferences. This method enables LLMs to produce
+outputs that are more accurate, reliable, and contextually appropriate.
+
+摘要：儘管大型語言模型 (LLM) 在各種自然語言處理任務中具備卓越的能力，但由於其參數知識有限且缺乏特定領域的專業知識，因此它們仍然容易出現幻覺。檢索增強式生成 (RAG) 透過納入外部文件檢索來擴充 LLM 的知識庫，以應對此項挑戰。在此方法中，RAG 會根據查詢檢索外部語料庫中的文件區塊，然後將其用作下游語言模型的背景，以產生答案。然而，這些檢索到的知識來源通常包含不相關或錯誤的資訊，因而損害了 RAG 在下游任務中的效能。為了克服此項限制，我們引入了一個精簡、有效率且可插入的模組，用於在將外部知識來源提供給生成器之前對其進行精煉。此模組透過提取最相關且有用的資訊並將其重新組織成簡潔且特定於查詢的格式，來重建檢索到的內容。透過三階段訓練範例 - 包含監督微調、對比多任務學習以及基於強化學習的比對 - 它優先考量關鍵知識，並使其與生成器的偏好相符。此方法可讓 LLM 產生更準確、可靠且在語境上更適當的輸出。
+
+##### **LLM-Powered Proactive Data Systems**
+2502.13016v1 by Sepanta Zeighami, Yiming Lin, Shreya Shankar, Aditya Parameswaran
+
+With the power of LLMs, we now have the ability to query data that was
+previously impossible to query, including text, images, and video. However,
+despite this enormous potential, most present-day data systems that leverage
+LLMs are reactive, reflecting our community's desire to map LLMs to known
+abstractions. Most data systems treat LLMs as an opaque black box that operates
+on user inputs and data as is, optimizing them much like any other approximate,
+expensive UDFs, in conjunction with other relational operators. Such data
+systems do as they are told, but fail to understand and leverage what the LLM
+is being asked to do (i.e. the underlying operations, which may be
+error-prone), the data the LLM is operating on (e.g., long, complex documents),
+or what the user really needs. They don't take advantage of the characteristics
+of the operations and/or the data at hand, or ensure correctness of results
+when there are imprecisions and ambiguities. We argue that data systems instead
+need to be proactive: they need to be given more agency -- armed with the power
+of LLMs -- to understand and rework the user inputs and the data and to make
+decisions on how the operations and the data should be represented and
+processed. By allowing the data system to parse, rewrite, and decompose user
+inputs and data, or to interact with the user in ways that go beyond the
+standard single-shot query-result paradigm, the data system is able to address
+user needs more efficiently and effectively. These new capabilities lead to a
+rich design space where the data system takes more initiative: they are
+empowered to perform optimization based on the transformation operations, data
+characteristics, and user intent. We discuss various successful examples of how
+this framework has been and can be applied in real-world tasks, and present
+future directions for this ambitious research agenda.
+
+摘要：<paragraph>透過 LLM 的強大功能，我們現在能夠查詢過去無法查詢的資料，包括文字、圖片和影片。然而，儘管有如此龐大的潛力，但現今大多數利用 LLM 的資料系統都是被動的，反映出我們的社群希望將 LLM 映射到已知的抽象化。大多數資料系統將 LLM 視為一個不透明的黑盒子，以使用者輸入和資料為基礎進行運作，並像其他近似、昂貴的 UDF 一樣最佳化它們，並與其他關聯運算子結合使用。這些資料系統會照著指示執行，但無法理解並運用 LLM 被要求執行的任務（例如可能容易出錯的基本運算）、LLM 正在運算的資料（例如冗長、複雜的文件），或使用者真正需要的是什麼。它們不會利用運算和/或手邊資料的特性，或在有誤差和歧義時確保結果的正確性。我們認為資料系統應該改為主動：它們需要被賦予更多自主權，並具備 LLM 的強大功能，以了解並重新處理使用者輸入和資料，並就運算和資料的表示和處理方式做出決策。透過允許資料系統解析、改寫和分解使用者輸入和資料，或以超越標準單次查詢結果模式的方式與使用者互動，資料系統能夠更有效率且有效地滿足使用者的需求。這些新功能會帶來一個豐富的設計空間，讓資料系統發揮更多主導性：它們有能力根據轉換運算、資料特性和使用者意圖進行最佳化。我們將討論這個架構如何應用於實際任務，並提出這個雄心勃勃的研究議程的未來方向。</paragraph>
+
+##### **Towards a Design Guideline for RPA Evaluation: A Survey of Large Language Model-Based Role-Playing Agents**
+2502.13012v1 by Chaoran Chen, Bingsheng Yao, Ruishi Zou, Wenyue Hua, Weimin Lyu, Toby Jia-Jun Li, Dakuo Wang
+
+Role-Playing Agent (RPA) is an increasingly popular type of LLM Agent that
+simulates human-like behaviors in a variety of tasks. However, evaluating RPAs
+is challenging due to diverse task requirements and agent designs. This paper
+proposes an evidence-based, actionable, and generalizable evaluation design
+guideline for LLM-based RPA by systematically reviewing 1,676 papers published
+between Jan. 2021 and Dec. 2024. Our analysis identifies six agent attributes,
+seven task attributes, and seven evaluation metrics from existing literature.
+Based on these findings, we present an RPA evaluation design guideline to help
+researchers develop more systematic and consistent evaluation methods.
+
+摘要：角色扮演代理（RPA）是一種越來越流行的 LLM 代理，它能模擬人類在各種任務中的行為。然而，由於任務需求和代理設計的多樣性，評估 RPA 具有挑戰性。本文通過系統地審查 2021 年 1 月至 2024 年 12 月期間發表的 1,676 篇論文，提出了基於證據、可操作且可推廣的 LLM 基於 RPA 的評估設計指南。我們的分析從現有文獻中識別出六個代理屬性、七個任務屬性和七個評估指標。根據這些發現，我們提出了 RPA 評估設計指南，以幫助研究人員開發更系統化和一致的評估方法。
+
+##### **Adaptive Knowledge Graphs Enhance Medical Question Answering: Bridging the Gap Between LLMs and Evolving Medical Knowledge**
+2502.13010v1 by Mohammad Reza Rezaei, Reza Saadati Fard, Jayson Parker, Rahul G. Krishnan, Milad Lankarany
+
+Large Language Models (LLMs) have significantly advanced medical
+question-answering by leveraging extensive clinical data and medical
+literature. However, the rapid evolution of medical knowledge and the
+labor-intensive process of manually updating domain-specific resources pose
+challenges to the reliability of these systems. To address this, we introduce
+Adaptive Medical Graph-RAG (AMG-RAG), a comprehensive framework that automates
+the construction and continuous updating of medical knowledge graphs,
+integrates reasoning, and retrieves current external evidence, such as PubMed
+and WikiSearch. By dynamically linking new findings and complex medical
+concepts, AMG-RAG not only improves accuracy but also enhances interpretability
+in medical queries.
+  Evaluations on the MEDQA and MEDMCQA benchmarks demonstrate the effectiveness
+of AMG-RAG, achieving an F1 score of 74.1 percent on MEDQA and an accuracy of
+66.34 percent on MEDMCQA, outperforming both comparable models and those 10 to
+100 times larger. Notably, these improvements are achieved without increasing
+computational overhead, highlighting the critical role of automated knowledge
+graph generation and external evidence retrieval in delivering up-to-date,
+trustworthy medical insights.
+
+摘要：大型語言模型 (LLM) 透過利用廣泛的臨床資料和醫學文獻，大幅提升了醫療問題解答的進步。然而，醫療知識的快速演進和手動更新特定領域資源的繁複程序，對這些系統的可靠性構成挑戰。為了解決這個問題，我們引入了適應性醫療圖表 RAG (AMG-RAG)，這是一個自動化建構和持續更新醫療知識圖表的綜合架構，整合推理並擷取 PubMed 和 WikiSearch 等最新的外部證據。透過動態連結新的發現和複雜的醫療概念，AMG-RAG 不僅提升了準確性，也增強了醫療查詢的可解釋性。在 MEDQA 和 MEDMCQA 基準上的評量證明了 AMG-RAG 的有效性，在 MEDQA 上達到了 74.1% 的 F1 分數，在 MEDMCQA 上達到了 66.34% 的準確度，優於其他同類模型以及那些大 10 到 100 倍的模型。值得注意的是，這些改進是在不增加運算負擔的情況下實現的，突顯了自動化知識圖表生成和外部證據擷取在提供最新、可信賴的醫療見解中扮演的重要角色。
+
+##### **Integrating Reinforcement Learning, Action Model Learning, and Numeric Planning for Tackling Complex Tasks**
+2502.13006v1 by Yarin Benyamin, Argaman Mordoch, Shahaf S. Shperberg, Roni Stern
+
+Automated Planning algorithms require a model of the domain that specifies
+the preconditions and effects of each action. Obtaining such a domain model is
+notoriously hard. Algorithms for learning domain models exist, yet it remains
+unclear whether learning a domain model and planning is an effective approach
+for numeric planning environments, i.e., where states include discrete and
+numeric state variables. In this work, we explore the benefits of learning a
+numeric domain model and compare it with alternative model-free solutions. As a
+case study, we use two tasks in Minecraft, a popular sandbox game that has been
+used as an AI challenge. First, we consider an offline learning setting, where
+a set of expert trajectories are available to learn from. This is the standard
+setting for learning domain models. We used the Numeric Safe Action Model
+Learning (NSAM) algorithm to learn a numeric domain model and solve new
+problems with the learned domain model and a numeric planner. We call this
+model-based solution NSAM_(+p), and compare it to several model-free Imitation
+Learning (IL) and Offline Reinforcement Learning (RL) algorithms. Empirical
+results show that some IL algorithms can learn faster to solve simple tasks,
+while NSAM_(+p) allows solving tasks that require long-term planning and
+enables generalizing to solve problems in larger environments. Then, we
+consider an online learning setting, where learning is done by moving an agent
+in the environment. For this setting, we introduce RAMP. In RAMP, observations
+collected during the agent's execution are used to simultaneously train an RL
+policy and learn a planning domain action model. This forms a positive feedback
+loop between the RL policy and the learned domain model. We demonstrate
+experimentally the benefits of using RAMP, showing that it finds more efficient
+plans and solves more problems than several RL baselines.
+
+摘要：<paragraph>自動化規劃演算法需要一個網域模型，來指定每個動作的前提條件和效果。取得這樣的網域模型出了名的困難。學習網域模型的演算法確實存在，但學習網域模型和規劃是否為數值規劃環境的有效方法仍然不清楚，也就是說，其中狀態包含離散和數值狀態變數。在這項工作中，我們探討學習數值網域模型的優點，並將其與替代的無模型解決方案進行比較。作為一個案例研究，我們使用 Minecraft 中的兩個任務，Minecraft 是一個流行的沙盒遊戲，已被用作 AI 挑戰。首先，我們考慮離線學習設定，其中有一組專家軌跡可供學習。這是學習網域模型的標準設定。我們使用數值安全動作模型學習 (NSAM) 演算法來學習數值網域模型，並使用已學習的網域模型和數值規劃器解決新問題。我們稱此模型為基礎的解決方案 NSAM_(+p)，並將其與多種無模型模仿學習 (IL) 和離線強化學習 (RL) 演算法進行比較。經驗結果顯示，一些 IL 演算法可以更快地學習解決簡單任務，而 NSAM_(+p) 允許解決需要長期規劃的任務，並能夠推廣到在更大環境中解決問題。然後，我們考慮線上學習設定，其中學習是透過在環境中移動代理來完成的。對於此設定，我們引入了 RAMP。在 RAMP 中，在代理執行期間收集的觀察結果用於同時訓練 RL 政策和學習規劃網域動作模型。這在 RL 政策和已學習的網域模型之間形成了一個正向回饋迴路。我們透過實驗證明了使用 RAMP 的好處，顯示它比多個 RL 基準找到了更有效的計畫，並解決了更多問題。</paragraph>
+
+##### **Language Barriers: Evaluating Cross-Lingual Performance of CNN and Transformer Architectures for Speech Quality Estimation**
+2502.13004v1 by Wafaa Wardah, Tuğçe Melike Koçak Büyüktaş, Kirill Shchegelskiy, Sebastian Möller, Robert P. Spang
+
+Objective speech quality models aim to predict human-perceived speech quality
+using automated methods. However, cross-lingual generalization remains a major
+challenge, as Mean Opinion Scores (MOS) vary across languages due to
+linguistic, perceptual, and dataset-specific differences. A model trained
+primarily on English data may struggle to generalize to languages with
+different phonetic, tonal, and prosodic characteristics, leading to
+inconsistencies in objective assessments. This study investigates the
+cross-lingual performance of two speech quality models: NISQA, a CNN-based
+model, and a Transformer-based Audio Spectrogram Transformer (AST) model. Both
+models were trained exclusively on English datasets containing over 49,000
+speech samples and subsequently evaluated on speech in German, French,
+Mandarin, Swedish, and Dutch. We analyze model performance using Pearson
+Correlation Coefficient (PCC) and Root Mean Square Error (RMSE) across five
+speech quality dimensions: coloration, discontinuity, loudness, noise, and MOS.
+Our findings show that while AST achieves a more stable cross-lingual
+performance, both models exhibit noticeable biases. Notably, Mandarin speech
+quality predictions correlate highly with human MOS scores, whereas Swedish and
+Dutch present greater prediction challenges. Discontinuities remain difficult
+to model across all languages. These results highlight the need for more
+balanced multilingual datasets and architecture-specific adaptations to improve
+cross-lingual generalization.
+
+摘要：客觀語音品質模型旨在使用自動化方法預測人類感知的語音品質。然而，跨語言的概化仍然是一項重大挑戰，因為平均意見分數 (MOS) 會因語言的不同而有所不同，這是由於語言、感知和特定於資料集的差異所致。主要使用英語資料訓練的模型可能會難以概化到具有不同語音、聲調和韻律特徵的語言，導致客觀評估不一致。本研究探討了兩種語音品質模型的跨語言效能：基於 CNN 的 NISQA 模型和基於 Transformer 的音訊光譜 Transformer (AST) 模型。這兩種模型都僅使用包含超過 49,000 個語音範例的英語資料集進行訓練，然後在德語、法語、普通話、瑞典語和荷蘭語的語音上進行評估。我們使用皮爾森相關係數 (PCC) 和均方根誤差 (RMSE) 分析五個語音品質維度的模型效能：色彩、不連續性、響度、雜訊和 MOS。我們的研究結果顯示，儘管 AST 達到了更穩定的跨語言效能，但這兩種模型都表現出明顯的偏差。值得注意的是，普通話語音品質預測與人類 MOS 分數高度相關，而瑞典語和荷蘭語則呈現出更大的預測挑戰。不連續性在所有語言中仍然難以建模。這些結果凸顯了對更平衡的多語言資料集和特定於架構的調整的需求，以改善跨語言的概化。
+
+##### **You need to MIMIC to get FAME: Solving Meeting Transcript Scarcity with a Multi-Agent Conversations**
+2502.13001v1 by Frederic Kirstein, Muneeb Khan, Jan Philip Wahle, Terry Ruas, Bela Gipp
+
+Meeting summarization suffers from limited high-quality data, mainly due to
+privacy restrictions and expensive collection processes. We address this gap
+with FAME, a dataset of 500 meetings in English and 300 in German produced by
+MIMIC, our new multi-agent meeting synthesis framework that generates meeting
+transcripts on a given knowledge source by defining psychologically grounded
+participant profiles, outlining the conversation, and orchestrating a large
+language model (LLM) debate. A modular post-processing step refines these
+outputs, mitigating potential repetitiveness and overly formal tones, ensuring
+coherent, credible dialogues at scale. We also propose a psychologically
+grounded evaluation framework assessing naturalness, social behavior
+authenticity, and transcript difficulties. Human assessments show that FAME
+approximates real-meeting spontaneity (4.5/5 in naturalness), preserves
+speaker-centric challenges (3/5 in spoken language), and introduces richer
+information-oriented difficulty (4/5 in difficulty). These findings highlight
+that FAME is a good and scalable proxy for real-world meeting conditions. It
+enables new test scenarios for meeting summarization research and other
+conversation-centric applications in tasks requiring conversation data or
+simulating social scenarios under behavioral constraints.
+
+摘要：會議摘要因缺乏高品質資料而受限，主要是由於隱私限制和昂貴的收集程序。我們透過 FAME 來解決這個差距，FAME 是 MIMIC 製作的 500 場英文會議和 300 場德文會議的資料集，MIMIC 是我們新的多重代理會議合成架構，透過定義心理基礎的參與者設定檔、概述對話，並協調大型語言模型 (LLM) 辯論，在給定的知識來源上產生會議記錄。模組化後處理步驟會改善這些輸出，減輕潛在的重複性和過於正式的語氣，確保大規模的對話連貫且可信。我們也提出一個心理基礎的評估架構，評估自然性、社交行為真實性，以及記錄難度。人類評估顯示，FAME 近似於真實會議的即興性（自然性 4.5/5），保留以講者為中心的挑戰（口語 3/5），並引入更豐富的資訊導向難度（難度 4/5）。這些發現強調 FAME 是真實世界會議條件的良好且可擴充的代理。它能為會議摘要研究和其他對話為中心的應用程式啟用新的測試情境，在需要對話資料或在行為限制下模擬社交情境的任務中。
+
+##### **Personalized Top-k Set Queries Over Predicted Scores**
+2502.12998v1 by Sohrab Namazi Nia, Subhodeep Ghosh, Senjuti Basu Roy, Sihem Amer-Yahia
+
+This work studies the applicability of expensive external oracles such as
+large language models in answering top-k queries over predicted scores. Such
+scores are incurred by user-defined functions to answer personalized queries
+over multi-modal data. We propose a generic computational framework that
+handles arbitrary set-based scoring functions, as long as the functions could
+be decomposed into constructs, each of which sent to an oracle (in our case an
+LLM) to predict partial scores. At a given point in time, the framework assumes
+a set of responses and their partial predicted scores, and it maintains a
+collection of possible sets that are likely to be the true top-k. Since calling
+oracles is costly, our framework judiciously identifies the next construct,
+i.e., the next best question to ask the oracle so as to maximize the likelihood
+of identifying the true top-k. We present a principled probabilistic model that
+quantifies that likelihood. We study efficiency opportunities in designing
+algorithms. We run an evaluation with three large scale datasets, scoring
+functions, and baselines. Experiments indicate the efficacy of our framework,
+as it achieves an order of magnitude improvement over baselines in requiring
+LLM calls while ensuring result accuracy. Scalability experiments further
+indicate that our framework could be used in large-scale applications.
+
+摘要：本研究探討在預測分數中回答前 k 個查詢時，昂貴的外部預言（例如大型語言模型）的適用性。此類分數是由使用者定義的函式產生，用於回答多模態資料中的個人化查詢。我們提出一個通用的運算框架，用於處理任意基於集合的計分函式，只要這些函式可以分解為建構區塊，然後將每個建構區塊傳送給預言（在本例中為 LLM）以預測部分分數。在特定時間點，此框架假設一組回應及其部分預測分數，並維護一組可能成為真實前 k 個的集合。由於呼叫預言的成本很高，因此我們的框架會明智地找出下一個建構區塊，亦即下一個最佳問題，以詢問預言，以便最大化找出真實前 k 個的可能性。我們提出一個基於原理的機率模型，用於量化此可能性。我們研究設計演算法時的效率機會。我們針對三個大型資料集、計分函式和基準執行評估。實驗結果指出我們框架的效能，因為它在需要 LLM 呼叫的同時確保結果準確性，比基準進步了一個數量級。可擴充性實驗進一步指出我們的框架可用於大型應用程式。
+
+##### **Eager Updates For Overlapped Communication and Computation in DiLoCo**
+2502.12996v1 by Satyen Kale, Arthur Douillard, Yanislav Donchev
+
+Distributed optimization methods such as DiLoCo have been shown to be
+effective in training very large models across multiple distributed workers,
+such as datacenters. These methods split updates into two parts: an inner
+optimization phase, where the workers independently execute multiple
+optimization steps on their own local data, and an outer optimization step,
+where the inner updates are synchronized. While such approaches require orders
+of magnitude less communication than standard data-parallel training, in
+settings where the workers are datacenters, even the limited communication
+requirements of these approaches can still cause significant slow downs due to
+the blocking necessary at each outer optimization step. In this paper, we
+investigate techniques to mitigate this issue by overlapping communication with
+computation in a manner that allows the outer optimization step to fully
+overlap with the inner optimization phase. We show that a particular variant,
+dubbed eager updates, provides competitive performance with standard DiLoCo in
+settings with low bandwidth between workers.
+
+摘要：分散式優化方法（例如 DiLoCo）已被證明可有效訓練橫跨多個分散式工作者的超大型模型，例如資料中心。這些方法將更新拆分為兩部分：內部最佳化階段，其中工作者獨立地在自己的本地資料上執行多個最佳化步驟，以及外部最佳化步驟，其中內部更新會同步。雖然此類方法所需的通訊量比標準資料平行訓練少幾個數量級，但在工作者為資料中心的情況下，即使這些方法有限的通訊需求仍可能由於每個外部最佳化步驟所需的封鎖而導致顯著的減速。在本文中，我們探討了透過以允許外部最佳化步驟與內部最佳化階段完全重疊的方式將通訊與運算重疊，來減輕此問題的技術。我們展示了一個特定變體，稱為即時更新，在工作者之間頻寬較低的情況下，可提供與標準 DiLoCo 相當的效能。
+
+##### **Free Argumentative Exchanges for Explaining Image Classifiers**
+2502.12995v1 by Avinash Kori, Antonio Rago, Francesca Toni
+
+Deep learning models are powerful image classifiers but their opacity hinders
+their trustworthiness. Explanation methods for capturing the reasoning process
+within these classifiers faithfully and in a clear manner are scarce, due to
+their sheer complexity and size. We provide a solution for this problem by
+defining a novel method for explaining the outputs of image classifiers with
+debates between two agents, each arguing for a particular class. We obtain
+these debates as concrete instances of Free Argumentative eXchanges (FAXs), a
+novel argumentation-based multi-agent framework allowing agents to internalise
+opinions by other agents differently than originally stated. We define two
+metrics (consensus and persuasion rate) to assess the usefulness of FAXs as
+argumentative explanations for image classifiers. We then conduct a number of
+empirical experiments showing that FAXs perform well along these metrics as
+well as being more faithful to the image classifiers than conventional,
+non-argumentative explanation methods. All our implementations can be found at
+https://github.com/koriavinash1/FAX.
+
+摘要：深度學習模型是強大的影像分類器，但其不透明性阻礙了其可信度。由於其極高的複雜性和規模，忠實且清楚地捕捉這些分類器內部推理過程的解釋方法很少見。我們透過定義一種新穎的方法來解決這個問題，該方法透過兩個代理之間的辯論來解釋影像分類器的輸出，每個代理都主張一個特定類別。我們將這些辯論作為自由論證交換 (FAX) 的具體實例，這是一個新穎的基於論證的多代理架構，允許代理以不同於原始陳述的方式內化其他代理的意見。我們定義了兩個指標（共識率和說服率）來評估 FAX 作為影像分類器論證解釋的有用性。然後，我們進行了多項實證實驗，表明 FAX 在這些指標上表現良好，並且比傳統的非論證解釋方法更忠實於影像分類器。我們所有的實作都可以在 https://github.com/koriavinash1/FAX 中找到。
+
+##### **B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability**
+2502.12992v1 by Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
+
+Post-hoc explanation methods for black-box models often struggle with
+faithfulness and human interpretability due to the lack of explainability in
+current neural models. Meanwhile, B-cos networks have been introduced to
+improve model explainability through architectural and computational
+adaptations, but their application has so far been limited to computer vision
+models and their associated training pipelines. In this work, we introduce
+B-cos LMs, i.e., B-cos networks empowered for NLP tasks. Our approach directly
+transforms pre-trained language models into B-cos LMs by combining B-cos
+conversion and task fine-tuning, improving efficiency compared to previous
+B-cos methods. Our automatic and human evaluation results demonstrate that
+B-cos LMs produce more faithful and human interpretable explanations than post
+hoc methods, while maintaining task performance comparable to conventional
+fine-tuning. Our in-depth analysis explores how B-cos LMs differ from
+conventionally fine-tuned models in their learning processes and explanation
+patterns. Finally, we provide practical guidelines for effectively building
+B-cos LMs based on our findings. Our code is available at
+https://anonymous.4open.science/r/bcos_lm.
+
+摘要：黑盒模型的事后解释方法通常会因为当前神经模型缺乏可解释性而难以做到忠实和人类可解释。与此同时，B-cos 网络已被引入，以通过架构和计算改编来提高模型的可解释性，但到目前为止，它们的应用仅限于计算机视觉模型及其相关的训练管道。在这项工作中，我们引入了 B-cos LM，即针对 NLP 任务增强的 B-cos 网络。我们的方法通过结合 B-cos 转换和任务微调，将预训练的语言模型直接转换为 B-cos LM，与以前 B-cos 方法相比，提高了效率。我们的自动和人工评估结果表明，与事后方法相比，B-cos LM 产生了更忠实和人类可解释的解释，同时保持与传统微调相当的任务性能。我们的深入分析探讨了 B-cos LM 在其学习过程和解释模式中与传统微调模型有何不同。最后，我们根据我们的发现提供了有效构建 B-cos LM 的实用指南。我们的代码可在 https://anonymous.4open.science/r/bcos_lm 获得。
+
+##### **Beyond Profile: From Surface-Level Facts to Deep Persona Simulation in LLMs**
+2502.12988v1 by Zixiao Wang, Duzhen Zhang, Ishita Agrawal, Shen Gao, Le Song, Xiuying Chen
+
+Previous approaches to persona simulation large language models (LLMs) have
+typically relied on learning basic biographical information, or using limited
+role-play dialogue datasets to capture a character's responses. However, a
+holistic representation of an individual goes beyond surface-level facts or
+conversations to deeper thoughts and thinking. In this work, we introduce
+CharacterBot, a model designed to replicate both the linguistic patterns and
+distinctive thought processes of a character. Using Lu Xun, a renowned Chinese
+writer, as a case study, we propose four training tasks derived from his 17
+essay collections. These include a pre-training task focused on mastering
+external linguistic structures and knowledge, as well as three fine-tuning
+tasks: multiple-choice question answering, generative question answering, and
+style transfer, each aligning the LLM with Lu Xun's internal ideation and
+writing style. To optimize learning across these tasks, we introduce a CharLoRA
+parameter updating mechanism, where a general linguistic style expert
+collaborates with other task-specific experts to better study both the language
+style and the understanding of deeper thoughts. We evaluate CharacterBot on
+three tasks for linguistic accuracy and opinion comprehension, demonstrating
+that it significantly outperforms the baselines on our adapted metrics. We hope
+that this work inspires future research on deep character persona simulation
+LLM.
+
+摘要：<paragraph>以前對角色模擬大型語言模型 (LLM) 的方法通常依賴於學習基本傳記資訊，或使用有限的角色扮演對話資料集來捕捉角色的反應。然而，對個人的整體表徵超越了表面層面的事實或對話，深入到更深層的想法和思考。在這項工作中，我們引入了 CharacterBot，一個旨在複製角色的語言模式和獨特思考過程的模型。以著名的中國作家魯迅為案例研究，我們提出了四個從他的 17 篇散文集中衍生的訓練任務。其中包括一個預訓練任務，專注於掌握外部語言結構和知識，以及三個微調任務：多選題回答、生成式問答和風格轉移，每個任務都將 LLM 與魯迅的內部觀念和寫作風格相結合。為了優化這些任務的學習，我們引入了一個 CharLoRA 參數更新機制，其中一位通曉語言風格的專家與其他特定任務專家合作，以更好地研究語言風格和對深層思想的理解。我們在三項任務上評估了 CharacterBot 的語言準確性和意見理解，證明它在我們調整的指標上顯著優於基準。我們希望這項工作能激勵未來對深度角色角色模擬 LLM 的研究。</paragraph>
+
+##### **PartSDF: Part-Based Implicit Neural Representation for Composite 3D Shape Parametrization and Optimization**
+2502.12985v1 by Nicolas Talabot, Olivier Clerc, Arda Cinar Demirtas, Doruk Oner, Pascal Fua
+
+Accurate 3D shape representation is essential in engineering applications
+such as design, optimization, and simulation. In practice, engineering
+workflows require structured, part-aware representations, as objects are
+inherently designed as assemblies of distinct components. However, most
+existing methods either model shapes holistically or decompose them without
+predefined part structures, limiting their applicability in real-world design
+tasks. We propose PartSDF, a supervised implicit representation framework that
+explicitly models composite shapes with independent, controllable parts while
+maintaining shape consistency. Despite its simple single-decoder architecture,
+PartSDF outperforms both supervised and unsupervised baselines in
+reconstruction and generation tasks. We further demonstrate its effectiveness
+as a structured shape prior for engineering applications, enabling precise
+control over individual components while preserving overall coherence. Code
+available at https://github.com/cvlab-epfl/PartSDF.
+
+摘要：精確的 3D 形狀表示在工程應用中至關重要，例如設計、最佳化和模擬。實際上，工程工作流程需要結構化、零件感知的表示，因為物體本質上是設計為不同元件的組件。然而，大多數現有方法不是整體建模形狀，就是將其分解，而沒有預先定義的零件結構，這限制了它們在實際設計任務中的適用性。我們提出 PartSDF，一個監督式的隱式表示框架，它明確地使用獨立、可控的零件對複合形狀進行建模，同時保持形狀一致性。儘管其單一的解碼器架構很簡單，但 PartSDF 在重建和生成任務中都優於監督式和非監督式基準。我們進一步證明了其作為工程應用結構化形狀先驗的有效性，能夠精確控制各個元件，同時保持整體一致性。程式碼可在 https://github.com/cvlab-epfl/PartSDF 取得。
+
+##### **Sailor2: Sailing in South-East Asia with Inclusive Multilingual LLMs**
+2502.12982v1 by Longxu Dou, Qian Liu, Fan Zhou, Changyu Chen, Zili Wang, Ziqi Jin, Zichen Liu, Tongyao Zhu, Cunxiao Du, Penghui Yang, Haonan Wang, Jiaheng Liu, Yongchi Zhao, Xiachong Feng, Xin Mao, Man Tsung Yeung, Kunat Pipatanakul, Fajri Koto, Min Si Thu, Hynek Kydlíček, Zeyi Liu, Qunshu Lin, Sittipong Sripaisarnmongkol, Kridtaphad Sae-Khow, Nirattisai Thongchim, Taechawat Konkaew, Narong Borijindargoon, Anh Dao, Matichon Maneegard, Phakphum Artkaew, Zheng-Xin Yong, Quan Nguyen, Wannaphong Phatthiyaphaibun, Hoang H. Tran, Mike Zhang, Shiqi Chen, Tianyu Pang, Chao Du, Xinyi Wan, Wei Lu, Min Lin
+
+Sailor2 is a family of cutting-edge multilingual language models for
+South-East Asian (SEA) languages, available in 1B, 8B, and 20B sizes to suit
+diverse applications. Building on Qwen2.5, Sailor2 undergoes continuous
+pre-training on 500B tokens (400B SEA-specific and 100B replay tokens) to
+support 13 SEA languages while retaining proficiency in Chinese and English.
+Sailor2-20B model achieves a 50-50 win rate against GPT-4o across SEA
+languages. We also deliver a comprehensive cookbook on how to develop the
+multilingual model in an efficient manner, including five key aspects: data
+curation, pre-training, post-training, model customization and evaluation. We
+hope that Sailor2 model (Apache 2.0 license) will drive language development in
+the SEA region, and Sailor2 cookbook will inspire researchers to build more
+inclusive LLMs for other under-served languages.
+
+摘要：Sailor2 是一系列針對東南亞 (SEA) 語言的尖端多語言語言模型，備有 1B、8B 和 20B 大小，以適應各種應用。在 Qwen2.5 的基礎上，Sailor2 持續進行 500B 代幣（400B SEA 專用和 100B 重播代幣）的預訓練，以支援 13 種 SEA 語言，同時保留中文和英文的熟練度。Sailor2-20B 模型在 SEA 語言中對抗 GPT-4o 時，達到 50-50 的獲勝率。我們還提供一本全面的食譜，說明如何以有效的方式開發多語言模型，包括五個關鍵方面：資料策展、預訓練、後訓練、模型自訂和評估。我們希望 Sailor2 模型（Apache 2.0 授權）將推動 SEA 地區的語言發展，而 Sailor2 食譜將激勵研究人員為其他服務不足的語言建立更具包容性的 LLM。
+
+##### **Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking**
+2502.12970v1 by Junda Zhu, Lingyong Yan, Shuaiqiang Wang, Dawei Yin, Lei Sha
+
+The reasoning abilities of Large Language Models (LLMs) have demonstrated
+remarkable advancement and exceptional performance across diverse domains.
+However, leveraging these reasoning capabilities to enhance LLM safety against
+adversarial attacks and jailbreak queries remains largely unexplored. To bridge
+this gap, we propose Reasoning-to-Defend (R2D), a novel training paradigm that
+integrates safety reflections of queries and responses into LLMs' generation
+process, unlocking a safety-aware reasoning mechanism. This approach enables
+self-evaluation at each reasoning step to create safety pivot tokens as
+indicators of the response's safety status. Furthermore, in order to improve
+the learning efficiency of pivot token prediction, we propose Contrastive Pivot
+Optimization(CPO), which enhances the model's ability to perceive the safety
+status of dialogues. Through this mechanism, LLMs dynamically adjust their
+response strategies during reasoning, significantly enhancing their defense
+capabilities against jailbreak attacks. Extensive experimental results
+demonstrate that R2D effectively mitigates various attacks and improves overall
+safety, highlighting the substantial potential of safety-aware reasoning in
+strengthening LLMs' robustness against jailbreaks.
+
+摘要：大型語言模型 (LLM) 的推理能力已展現出顯著的進步，並在不同的領域中表現出色。然而，利用這些推理能力來增強 LLM 對抗攻擊和越獄查詢的安全性仍然是未開發的領域。為了彌補這個差距，我們提出了推理防禦 (R2D)，這是一種新穎的訓練範例，它將查詢和回應的安全考量整合到 LLM 的生成過程中，開啟了一個安全感知推理機制。此方法可以在每個推理步驟中進行自我評估，以建立安全樞紐標記，作為回應安全狀態的指標。此外，為了提高樞紐標記預測的學習效率，我們提出了對比樞紐最佳化 (CPO)，它增強了模型感知對話安全狀態的能力。透過此機制，LLM 在推理過程中動態調整其回應策略，大幅增強其對抗越獄攻擊的防禦能力。廣泛的實驗結果證明，R2D 有效地減輕了各種攻擊，並改善了整體安全性，突顯了安全感知推理在加強 LLM 對抗越獄的穩健性方面的潛力。
+
+##### **A Survey of Text Classification Under Class Distribution Shift**
+2502.12965v1 by Adriana Valentina Costache, Silviu Florin Gheorghe, Eduard Gabriel Poesina, Paul Irofti, Radu Tudor Ionescu
+
+The basic underlying assumption of machine learning (ML) models is that the
+training and test data are sampled from the same distribution. However, in
+daily practice, this assumption is often broken, i.e.~the distribution of the
+test data changes over time, which hinders the application of conventional ML
+models. One domain where the distribution shift naturally occurs is text
+classification, since people always find new topics to discuss. To this end, we
+survey research articles studying open-set text classification and related
+tasks. We divide the methods in this area based on the constraints that define
+the kind of distribution shift and the corresponding problem formulation,
+i.e.~learning with the Universum, zero-shot learning, and open-set learning. We
+next discuss the predominant mitigation approaches for each problem setup.
+Finally, we identify several future work directions, aiming to push the
+boundaries beyond the state of the art. Interestingly, we find that continual
+learning can solve many of the issues caused by the shifting class
+distribution. We maintain a list of relevant papers at
+https://github.com/Eduard6421/Open-Set-Survey.
+
+摘要：機器學習 (ML) 模型的基本假設是訓練資料和測試資料取樣自同一個分佈。然而，在日常實務中，這個假設經常被打破，也就是說測試資料的分布會隨著時間改變，這會阻礙傳統 ML 模型的應用。分佈轉移自然發生的其中一個領域是文字分類，因為人們總能找到新的主題來討論。為此，我們調查研究開放集文字分類和相關任務的研究文章。我們根據定義分佈轉移的類型和對應問題公式的限制，將這個領域的方法分為：使用 Universum 學習、零次學習和開放集學習。接下來，我們討論每個問題設定的主要緩解方法。最後，我們找出幾個未來的研究方向，目標是將界線推展到現有技術的極限之外。有趣的是，我們發現持續學習可以解決許多由類別分佈轉移所造成的議題。我們在 https://github.com/Eduard6421/Open-Set-Survey 維護一份相關論文清單。
+
+##### **Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs**
+2502.12964v1 by Adi Simhi, Itay Itzhak, Fazl Barez, Gabriel Stanovsky, Yonatan Belinkov
+
+Large Language Models (LLMs) often generate outputs that lack grounding in
+real-world facts, a phenomenon known as hallucinations. Prior research has
+associated hallucinations with model uncertainty, leveraging this relationship
+for hallucination detection and mitigation. In this paper, we challenge the
+underlying assumption that all hallucinations are associated with uncertainty.
+Using knowledge detection and uncertainty measurement methods, we demonstrate
+that models can hallucinate with high certainty even when they have the correct
+knowledge. We further show that high-certainty hallucinations are consistent
+across models and datasets, distinctive enough to be singled out, and challenge
+existing mitigation methods. Our findings reveal an overlooked aspect of
+hallucinations, emphasizing the need to understand their origins and improve
+mitigation strategies to enhance LLM safety. The code is available at
+https://github.com/technion-cs-nlp/Trust_me_Im_wrong .
+
+摘要：大型語言模型 (LLM) 經常產生缺乏真實世界事實根據的輸出，這種現象稱為幻覺。先前的研究已將幻覺與模型不確定性聯繫起來，利用這種關係進行幻覺偵測和緩解。在本文中，我們挑戰所有幻覺都與不確定性相關的基本假設。使用知識偵測和不確定性測量方法，我們證明模型即使擁有正確的知識，也能以高度確定性產生幻覺。我們進一步表明，高確定性幻覺在模型和資料集之間是一致的，足夠獨特以至於可以單獨挑選出來，並挑戰現有的緩解方法。我們的研究結果揭示了幻覺的一個被忽視的方面，強調需要了解其起源並改進緩解策略以增強 LLM 安全性。可以在 https://github.com/technion-cs-nlp/Trust_me_Im_wrong 找到程式碼。
+
+##### **Infinite Retrieval: Attention Enhanced LLMs in Long-Context Processing**
+2502.12962v1 by Xiaoju Ye, Zhichun Wang, Jingyuan Wang
+
+Limited by the context window size of Large Language Models(LLMs), handling
+various tasks with input tokens exceeding the upper limit has been challenging,
+whether it is a simple direct retrieval task or a complex multi-hop reasoning
+task. Although various methods have been proposed to enhance the long-context
+processing capabilities of LLMs, they either incur substantial post-training
+costs, or require additional tool modules(e.g.,RAG), or have not shown
+significant improvement in realistic tasks. Our work observes the correlation
+between the attention distribution and generated answers across each layer, and
+establishes the attention allocation aligns with retrieval-augmented
+capabilities through experiments. Drawing on the above insights, we propose a
+novel method InfiniRetri that leverages the LLMs's own attention information to
+enable accurate retrieval across inputs of infinitely length. Our evaluations
+indicate that InfiniRetri achieves 100% accuracy in the
+Needle-In-a-Haystack(NIH) test over 1M tokens using a 0.5B parameter model,
+surpassing other method or larger models and setting a new
+state-of-the-art(SOTA). Moreover, our method achieves significant performance
+improvements on real-world benchmarks, with a maximum 288% improvement. In
+addition, InfiniRetri can be applied to any Transformer-based LLMs without
+additional training and substantially reduces inference latency and compute
+overhead in long texts. In summary, our comprehensive studies show
+InfiniRetri's potential for practical applications and creates a paradigm for
+retrievaling information using LLMs own capabilities under infinite-length
+tokens. Code will be released in link.
+
+摘要：受限于大型语言模型 (LLM) 的上下文窗口大小，处理超出上限的输入标记的各种任务一直具有挑战性，无论是简单的直接检索任务还是复杂的多跳推理任务。虽然已经提出了各种方法来增强 LLM 的长上下文处理能力，但它们要么产生大量的后训练成本，要么需要额外的工具模块（例如，RAG），要么在实际任务中没有显示出显着的改进。我们的工作观察了每层注意力分布和生成答案之间的相关性，并通过实验建立了注意力分配与检索增强能力保持一致。根据上述见解，我们提出了一种新方法 InfiniRetri，该方法利用 LLM 自身的注意力信息来实现对无限长度输入的准确检索。我们的评估表明，InfiniRetri 在使用 0.5B 参数模型对超过 100 万个标记的针头干草堆 (NIH) 测试中实现了 100% 的准确率，超越了其他方法或更大的模型，并创造了新的最先进 (SOTA)。此外，我们的方法在实际基准上实现了显著的性能提升，最大提升了 288%。此外，InfiniRetri 可以应用于任何基于 Transformer 的 LLM，而无需额外的训练，并且可以大幅减少推理延迟和长文本中的计算开销。总之，我们的综合研究表明了 InfiniRetri 在实际应用中的潜力，并为使用 LLM 自身能力在无限长度标记下检索信息创造了一个范例。代码将在链接中发布。
+
+##### **Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger**
+2502.12961v1 by Wenjun Li, Dexun Li, Kuicai Dong, Cong Zhang, Hao Zhang, Weiwen Liu, Yasheng Wang, Ruiming Tang, Yong Liu
+
+Large language models (LLMs) have shown remarkable emergent capabilities,
+transforming the execution of functional tasks by leveraging external tools for
+complex problems that require specialized processing or real-time data. While
+existing research expands LLMs access to diverse tools (e.g., program
+interpreters, search engines, weather/map apps), the necessity of using these
+tools is often overlooked, leading to indiscriminate tool invocation. This
+naive approach raises two key issues:(1) increased delays due to unnecessary
+tool calls, and (2) potential errors resulting from faulty interactions with
+external tools. In this paper, we introduce meta-cognition as a proxy for LLMs
+self-assessment of their capabilities, representing the model's awareness of
+its own limitations. Based on this, we propose MeCo, an adaptive
+decision-making strategy for external tool use. MeCo quantifies metacognitive
+scores by capturing high-level cognitive signals in the representation space,
+guiding when to invoke tools. Notably, MeCo is fine-tuning-free and incurs
+minimal cost. Our experiments show that MeCo accurately detects LLMs' internal
+cognitive signals and significantly improves tool-use decision-making across
+multiple base models and benchmarks.
+
+摘要：大型語言模型 (LLM) 已展現出顯著的新興能力，透過運用外部工具來執行功能任務，解決需要專業處理或即時資料的複雜問題，從而轉變任務的執行方式。儘管現有研究擴展了 LLM 對各種工具的存取（例如程式碼詮釋器、搜尋引擎、天氣/地圖應用程式），但使用這些工具的必要性往往被忽略，導致不加選擇地呼叫工具。這種天真的方法提出了兩個關鍵問題：(1) 由於不必要的工具呼叫而導致延遲增加，以及 (2) 由於與外部工具互動錯誤而導致的潛在錯誤。在本文中，我們將元認知引入作為 LLM 自我評估其能力的代理，代表模型意識到其自身的限制。基於此，我們提出了 MeCo，一種用於外部工具使用的適應性決策制定策略。MeCo 透過擷取表徵空間中的高階認知訊號來量化元認知分數，指導何時呼叫工具。值得注意的是，MeCo 是免微調的，而且成本極低。我們的實驗表明，MeCo 能夠準確地偵測 LLM 的內部認知訊號，並大幅改善跨多個基本模型和基準的工具使用決策制定。
+
+##### **AlignFreeze: Navigating the Impact of Realignment on the Layers of Multilingual Models Across Diverse Languages**
+2502.12959v1 by Steve Bakos, Félix Gaschi, David Guzmán, Riddhi More, Kelly Chutong Li, En-Shiun Annie Lee
+
+Realignment techniques are often employed to enhance cross-lingual transfer
+in multilingual language models, still, they can sometimes degrade performance
+in languages that differ significantly from the fine-tuned source language.
+This paper introduces AlignFreeze, a method that freezes either the layers'
+lower half or upper half during realignment. Through controlled experiments on
+4 tasks, 3 models, and in 35 languages, we find that realignment affects all
+the layers but can be the most detrimental to the lower ones. Freezing the
+lower layers can prevent performance degradation. Particularly, AlignFreeze
+improves Part-of-Speech (PoS) tagging performances in languages where full
+realignment fails: with XLM-R, it provides improvements of more than one
+standard deviation in accuracy in seven more languages than full realignment.
+
+摘要：重新對齊技術通常用於增強多語言語言模型中的跨語言轉移，然而，它們有時會降低與微調源語言顯著不同的語言的效能。本文介紹了 AlignFreeze，一種在重新對齊期間凍結層的下半部或上半部的的方法。透過 4 項任務、3 個模型和 35 種語言的受控實驗，我們發現重新對齊會影響所有層，但對較低層的影響最大。凍結較低層可以防止效能下降。特別是，AlignFreeze 改善了在完全重新對齊失敗的語言中的詞性 (PoS) 標記效能：使用 XLM-R，它比完全重新對齊在七種語言中提供了超過一個標準差的準確度改進。
+
+##### **Task-Informed Anti-Curriculum by Masking Improves Downstream Performance on Text**
+2502.12953v1 by Andrei Jarca, Florinel Alin Croitoru, Radu Tudor Ionescu
+
+Masked language modeling has become a widely adopted unsupervised technique
+to pre-train language models. However, the process of selecting tokens for
+masking is random, and the percentage of masked tokens is typically fixed for
+the entire training process. In this paper, we propose to adjust the masking
+ratio and to decide which tokens to mask based on a novel task-informed
+anti-curriculum learning scheme. First, we harness task-specific knowledge
+about useful and harmful tokens in order to determine which tokens to mask.
+Second, we propose a cyclic decaying masking ratio, which corresponds to an
+anti-curriculum schedule (from hard to easy). We exemplify our novel
+task-informed anti-curriculum by masking (TIACBM) approach across three diverse
+downstream tasks: sentiment analysis, text classification by topic, and
+authorship attribution. Our findings suggest that TIACBM enhances the ability
+of the model to focus on key task-relevant features, contributing to
+statistically significant performance gains across tasks. We release our code
+at https://github.com/JarcaAndrei/TIACBM.
+
+摘要：遮蔽語言模型已成為一種廣泛採用的無監督技術，用於預先訓練語言模型。然而，選擇用於遮蔽的詞彙的過程是隨機的，且遮蔽詞彙的百分比通常在整個訓練過程中是固定的。在本文中，我們建議調整遮蔽率，並根據一種新穎的任務資訊反課程學習方案來決定要遮蔽哪些詞彙。首先，我們利用任務特定的知識，了解有用的和有害的詞彙，以確定要遮蔽哪些詞彙。其次，我們提出一個循環遞減遮蔽率，這對應於一個反課程表（從難到易）。我們以三項不同的下游任務為例，說明我們新穎的任務資訊反課程遮蔽（TIACBM）方法：情緒分析、按主題分類文字，以及作者歸屬。我們的研究結果表明，TIACBM 增強了模型專注於關鍵任務相關特徵的能力，有助於在各項任務中獲得具有統計意義的效能提升。我們在 https://github.com/JarcaAndrei/TIACBM 釋出我們的程式碼。
+
+##### **Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection**
+2502.12948v1 by Athira J Jacob, Puneet Sharma, Daniel Rueckert
+
+Detection of hyperenhancement from cardiac LGE MRI images is a complex task
+requiring significant clinical expertise. Although deep learning-based models
+have shown promising results for the task, they require large amounts of data
+with fine-grained annotations. Clinical reports generated for cardiac MR
+studies contain rich, clinically relevant information, including the location,
+extent and etiology of any scars present. Although recently developed
+CLIP-based training enables pretraining models with image-text pairs, it
+requires large amounts of data and further finetuning strategies on downstream
+tasks. In this study, we use various strategies rooted in domain knowledge to
+train a model for LGE detection solely using text from clinical reports, on a
+relatively small clinical cohort of 965 patients. We improve performance
+through the use of synthetic data augmentation, by systematically creating scar
+images and associated text. In addition, we standardize the orientation of the
+images in an anatomy-informed way to enable better alignment of spatial and
+text features. We also use a captioning loss to enable fine-grained supervision
+and explore the effect of pretraining of the vision encoder on performance.
+Finally, ablation studies are carried out to elucidate the contributions of
+each design component to the overall performance of the model.
+
+摘要：從心臟 LGE MRI 影像偵測出過度增強是一項複雜的任務，需要顯著的臨床專業知識。儘管基於深度學習的模型已顯示出對這項任務有前景的結果，但它們需要大量具有細緻註解的資料。為心臟 MR 研究產生的臨床報告包含豐富且臨床上相關的資訊，包括任何疤痕的位置、範圍和病因。儘管最近開發的基於 CLIP 的訓練能使用影像文字對預訓練模型，但它需要大量資料和進一步微調下游任務的策略。在這項研究中，我們使用植基於領域知識的各種策略，僅使用來自臨床報告的文字，在一個相對較小的 965 名患者臨床群體中訓練一個 LGE 偵測模型。我們透過使用合成資料擴充來改善效能，系統性地建立疤痕影像和相關文字。此外，我們以解剖學告知的方式標準化影像方向，以使空間和文字特徵能更好地對齊。我們也使用標題損失來啟用細緻的監督，並探討視覺編碼器的預訓練對效能的影響。最後，進行消融研究以闡明每個設計元件對模型整體效能的貢獻。
+
+##### **Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models**
+2502.12947v1 by Gyeongman Kim, Gyouk Chu, Eunho Yang
+
+With the emergence of Mixture-of-Experts (MoE), the efficient scaling of
+model size has accelerated the development of large language models in recent
+years. However, their high memory requirements prevent their use in
+resource-constrained environments. While knowledge distillation (KD) has been a
+proven method for model compression, its application to MoE teacher models
+remains underexplored. Through our investigation, we discover that
+non-activated experts in MoE models possess valuable knowledge that benefits
+student models. We further demonstrate that existing KD methods are not optimal
+for compressing MoE models, as they fail to leverage this knowledge
+effectively. To address this, we propose two intuitive MoE-specific KD methods
+for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR),
+both designed to effectively extract knowledge from all experts. Specifically,
+KA augments knowledge by sampling experts multiple times, while SAR uses all
+experts and adjusts the expert weights through router training to provide
+optimal knowledge. Extensive experiments show that our methods outperform
+conventional KD methods, demonstrating their effectiveness for MoE teacher
+models.
+
+摘要：隨著 Mixture-of-Experts (MoE) 的出現，模型規模的有效擴展加速了近年來大型語言模型的發展。然而，它們的高記憶體需求會阻礙它們在資源受限的環境中使用。雖然知識蒸餾 (KD) 已被證明是一種模型壓縮的方法，但它在 MoE 教師模型中的應用仍未被充分探索。透過我們的調查，我們發現 MoE 模型中未被啟用的專家擁有有價值的知識，這些知識對學生模型有益。我們進一步證明，現有的 KD 方法並非壓縮 MoE 模型的最佳方法，因為它們無法有效利用這些知識。為了解決這個問題，我們首次提出兩種直觀的 MoE 專用 KD 方法：知識擴充 (KA) 和學生感知路由器 (SAR)，兩者都旨在從所有專家有效提取知識。具體來說，KA 透過多次抽樣專家來擴充知識，而 SAR 使用所有專家並透過路由器訓練調整專家權重以提供最佳知識。廣泛的實驗表明，我們的模型優於傳統的 KD 模型，證明了它們對 MoE 教師模型的有效性。
+
+##### **LLMPopcorn: An Empirical Study of LLMs as Assistants for Popular Micro-video Generation**
+2502.12945v1 by Junchen Fu, Xuri Ge, Kaiwen Zheng, Ioannis Arapakis, Xin Xin, Joemon M. Jose
+
+Popular Micro-videos, dominant on platforms like TikTok and YouTube, hold
+significant commercial value. The rise of high-quality AI-generated content has
+spurred interest in AI-driven micro-video creation. However, despite the
+advanced capabilities of large language models (LLMs) like ChatGPT and DeepSeek
+in text generation and reasoning, their potential to assist the creation of
+popular micro-videos remains largely unexplored.
+  In this paper, we conduct an empirical study on LLM-assisted popular
+micro-video generation (LLMPopcorn). Specifically, we investigate the following
+research questions: (i) How can LLMs be effectively utilized to assist popular
+micro-video generation? (ii) To what extent can prompt-based enhancements
+optimize the LLM-generated content for higher popularity? (iii) How well do
+various LLMs and video generators perform in the popular micro-video generation
+task? By exploring these questions, we show that advanced LLMs like DeepSeek-V3
+enable micro-video generation to achieve popularity comparable to human-created
+content. Prompt enhancements further boost popularity, and benchmarking
+highlights DeepSeek-V3 and DeepSeek-R1 among LLMs, while LTX-Video and
+HunyuanVideo lead in video generation. This pioneering work advances
+AI-assisted micro-video creation, uncovering new research opportunities. We
+will release the code and datasets to support future studies.
+
+摘要：<paragraph>在 TikTok 和 YouTube 等平台上流行的微影片具有
+重要的商业价值。高质量 AI 生成的内容的兴起
+激发了人们对 AI 驱动的微影片创作的兴趣。然而，尽管大型语言模型 (LLM) 如 ChatGPT 和 DeepSeek
+在文本生成和推理方面的能力很强，但它们在辅助创建
+流行微影片方面的潜力在很大程度上仍未得到探索。
+  在本文中，我们对 LLM 辅助的流行
+微影片生成 (LLMPopcorn) 进行了实证研究。具体来说，我们调查了以下
+研究问题：(i) 如何有效利用 LLM 来辅助流行
+微影片生成？(ii) 基于提示的增强在多大程度上可以
+优化 LLM 生成的内容以获得更高的流行度？(iii) 各种 LLM 和视频生成器在流行的微视频生成中表现如何
+任务？通过探索这些问题，我们表明了像 DeepSeek-V3 这样的高级 LLM
+使微视频生成能够达到与人类创作的内容相当的流行度。提示增强进一步提高了受欢迎程度，并且基准测试突出了 LLM 中的 DeepSeek-V3 和 DeepSeek-R1，而 LTX-Video 和
+HunyuanVideo 在视频生成中领先。这项开创性的工作推进了
+人工智能辅助的微视频创作，发现了新的研究机会。我们将发布代码和数据集以支持未来的研究。</paragraph>
+
+##### **Synthetic Data Generation for Culturally Nuanced Commonsense Reasoning in Low-Resource Languages**
+2502.12932v1 by Salsabila Zahirah Pranida, Rifo Ahmad Genadi, Fajri Koto
+
+Quantifying reasoning capability in low-resource languages remains a
+challenge in NLP due to data scarcity and limited access to annotators. While
+LLM-assisted dataset construction has proven useful for medium- and
+high-resource languages, its effectiveness in low-resource languages,
+particularly for commonsense reasoning, is still unclear. In this paper, we
+compare three dataset creation strategies: (1) LLM-assisted dataset generation,
+(2) machine translation, and (3) human-written data by native speakers, to
+build a culturally nuanced story comprehension dataset. We focus on Javanese
+and Sundanese, two major local languages in Indonesia, and evaluate the
+effectiveness of open-weight and closed-weight LLMs in assisting dataset
+creation through extensive manual validation. To assess the utility of
+synthetic data, we fine-tune language models on classification and generation
+tasks using this data and evaluate performance on a human-written test set. Our
+findings indicate that LLM-assisted data creation outperforms machine
+translation.
+
+摘要：由於資料稀少且標註者有限，量化低資源語言中的推理能力在自然語言處理中仍然是一項挑戰。雖然 LLM 輔助的資料集建構已被證明對中高資源語言有用，但其在低資源語言中的有效性，特別是對於常識推理，仍然不清楚。在本文中，我們比較了三種資料集建立策略：(1) LLM 輔助的資料集生成，(2) 機器翻譯，以及 (3) 母語人士撰寫的人工資料，以建立具有文化細微差的故事理解資料集。我們專注於爪哇語和巽他語，這兩種印尼的主要地方語言，並透過廣泛的手動驗證評估開放權重和封閉權重 LLM 在協助資料集建立中的有效性。為了評估合成資料的效用，我們使用這些資料對分類和生成任務進行語言模型微調，並在人工撰寫的測試集上評估效能。我們的研究結果表明，LLM 輔助的資料建立優於機器翻譯。
+
+##### **Flow-of-Options: Diversified and Improved LLM Reasoning by Thinking Through Options**
+2502.12929v1 by Lakshmi Nair, Ian Trase, Mark Kim
+
+We present a novel reasoning approach called Flow-of-Options (FoO), designed
+to address intrinsic biases in Large Language Models (LLMs). FoO enables LLMs
+to systematically explore a diverse range of possibilities in their reasoning,
+as demonstrated by an FoO-based agentic system for autonomously solving Machine
+Learning tasks (AutoML). Our framework outperforms state-of-the-art baselines,
+achieving improvements of 38.2% - 69.2% on standard data science tasks, and
+37.4% - 47.9% on therapeutic chemistry tasks. With an overall operation cost
+under $1 per task, our framework is well-suited for cost-sensitive
+applications. Beyond classification and regression, we illustrate the broader
+applicability of our FoO-based agentic system to tasks such as reinforcement
+learning and image generation. Our framework presents significant advancements
+compared to current state-of-the-art agentic systems for AutoML, due to the
+benefits of FoO in enforcing diversity in LLM solutions through compressed,
+explainable representations that also support long-term memory when combined
+with case-based reasoning.
+
+摘要：我們提出了一種稱為選項流 (FoO) 的新推理方法，旨在解決大型語言模型 (LLM) 中的內在偏差。FoO 使 LLM 能系統性地探索其推理中的各種可能性，這由一個基於 FoO 的代理系統展示，該系統可自主解決機器學習任務 (AutoML)。我們的框架優於最先進的基準，在標準數據科學任務上取得了 38.2% - 69.2% 的改進，在治療化學任務上取得了 37.4% - 47.9% 的改進。由於每個任務的整體運營成本低於 1 美元，因此我們的框架非常適合對成本敏感的應用。除了分類和回歸之外，我們還說明了基於 FoO 的代理系統在強化學習和圖像生成等任務中的更廣泛適用性。我們的框架與當前最先進的 AutoML 代理系統相比具有顯著的進步，這是因為 FoO 在通過壓縮、可解釋的表示強制 LLM 解決方案的多樣性方面具有優勢，這些表示與基於案例的推理結合時還支持長期記憶。
+
+##### **Finedeep: Mitigating Sparse Activation in Dense LLMs via Multi-Layer Fine-Grained Experts**
+2502.12928v1 by Leiyu Pan, Zhenpeng Su, Minxuan Lv, Yizhe Xiong, Xiangwen Zhang, Zijia Lin, Hui Chen, Jungong Han, Guiguang Ding, Cheng Luo, Di Zhang, Kun Gai, Deyi Xiong
+
+Large language models have demonstrated exceptional performance across a wide
+range of tasks. However, dense models usually suffer from sparse activation,
+where many activation values tend towards zero (i.e., being inactivated). We
+argue that this could restrict the efficient exploration of model
+representation space. To mitigate this issue, we propose Finedeep, a
+deep-layered fine-grained expert architecture for dense models. Our framework
+partitions the feed-forward neural network layers of traditional dense models
+into small experts, arranges them across multiple sub-layers. A novel routing
+mechanism is proposed to determine each expert's contribution. We conduct
+extensive experiments across various model sizes, demonstrating that our
+approach significantly outperforms traditional dense architectures in terms of
+perplexity and benchmark performance while maintaining a comparable number of
+parameters and floating-point operations. Moreover, we find that Finedeep
+achieves optimal results when balancing depth and width, specifically by
+adjusting the number of expert sub-layers and the number of experts per
+sub-layer. Empirical results confirm that Finedeep effectively alleviates
+sparse activation and efficiently utilizes representation capacity in dense
+models.
+
+摘要：大型語言模型在各種任務中展現出非凡的效能。然而，密集模型通常會出現稀疏激活，其中許多激活值趨近於零（即處於非激活狀態）。我們認為這可能會限制模型表示空間的有效探索。為了減輕這個問題，我們提出 Finedeep，這是一種針對密集模型的深度分層細粒度專家架構。我們的框架將傳統密集模型的前饋神經網路層分割成小型專家，並將它們排列在多個子層中。我們提出了一種新穎的路由機制來確定每個專家的貢獻。我們針對各種模型大小進行了廣泛的實驗，證明我們的做法在困惑度和基準效能方面顯著優於傳統的密集架構，同時保持了相當數量的參數和浮點運算。此外，我們發現 Finedeep 在平衡深度和廣度時可以達到最佳結果，特別是透過調整專家子層的數量和每個子層的專家數量。實證結果證實，Finedeep 有效地減輕了稀疏激活，並有效利用了密集模型中的表示能力。
+
+##### **SEFL: Harnessing Large Language Model Agents to Improve Educational Feedback Systems**
+2502.12927v1 by Mike Zhang, Amalie Pernille Dilling, Léon Gondelman, Niels Erik Ruan Lyngdorf, Euan D. Lindsay, Johannes Bjerva
+
+Providing high-quality feedback is crucial for student success but is
+constrained by time, cost, and limited data availability. We introduce
+Synthetic Educational Feedback Loops (SEFL), a novel framework designed to
+deliver immediate, on-demand feedback at scale without relying on extensive,
+real-world student data. In SEFL, two large language models (LLMs) operate in
+teacher--student roles to simulate assignment completion and formative
+feedback, generating abundant synthetic pairs of student work and corresponding
+critiques. We then fine-tune smaller, more computationally efficient LLMs on
+these synthetic pairs, enabling them to replicate key features of high-quality,
+goal-oriented feedback. Unlike personalized tutoring approaches that offer
+multi-turn, individualized instruction, SEFL specifically focuses on
+replicating the teacher-->student feedback loop for diverse assignments.
+Through both LLM-as-a-judge and human evaluations, we demonstrate that
+SEFL-tuned models outperform their non-tuned counterparts in feedback quality,
+clarity, and timeliness. These findings reveal SEFL's potential to transform
+feedback processes for higher education and beyond, offering an ethical and
+scalable alternative to conventional manual feedback cycles.
+
+摘要：提供高品質的回饋對於學生的成功至關重要，但受到時間、成本和資料取得有限的限制。我們引入了合成教育回饋迴圈 (SEFL)，這是一個新穎的架構，旨在提供立即且依需求的回饋，且無需仰賴大量的真實世界學生資料。在 SEFL 中，兩個大型語言模型 (LLM) 以師生角色運作，模擬作業完成和形成性回饋，產生大量的合成學生作業和對應的評論。然後我們針對這些合成配對微調較小、計算效率較高的 LLM，讓它們能夠複製高品質、目標導向回饋的主要特徵。與提供多回合、個別化教學的個人化輔導方法不同，SEFL 特別專注於複製適用於各種作業的教師-->學生回饋迴圈。透過 LLM 作為評審和人類評估，我們證明了 SEFL 微調模型在回饋品質、清晰度和時效性方面優於未微調的模型。這些發現揭示了 SEFL 轉變高等教育及其他領域回饋流程的潛力，提供了一個符合道德且可擴充的替代方案，取代傳統的手動回饋週期。
+
+##### **Towards more Contextual Agents: An extractor-Generator Optimization Framework**
+2502.12926v1 by Mourad Aouini, Jinan Loubani
+
+Large Language Model (LLM)-based agents have demonstrated remarkable success
+in solving complex tasks across a wide range of general-purpose applications.
+However, their performance often degrades in context-specific scenarios, such
+as specialized industries or research domains, where the absence of
+domain-relevant knowledge leads to imprecise or suboptimal outcomes. To address
+this challenge, our work introduces a systematic approach to enhance the
+contextual adaptability of LLM-based agents by optimizing their underlying
+prompts-critical components that govern agent behavior, roles, and
+interactions. Manually crafting optimized prompts for context-specific tasks is
+labor-intensive, error-prone, and lacks scalability. In this work, we introduce
+an Extractor-Generator framework designed to automate the optimization of
+contextual LLM-based agents. Our method operates through two key stages: (i)
+feature extraction from a dataset of gold-standard input-output examples, and
+(ii) prompt generation via a high-level optimization strategy that iteratively
+identifies underperforming cases and applies self-improvement techniques. This
+framework substantially improves prompt adaptability by enabling more precise
+generalization across diverse inputs, particularly in context-specific tasks
+where maintaining semantic consistency and minimizing error propagation are
+critical for reliable performance. Although developed with single-stage
+workflows in mind, the approach naturally extends to multi-stage workflows,
+offering broad applicability across various agent-based systems. Empirical
+evaluations demonstrate that our framework significantly enhances the
+performance of prompt-optimized agents, providing a structured and efficient
+approach to contextual LLM-based agents.
+
+摘要：大型語言模型 (LLM) 為基礎的代理已展現出非凡的成功，
+能解決廣泛一般用途應用程式的複雜任務。
+然而，它們的效能通常會在特定情境中下降，例如專門產業或研究領域，
+其中缺乏與領域相關知識會導致不精確或次佳的結果。為了解決
+這項挑戰，我們的研究引進了一種系統化的方法來增強 LLM 為基礎的代理的
+情境適應性，方法是最佳化它們的基礎提示，這些提示是決定代理行為、角色和
+互動的重要組成部分。手動製作最佳化的提示以應對特定情境的任務既費時又容易出錯，而且缺乏可擴充性。在這項研究中，我們引進
+一個萃取產生器架構，旨在自動化情境 LLM 為基礎代理的最佳化。我們的
+方法透過兩個關鍵階段運作：(i) 從黃金標準輸入輸出範例的資料集萃取特徵，以及
+(ii) 透過高階最佳化策略產生提示，此策略會反覆找出表現不佳的案例並套用自我改善技術。此
+架構大幅改善了提示適應性，讓它能針對不同的輸入進行更精確的概括，特別是在情境特定任務中，在這些任務中，維持語意一致性和將錯誤傳播降至最低對於可靠的效能至關重要。儘管是針對單階段工作流程開發，但此方法自然能延伸至多階段工作流程，在各種基於代理的系統中提供廣泛的適用性。實證評估顯示，我們的架構大幅增強了提示最佳化代理的效能，為基於情境的 LLM 代理提供了一個結構化且有效率的方法。
+
+##### **Keep what you need : extracting efficient subnetworks from large audio representation models**
+2502.12925v1 by David Genova, Philippe Esling, Tom Hurlin
+
+Recently, research on audio foundation models has witnessed notable advances,
+as illustrated by the ever improving results on complex downstream tasks.
+Subsequently, those pretrained networks have quickly been used for various
+audio applications. These improvements have however resulted in a considerable
+increase both in size and complexity of these models. Along the environmental
+concerns this issue raises, this prevents the deployment of such networks on
+consumer-level devices, and precludes their use for real-time applications.
+Moreover, this appears contradictory with the specificity of the tasks for
+which these models are used, which are often simpler compared to extracting a
+rich, multi-purpose representation from any type of audio data. In this paper,
+we address this issue with a simple, yet effective method to extract
+lightweight specialist subnetworks from large foundation models. Specifically,
+we introduce learnable binary masks in-between the layers of a pretrained
+representation model. When training the end-to-end model on a downstream task,
+we add a sparsity-inducing loss to the overall objective, hence learning a
+compact subnetwork specialized on a single task. Importantly, the weights of
+the foundation model are kept frozen, resulting into low additional training
+costs. Once trained, the masked computational units can then be removed from
+the network, implying significant performance gains. We assess our method on
+three widespread audio foundation models, each based on a different backbone
+architecture, and illustrate its effectiveness on common audio representation
+evaluation tasks, as well as its versatility on both speech, music, and general
+audio. Code for reproducing the results and supporting webpage are available at
+https://github.com/gnvIRCAM/Audio-representation-trimming
+
+摘要：<paragraph>近期，音频基础模型的研究取得了显著进展，
+复杂的下游任务上不断提升的结果证明了这一点。
+随后，这些预训练网络已迅速用于各种
+音频应用程序。然而，这些改进导致了这些模型的尺寸和复杂性都大幅
+增加。除了由此产生的环境问题外，这也阻止了此类网络在
+消费者级设备上的部署，并排除了它们在实时应用程序中的使用。
+此外，这似乎与这些模型的使用任务的特殊性相矛盾，与从任何类型的音频数据中提取丰富的多用途表示相比，这些任务通常更简单。在本文中，
+我们通过一种简单但有效的方法来解决此问题，从大型基础模型中提取轻量级专家子网络。具体来说，
+我们在预训练表示模型的层之间引入了可学习的二进制掩码。当在某个下游任务上训练端到端模型时，
+我们在总体目标中添加了稀疏性诱导损失，从而学习到专门用于单个任务的紧凑型子网络。重要的是，
+基础模型的权重保持冻结，从而导致额外的训练成本低。一旦训练完成，就可以从网络中移除掩码的计算单元，这意味着性能将大幅提升。我们对三个广泛使用的音频基础模型评估了我们的方法，每个模型都基于不同的骨干架构，并说明了其在常见音频表示评估任务上的有效性，以及其在语音、音乐和通用音频上的多功能性。用于重现结果的代码和支持网页可在
+https://github.com/gnvIRCAM/Audio-representation-trimming 获得</paragraph>
+
+##### **Conditioning LLMs to Generate Code-Switched Text: A Methodology Grounded in Naturally Occurring Data**
+2502.12924v1 by Maite Heredia, Gorka Labaka, Jeremy Barnes, Aitor Soroa
+
+Code-switching (CS) is still a critical challenge in Natural Language
+Processing (NLP). Current Large Language Models (LLMs) struggle to interpret
+and generate code-switched text, primarily due to the scarcity of large-scale
+CS datasets for training. This paper presents a novel methodology to generate
+CS data using LLMs, and test it on the English-Spanish language pair. We
+propose back-translating natural CS sentences into monolingual English, and
+using the resulting parallel corpus to fine-tune LLMs to turn monolingual
+sentences into CS. Unlike previous approaches to CS generation, our methodology
+uses natural CS data as a starting point, allowing models to learn its natural
+distribution beyond grammatical patterns. We thoroughly analyse the models'
+performance through a study on human preferences, a qualitative error analysis
+and an evaluation with popular automatic metrics. Results show that our
+methodology generates fluent code-switched text, expanding research
+opportunities in CS communication, and that traditional metrics do not
+correlate with human judgement when assessing the quality of the generated CS
+data. We release our code and generated dataset under a CC-BY-NC-SA license.
+
+摘要：代碼轉換（CS）在自然語言處理（NLP）中仍是一個嚴峻的挑戰。目前的巨量語言模型（LLM）難以解讀和生成代碼轉換文字，主要是因為缺乏用於訓練的大規模 CS 資料集。本文提出了一種使用 LLM 生成 CS 資料的新方法，並在英語-西班牙語語言對上進行測試。我們建議將自然 CS 句子反向翻譯成單語英語，並使用產生的平行語料庫微調 LLM，將單語句子轉換為 CS。與先前的 CS 生成方法不同，我們的技術使用自然 CS 資料作為起點，讓模型能夠學習其超越語法模式的自然分佈。我們透過研究人類偏好、定性錯誤分析和使用流行的自動化指標進行評估，徹底分析模型的效能。結果顯示，我們的技術可以生成流利的代碼轉換文字，擴展 CS 溝通的研究機會，而且在評估生成的 CS 資料品質時，傳統指標與人類判斷無關。我們在 CC-BY-NC-SA 授權下釋出我們的程式碼和生成的資料集。
+
+##### **On-Device LLMs for Home Assistant: Dual Role in Intent Detection and Response Generation**
+2502.12923v1 by Rune Birkmose, Nathan Mørkeberg Reece, Esben Hofstedt Norvin, Johannes Bjerva, Mike Zhang
+
+This paper investigates whether Large Language Models (LLMs), fine-tuned on
+synthetic but domain-representative data, can perform the twofold task of (i)
+slot and intent detection and (ii) natural language response generation for a
+smart home assistant, while running solely on resource-limited, CPU-only edge
+hardware. We fine-tune LLMs to produce both JSON action calls and text
+responses. Our experiments show that 16-bit and 8-bit quantized variants
+preserve high accuracy on slot and intent detection and maintain strong
+semantic coherence in generated text, while the 4-bit model, while retaining
+generative fluency, suffers a noticeable drop in device-service classification
+accuracy. Further evaluations on noisy human (non-synthetic) prompts and
+out-of-domain intents confirm the models' generalization ability, obtaining
+around 80--86\% accuracy. While the average inference time is 5--6 seconds per
+query -- acceptable for one-shot commands but suboptimal for multi-turn
+dialogue -- our results affirm that an on-device LLM can effectively unify
+command interpretation and flexible response generation for home automation
+without relying on specialized hardware.
+
+摘要：本文探討微調於合成但具領域代表性的資料上的大型語言模型 (LLM)，是否能執行 (i) 槽位和意圖偵測，以及 (ii) 自然語言回應產生的雙重任務，同時僅在資源受限、僅 CPU 的邊緣硬體上執行。我們微調 LLM 以產生 JSON 動作呼叫和文字回應。我們的實驗顯示，16 位元和 8 位元量化的變體在槽位和意圖偵測上保持高準確度，並在產生的文字中維持強大的語意一致性，而 4 位元模型雖然保有生成流暢度，但在裝置服務分類準確度上卻有明顯下降。進一步對有雜訊的人類 (非合成) 提示和領域外意圖的評估，證實了模型的泛化能力，獲得約 80--86% 的準確度。雖然平均推論時間為每個查詢 5--6 秒，對於一次性命令來說是可以接受的，但對於多輪對話來說並不理想，但我們的結果證實，裝置上的 LLM 可以有效地統一命令解譯和彈性回應產生，以進行家庭自動化，而無需依賴專用硬體。
+
+##### **Q-STRUM Debate: Query-Driven Contrastive Summarization for Recommendation Comparison**
+2502.12921v1 by George-Kirollos Saad, Scott Sanner
+
+Query-driven recommendation with unknown items poses a challenge for users to
+understand why certain items are appropriate for their needs. Query-driven
+Contrastive Summarization (QCS) is a methodology designed to address this issue
+by leveraging language-based item descriptions to clarify contrasts between
+them. However, existing state-of-the-art contrastive summarization methods such
+as STRUM-LLM fall short of this goal. To overcome these limitations, we
+introduce Q-STRUM Debate, a novel extension of STRUM-LLM that employs
+debate-style prompting to generate focused and contrastive summarizations of
+item aspects relevant to a query. Leveraging modern large language models
+(LLMs) as powerful tools for generating debates, Q-STRUM Debate provides
+enhanced contrastive summaries. Experiments across three datasets demonstrate
+that Q-STRUM Debate yields significant performance improvements over existing
+methods on key contrastive summarization criteria, thus introducing a novel and
+performant debate prompting methodology for QCS.
+
+摘要：以未知項目進行的查詢驅動推薦對使用者來說是一項挑戰，他們難以理解為何某些項目適合自己的需求。查詢驅動對比摘要 (QCS) 是一種方法，旨在透過利用基於語言的項目描述來釐清項目之間的對比，以解決這個問題。然而，現有的最先進對比摘要方法（例如 STRUM-LLM）並未達成此目標。為了克服這些限制，我們引進 Q-STRUM Debate，一種 STRUM-LLM 的新延伸，它採用辯論式提示來產生與查詢相關的項目面向的重點式對比摘要。透過利用現代大型語言模型 (LLM) 作為產生辯論的強大工具，Q-STRUM Debate 提供增強的對比摘要。透過三個資料集的實驗證明，Q-STRUM Debate 在關鍵的對比摘要標準上，比現有方法有顯著的效能改善，因此為 QCS 引進一種新穎且高性能的辯論提示方法。
+
+##### **GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning**
+2502.12913v1 by Sifan Zhou, Shuo Wang, Zhihang Yuan, Mingjia Shi, Yuzhang Shang, Dawei Yang
+
+Large Language Models (LLMs) fine-tuning technologies have achieved
+remarkable results. However, traditional LLM fine-tuning approaches face
+significant challenges: they require large Floating Point (FP) computation,
+raising privacy concerns when handling sensitive data, and are impractical for
+resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT)
+techniques reduce trainable parameters, their reliance on floating-point
+arithmetic creates fundamental incompatibilities with edge hardware. In this
+work, we introduce a novel framework for on-device LLM fine-tuning that
+eliminates the need for floating-point operations in both inference and
+training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer
+format, which efficiently represents model parameters in integer format using
+shared exponents among parameter groups. When combined with LoRA-like adapters,
+this enables fully integer-based fine-tuning that is both memory and compute
+efficient. We demonstrate that our approach achieves accuracy comparable to
+FP16-based fine-tuning while significantly reducing memory usage (50%).
+Moreover, compared to FP8, our method can reduce 5x power consumption and 11x
+chip area with same performance, making large-scale model adaptation feasible
+on edge devices.
+
+摘要：大型语言模型 (LLM) 微调技术已取得显著成果。然而，传统的 LLM 微调方法面临着严峻的挑战：它们需要大量的浮点 (FP) 计算，在处理敏感数据时会引发隐私问题，并且对于资源受限的边缘设备而言不切实际。虽然参数高效微调 (PEFT) 技术减少了可训练参数，但它们对浮点运算的依赖与边缘硬件产生了根本上的不兼容性。在这项工作中，我们引入了一个用于设备上 LLM 微调的新框架，该框架消除了推理和训练中对浮点运算的需求，名为 GSQ-Tuning。其核心是组共享指数整数格式，该格式使用参数组之间的共享指数以整数格式有效地表示模型参数。当与类似 LoRA 的适配器相结合时，这实现了完全基于整数的微调，既节省内存又节省计算。我们证明了我们的方法实现了与基于 FP16 的微调相当的准确性，同时显著减少了内存使用量 (50%)。此外，与 FP8 相比，我们的方法可以在相同的性能下减少 5 倍的功耗和 11 倍的芯片面积，从而使大规模模型适应在边缘设备上成为可能。
+
+##### **Knapsack Optimization-based Schema Linking for LLM-based Text-to-SQL Generation**
+2502.12911v1 by Zheng Yuan, Hao Chen, Zijin Hong, Qinggang Zhang, Feiran Huang, Xiao Huang
+
+Generating SQLs from user queries is a long-standing challenge, where the
+accuracy of initial schema linking significantly impacts subsequent SQL
+generation performance. However, current schema linking models still struggle
+with missing relevant schema elements or an excess of redundant ones. A crucial
+reason for this is that commonly used metrics, recall and precision, fail to
+capture relevant element missing and thus cannot reflect actual schema linking
+performance. Motivated by this, we propose an enhanced schema linking metric by
+introducing a restricted missing indicator. Accordingly, we introduce Knapsack
+optimization-based Schema Linking Agent (KaSLA), a plug-in schema linking agent
+designed to prevent the missing of relevant schema elements while minimizing
+the inclusion of redundant ones. KaSLA employs a hierarchical linking strategy
+that first identifies the optimal table linking and subsequently links columns
+within the selected table to reduce linking candidate space. In each linking
+process, it utilize a knapsack optimization approach to link potentially
+relevant elements while accounting for a limited tolerance of potential
+redundant ones.With this optimization, KaSLA-1.6B achieves superior schema
+linking results compared to large-scale LLMs, including deepseek-v3 with
+state-of-the-art (SOTA) schema linking method. Extensive experiments on Spider
+and BIRD benchmarks verify that KaSLA can significantly improve the SQL
+generation performance of SOTA text-to-SQL models by substituting their schema
+linking processes.
+
+摘要：從使用者查詢中產生 SQL 是個長期的挑戰，其中初始架構連結的準確性會顯著影響後續 SQL 產生效能。然而，目前的架構連結模型仍難以處理遺漏相關架構元素或過多重複元素的問題。造成此問題的一個關鍵原因是，常用的指標召回率和精確度無法捕捉遺漏相關元素，因此無法反映實際的架構連結效能。有鑑於此，我們提出一個增強的架構連結指標，透過引入受限遺漏指標。因此，我們介紹基於背包最佳化的架構連結代理 (KaSLA)，這是一個外掛式架構連結代理，旨在防止遺漏相關架構元素，同時將重複元素的納入降至最低。KaSLA 採用分層連結策略，首先找出最佳的表格連結，然後連結所選表格中的欄位，以減少連結候選空間。在每個連結過程中，它利用背包最佳化方法連結潛在相關元素，同時考量對潛在重複元素的容忍度。透過此最佳化，KaSLA-1.6B 達到優於大規模 LLM 的架構連結結果，包括採用最先進 (SOTA) 架構連結方法的 deepseek-v3。在 Spider 和 BIRD 基準上的廣泛實驗驗證，KaSLA 可透過取代其架構連結流程，大幅提升 SOTA 文字轉 SQL 模型的 SQL 產生效能。
+
+##### **Graph Neural Networks for Databases: A Survey**
+2502.12908v1 by Ziming Li, Youhuan Li, Yuyu Luo, Guoliang Li, Chuxu Zhang
+
+Graph neural networks (GNNs) are powerful deep learning models for
+graph-structured data, demonstrating remarkable success across diverse domains.
+Recently, the database (DB) community has increasingly recognized the
+potentiality of GNNs, prompting a surge of researches focusing on improving
+database systems through GNN-based approaches. However, despite notable
+advances, There is a lack of a comprehensive review and understanding of how
+GNNs could improve DB systems. Therefore, this survey aims to bridge this gap
+by providing a structured and in-depth overview of GNNs for DB systems.
+Specifically, we propose a new taxonomy that classifies existing methods into
+two key categories: (1) Relational Databases, which includes tasks like
+performance prediction, query optimization, and text-to-SQL, and (2) Graph
+Databases, addressing challenges like efficient graph query processing and
+graph similarity computation. We systematically review key methods in each
+category, highlighting their contributions and practical implications. Finally,
+we suggest promising avenues for integrating GNNs into Database systems.
+
+摘要：圖形神經網路 (GNN) 是用於圖形結構資料的強大深度學習模型，在各種領域中展現出顯著的成功。最近，資料庫 (DB) 社群越來越認識到 GNN 的潛力，促使大量研究專注於透過基於 GNN 的方法來改善資料庫系統。然而，儘管有顯著的進展，但對於 GNN 如何改善資料庫系統，仍然缺乏全面的回顧和理解。因此，本調查旨在透過提供 GNN 在資料庫系統中的結構化且深入的概觀來彌補這個差距。具體來說，我們提出了一個新的分類法，將現有方法分類為兩個主要類別：(1) 關係資料庫，其中包括效能預測、查詢最佳化和文字轉 SQL 等任務，以及 (2) 圖形資料庫，用於處理高效圖形查詢處理和圖形相似度計算等挑戰。我們系統性地回顧了每個類別中的關鍵方法，重點說明其貢獻和實務意涵。最後，我們建議將 GNN 整合到資料庫系統中的有希望途徑。
+
+##### **Fraud-R1 : A Multi-Round Benchmark for Assessing the Robustness of LLM Against Augmented Fraud and Phishing Inducements**
+2502.12904v1 by Shu Yang, Shenzhe Zhu, Zeyu Wu, Keyu Wang, Junchi Yao, Junchao Wu, Lijie Hu, Mengdi Li, Derek F. Wong, Di Wang
+
+We introduce Fraud-R1, a benchmark designed to evaluate LLMs' ability to
+defend against internet fraud and phishing in dynamic, real-world scenarios.
+Fraud-R1 comprises 8,564 fraud cases sourced from phishing scams, fake job
+postings, social media, and news, categorized into 5 major fraud types. Unlike
+previous benchmarks, Fraud-R1 introduces a multi-round evaluation pipeline to
+assess LLMs' resistance to fraud at different stages, including credibility
+building, urgency creation, and emotional manipulation. Furthermore, we
+evaluate 15 LLMs under two settings: 1. Helpful-Assistant, where the LLM
+provides general decision-making assistance, and 2. Role-play, where the model
+assumes a specific persona, widely used in real-world agent-based interactions.
+Our evaluation reveals the significant challenges in defending against fraud
+and phishing inducement, especially in role-play settings and fake job
+postings. Additionally, we observe a substantial performance gap between
+Chinese and English, underscoring the need for improved multilingual fraud
+detection capabilities.
+
+摘要：我們推出 Fraud-R1，一個基準，旨在評估 LLM 在動態、真實世界場景中防範網路詐騙和網路釣魚的能力。Fraud-R1 包含 8,564 起詐騙案例，來源包括網路釣魚詐騙、虛假職缺、社群媒體和新聞，分類為 5 種類型的主要詐騙手法。與先前的基準不同，Fraud-R1 引入多輪評估管道，以評估 LLM 在不同階段對詐騙的抵抗力，包括建立信譽、製造急迫感和情感操縱。此外，我們在兩種設定下評估 15 個 LLM：1. 協助助理，其中 LLM 提供一般決策協助，以及 2. 角色扮演，其中模型假設特定角色，廣泛用於現實世界中基於代理的互動。我們的評估揭示了在防範詐騙和網路釣魚誘導方面面臨的重大挑戰，尤其是在角色扮演設定和虛假職缺中。此外，我們觀察到中文和英文之間有顯著的效能差距，這凸顯了改進多語言詐騙偵測功能的必要性。
+
+##### **Soundwave: Less is More for Speech-Text Alignment in LLMs**
+2502.12900v1 by Yuhao Zhang, Zhiheng Liu, Fan Bu, Ruiyu Zhang, Benyou Wang, Haizhou Li
+
+Existing end-to-end speech large language models (LLMs) usually rely on
+large-scale annotated data for training, while data-efficient training has not
+been discussed in depth. We focus on two fundamental problems between speech
+and text: the representation space gap and sequence length inconsistency. We
+propose Soundwave, which utilizes an efficient training strategy and a novel
+architecture to address these issues. Results show that Soundwave outperforms
+the advanced Qwen2-Audio in speech translation and AIR-Bench speech tasks,
+using only one-fiftieth of the training data. Further analysis shows that
+Soundwave still retains its intelligence during conversation. The project is
+available at https://github.com/FreedomIntelligence/Soundwave.
+
+摘要：現有的端對端語音大型語言模型 (LLM) 通常依賴於大規模註釋資料進行訓練，而資料有效率的訓練尚未深入探討。我們專注於語音和文字之間的兩個基本問題：表示空間差距和序列長度不一致。我們提出 Soundwave，它利用高效的訓練策略和新穎的架構來解決這些問題。結果顯示，Soundwave 在語音翻譯和 AIR-Bench 語音任務中優於進階的 Qwen2-Audio，僅使用五十分之一的訓練資料。進一步的分析顯示，Soundwave 在對話中仍能保持其智慧。專案可於 https://github.com/FreedomIntelligence/Soundwave 取得。
+
+##### **None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks**
+2502.12896v1 by Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
+
+In LLM evaluations, reasoning is often distinguished from recall/memorization
+by performing numerical variations to math-oriented questions. Here we
+introduce a general variation method for multiple-choice questions that
+completely dissociates the correct answer from previously seen tokens or
+concepts, requiring LLMs to understand and reason (rather than memorizing) in
+order to answer correctly. Using this method, we evaluate state-of-the-art
+proprietary and open-source LLMs on two datasets available in English and
+Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset.
+Results show that all models experience remarkable accuracy drops under our
+proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access
+2024, ranging from 10% to 93% across models. Notably, the most accurate model
+in our experimentation (OpenAI-o3-mini) is not the most robust
+(DeepSeek-R1-70B), suggesting that the best models in standard evaluations may
+not be the ones with better reasoning capabilities. Also, we see larger
+accuracy drops in public (vs private) datasets and questions posed in their
+original language (vs a manual translation), which are signs of contamination
+and also point to a relevant role of recall/memorization in current LLMs'
+answers.
+
+摘要：在 LLM 評估中，推理通常透過對數學導向問題進行數值變異來區別於回憶/記憶。在此，我們引入一種通用變異方法，適用於多選題，它將正確答案與先前看到的代幣或概念完全區分開來，要求 LLM 理解和推理（而不是記憶），以便正確回答。使用此方法，我們在英語和西班牙語中評估了兩種數據集中的最先進的專有和開源 LLM：公共 MMLU 基準和私有 UNED-Access 2024 數據集。結果表明，在我們提出的變異下，所有模型的準確度都出現顯著下降，在 MMLU 上平均損失 57%，在 UNED-Access 2024 上平均損失 50%，在不同模型中範圍從 10% 到 93%。值得注意的是，我們實驗中最準確的模型（OpenAI-o3-mini）並不是最穩健的模型（DeepSeek-R1-70B），這表明標準評估中最好的模型可能不是推理能力最強的模型。此外，我們看到公共（相對於私有）數據集和以原始語言提出的問題（相對於人工翻譯）的準確度下降幅度更大，這是汙染的跡象，也表明回憶/記憶在當前 LLM 的答案中發揮著相關作用。
+
+##### **Multilingual European Language Models: Benchmarking Approaches and Challenges**
+2502.12895v1 by Fabio Barth, Georg Rehm
+
+The breakthrough of generative large language models (LLMs) that can solve
+different tasks through chat interaction has led to a significant increase in
+the use of general benchmarks to assess the quality or performance of these
+models beyond individual applications. There is also a need for better methods
+to evaluate and also to compare models due to the ever increasing number of new
+models published. However, most of the established benchmarks revolve around
+the English language. This paper analyses the benefits and limitations of
+current evaluation datasets, focusing on multilingual European benchmarks. We
+analyse seven multilingual benchmarks and identify four major challenges.
+Furthermore, we discuss potential solutions to enhance translation quality and
+mitigate cultural biases, including human-in-the-loop verification and
+iterative translation ranking. Our analysis highlights the need for culturally
+aware and rigorously validated benchmarks to assess the reasoning and
+question-answering capabilities of multilingual LLMs accurately.
+
+摘要：生成式大型語言模型 (LLM) 的突破，它能透過聊天互動解決不同任務，這導致使用一般基準來評估這些模型在個別應用程式以外的品質或效能大幅增加。由於已發布的新模型數量不斷增加，因此也有必要採用更好的方法來評估模型並進行比較。然而，大多數已建立的基準都圍繞著英語。本文分析了目前評估資料集的優點和限制，重點放在多語言歐洲基準。我們分析了七個多語言基準，並找出四個主要的挑戰。此外，我們討論了增強翻譯品質和減輕文化偏見的潛在解決方案，包括人為迴圈驗證和反覆翻譯排名。我們的分析突顯了對文化意識和嚴格驗證的基準的需求，以準確評估多語言 LLM 的推理和問答能力。
+
+##### **H-CoT: Hijacking the Chain-of-Thought Safety Reasoning Mechanism to Jailbreak Large Reasoning Models, Including OpenAI o1/o3, DeepSeek-R1, and Gemini 2.0 Flash Thinking**
+2502.12893v1 by Martin Kuo, Jianyi Zhang, Aolin Ding, Qinsi Wang, Louis DiValentin, Yujia Bao, Wei Wei, Da-Cheng Juan, Hai Li, Yiran Chen
+
+Large Reasoning Models (LRMs) have recently extended their powerful reasoning
+capabilities to safety checks-using chain-of-thought reasoning to decide
+whether a request should be answered. While this new approach offers a
+promising route for balancing model utility and safety, its robustness remains
+underexplored. To address this gap, we introduce Malicious-Educator, a
+benchmark that disguises extremely dangerous or malicious requests beneath
+seemingly legitimate educational prompts. Our experiments reveal severe
+security flaws in popular commercial-grade LRMs, including OpenAI o1/o3,
+DeepSeek-R1, and Gemini 2.0 Flash Thinking. For instance, although OpenAI's o1
+model initially maintains a high refusal rate of about 98%, subsequent model
+updates significantly compromise its safety; and attackers can easily extract
+criminal strategies from DeepSeek-R1 and Gemini 2.0 Flash Thinking without any
+additional tricks. To further highlight these vulnerabilities, we propose
+Hijacking Chain-of-Thought (H-CoT), a universal and transferable attack method
+that leverages the model's own displayed intermediate reasoning to jailbreak
+its safety reasoning mechanism. Under H-CoT, refusal rates sharply
+decline-dropping from 98% to below 2%-and, in some instances, even transform
+initially cautious tones into ones that are willing to provide harmful content.
+We hope these findings underscore the urgent need for more robust safety
+mechanisms to preserve the benefits of advanced reasoning capabilities without
+compromising ethical standards.
+
+摘要：大型推理模型 (LRM) 最近將其強大的推理能力擴展到安全檢查，使用思維鏈推理來決定是否應回答請求。雖然這種新方法為平衡模型實用性和安全性提供了一條有希望的途徑，但其穩健性仍未得到充分探索。為了解決這一差距，我們引入了 Malicious-Educator，這是一個基準，它將極其危險或惡意的請求偽裝在看似合法的教育提示之下。我們的實驗揭示了流行的商業級 LRM 中嚴重的安全缺陷，包括 OpenAI o1/o3、DeepSeek-R1 和 Gemini 2.0 Flash Thinking。例如，儘管 OpenAI 的 o1 模型最初保持約 98% 的高拒絕率，但後續的模型更新顯著損害了其安全性；攻擊者可以輕鬆地從 DeepSeek-R1 和 Gemini 2.0 Flash Thinking 中提取犯罪策略，而無需任何額外的技巧。為了進一步強調這些漏洞，我們提出了劫持思維鏈 (H-CoT)，這是一種通用且可轉移的攻擊方法，它利用模型自己顯示的中間推理來越獄其安全推理機制。在 H-CoT 下，拒絕率急劇下降，從 98% 降至 2% 以下，在某些情況下，甚至將最初謹慎的語氣轉變為願意提供有害內容的語氣。我們希望這些發現強調了對更強大的安全機制的迫切需要，以保留先進推理能力的好處，同時不損害道德標準。
+
+##### **Are Multilingual Language Models an Off-ramp for Under-resourced Languages? Will we arrive at Digital Language Equality in Europe in 2030?**
+2502.12886v1 by Georg Rehm, Annika Grützner-Zahn, Fabio Barth
+
+Large language models (LLMs) demonstrate unprecedented capabilities and
+define the state of the art for almost all natural language processing (NLP)
+tasks and also for essentially all Language Technology (LT) applications. LLMs
+can only be trained for languages for which a sufficient amount of pre-training
+data is available, effectively excluding many languages that are typically
+characterised as under-resourced. However, there is both circumstantial and
+empirical evidence that multilingual LLMs, which have been trained using data
+sets that cover multiple languages (including under-resourced ones), do exhibit
+strong capabilities for some of these under-resourced languages. Eventually,
+this approach may have the potential to be a technological off-ramp for those
+under-resourced languages for which "native" LLMs, and LLM-based technologies,
+cannot be developed due to a lack of training data. This paper, which
+concentrates on European languages, examines this idea, analyses the current
+situation in terms of technology support and summarises related work. The
+article concludes by focusing on the key open questions that need to be
+answered for the approach to be put into practice in a systematic way.
+
+摘要：大型語言模型 (LLM) 展現前所未有的能力，並定義了幾乎所有自然語言處理 (NLP) 任務以及所有語言技術 (LT) 應用的最新技術。LLM 只能針對有足夠預訓練資料可用的語言進行訓練，實際上排除了許多通常被歸類為資源不足的語言。然而，有環境和經驗證據顯示，多語言 LLM 已使用涵蓋多種語言（包括資源不足的語言）的資料集進行訓練，確實對其中一些資源不足的語言展現出強大的能力。最終，這種方法可能具有成為那些由於缺乏訓練資料而無法開發「原生」LLM 和基於 LLM 的技術的資源不足語言的技術跳板的潛力。本文專注於歐洲語言，探討這個想法，分析技術支援方面的現狀，並總結相關工作。本文最後專注於必須回答的主要開放性問題，以便系統性地實踐這種方法。
+
+##### **How desirable is alignment between LLMs and linguistically diverse human users?**
+2502.12884v1 by Pia Knoeferle, Sebastian Möller, Dorothea Kolossa, Veronika Solopova, Georg Rehm
+
+We discuss how desirable it is that Large Language Models (LLMs) be able to
+adapt or align their language behavior with users who may be diverse in their
+language use. User diversity may come about among others due to i) age
+differences; ii) gender characteristics, and/or iii) multilingual experience,
+and associated differences in language processing and use. We consider
+potential consequences for usability, communication, and LLM development.
+
+摘要：我們探討大型語言模型 (LLM) 能夠適應或調整其語言行為，以適應語言使用可能多樣化的使用者，這有多麼可取。使用者多樣性可能出於以下原因而產生：i) 年齡差異；ii) 性別特徵，和/或 iii) 多語言經驗，以及語言處理和使用上的相關差異。我們考慮對可用性、溝通和 LLM 開發的潛在後果。
+
+##### **Continuous Learning Conversational AI: A Personalized Agent Framework via A2C Reinforcement Learning**
+2502.12876v1 by Nandakishor M, Anjali M
+
+Creating personalized and adaptable conversational AI remains a key
+challenge. This paper introduces a Continuous Learning Conversational AI (CLCA)
+approach, implemented using A2C reinforcement learning, to move beyond static
+Large Language Models (LLMs). We use simulated sales dialogues, generated by
+LLMs, to train an A2C agent. This agent learns to optimize conversation
+strategies for personalization, focusing on engagement and delivering value.
+Our system architecture integrates reinforcement learning with LLMs for both
+data creation and response selection. This method offers a practical way to
+build personalized AI companions that evolve through continuous learning,
+advancing beyond traditional static LLM techniques.
+
+摘要：建立個人化且適應性強的對話式 AI 仍然是一項關鍵挑戰。本文介紹了一種持續學習對話式 AI (CLCA) 方法，透過 A2C 強化學習實作，以超越靜態大型語言模型 (LLM)。我們使用 LLM 生成的模擬銷售對話來訓練 A2C 代理。此代理會學習最佳化對話策略以實現個人化，並專注於參與和提供價值。我們的系統架構將強化學習與 LLM 整合，用於資料建立和回應選取。此方法提供了一種實用的方式來建立個人化 AI 伴侶，這些伴侶會透過持續學習而演進，超越傳統的靜態 LLM 技術。
+
+##### **PAFT: Prompt-Agnostic Fine-Tuning**
+2502.12859v1 by Chenxing Wei, Yao Shu, Mingwen Ou, Ying Tiffany He, Fei Richard Yu
+
+While Large Language Models (LLMs) adapt well to downstream tasks after
+fine-tuning, this adaptability often compromises prompt robustness, as even
+minor prompt variations can significantly degrade performance. To address this,
+we propose Prompt-Agnostic Fine-Tuning(PAFT), a simple yet effective approach
+that dynamically adjusts prompts during fine-tuning. This encourages the model
+to learn underlying task principles rather than overfitting to specific prompt
+formulations. PAFT operates in two stages: First, a diverse set of meaningful,
+synthetic candidate prompts is constructed. Second, during fine-tuning, prompts
+are randomly sampled from this set to create dynamic training inputs. Extensive
+experiments across diverse datasets and LLMs demonstrate that models trained
+with PAFT exhibit strong robustness and generalization across a wide range of
+prompts, including unseen ones. This enhanced robustness improves both model
+performance and inference speed while maintaining training efficiency. Ablation
+studies further confirm the effectiveness of PAFT.
+
+摘要：儘管大型語言模型 (LLM) 在微調後能很好地適應下游任務，但這種適應性通常會損害提示的穩健性，因為即使微小的提示變異也會大幅降低效能。為了解決這個問題，我們提出提示不可知微調 (PAFT)，這是一種簡單卻有效的方法，可以在微調期間動態調整提示。這鼓勵模型學習底層任務原則，而不是過度擬合特定的提示表述。PAFT 分為兩個階段運作：首先，構建一組多樣化、有意義的合成候選提示。其次，在微調期間，從此集合中隨機抽取提示以建立動態訓練輸入。針對各種資料集和 LLM 進行的廣泛實驗表明，使用 PAFT 訓練的模型在各種提示中表現出強大的穩健性和概括性，包括未見過的提示。這種增強的穩健性同時改善了模型效能和推理速度，同時維持訓練效率。消融研究進一步證實了 PAFT 的有效性。
+
+##### **Rejected Dialects: Biases Against African American Language in Reward Models**
+2502.12858v1 by Joel Mire, Zubin Trivadi Aysola, Daniel Chechelnitsky, Nicholas Deas, Chrysoula Zerva, Maarten Sap
+
+Preference alignment via reward models helps build safe, helpful, and
+reliable large language models (LLMs). However, subjectivity in preference
+judgments and the lack of representative sampling in preference data collection
+can introduce new biases, hindering reward models' fairness and equity. In this
+work, we introduce a framework for evaluating dialect biases in reward models
+and conduct a case study on biases against African American Language (AAL)
+through several experiments comparing reward model preferences and behavior on
+paired White Mainstream English (WME) and both machine-translated and
+human-written AAL corpora. We show that reward models are less aligned with
+human preferences when processing AAL texts vs. WME ones (-4\% accuracy on
+average), frequently disprefer AAL-aligned texts vs. WME-aligned ones, and
+steer conversations toward WME, even when prompted with AAL texts. Our findings
+provide a targeted analysis of anti-AAL biases at a relatively understudied
+stage in LLM development, highlighting representational harms and ethical
+questions about the desired behavior of LLMs concerning AAL.
+
+摘要：透過獎勵模型進行偏好比對有助於建立安全、有用的可靠大型語言模型 (LLM)。然而，偏好判斷的主觀性，以及偏好資料收集中缺乏代表性抽樣，可能會引進新的偏誤，阻礙獎勵模型的公平性和公正性。在這項工作中，我們引進一個用於評估獎勵模型中方言偏誤的架構，並透過數個實驗進行案例研究，探討針對非裔美國人語言 (AAL) 的偏誤，這些實驗比較了獎勵模型偏好和行為，比較成對的白人主流英語 (WME) 與機器翻譯和人類撰寫的 AAL 語料庫。我們顯示，與處理 WME 文字相比，獎勵模型在處理 AAL 文字時與人類偏好較不一致（平均準確度降低 4%），經常不偏好與 AAL 一致的文字，而偏好與 WME 一致的文字，並將對話導向 WME，即使提示的是 AAL 文字。我們的發現針對 LLM 開發中相對未受重視的階段，提供針對反 AAL 偏誤的目標分析，強調與表徵相關的危害和關於 LLM 對 AAL 的期望行為的倫理問題。
+
+##### **Integrating Arithmetic Learning Improves Mathematical Reasoning in Smaller Models**
+2502.12855v1 by Neeraj Gangwar, Suma P Bhat, Nickvash Kani
+
+While large models pre-trained on high-quality data exhibit excellent
+performance across various reasoning tasks, including mathematical reasoning
+(e.g. GSM8k, MultiArith), specializing smaller models to excel at mathematical
+reasoning remains a challenging problem. Common approaches to address this
+challenge include knowledge distillation, where smaller student models learn
+from large pre-trained teacher models, and data augmentation, such as
+rephrasing questions. Despite these efforts, smaller models struggle with
+arithmetic computations, leading to errors in mathematical reasoning. In this
+work, we focus on leveraging a programmatically generated arithmetic dataset to
+enhance the reasoning capabilities of smaller models. We investigate two key
+approaches to incorporate this dataset -- (1) intermediate fine-tuning, where a
+model is fine-tuned on the arithmetic dataset before being trained on a
+reasoning dataset, and (2) integrating the arithmetic dataset into the
+instruction-tuning mixture, allowing the model to learn arithmetic skills
+alongside general instruction-following abilities. Our experiments on multiple
+reasoning benchmarks demonstrate that incorporating an arithmetic dataset,
+whether through targeted fine-tuning or within the instruction-tuning mixture,
+enhances the models' arithmetic capabilities, which in turn improves their
+mathematical reasoning performance.
+
+摘要：大型模型经过针对高质量数据的预训练，在各种推理任务中表现出色，包括数学推理（例如 GSM8k、MultiArith），但专门化小型模型以擅长数学推理仍然是一个具有挑战性的问题。解决这一挑战的常见方法包括知识蒸馏，其中较小的学生模型从经过预训练的大型教师模型中学习，以及数据增强，例如重新表述问题。尽管做出了这些努力，较小的模型在算术计算中仍然存在困难，从而导致数学推理错误。在这项工作中，我们专注于利用程序化生成的算术数据集来增强较小模型的推理能力。我们研究了两种关键方法来合并此数据集——（1）中间微调，其中模型在算术数据集上进行微调，然后在推理数据集上进行训练，以及（2）将算术数据集集成到指令微调混合中，允许模型学习算术技能以及一般的指令遵循能力。我们在多个推理基准上的实验表明，通过有针对性的微调或在指令微调混合中合并算术数据集，增强了模型的算术能力，进而提高了它们的数学推理性能。
+
+##### **S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning**
+2502.12853v1 by Ruotian Ma, Peisong Wang, Cheng Liu, Xingyan Liu, Jiaqi Chen, Bang Zhang, Xin Zhou, Nan Du, Jia Li
+
+Recent studies have demonstrated the effectiveness of LLM test-time scaling.
+However, existing approaches to incentivize LLMs' deep thinking abilities
+generally require large-scale data or significant training efforts. Meanwhile,
+it remains unclear how to improve the thinking abilities of less powerful base
+models. In this work, we introduce S$^2$R, an efficient framework that enhances
+LLM reasoning by teaching models to self-verify and self-correct during
+inference. Specifically, we first initialize LLMs with iterative
+self-verification and self-correction behaviors through supervised fine-tuning
+on carefully curated data. The self-verification and self-correction skills are
+then further strengthened by both outcome-level and process-level reinforcement
+learning, with minimized resource requirements, enabling the model to
+adaptively refine its reasoning process during inference. Our results
+demonstrate that, with only 3.1k self-verifying and self-correcting behavior
+initialization samples, Qwen2.5-math-7B achieves an accuracy improvement from
+51.0\% to 81.6\%, outperforming models trained on an equivalent amount of
+long-CoT distilled data. Extensive experiments and analysis based on three base
+models across both in-domain and out-of-domain benchmarks validate the
+effectiveness of S$^2$R. Our code and data are available at
+https://github.com/NineAbyss/S2R.
+
+摘要：<paragraph>最近的研究表明了 LLM 测试时间扩展的有效性。
+然而，现有激励 LLM 深度思考能力的方法
+通常需要大规模数据或大量的训练工作。同时，
+如何提高较弱基础模型的思考能力仍然不清楚。在这项工作中，我们引入了 S$^2$R，一个通过教导模型在
+推理过程中进行自我验证和自我纠正来增强 LLM 推理的有效框架。具体来说，我们首先通过监督微调对精心整理的数据来初始化具有迭代自我验证和自我纠正行为的 LLM。然后通过结果级别和过程级别的强化
+学习进一步加强自我验证和自我纠正技能，同时最大程度地减少资源需求，使模型能够
+在推理过程中自适应地优化其推理过程。我们的结果
+表明，仅使用 3.1k 个自我验证和自我纠正行为
+初始化样本，Qwen2.5-math-7B 的准确率从
+51.0% 提高到 81.6%，优于在等量长 CoT 蒸馏数据上训练的模型。基于三个基础模型在域内和域外基准上的广泛实验和分析验证了
+S$^2$R 的有效性。我们的代码和数据可以在
+https://github.com/NineAbyss/S2R 获得。</paragraph>
+
+##### **MVL-SIB: A Massively Multilingual Vision-Language Benchmark for Cross-Modal Topical Matching**
+2502.12852v1 by Fabian David Schmidt, Florian Schneider, Chris Biemann, Goran Glavaš
+
+Existing multilingual vision-language (VL) benchmarks often only cover a
+handful of languages. Consequently, evaluations of large vision-language models
+(LVLMs) predominantly target high-resource languages, underscoring the need for
+evaluation data for low-resource languages. To address this limitation, we
+introduce MVL-SIB, a massively multilingual vision-language benchmark that
+evaluates both cross-modal and text-only topical matching across 205 languages
+-- over 100 more than the most multilingual existing VL benchmarks encompass.
+We then benchmark a range of of open-weight LVLMs together with GPT-4o(-mini)
+on MVL-SIB. Our results reveal that LVLMs struggle in cross-modal topic
+matching in lower-resource languages, performing no better than chance on
+languages like N'Koo. Our analysis further reveals that VL support in LVLMs
+declines disproportionately relative to textual support for lower-resource
+languages, as evidenced by comparison of cross-modal and text-only topical
+matching performance. We further observe that open-weight LVLMs do not benefit
+from representing a topic with more than one image, suggesting that these
+models are not yet fully effective at handling multi-image tasks. By
+correlating performance on MVL-SIB with other multilingual VL benchmarks, we
+highlight that MVL-SIB serves as a comprehensive probe of multilingual VL
+understanding in LVLMs.
+
+摘要：現有的多語言視覺語言 (VL) 基準通常只涵蓋少數語言。因此，大型視覺語言模型 (LVLMs) 的評估主要針對資源豐富的語言，強調了對資源匱乏語言的評估資料的需求。為了解決此限制，我們引入了 MVL-SIB，一個大規模的多語言視覺語言基準，它評估了 205 種語言的跨模態和純文字主題匹配，比現有的多語言 VL 基準涵蓋的語言多出 100 多種。然後，我們在 MVL-SIB 上對一系列開放權重的 LVLMs 與 GPT-4o(-mini) 進行了基準測試。我們的結果表明，LVLMs 在資源較少的語言中難以進行跨模態主題匹配，在 N'Koo 等語言上的表現不比隨機好。我們的分析進一步表明，LVLMs 中的 VL 支援相對於資源較少的語言的文字支援下降得不成比例，這從跨模態和純文字主題匹配效能的比較中可以看出。我們進一步觀察到，開放權重的 LVLMs 無法從用多於一張影像來表示主題中受益，這表明這些模型在處理多影像任務方面尚未完全有效。通過將 MVL-SIB 上的效能與其他多語言 VL 基準相關聯，我們強調 MVL-SIB 可作為 LVLMs 中多語言 VL 理解的綜合探測。
+
+##### **MeMo: Towards Language Models with Associative Memory Mechanisms**
+2502.12851v1 by Fabio Massimo Zanzotto, Elena Sofia Ruzzetti, Giancarlo A. Xompero, Leonardo Ranaldi, Davide Venditti, Federico Ranaldi, Cristina Giannone, Andrea Favalli, Raniero Romagnoli
+
+Memorization is a fundamental ability of Transformer-based Large Language
+Models, achieved through learning. In this paper, we propose a paradigm shift
+by designing an architecture to memorize text directly, bearing in mind the
+principle that memorization precedes learning. We introduce MeMo, a novel
+architecture for language modeling that explicitly memorizes sequences of
+tokens in layered associative memories. By design, MeMo offers transparency and
+the possibility of model editing, including forgetting texts. We experimented
+with the MeMo architecture, showing the memorization power of the one-layer and
+the multi-layer configurations.
+
+摘要：記憶是 Transformer 大型語言模型的基本能力，可透過學習達成。在本文中，我們提出一個典範轉移，透過設計一個架構來直接記憶文字，並牢記記憶先於學習的原則。我們導入 MeMo，一個新穎的語言建模架構，可明確地記憶分層關聯式記憶中的代幣序列。透過設計，MeMo 提供透明度和模型編輯的可能性，包括遺忘文字。我們實驗了 MeMo 架構，展示了單層和多層組態的記憶力。
+
+##### **Towards Adaptive Feedback with AI: Comparing the Feedback Quality of LLMs and Teachers on Experimentation Protocols**
+2502.12842v1 by Kathrin Seßler, Arne Bewersdorff, Claudia Nerdel, Enkelejda Kasneci
+
+Effective feedback is essential for fostering students' success in scientific
+inquiry. With advancements in artificial intelligence, large language models
+(LLMs) offer new possibilities for delivering instant and adaptive feedback.
+However, this feedback often lacks the pedagogical validation provided by
+real-world practitioners. To address this limitation, our study evaluates and
+compares the feedback quality of LLM agents with that of human teachers and
+science education experts on student-written experimentation protocols. Four
+blinded raters, all professionals in scientific inquiry and science education,
+evaluated the feedback texts generated by 1) the LLM agent, 2) the teachers and
+3) the science education experts using a five-point Likert scale based on six
+criteria of effective feedback: Feed Up, Feed Back, Feed Forward, Constructive
+Tone, Linguistic Clarity, and Technical Terminology. Our results indicate that
+LLM-generated feedback shows no significant difference to that of teachers and
+experts in overall quality. However, the LLM agent's performance lags in the
+Feed Back dimension, which involves identifying and explaining errors within
+the student's work context. Qualitative analysis highlighted the LLM agent's
+limitations in contextual understanding and in the clear communication of
+specific errors. Our findings suggest that combining LLM-generated feedback
+with human expertise can enhance educational practices by leveraging the
+efficiency of LLMs and the nuanced understanding of educators.
+
+摘要：有效的回饋對於培養學生在科學探究中的成功至關重要。隨著人工智慧的進步，大型語言模型 (LLM) 為提供即時且適應性的回饋提供了新的可能性。然而，此回饋通常缺乏實際從業者提供的教學驗證。為了解決此限制，我們的研究評估並比較了 LLM 代理與人類教師和科學教育專家在學生撰寫的實驗協定上的回饋品質。四位盲評者，皆為科學探究和科學教育專業人士，使用基於六個有效回饋準則的五點李克特量表評估由 1) LLM 代理、2) 教師和 3) 科學教育專家產生的回饋文字：鼓勵、回饋、前饋、建設性語氣、語言清晰度和技術術語。我們的結果表明，LLM 產生的回饋在整體品質上與教師和專家產生的回饋沒有顯著差異。然而，LLM 代理的表現落後於回饋面向，這涉及在學生的作業背景中識別和解釋錯誤。定性分析突顯了 LLM 代理在情境理解和明確傳達特定錯誤方面的限制。我們的研究結果表明，將 LLM 產生的回饋與人類專業知識相結合，可以透過利用 LLM 的效率和教育者的細緻理解來提升教育實務。
+
+##### **Towards Equitable AI: Detecting Bias in Using Large Language Models for Marketing**
+2502.12838v1 by Berk Yilmaz, Huthaifa I. Ashqar
+
+The recent advances in large language models (LLMs) have revolutionized
+industries such as finance, marketing, and customer service by enabling
+sophisticated natural language processing tasks. However, the broad adoption of
+LLMs brings significant challenges, particularly in the form of social biases
+that can be embedded within their outputs. Biases related to gender, age, and
+other sensitive attributes can lead to unfair treatment, raising ethical
+concerns and risking both company reputation and customer trust. This study
+examined bias in finance-related marketing slogans generated by LLMs (i.e.,
+ChatGPT) by prompting tailored ads targeting five demographic categories:
+gender, marital status, age, income level, and education level. A total of
+1,700 slogans were generated for 17 unique demographic groups, and key terms
+were categorized into four thematic groups: empowerment, financial, benefits
+and features, and personalization. Bias was systematically assessed using
+relative bias calculations and statistically tested with the Kolmogorov-Smirnov
+(KS) test against general slogans generated for any individual. Results
+revealed that marketing slogans are not neutral; rather, they emphasize
+different themes based on demographic factors. Women, younger individuals,
+low-income earners, and those with lower education levels receive more distinct
+messaging compared to older, higher-income, and highly educated individuals.
+This underscores the need to consider demographic-based biases in AI-generated
+marketing strategies and their broader societal implications. The findings of
+this study provide a roadmap for developing more equitable AI systems,
+highlighting the need for ongoing bias detection and mitigation efforts in
+LLMs.
+
+摘要：大型語言模型 (LLM) 的最新進展徹底改變了金融、行銷和客戶服務等產業，因為它能執行複雜的自然語言處理任務。然而，LLM 的廣泛採用帶來重大的挑戰，特別是潛藏在其輸出結果中的社會偏見形式。與性別、年齡和其他敏感屬性相關的偏見可能導致不公平的待遇，引發道德問題，並危及公司聲譽和客戶信任。本研究探討了 LLM（即 ChatGPT）產生的與金融相關的行銷標語中的偏見，方法是針對五個人口統計類別：性別、婚姻狀況、年齡、收入水準和教育水準，提示量身打造的廣告。總共為 17 個獨特的人口統計群組產生了 1,700 個標語，並且關鍵詞被分類為四個主題群組：賦權、財務、好處和功能，以及個人化。偏見使用相對偏見計算進行系統性評估，並使用科爾莫哥洛夫-史米諾夫 (KS) 檢定與針對任何個人產生的通用標語進行統計檢定。結果顯示行銷標語並非中立；相反地，它們根據人口統計因素強調不同的主題。與年紀較大、收入較高和受教育程度較高的個人相比，女性、年輕人、低收入者和教育程度較低者接收到的訊息更為不同。這強調了在 AI 生成的行銷策略中考量基於人口統計的偏見及其更廣泛的社會影響的必要性。本研究的發現提供了開發更公平 AI 系統的路線圖，突顯了在 LLM 中持續進行偏見偵測和緩解工作的重要性。
+
+##### **An LLM-Powered Agent for Physiological Data Analysis: A Case Study on PPG-based Heart Rate Estimation**
+2502.12836v1 by Mohammad Feli, Iman Azimi, Pasi Liljeberg, Amir M. Rahmani
+
+Large language models (LLMs) are revolutionizing healthcare by improving
+diagnosis, patient care, and decision support through interactive
+communication. More recently, they have been applied to analyzing physiological
+time-series like wearable data for health insight extraction. Existing methods
+embed raw numerical sequences directly into prompts, which exceeds token limits
+and increases computational costs. Additionally, some studies integrated
+features extracted from time-series in textual prompts or applied multimodal
+approaches. However, these methods often produce generic and unreliable outputs
+due to LLMs' limited analytical rigor and inefficiency in interpreting
+continuous waveforms. In this paper, we develop an LLM-powered agent for
+physiological time-series analysis aimed to bridge the gap in integrating LLMs
+with well-established analytical tools. Built on the OpenCHA, an open-source
+LLM-powered framework, our agent features an orchestrator that integrates user
+interaction, data sources, and analytical tools to generate accurate health
+insights. To evaluate its effectiveness, we implement a case study on heart
+rate (HR) estimation from Photoplethysmogram (PPG) signals using a dataset of
+PPG and Electrocardiogram (ECG) recordings in a remote health monitoring study.
+The agent's performance is benchmarked against OpenAI GPT-4o-mini and GPT-4o,
+with ECG serving as the gold standard for HR estimation. Results demonstrate
+that our agent significantly outperforms benchmark models by achieving lower
+error rates and more reliable HR estimations. The agent implementation is
+publicly available on GitHub.
+
+摘要：大型語言模型 (LLM) 透過互動式溝通，改善診斷、病人照護和決策支援，進而革新醫療保健。最近，它們已應用於分析生理時間序列，例如可穿戴式裝置的資料，以萃取健康見解。現有方法會將原始數值序列直接嵌入提示中，這會超過權杖限制並增加運算成本。此外，一些研究將從時間序列中萃取的特徵整合到文字提示中，或應用多模態方法。然而，由於 LLM 在解譯連續波形時分析嚴謹度有限且效率不彰，這些方法經常產生通用且不可靠的輸出。在本文中，我們開發了一個由 LLM 驅動的代理，用於生理時間序列分析，旨在彌合將 LLM 與既有分析工具整合的差距。我們的代理建立在 OpenCHA（一個由 LLM 驅動的開源架構）之上，具備一個整合使用者互動、資料來源和分析工具的協調器，以產生準確的健康見解。為了評估其有效性，我們實作了一個案例研究，從遠距健康監測研究中的一組光電容積描記圖 (PPG) 和心電圖 (ECG) 記錄中估算心率 (HR)。該代理的效能與 OpenAI GPT-4o-mini 和 GPT-4o 進行基準測試，其中 ECG 作為 HR 估算的金標準。結果顯示，我們的代理透過達成較低的錯誤率和更可靠的 HR 估算，顯著優於基準模型。該代理實作已公開在 GitHub 上。
+
+##### **Subword models struggle with word learning, but surprisal hides it**
+2502.12835v1 by Bastian Bunzeck, Sina Zarrieß
+
+We study word learning in subword and character language models with the
+psycholinguistic lexical decision task. While subword LMs struggle to discern
+words and non-words with high accuracy, character LMs solve this task easily
+and consistently. Furthermore, when comparing word learning and syntactic
+learning, both processes are separable in character LM where word learning
+predates syntactic learning, whereas these processes are simultaneous in
+subword LM. This raises questions about the adequacy of subword LMs for
+modeling language acquisition and positions character LMs as a viable
+alternative.
+
+摘要：我們使用心理語言學的詞彙決策任務研究在子詞和字元語言模型中的詞彙學習。儘管子詞語言模型難以區分單詞和非單詞，但字元語言模型可以輕鬆且一致地解決此任務。此外，在比較單詞學習和句法學習時，這兩個過程在字元語言模型中是可分離的，其中單詞學習先於句法學習，而這些過程在子詞語言模型中是同時發生的。這引發了關於子詞語言模型對語言習得建模的充分性的問題，並將字元語言模型定位為可行的替代方案。
+
+##### **KazMMLU: Evaluating Language Models on Kazakh, Russian, and Regional Knowledge of Kazakhstan**
+2502.12829v1 by Mukhammed Togmanov, Nurdaulet Mukhituly, Diana Turmakhan, Jonibek Mansurov, Maiya Goloburda, Akhmed Sakip, Zhuohan Xie, Yuxia Wang, Bekassyl Syzdykov, Nurkhan Laiyk, Alham Fikri Aji, Ekaterina Kochmar, Preslav Nakov, Fajri Koto
+
+Despite having a population of twenty million, Kazakhstan's culture and
+language remain underrepresented in the field of natural language processing.
+Although large language models (LLMs) continue to advance worldwide, progress
+in Kazakh language has been limited, as seen in the scarcity of dedicated
+models and benchmark evaluations. To address this gap, we introduce KazMMLU,
+the first MMLU-style dataset specifically designed for Kazakh language. KazMMLU
+comprises 23,000 questions that cover various educational levels, including
+STEM, humanities, and social sciences, sourced from authentic educational
+materials and manually validated by native speakers and educators. The dataset
+includes 10,969 Kazakh questions and 12,031 Russian questions, reflecting
+Kazakhstan's bilingual education system and rich local context. Our evaluation
+of several state-of-the-art multilingual models (Llama-3.1, Qwen-2.5, GPT-4,
+and DeepSeek V3) demonstrates substantial room for improvement, as even the
+best-performing models struggle to achieve competitive performance in Kazakh
+and Russian. These findings underscore significant performance gaps compared to
+high-resource languages. We hope that our dataset will enable further research
+and development of Kazakh-centric LLMs. Data and code will be made available
+upon acceptance.
+
+摘要：儘管哈薩克人口達兩千萬，但哈薩克的文化和語言在自然語言處理領域仍未得到充分的重視。儘管大型語言模型 (LLM) 在全球持續進步，但哈薩克語的進展卻十分有限，這從專用模型和基準評估的稀缺性中可見一斑。為了解決這個差距，我們引入了 KazMMLU，這是第一個專門為哈薩克語設計的 MMLU 風格資料集。KazMMLU 包含 23,000 個問題，涵蓋各種教育層級，包括 STEM、人文學科和社會科學，這些問題來自真實的教育材料，並由母語人士和教育工作者手動驗證。該資料集包含 10,969 個哈薩克語問題和 12,031 個俄語問題，反映了哈薩克的雙語教育體系和豐富的在地脈絡。我們對幾個最先進的多語言模型（Llama-3.1、Qwen-2.5、GPT-4 和 DeepSeek V3）的評估顯示，仍有很大的改進空間，因為即使是效能最好的模型，也很難在哈薩克語和俄語中達到有競爭力的效能。這些發現強調了與資源豐富的語言相比，存在顯著的效能差距。我們希望我們的資料集能促進以哈薩克語為中心的 LLM 的進一步研究和開發。資料和程式碼將在獲得接受後提供。
+
+##### **Reasoning and the Trusting Behavior of DeepSeek and GPT: An Experiment Revealing Hidden Fault Lines in Large Language Models**
+2502.12825v1 by Rubing Lu, João Sedoc, Arun Sundararajan
+
+When encountering increasingly frequent performance improvements or cost
+reductions from a new large language model (LLM), developers of applications
+leveraging LLMs must decide whether to take advantage of these improvements or
+stay with older tried-and-tested models. Low perceived switching frictions can
+lead to choices that do not consider more subtle behavior changes that the
+transition may induce. Our experiments use a popular game-theoretic behavioral
+economics model of trust to show stark differences in the trusting behavior of
+OpenAI's and DeepSeek's models. We highlight a collapse in the economic trust
+behavior of the o1-mini and o3-mini models as they reconcile profit-maximizing
+and risk-seeking with future returns from trust, and contrast it with
+DeepSeek's more sophisticated and profitable trusting behavior that stems from
+an ability to incorporate deeper concepts like forward planning and
+theory-of-mind. As LLMs form the basis for high-stakes commercial systems, our
+results highlight the perils of relying on LLM performance benchmarks that are
+too narrowly defined and suggest that careful analysis of their hidden fault
+lines should be part of any organization's AI strategy.
+
+摘要：當遇到越來越頻繁的效能提升或來自於新的大型語言模型 (LLM) 的成本降低時，利用 LLM 的應用程式開發人員必須決定是否要利用這些提升或維持較舊且經過測試的模型。低感知切換摩擦可能會導致選擇不考慮轉換可能誘發的更細微的行為改變。我們的實驗使用信任的流行博弈論行為經濟模型來顯示 OpenAI 和 DeepSeek 模型在信任行為上的顯著差異。我們強調 o1-mini 和 o3-mini 模型的經濟信任行為崩潰，因為它們調和了利潤最大化和風險尋求與來自信任的未來回報，並將其與 DeepSeek 更複雜且有利可圖的信任行為進行對比，這種信任行為源於整合更深層的概念，例如前瞻性規劃和心智理論。由於 LLM 構成高風險商業系統的基礎，我們的結果突顯了依賴定義過於狹窄的 LLM 效能基準的危險性，並建議仔細分析其隱藏的斷層線應該是任何組織的 AI 策略的一部分。
+
+##### **Pitfalls of Scale: Investigating the Inverse Task of Redefinition in Large Language Models**
+2502.12821v1 by Elena Stringli, Maria Lymperaiou, Giorgos Filandrianos, Giorgos Stamou
+
+Inverse tasks can uncover potential reasoning gaps as Large Language Models
+(LLMs) scale up. In this work, we explore the redefinition task, in which we
+assign alternative values to well-known physical constants and units of
+measure, prompting LLMs to respond accordingly. Our findings show that not only
+does model performance degrade with scale, but its false confidence also rises.
+Moreover, while factors such as prompting strategies or response formatting are
+influential, they do not preclude LLMs from anchoring to memorized values.
+
+摘要：逆向任務可以揭示大型語言模型 (LLM) 擴展時潛在的推理差距。在本文中，我們探討重新定義任務，其中我們將替換值指定給著名的物理常數和測量單位，促使 LLM 做出相應回應。我們的研究結果表明，模型效能不僅會隨著規模而下降，其虛假信心也會上升。此外，儘管提示策略或回應格式等因素具有影響力，但它們並不妨礙 LLM 錨定在記憶值上。
+
+##### **Simulating User Diversity in Task-Oriented Dialogue Systems using Large Language Models**
+2502.12813v1 by Adnan Ahmad, Stefan Hillmann, Sebastian Möller
+
+In this study, we explore the application of Large Language Models (LLMs) for
+generating synthetic users and simulating user conversations with a
+task-oriented dialogue system and present detailed results and their analysis.
+We propose a comprehensive novel approach to user simulation technique that
+uses LLMs to create diverse user profiles, set goals, engage in multi-turn
+dialogues, and evaluate the conversation success. We employ two proprietary
+LLMs, namely GPT-4o and GPT-o1 (Achiam et al., 2023), to generate a
+heterogeneous base of user profiles, characterized by varied demographics,
+multiple user goals, different conversational styles, initial knowledge levels,
+interests, and conversational objectives. We perform a detailed analysis of the
+user profiles generated by LLMs to assess the diversity, consistency, and
+potential biases inherent in these LLM-generated user simulations. We find that
+GPT-o1 generates more heterogeneous user distribution across most user
+attributes, while GPT-4o generates more skewed user attributes. The generated
+set of user profiles are then utilized to simulate dialogue sessions by
+interacting with a task-oriented dialogue system.
+
+摘要：在這項研究中，我們探討大型語言模型 (LLM) 在生成合成使用者和模擬使用者對話，並使用任務導向對話系統進行對話的應用，並提出詳細的結果及其分析。我們提出了一種全面的使用者模擬技術新方法，利用 LLM 建立多樣化的使用者概況、設定目標、參與多輪對話，並評估對話的成功性。我們採用了兩個專有的 LLM，即 GPT-4o 和 GPT-o1 (Achiam 等人，2023 年)，以生成一個異質的使用者概況基礎，其特徵在於不同的人口統計資料、多個使用者目標、不同的對話風格、初始知識水準、興趣和對話目標。我們對 LLM 生成的使用者概況進行了詳細分析，以評估這些 LLM 生成的使用者模擬中固有的多樣性、一致性和潛在偏差。我們發現 GPT-o1 在大多數使用者屬性中產生更異質的使用者分佈，而 GPT-4o 則產生更偏斜的使用者屬性。然後利用生成的使用者概況集，透過與任務導向對話系統互動來模擬對話會話。
+
+##### **Towards Text-Image Interleaved Retrieval**
+2502.12799v1 by Xin Zhang, Ziqi Dai, Yongqi Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Meishan Zhang, Jun Yu, Wenjie Li, Min Zhang
+
+Current multimodal information retrieval studies mainly focus on single-image
+inputs, which limits real-world applications involving multiple images and
+text-image interleaved content. In this work, we introduce the text-image
+interleaved retrieval (TIIR) task, where the query and document are interleaved
+text-image sequences, and the model is required to understand the semantics
+from the interleaved context for effective retrieval. We construct a TIIR
+benchmark based on naturally interleaved wikiHow tutorials, where a specific
+pipeline is designed to generate interleaved queries. To explore the task, we
+adapt several off-the-shelf retrievers and build a dense baseline by
+interleaved multimodal large language model (MLLM). We then propose a novel
+Matryoshka Multimodal Embedder (MME), which compresses the number of visual
+tokens at different granularity, to address the challenge of excessive visual
+tokens in MLLM-based TIIR models. Experiments demonstrate that simple adaption
+of existing models does not consistently yield effective results. Our MME
+achieves significant improvements over the baseline by substantially fewer
+visual tokens. We provide extensive analysis and will release the dataset and
+code to facilitate future research.
+
+摘要：目前的多模態資訊檢索研究主要集中在單一影像輸入，這限制了涉及多個影像和文字影像交錯內容的實際應用。在這項工作中，我們引入了文字影像交錯檢索 (TIIR) 任務，其中查詢和文件是交錯的文字影像序列，並且模型需要理解交錯內容的語意以進行有效檢索。我們根據自然交錯的 wikiHow 教學課程建構了一個 TIIR 基準，其中設計了一個特定的管線來產生交錯查詢。為了探索這個任務，我們調整了幾個現成的檢索器，並透過交錯的多模態大型語言模型 (MLLM) 建立了一個密集的基準。然後，我們提出了一個新穎的 Matryoshka 多模態嵌入器 (MME)，它壓縮了不同粒度視覺符號的數量，以解決基於 MLLM 的 TIIR 模型中過多視覺符號的挑戰。實驗表明，對現有模型的簡單調整並未持續產生有效結果。我們的 MME 透過大幅減少視覺符號，達到了比基準顯著的改進。我們提供了廣泛的分析，並將釋出資料集和程式碼以促進未來的研究。
+
+##### **Envious Explore and Exploit**
+2502.12798v1 by Omer Ben-Porat, Yotam Gafni, Or Markovetzki
+
+Explore-and-exploit tradeoffs play a key role in recommendation systems
+(RSs), aiming at serving users better by learning from previous interactions.
+Despite their commercial success, the societal effects of explore-and-exploit
+mechanisms are not well understood, especially regarding the utility
+discrepancy they generate between different users. In this work, we measure
+such discrepancy using the economic notion of envy. We present a multi-armed
+bandit-like model in which every round consists of several sessions, and
+rewards are realized once per round. We call the latter property reward
+consistency, and show that the RS can leverage this property for better
+societal outcomes. On the downside, doing so also generates envy, as
+late-to-arrive users enjoy the information gathered by early-to-arrive users.
+We examine the generated envy under several arrival order mechanisms and
+virtually any anonymous algorithm, i.e., any algorithm that treats all similar
+users similarly without leveraging their identities. We provide tight envy
+bounds on uniform arrival and upper bound the envy for nudged arrival, in which
+the RS can affect the order of arrival by nudging its users. Furthermore, we
+study the efficiency-fairness trade-off by devising an algorithm that allows
+constant envy and approximates the optimal welfare in restricted settings.
+Finally, we validate our theoretical results empirically using simulations.
+
+摘要：探索與開發的取捨在推薦系統 (RS) 中扮演著關鍵角色，旨在透過學習先前的互動來為使用者提供更好的服務。儘管在商業上獲得成功，但探索與開發機制的社會效應仍未被充分理解，特別是關於它們在不同使用者之間產生的效用差異。在這項工作中，我們使用經濟學中的嫉妒概念來衡量這種差異。我們提出了一個多臂老虎機模型，其中每一輪都包含多個回合，並且每回合只會實現一次獎勵。我們將後者的特性稱為獎勵一致性，並證明 RS 可以利用此特性來獲得更好的社會成果。不利的是，這麼做也會產生嫉妒，因為較晚加入的使用者可以享受較早加入的使用者所收集的資訊。我們在多種到達順序機制和幾乎任何匿名演算法（即任何演算法都以類似的方式對待所有類似的使用者，而不利用他們的身份）下檢驗產生的嫉妒。我們對均勻到達提供嚴格的嫉妒界線，並對推動到達的上限進行嫉妒界線，其中 RS 可以透過推動其使用者來影響到達順序。此外，我們透過設計一種演算法來研究效率公平權衡，該演算法允許恆定的嫉妒，並在受限設定中近似最佳福利。最後，我們使用模擬對我們的理論結果進行經驗驗證。
+
+##### **Commonsense Reasoning in Arab Culture**
+2502.12788v1 by Abdelrahman Sadallah, Junior Cedric Tonga, Khalid Almubarak, Saeed Almheiri, Farah Atif, Chatrine Qwaider, Karima Kadaoui, Sara Shatnawi, Yaser Alesh, Fajri Koto
+
+Despite progress in Arabic large language models, such as Jais and AceGPT,
+their evaluation on commonsense reasoning has largely relied on
+machine-translated datasets, which lack cultural depth and may introduce
+Anglocentric biases. Commonsense reasoning is shaped by geographical and
+cultural contexts, and existing English datasets fail to capture the diversity
+of the Arab world. To address this, we introduce \datasetname, a commonsense
+reasoning dataset in Modern Standard Arabic (MSA), covering cultures of 13
+countries across the Gulf, Levant, North Africa, and the Nile Valley. The
+dataset was built from scratch by engaging native speakers to write and
+validate culturally relevant questions for their respective countries.
+\datasetname spans 12 daily life domains with 54 fine-grained subtopics,
+reflecting various aspects of social norms, traditions, and everyday
+experiences. Zero-shot evaluations show that open-weight language models with
+up to 32B parameters struggle to comprehend diverse Arab cultures, with
+performance varying across regions. These findings highlight the need for more
+culturally aware models and datasets tailored to the Arabic-speaking world.
+
+摘要：儘管阿拉伯語大型語言模型（例如 Jais 和 AceGPT）已有進展，
+但它們在常識推理上的評估在很大程度上依賴於
+機器翻譯的資料集，這些資料集缺乏文化深度，可能會引入
+以英語為中心的偏見。常識推理受地理和
+文化背景影響，現有的英文資料集無法捕捉阿拉伯世界的多樣性。為了解決這個問題，我們引入了 \datasetname，一個現代標準阿拉伯語 (MSA) 的常識推理資料集，涵蓋海灣地區、黎凡特地區、北非和尼羅河谷 13 個國家的文化。此資料集是從頭開始建立的，由母語人士參與編寫和驗證他們各自國家的文化相關問題。\datasetname 涵蓋 12 個日常生活領域，包含 54 個細緻的主題，反映社會規範、傳統和日常經驗的各個方面。零次學習評估顯示，具有高達 32B 參數的開放式權重語言模型難以理解不同的阿拉伯文化，且各區域的表現不一。這些發現突顯了對更具文化意識的模型和專為阿拉伯語系世界量身打造的資料集的需求。
+
+##### **VidCapBench: A Comprehensive Benchmark of Video Captioning for Controllable Text-to-Video Generation**
+2502.12782v1 by Xinlong Chen, Yuanxing Zhang, Chongling Rao, Yushuo Guan, Jiaheng Liu, Fuzheng Zhang, Chengru Song, Qiang Liu, Di Zhang, Tieniu Tan
+
+The training of controllable text-to-video (T2V) models relies heavily on the
+alignment between videos and captions, yet little existing research connects
+video caption evaluation with T2V generation assessment. This paper introduces
+VidCapBench, a video caption evaluation scheme specifically designed for T2V
+generation, agnostic to any particular caption format. VidCapBench employs a
+data annotation pipeline, combining expert model labeling and human refinement,
+to associate each collected video with key information spanning video
+aesthetics, content, motion, and physical laws. VidCapBench then partitions
+these key information attributes into automatically assessable and manually
+assessable subsets, catering to both the rapid evaluation needs of agile
+development and the accuracy requirements of thorough validation. By evaluating
+numerous state-of-the-art captioning models, we demonstrate the superior
+stability and comprehensiveness of VidCapBench compared to existing video
+captioning evaluation approaches. Verification with off-the-shelf T2V models
+reveals a significant positive correlation between scores on VidCapBench and
+the T2V quality evaluation metrics, indicating that VidCapBench can provide
+valuable guidance for training T2V models. The project is available at
+https://github.com/VidCapBench/VidCapBench.
+
+摘要：可控制文本到影片 (T2V) 模型的訓練極度仰賴影片和字幕之間的對齊，但現有研究鮮少將影片字幕評估與 T2V 生成評估連結起來。本文介紹 VidCapBench，這是一種專門為 T2V 生成設計的影片字幕評估架構，與任何特定的字幕格式無關。VidCapBench 採用資料標註流程，結合專家模型標記和人工微調，將每個收集到的影片與涵蓋影片美學、內容、動作和物理定律等關鍵資訊關聯起來。VidCapBench 接著將這些關鍵資訊屬性分割成可自動評估和可手動評估的子集，以滿足敏捷開發的快速評估需求和全面驗證的準確性要求。透過評估許多最先進的字幕模型，我們證明了 VidCapBench 與現有的影片字幕評估方法相比，具有優異的穩定性和全面性。使用現成的 T2V 模型驗證顯示，VidCapBench 得分與 T2V 品質評估指標之間存在顯著的正相關，這表示 VidCapBench 可以為訓練 T2V 模型提供有價值的指導。專案可於 https://github.com/VidCapBench/VidCapBench 取得。
+
+##### **Portable Reward Tuning: Towards Reusable Fine-Tuning across Different Pretrained Models**
+2502.12776v1 by Daiki Chijiwa, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, Susumu Takeuchi
+
+While foundation models have been exploited for various expert tasks through
+fine-tuning, any foundation model will become outdated due to its old knowledge
+or limited capability. Thus the underlying foundation model should be
+eventually replaced by new ones, which leads to repeated cost of fine-tuning
+these new models. Existing work addresses this problem by inference-time
+tuning, i.e., modifying the output probabilities from the new foundation model
+with the outputs from the old foundation model and its fine-tuned model, which
+involves an additional overhead in inference by the latter two models. In this
+paper, we propose a new fine-tuning principle, Portable Reward Tuning (PRT),
+that reduces the inference overhead by its nature, based on the reformulation
+of fine-tuning as the reward maximization. Specifically, instead of fine-tuning
+parameters of the foundation models, PRT trains the reward model explicitly
+through the same loss function as in fine-tuning. During inference, the reward
+model can be used with any foundation model (with the same set of vocabularies
+or labels) through the formulation of reward maximization. Experimental
+results, covering both vision and language models, demonstrate that the
+PRT-trained model can achieve comparable accuracy to the existing work of
+inference-time tuning, with less inference cost.
+
+摘要：儘管基礎模型已透過微調用於各種專家任務，任何基礎模型都將因其舊知識或有限功能而過時。因此，基礎模型最終應由新模型取代，這導致重複微調這些新模型的成本。現有工作透過推論時間調整來解決這個問題，即使用舊基礎模型及其微調模型的輸出修改新基礎模型的輸出機率，這涉及後兩個模型在推論中的額外開銷。在本文中，我們提出一個新的微調原則，可攜式獎勵調整 (PRT)，它本質上會減少推論開銷，基於將微調重新表述為獎勵最大化。具體來說，PRT 不是微調基礎模型的參數，而是透過與微調中相同的損失函數明確訓練獎勵模型。在推論期間，獎勵模型可透過獎勵最大化的公式與任何基礎模型（具有相同的詞彙或標籤組）一起使用。涵蓋視覺和語言模型的實驗結果證明，PRT 訓練的模型可以達到與現有推論時間調整工作相當的準確度，且推論成本較低。
+
+##### **Mind the Gap: Aligning the Brain with Language Models Requires a Nonlinear and Multimodal Approach**
+2502.12771v1 by Danny Dongyeop Han, Yunju Cho, Jiook Cha, Jay-Yoon Lee
+
+Self-supervised language and audio models effectively predict brain responses
+to speech. However, traditional prediction models rely on linear mappings from
+unimodal features, despite the complex integration of auditory signals with
+linguistic and semantic information across widespread brain networks during
+speech comprehension. Here, we introduce a nonlinear, multimodal prediction
+model that combines audio and linguistic features from pre-trained models
+(e.g., LLAMA, Whisper). Our approach achieves a 17.2% and 17.9% improvement in
+prediction performance (unnormalized and normalized correlation) over
+traditional unimodal linear models, as well as a 7.7% and 14.4% improvement,
+respectively, over prior state-of-the-art models. These improvements represent
+a major step towards future robust in-silico testing and improved decoding
+performance. They also reveal how auditory and semantic information are fused
+in motor, somatosensory, and higher-level semantic regions, aligning with
+existing neurolinguistic theories. Overall, our work highlights the often
+neglected potential of nonlinear and multimodal approaches to brain modeling,
+paving the way for future studies to embrace these strategies in naturalistic
+neurolinguistics research.
+
+摘要：自我監督的語言和音訊模型有效預測大腦對語言的反應。然而，傳統的預測模型依賴於單模態特徵的線性映射，儘管在語言理解過程中，聽覺信號與語言和語義資訊在廣泛的腦網路中進行複雜的整合。在此，我們引入一個非線性、多模態預測模型，結合預先訓練模型（例如，LLAMA、Whisper）中的音訊和語言特徵。我們的做法在預測效能上（未正規化和正規化相關性）分別比傳統的單模態線性模型提升了 17.2% 和 17.9%，分別比先前的最先進模型提升了 7.7% 和 14.4%。這些改進代表了未來穩健的電腦模擬測試和改進的解碼效能邁出了一大步。它們也揭示了聽覺和語義資訊如何在運動、體感和更高層次的語義區域中融合，與現有的神經語言學理論一致。總的來說，我們的研究突出了非線性和多模態大腦建模方法經常被忽略的潛力，為未來研究在自然主義神經語言學研究中採用這些策略鋪平了道路。
+
+##### **How Much Do LLMs Hallucinate across Languages? On Multilingual Estimation of LLM Hallucination in the Wild**
+2502.12769v1 by Saad Obaid ul Islam, Anne Lauscher, Goran Glavaš
+
+In the age of misinformation, hallucination -- the tendency of Large Language
+Models (LLMs) to generate non-factual or unfaithful responses -- represents the
+main risk for their global utility. Despite LLMs becoming increasingly
+multilingual, the vast majority of research on detecting and quantifying LLM
+hallucination are (a) English-centric and (b) focus on machine translation (MT)
+and summarization, tasks that are less common ``in the wild'' than open
+information seeking. In contrast, we aim to quantify the extent of LLM
+hallucination across languages in knowledge-intensive long-form question
+answering. To this end, we train a multilingual hallucination detection model
+and conduct a large-scale study across 30 languages and 6 open-source LLM
+families. We start from an English hallucination detection dataset and rely on
+MT to generate (noisy) training data in other languages. We also manually
+annotate gold data for five high-resource languages; we then demonstrate, for
+these languages, that the estimates of hallucination rates are similar between
+silver (LLM-generated) and gold test sets, validating the use of silver data
+for estimating hallucination rates for other languages. For the final rates
+estimation, we build a knowledge-intensive QA dataset for 30 languages with
+LLM-generated prompts and Wikipedia articles as references. We find that, while
+LLMs generate longer responses with more hallucinated tokens for
+higher-resource languages, there is no correlation between length-normalized
+hallucination rates of languages and their digital representation. Further, we
+find that smaller LLMs exhibit larger hallucination rates than larger models.
+
+摘要：<paragraph>在错误訊息的時代，幻覺——大型語言模型 (LLM) 產生非事實或不忠實回應的傾向——代表其全球效用的主要風險。儘管 LLM 變得越來越多元化，但絕大多數關於偵測和量化 LLM 幻覺的研究都是 (a) 以英語為中心，(b) 專注於機器翻譯 (MT) 和摘要，這些任務在「野外」中不如開放式資訊搜尋常見。相反地，我們旨在量化 LLM 在知識密集型長篇問答中跨語言的幻覺程度。為此，我們訓練了一個多語言幻覺偵測模型，並針對 30 種語言和 6 個開放原始碼 LLM 家族進行大規模研究。我們從一個英語幻覺偵測資料集開始，並依賴 MT 在其他語言中產生（有雜訊的）訓練資料。我們還手動為五種高資源語言註解黃金資料；然後我們證明，對於這些語言，幻覺率的估計值在白銀（LLM 產生）和黃金測試集之間是相似的，驗證了使用白銀資料來估計其他語言的幻覺率。對於最終的比率估計，我們建立了一個知識密集型問答資料集，其中包含 30 種語言，並以 LLM 產生的提示和維基百科文章作為參考。我們發現，儘管 LLM 為資源較多的語言產生了更長的回應和更多幻覺的代幣，但語言的長度正規化幻覺率與其數位表示之間沒有相關性。此外，我們發現較小的 LLM 表現出比較大的模型更大的幻覺率。</paragraph>
+
+##### **R2-KG: General-Purpose Dual-Agent Framework for Reliable Reasoning on Knowledge Graphs**
+2502.12767v1 by Sumin Jo, Junseong Choi, Jiho Kim, Edward Choi
+
+Recent studies have combined Large Language Models (LLMs) with Knowledge
+Graphs (KGs) to enhance reasoning, improving inference accuracy without
+additional training while mitigating hallucination. However, existing
+frameworks are often rigid, struggling to adapt to KG or task changes. They
+also rely heavily on powerful LLMs for reliable (i.e., trustworthy) reasoning.
+To address this, We introduce R2-KG, a plug-and-play, dual-agent framework that
+separates reasoning into two roles: an Operator (a low-capacity LLM) that
+gathers evidence and a Supervisor (a high-capacity LLM) that makes final
+judgments. This design is cost-efficient for LLM inference while still
+maintaining strong reasoning accuracy. Additionally, R2-KG employs an
+Abstention mechanism, generating answers only when sufficient evidence is
+collected from KG, which significantly enhances reliability. Experiments across
+multiple KG-based reasoning tasks show that R2-KG consistently outperforms
+baselines in both accuracy and reliability, regardless of the inherent
+capability of LLMs used as the Operator. Further experiments reveal that the
+single-agent version of R2-KG, equipped with a strict self-consistency
+strategy, achieves significantly higher-than-baseline reliability while
+reducing inference cost. However, it also leads to a higher abstention rate in
+complex KGs. Our findings establish R2-KG as a flexible and cost-effective
+solution for KG-based reasoning. It reduces reliance on high-capacity LLMs
+while ensuring trustworthy inference.
+
+摘要：<paragraph>最近的研究结合了大型语言模型 (LLM) 与知识图谱 (KG) 以增强推理，在不额外训练的情况下提高推理准确性，同时减轻幻觉。然而，现有的框架通常很僵化，难以适应知识图谱或任务的变化。它们还严重依赖强大的 LLM 来进行可靠（即值得信赖）的推理。为了解决这个问题，我们引入了 R2-KG，这是一个即插即用、双代理框架，它将推理分为两个角色：一个收集证据的操作员（低容量 LLM）和一个做出最终判断的监督员（高容量 LLM）。这种设计在 LLM 推理方面具有成本效益，同时仍保持强大的推理准确性。此外，R2-KG 采用弃权机制，仅在从知识图谱收集到足够证据时才生成答案，这显著提高了可靠性。跨多个基于知识图谱的推理任务的实验表明，R2-KG 在准确性和可靠性方面始终优于基线，而与用作操作员的 LLM 的固有能力无关。进一步的实验表明，R2-KG 的单代理版本配备了严格的自一致性策略，实现了明显高于基线的可靠性，同时降低了推理成本。然而，它也导致了复杂知识图谱中更高的弃权率。我们的发现将 R2-KG 确立为一种灵活且经济高效的基于知识图谱的推理解决方案。它减少了对高容量 LLM 的依赖，同时确保了可信的推理。</paragraph>